440 lines
9.8 KiB
Markdown
440 lines
9.8 KiB
Markdown
# Data Schema & Knowledge Modeling: Advanced Methodology
|
|
|
|
## Workflow
|
|
|
|
```
|
|
Advanced Schema Modeling:
|
|
- [ ] Step 1: Analyze complex domain patterns
|
|
- [ ] Step 2: Design advanced relationship structures
|
|
- [ ] Step 3: Apply normalization or strategic denormalization
|
|
- [ ] Step 4: Model temporal/historical aspects
|
|
- [ ] Step 5: Plan schema evolution strategy
|
|
```
|
|
|
|
**Steps:** (1) Identify patterns in [Advanced Relationships](#1-advanced-relationship-patterns), (2) Apply [Hierarchy](#2-hierarchy-modeling) and [Polymorphic](#3-polymorphic-associations) patterns, (3) Use [Normalization](#4-normalization-levels) then [Denormalization](#5-strategic-denormalization), (4) Add [Temporal](#6-temporal--historical-modeling) if needed, (5) Plan [Evolution](#7-schema-evolution).
|
|
|
|
---
|
|
|
|
## 1. Advanced Relationship Patterns
|
|
|
|
### Self-Referential
|
|
|
|
Entity relates to itself (org charts, categories, social networks).
|
|
|
|
```sql
|
|
CREATE TABLE Employee (
|
|
id BIGINT PRIMARY KEY,
|
|
managerId BIGINT NULL REFERENCES Employee(id),
|
|
CONSTRAINT no_self_ref CHECK (id != managerId)
|
|
);
|
|
```
|
|
|
|
Query with recursive CTE for full hierarchy.
|
|
|
|
### Conditional
|
|
|
|
Relationship exists only under conditions.
|
|
|
|
```sql
|
|
CREATE TABLE Order (
|
|
id BIGINT PRIMARY KEY,
|
|
status VARCHAR(20),
|
|
paymentId BIGINT NULL REFERENCES Payment(id),
|
|
CONSTRAINT payment_when_paid CHECK (
|
|
(status IN ('paid','completed') AND paymentId IS NOT NULL) OR
|
|
(status NOT IN ('paid','completed'))
|
|
)
|
|
);
|
|
```
|
|
|
|
### Multi-Parent
|
|
|
|
Entity has multiple parents (document in folders).
|
|
|
|
```sql
|
|
CREATE TABLE DocumentFolder (
|
|
documentId BIGINT REFERENCES Document(id),
|
|
folderId BIGINT REFERENCES Folder(id),
|
|
PRIMARY KEY (documentId, folderId)
|
|
);
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Hierarchy Modeling
|
|
|
|
Four approaches with trade-offs:
|
|
|
|
| Approach | Implementation | Read | Write | Best For |
|
|
|----------|---------------|------|-------|----------|
|
|
| **Adjacency List** | `parentId` column | Slow (recursive) | Fast | Shallow trees, frequent updates |
|
|
| **Path Enumeration** | `path VARCHAR` ('/1/5/12/') | Fast | Medium | Read-heavy, moderate depth |
|
|
| **Nested Sets** | `lft, rgt INT` | Fastest | Slow | Read-heavy, rare writes |
|
|
| **Closure Table** | Separate ancestor/descendant table | Fastest | Medium | Complex queries, any depth |
|
|
|
|
**Adjacency List:**
|
|
```sql
|
|
CREATE TABLE Category (
|
|
id BIGINT PRIMARY KEY,
|
|
parentId BIGINT NULL REFERENCES Category(id)
|
|
);
|
|
```
|
|
|
|
**Closure Table:**
|
|
```sql
|
|
CREATE TABLE CategoryClosure (
|
|
ancestor BIGINT,
|
|
descendant BIGINT,
|
|
depth INT, -- 0=self, 1=child, 2+=deeper
|
|
PRIMARY KEY (ancestor, descendant)
|
|
);
|
|
```
|
|
|
|
**Recommendation:** Adjacency for < 5 levels, Closure for complex queries.
|
|
|
|
---
|
|
|
|
## 3. Polymorphic Associations
|
|
|
|
Entity relates to multiple types (Comment on Post/Photo/Video).
|
|
|
|
### Approach 1: Separate FKs (Recommended for SQL)
|
|
|
|
```sql
|
|
CREATE TABLE Comment (
|
|
id BIGINT PRIMARY KEY,
|
|
postId BIGINT NULL REFERENCES Post(id),
|
|
photoId BIGINT NULL REFERENCES Photo(id),
|
|
videoId BIGINT NULL REFERENCES Video(id),
|
|
CONSTRAINT one_parent CHECK (
|
|
(postId IS NOT NULL)::int +
|
|
(photoId IS NOT NULL)::int +
|
|
(videoId IS NOT NULL)::int = 1
|
|
)
|
|
);
|
|
```
|
|
|
|
**Pros:** Type-safe, referential integrity
|
|
**Cons:** Schema grows with types
|
|
|
|
### Approach 2: Supertype/Subtype
|
|
|
|
```sql
|
|
CREATE TABLE Commentable (id BIGINT PRIMARY KEY, type VARCHAR(50));
|
|
CREATE TABLE Post (id BIGINT PRIMARY KEY REFERENCES Commentable(id), ...);
|
|
CREATE TABLE Photo (id BIGINT PRIMARY KEY REFERENCES Commentable(id), ...);
|
|
CREATE TABLE Comment (commentableId BIGINT REFERENCES Commentable(id));
|
|
```
|
|
|
|
**Use when:** Shared attributes across types.
|
|
|
|
---
|
|
|
|
## 4. Graph & Ontology Design
|
|
|
|
### Property Graph
|
|
|
|
**Nodes** = entities, **Edges** = relationships, both have properties.
|
|
|
|
```cypher
|
|
CREATE (u:User {id: 1, name: 'Alice'})
|
|
CREATE (p:Product {id: 100, name: 'Widget'})
|
|
CREATE (u)-[:PURCHASED {date: '2024-01-15', quantity: 2}]->(p)
|
|
```
|
|
|
|
**Schema:**
|
|
```
|
|
Nodes: User, Product, Category
|
|
Edges: PURCHASED (User→Product, {date, quantity})
|
|
REVIEWED (User→Product, {rating, comment})
|
|
BELONGS_TO (Product→Category)
|
|
```
|
|
|
|
**Design principles:**
|
|
- Nodes for entities with identity
|
|
- Edges for relationships
|
|
- Properties on edges for context
|
|
- Avoid deep traversals (< 3 hops)
|
|
|
|
### RDF Triples (Semantic Web)
|
|
|
|
Subject-Predicate-Object:
|
|
```turtle
|
|
ex:Alice rdf:type ex:User .
|
|
ex:Alice ex:purchased ex:Widget .
|
|
```
|
|
|
|
**Use RDF when:** Standards compliance, semantic reasoning, linked data
|
|
**Use Property Graph when:** Performance, complex traversals
|
|
|
|
---
|
|
|
|
## 5. Normalization Levels
|
|
|
|
### 1NF: Atomic Values
|
|
|
|
**Violation:** Multiple phones in one column
|
|
**Fix:** Separate UserPhone table
|
|
|
|
### 2NF: No Partial Dependencies
|
|
|
|
**Violation:** In OrderItem(orderId, productId, productName), productName depends only on productId
|
|
**Fix:** productName lives in Product table
|
|
|
|
### 3NF: No Transitive Dependencies
|
|
|
|
**Violation:** In Address(id, zipCode, city, state), city/state depend on zipCode
|
|
**Fix:** Separate ZipCode table
|
|
|
|
**When to normalize to 3NF:** OLTP, frequent updates, consistency required
|
|
|
|
---
|
|
|
|
## 6. Strategic Denormalization
|
|
|
|
**Only after profiling shows bottleneck.**
|
|
|
|
### Pattern 1: Computed Aggregates
|
|
|
|
Store `Order.total` instead of summing OrderItems on every query.
|
|
|
|
**Trade-off:** Faster reads, slower writes, consistency risk (use triggers/app logic)
|
|
|
|
### Pattern 2: Frequent Joins
|
|
|
|
Embed address fields in User table to avoid join.
|
|
|
|
**Trade-off:** No join, but updates must maintain both
|
|
|
|
### Pattern 3: Historical Snapshots
|
|
|
|
```sql
|
|
CREATE TABLE OrderSnapshot (
|
|
orderId BIGINT,
|
|
snapshotDate DATE,
|
|
userName VARCHAR(255), -- denormalized from User
|
|
userEmail VARCHAR(255),
|
|
PRIMARY KEY (orderId, snapshotDate)
|
|
);
|
|
```
|
|
|
|
**Use when:** Need point-in-time data (e.g., user's name at time of order)
|
|
|
|
---
|
|
|
|
## 7. Temporal & Historical Modeling
|
|
|
|
### Pattern 1: Effective Dating
|
|
|
|
```sql
|
|
CREATE TABLE Price (
|
|
productId BIGINT,
|
|
price DECIMAL(10,2),
|
|
effectiveFrom DATE NOT NULL,
|
|
effectiveTo DATE NULL, -- NULL = current
|
|
PRIMARY KEY (productId, effectiveFrom)
|
|
);
|
|
```
|
|
|
|
**Query current:** WHERE effectiveFrom <= TODAY AND (effectiveTo IS NULL OR effectiveTo > TODAY)
|
|
|
|
### Pattern 2: History Table
|
|
|
|
```sql
|
|
CREATE TABLE UserHistory (
|
|
id BIGINT AUTO_INCREMENT PRIMARY KEY,
|
|
userId BIGINT,
|
|
email VARCHAR(255),
|
|
name VARCHAR(255),
|
|
validFrom TIMESTAMP DEFAULT NOW(),
|
|
validTo TIMESTAMP NULL,
|
|
changeType VARCHAR(20) -- 'INSERT', 'UPDATE', 'DELETE'
|
|
);
|
|
```
|
|
|
|
Trigger on User table inserts into UserHistory on changes.
|
|
|
|
### Pattern 3: Event Sourcing
|
|
|
|
```sql
|
|
CREATE TABLE OrderEvent (
|
|
id BIGINT AUTO_INCREMENT PRIMARY KEY,
|
|
orderId BIGINT,
|
|
eventType VARCHAR(50), -- 'CREATED', 'ITEM_ADDED', 'SHIPPED'
|
|
eventData JSON,
|
|
occurredAt TIMESTAMP DEFAULT NOW()
|
|
);
|
|
```
|
|
|
|
Reconstruct state by replaying events.
|
|
|
|
**Trade-offs:**
|
|
**Pros:** Complete audit, time travel
|
|
**Cons:** Query complexity, storage
|
|
|
|
---
|
|
|
|
## 8. Schema Evolution
|
|
|
|
### Strategy 1: Backward-Compatible
|
|
|
|
Safe changes (no app changes):
|
|
- Add nullable column
|
|
- Add table (not referenced)
|
|
- Add index
|
|
- Widen column (VARCHAR(100) → VARCHAR(255))
|
|
|
|
```sql
|
|
ALTER TABLE User ADD COLUMN phoneNumber VARCHAR(20) NULL;
|
|
```
|
|
|
|
### Strategy 2: Expand-Contract
|
|
|
|
For breaking changes:
|
|
|
|
1. **Expand:** Add new alongside old
|
|
```sql
|
|
ALTER TABLE User ADD COLUMN newEmail VARCHAR(255) NULL;
|
|
```
|
|
|
|
2. **Migrate:** Copy data
|
|
```sql
|
|
UPDATE User SET newEmail = email WHERE newEmail IS NULL;
|
|
```
|
|
|
|
3. **Contract:** Remove old
|
|
```sql
|
|
ALTER TABLE User DROP COLUMN email;
|
|
ALTER TABLE User RENAME COLUMN newEmail TO email;
|
|
```
|
|
|
|
### Strategy 3: Versioned Schemas (NoSQL)
|
|
|
|
```json
|
|
{"_schemaVersion": "2.0", "email": "alice@example.com"}
|
|
```
|
|
|
|
App handles multiple versions.
|
|
|
|
### Strategy 4: Blue-Green
|
|
|
|
Run old and new schemas simultaneously, dual-write, migrate, switch reads, remove old.
|
|
|
|
**Best for:** Major redesigns, zero downtime
|
|
|
|
---
|
|
|
|
## 9. Multi-Tenancy
|
|
|
|
### Pattern 1: Separate Databases
|
|
|
|
```
|
|
tenant1_db, tenant2_db, tenant3_db
|
|
```
|
|
|
|
**Pros:** Strong isolation
|
|
**Cons:** High overhead
|
|
|
|
### Pattern 2: Separate Schemas
|
|
|
|
```sql
|
|
CREATE SCHEMA tenant1;
|
|
CREATE TABLE tenant1.User (...);
|
|
```
|
|
|
|
**Pros:** Better than separate DBs
|
|
**Cons:** Still some overhead
|
|
|
|
### Pattern 3: Shared Schema + Tenant ID
|
|
|
|
```sql
|
|
CREATE TABLE User (
|
|
id BIGINT PRIMARY KEY,
|
|
tenantId BIGINT NOT NULL,
|
|
email VARCHAR(255),
|
|
UNIQUE (tenantId, email)
|
|
);
|
|
```
|
|
|
|
**Pros:** Most efficient
|
|
**Cons:** Must filter ALL queries by tenantId
|
|
|
|
**Recommendation:** Pattern 3 for SaaS, Pattern 1 for regulated industries
|
|
|
|
---
|
|
|
|
## 10. Performance
|
|
|
|
### Indexes
|
|
|
|
**Covering index** (includes all query columns):
|
|
```sql
|
|
CREATE INDEX idx_user_status ON User(status) INCLUDE (name, email);
|
|
```
|
|
|
|
**Composite index** (order matters):
|
|
```sql
|
|
-- Good for: WHERE tenantId = X AND createdAt > Y
|
|
CREATE INDEX idx_tenant_date ON Order(tenantId, createdAt);
|
|
```
|
|
|
|
**Partial index** (reduce size):
|
|
```sql
|
|
CREATE INDEX idx_active ON User(email) WHERE deletedAt IS NULL;
|
|
```
|
|
|
|
### Partitioning
|
|
|
|
**Horizontal (sharding):**
|
|
```sql
|
|
CREATE TABLE Order (...) PARTITION BY RANGE (createdAt);
|
|
CREATE TABLE Order_2024_Q1 PARTITION OF Order
|
|
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
|
|
```
|
|
|
|
**Vertical:** Split hot/cold data into separate tables.
|
|
|
|
---
|
|
|
|
## 11. Common Advanced Patterns
|
|
|
|
### Soft Deletes
|
|
|
|
```sql
|
|
ALTER TABLE User ADD COLUMN deletedAt TIMESTAMP NULL;
|
|
-- Query: WHERE deletedAt IS NULL
|
|
```
|
|
|
|
### Audit Columns
|
|
|
|
```sql
|
|
createdAt TIMESTAMP DEFAULT NOW()
|
|
updatedAt TIMESTAMP DEFAULT NOW() ON UPDATE NOW()
|
|
createdBy BIGINT REFERENCES User(id)
|
|
updatedBy BIGINT REFERENCES User(id)
|
|
```
|
|
|
|
### State Machines
|
|
|
|
```sql
|
|
CREATE TABLE OrderState (
|
|
orderId BIGINT REFERENCES Order(id),
|
|
state VARCHAR(20),
|
|
transitionedAt TIMESTAMP DEFAULT NOW(),
|
|
PRIMARY KEY (orderId, transitionedAt)
|
|
);
|
|
-- Track: draft → pending → confirmed → shipped → delivered
|
|
```
|
|
|
|
### Idempotency Keys
|
|
|
|
```sql
|
|
CREATE TABLE Request (
|
|
idempotencyKey UUID PRIMARY KEY,
|
|
payload JSON,
|
|
result JSON,
|
|
processedAt TIMESTAMP
|
|
);
|
|
-- Prevents duplicate processing
|
|
```
|