9.8 KiB
Data Schema & Knowledge Modeling: Advanced Methodology
Workflow
Advanced Schema Modeling:
- [ ] Step 1: Analyze complex domain patterns
- [ ] Step 2: Design advanced relationship structures
- [ ] Step 3: Apply normalization or strategic denormalization
- [ ] Step 4: Model temporal/historical aspects
- [ ] Step 5: Plan schema evolution strategy
Steps: (1) Identify patterns in Advanced Relationships, (2) Apply Hierarchy and Polymorphic patterns, (3) Use Normalization then Denormalization, (4) Add Temporal if needed, (5) Plan Evolution.
1. Advanced Relationship Patterns
Self-Referential
Entity relates to itself (org charts, categories, social networks).
CREATE TABLE Employee (
id BIGINT PRIMARY KEY,
managerId BIGINT NULL REFERENCES Employee(id),
CONSTRAINT no_self_ref CHECK (id != managerId)
);
Query with recursive CTE for full hierarchy.
Conditional
Relationship exists only under conditions.
CREATE TABLE Order (
id BIGINT PRIMARY KEY,
status VARCHAR(20),
paymentId BIGINT NULL REFERENCES Payment(id),
CONSTRAINT payment_when_paid CHECK (
(status IN ('paid','completed') AND paymentId IS NOT NULL) OR
(status NOT IN ('paid','completed'))
)
);
Multi-Parent
Entity has multiple parents (document in folders).
CREATE TABLE DocumentFolder (
documentId BIGINT REFERENCES Document(id),
folderId BIGINT REFERENCES Folder(id),
PRIMARY KEY (documentId, folderId)
);
2. Hierarchy Modeling
Four approaches with trade-offs:
| Approach | Implementation | Read | Write | Best For |
|---|---|---|---|---|
| Adjacency List | parentId column |
Slow (recursive) | Fast | Shallow trees, frequent updates |
| Path Enumeration | path VARCHAR ('/1/5/12/') |
Fast | Medium | Read-heavy, moderate depth |
| Nested Sets | lft, rgt INT |
Fastest | Slow | Read-heavy, rare writes |
| Closure Table | Separate ancestor/descendant table | Fastest | Medium | Complex queries, any depth |
Adjacency List:
CREATE TABLE Category (
id BIGINT PRIMARY KEY,
parentId BIGINT NULL REFERENCES Category(id)
);
Closure Table:
CREATE TABLE CategoryClosure (
ancestor BIGINT,
descendant BIGINT,
depth INT, -- 0=self, 1=child, 2+=deeper
PRIMARY KEY (ancestor, descendant)
);
Recommendation: Adjacency for < 5 levels, Closure for complex queries.
3. Polymorphic Associations
Entity relates to multiple types (Comment on Post/Photo/Video).
Approach 1: Separate FKs (Recommended for SQL)
CREATE TABLE Comment (
id BIGINT PRIMARY KEY,
postId BIGINT NULL REFERENCES Post(id),
photoId BIGINT NULL REFERENCES Photo(id),
videoId BIGINT NULL REFERENCES Video(id),
CONSTRAINT one_parent CHECK (
(postId IS NOT NULL)::int +
(photoId IS NOT NULL)::int +
(videoId IS NOT NULL)::int = 1
)
);
Pros: Type-safe, referential integrity Cons: Schema grows with types
Approach 2: Supertype/Subtype
CREATE TABLE Commentable (id BIGINT PRIMARY KEY, type VARCHAR(50));
CREATE TABLE Post (id BIGINT PRIMARY KEY REFERENCES Commentable(id), ...);
CREATE TABLE Photo (id BIGINT PRIMARY KEY REFERENCES Commentable(id), ...);
CREATE TABLE Comment (commentableId BIGINT REFERENCES Commentable(id));
Use when: Shared attributes across types.
4. Graph & Ontology Design
Property Graph
Nodes = entities, Edges = relationships, both have properties.
CREATE (u:User {id: 1, name: 'Alice'})
CREATE (p:Product {id: 100, name: 'Widget'})
CREATE (u)-[:PURCHASED {date: '2024-01-15', quantity: 2}]->(p)
Schema:
Nodes: User, Product, Category
Edges: PURCHASED (User→Product, {date, quantity})
REVIEWED (User→Product, {rating, comment})
BELONGS_TO (Product→Category)
Design principles:
- Nodes for entities with identity
- Edges for relationships
- Properties on edges for context
- Avoid deep traversals (< 3 hops)
RDF Triples (Semantic Web)
Subject-Predicate-Object:
ex:Alice rdf:type ex:User .
ex:Alice ex:purchased ex:Widget .
Use RDF when: Standards compliance, semantic reasoning, linked data Use Property Graph when: Performance, complex traversals
5. Normalization Levels
1NF: Atomic Values
Violation: Multiple phones in one column Fix: Separate UserPhone table
2NF: No Partial Dependencies
Violation: In OrderItem(orderId, productId, productName), productName depends only on productId Fix: productName lives in Product table
3NF: No Transitive Dependencies
Violation: In Address(id, zipCode, city, state), city/state depend on zipCode Fix: Separate ZipCode table
When to normalize to 3NF: OLTP, frequent updates, consistency required
6. Strategic Denormalization
Only after profiling shows bottleneck.
Pattern 1: Computed Aggregates
Store Order.total instead of summing OrderItems on every query.
Trade-off: Faster reads, slower writes, consistency risk (use triggers/app logic)
Pattern 2: Frequent Joins
Embed address fields in User table to avoid join.
Trade-off: No join, but updates must maintain both
Pattern 3: Historical Snapshots
CREATE TABLE OrderSnapshot (
orderId BIGINT,
snapshotDate DATE,
userName VARCHAR(255), -- denormalized from User
userEmail VARCHAR(255),
PRIMARY KEY (orderId, snapshotDate)
);
Use when: Need point-in-time data (e.g., user's name at time of order)
7. Temporal & Historical Modeling
Pattern 1: Effective Dating
CREATE TABLE Price (
productId BIGINT,
price DECIMAL(10,2),
effectiveFrom DATE NOT NULL,
effectiveTo DATE NULL, -- NULL = current
PRIMARY KEY (productId, effectiveFrom)
);
Query current: WHERE effectiveFrom <= TODAY AND (effectiveTo IS NULL OR effectiveTo > TODAY)
Pattern 2: History Table
CREATE TABLE UserHistory (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
userId BIGINT,
email VARCHAR(255),
name VARCHAR(255),
validFrom TIMESTAMP DEFAULT NOW(),
validTo TIMESTAMP NULL,
changeType VARCHAR(20) -- 'INSERT', 'UPDATE', 'DELETE'
);
Trigger on User table inserts into UserHistory on changes.
Pattern 3: Event Sourcing
CREATE TABLE OrderEvent (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
orderId BIGINT,
eventType VARCHAR(50), -- 'CREATED', 'ITEM_ADDED', 'SHIPPED'
eventData JSON,
occurredAt TIMESTAMP DEFAULT NOW()
);
Reconstruct state by replaying events.
Trade-offs: Pros: Complete audit, time travel Cons: Query complexity, storage
8. Schema Evolution
Strategy 1: Backward-Compatible
Safe changes (no app changes):
- Add nullable column
- Add table (not referenced)
- Add index
- Widen column (VARCHAR(100) → VARCHAR(255))
ALTER TABLE User ADD COLUMN phoneNumber VARCHAR(20) NULL;
Strategy 2: Expand-Contract
For breaking changes:
-
Expand: Add new alongside old
ALTER TABLE User ADD COLUMN newEmail VARCHAR(255) NULL; -
Migrate: Copy data
UPDATE User SET newEmail = email WHERE newEmail IS NULL; -
Contract: Remove old
ALTER TABLE User DROP COLUMN email; ALTER TABLE User RENAME COLUMN newEmail TO email;
Strategy 3: Versioned Schemas (NoSQL)
{"_schemaVersion": "2.0", "email": "alice@example.com"}
App handles multiple versions.
Strategy 4: Blue-Green
Run old and new schemas simultaneously, dual-write, migrate, switch reads, remove old.
Best for: Major redesigns, zero downtime
9. Multi-Tenancy
Pattern 1: Separate Databases
tenant1_db, tenant2_db, tenant3_db
Pros: Strong isolation Cons: High overhead
Pattern 2: Separate Schemas
CREATE SCHEMA tenant1;
CREATE TABLE tenant1.User (...);
Pros: Better than separate DBs Cons: Still some overhead
Pattern 3: Shared Schema + Tenant ID
CREATE TABLE User (
id BIGINT PRIMARY KEY,
tenantId BIGINT NOT NULL,
email VARCHAR(255),
UNIQUE (tenantId, email)
);
Pros: Most efficient Cons: Must filter ALL queries by tenantId
Recommendation: Pattern 3 for SaaS, Pattern 1 for regulated industries
10. Performance
Indexes
Covering index (includes all query columns):
CREATE INDEX idx_user_status ON User(status) INCLUDE (name, email);
Composite index (order matters):
-- Good for: WHERE tenantId = X AND createdAt > Y
CREATE INDEX idx_tenant_date ON Order(tenantId, createdAt);
Partial index (reduce size):
CREATE INDEX idx_active ON User(email) WHERE deletedAt IS NULL;
Partitioning
Horizontal (sharding):
CREATE TABLE Order (...) PARTITION BY RANGE (createdAt);
CREATE TABLE Order_2024_Q1 PARTITION OF Order
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
Vertical: Split hot/cold data into separate tables.
11. Common Advanced Patterns
Soft Deletes
ALTER TABLE User ADD COLUMN deletedAt TIMESTAMP NULL;
-- Query: WHERE deletedAt IS NULL
Audit Columns
createdAt TIMESTAMP DEFAULT NOW()
updatedAt TIMESTAMP DEFAULT NOW() ON UPDATE NOW()
createdBy BIGINT REFERENCES User(id)
updatedBy BIGINT REFERENCES User(id)
State Machines
CREATE TABLE OrderState (
orderId BIGINT REFERENCES Order(id),
state VARCHAR(20),
transitionedAt TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (orderId, transitionedAt)
);
-- Track: draft → pending → confirmed → shipped → delivered
Idempotency Keys
CREATE TABLE Request (
idempotencyKey UUID PRIMARY KEY,
payload JSON,
result JSON,
processedAt TIMESTAMP
);
-- Prevents duplicate processing