Initial commit
This commit is contained in:
439
skills/data-schema-knowledge-modeling/resources/methodology.md
Normal file
439
skills/data-schema-knowledge-modeling/resources/methodology.md
Normal file
@@ -0,0 +1,439 @@
|
||||
# Data Schema & Knowledge Modeling: Advanced Methodology
|
||||
|
||||
## Workflow
|
||||
|
||||
```
|
||||
Advanced Schema Modeling:
|
||||
- [ ] Step 1: Analyze complex domain patterns
|
||||
- [ ] Step 2: Design advanced relationship structures
|
||||
- [ ] Step 3: Apply normalization or strategic denormalization
|
||||
- [ ] Step 4: Model temporal/historical aspects
|
||||
- [ ] Step 5: Plan schema evolution strategy
|
||||
```
|
||||
|
||||
**Steps:** (1) Identify patterns in [Advanced Relationships](#1-advanced-relationship-patterns), (2) Apply [Hierarchy](#2-hierarchy-modeling) and [Polymorphic](#3-polymorphic-associations) patterns, (3) Use [Normalization](#4-normalization-levels) then [Denormalization](#5-strategic-denormalization), (4) Add [Temporal](#6-temporal--historical-modeling) if needed, (5) Plan [Evolution](#7-schema-evolution).
|
||||
|
||||
---
|
||||
|
||||
## 1. Advanced Relationship Patterns
|
||||
|
||||
### Self-Referential
|
||||
|
||||
Entity relates to itself (org charts, categories, social networks).
|
||||
|
||||
```sql
|
||||
CREATE TABLE Employee (
|
||||
id BIGINT PRIMARY KEY,
|
||||
managerId BIGINT NULL REFERENCES Employee(id),
|
||||
CONSTRAINT no_self_ref CHECK (id != managerId)
|
||||
);
|
||||
```
|
||||
|
||||
Query with recursive CTE for full hierarchy.
|
||||
|
||||
### Conditional
|
||||
|
||||
Relationship exists only under conditions.
|
||||
|
||||
```sql
|
||||
CREATE TABLE Order (
|
||||
id BIGINT PRIMARY KEY,
|
||||
status VARCHAR(20),
|
||||
paymentId BIGINT NULL REFERENCES Payment(id),
|
||||
CONSTRAINT payment_when_paid CHECK (
|
||||
(status IN ('paid','completed') AND paymentId IS NOT NULL) OR
|
||||
(status NOT IN ('paid','completed'))
|
||||
)
|
||||
);
|
||||
```
|
||||
|
||||
### Multi-Parent
|
||||
|
||||
Entity has multiple parents (document in folders).
|
||||
|
||||
```sql
|
||||
CREATE TABLE DocumentFolder (
|
||||
documentId BIGINT REFERENCES Document(id),
|
||||
folderId BIGINT REFERENCES Folder(id),
|
||||
PRIMARY KEY (documentId, folderId)
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Hierarchy Modeling
|
||||
|
||||
Four approaches with trade-offs:
|
||||
|
||||
| Approach | Implementation | Read | Write | Best For |
|
||||
|----------|---------------|------|-------|----------|
|
||||
| **Adjacency List** | `parentId` column | Slow (recursive) | Fast | Shallow trees, frequent updates |
|
||||
| **Path Enumeration** | `path VARCHAR` ('/1/5/12/') | Fast | Medium | Read-heavy, moderate depth |
|
||||
| **Nested Sets** | `lft, rgt INT` | Fastest | Slow | Read-heavy, rare writes |
|
||||
| **Closure Table** | Separate ancestor/descendant table | Fastest | Medium | Complex queries, any depth |
|
||||
|
||||
**Adjacency List:**
|
||||
```sql
|
||||
CREATE TABLE Category (
|
||||
id BIGINT PRIMARY KEY,
|
||||
parentId BIGINT NULL REFERENCES Category(id)
|
||||
);
|
||||
```
|
||||
|
||||
**Closure Table:**
|
||||
```sql
|
||||
CREATE TABLE CategoryClosure (
|
||||
ancestor BIGINT,
|
||||
descendant BIGINT,
|
||||
depth INT, -- 0=self, 1=child, 2+=deeper
|
||||
PRIMARY KEY (ancestor, descendant)
|
||||
);
|
||||
```
|
||||
|
||||
**Recommendation:** Adjacency for < 5 levels, Closure for complex queries.
|
||||
|
||||
---
|
||||
|
||||
## 3. Polymorphic Associations
|
||||
|
||||
Entity relates to multiple types (Comment on Post/Photo/Video).
|
||||
|
||||
### Approach 1: Separate FKs (Recommended for SQL)
|
||||
|
||||
```sql
|
||||
CREATE TABLE Comment (
|
||||
id BIGINT PRIMARY KEY,
|
||||
postId BIGINT NULL REFERENCES Post(id),
|
||||
photoId BIGINT NULL REFERENCES Photo(id),
|
||||
videoId BIGINT NULL REFERENCES Video(id),
|
||||
CONSTRAINT one_parent CHECK (
|
||||
(postId IS NOT NULL)::int +
|
||||
(photoId IS NOT NULL)::int +
|
||||
(videoId IS NOT NULL)::int = 1
|
||||
)
|
||||
);
|
||||
```
|
||||
|
||||
**Pros:** Type-safe, referential integrity
|
||||
**Cons:** Schema grows with types
|
||||
|
||||
### Approach 2: Supertype/Subtype
|
||||
|
||||
```sql
|
||||
CREATE TABLE Commentable (id BIGINT PRIMARY KEY, type VARCHAR(50));
|
||||
CREATE TABLE Post (id BIGINT PRIMARY KEY REFERENCES Commentable(id), ...);
|
||||
CREATE TABLE Photo (id BIGINT PRIMARY KEY REFERENCES Commentable(id), ...);
|
||||
CREATE TABLE Comment (commentableId BIGINT REFERENCES Commentable(id));
|
||||
```
|
||||
|
||||
**Use when:** Shared attributes across types.
|
||||
|
||||
---
|
||||
|
||||
## 4. Graph & Ontology Design
|
||||
|
||||
### Property Graph
|
||||
|
||||
**Nodes** = entities, **Edges** = relationships, both have properties.
|
||||
|
||||
```cypher
|
||||
CREATE (u:User {id: 1, name: 'Alice'})
|
||||
CREATE (p:Product {id: 100, name: 'Widget'})
|
||||
CREATE (u)-[:PURCHASED {date: '2024-01-15', quantity: 2}]->(p)
|
||||
```
|
||||
|
||||
**Schema:**
|
||||
```
|
||||
Nodes: User, Product, Category
|
||||
Edges: PURCHASED (User→Product, {date, quantity})
|
||||
REVIEWED (User→Product, {rating, comment})
|
||||
BELONGS_TO (Product→Category)
|
||||
```
|
||||
|
||||
**Design principles:**
|
||||
- Nodes for entities with identity
|
||||
- Edges for relationships
|
||||
- Properties on edges for context
|
||||
- Avoid deep traversals (< 3 hops)
|
||||
|
||||
### RDF Triples (Semantic Web)
|
||||
|
||||
Subject-Predicate-Object:
|
||||
```turtle
|
||||
ex:Alice rdf:type ex:User .
|
||||
ex:Alice ex:purchased ex:Widget .
|
||||
```
|
||||
|
||||
**Use RDF when:** Standards compliance, semantic reasoning, linked data
|
||||
**Use Property Graph when:** Performance, complex traversals
|
||||
|
||||
---
|
||||
|
||||
## 5. Normalization Levels
|
||||
|
||||
### 1NF: Atomic Values
|
||||
|
||||
**Violation:** Multiple phones in one column
|
||||
**Fix:** Separate UserPhone table
|
||||
|
||||
### 2NF: No Partial Dependencies
|
||||
|
||||
**Violation:** In OrderItem(orderId, productId, productName), productName depends only on productId
|
||||
**Fix:** productName lives in Product table
|
||||
|
||||
### 3NF: No Transitive Dependencies
|
||||
|
||||
**Violation:** In Address(id, zipCode, city, state), city/state depend on zipCode
|
||||
**Fix:** Separate ZipCode table
|
||||
|
||||
**When to normalize to 3NF:** OLTP, frequent updates, consistency required
|
||||
|
||||
---
|
||||
|
||||
## 6. Strategic Denormalization
|
||||
|
||||
**Only after profiling shows bottleneck.**
|
||||
|
||||
### Pattern 1: Computed Aggregates
|
||||
|
||||
Store `Order.total` instead of summing OrderItems on every query.
|
||||
|
||||
**Trade-off:** Faster reads, slower writes, consistency risk (use triggers/app logic)
|
||||
|
||||
### Pattern 2: Frequent Joins
|
||||
|
||||
Embed address fields in User table to avoid join.
|
||||
|
||||
**Trade-off:** No join, but updates must maintain both
|
||||
|
||||
### Pattern 3: Historical Snapshots
|
||||
|
||||
```sql
|
||||
CREATE TABLE OrderSnapshot (
|
||||
orderId BIGINT,
|
||||
snapshotDate DATE,
|
||||
userName VARCHAR(255), -- denormalized from User
|
||||
userEmail VARCHAR(255),
|
||||
PRIMARY KEY (orderId, snapshotDate)
|
||||
);
|
||||
```
|
||||
|
||||
**Use when:** Need point-in-time data (e.g., user's name at time of order)
|
||||
|
||||
---
|
||||
|
||||
## 7. Temporal & Historical Modeling
|
||||
|
||||
### Pattern 1: Effective Dating
|
||||
|
||||
```sql
|
||||
CREATE TABLE Price (
|
||||
productId BIGINT,
|
||||
price DECIMAL(10,2),
|
||||
effectiveFrom DATE NOT NULL,
|
||||
effectiveTo DATE NULL, -- NULL = current
|
||||
PRIMARY KEY (productId, effectiveFrom)
|
||||
);
|
||||
```
|
||||
|
||||
**Query current:** WHERE effectiveFrom <= TODAY AND (effectiveTo IS NULL OR effectiveTo > TODAY)
|
||||
|
||||
### Pattern 2: History Table
|
||||
|
||||
```sql
|
||||
CREATE TABLE UserHistory (
|
||||
id BIGINT AUTO_INCREMENT PRIMARY KEY,
|
||||
userId BIGINT,
|
||||
email VARCHAR(255),
|
||||
name VARCHAR(255),
|
||||
validFrom TIMESTAMP DEFAULT NOW(),
|
||||
validTo TIMESTAMP NULL,
|
||||
changeType VARCHAR(20) -- 'INSERT', 'UPDATE', 'DELETE'
|
||||
);
|
||||
```
|
||||
|
||||
Trigger on User table inserts into UserHistory on changes.
|
||||
|
||||
### Pattern 3: Event Sourcing
|
||||
|
||||
```sql
|
||||
CREATE TABLE OrderEvent (
|
||||
id BIGINT AUTO_INCREMENT PRIMARY KEY,
|
||||
orderId BIGINT,
|
||||
eventType VARCHAR(50), -- 'CREATED', 'ITEM_ADDED', 'SHIPPED'
|
||||
eventData JSON,
|
||||
occurredAt TIMESTAMP DEFAULT NOW()
|
||||
);
|
||||
```
|
||||
|
||||
Reconstruct state by replaying events.
|
||||
|
||||
**Trade-offs:**
|
||||
**Pros:** Complete audit, time travel
|
||||
**Cons:** Query complexity, storage
|
||||
|
||||
---
|
||||
|
||||
## 8. Schema Evolution
|
||||
|
||||
### Strategy 1: Backward-Compatible
|
||||
|
||||
Safe changes (no app changes):
|
||||
- Add nullable column
|
||||
- Add table (not referenced)
|
||||
- Add index
|
||||
- Widen column (VARCHAR(100) → VARCHAR(255))
|
||||
|
||||
```sql
|
||||
ALTER TABLE User ADD COLUMN phoneNumber VARCHAR(20) NULL;
|
||||
```
|
||||
|
||||
### Strategy 2: Expand-Contract
|
||||
|
||||
For breaking changes:
|
||||
|
||||
1. **Expand:** Add new alongside old
|
||||
```sql
|
||||
ALTER TABLE User ADD COLUMN newEmail VARCHAR(255) NULL;
|
||||
```
|
||||
|
||||
2. **Migrate:** Copy data
|
||||
```sql
|
||||
UPDATE User SET newEmail = email WHERE newEmail IS NULL;
|
||||
```
|
||||
|
||||
3. **Contract:** Remove old
|
||||
```sql
|
||||
ALTER TABLE User DROP COLUMN email;
|
||||
ALTER TABLE User RENAME COLUMN newEmail TO email;
|
||||
```
|
||||
|
||||
### Strategy 3: Versioned Schemas (NoSQL)
|
||||
|
||||
```json
|
||||
{"_schemaVersion": "2.0", "email": "alice@example.com"}
|
||||
```
|
||||
|
||||
App handles multiple versions.
|
||||
|
||||
### Strategy 4: Blue-Green
|
||||
|
||||
Run old and new schemas simultaneously, dual-write, migrate, switch reads, remove old.
|
||||
|
||||
**Best for:** Major redesigns, zero downtime
|
||||
|
||||
---
|
||||
|
||||
## 9. Multi-Tenancy
|
||||
|
||||
### Pattern 1: Separate Databases
|
||||
|
||||
```
|
||||
tenant1_db, tenant2_db, tenant3_db
|
||||
```
|
||||
|
||||
**Pros:** Strong isolation
|
||||
**Cons:** High overhead
|
||||
|
||||
### Pattern 2: Separate Schemas
|
||||
|
||||
```sql
|
||||
CREATE SCHEMA tenant1;
|
||||
CREATE TABLE tenant1.User (...);
|
||||
```
|
||||
|
||||
**Pros:** Better than separate DBs
|
||||
**Cons:** Still some overhead
|
||||
|
||||
### Pattern 3: Shared Schema + Tenant ID
|
||||
|
||||
```sql
|
||||
CREATE TABLE User (
|
||||
id BIGINT PRIMARY KEY,
|
||||
tenantId BIGINT NOT NULL,
|
||||
email VARCHAR(255),
|
||||
UNIQUE (tenantId, email)
|
||||
);
|
||||
```
|
||||
|
||||
**Pros:** Most efficient
|
||||
**Cons:** Must filter ALL queries by tenantId
|
||||
|
||||
**Recommendation:** Pattern 3 for SaaS, Pattern 1 for regulated industries
|
||||
|
||||
---
|
||||
|
||||
## 10. Performance
|
||||
|
||||
### Indexes
|
||||
|
||||
**Covering index** (includes all query columns):
|
||||
```sql
|
||||
CREATE INDEX idx_user_status ON User(status) INCLUDE (name, email);
|
||||
```
|
||||
|
||||
**Composite index** (order matters):
|
||||
```sql
|
||||
-- Good for: WHERE tenantId = X AND createdAt > Y
|
||||
CREATE INDEX idx_tenant_date ON Order(tenantId, createdAt);
|
||||
```
|
||||
|
||||
**Partial index** (reduce size):
|
||||
```sql
|
||||
CREATE INDEX idx_active ON User(email) WHERE deletedAt IS NULL;
|
||||
```
|
||||
|
||||
### Partitioning
|
||||
|
||||
**Horizontal (sharding):**
|
||||
```sql
|
||||
CREATE TABLE Order (...) PARTITION BY RANGE (createdAt);
|
||||
CREATE TABLE Order_2024_Q1 PARTITION OF Order
|
||||
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
|
||||
```
|
||||
|
||||
**Vertical:** Split hot/cold data into separate tables.
|
||||
|
||||
---
|
||||
|
||||
## 11. Common Advanced Patterns
|
||||
|
||||
### Soft Deletes
|
||||
|
||||
```sql
|
||||
ALTER TABLE User ADD COLUMN deletedAt TIMESTAMP NULL;
|
||||
-- Query: WHERE deletedAt IS NULL
|
||||
```
|
||||
|
||||
### Audit Columns
|
||||
|
||||
```sql
|
||||
createdAt TIMESTAMP DEFAULT NOW()
|
||||
updatedAt TIMESTAMP DEFAULT NOW() ON UPDATE NOW()
|
||||
createdBy BIGINT REFERENCES User(id)
|
||||
updatedBy BIGINT REFERENCES User(id)
|
||||
```
|
||||
|
||||
### State Machines
|
||||
|
||||
```sql
|
||||
CREATE TABLE OrderState (
|
||||
orderId BIGINT REFERENCES Order(id),
|
||||
state VARCHAR(20),
|
||||
transitionedAt TIMESTAMP DEFAULT NOW(),
|
||||
PRIMARY KEY (orderId, transitionedAt)
|
||||
);
|
||||
-- Track: draft → pending → confirmed → shipped → delivered
|
||||
```
|
||||
|
||||
### Idempotency Keys
|
||||
|
||||
```sql
|
||||
CREATE TABLE Request (
|
||||
idempotencyKey UUID PRIMARY KEY,
|
||||
payload JSON,
|
||||
result JSON,
|
||||
processedAt TIMESTAMP
|
||||
);
|
||||
-- Prevents duplicate processing
|
||||
```
|
||||
Reference in New Issue
Block a user