Files
gh-lyndonkl-claude/skills/data-schema-knowledge-modeling/resources/methodology.md
2025-11-30 08:38:26 +08:00

440 lines
9.8 KiB
Markdown

# Data Schema & Knowledge Modeling: Advanced Methodology
## Workflow
```
Advanced Schema Modeling:
- [ ] Step 1: Analyze complex domain patterns
- [ ] Step 2: Design advanced relationship structures
- [ ] Step 3: Apply normalization or strategic denormalization
- [ ] Step 4: Model temporal/historical aspects
- [ ] Step 5: Plan schema evolution strategy
```
**Steps:** (1) Identify patterns in [Advanced Relationships](#1-advanced-relationship-patterns), (2) Apply [Hierarchy](#2-hierarchy-modeling) and [Polymorphic](#3-polymorphic-associations) patterns, (3) Use [Normalization](#4-normalization-levels) then [Denormalization](#5-strategic-denormalization), (4) Add [Temporal](#6-temporal--historical-modeling) if needed, (5) Plan [Evolution](#7-schema-evolution).
---
## 1. Advanced Relationship Patterns
### Self-Referential
Entity relates to itself (org charts, categories, social networks).
```sql
CREATE TABLE Employee (
id BIGINT PRIMARY KEY,
managerId BIGINT NULL REFERENCES Employee(id),
CONSTRAINT no_self_ref CHECK (id != managerId)
);
```
Query with recursive CTE for full hierarchy.
### Conditional
Relationship exists only under conditions.
```sql
CREATE TABLE Order (
id BIGINT PRIMARY KEY,
status VARCHAR(20),
paymentId BIGINT NULL REFERENCES Payment(id),
CONSTRAINT payment_when_paid CHECK (
(status IN ('paid','completed') AND paymentId IS NOT NULL) OR
(status NOT IN ('paid','completed'))
)
);
```
### Multi-Parent
Entity has multiple parents (document in folders).
```sql
CREATE TABLE DocumentFolder (
documentId BIGINT REFERENCES Document(id),
folderId BIGINT REFERENCES Folder(id),
PRIMARY KEY (documentId, folderId)
);
```
---
## 2. Hierarchy Modeling
Four approaches with trade-offs:
| Approach | Implementation | Read | Write | Best For |
|----------|---------------|------|-------|----------|
| **Adjacency List** | `parentId` column | Slow (recursive) | Fast | Shallow trees, frequent updates |
| **Path Enumeration** | `path VARCHAR` ('/1/5/12/') | Fast | Medium | Read-heavy, moderate depth |
| **Nested Sets** | `lft, rgt INT` | Fastest | Slow | Read-heavy, rare writes |
| **Closure Table** | Separate ancestor/descendant table | Fastest | Medium | Complex queries, any depth |
**Adjacency List:**
```sql
CREATE TABLE Category (
id BIGINT PRIMARY KEY,
parentId BIGINT NULL REFERENCES Category(id)
);
```
**Closure Table:**
```sql
CREATE TABLE CategoryClosure (
ancestor BIGINT,
descendant BIGINT,
depth INT, -- 0=self, 1=child, 2+=deeper
PRIMARY KEY (ancestor, descendant)
);
```
**Recommendation:** Adjacency for < 5 levels, Closure for complex queries.
---
## 3. Polymorphic Associations
Entity relates to multiple types (Comment on Post/Photo/Video).
### Approach 1: Separate FKs (Recommended for SQL)
```sql
CREATE TABLE Comment (
id BIGINT PRIMARY KEY,
postId BIGINT NULL REFERENCES Post(id),
photoId BIGINT NULL REFERENCES Photo(id),
videoId BIGINT NULL REFERENCES Video(id),
CONSTRAINT one_parent CHECK (
(postId IS NOT NULL)::int +
(photoId IS NOT NULL)::int +
(videoId IS NOT NULL)::int = 1
)
);
```
**Pros:** Type-safe, referential integrity
**Cons:** Schema grows with types
### Approach 2: Supertype/Subtype
```sql
CREATE TABLE Commentable (id BIGINT PRIMARY KEY, type VARCHAR(50));
CREATE TABLE Post (id BIGINT PRIMARY KEY REFERENCES Commentable(id), ...);
CREATE TABLE Photo (id BIGINT PRIMARY KEY REFERENCES Commentable(id), ...);
CREATE TABLE Comment (commentableId BIGINT REFERENCES Commentable(id));
```
**Use when:** Shared attributes across types.
---
## 4. Graph & Ontology Design
### Property Graph
**Nodes** = entities, **Edges** = relationships, both have properties.
```cypher
CREATE (u:User {id: 1, name: 'Alice'})
CREATE (p:Product {id: 100, name: 'Widget'})
CREATE (u)-[:PURCHASED {date: '2024-01-15', quantity: 2}]->(p)
```
**Schema:**
```
Nodes: User, Product, Category
Edges: PURCHASED (User→Product, {date, quantity})
REVIEWED (User→Product, {rating, comment})
BELONGS_TO (Product→Category)
```
**Design principles:**
- Nodes for entities with identity
- Edges for relationships
- Properties on edges for context
- Avoid deep traversals (< 3 hops)
### RDF Triples (Semantic Web)
Subject-Predicate-Object:
```turtle
ex:Alice rdf:type ex:User .
ex:Alice ex:purchased ex:Widget .
```
**Use RDF when:** Standards compliance, semantic reasoning, linked data
**Use Property Graph when:** Performance, complex traversals
---
## 5. Normalization Levels
### 1NF: Atomic Values
**Violation:** Multiple phones in one column
**Fix:** Separate UserPhone table
### 2NF: No Partial Dependencies
**Violation:** In OrderItem(orderId, productId, productName), productName depends only on productId
**Fix:** productName lives in Product table
### 3NF: No Transitive Dependencies
**Violation:** In Address(id, zipCode, city, state), city/state depend on zipCode
**Fix:** Separate ZipCode table
**When to normalize to 3NF:** OLTP, frequent updates, consistency required
---
## 6. Strategic Denormalization
**Only after profiling shows bottleneck.**
### Pattern 1: Computed Aggregates
Store `Order.total` instead of summing OrderItems on every query.
**Trade-off:** Faster reads, slower writes, consistency risk (use triggers/app logic)
### Pattern 2: Frequent Joins
Embed address fields in User table to avoid join.
**Trade-off:** No join, but updates must maintain both
### Pattern 3: Historical Snapshots
```sql
CREATE TABLE OrderSnapshot (
orderId BIGINT,
snapshotDate DATE,
userName VARCHAR(255), -- denormalized from User
userEmail VARCHAR(255),
PRIMARY KEY (orderId, snapshotDate)
);
```
**Use when:** Need point-in-time data (e.g., user's name at time of order)
---
## 7. Temporal & Historical Modeling
### Pattern 1: Effective Dating
```sql
CREATE TABLE Price (
productId BIGINT,
price DECIMAL(10,2),
effectiveFrom DATE NOT NULL,
effectiveTo DATE NULL, -- NULL = current
PRIMARY KEY (productId, effectiveFrom)
);
```
**Query current:** WHERE effectiveFrom <= TODAY AND (effectiveTo IS NULL OR effectiveTo > TODAY)
### Pattern 2: History Table
```sql
CREATE TABLE UserHistory (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
userId BIGINT,
email VARCHAR(255),
name VARCHAR(255),
validFrom TIMESTAMP DEFAULT NOW(),
validTo TIMESTAMP NULL,
changeType VARCHAR(20) -- 'INSERT', 'UPDATE', 'DELETE'
);
```
Trigger on User table inserts into UserHistory on changes.
### Pattern 3: Event Sourcing
```sql
CREATE TABLE OrderEvent (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
orderId BIGINT,
eventType VARCHAR(50), -- 'CREATED', 'ITEM_ADDED', 'SHIPPED'
eventData JSON,
occurredAt TIMESTAMP DEFAULT NOW()
);
```
Reconstruct state by replaying events.
**Trade-offs:**
**Pros:** Complete audit, time travel
**Cons:** Query complexity, storage
---
## 8. Schema Evolution
### Strategy 1: Backward-Compatible
Safe changes (no app changes):
- Add nullable column
- Add table (not referenced)
- Add index
- Widen column (VARCHAR(100) → VARCHAR(255))
```sql
ALTER TABLE User ADD COLUMN phoneNumber VARCHAR(20) NULL;
```
### Strategy 2: Expand-Contract
For breaking changes:
1. **Expand:** Add new alongside old
```sql
ALTER TABLE User ADD COLUMN newEmail VARCHAR(255) NULL;
```
2. **Migrate:** Copy data
```sql
UPDATE User SET newEmail = email WHERE newEmail IS NULL;
```
3. **Contract:** Remove old
```sql
ALTER TABLE User DROP COLUMN email;
ALTER TABLE User RENAME COLUMN newEmail TO email;
```
### Strategy 3: Versioned Schemas (NoSQL)
```json
{"_schemaVersion": "2.0", "email": "alice@example.com"}
```
App handles multiple versions.
### Strategy 4: Blue-Green
Run old and new schemas simultaneously, dual-write, migrate, switch reads, remove old.
**Best for:** Major redesigns, zero downtime
---
## 9. Multi-Tenancy
### Pattern 1: Separate Databases
```
tenant1_db, tenant2_db, tenant3_db
```
**Pros:** Strong isolation
**Cons:** High overhead
### Pattern 2: Separate Schemas
```sql
CREATE SCHEMA tenant1;
CREATE TABLE tenant1.User (...);
```
**Pros:** Better than separate DBs
**Cons:** Still some overhead
### Pattern 3: Shared Schema + Tenant ID
```sql
CREATE TABLE User (
id BIGINT PRIMARY KEY,
tenantId BIGINT NOT NULL,
email VARCHAR(255),
UNIQUE (tenantId, email)
);
```
**Pros:** Most efficient
**Cons:** Must filter ALL queries by tenantId
**Recommendation:** Pattern 3 for SaaS, Pattern 1 for regulated industries
---
## 10. Performance
### Indexes
**Covering index** (includes all query columns):
```sql
CREATE INDEX idx_user_status ON User(status) INCLUDE (name, email);
```
**Composite index** (order matters):
```sql
-- Good for: WHERE tenantId = X AND createdAt > Y
CREATE INDEX idx_tenant_date ON Order(tenantId, createdAt);
```
**Partial index** (reduce size):
```sql
CREATE INDEX idx_active ON User(email) WHERE deletedAt IS NULL;
```
### Partitioning
**Horizontal (sharding):**
```sql
CREATE TABLE Order (...) PARTITION BY RANGE (createdAt);
CREATE TABLE Order_2024_Q1 PARTITION OF Order
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
```
**Vertical:** Split hot/cold data into separate tables.
---
## 11. Common Advanced Patterns
### Soft Deletes
```sql
ALTER TABLE User ADD COLUMN deletedAt TIMESTAMP NULL;
-- Query: WHERE deletedAt IS NULL
```
### Audit Columns
```sql
createdAt TIMESTAMP DEFAULT NOW()
updatedAt TIMESTAMP DEFAULT NOW() ON UPDATE NOW()
createdBy BIGINT REFERENCES User(id)
updatedBy BIGINT REFERENCES User(id)
```
### State Machines
```sql
CREATE TABLE OrderState (
orderId BIGINT REFERENCES Order(id),
state VARCHAR(20),
transitionedAt TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (orderId, transitionedAt)
);
-- Track: draft → pending → confirmed → shipped → delivered
```
### Idempotency Keys
```sql
CREATE TABLE Request (
idempotencyKey UUID PRIMARY KEY,
payload JSON,
result JSON,
processedAt TIMESTAMP
);
-- Prevents duplicate processing
```