# Data Schema & Knowledge Modeling: Advanced Methodology ## Workflow ``` Advanced Schema Modeling: - [ ] Step 1: Analyze complex domain patterns - [ ] Step 2: Design advanced relationship structures - [ ] Step 3: Apply normalization or strategic denormalization - [ ] Step 4: Model temporal/historical aspects - [ ] Step 5: Plan schema evolution strategy ``` **Steps:** (1) Identify patterns in [Advanced Relationships](#1-advanced-relationship-patterns), (2) Apply [Hierarchy](#2-hierarchy-modeling) and [Polymorphic](#3-polymorphic-associations) patterns, (3) Use [Normalization](#4-normalization-levels) then [Denormalization](#5-strategic-denormalization), (4) Add [Temporal](#6-temporal--historical-modeling) if needed, (5) Plan [Evolution](#7-schema-evolution). --- ## 1. Advanced Relationship Patterns ### Self-Referential Entity relates to itself (org charts, categories, social networks). ```sql CREATE TABLE Employee ( id BIGINT PRIMARY KEY, managerId BIGINT NULL REFERENCES Employee(id), CONSTRAINT no_self_ref CHECK (id != managerId) ); ``` Query with recursive CTE for full hierarchy. ### Conditional Relationship exists only under conditions. ```sql CREATE TABLE Order ( id BIGINT PRIMARY KEY, status VARCHAR(20), paymentId BIGINT NULL REFERENCES Payment(id), CONSTRAINT payment_when_paid CHECK ( (status IN ('paid','completed') AND paymentId IS NOT NULL) OR (status NOT IN ('paid','completed')) ) ); ``` ### Multi-Parent Entity has multiple parents (document in folders). ```sql CREATE TABLE DocumentFolder ( documentId BIGINT REFERENCES Document(id), folderId BIGINT REFERENCES Folder(id), PRIMARY KEY (documentId, folderId) ); ``` --- ## 2. Hierarchy Modeling Four approaches with trade-offs: | Approach | Implementation | Read | Write | Best For | |----------|---------------|------|-------|----------| | **Adjacency List** | `parentId` column | Slow (recursive) | Fast | Shallow trees, frequent updates | | **Path Enumeration** | `path VARCHAR` ('/1/5/12/') | Fast | Medium | Read-heavy, moderate depth | | **Nested Sets** | `lft, rgt INT` | Fastest | Slow | Read-heavy, rare writes | | **Closure Table** | Separate ancestor/descendant table | Fastest | Medium | Complex queries, any depth | **Adjacency List:** ```sql CREATE TABLE Category ( id BIGINT PRIMARY KEY, parentId BIGINT NULL REFERENCES Category(id) ); ``` **Closure Table:** ```sql CREATE TABLE CategoryClosure ( ancestor BIGINT, descendant BIGINT, depth INT, -- 0=self, 1=child, 2+=deeper PRIMARY KEY (ancestor, descendant) ); ``` **Recommendation:** Adjacency for < 5 levels, Closure for complex queries. --- ## 3. Polymorphic Associations Entity relates to multiple types (Comment on Post/Photo/Video). ### Approach 1: Separate FKs (Recommended for SQL) ```sql CREATE TABLE Comment ( id BIGINT PRIMARY KEY, postId BIGINT NULL REFERENCES Post(id), photoId BIGINT NULL REFERENCES Photo(id), videoId BIGINT NULL REFERENCES Video(id), CONSTRAINT one_parent CHECK ( (postId IS NOT NULL)::int + (photoId IS NOT NULL)::int + (videoId IS NOT NULL)::int = 1 ) ); ``` **Pros:** Type-safe, referential integrity **Cons:** Schema grows with types ### Approach 2: Supertype/Subtype ```sql CREATE TABLE Commentable (id BIGINT PRIMARY KEY, type VARCHAR(50)); CREATE TABLE Post (id BIGINT PRIMARY KEY REFERENCES Commentable(id), ...); CREATE TABLE Photo (id BIGINT PRIMARY KEY REFERENCES Commentable(id), ...); CREATE TABLE Comment (commentableId BIGINT REFERENCES Commentable(id)); ``` **Use when:** Shared attributes across types. --- ## 4. Graph & Ontology Design ### Property Graph **Nodes** = entities, **Edges** = relationships, both have properties. ```cypher CREATE (u:User {id: 1, name: 'Alice'}) CREATE (p:Product {id: 100, name: 'Widget'}) CREATE (u)-[:PURCHASED {date: '2024-01-15', quantity: 2}]->(p) ``` **Schema:** ``` Nodes: User, Product, Category Edges: PURCHASED (User→Product, {date, quantity}) REVIEWED (User→Product, {rating, comment}) BELONGS_TO (Product→Category) ``` **Design principles:** - Nodes for entities with identity - Edges for relationships - Properties on edges for context - Avoid deep traversals (< 3 hops) ### RDF Triples (Semantic Web) Subject-Predicate-Object: ```turtle ex:Alice rdf:type ex:User . ex:Alice ex:purchased ex:Widget . ``` **Use RDF when:** Standards compliance, semantic reasoning, linked data **Use Property Graph when:** Performance, complex traversals --- ## 5. Normalization Levels ### 1NF: Atomic Values **Violation:** Multiple phones in one column **Fix:** Separate UserPhone table ### 2NF: No Partial Dependencies **Violation:** In OrderItem(orderId, productId, productName), productName depends only on productId **Fix:** productName lives in Product table ### 3NF: No Transitive Dependencies **Violation:** In Address(id, zipCode, city, state), city/state depend on zipCode **Fix:** Separate ZipCode table **When to normalize to 3NF:** OLTP, frequent updates, consistency required --- ## 6. Strategic Denormalization **Only after profiling shows bottleneck.** ### Pattern 1: Computed Aggregates Store `Order.total` instead of summing OrderItems on every query. **Trade-off:** Faster reads, slower writes, consistency risk (use triggers/app logic) ### Pattern 2: Frequent Joins Embed address fields in User table to avoid join. **Trade-off:** No join, but updates must maintain both ### Pattern 3: Historical Snapshots ```sql CREATE TABLE OrderSnapshot ( orderId BIGINT, snapshotDate DATE, userName VARCHAR(255), -- denormalized from User userEmail VARCHAR(255), PRIMARY KEY (orderId, snapshotDate) ); ``` **Use when:** Need point-in-time data (e.g., user's name at time of order) --- ## 7. Temporal & Historical Modeling ### Pattern 1: Effective Dating ```sql CREATE TABLE Price ( productId BIGINT, price DECIMAL(10,2), effectiveFrom DATE NOT NULL, effectiveTo DATE NULL, -- NULL = current PRIMARY KEY (productId, effectiveFrom) ); ``` **Query current:** WHERE effectiveFrom <= TODAY AND (effectiveTo IS NULL OR effectiveTo > TODAY) ### Pattern 2: History Table ```sql CREATE TABLE UserHistory ( id BIGINT AUTO_INCREMENT PRIMARY KEY, userId BIGINT, email VARCHAR(255), name VARCHAR(255), validFrom TIMESTAMP DEFAULT NOW(), validTo TIMESTAMP NULL, changeType VARCHAR(20) -- 'INSERT', 'UPDATE', 'DELETE' ); ``` Trigger on User table inserts into UserHistory on changes. ### Pattern 3: Event Sourcing ```sql CREATE TABLE OrderEvent ( id BIGINT AUTO_INCREMENT PRIMARY KEY, orderId BIGINT, eventType VARCHAR(50), -- 'CREATED', 'ITEM_ADDED', 'SHIPPED' eventData JSON, occurredAt TIMESTAMP DEFAULT NOW() ); ``` Reconstruct state by replaying events. **Trade-offs:** **Pros:** Complete audit, time travel **Cons:** Query complexity, storage --- ## 8. Schema Evolution ### Strategy 1: Backward-Compatible Safe changes (no app changes): - Add nullable column - Add table (not referenced) - Add index - Widen column (VARCHAR(100) → VARCHAR(255)) ```sql ALTER TABLE User ADD COLUMN phoneNumber VARCHAR(20) NULL; ``` ### Strategy 2: Expand-Contract For breaking changes: 1. **Expand:** Add new alongside old ```sql ALTER TABLE User ADD COLUMN newEmail VARCHAR(255) NULL; ``` 2. **Migrate:** Copy data ```sql UPDATE User SET newEmail = email WHERE newEmail IS NULL; ``` 3. **Contract:** Remove old ```sql ALTER TABLE User DROP COLUMN email; ALTER TABLE User RENAME COLUMN newEmail TO email; ``` ### Strategy 3: Versioned Schemas (NoSQL) ```json {"_schemaVersion": "2.0", "email": "alice@example.com"} ``` App handles multiple versions. ### Strategy 4: Blue-Green Run old and new schemas simultaneously, dual-write, migrate, switch reads, remove old. **Best for:** Major redesigns, zero downtime --- ## 9. Multi-Tenancy ### Pattern 1: Separate Databases ``` tenant1_db, tenant2_db, tenant3_db ``` **Pros:** Strong isolation **Cons:** High overhead ### Pattern 2: Separate Schemas ```sql CREATE SCHEMA tenant1; CREATE TABLE tenant1.User (...); ``` **Pros:** Better than separate DBs **Cons:** Still some overhead ### Pattern 3: Shared Schema + Tenant ID ```sql CREATE TABLE User ( id BIGINT PRIMARY KEY, tenantId BIGINT NOT NULL, email VARCHAR(255), UNIQUE (tenantId, email) ); ``` **Pros:** Most efficient **Cons:** Must filter ALL queries by tenantId **Recommendation:** Pattern 3 for SaaS, Pattern 1 for regulated industries --- ## 10. Performance ### Indexes **Covering index** (includes all query columns): ```sql CREATE INDEX idx_user_status ON User(status) INCLUDE (name, email); ``` **Composite index** (order matters): ```sql -- Good for: WHERE tenantId = X AND createdAt > Y CREATE INDEX idx_tenant_date ON Order(tenantId, createdAt); ``` **Partial index** (reduce size): ```sql CREATE INDEX idx_active ON User(email) WHERE deletedAt IS NULL; ``` ### Partitioning **Horizontal (sharding):** ```sql CREATE TABLE Order (...) PARTITION BY RANGE (createdAt); CREATE TABLE Order_2024_Q1 PARTITION OF Order FOR VALUES FROM ('2024-01-01') TO ('2024-04-01'); ``` **Vertical:** Split hot/cold data into separate tables. --- ## 11. Common Advanced Patterns ### Soft Deletes ```sql ALTER TABLE User ADD COLUMN deletedAt TIMESTAMP NULL; -- Query: WHERE deletedAt IS NULL ``` ### Audit Columns ```sql createdAt TIMESTAMP DEFAULT NOW() updatedAt TIMESTAMP DEFAULT NOW() ON UPDATE NOW() createdBy BIGINT REFERENCES User(id) updatedBy BIGINT REFERENCES User(id) ``` ### State Machines ```sql CREATE TABLE OrderState ( orderId BIGINT REFERENCES Order(id), state VARCHAR(20), transitionedAt TIMESTAMP DEFAULT NOW(), PRIMARY KEY (orderId, transitionedAt) ); -- Track: draft → pending → confirmed → shipped → delivered ``` ### Idempotency Keys ```sql CREATE TABLE Request ( idempotencyKey UUID PRIMARY KEY, payload JSON, result JSON, processedAt TIMESTAMP ); -- Prevents duplicate processing ```