Files

Zhongwei Li 41d9f6b189 Initial commit

2025-11-30 08:38:26 +08:00

9.8 KiB

Raw Blame History

Data Schema & Knowledge Modeling: Advanced Methodology

Workflow

Advanced Schema Modeling:
- [ ] Step 1: Analyze complex domain patterns
- [ ] Step 2: Design advanced relationship structures
- [ ] Step 3: Apply normalization or strategic denormalization
- [ ] Step 4: Model temporal/historical aspects
- [ ] Step 5: Plan schema evolution strategy

Steps: (1) Identify patterns in Advanced Relationships, (2) Apply Hierarchy and Polymorphic patterns, (3) Use Normalization then Denormalization, (4) Add Temporal if needed, (5) Plan Evolution.

1. Advanced Relationship Patterns

Self-Referential

Entity relates to itself (org charts, categories, social networks).

CREATE TABLE Employee (
  id BIGINT PRIMARY KEY,
  managerId BIGINT NULL REFERENCES Employee(id),
  CONSTRAINT no_self_ref CHECK (id != managerId)
);

Query with recursive CTE for full hierarchy.

Conditional

Relationship exists only under conditions.

CREATE TABLE Order (
  id BIGINT PRIMARY KEY,
  status VARCHAR(20),
  paymentId BIGINT NULL REFERENCES Payment(id),
  CONSTRAINT payment_when_paid CHECK (
    (status IN ('paid','completed') AND paymentId IS NOT NULL) OR
    (status NOT IN ('paid','completed'))
  )
);

Multi-Parent

Entity has multiple parents (document in folders).

CREATE TABLE DocumentFolder (
  documentId BIGINT REFERENCES Document(id),
  folderId BIGINT REFERENCES Folder(id),
  PRIMARY KEY (documentId, folderId)
);

2. Hierarchy Modeling

Four approaches with trade-offs:

Approach	Implementation	Read	Write	Best For
Adjacency List	`parentId` column	Slow (recursive)	Fast	Shallow trees, frequent updates
Path Enumeration	`path VARCHAR` ('/1/5/12/')	Fast	Medium	Read-heavy, moderate depth
Nested Sets	`lft, rgt INT`	Fastest	Slow	Read-heavy, rare writes
Closure Table	Separate ancestor/descendant table	Fastest	Medium	Complex queries, any depth

Adjacency List:

CREATE TABLE Category (
  id BIGINT PRIMARY KEY,
  parentId BIGINT NULL REFERENCES Category(id)
);

Closure Table:

CREATE TABLE CategoryClosure (
  ancestor BIGINT,
  descendant BIGINT,
  depth INT,  -- 0=self, 1=child, 2+=deeper
  PRIMARY KEY (ancestor, descendant)
);

Recommendation: Adjacency for < 5 levels, Closure for complex queries.

3. Polymorphic Associations

Entity relates to multiple types (Comment on Post/Photo/Video).

Approach 1: Separate FKs (Recommended for SQL)

CREATE TABLE Comment (
  id BIGINT PRIMARY KEY,
  postId BIGINT NULL REFERENCES Post(id),
  photoId BIGINT NULL REFERENCES Photo(id),
  videoId BIGINT NULL REFERENCES Video(id),
  CONSTRAINT one_parent CHECK (
    (postId IS NOT NULL)::int +
    (photoId IS NOT NULL)::int +
    (videoId IS NOT NULL)::int = 1
  )
);

Pros: Type-safe, referential integrity Cons: Schema grows with types

Approach 2: Supertype/Subtype

CREATE TABLE Commentable (id BIGINT PRIMARY KEY, type VARCHAR(50));
CREATE TABLE Post (id BIGINT PRIMARY KEY REFERENCES Commentable(id), ...);
CREATE TABLE Photo (id BIGINT PRIMARY KEY REFERENCES Commentable(id), ...);
CREATE TABLE Comment (commentableId BIGINT REFERENCES Commentable(id));

Use when: Shared attributes across types.

4. Graph & Ontology Design

Property Graph

Nodes = entities, Edges = relationships, both have properties.

CREATE (u:User {id: 1, name: 'Alice'})
CREATE (p:Product {id: 100, name: 'Widget'})
CREATE (u)-[:PURCHASED {date: '2024-01-15', quantity: 2}]->(p)

Schema:

Nodes: User, Product, Category
Edges: PURCHASED (User→Product, {date, quantity})
       REVIEWED (User→Product, {rating, comment})
       BELONGS_TO (Product→Category)

Design principles:

Nodes for entities with identity
Edges for relationships
Properties on edges for context
Avoid deep traversals (< 3 hops)

RDF Triples (Semantic Web)

Subject-Predicate-Object:

ex:Alice rdf:type ex:User .
ex:Alice ex:purchased ex:Widget .

Use RDF when: Standards compliance, semantic reasoning, linked data Use Property Graph when: Performance, complex traversals

5. Normalization Levels

1NF: Atomic Values

Violation: Multiple phones in one column Fix: Separate UserPhone table

2NF: No Partial Dependencies

Violation: In OrderItem(orderId, productId, productName), productName depends only on productId Fix: productName lives in Product table

3NF: No Transitive Dependencies

Violation: In Address(id, zipCode, city, state), city/state depend on zipCode Fix: Separate ZipCode table

When to normalize to 3NF: OLTP, frequent updates, consistency required

6. Strategic Denormalization

Only after profiling shows bottleneck.

Pattern 1: Computed Aggregates

Store Order.total instead of summing OrderItems on every query.

Trade-off: Faster reads, slower writes, consistency risk (use triggers/app logic)

Pattern 2: Frequent Joins

Embed address fields in User table to avoid join.

Trade-off: No join, but updates must maintain both

Pattern 3: Historical Snapshots

CREATE TABLE OrderSnapshot (
  orderId BIGINT,
  snapshotDate DATE,
  userName VARCHAR(255),  -- denormalized from User
  userEmail VARCHAR(255),
  PRIMARY KEY (orderId, snapshotDate)
);

Use when: Need point-in-time data (e.g., user's name at time of order)

7. Temporal & Historical Modeling

Pattern 1: Effective Dating

CREATE TABLE Price (
  productId BIGINT,
  price DECIMAL(10,2),
  effectiveFrom DATE NOT NULL,
  effectiveTo DATE NULL,  -- NULL = current
  PRIMARY KEY (productId, effectiveFrom)
);

Query current: WHERE effectiveFrom <= TODAY AND (effectiveTo IS NULL OR effectiveTo > TODAY)

Pattern 2: History Table

CREATE TABLE UserHistory (
  id BIGINT AUTO_INCREMENT PRIMARY KEY,
  userId BIGINT,
  email VARCHAR(255),
  name VARCHAR(255),
  validFrom TIMESTAMP DEFAULT NOW(),
  validTo TIMESTAMP NULL,
  changeType VARCHAR(20)  -- 'INSERT', 'UPDATE', 'DELETE'
);

Trigger on User table inserts into UserHistory on changes.

Pattern 3: Event Sourcing

CREATE TABLE OrderEvent (
  id BIGINT AUTO_INCREMENT PRIMARY KEY,
  orderId BIGINT,
  eventType VARCHAR(50),  -- 'CREATED', 'ITEM_ADDED', 'SHIPPED'
  eventData JSON,
  occurredAt TIMESTAMP DEFAULT NOW()
);

Reconstruct state by replaying events.

Trade-offs: Pros: Complete audit, time travel Cons: Query complexity, storage

8. Schema Evolution

Strategy 1: Backward-Compatible

Safe changes (no app changes):

Add nullable column
Add table (not referenced)
Add index
Widen column (VARCHAR(100) → VARCHAR(255))

ALTER TABLE User ADD COLUMN phoneNumber VARCHAR(20) NULL;

Strategy 2: Expand-Contract

For breaking changes:

Expand: Add new alongside old

ALTER TABLE User ADD COLUMN newEmail VARCHAR(255) NULL;

Migrate: Copy data

UPDATE User SET newEmail = email WHERE newEmail IS NULL;

Contract: Remove old

ALTER TABLE User DROP COLUMN email;
ALTER TABLE User RENAME COLUMN newEmail TO email;

Strategy 3: Versioned Schemas (NoSQL)

{"_schemaVersion": "2.0", "email": "alice@example.com"}

App handles multiple versions.

Strategy 4: Blue-Green

Run old and new schemas simultaneously, dual-write, migrate, switch reads, remove old.

Best for: Major redesigns, zero downtime

9. Multi-Tenancy

Pattern 1: Separate Databases

tenant1_db, tenant2_db, tenant3_db

Pros: Strong isolation Cons: High overhead

Pattern 2: Separate Schemas

CREATE SCHEMA tenant1;
CREATE TABLE tenant1.User (...);

Pros: Better than separate DBs Cons: Still some overhead

Pattern 3: Shared Schema + Tenant ID

CREATE TABLE User (
  id BIGINT PRIMARY KEY,
  tenantId BIGINT NOT NULL,
  email VARCHAR(255),
  UNIQUE (tenantId, email)
);

Pros: Most efficient Cons: Must filter ALL queries by tenantId

Recommendation: Pattern 3 for SaaS, Pattern 1 for regulated industries

10. Performance

Indexes

Covering index (includes all query columns):

CREATE INDEX idx_user_status ON User(status) INCLUDE (name, email);

Composite index (order matters):

-- Good for: WHERE tenantId = X AND createdAt > Y
CREATE INDEX idx_tenant_date ON Order(tenantId, createdAt);

Partial index (reduce size):

CREATE INDEX idx_active ON User(email) WHERE deletedAt IS NULL;

Partitioning

Horizontal (sharding):

CREATE TABLE Order (...) PARTITION BY RANGE (createdAt);
CREATE TABLE Order_2024_Q1 PARTITION OF Order
  FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');

Vertical: Split hot/cold data into separate tables.

11. Common Advanced Patterns

Soft Deletes

ALTER TABLE User ADD COLUMN deletedAt TIMESTAMP NULL;
-- Query: WHERE deletedAt IS NULL

Audit Columns

createdAt TIMESTAMP DEFAULT NOW()
updatedAt TIMESTAMP DEFAULT NOW() ON UPDATE NOW()
createdBy BIGINT REFERENCES User(id)
updatedBy BIGINT REFERENCES User(id)

State Machines

CREATE TABLE OrderState (
  orderId BIGINT REFERENCES Order(id),
  state VARCHAR(20),
  transitionedAt TIMESTAMP DEFAULT NOW(),
  PRIMARY KEY (orderId, transitionedAt)
);
-- Track: draft → pending → confirmed → shipped → delivered

Idempotency Keys

CREATE TABLE Request (
  idempotencyKey UUID PRIMARY KEY,
  payload JSON,
  result JSON,
  processedAt TIMESTAMP
);
-- Prevents duplicate processing

9.8 KiB Raw Blame History