--- name: pre-dev-data-model description: | Gate 5: Data structures document - defines entities, relationships, and ownership before database technology selection. Large Track only. trigger: | - API Design passed Gate 4 validation - System stores persistent data - Multiple entities with relationships - Large Track workflow (2+ day features) skip_when: | - Small Track workflow → skip to Task Breakdown - No persistent data → skip to Dependency Map - API Design not validated → complete Gate 4 first sequence: after: [pre-dev-api-design] before: [pre-dev-dependency-map] --- # Data Modeling - Defining Data Structures ## Foundational Principle **Data structures, relationships, and ownership must be defined before database technology selection.** Jumping to database-specific schemas without modeling creates: - Inconsistent data structures across services - Unclear data ownership and authority - Schema conflicts discovered during development - Migration nightmares when requirements change - Performance issues from poor data design **The Data Model answers**: WHAT data exists, HOW entities relate, WHO owns what data? **The Data Model never answers**: WHICH database technology or HOW to implement storage. ## When to Use This Skill Use this skill when: - API Design has passed Gate 4 validation - TRD has passed Gate 3 validation - System stores persistent data - Multiple entities with relationships - Need clear data ownership boundaries - Building data-intensive applications ## Mandatory Workflow ### Phase 1: Data Analysis (Inputs Required) 1. **Approved API Design** (Gate 4 passed) - contracts define data flows 2. **Approved TRD** (Gate 3 passed) - architecture components identified 3. **Approved Feature Map** (Gate 2 passed) - domains defined 4. **Approved PRD** (Gate 1 passed) - business requirements locked 5. **Extract entities** from PRD, Feature Map, and API contracts 6. **Identify relationships** between entities ### Phase 2: Data Modeling 1. **Define entities** (what data objects exist) 2. **Specify attributes** (what properties each entity has) 3. **Model relationships** (how entities connect) 4. **Assign ownership** (which component owns which data) 5. **Define constraints** (uniqueness, required fields, ranges) 6. **Plan data lifecycle** (creation, updates, deletion, archival) 7. **Design access patterns** (how data will be queried) 8. **Consider data quality** (validation, normalization) ### Phase 3: Gate 5 Validation **MANDATORY CHECKPOINT** - Must pass before proceeding to Dependency Map: - [ ] All entities are identified and defined - [ ] Entity attributes are complete with types and constraints - [ ] Relationships between entities are modeled - [ ] Data ownership is explicitly assigned to components - [ ] Primary identifiers are defined - [ ] Unique constraints are specified - [ ] Required vs. optional fields are clear - [ ] Data lifecycle is documented - [ ] Access patterns are identified - [ ] No database-specific details (tables, indexes, SQL) - [ ] No ORM or storage technology specifics ## Explicit Rules ### ✅ DO Include in Data Model - Entity definitions (conceptual data objects) - Attributes with types (string, number, boolean, date, etc.) - Constraints (required, unique, ranges, patterns) - Relationships (one-to-one, one-to-many, many-to-many) - Data ownership (which component is authoritative) - Primary identifiers (how entities are uniquely identified) - Lifecycle rules (soft delete, archival, retention) - Access patterns (how data will be queried) - Data quality rules (validation, normalization) - Referential integrity requirements ### ❌ NEVER Include in Data Model - Database product names (PostgreSQL, MongoDB, Redis) - Table names or collection names - Index definitions - SQL or query language specifics - ORM frameworks (Prisma, TypeORM, SQLAlchemy) - Storage engine specifics (InnoDB, MyISAM) - Partitioning or sharding strategies (implementation detail) - Replication or backup strategies - Database-specific data types (JSONB, UUID, BIGSERIAL) ### Abstraction Rules 1. **Entity**: Say "User" not "users table" 2. **Attribute**: Say "emailAddress: String (email format)" not "email VARCHAR(255)" 3. **Relationship**: Say "User has many Orders" not "foreign key user_id" 4. **Identifier**: Say "Unique identifier" not "UUID primary key" 5. **Constraint**: Say "Must be unique" not "UNIQUE INDEX" ## Rationalization Table | Excuse | Reality | |--------|---------| | "We know it's PostgreSQL, just use PG types" | Database choice comes later. Model abstractly now. | | "Table design is data modeling" | Tables are implementation. Entities are concepts. Stay conceptual. | | "We need indexes for performance" | Indexes are optimization. Model data first, optimize later. | | "ORMs require specific schemas" | ORMs adapt to models. Don't let tooling drive design. | | "Foreign keys define relationships" | Relationships exist conceptually. FKs are implementation. | | "SQL examples help clarity" | Abstract models are clearer. SQL is implementation detail. | | "NoSQL doesn't need relationships" | All systems have data relationships. Model them regardless of DB type. | | "This is just ERD" | ERD is visualization tool. Data model is broader (ownership, lifecycle, etc). | | "We can skip this for simple CRUD" | Even CRUD needs clear entity design. Don't skip. | | "Microservices mean no relationships" | Services interact via data. Model entities per service. | ## Red Flags - STOP If you catch yourself writing any of these in Data Model, **STOP**: - Database product names (Postgres, MySQL, Mongo, Redis) - SQL keywords (CREATE TABLE, ALTER TABLE, SELECT, JOIN) - Database-specific types (SERIAL, JSONB, VARCHAR, TEXT) - Index commands (CREATE INDEX, UNIQUE INDEX) - ORM code (Prisma schema, TypeORM decorators) - Storage details (partitioning, sharding, replication) - Query optimization (EXPLAIN plans, index hints) - Backup/recovery strategies **When you catch yourself**: Replace DB detail with abstract concept. "users table" → "User entity" ## Gate 5 Validation Checklist Before proceeding to Dependency Map, verify: **Entity Completeness**: - [ ] All entities from PRD/Feature Map are modeled - [ ] Entity names are clear and consistent - [ ] Each entity has defined purpose - [ ] Entity boundaries align with component ownership (from TRD) **Attribute Specification**: - [ ] All attributes have types specified - [ ] Required vs. optional is explicit - [ ] Constraints are documented (unique, range, format) - [ ] Default values are specified where relevant - [ ] Computed/derived fields are identified **Relationship Modeling**: - [ ] All relationships between entities are documented - [ ] Cardinality is specified (one-to-one, one-to-many, many-to-many) - [ ] Optional vs. required relationships are clear - [ ] Referential integrity needs are documented - [ ] Circular dependencies are identified and resolved **Data Ownership**: - [ ] Each entity is owned by exactly one component - [ ] Read/write permissions are documented - [ ] Cross-component data access is via APIs (from Gate 4) - [ ] No shared database anti-pattern **Data Quality**: - [ ] Validation rules are specified - [ ] Normalization level is appropriate - [ ] Denormalization decisions are justified - [ ] Data consistency strategy is defined (eventual vs. strong) **Lifecycle Management**: - [ ] Creation rules are documented - [ ] Update patterns are defined - [ ] Deletion strategy is specified (hard vs. soft delete) - [ ] Archival and retention policies exist - [ ] Audit trail needs are identified **Access Patterns**: - [ ] Primary access patterns are documented - [ ] Query needs are identified (lookups, searches, aggregations) - [ ] Write patterns are documented (create, update, delete frequencies) - [ ] Consistency requirements are specified **Technology Agnostic**: - [ ] No database product names - [ ] No SQL or NoSQL specifics - [ ] No table/collection/index definitions - [ ] Can implement in any database technology **Gate Result**: - ✅ **PASS**: All checkboxes checked → Proceed to Dependency Map - ⚠️ **CONDITIONAL**: Remove DB specifics or add missing entities → Re-validate - ❌ **FAIL**: Incomplete model or poor ownership → Rework ## Data Model Template ```markdown # Data Model: [Project/Feature Name] ## Overview - **API Design Reference**: [Link to Gate 4 API contracts] - **TRD Reference**: [Link to Gate 3 TRD] - **Feature Map Reference**: [Link to Gate 2 Feature Map] - **Last Updated**: [Date] - **Status**: Draft / Under Review / Approved ## Data Ownership Map | Entity | Owning Component (from TRD) | Read Access | Write Access | |--------|------------------------------|-------------|--------------| | User | User Service | All components | User Service only | | Order | Order Service | User, Payment, Fulfillment | Order Service only | | Payment | Payment Service | Order, Billing | Payment Service only | | Product | Catalog Service | All components | Catalog Service only | **Principle**: Each entity has exactly ONE authoritative owner. Cross-component access via APIs only. --- ## Entity Definitions ### Entity: User **Purpose**: Represents a system user account **Owned By**: User Service (from TRD) **Primary Identifier**: userId (Unique identifier, immutable) **Attributes**: | Attribute | Type | Required | Unique | Constraints | Description | |-----------|------|----------|--------|-------------|-------------| | userId | Identifier | Yes | Yes | Immutable, UUID format | Unique user identifier | | email | EmailAddress | Yes | Yes | Valid email format, max 254 chars | Primary email | | displayName | String | No | No | 3-50 chars, alphanumeric + spaces | Public name | | passwordHash | String | Yes | No | Hashed value only, never store plain text | Authentication credential | | accountStatus | UserStatus | Yes | No | One of: active, suspended, deleted, pending | Current status | | emailVerified | Boolean | Yes | No | Default: false | Email verification status | | createdAt | Timestamp | Yes | No | Immutable, ISO8601 | Account creation time | | updatedAt | Timestamp | Yes | No | Auto-updated on changes | Last modification time | | lastLoginAt | Timestamp | No | No | ISO8601 | Most recent login time | **Constraints**: - `email` must be unique across all users - `displayName` is optional but recommended (prompt user during registration) - `passwordHash` must never be returned in API responses - `accountStatus` transitions: pending → active, active → suspended → active, active → deleted (final) **Lifecycle**: - **Creation**: Via `CreateUser` API operation (Gate 4 contract) - **Updates**: Via `UpdateUserProfile`, `ChangePassword`, `UpdateStatus` operations - **Deletion**: Soft delete (set `accountStatus = deleted`, retain data for 90 days) - **Archival**: Hard delete after 90 days of soft delete **Access Patterns**: - Lookup by `userId` (primary pattern, most frequent) - Lookup by `email` (login flow, unique constraint) - List users by `accountStatus` (admin operations) - Search by `displayName` (user search feature) **Data Quality**: - `email` is normalized to lowercase before storage - `displayName` is trimmed of leading/trailing whitespace - `passwordHash` uses industry-standard hashing (algorithm TBD in Dependency Map) --- ### Entity: Order **Purpose**: Represents a customer order **Owned By**: Order Service (from TRD) **Primary Identifier**: orderId (Unique identifier, immutable) **Attributes**: | Attribute | Type | Required | Unique | Constraints | Description | |-----------|------|----------|--------|-------------|-------------| | orderId | Identifier | Yes | Yes | Immutable, UUID format | Unique order identifier | | userId | Identifier | Yes | No | Must reference existing User | Customer who placed order | | orderStatus | OrderStatus | Yes | No | One of: pending, confirmed, shipped, delivered, cancelled | Current status | | totalAmount | MonetaryAmount | Yes | No | Non-negative, in smallest currency unit | Total order value | | currency | CurrencyCode | Yes | No | ISO 4217 code | Currency for totalAmount | | shippingAddress | Address | Yes | No | Valid address structure | Delivery destination | | orderItems | List | Yes | No | Min 1 item, max 100 items | Products in order | | createdAt | Timestamp | Yes | No | Immutable, ISO8601 | Order creation time | | updatedAt | Timestamp | Yes | No | Auto-updated on changes | Last modification time | **Nested Types**: #### OrderItem (embedded within Order) | Attribute | Type | Required | Description | |-----------|------|----------|-------------| | productId | Identifier | Yes | References Product entity | | quantity | Integer | Yes | Min 1, max 999 | | unitPrice | MonetaryAmount | Yes | Price per item at time of order | | subtotal | MonetaryAmount | Yes | quantity × unitPrice | #### Address (value object) | Attribute | Type | Required | Description | |-----------|------|----------|-------------| | street1 | String | Yes | Primary street address | | street2 | String | No | Secondary address (apt, suite) | | city | String | Yes | City name | | state | String | Yes | State/province code | | postalCode | String | Yes | Postal/ZIP code | | country | String | Yes | ISO 3166-1 alpha-2 code | **Relationships**: - **User** (one-to-many): One User can have many Orders - **Product** (via OrderItem): Order references Products (read-only, owned by Catalog Service) **Constraints**: - `totalAmount` must equal sum of all `orderItems[].subtotal` + shipping + tax - `orderStatus` transitions: pending → confirmed → shipped → delivered - `orderStatus` can go to cancelled from pending or confirmed only - `orderItems` must contain at least 1 item - All `productId` references must be valid at order creation time (validate via API) **Lifecycle**: - **Creation**: Via `CreateOrder` API operation - **Updates**: Via `UpdateOrderStatus`, `CancelOrder` operations - **Deletion**: Soft delete after 7 years (regulatory compliance) - **Archival**: Orders never hard deleted (permanent record) **Access Patterns**: - Lookup by `orderId` (most frequent) - List by `userId` (user order history) - List by `orderStatus` (fulfillment workflows) - Query by `createdAt` range (reporting, analytics) **Data Quality**: - `orderItems[].unitPrice` is snapshot at order time (price changes don't affect existing orders) - `totalAmount` is computed and validated on creation - `shippingAddress` is validated against address API before order creation --- ### Entity: [Another Entity] [Same structure as above] --- ## Relationship Diagram ``` User (1) ──< has many >── (*) Order │ │ contains ↓ OrderItem (embedded) │ │ references ↓ Product (1) │ │ belongs to ↓ Category (1) Payment (*) ──< processes >── (1) Order ``` **Legend**: - `(1)`: One - `(*)`: Many - `──<`: One-to-many relationship - `──`: One-to-one relationship - Embedded: Data stored within parent entity --- ## Cross-Component Data Access ### User Service → Order Service **Scenario**: User wants to view their order history **Data Flow**: 1. User Service authenticates user (owns User entity) 2. User Service calls Order Service API: `GetOrdersByUserId(userId)` 3. Order Service returns order data (owns Order entity) 4. User Service enriches with user display name if needed **Rules**: - User Service does NOT access Order Service's data store directly - All access via APIs (from Gate 4 contracts) - Order Service is authoritative for Order data --- ### Order Service → Catalog Service **Scenario**: Order creation needs product info **Data Flow**: 1. Order Service receives `CreateOrder` request 2. Order Service calls Catalog Service API: `GetProduct(productId)` 3. Catalog Service returns product data (price, availability) 4. Order Service creates order with snapshot of product data **Rules**: - Order stores `productId` reference and price snapshot - Catalog Service is authoritative for current Product data - Order's product snapshot is immutable (historical record) --- ### [Another Cross-Component Access] [Same structure] --- ## Data Consistency Strategy ### Strong Consistency (Immediate) - User authentication (must be immediately consistent) - Payment transactions (cannot tolerate eventual consistency) - Inventory deductions (prevent overselling) ### Eventual Consistency (Acceptable Delay) - User profile updates reflecting in analytics (delay OK) - Order history in user dashboard (few seconds lag acceptable) - Search indexes (brief staleness tolerable) ### Consistency Implementation**: - Strong: Synchronous API calls with transactional guarantees - Eventual: Async events with idempotent handlers --- ## Data Validation Rules ### User Entity Validation - `email`: Must match RFC 5322 format - `displayName`: No profanity (filter list: [reference]) - `passwordHash`: Must meet complexity requirements (min 12 chars, mixed case, numbers, symbols) ### Order Entity Validation - `totalAmount`: Must match sum of items + fees - `orderItems`: Each `productId` must exist in Catalog Service at order time - `shippingAddress`: Must pass address validation API ### Cross-Entity Validation - User must have `accountStatus = active` to create orders - Products must have `availability > 0` to be added to orders --- ## Data Lifecycle Policies ### Retention Periods | Entity | Active Period | Archive After | Delete After | |--------|---------------|---------------|--------------| | User | Until deleted | 90 days (soft delete) | 90 days after soft delete | | Order | Permanent | N/A | Never (regulatory) | | Payment | 7 years | 7 years | 10 years (compliance) | | AuditLog | 1 year | 1 year | 5 years | ### Soft Delete Strategy - **User**: Set `accountStatus = deleted`, retain data for GDPR compliance window - **Order**: Never deleted (permanent financial record) - **Payment**: Anonymize after 7 years (retain transaction, remove PII) ### Audit Trail Requirements - **User**: Log all status changes, login attempts - **Order**: Log all status transitions, modifications - **Payment**: Log all transaction attempts, results --- ## Data Privacy & Compliance ### Personally Identifiable Information (PII) | Entity | PII Fields | Handling | |--------|-----------|----------| | User | email, displayName | Encrypted at rest, GDPR right to deletion | | Order | shippingAddress | Encrypted at rest, retention per policy | | Payment | card details | Never stored (use tokenization) | ### GDPR Compliance - Users can request data export (via `ExportUserData` API) - Users can request deletion (soft delete with 90-day grace period) - Consent tracking for marketing communications ### Data Encryption - Sensitive fields encrypted at rest (algorithm TBD in Dependency Map) - Encryption keys managed externally (key management TBD in Dependency Map) --- ## Access Pattern Analysis ### High-Frequency Patterns (Optimize for these) 1. **User lookup by ID**: `GetUser(userId)` - 1000 req/sec 2. **Order lookup by ID**: `GetOrder(orderId)` - 500 req/sec 3. **User's orders**: `GetOrdersByUserId(userId)` - 200 req/sec ### Medium-Frequency Patterns 4. **User lookup by email**: `GetUserByEmail(email)` - 50 req/sec (login) 5. **Orders by status**: `GetOrdersByStatus(status)` - 30 req/sec (fulfillment) ### Low-Frequency Patterns 6. **User search**: `SearchUsers(query)` - 5 req/sec (admin) 7. **Order reports**: `GetOrdersByDateRange(start, end)` - 1 req/hour (reporting) **Optimization Notes** (for later, not now): - High-frequency patterns need fast lookups (indexes, caching) - Medium-frequency patterns need balanced design - Low-frequency patterns can tolerate slower queries --- ## Data Quality Standards ### Normalization - **User email**: Stored in lowercase, normalized form - **Addresses**: Standardized format via address validation service - **Phone numbers**: Stored in E.164 format ### Validation - All inputs validated before storage - Validation rules documented per attribute - Failed validation returns clear error messages ### Data Integrity - Referential integrity enforced (userId in Order must exist) - Constraints enforced at model level (not just database) - Data consistency checks in automated tests --- ## Migration & Evolution Strategy ### Schema Evolution - **Additive changes**: Add optional fields (backward compatible) - **Non-breaking**: Default values for new required fields - **Breaking changes**: Versioned approach (v1 → v2 with migration) ### Data Migration - Plan for zero-downtime migrations - Backward and forward compatibility during migration - Rollback strategy for failed migrations ### Versioning - Entities can evolve over time - Migration scripts documented (but not written here - that's implementation) - Compatibility maintained during transitions --- ## Gate 5 Validation **Validation Date**: [Date] **Validated By**: [Person/team] - [ ] All entities defined with complete attributes - [ ] Relationships documented and valid - [ ] Data ownership assigned to components - [ ] Constraints and validation rules specified - [ ] Lifecycle policies documented - [ ] Access patterns identified - [ ] No database-specific details included - [ ] Ready for Dependency Map (Gate 6) **Approval**: ☐ Approved | ☐ Needs Revision | ☐ Rejected **Next Step**: Proceed to Dependency Map (`pre-dev-dependency-map`) ``` ## Common Violations and Fixes ### Violation 1: Database-Specific Schema ❌ **Wrong**: ```sql CREATE TABLE users ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), email VARCHAR(255) UNIQUE NOT NULL, created_at TIMESTAMP DEFAULT NOW() ); ``` ✅ **Correct**: ```markdown ### Entity: User **Attributes**: | Attribute | Type | Required | Unique | Description | |-----------|------|----------|--------|-------------| | userId | Identifier | Yes | Yes | Unique user identifier | | email | EmailAddress | Yes | Yes | Primary email address | | createdAt | Timestamp | Yes | No | Account creation time | ``` ### Violation 2: ORM-Specific Code ❌ **Wrong**: ```typescript @Entity() class User { @PrimaryGeneratedColumn('uuid') id: string; @Column({ unique: true }) email: string; } ``` ✅ **Correct**: ```markdown ### Entity: User **Primary Identifier**: userId (Unique identifier) **Attributes**: userId, email, ... **Constraints**: email must be unique ``` ### Violation 3: Technology in Relationships ❌ **Wrong**: ```markdown **Relationships**: - Foreign key `user_id` references `users.id` - Join table `user_roles` for many-to-many ``` ✅ **Correct**: ```markdown **Relationships**: - User (one-to-many) Order: One user can have many orders - User (many-to-many) Role: Users can have multiple roles, roles can be assigned to multiple users ``` ## Confidence Scoring Use this to adjust your interaction with the user: ```yaml Confidence Factors: Entity Coverage: [0-30] - All entities modeled: 30 - Most entities covered: 20 - Significant gaps: 10 Relationship Clarity: [0-25] - All relationships documented: 25 - Most relationships clear: 15 - Ambiguous connections: 5 Data Ownership: [0-25] - Clear ownership boundaries: 25 - Mostly clear with minor overlaps: 15 - Unclear or contested: 5 Constraint Completeness: [0-20] - All validation rules specified: 20 - Common cases covered: 12 - Minimal specification: 5 Total: [0-100] Action: 80+: Generate complete data model autonomously 50-79: Present options for normalization/relationships <50: Ask clarifying questions about entity boundaries ``` ## Output Location **Always output to**: `docs/pre-development/data-model/data-model-[feature-name].md` ## After Data Model Approval 1. ✅ Lock the data model - entity structure is now reference 2. 🎯 Use data model as input for Dependency Map (next phase: `pre-dev-dependency-map`) 3. 🚫 Never add database specifics to data model retroactively 4. 📋 Keep data model technology-agnostic until Dependency Map ## Quality Self-Check Before declaring Data Model complete, verify: - [ ] All entities are defined with complete attributes - [ ] Attribute types and constraints are specified - [ ] Relationships are modeled with correct cardinality - [ ] Data ownership is explicitly assigned - [ ] Primary identifiers are defined - [ ] Lifecycle policies are documented - [ ] Access patterns are identified - [ ] Validation rules are comprehensive - [ ] Privacy/compliance needs are addressed - [ ] Consistency strategy is defined - [ ] Zero database-specific details (tables, SQL, indexes) - [ ] Zero ORM or framework specifics - [ ] Gate 5 validation checklist 100% complete ## The Bottom Line **If you wrote SQL schemas or ORM code, delete it and model abstractly.** Data modeling is conceptual. Period. No database products. No SQL. No ORMs. Database technology goes in Dependency Map. That's the next phase. Wait for it. Violating this separation means: - You're locked into a database before evaluating alternatives - Data model can't be reused across different database types - You can't objectively compare relational vs. document vs. key-value - Poor separation of concerns (conceptual vs. physical) **Model the data. Stay abstract. Choose database later.**