# How to Create a Data Model Specification

Data Model specifications document the entities, fields, relationships, and constraints for your application's data. They define the "shape" of data your system works with.

## Quick Start

```bash
# 1. Create a new data model
scripts/generate-spec.sh data-model data-001-descriptive-slug

# 2. Open and fill in the file
# (The file will be created at: docs/specs/data-model/data-001-descriptive-slug.md)

# 3. Fill in entities and relationships, then validate:
scripts/validate-spec.sh docs/specs/data-model/data-001-descriptive-slug.md

# 4. Fix issues and check completeness:
scripts/check-completeness.sh docs/specs/data-model/data-001-descriptive-slug.md
```

## When to Write a Data Model

Use a Data Model when you need to:
- Define database schema for new features
- Document entity relationships and constraints
- Establish consistent naming conventions
- Enable API/UI teams to understand data structure
- Plan data migrations or refactoring
- Document complex data relationships

## Research Phase

### 1. Research Related Specifications
Find what this data model supports:

```bash
# Find technical requirements this fulfills
grep -r "prd\|technical" docs/specs/ --include="*.md"

# Find existing data models that might be related
grep -r "data\|model" docs/specs/ --include="*.md"

# Find API contracts that expose this data
grep -r "api\|endpoint" docs/specs/ --include="*.md"
```

### 2. Review Existing Data Models
- What data modeling patterns does your codebase use?
- What database are you using (PostgreSQL, MongoDB, etc.)?
- How are relationships currently modeled?
- Naming conventions for fields and entities?
- Any legacy schema patterns to respect or migrate from?

### 3. Research Domain Models
- How do industry-standard models structure similar data?
- Are there existing standards (e.g., ISO, RFC) you should follow?
- What are best practices in this domain?

### 4. Understand Business Rules
- What constraints must the data satisfy?
- What are the cardinality rules (one-to-many, many-to-many)?
- What data must be unique or required?
- What's the expected scale/volume?

## Structure & Content Guide

### Title & Metadata
- **Title**: "User Data Model", "Transaction Model", etc.
- **Scope**: What entities does this model cover?
- **Version**: 1.0 for new models

### Overview Section
Provide context:

```markdown
# User & Profile Data Model

This data model defines the core entities for user management and profile
information. Covers user accounts, authentication data, and user preferences.

**Entities**: User, UserProfile, UserPreference
**Relationships**: User → UserProfile (1:1), User → UserPreference (1:many)
**Primary Database**: PostgreSQL
```

### Entity Definitions Section

Document each entity/table:

#### Entity: User

```markdown
### User

Core user account entity. Every user must have exactly one User record.

**Purpose**: Represents a user account in the system.

**Fields**

| Field | Type | Required | Unique | Default | Description |
|-------|------|----------|--------|---------|-------------|
| id | UUID | Yes | Yes | auto | Primary key, auto-generated |
| email | String(255) | Yes | Yes | - | User's email address, used for login |
| password_hash | String(255) | Yes | No | - | Bcrypt hash of password (cost=12) |
| first_name | String(100) | No | No | - | User's first name |
| last_name | String(100) | No | No | - | User's last name |
| status | Enum | Yes | No | active | Account status: active, inactive, suspended |
| created_at | Timestamp | Yes | No | now() | Account creation time (UTC) |
| updated_at | Timestamp | Yes | No | now() | Last update time (UTC) |
| deleted_at | Timestamp | No | No | NULL | Soft-delete timestamp, NULL if active |

**Indexes**
- Primary: `email` (unique for quick lookups)
- Secondary: `created_at` (for user listing/pagination)
- Secondary: `status` (for filtering active users)

**Constraints**
- Email format must be valid (enforced in application)
- Password must be at least 8 characters (enforced in application)
- Email must be globally unique
- Status can only be: active, inactive, suspended

**Data Volume**
- Expected growth: 100 new users/day
- Estimated year 1: ~36k users
- Estimated year 3: ~150k users

**Archival Strategy**
- Deleted users (deleted_at != NULL) moved to archive after 1 year
- Soft deletes used for data recovery capability
```

#### Entity: UserProfile

```markdown
### UserProfile

Extended user profile information. One-to-one relationship with User.

**Purpose**: Stores optional user profile information separate from core account.

**Fields**

| Field | Type | Required | Unique | Description |
|-------|------|----------|--------|-------------|
| id | UUID | Yes | Yes | Primary key |
| user_id | UUID (FK) | Yes | Yes | Foreign key to User.id |
| avatar_url | String(500) | No | No | URL to user's avatar image |
| bio | String(500) | No | No | User bio/description |
| phone | String(20) | No | Yes | User phone number |
| timezone | String(50) | No | No | User's timezone (e.g., America/New_York) |
| language | String(5) | No | No | Preferred language (ISO 639-1, e.g., en, fr) |
| theme | Enum | No | No | UI theme preference: light, dark, auto |
| created_at | Timestamp | Yes | No | Creation time |
| updated_at | Timestamp | Yes | No | Last update time |

**Indexes**
- Primary: `user_id` (unique for 1:1 relationship)

**Constraints**
- Foreign key: user_id references User(id) ON DELETE CASCADE
- Phone must be valid format (if provided)
- Timezone must be valid (e.g., from IANA timezone database)
- Language must be valid ISO 639-1 code
- Theme must be one of: light, dark, auto

**Notes**
- Soft-deleted with parent User (CASCADE delete)
- Profile is optional - some users may not have profile data
```

#### Entity: UserPreference

```markdown
### UserPreference

Key-value preferences for users. Flexible schema for future preference types.

**Purpose**: Stores user preferences without requiring schema changes.

**Fields**

| Field | Type | Required | Unique | Description |
|-------|------|----------|--------|-------------|
| id | UUID | Yes | Yes | Primary key |
| user_id | UUID (FK) | Yes | No | Foreign key to User.id |
| preference_key | String(100) | Yes | No | Preference identifier (e.g., notifications_email) |
| preference_value | String(1000) | Yes | No | Preference value as string |
| created_at | Timestamp | Yes | No | Creation time |
| updated_at | Timestamp | Yes | No | Last update time |

**Indexes**
- Composite: `(user_id, preference_key)` - For efficient preference lookup
- Primary: `user_id` - For finding all preferences for a user

**Constraints**
- Foreign key: user_id references User(id) ON DELETE CASCADE
- Composite unique: `(user_id, preference_key)` - One preference per key per user
- preference_key must match pattern: `[a-z_]+` (lowercase letters and underscores only)
- preference_value must be valid JSON or simple string

**Valid Preferences**
Examples of preference_key values:
- `notifications_email` → "true"/"false"
- `notifications_sms` → "true"/"false"
- `export_format` → "csv"/"json"
- `ui_columns_per_page` → "20"/"50"/"100"

**Notes**
- Flexible key-value design allows adding preferences without schema changes
- Values stored as strings for flexibility, parsed by application layer
```

### Relationships Section

Document how entities relate:

```markdown
## Entity Relationships

```
┌───────────┐         ┌──────────────┐         ┌─────────────┐
│   User    │         │ UserProfile  │         │ UserPref    │
├───────────┤         ├──────────────┤         ├─────────────┤
│ id (PK)   │         │ id (PK)      │         │ id (PK)     │
│ email     │◄───1:1──│ user_id (FK) │         │ user_id(FK) │
│ ...       │         │ avatar_url   │         │ pref_key    │
└───────────┘         │ ...          │         │ pref_value  │
                      └──────────────┘         └─────────────┘
                                                       ▲
                                                       │
                                                     1:many
```

### Relationship: User → UserProfile (1:1)
- **Type**: One-to-One
- **Foreign Key**: UserProfile.user_id → User.id
- **Cardinality**: A User has exactly one UserProfile; a UserProfile belongs to exactly one User
- **Delete Behavior**: CASCADE - Deleting User deletes UserProfile
- **Optional**: UserProfile is optional (some users may not have detailed profile)

### Relationship: User → UserPreference (1:many)
- **Type**: One-to-Many
- **Foreign Key**: UserPreference.user_id → User.id
- **Cardinality**: A User can have many UserPreferences; each UserPreference belongs to one User
- **Delete Behavior**: CASCADE - Deleting User deletes all preferences
- **Optional**: A User can have zero preferences
```

### Constraints & Validation Section

```markdown
## Data Constraints & Validation

### Business Logic Constraints
- Users cannot have duplicate emails (enforced at database + application)
- User phone numbers must be unique if provided
- Email and phone cannot both be deleted/NULL in UserProfile

### Data Integrity Rules
- password_hash must never be exposed in API responses
- deleted_at cannot be set retroactively (only forward through time)
- updated_at must be >= created_at

### Referential Integrity
- Foreign key constraints enforced at database level
- Cascade deletes on User deletion
- No orphaned UserProfile or UserPreference records

### Enumeration Values

**User.status**
- `active` - Account is active
- `inactive` - Account temporarily inactive
- `suspended` - Account suspended (admin action)

**UserProfile.theme**
- `light` - Light theme
- `dark` - Dark theme
- `auto` - Follow system settings

**UserPreference.preference_key**
- Must match pattern: `[a-z_]+`
- Examples: `notifications_email`, `export_format`, `ui_language`
```

### Scaling Considerations Section

```markdown
## Scaling & Performance

### Expected Data Volume
- Users: 100-1000 per day growth
- Preferences: ~5-10 per user on average
- Year 1 estimate: 36k users, ~180k preference records

### Table Sizes
- User table: ~36MB (estimated year 1)
- UserProfile table: ~28MB
- UserPreference table: ~22MB

### Query Patterns & Indexes
- Find user by email: Indexed (UNIQUE index on email)
- Find all preferences for user: Indexed (composite on user_id, pref_key)
- List users by creation date: Indexed (on created_at)
- Filter users by status: Indexed (on status)

### Optimization Notes
- Composite index `(user_id, preference_key)` enables efficient preference lookups
- Email index enables fast login queries
- Consider partitioning UserPreference by user_id for very large scale (100M+ records)
```

### Migration & Change Management Section

```markdown
## Schema Evolution

### Creating These Tables
```sql
CREATE TABLE user (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  email VARCHAR(255) UNIQUE NOT NULL,
  password_hash VARCHAR(255) NOT NULL,
  first_name VARCHAR(100),
  last_name VARCHAR(100),
  status VARCHAR(50) DEFAULT 'active' NOT NULL,
  created_at TIMESTAMP DEFAULT now() NOT NULL,
  updated_at TIMESTAMP DEFAULT now() NOT NULL,
  deleted_at TIMESTAMP
);

CREATE TABLE user_profile (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id UUID UNIQUE NOT NULL REFERENCES user(id) ON DELETE CASCADE,
  avatar_url VARCHAR(500),
  bio VARCHAR(500),
  phone VARCHAR(20) UNIQUE,
  timezone VARCHAR(50),
  language VARCHAR(5),
  theme VARCHAR(20),
  created_at TIMESTAMP DEFAULT now() NOT NULL,
  updated_at TIMESTAMP DEFAULT now() NOT NULL
);

CREATE TABLE user_preference (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id UUID NOT NULL REFERENCES user(id) ON DELETE CASCADE,
  preference_key VARCHAR(100) NOT NULL,
  preference_value VARCHAR(1000) NOT NULL,
  created_at TIMESTAMP DEFAULT now() NOT NULL,
  updated_at TIMESTAMP DEFAULT now() NOT NULL,
  UNIQUE(user_id, preference_key)
);

CREATE INDEX idx_user_email ON user(email);
CREATE INDEX idx_user_created_at ON user(created_at);
CREATE INDEX idx_user_status ON user(status);
CREATE INDEX idx_preference_lookup ON user_preference(user_id, preference_key);
```

### Future Migrations
- Q2 2024: Add `last_login_at` to User (nullable, new index)
- Q3 2024: Implement user archival (age > 1 year, no activity)
```

### Documentation & Examples Section

```markdown
## Example Queries

### Find user by email
```sql
SELECT * FROM user WHERE email = 'user@example.com';
```

### Get user with profile
```sql
SELECT u.*, p.*
FROM user u
LEFT JOIN user_profile p ON u.id = p.user_id
WHERE u.id = $1;
```

### Get user's preferences
```sql
SELECT preference_key, preference_value
FROM user_preference
WHERE user_id = $1
ORDER BY created_at DESC;
```

### Archive old inactive users
```sql
UPDATE user
SET deleted_at = now()
WHERE status = 'inactive' AND updated_at < now() - interval '1 year'
AND deleted_at IS NULL;
```
```

## Writing Tips

### Document Constraints Clearly
- Why does each field have the constraints it does?
- What validation rules apply?
- What happens on constraint violations?

### Think About Scale
- How much data will this table store?
- What are the growth projections?
- What indexing strategy is needed?
- Will partitioning be needed in the future?

### Link to Related Specs
- Reference technical requirements: `[PRD-001]`
- Reference API contracts: `[API-001]` (what data is exposed)
- Reference design documents: `[DES-001]`

### Include Examples
- Sample SQL for common queries
- Sample JSON representations
- Example migration scripts

### Document Change Constraints
- What fields can't change after creation?
- What fields are immutable?
- How do we handle schema evolution?

## Validation & Fixing Issues

### Run the Validator
```bash
scripts/validate-spec.sh docs/specs/data-model/data-001-your-spec.md
```

### Common Issues & Fixes

**Issue**: "Missing entity field specifications"
- **Fix**: Complete the fields table for each entity with types, constraints, descriptions

**Issue**: "No relationships documented"
- **Fix**: Add a relationships section showing foreign keys and cardinality

**Issue**: "TODO items in Constraints (3 items)"
- **Fix**: Complete constraint definitions, validation rules, and enumeration values

**Issue**: "No scaling or performance information"
- **Fix**: Add data volume estimates, indexing strategy, and optimization notes

## Decision-Making Framework

As you write the data model, consider:

1. **Entity Design**: What entities do we need?
   - What are distinct concepts?
   - What are attributes vs. relationships?
   - Should data be normalized or denormalized?

2. **Relationships**: How do entities relate?
   - One-to-one, one-to-many, many-to-many?
   - Should relationships be required or optional?
   - How should deletions cascade?

3. **Constraints**: What rules must data satisfy?
   - Uniqueness constraints?
   - Required fields?
   - Data type restrictions?
   - Enumeration values?

4. **Performance**: How will data be queried?
   - What indexes are needed?
   - What's the expected scale?
   - Are there bottlenecks?

5. **Evolution**: How will this model change?
   - Can we add fields without migrations?
   - Can we add entities without breaking things?
   - How do we handle data migrations?

## Next Steps

1. **Create the spec**: `scripts/generate-spec.sh data-model data-XXX-slug`
2. **Define entities**: What are the main entities/tables?
3. **Specify fields** with types, constraints, descriptions
4. **Document relationships** between entities
5. **Plan indexes** for performance
6. **Validate**: `scripts/validate-spec.sh docs/specs/data-model/data-XXX-slug.md`
7. **Share with team** before implementation