gh-tachyon-beep-skillpacks-…/skills/using-security-architect/security-controls-design.md


# Security Controls Design

## Overview

Design security controls as **layered defenses at trust boundaries**. Core principle: Apply systematic checks at every boundary to ensure no single control failure compromises security.

**Key insight**: List specific controls after identifying WHERE to apply them (trust boundaries first, then controls).

## When to Use

Load this skill when:
- Implementing authentication/authorization systems
- Hardening API endpoints, databases, file storage
- Designing data protection mechanisms
- Securing communication channels
- Protecting sensitive operations

**Symptoms you need this**:
- "How do I secure this API/database/upload feature?"
- "What controls should I implement?"
- "How do I prevent unauthorized access to X?"
- "How do I harden this system?"

**Don't use for**:
- Threat modeling (use `ordis/security-architect/threat-modeling` first)
- Code-level security patterns (use `ordis/security-architect/secure-code-patterns`)
- Reviewing existing designs (use `ordis/security-architect/security-architecture-review`)

## Core Methodology: Trust Boundaries First

**DON'T start with**: "What controls should I implement?"

**DO start with**: "Where are the trust boundaries?"

###Step 1: Identify Trust Boundaries

Trust boundaries are **points where data/requests cross from less-trusted to more-trusted zones**.

**Common boundaries:**
- Internet → API Gateway
- API Gateway → Application Server
- Application → Database
- Application → File Storage
- Unauthenticated → Authenticated
- User Role → Admin Role
- External Service → Internal Service

**Example: File Upload System**
```
Trust Boundaries:
1. User Browser → Upload Endpoint (UNTRUSTED → APP)
2. Upload Endpoint → Virus Scanner (APP → SCANNER)
3. Scanner → Storage (SCANNER → S3)
4. Storage → Display (S3 → USER)
5. Storage → Internal Processing (S3 → APP)
```

### Step 2: Apply Defense-in-Depth at Each Boundary

For EACH boundary, apply multiple control layers. If one fails, others provide backup.

## Defense-in-Depth Checklist

Use this checklist at **every trust boundary**:

### Layer 1: Validation (First Line)
- [ ] **Input validation**: Type, format, size, allowed values
- [ ] **Sanitization**: Remove dangerous characters, escape output
- [ ] **Canonicalization**: Resolve to standard form (prevent bypass)

### Layer 2: Authentication (Who Are You?)
- [ ] **Identity verification**: Credentials, tokens, certificates
- [ ] **Multi-factor authentication**: For sensitive boundaries
- [ ] **Session management**: Secure tokens, expiration, rotation

### Layer 3: Authorization (What Can You Do?)
- [ ] **Access control checks**: RBAC, ABAC, resource-level
- [ ] **Least privilege enforcement**: Grant minimum necessary
- [ ] **Privilege escalation prevention**: No path to higher access

### Layer 4: Rate Limiting (Abuse Prevention)
- [ ] **Request rate limits**: Per-IP, per-user, per-endpoint
- [ ] **Resource quotas**: Prevent resource exhaustion
- [ ] **Anomaly detection**: Flag unusual patterns

### Layer 5: Audit Logging (Detective)
- [ ] **Security event logging**: Who, what, when, where, outcome
- [ ] **Tamper-proof logs**: Write-only for applications
- [ ] **Alerting**: Automated detection of suspicious activity

### Layer 6: Encryption (Confidentiality)
- [ ] **Data in transit**: TLS 1.3, certificate validation
- [ ] **Data at rest**: Encryption for sensitive data
- [ ] **Key management**: Secure storage, rotation, separation

**Example Application** (API Authentication Boundary):
```
Internet → API Gateway boundary:

Layer 1 (Validation):
- Validate Authorization header present and well-formed
- Check request size limits (prevent DoS)
- Validate content-type and payload structure

Layer 2 (Authentication):
- Verify JWT signature (RS256, public key validation)
- Check token expiration (exp claim)
- Verify token not revoked (check Redis revocation list)

Layer 3 (Authorization):
- Extract scopes from token
- Verify endpoint requires scope present in token
- Check resource-level permissions (can user access THIS resource?)

Layer 4 (Rate Limiting):
- Per-token: 1000 requests/minute
- Per-IP: 100 requests/minute (catch token sharing)
- Per-endpoint: Stricter limits on write operations

Layer 5 (Audit Logging):
- Log authentication attempts (success/failure)
- Log authorization decisions (allowed/denied)
- Log resource access (who accessed what)

Layer 6 (Encryption):
- Enforce TLS 1.3 only (reject unencrypted)
- Validate certificate chain
- Store tokens encrypted in session store
```

**If ANY layer fails, others provide defense.**

## Fail-Secure Patterns

When a control fails, system should **default to secure state** (deny access, close connection, reject request).

### Fail-Closed (Secure) vs Fail-Open (Insecure)

| Situation | Fail-Open (❌ BAD) | Fail-Closed (✅ GOOD) |
|-----------|-------------------|---------------------|
| **Auth service down** | Allow all requests through | Deny all requests until service recovers |
| **Token validation fails** | Treat as valid | Reject request |
| **Database unreachable** | Skip permission check | Deny access |
| **Rate limit store unavailable** | No rate limiting | Apply strictest default limit |
| **Audit log fails to write** | Continue operation | Reject operation |

### Examples of Fail-Secure Implementation

**Example 1: Authentication Service Failure**
```python
def authenticate_request(request):
    try:
        token = extract_token(request)
        user = auth_service.validate_token(token)  # External service call
        return user
    except AuthServiceUnavailable:
        # ❌ FAIL-OPEN: return AnonymousUser()  # Let them through
        # ✅ FAIL-CLOSED: raise Unauthorized("Authentication service unavailable")
        raise Unauthorized("Authentication service unavailable")
    except InvalidToken:
        raise Unauthorized("Invalid token")
```

**Example 2: Rate Limiter Failure**
```python
def check_rate_limit(user_id):
    try:
        redis.incr(f"rate:{user_id}")
        count = redis.get(f"rate:{user_id}")
        if count > LIMIT:
            raise RateLimitExceeded()
    except RedisConnectionError:
        # ❌ FAIL-OPEN: return  # Let request through
        # ✅ FAIL-CLOSED: Apply strictest default limit
        # If Redis is down, apply aggressive in-memory rate limit
        in_memory_limiter.check(user_id, limit=10)  # Much stricter than normal
```

**Example 3: Database Permission Check**
```python
def can_user_access_resource(user_id, resource_id):
    try:
        permission = db.query(
            "SELECT can_read FROM permissions WHERE user_id = ? AND resource_id = ?",
            user_id, resource_id
        )
        return permission.can_read
    except DatabaseConnectionError:
        # ❌ FAIL-OPEN: return True  # Assume they have access
        # ✅ FAIL-CLOSED: return False  # Deny access if can't verify
        logger.error(f"DB unavailable, denying access for user={user_id} resource={resource_id}")
        return False
```

**Example 4: File Type Validation**
```python
def validate_file_upload(file):
    # Layer 1: Check extension
    if file.extension not in ALLOWED_EXTENSIONS:
        raise ValidationError("Invalid file type")

    # Layer 2: Check magic bytes
    try:
        magic_bytes = file.read(16)
        if not is_valid_magic_bytes(magic_bytes):
            # ✅ FAIL-CLOSED: If magic bytes don't match, reject
            # Even if extension passed, magic bytes take precedence
            raise ValidationError("File content doesn't match extension")
    except Exception as e:
        # ❌ FAIL-OPEN: return True  # Couldn't check, assume valid
        # ✅ FAIL-CLOSED: raise ValidationError("Could not validate file")
        raise ValidationError(f"Could not validate file: {e}")
```

**Principle**: **When in doubt, deny**. It's better to have a false positive (deny legitimate request) than false negative (allow malicious request).

## Least Privilege Principle

Grant **minimum necessary access** for each component to perform its function. No more.

### Application Method

**For each component, ask three questions:**
1. **What does it NEED to do?** (functional requirements)
2. **What's the MINIMUM access to achieve that?** (reduce scope)
3. **What can it NEVER do?** (explicit denials)

### Example: Database Access Roles

**Web Application Role:**
```sql
-- What it NEEDS: Read customers, write audit logs
GRANT SELECT ON customers TO web_app_user;
GRANT INSERT ON audit_logs TO web_app_user;

-- What's MINIMUM: No DELETE, no UPDATE on audit logs (immutable), no admin tables
REVOKE DELETE ON customers FROM web_app_user;
REVOKE ALL ON admin_users FROM web_app_user;

-- Explicit NEVER: Cannot modify audit logs (tamper-proof)
REVOKE UPDATE, DELETE ON audit_logs FROM web_app_user;

-- Row-level security: Only active customers
CREATE POLICY web_app_access ON customers
  FOR SELECT TO web_app_user
  USING (status = 'active');
```

**Analytics Role:**
```sql
-- What it NEEDS: Read non-PII customer data for analytics
-- What's MINIMUM: View with PII columns excluded
CREATE VIEW customers_analytics AS
  SELECT customer_id, country, subscription_tier, created_at
  FROM customers;  -- Excludes: name, email, address

GRANT SELECT ON customers_analytics TO analytics_user;

-- What it can NEVER do: Access PII, modify data, see payment info
REVOKE ALL ON customers FROM analytics_user;
REVOKE ALL ON payment_info FROM analytics_user;
SET default_transaction_read_only = true FOR analytics_user;
```

### File System Permissions

**Application Server:**
```bash
# What it NEEDS: Read config, write logs, read/write uploads
/etc/app/config/          → Read-only (owner: root, chmod 640, group: app)
/var/log/app/             → Write-only (owner: app, chmod 200, append-only)
/var/uploads/             → Read/write (owner: app, chmod 700)

# What it can NEVER do: Write to config, execute from uploads
/etc/app/config/          → No write permissions
/var/uploads/             → Mount with noexec flag (prevent execution)
```

### API Scopes (OAuth2 Pattern)

```python
# User requests minimal scopes
scopes_requested = ["read:profile", "read:posts"]

# DON'T grant admin scopes by default
# DO grant only what was requested and approved
token = create_token(user, scopes=scopes_requested)

# At each endpoint, verify scope
@require_scope("write:posts")
def create_post(request):
    # This endpoint is inaccessible with read:posts scope
    pass
```

**Principle**: **Default deny, explicit allow**. Start with no access, grant only what's needed.

## Separation of Duties

**No single component/person/account should have complete control** over a critical operation.

### Patterns

#### Pattern 1: Multi-Signature Approvals

**Example: Production Deployments**
```yaml
# Require 2 approvals from different teams
approvals:
  required: 2
  teams:
    - engineering-leads
    - security-team

# Cannot approve own PR
prevent_self_approval: true
```

#### Pattern 2: Split Responsibilities

**Example: Payment Processing**
```python
# Component A: Initiates payment (can create, cannot approve)
payment_service.initiate_payment(amount, account)

# Component B: Approves payment (can approve, cannot create)
# Different credentials, different service
approval_service.approve_payment(payment_id)

# Component C: Executes payment (can execute, cannot create/approve)
# Only accepts approved payments
execution_service.execute_payment(approved_payment_id)
```

**No single service can create AND approve AND execute a payment.**

#### Pattern 3: Key Splitting

**Example: Encryption Key Management**
```python
# Master key split into 3 shares using Shamir Secret Sharing
# Require 2 of 3 shares to reconstruct
shares = split_key(master_key, threshold=2, num_shares=3)

# Distribute to different teams/locations
security_team.store(shares[0])
ops_team.store(shares[1])
compliance_team.store(shares[2])

# Reconstruction requires 2 teams to cooperate
reconstructed = reconstruct_key([shares[0], shares[1]])
```

#### Pattern 4: Admin Operations Require Approval

**Example: Database Admin Actions**
```python
# Admin initiates action (creates request, cannot execute)
admin_request = AdminRequest(
    action="DELETE_USER",
    user_id=12345,
    reason="GDPR erasure request",
    requested_by=admin_id
)

# Second admin reviews and approves (cannot initiate)
reviewer.approve(admin_request, reviewer_id=different_admin_id)

# System executes after approval (automated, no single admin control)
if admin_request.is_approved():
    execute_admin_action(admin_request)
```

**Principle**: **Break critical paths into multiple steps requiring different actors.**

## Control Verification Method

For **each control you design**, ask: **"What if this control fails?"**

### Verification Checklist

**For each control:**
1. **What attack does this prevent?** (threat it addresses)
2. **How can this control fail?** (failure modes)
3. **What happens if it fails?** (impact)
4. **What's the next layer of defense?** (backup control)
5. **Is failure logged/detected?** (observability)

### Example: API Token Validation

**Control**: Verify JWT signature before processing request

1. **What attack**: Prevents forged tokens, ensures authenticity
2. **How it can fail**:
   - Public key unavailable (service down)
   - Expired token not caught (clock skew)
   - Token revocation list unavailable (Redis down)
   - Signature algorithm downgrade attack (accept HS256 instead of RS256)
3. **What if it fails**:
   - Public key unavailable → Fail-closed (deny all requests)
   - Expired token → Layer 2: Check expiration explicitly
   - Revocation list down → Layer 3: Apply strict rate limits as fallback
   - Algorithm downgrade → Layer 4: Explicitly require RS256, reject others
4. **Next layer**:
   - Authorization checks (even with valid token, check permissions)
   - Rate limiting (limit damage from compromised token)
   - Audit logging (detect unusual access patterns)
5. **Failure logged**: Yes → Log signature validation failures, alert on spike

**Outcome**: Designed 4 layers of defense against token attacks.

### Example: File Upload Validation

**Control**: Check file extension against allowlist

1. **What attack**: Prevents upload of executable files (.exe, .sh)
2. **How it can fail**:
   - Attacker renames malware.exe → malware.jpg
   - Double extension: malware.jpg.exe
   - Case variation: malware.ExE
3. **What if it fails**: Malicious file stored, potentially executed
4. **Next layers**:
   - Layer 2: Magic byte verification (check file content, not name)
   - Layer 3: Antivirus scanning (detect known malware)
   - Layer 4: File reprocessing (re-encode images, destroying embedded code)
   - Layer 5: noexec mount (storage prevents execution)
   - Layer 6: Separate domain for user content (CSP prevents XSS)
5. **Failure logged**: Yes → Log validation failures, rejected files

**Outcome**: Extension check is Layer 1 of 6. If bypassed, 5 more layers prevent exploitation.

## Quick Reference: Control Selection

**For every trust boundary, apply this checklist:**

| Layer | Control Type | Example |
|-------|--------------|---------|
| **1. Validation** | Input checking | Size limits, type validation, sanitization |
| **2. Authentication** | Identity verification | JWT validation, certificate checks, MFA |
| **3. Authorization** | Permission checks | RBAC, resource-level access, least privilege |
| **4. Rate Limiting** | Abuse prevention | Per-user limits, anomaly detection, quotas |
| **5. Audit Logging** | Detective | Security events, tamper-proof logs, alerting |
| **6. Encryption** | Confidentiality | TLS in transit, encryption at rest, key management |

**For each control:**
- Define fail-secure behavior (what happens if it fails?)
- Apply least privilege (minimum necessary access)
- Verify separation of duties (no single point of complete control)
- Test "what if this fails?" (ensure backup layers exist)

## Common Mistakes

### ❌ Designing Controls Before Identifying Boundaries

**Wrong**: "I need authentication and authorization and rate limiting"

**Right**: "Where are my trust boundaries? → Internet→API, API→Database → At each: apply layered controls"

**Why**: Controls are meaningless without knowing WHERE to apply them.


### ❌ Single Layer of Defense

**Wrong**: "Authentication is enough security"

**Right**: "Authentication + Authorization + Rate Limiting + Audit Logging"

**Why**: If authentication is bypassed (bug, misconfiguration), other layers provide defense.


### ❌ Fail-Open Defaults

**Wrong**:
```python
try:
    user = auth_service.validate(token)
except ServiceUnavailable:
    user = AnonymousUser()  # Let them through
```

**Right**:
```python
try:
    user = auth_service.validate(token)
except ServiceUnavailable:
    raise Unauthorized("Auth service unavailable")
```

**Why**: Control failure should result in secure state (deny), not insecure state (allow).


### ❌ Excessive Privileges

**Wrong**: Grant web application full database access (SELECT, INSERT, UPDATE, DELETE on all tables)

**Right**: Grant only needed operations per table (SELECT on customers, INSERT-only on audit_logs)

**Why**: Minimizes damage from compromised application (SQL injection, stolen credentials).


### ❌ Single Point of Control

**Wrong**: One admin account can initiate, approve, and execute critical operations

**Right**: Separate accounts for initiate vs approve, require multi-signature

**Why**: Prevents single compromised account from complete system control.


### ❌ No Verification of "What If This Fails?"

**Wrong**: Design controls, assume they work

**Right**: For each control, ask "how can this fail?" and design backup layers

**Why**: Controls fail due to bugs, misconfigurations, attacks. Backup layers provide resilience.

## Cross-References

**Use BEFORE this skill**:
- `ordis/security-architect/threat-modeling` - Identify threats first, then design controls to address them

**Use WITH this skill**:
- `muna/technical-writer/documentation-structure` - Document control architecture as ADR

**Use AFTER this skill**:
- `ordis/security-architect/security-architecture-review` - Review controls for completeness

## Real-World Impact

**Well-designed controls using this methodology:**
- Multi-layered API authentication catching token forgery even when signature validation was bypassed (algorithm confusion attack)
- Database access controls limiting SQL injection damage to read-only operations (least privilege prevented data deletion)
- File upload defenses stopping malware despite extension check bypass (magic bytes + antivirus + reprocessing layers)

**Key lesson**: **Systematic application of defense-in-depth at trust boundaries is more effective than ad-hoc control selection.**