Files
2025-11-30 08:59:46 +08:00

538 lines
18 KiB
Markdown

# Security Controls Design
## Overview
Design security controls as **layered defenses at trust boundaries**. Core principle: Apply systematic checks at every boundary to ensure no single control failure compromises security.
**Key insight**: List specific controls after identifying WHERE to apply them (trust boundaries first, then controls).
## When to Use
Load this skill when:
- Implementing authentication/authorization systems
- Hardening API endpoints, databases, file storage
- Designing data protection mechanisms
- Securing communication channels
- Protecting sensitive operations
**Symptoms you need this**:
- "How do I secure this API/database/upload feature?"
- "What controls should I implement?"
- "How do I prevent unauthorized access to X?"
- "How do I harden this system?"
**Don't use for**:
- Threat modeling (use `ordis/security-architect/threat-modeling` first)
- Code-level security patterns (use `ordis/security-architect/secure-code-patterns`)
- Reviewing existing designs (use `ordis/security-architect/security-architecture-review`)
## Core Methodology: Trust Boundaries First
**DON'T start with**: "What controls should I implement?"
**DO start with**: "Where are the trust boundaries?"
###Step 1: Identify Trust Boundaries
Trust boundaries are **points where data/requests cross from less-trusted to more-trusted zones**.
**Common boundaries:**
- Internet → API Gateway
- API Gateway → Application Server
- Application → Database
- Application → File Storage
- Unauthenticated → Authenticated
- User Role → Admin Role
- External Service → Internal Service
**Example: File Upload System**
```
Trust Boundaries:
1. User Browser → Upload Endpoint (UNTRUSTED → APP)
2. Upload Endpoint → Virus Scanner (APP → SCANNER)
3. Scanner → Storage (SCANNER → S3)
4. Storage → Display (S3 → USER)
5. Storage → Internal Processing (S3 → APP)
```
### Step 2: Apply Defense-in-Depth at Each Boundary
For EACH boundary, apply multiple control layers. If one fails, others provide backup.
## Defense-in-Depth Checklist
Use this checklist at **every trust boundary**:
### Layer 1: Validation (First Line)
- [ ] **Input validation**: Type, format, size, allowed values
- [ ] **Sanitization**: Remove dangerous characters, escape output
- [ ] **Canonicalization**: Resolve to standard form (prevent bypass)
### Layer 2: Authentication (Who Are You?)
- [ ] **Identity verification**: Credentials, tokens, certificates
- [ ] **Multi-factor authentication**: For sensitive boundaries
- [ ] **Session management**: Secure tokens, expiration, rotation
### Layer 3: Authorization (What Can You Do?)
- [ ] **Access control checks**: RBAC, ABAC, resource-level
- [ ] **Least privilege enforcement**: Grant minimum necessary
- [ ] **Privilege escalation prevention**: No path to higher access
### Layer 4: Rate Limiting (Abuse Prevention)
- [ ] **Request rate limits**: Per-IP, per-user, per-endpoint
- [ ] **Resource quotas**: Prevent resource exhaustion
- [ ] **Anomaly detection**: Flag unusual patterns
### Layer 5: Audit Logging (Detective)
- [ ] **Security event logging**: Who, what, when, where, outcome
- [ ] **Tamper-proof logs**: Write-only for applications
- [ ] **Alerting**: Automated detection of suspicious activity
### Layer 6: Encryption (Confidentiality)
- [ ] **Data in transit**: TLS 1.3, certificate validation
- [ ] **Data at rest**: Encryption for sensitive data
- [ ] **Key management**: Secure storage, rotation, separation
**Example Application** (API Authentication Boundary):
```
Internet → API Gateway boundary:
Layer 1 (Validation):
- Validate Authorization header present and well-formed
- Check request size limits (prevent DoS)
- Validate content-type and payload structure
Layer 2 (Authentication):
- Verify JWT signature (RS256, public key validation)
- Check token expiration (exp claim)
- Verify token not revoked (check Redis revocation list)
Layer 3 (Authorization):
- Extract scopes from token
- Verify endpoint requires scope present in token
- Check resource-level permissions (can user access THIS resource?)
Layer 4 (Rate Limiting):
- Per-token: 1000 requests/minute
- Per-IP: 100 requests/minute (catch token sharing)
- Per-endpoint: Stricter limits on write operations
Layer 5 (Audit Logging):
- Log authentication attempts (success/failure)
- Log authorization decisions (allowed/denied)
- Log resource access (who accessed what)
Layer 6 (Encryption):
- Enforce TLS 1.3 only (reject unencrypted)
- Validate certificate chain
- Store tokens encrypted in session store
```
**If ANY layer fails, others provide defense.**
## Fail-Secure Patterns
When a control fails, system should **default to secure state** (deny access, close connection, reject request).
### Fail-Closed (Secure) vs Fail-Open (Insecure)
| Situation | Fail-Open (❌ BAD) | Fail-Closed (✅ GOOD) |
|-----------|-------------------|---------------------|
| **Auth service down** | Allow all requests through | Deny all requests until service recovers |
| **Token validation fails** | Treat as valid | Reject request |
| **Database unreachable** | Skip permission check | Deny access |
| **Rate limit store unavailable** | No rate limiting | Apply strictest default limit |
| **Audit log fails to write** | Continue operation | Reject operation |
### Examples of Fail-Secure Implementation
**Example 1: Authentication Service Failure**
```python
def authenticate_request(request):
try:
token = extract_token(request)
user = auth_service.validate_token(token) # External service call
return user
except AuthServiceUnavailable:
# ❌ FAIL-OPEN: return AnonymousUser() # Let them through
# ✅ FAIL-CLOSED: raise Unauthorized("Authentication service unavailable")
raise Unauthorized("Authentication service unavailable")
except InvalidToken:
raise Unauthorized("Invalid token")
```
**Example 2: Rate Limiter Failure**
```python
def check_rate_limit(user_id):
try:
redis.incr(f"rate:{user_id}")
count = redis.get(f"rate:{user_id}")
if count > LIMIT:
raise RateLimitExceeded()
except RedisConnectionError:
# ❌ FAIL-OPEN: return # Let request through
# ✅ FAIL-CLOSED: Apply strictest default limit
# If Redis is down, apply aggressive in-memory rate limit
in_memory_limiter.check(user_id, limit=10) # Much stricter than normal
```
**Example 3: Database Permission Check**
```python
def can_user_access_resource(user_id, resource_id):
try:
permission = db.query(
"SELECT can_read FROM permissions WHERE user_id = ? AND resource_id = ?",
user_id, resource_id
)
return permission.can_read
except DatabaseConnectionError:
# ❌ FAIL-OPEN: return True # Assume they have access
# ✅ FAIL-CLOSED: return False # Deny access if can't verify
logger.error(f"DB unavailable, denying access for user={user_id} resource={resource_id}")
return False
```
**Example 4: File Type Validation**
```python
def validate_file_upload(file):
# Layer 1: Check extension
if file.extension not in ALLOWED_EXTENSIONS:
raise ValidationError("Invalid file type")
# Layer 2: Check magic bytes
try:
magic_bytes = file.read(16)
if not is_valid_magic_bytes(magic_bytes):
# ✅ FAIL-CLOSED: If magic bytes don't match, reject
# Even if extension passed, magic bytes take precedence
raise ValidationError("File content doesn't match extension")
except Exception as e:
# ❌ FAIL-OPEN: return True # Couldn't check, assume valid
# ✅ FAIL-CLOSED: raise ValidationError("Could not validate file")
raise ValidationError(f"Could not validate file: {e}")
```
**Principle**: **When in doubt, deny**. It's better to have a false positive (deny legitimate request) than false negative (allow malicious request).
## Least Privilege Principle
Grant **minimum necessary access** for each component to perform its function. No more.
### Application Method
**For each component, ask three questions:**
1. **What does it NEED to do?** (functional requirements)
2. **What's the MINIMUM access to achieve that?** (reduce scope)
3. **What can it NEVER do?** (explicit denials)
### Example: Database Access Roles
**Web Application Role:**
```sql
-- What it NEEDS: Read customers, write audit logs
GRANT SELECT ON customers TO web_app_user;
GRANT INSERT ON audit_logs TO web_app_user;
-- What's MINIMUM: No DELETE, no UPDATE on audit logs (immutable), no admin tables
REVOKE DELETE ON customers FROM web_app_user;
REVOKE ALL ON admin_users FROM web_app_user;
-- Explicit NEVER: Cannot modify audit logs (tamper-proof)
REVOKE UPDATE, DELETE ON audit_logs FROM web_app_user;
-- Row-level security: Only active customers
CREATE POLICY web_app_access ON customers
FOR SELECT TO web_app_user
USING (status = 'active');
```
**Analytics Role:**
```sql
-- What it NEEDS: Read non-PII customer data for analytics
-- What's MINIMUM: View with PII columns excluded
CREATE VIEW customers_analytics AS
SELECT customer_id, country, subscription_tier, created_at
FROM customers; -- Excludes: name, email, address
GRANT SELECT ON customers_analytics TO analytics_user;
-- What it can NEVER do: Access PII, modify data, see payment info
REVOKE ALL ON customers FROM analytics_user;
REVOKE ALL ON payment_info FROM analytics_user;
SET default_transaction_read_only = true FOR analytics_user;
```
### File System Permissions
**Application Server:**
```bash
# What it NEEDS: Read config, write logs, read/write uploads
/etc/app/config/ → Read-only (owner: root, chmod 640, group: app)
/var/log/app/ → Write-only (owner: app, chmod 200, append-only)
/var/uploads/ → Read/write (owner: app, chmod 700)
# What it can NEVER do: Write to config, execute from uploads
/etc/app/config/ → No write permissions
/var/uploads/ → Mount with noexec flag (prevent execution)
```
### API Scopes (OAuth2 Pattern)
```python
# User requests minimal scopes
scopes_requested = ["read:profile", "read:posts"]
# DON'T grant admin scopes by default
# DO grant only what was requested and approved
token = create_token(user, scopes=scopes_requested)
# At each endpoint, verify scope
@require_scope("write:posts")
def create_post(request):
# This endpoint is inaccessible with read:posts scope
pass
```
**Principle**: **Default deny, explicit allow**. Start with no access, grant only what's needed.
## Separation of Duties
**No single component/person/account should have complete control** over a critical operation.
### Patterns
#### Pattern 1: Multi-Signature Approvals
**Example: Production Deployments**
```yaml
# Require 2 approvals from different teams
approvals:
required: 2
teams:
- engineering-leads
- security-team
# Cannot approve own PR
prevent_self_approval: true
```
#### Pattern 2: Split Responsibilities
**Example: Payment Processing**
```python
# Component A: Initiates payment (can create, cannot approve)
payment_service.initiate_payment(amount, account)
# Component B: Approves payment (can approve, cannot create)
# Different credentials, different service
approval_service.approve_payment(payment_id)
# Component C: Executes payment (can execute, cannot create/approve)
# Only accepts approved payments
execution_service.execute_payment(approved_payment_id)
```
**No single service can create AND approve AND execute a payment.**
#### Pattern 3: Key Splitting
**Example: Encryption Key Management**
```python
# Master key split into 3 shares using Shamir Secret Sharing
# Require 2 of 3 shares to reconstruct
shares = split_key(master_key, threshold=2, num_shares=3)
# Distribute to different teams/locations
security_team.store(shares[0])
ops_team.store(shares[1])
compliance_team.store(shares[2])
# Reconstruction requires 2 teams to cooperate
reconstructed = reconstruct_key([shares[0], shares[1]])
```
#### Pattern 4: Admin Operations Require Approval
**Example: Database Admin Actions**
```python
# Admin initiates action (creates request, cannot execute)
admin_request = AdminRequest(
action="DELETE_USER",
user_id=12345,
reason="GDPR erasure request",
requested_by=admin_id
)
# Second admin reviews and approves (cannot initiate)
reviewer.approve(admin_request, reviewer_id=different_admin_id)
# System executes after approval (automated, no single admin control)
if admin_request.is_approved():
execute_admin_action(admin_request)
```
**Principle**: **Break critical paths into multiple steps requiring different actors.**
## Control Verification Method
For **each control you design**, ask: **"What if this control fails?"**
### Verification Checklist
**For each control:**
1. **What attack does this prevent?** (threat it addresses)
2. **How can this control fail?** (failure modes)
3. **What happens if it fails?** (impact)
4. **What's the next layer of defense?** (backup control)
5. **Is failure logged/detected?** (observability)
### Example: API Token Validation
**Control**: Verify JWT signature before processing request
1. **What attack**: Prevents forged tokens, ensures authenticity
2. **How it can fail**:
- Public key unavailable (service down)
- Expired token not caught (clock skew)
- Token revocation list unavailable (Redis down)
- Signature algorithm downgrade attack (accept HS256 instead of RS256)
3. **What if it fails**:
- Public key unavailable → Fail-closed (deny all requests)
- Expired token → Layer 2: Check expiration explicitly
- Revocation list down → Layer 3: Apply strict rate limits as fallback
- Algorithm downgrade → Layer 4: Explicitly require RS256, reject others
4. **Next layer**:
- Authorization checks (even with valid token, check permissions)
- Rate limiting (limit damage from compromised token)
- Audit logging (detect unusual access patterns)
5. **Failure logged**: Yes → Log signature validation failures, alert on spike
**Outcome**: Designed 4 layers of defense against token attacks.
### Example: File Upload Validation
**Control**: Check file extension against allowlist
1. **What attack**: Prevents upload of executable files (.exe, .sh)
2. **How it can fail**:
- Attacker renames malware.exe → malware.jpg
- Double extension: malware.jpg.exe
- Case variation: malware.ExE
3. **What if it fails**: Malicious file stored, potentially executed
4. **Next layers**:
- Layer 2: Magic byte verification (check file content, not name)
- Layer 3: Antivirus scanning (detect known malware)
- Layer 4: File reprocessing (re-encode images, destroying embedded code)
- Layer 5: noexec mount (storage prevents execution)
- Layer 6: Separate domain for user content (CSP prevents XSS)
5. **Failure logged**: Yes → Log validation failures, rejected files
**Outcome**: Extension check is Layer 1 of 6. If bypassed, 5 more layers prevent exploitation.
## Quick Reference: Control Selection
**For every trust boundary, apply this checklist:**
| Layer | Control Type | Example |
|-------|--------------|---------|
| **1. Validation** | Input checking | Size limits, type validation, sanitization |
| **2. Authentication** | Identity verification | JWT validation, certificate checks, MFA |
| **3. Authorization** | Permission checks | RBAC, resource-level access, least privilege |
| **4. Rate Limiting** | Abuse prevention | Per-user limits, anomaly detection, quotas |
| **5. Audit Logging** | Detective | Security events, tamper-proof logs, alerting |
| **6. Encryption** | Confidentiality | TLS in transit, encryption at rest, key management |
**For each control:**
- Define fail-secure behavior (what happens if it fails?)
- Apply least privilege (minimum necessary access)
- Verify separation of duties (no single point of complete control)
- Test "what if this fails?" (ensure backup layers exist)
## Common Mistakes
### ❌ Designing Controls Before Identifying Boundaries
**Wrong**: "I need authentication and authorization and rate limiting"
**Right**: "Where are my trust boundaries? → Internet→API, API→Database → At each: apply layered controls"
**Why**: Controls are meaningless without knowing WHERE to apply them.
### ❌ Single Layer of Defense
**Wrong**: "Authentication is enough security"
**Right**: "Authentication + Authorization + Rate Limiting + Audit Logging"
**Why**: If authentication is bypassed (bug, misconfiguration), other layers provide defense.
### ❌ Fail-Open Defaults
**Wrong**:
```python
try:
user = auth_service.validate(token)
except ServiceUnavailable:
user = AnonymousUser() # Let them through
```
**Right**:
```python
try:
user = auth_service.validate(token)
except ServiceUnavailable:
raise Unauthorized("Auth service unavailable")
```
**Why**: Control failure should result in secure state (deny), not insecure state (allow).
### ❌ Excessive Privileges
**Wrong**: Grant web application full database access (SELECT, INSERT, UPDATE, DELETE on all tables)
**Right**: Grant only needed operations per table (SELECT on customers, INSERT-only on audit_logs)
**Why**: Minimizes damage from compromised application (SQL injection, stolen credentials).
### ❌ Single Point of Control
**Wrong**: One admin account can initiate, approve, and execute critical operations
**Right**: Separate accounts for initiate vs approve, require multi-signature
**Why**: Prevents single compromised account from complete system control.
### ❌ No Verification of "What If This Fails?"
**Wrong**: Design controls, assume they work
**Right**: For each control, ask "how can this fail?" and design backup layers
**Why**: Controls fail due to bugs, misconfigurations, attacks. Backup layers provide resilience.
## Cross-References
**Use BEFORE this skill**:
- `ordis/security-architect/threat-modeling` - Identify threats first, then design controls to address them
**Use WITH this skill**:
- `muna/technical-writer/documentation-structure` - Document control architecture as ADR
**Use AFTER this skill**:
- `ordis/security-architect/security-architecture-review` - Review controls for completeness
## Real-World Impact
**Well-designed controls using this methodology:**
- Multi-layered API authentication catching token forgery even when signature validation was bypassed (algorithm confusion attack)
- Database access controls limiting SQL injection damage to read-only operations (least privilege prevented data deletion)
- File upload defenses stopping malware despite extension check bypass (magic bytes + antivirus + reprocessing layers)
**Key lesson**: **Systematic application of defense-in-depth at trust boundaries is more effective than ad-hoc control selection.**