# Secure By Design Patterns ## Overview Build security into system foundations. Core principle: Design systems that are **secure by default, not secured after the fact**. **Key insight**: Preventing security issues through architecture is cheaper and more effective than detecting and responding to them. ## When to Use Load this skill when: - Designing new systems (greenfield) - Refactoring existing architecture - Making fundamental architecture decisions - Evaluating architecture proposals **Symptoms you need this**: - "How do we make this system secure?" - Designing authentication, secrets management, data access - Architecting microservices, data pipelines, distributed systems - Choosing deployment/configuration strategies **Don't use for**: - Threat modeling specific attacks (use `ordis/security-architect/threat-modeling`) - Implementing security controls (use `ordis/security-architect/security-controls-design`) - Reviewing existing designs (use `ordis/security-architect/security-architecture-review`) ## Core Patterns ### Pattern 1: Zero-Trust Architecture **Principle**: Never trust, always verify. No implicit trust based on network location. #### Three Pillars 1. **Verify Explicitly** - Authenticate every request (no "internal network = trusted") - Authorize based on identity + context (device, location, time, risk) - Use strong authentication (mTLS, signed tokens, not IP allowlists alone) 2. **Least Privilege Access** - Grant minimum necessary permissions - Time-limited access (credentials expire, tokens rotate) - Resource-level authorization (not just service-level) 3. **Assume Breach** - Design for "when compromised", not "if compromised" - Minimize blast radius (segmentation, isolation) - Monitor everything (detect lateral movement) #### Example: Microservices Communication ❌ **Not Zero-Trust**: ``` Service A → Service B (same network, no auth) # Assumes: Internal network = trusted # Risk: Compromised service A can access all of service B ``` ✅ **Zero-Trust**: ``` Service A → Service B (mTLS + JWT + authz check) # Every request authenticated + authorized # Service B validates: Is this service A? Does it have permission for THIS resource? Implementation: - Service mesh (Istio/Linkerd) enforces mTLS - Service B validates JWT from service A - RBAC policy: service A can only access /resource/{own_resources} - Audit log: Record all access attempts ``` **Result**: If service A compromised, attacker cannot impersonate other services or access resources outside A's scope. ### Pattern 2: Immutable Infrastructure **Principle**: No runtime modifications. Replace rather than update. #### Core Concepts 1. **Immutable Artifacts** - Container images, VM images, binaries - Never modified after creation - Versioned and signed 2. **Deployment Replaces Instances** - Updates = deploy new version, terminate old - No SSH into servers to patch - No runtime configuration changes 3. **Configuration as Code** - All config in version control - Deployments are reproducible - Rollback = redeploy previous version #### Benefits - **Security**: No drift from known-good state - **Auditability**: All changes in version control - **Rollback**: Redeploy previous image - **Consistency**: Dev/staging/prod identical #### Example: Application Updates ❌ **Mutable (Insecure)**: ```bash # SSH into production server ssh prod-server # Update code git pull origin main # Restart service systemctl restart app # Problem: No audit trail, no rollback, config drift ``` ✅ **Immutable (Secure)**: ```bash # Build new image locally docker build -t app:v2.1.0 . # Push to registry (signed) docker push registry/app:v2.1.0 # Deploy new version (Kubernetes) kubectl set image deployment/app app=registry/app:v2.1.0 # Kubernetes: Creates new pods, terminates old pods # Rollback if needed: kubectl rollout undo deployment/app # Result: Full audit trail, instant rollback, no drift ``` **Configuration Management**: ```yaml # Configuration in Git, not edited on servers apiVersion: v1 kind: ConfigMap metadata: name: app-config data: API_TIMEOUT: "30" RATE_LIMIT: "1000" # Changes = commit to Git → CI/CD applies → new deployment ``` ### Pattern 3: Security Boundaries **Principle**: Explicit trust zones with validation at every boundary crossing. #### Boundary Identification Trust boundaries are points where data/requests cross from lower-trust to higher-trust zones: ``` Internet (UNTRUSTED) ↓ BOUNDARY 1 API Gateway (TRUSTED) ↓ BOUNDARY 2 Application Services (TRUSTED) ↓ BOUNDARY 3 Database (HIGHLY TRUSTED) ``` #### Validation at Boundaries At EACH boundary: 1. **Authenticate**: Verify identity 2. **Authorize**: Check permissions 3. **Validate**: Check input format/constraints 4. **Sanitize**: Remove dangerous content 5. **Log**: Record crossing #### Example: Data Pipeline ``` External API (UNTRUSTED) ↓ BOUNDARY: API Client Validate: JSON schema, required fields Authenticate: API key Rate limit: 1000 req/hour Message Queue (SEMI-TRUSTED) ↓ BOUNDARY: Consumer Validate: Message structure, idempotency key Authorize: Consumer has permission for this message type Processing Service (TRUSTED) ↓ BOUNDARY: Database Writer Validate: Data types, constraints Authorize: Service can write to this table Database (HIGHLY TRUSTED) ``` **Key insight**: Never trust data just because it came from "internal" service. Validate at every boundary. #### Minimizing Boundary Surface Area ❌ **Large Surface Area (Insecure)**: ``` # Database accepts connections from all services firewall: allow 0.0.0.0/0 → database:5432 # Problem: 50 services can connect, huge attack surface ``` ✅ **Small Surface Area (Secure)**: ``` # Only specific services connect to database firewall: allow backend-api → database:5432 allow analytics → database:5432 deny all others # Further: Use service mesh with identity-based policies ``` ### Pattern 4: Trusted Computing Base (TCB) Minimization **Principle**: Small security-critical core, everything else untrusted. #### What is TCB? TCB = Components you MUST trust for security. If TCB is compromised, security fails. **Goal**: Minimize TCB size (less code = fewer vulnerabilities). #### Pattern: Small Critical Core ``` ┌─────────────────────────────────────┐ │ Untrusted Zone (Applications) │ │ - Web servers │ │ - Application logic │ │ - User-facing services │ └────────────┬────────────────────────┘ │ API calls (validated) ┌────────────▼────────────────────────┐ │ TRUSTED COMPUTING BASE (TCB) │ │ - Authentication service (small!) │ │ - Secrets vault (minimal code) │ │ - Audit logger (append-only) │ └─────────────────────────────────────┘ ``` #### Example: Secrets Management ❌ **Large TCB (Risky)**: ``` # Every service has secrets management logic # TCB = All 50+ services (huge attack surface) each_service: - Fetches secrets from vault - Decrypts secrets - Manages rotation - Handles caching # Problem: Bug in ANY service compromises secrets ``` ✅ **Small TCB (Secure)**: ``` # Secrets Vault = TCB (small, auditable, formally verified) # Applications = Untrusted (use vault API) Vault (TCB): - 10,000 lines of code - Formally verified - Hardware-backed encryption (HSM) - Minimal attack surface (no network egress) Applications (Untrusted): - Call vault API for secrets - Vault enforces all access control - Apps cannot access secrets they're not authorized for # Result: Compromise application ≠ compromise vault ``` #### TCB Characteristics 1. **Small**: Minimize code size 2. **Auditable**: Can be formally verified 3. **Isolated**: Runs in separate environment (sandbox, separate machine) 4. **Minimal privileges**: TCB has no unnecessary access 5. **Heavily monitored**: All TCB access logged ### Pattern 5: Fail-Fast Security **Principle**: Validate security properties at construction time. Refuse to operate if misconfigured. #### Construction-Time vs Runtime **Construction time**: When system/component is created (startup, initialization) **Runtime**: When system is processing requests **Fail-fast**: Validate security at construction, fail immediately if invalid. #### Example: Security Level Validation ❌ **Runtime Validation (Vulnerable)**: ```python # Data pipeline starts, processes data, THEN checks security pipeline = Pipeline() pipeline.add_source(untrusted_datasource) pipeline.add_sink(trusted_datasink) pipeline.start() # Starts processing! # Runtime: Check if datasource security level matches sink for record in pipeline: if record.security_level > sink.max_security_level: raise SecurityError("Security mismatch!") # Problem: Exposed data before detecting mismatch # Exposure window = time until first mismatched record ``` ✅ **Fail-Fast (Secure)**: ```python # Validate security BEFORE processing any data pipeline = Pipeline() pipeline.add_source(untrusted_datasource) pipeline.add_sink(trusted_datasink) # BEFORE start: Validate security properties if datasource.security_level > sink.max_security_level: raise SecurityError( f"Cannot create pipeline: Source {datasource.security_level} " f"exceeds sink maximum {sink.max_security_level}" ) pipeline.start() # Only starts if validation passed # Result: Zero exposure window, fail before processing data ``` #### Startup Validation Checklist Validate at system startup (fail if any check fails): - [ ] All required secrets accessible? - [ ] TLS certificates valid and not expired? - [ ] Database permissions granted? - [ ] Security policies loaded? - [ ] Encryption keys available? #### Example: Service Startup ```python class SecureService: def __init__(self): # Fail-fast validation at construction self._validate_security() def _validate_security(self): # Check TLS certificate if not self.tls_cert_valid(): raise SecurityError("TLS certificate invalid or expired") # Check encryption keys accessible if not self.can_access_keys(): raise SecurityError("Cannot access encryption keys") # Check database permissions if not self.has_required_db_permissions(): raise SecurityError("Insufficient database permissions") # All checks passed logger.info("Security validation passed") def start(self): # Only callable after __init__ validation passed self.process_requests() # Usage: try: service = SecureService() # Validates security at construction service.start() except SecurityError as e: logger.error(f"Service failed security validation: {e}") sys.exit(1) # Refuse to start with invalid security ``` **Benefits**: - **No exposure window**: Catch misconfigurations before processing data - **Clear errors**: Fail with specific message ("TLS cert expired") - **Operational safety**: Misconfigured systems never reach production ## Pattern Application Framework When designing systems, apply patterns in this order: ### 1. Identify Trust Boundaries (Security Boundaries) Where does data cross trust zones? ### 2. Apply Zero-Trust at Each Boundary - Authenticate + authorize every crossing - Never trust based on network location ### 3. Minimize TCB What MUST be trusted? Can it be smaller? ### 4. Use Immutable Infrastructure Can deployments replace rather than update? ### 5. Add Fail-Fast Validation Validate security at construction, refuse to start if invalid ## Quick Reference: Pattern Selection | Situation | Pattern | Key Action | |-----------|---------|------------| | **Designing service-to-service communication** | Zero-Trust | mTLS + JWT + authz on every request | | **Deciding deployment strategy** | Immutable Infrastructure | Container images, replace not update | | **Architecting multi-tier system** | Security Boundaries | Validate + authenticate at every tier boundary | | **Building secrets/auth service** | TCB Minimization | Small core, everything else uses API | | **System startup logic** | Fail-Fast Security | Validate security before processing requests | ## Common Mistakes ### ❌ Implicit Trust Based on Network **Wrong**: "Services in VPC are trusted, no auth needed" **Right**: Zero-trust - authenticate/authorize every request even within VPC **Why**: Network boundaries are weak. Compromised service = lateral movement without auth. ### ❌ Runtime Security Patches **Wrong**: SSH into production, apply patch, restart **Right**: Build new immutable image, deploy via CI/CD **Why**: Runtime patches create drift, no audit trail, hard to rollback. ### ❌ Large Security-Critical Core **Wrong**: Every service has secrets logic (large TCB) **Right**: Small secrets vault (TCB), services call API (untrusted) **Why**: Smaller TCB = fewer vulnerabilities, easier to audit/verify. ### ❌ Runtime Security Validation **Wrong**: Start processing, check security during execution **Right**: Validate security at construction, refuse to start if invalid **Why**: Runtime checks have exposure windows. Fail-fast = zero exposure. ### ❌ Unclear Trust Boundaries **Wrong**: No explicit boundaries, assume "internal is safe" **Right**: Diagram trust zones, validate at every boundary crossing **Why**: Boundaries are where attacks happen. Explicit validation prevents bypass. ## Cross-References **Use BEFORE this skill**: - `ordis/security-architect/threat-modeling` - Identify threats, then apply patterns to address them **Use WITH this skill**: - `ordis/security-architect/security-controls-design` - Patterns inform control choices **Use AFTER this skill**: - `ordis/security-architect/security-architecture-review` - Review architecture against patterns ## Real-World Impact **Systems using secure-by-design patterns:** - **Zero-trust + immutable infrastructure**: No successful lateral movement in 2 years despite multiple compromised services (blast radius contained by mTLS + segmentation) - **Fail-fast validation**: Prevented VULN-004 class (security level overrides) by refusing to start pipelines with mismatched security levels - **TCB minimization**: Secrets vault with 8,000 lines of code (vs 50+ services with embedded secrets logic) - single formal verification point instead of 50 attack surfaces **Key lesson**: **Security built into architecture is more effective and cheaper than security added later.**