Files
2025-11-30 08:59:16 +08:00

10 KiB

name, description
name description
cicd-pipeline-architecture Use when setting up CI/CD pipelines, experiencing deployment failures, slow feedback loops, or production incidents after deployment - provides deployment strategies, test gates, rollback mechanisms, and environment promotion patterns to prevent downtime and enable safe continuous delivery

CI/CD Pipeline Architecture

Overview

Design CI/CD pipelines with deployment verification, rollback capabilities, and zero-downtime strategies from day one.

Core principle: "Deploy to production" is not a single step - it's a sequence of gates, health checks, gradual rollouts, and automated rollback triggers. Skipping these "for speed" causes production incidents.

When to Use

Use this skill when:

  • Setting up new CI/CD pipelines (before writing workflow files)
  • Experiencing deployment failures in production
  • CI feedback loops are too slow (tests taking too long)
  • No confidence in deployments (fear of breaking production)
  • Manual rollbacks required after bad deploys
  • Downtime during deployments is acceptable (it shouldn't be)
  • Migrations cause production issues

Do NOT skip this for:

  • "Quick MVP" or "demo" pipelines (these become production)
  • "Simple" applications (complexity comes from deployment, not code)
  • "We'll improve it later" (later never comes, incidents do)

Core Pipeline Architecture

Mandatory Pipeline Stages

Every production pipeline MUST include:

1. Build → 2. Test → 3. Deploy to Staging → 4. Verify Staging → 5. Deploy to Production → 6. Verify Production → 7. Monitor

Missing any stage = production incidents waiting to happen.

1. Build Stage

Purpose: Compile, package, create artifacts

build:
  - Compile code (if applicable)
  - Run linters and formatters
  - Build container image
  - Tag with commit SHA (NOT "latest")
  - Push to registry
  - Create immutable artifact

Key principle: Build once, deploy everywhere. Same artifact to staging and production.

2. Test Stage

Test Pyramid in CI:

       /\
      /E2\      ← Few, critical paths only (5-10 tests)
     /----\
    / Intg \    ← API contracts, DB integration (50-100 tests)
   /--------\
  /   Unit   \  ← Fast, isolated, thorough (100s-1000s)
 /____________\

Optimization strategies:

  • Parallel execution: Split test suite across multiple runners
  • Smart triggers: Run full suite on main, subset on PRs
  • Caching: Cache dependencies, build artifacts, test databases
  • Fail fast: Run fastest tests first

Anti-pattern: "Tests are slow, let's skip some" → Optimize execution, don't remove coverage

3. Deploy to Staging

Staging MUST match production:

  • Same infrastructure (containers, K8s, serverless)
  • Same environment variables structure
  • Same database migration process
  • Similar data volume (use anonymized production data)

Deployment process:

1. Run database migrations (with rollback tested)
2. Deploy new version alongside old (blue-green)
3. Run smoke tests
4. Cutover traffic
5. Keep old version running for quick rollback

4. Verify Staging

Automated verification (not manual testing):

verify_staging:
  - Health check endpoint returns 200
  - Critical API endpoints respond correctly
  - Database migrations applied successfully
  - Background jobs processing
  - External integrations functional

Failure = stop pipeline, do NOT proceed to production.

5. Deploy to Production

Deployment Strategies (choose one):

Blue-Green Deployment

Old (Blue) ← 100% traffic
New (Green) ← deployed, health checked, 0% traffic

→ Switch traffic to Green
→ Keep Blue running for 1 hour for rollback
→ Terminate Blue after monitoring shows Green is stable

Pros: Instant rollback, zero downtime Cons: Double infrastructure cost during deployment

Canary Deployment

Old ← 95% traffic
New ← 5% traffic (canary)

→ Monitor error rates, latency for 15 min
→ If healthy: 50% traffic
→ If healthy: 100% traffic
→ If unhealthy: immediate rollback to 100% old

Pros: Gradual risk, early warning Cons: More complex monitoring

Rolling Deployment

Instances: [A, B, C, D, E]

→ Deploy to A, health check
→ Deploy to B, health check
→ Deploy to C, D, E sequentially

If any fails → stop, rollback deployed instances

Pros: No extra infrastructure Cons: Mixed versions during deployment

Choose based on:

  • Blue-Green: Critical systems, tolerance for double cost
  • Canary: High-traffic systems with good metrics
  • Rolling: Cost-sensitive, moderate traffic

NEVER: Direct deployment with restart (causes downtime)

6. Verify Production

Automated post-deployment verification:

verify_production:
  - HTTP 200 from health endpoint
  - Response time < baseline + 20%
  - Error rate < 1%
  - Critical user flows functional (synthetic tests)
  - Database connections healthy
  - Cache hit rates normal

Auto-rollback triggers:

  • Health check fails for 2 consecutive checks
  • Error rate > 5% for 3 minutes
  • Response time > 2x baseline
  • Critical endpoint returns 5xx

7. Monitor

Observe for 1 hour post-deployment:

  • Error rates (by endpoint, by user segment)
  • Latency percentiles (p50, p95, p99)
  • Resource usage (CPU, memory, DB connections)
  • Business metrics (conversions, signups)

Dashboard must show:

  • Current deployment version
  • Time since last deployment
  • Comparison to pre-deployment metrics
  • Rollback button (one-click)

Database Migrations in CI/CD

Migration Strategy

1. Write backward-compatible migrations
   - Add columns as nullable first
   - Create new tables before dropping old
   - Add indexes with CONCURRENTLY (Postgres)

2. Deploy application code that works with old AND new schema

3. Run migration

4. Deploy code that uses new schema exclusively

5. Clean up old schema (separate deployment)

This takes 3 deployments, not 1. That's correct.

Migration Testing

test_migrations:
  - Apply migration to test DB
  - Run application tests against migrated schema
  - Test rollback (down migration)
  - Verify data integrity

Never skip migration rollback testing. You'll need it in production.

Secrets Management

Anti-patterns from baseline:

Hardcoded in workflow:

env:
  DATABASE_URL: postgresql://user:pass@localhost/db

Correct:

env:
  DATABASE_URL: ${{ secrets.DATABASE_URL }}

Secrets checklist:

  • Store in CI/CD secret manager (GitHub Secrets, GitLab CI/CD variables)
  • Rotate regularly (automated)
  • Different secrets per environment
  • Never log secret values
  • Use secret scanning tools

Environment Promotion

Progression:

Developer → CI Tests → Staging → Production

Gates between environments:

  1. CI → Staging: All tests pass
  2. Staging → Production:
    • Staging verification passes
    • Manual approval (for critical systems)
    • Business hours only (optional)
    • Monitoring shows staging is healthy

Quick Reference: Pipeline Checklist

Before deploying to production, verify:

  • Tests run and pass in CI
  • Build creates immutable, tagged artifact
  • Staging environment exists and matches production
  • Migrations tested with rollback
  • Deployment strategy chosen (blue-green/canary/rolling)
  • Health check endpoint implemented
  • Automated verification tests written
  • Auto-rollback triggers configured
  • Monitoring dashboard shows deployment metrics
  • Rollback procedure tested
  • Secrets managed securely
  • On-call engineer notified of deployment

Common Mistakes

Mistake Why It's Wrong Fix
"Just restart the service" Causes downtime, no rollback Use blue-green or canary deployment
"Tests are slow, skip some" Removes safety net Parallel execution, smart caching
"We'll add staging later" Production becomes your staging Create staging first, before production pipeline
"Migrations in deployment script" Can't roll back safely Backward-compatible migrations, 3-step deployment
"Manual verification after deploy" Slow, error-prone, doesn't scale Automated health checks and smoke tests
"Deploy on main merge" No gate, broken main can deploy Require staging verification first
Hardcoded database credentials Security risk, can't rotate Use secret manager
"Single server is fine for now" Downtime during deployment Use multiple instances from day one

Rationalization Table

Excuse Reality
"This is just an MVP/demo" MVP pipelines become production pipelines. Build it right once.
"Staging is expensive" Production incidents are more expensive. Staging prevents them.
"Blue-green doubles our costs" Downtime and incidents cost more than temporary double infrastructure.
"We'll add rollback later" You need rollback when a deployment fails. Later = too late.
"Health checks are overkill" Silent failures in production are worse than no deployment.
"Migrations always work" They don't. Test rollbacks before you need them.
"Our app is too simple for this" Deployment complexity isn't about code complexity.

Red Flags - Stop and Fix Pipeline

If you catch yourself thinking:

  • "Just push to main, it'll be fine" → Add staging gate
  • "Tests passed locally, skip CI" → Never skip CI
  • "Restart is faster than blue-green" → Downtime is never acceptable
  • "We'll monitor manually after deploy" → Automate verification
  • "If it breaks, we'll fix forward" → Implement auto-rollback
  • "Migrations can run during deployment" → Backward-compatible migrations first
  • "One environment is enough" → Minimum: staging + production

All of these mean: Your pipeline will cause production incidents.

Cross-References

Related skills:

  • test-automation-architecture (ordis-quality-engineering) - Which tests to run where
  • observability-and-monitoring (ordis-quality-engineering) - Deployment monitoring
  • testing-in-production (ordis-quality-engineering) - Canary + feature flags
  • api-testing (axiom-web-backend) - Contract tests in CI
  • database-integration (axiom-web-backend) - Migration patterns

The Bottom Line

"Deploy to production" is not one step. It's:

  1. Build immutable artifact
  2. Deploy to staging
  3. Verify staging
  4. Deploy with zero-downtime strategy
  5. Verify production automatically
  6. Monitor with auto-rollback triggers

Skipping steps to "move fast" causes incidents. This IS moving fast.