zhongwei/gh-tachyon-beep-skillpacks-plugins-axiom-devops-engineering

Files

Zhongwei Li 8dd818ba73 Initial commit

2025-11-30 08:59:16 +08:00

10 KiB

Raw Permalink Blame History

name, description

name	description
cicd-pipeline-architecture	Use when setting up CI/CD pipelines, experiencing deployment failures, slow feedback loops, or production incidents after deployment - provides deployment strategies, test gates, rollback mechanisms, and environment promotion patterns to prevent downtime and enable safe continuous delivery

CI/CD Pipeline Architecture

Overview

Design CI/CD pipelines with deployment verification, rollback capabilities, and zero-downtime strategies from day one.

Core principle: "Deploy to production" is not a single step - it's a sequence of gates, health checks, gradual rollouts, and automated rollback triggers. Skipping these "for speed" causes production incidents.

When to Use

Use this skill when:

Setting up new CI/CD pipelines (before writing workflow files)
Experiencing deployment failures in production
CI feedback loops are too slow (tests taking too long)
No confidence in deployments (fear of breaking production)
Manual rollbacks required after bad deploys
Downtime during deployments is acceptable (it shouldn't be)
Migrations cause production issues

Do NOT skip this for:

"Quick MVP" or "demo" pipelines (these become production)
"Simple" applications (complexity comes from deployment, not code)
"We'll improve it later" (later never comes, incidents do)

Core Pipeline Architecture

Mandatory Pipeline Stages

Every production pipeline MUST include:

1. Build → 2. Test → 3. Deploy to Staging → 4. Verify Staging → 5. Deploy to Production → 6. Verify Production → 7. Monitor

Missing any stage = production incidents waiting to happen.

1. Build Stage

Purpose: Compile, package, create artifacts

build:
  - Compile code (if applicable)
  - Run linters and formatters
  - Build container image
  - Tag with commit SHA (NOT "latest")
  - Push to registry
  - Create immutable artifact

Key principle: Build once, deploy everywhere. Same artifact to staging and production.

2. Test Stage

Test Pyramid in CI:

       /\
      /E2\      ← Few, critical paths only (5-10 tests)
     /----\
    / Intg \    ← API contracts, DB integration (50-100 tests)
   /--------\
  /   Unit   \  ← Fast, isolated, thorough (100s-1000s)
 /____________\

Optimization strategies:

Parallel execution: Split test suite across multiple runners
Smart triggers: Run full suite on main, subset on PRs
Caching: Cache dependencies, build artifacts, test databases
Fail fast: Run fastest tests first

Anti-pattern: "Tests are slow, let's skip some" → Optimize execution, don't remove coverage

3. Deploy to Staging

Staging MUST match production:

Same infrastructure (containers, K8s, serverless)
Same environment variables structure
Same database migration process
Similar data volume (use anonymized production data)

Deployment process:

1. Run database migrations (with rollback tested)
2. Deploy new version alongside old (blue-green)
3. Run smoke tests
4. Cutover traffic
5. Keep old version running for quick rollback

4. Verify Staging

Automated verification (not manual testing):

verify_staging:
  - Health check endpoint returns 200
  - Critical API endpoints respond correctly
  - Database migrations applied successfully
  - Background jobs processing
  - External integrations functional

Failure = stop pipeline, do NOT proceed to production.

5. Deploy to Production

Deployment Strategies (choose one):

Blue-Green Deployment

Old (Blue) ← 100% traffic
New (Green) ← deployed, health checked, 0% traffic

→ Switch traffic to Green
→ Keep Blue running for 1 hour for rollback
→ Terminate Blue after monitoring shows Green is stable

Pros: Instant rollback, zero downtime Cons: Double infrastructure cost during deployment

Canary Deployment

Old ← 95% traffic
New ← 5% traffic (canary)

→ Monitor error rates, latency for 15 min
→ If healthy: 50% traffic
→ If healthy: 100% traffic
→ If unhealthy: immediate rollback to 100% old

Pros: Gradual risk, early warning Cons: More complex monitoring

Rolling Deployment

Instances: [A, B, C, D, E]

→ Deploy to A, health check
→ Deploy to B, health check
→ Deploy to C, D, E sequentially

If any fails → stop, rollback deployed instances

Pros: No extra infrastructure Cons: Mixed versions during deployment

Choose based on:

Blue-Green: Critical systems, tolerance for double cost
Canary: High-traffic systems with good metrics
Rolling: Cost-sensitive, moderate traffic

NEVER: Direct deployment with restart (causes downtime)

6. Verify Production

Automated post-deployment verification:

verify_production:
  - HTTP 200 from health endpoint
  - Response time < baseline + 20%
  - Error rate < 1%
  - Critical user flows functional (synthetic tests)
  - Database connections healthy
  - Cache hit rates normal

Auto-rollback triggers:

Health check fails for 2 consecutive checks
Error rate > 5% for 3 minutes
Response time > 2x baseline
Critical endpoint returns 5xx

7. Monitor

Observe for 1 hour post-deployment:

Error rates (by endpoint, by user segment)
Latency percentiles (p50, p95, p99)
Resource usage (CPU, memory, DB connections)
Business metrics (conversions, signups)

Dashboard must show:

Current deployment version
Time since last deployment
Comparison to pre-deployment metrics
Rollback button (one-click)

Database Migrations in CI/CD

Migration Strategy

1. Write backward-compatible migrations
   - Add columns as nullable first
   - Create new tables before dropping old
   - Add indexes with CONCURRENTLY (Postgres)

2. Deploy application code that works with old AND new schema

3. Run migration

4. Deploy code that uses new schema exclusively

5. Clean up old schema (separate deployment)

This takes 3 deployments, not 1. That's correct.

Migration Testing

test_migrations:
  - Apply migration to test DB
  - Run application tests against migrated schema
  - Test rollback (down migration)
  - Verify data integrity

Never skip migration rollback testing. You'll need it in production.

Secrets Management

Anti-patterns from baseline:

❌ Hardcoded in workflow:

env:
  DATABASE_URL: postgresql://user:pass@localhost/db

✅ Correct:

env:
  DATABASE_URL: ${{ secrets.DATABASE_URL }}

Secrets checklist:

Store in CI/CD secret manager (GitHub Secrets, GitLab CI/CD variables)
Rotate regularly (automated)
Different secrets per environment
Never log secret values
Use secret scanning tools

Environment Promotion

Progression:

Developer → CI Tests → Staging → Production

Gates between environments:

CI → Staging: All tests pass
Staging → Production:
- Staging verification passes
- Manual approval (for critical systems)
- Business hours only (optional)
- Monitoring shows staging is healthy

Quick Reference: Pipeline Checklist

Before deploying to production, verify:

Tests run and pass in CI
Build creates immutable, tagged artifact
Staging environment exists and matches production
Migrations tested with rollback
Deployment strategy chosen (blue-green/canary/rolling)
Health check endpoint implemented
Automated verification tests written
Auto-rollback triggers configured
Monitoring dashboard shows deployment metrics
Rollback procedure tested
Secrets managed securely
On-call engineer notified of deployment

Common Mistakes

Mistake	Why It's Wrong	Fix
"Just restart the service"	Causes downtime, no rollback	Use blue-green or canary deployment
"Tests are slow, skip some"	Removes safety net	Parallel execution, smart caching
"We'll add staging later"	Production becomes your staging	Create staging first, before production pipeline
"Migrations in deployment script"	Can't roll back safely	Backward-compatible migrations, 3-step deployment
"Manual verification after deploy"	Slow, error-prone, doesn't scale	Automated health checks and smoke tests
"Deploy on main merge"	No gate, broken main can deploy	Require staging verification first
Hardcoded database credentials	Security risk, can't rotate	Use secret manager
"Single server is fine for now"	Downtime during deployment	Use multiple instances from day one

Rationalization Table

Excuse	Reality
"This is just an MVP/demo"	MVP pipelines become production pipelines. Build it right once.
"Staging is expensive"	Production incidents are more expensive. Staging prevents them.
"Blue-green doubles our costs"	Downtime and incidents cost more than temporary double infrastructure.
"We'll add rollback later"	You need rollback when a deployment fails. Later = too late.
"Health checks are overkill"	Silent failures in production are worse than no deployment.
"Migrations always work"	They don't. Test rollbacks before you need them.
"Our app is too simple for this"	Deployment complexity isn't about code complexity.

Red Flags - Stop and Fix Pipeline

If you catch yourself thinking:

"Just push to main, it'll be fine" → Add staging gate
"Tests passed locally, skip CI" → Never skip CI
"Restart is faster than blue-green" → Downtime is never acceptable
"We'll monitor manually after deploy" → Automate verification
"If it breaks, we'll fix forward" → Implement auto-rollback
"Migrations can run during deployment" → Backward-compatible migrations first
"One environment is enough" → Minimum: staging + production

All of these mean: Your pipeline will cause production incidents.

Cross-References

Related skills:

test-automation-architecture (ordis-quality-engineering) - Which tests to run where
observability-and-monitoring (ordis-quality-engineering) - Deployment monitoring
testing-in-production (ordis-quality-engineering) - Canary + feature flags
api-testing (axiom-web-backend) - Contract tests in CI
database-integration (axiom-web-backend) - Migration patterns

The Bottom Line

"Deploy to production" is not one step. It's:

Build immutable artifact
Deploy to staging
Verify staging
Deploy with zero-downtime strategy
Verify production automatically
Monitor with auto-rollback triggers

Skipping steps to "move fast" causes incidents. This IS moving fast.

10 KiB Raw Permalink Blame History