# CI/CD Troubleshooting Comprehensive guide to diagnosing and resolving common CI/CD pipeline issues. ## Table of Contents - [Pipeline Failures](#pipeline-failures) - [Dependency Issues](#dependency-issues) - [Docker & Container Problems](#docker--container-problems) - [Authentication & Permissions](#authentication--permissions) - [Performance Issues](#performance-issues) - [Platform-Specific Issues](#platform-specific-issues) --- ## Pipeline Failures ### Workflow Not Triggering **GitHub Actions:** **Symptoms:** Workflow doesn't run on push/PR **Common causes:** 1. Workflow file in wrong location (must be `.github/workflows/`) 2. Invalid YAML syntax 3. Branch/path filters excluding the changes 4. Workflow disabled in repository settings **Diagnostics:** ```bash # Validate YAML yamllint .github/workflows/ci.yml # Check if workflow is disabled gh workflow list --repo owner/repo ``` **Solutions:** ```yaml # Check trigger configuration on: push: branches: [main] # Ensure your branch matches paths-ignore: - 'docs/**' # May be excluding your changes # Enable workflow gh workflow enable ci.yml --repo owner/repo ``` **GitLab CI:** **Symptoms:** Pipeline doesn't start **Diagnostics:** ```bash # Validate .gitlab-ci.yml gl-ci-lint < .gitlab-ci.yml # Check CI/CD settings # Project > Settings > CI/CD > General pipelines ``` **Solutions:** - Check if CI/CD is enabled for the project - Verify `.gitlab-ci.yml` is in repository root - Check pipeline must succeed setting isn't blocking - Review `only`/`except` or `rules` configuration ### Jobs Failing Intermittently **Symptoms:** Same job passes sometimes, fails others **Common causes:** 1. Flaky tests 2. Race conditions 3. Network timeouts 4. Resource constraints 5. Time-dependent tests **Identify flaky tests:** ```yaml # GitHub Actions - Run multiple times strategy: matrix: attempt: [1, 2, 3, 4, 5] steps: - run: npm test ``` **Solutions:** ```javascript // Add retries to flaky tests jest.retryTimes(3); // Increase timeouts jest.setTimeout(30000); // Fix race conditions await waitFor(() => expect(element).toBeInDocument(), { timeout: 5000 }); ``` **Network retry pattern:** ```yaml - name: Install with retry uses: nick-invision/retry@v2 with: timeout_minutes: 10 max_attempts: 3 command: npm ci ``` ### Timeout Errors **Symptoms:** "Job exceeded maximum time" or similar **Solutions:** ```yaml # GitHub Actions - Increase timeout jobs: build: timeout-minutes: 60 # Default: 360 # GitLab CI test: timeout: 2h # Default: 1h ``` **Optimize long-running jobs:** - Add caching for dependencies - Split tests into parallel jobs - Use faster runners - Identify and optimize slow tests ### Exit Code Errors **Symptoms:** "Process completed with exit code 1" **Diagnostics:** ```yaml # Add verbose logging - run: npm test -- --verbose # Check specific exit codes - run: | npm test EXIT_CODE=$? echo "Exit code: $EXIT_CODE" if [ $EXIT_CODE -eq 127 ]; then echo "Command not found" elif [ $EXIT_CODE -eq 1 ]; then echo "General error" fi exit $EXIT_CODE ``` **Common exit codes:** - `1`: General error - `2`: Misuse of shell command - `126`: Command cannot execute - `127`: Command not found - `130`: Terminated by Ctrl+C - `137`: Killed (OOM) - `143`: Terminated (SIGTERM) --- ## Dependency Issues ### "Module not found" or "Cannot find package" **Symptoms:** Build fails with missing dependency error **Causes:** 1. Missing dependency in `package.json` 2. Cache corruption 3. Lock file out of sync 4. Private package access issues **Solutions:** ```yaml # Clear cache and reinstall - run: rm -rf node_modules package-lock.json - run: npm install # Use npm ci for clean install - run: npm ci # Clear GitHub Actions cache # Settings > Actions > Caches > Delete specific cache # GitLab - clear cache cache: key: $CI_COMMIT_REF_SLUG policy: push # Force new cache ``` ### Version Conflicts **Symptoms:** Dependency resolution errors, peer dependency warnings **Diagnostics:** ```bash # Check for conflicts npm ls npm outdated # View dependency tree npm list --depth=1 ``` **Solutions:** ```json // Use overrides (package.json) { "overrides": { "problematic-package": "2.0.0" } } // Or resolutions (Yarn) { "resolutions": { "problematic-package": "2.0.0" } } ``` ### Private Package Access **Symptoms:** "401 Unauthorized" or "404 Not Found" for private packages **GitHub Packages:** ```yaml - run: | echo "@myorg:registry=https://npm.pkg.github.com" >> .npmrc echo "//npm.pkg.github.com/:_authToken=${{ secrets.GITHUB_TOKEN }}" >> .npmrc - run: npm ci ``` **npm Registry:** ```yaml - run: echo "//registry.npmjs.org/:_authToken=${{ secrets.NPM_TOKEN }}" >> .npmrc - run: npm ci ``` **GitLab Package Registry:** ```yaml before_script: - echo "@mygroup:registry=${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/packages/npm/" >> .npmrc - echo "${CI_API_V4_URL#https?}/projects/${CI_PROJECT_ID}/packages/npm/:_authToken=${CI_JOB_TOKEN}" >> .npmrc ``` --- ## Docker & Container Problems ### "Cannot connect to Docker daemon" **Symptoms:** Docker commands fail with connection error **GitHub Actions:** ```yaml # Ensure Docker is available runs-on: ubuntu-latest # Has Docker pre-installed steps: - run: docker ps # Test Docker access ``` **GitLab CI:** ```yaml # Use Docker-in-Docker image: docker:latest services: - docker:dind variables: DOCKER_HOST: tcp://docker:2376 DOCKER_TLS_CERTDIR: "/certs" DOCKER_TLS_VERIFY: 1 DOCKER_CERT_PATH: "$DOCKER_TLS_CERTDIR/client" ``` ### Image Pull Errors **Symptoms:** "Error response from daemon: pull access denied" or timeout **Solutions:** ```yaml # GitHub Actions - Login to registry - uses: docker/login-action@v3 with: registry: ghcr.io username: ${{ github.actor }} password: ${{ secrets.GITHUB_TOKEN }} # Or for Docker Hub - uses: docker/login-action@v3 with: username: ${{ secrets.DOCKERHUB_USERNAME }} password: ${{ secrets.DOCKERHUB_TOKEN }} # Add retry logic - run: | for i in {1..3}; do docker pull myimage:latest && break sleep 5 done ``` ### "No space left on device" **Symptoms:** Docker build fails with disk space error **Solutions:** ```yaml # GitHub Actions - Clean up space - run: docker system prune -af --volumes # Or use built-in action - uses: jlumbroso/free-disk-space@main with: tool-cache: true android: true dotnet: true # GitLab - configure runner [[runners]] [runners.docker] volumes = ["/var/run/docker.sock:/var/run/docker.sock", "/cache"] [runners.docker.tmpfs] "/tmp" = "rw,noexec" ``` ### Multi-platform Build Issues **Symptoms:** Build fails for ARM/different architecture **Solution:** ```yaml - uses: docker/setup-qemu-action@v3 - uses: docker/setup-buildx-action@v3 - uses: docker/build-push-action@v5 with: platforms: linux/amd64,linux/arm64 context: . push: false ``` --- ## Authentication & Permissions ### "Permission denied" or "403 Forbidden" **GitHub Actions:** **Symptoms:** Cannot push, create release, or access API **Solutions:** ```yaml # Add necessary permissions permissions: contents: write # For pushing tags/releases pull-requests: write # For commenting on PRs packages: write # For pushing packages id-token: write # For OIDC # Check GITHUB_TOKEN permissions - run: | curl -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \ https://api.github.com/repos/${{ github.repository }} ``` **GitLab CI:** **Symptoms:** Cannot push to repository or access API **Solutions:** ```yaml # Use CI_JOB_TOKEN for API access script: - 'curl --header "JOB-TOKEN: $CI_JOB_TOKEN" "${CI_API_V4_URL}/projects"' # Or use personal/project access token variables: GIT_STRATEGY: clone before_script: - git config --global user.email "ci@example.com" - git config --global user.name "CI Bot" ``` ### Git Push Failures **Symptoms:** "failed to push some refs" or "protected branch" **Solutions:** ```yaml # GitHub Actions - Check branch protection # Settings > Branches > Branch protection rules # Allow bypass permissions: contents: write # Or use PAT with admin access - uses: actions/checkout@v4 with: token: ${{ secrets.ADMIN_PAT }} # GitLab - Grant permissions # Settings > Repository > Protected Branches # Add CI/CD role with push permission ``` ### AWS Credentials Issues **Symptoms:** "Unable to locate credentials" **Solutions:** ```yaml # Using OIDC (recommended) - uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: arn:aws:iam::123456789:role/GitHubActionsRole aws-region: us-east-1 # Using secrets (legacy) - uses: aws-actions/configure-aws-credentials@v4 with: aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }} aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} aws-region: us-east-1 # Test credentials - run: aws sts get-caller-identity ``` --- ## Performance Issues ### Slow Pipeline Execution **Diagnostics:** ```bash # GitHub - View timing gh run view --log # Identify slow steps # Each step shows duration in UI ``` **Solutions:** - See [optimization.md](optimization.md) for comprehensive guide - Add dependency caching - Parallelize independent jobs - Use faster runners - Reduce test scope on PRs ### Cache Not Working **Symptoms:** Cache always misses, builds still slow **Diagnostics:** ```yaml - uses: actions/cache@v4 id: cache with: path: node_modules key: ${{ hashFiles('**/package-lock.json') }} - run: echo "Cache hit: ${{ steps.cache.outputs.cache-hit }}" ``` **Common issues:** 1. Key changes every time 2. Path doesn't exist 3. Cache size exceeds limit 4. Cache evicted (LRU after 7 days on GitHub) **Solutions:** ```yaml # Use consistent key key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }} # Add restore-keys for partial match restore-keys: | ${{ runner.os }}-node- # Check cache size - run: du -sh node_modules ``` --- ## Platform-Specific Issues ### GitHub Actions **"Resource not accessible by integration":** ```yaml # Add required permission permissions: issues: write # Or whatever resource you're accessing ``` **"Workflow is not shared":** - Reusable workflows must be in `.github/workflows/` - Repository must be public or org member - Check workflow access settings **"No runner available":** - Self-hosted: Check runner is online and has matching labels - GitHub-hosted: May hit concurrent job limit (check usage) ### GitLab CI **"This job is stuck":** - No runner available with matching tags - All runners are busy - Runner not configured for this project **Solutions:** ```yaml # Remove tags to use any available runner job: tags: [] # Or check runner configuration # Settings > CI/CD > Runners ``` **"Job failed (system failure)":** - Runner disconnected - Resource limits exceeded - Infrastructure issue **Check runner logs:** ```bash # On runner host journalctl -u gitlab-runner -f ``` --- ## Debugging Techniques ### Enable Debug Logging **GitHub Actions:** ```yaml # Repository > Settings > Secrets > Add: # ACTIONS_RUNNER_DEBUG = true # ACTIONS_STEP_DEBUG = true ``` **GitLab CI:** ```yaml variables: CI_DEBUG_TRACE: "true" # Caution: May expose secrets! ``` ### Interactive Debugging **GitHub Actions:** ```yaml # Add tmate for SSH access - uses: mxschmitt/action-tmate@v3 if: failure() ``` **Local reproduction:** ```bash # Use act to run GitHub Actions locally act -j build # Or nektos/act for Docker docker run -v $(pwd):/workspace -it nektos/act -j build ``` ### Reproduce Locally ```bash # GitHub Actions - Use same Docker image docker run -it ubuntu:latest bash # Install dependencies and test apt-get update && apt-get install -y nodejs npm npm ci npm test ``` --- ## Prevention Strategies ### Pre-commit Checks ```yaml # .pre-commit-config.yaml repos: - repo: https://github.com/pre-commit/pre-commit-hooks rev: v4.5.0 hooks: - id: trailing-whitespace - id: check-yaml - id: check-added-large-files - repo: local hooks: - id: tests name: Run tests entry: npm test language: system pass_filenames: false ``` ### CI/CD Health Monitoring Use the `scripts/ci_health.py` script: ```bash python3 scripts/ci_health.py --platform github --repo owner/repo ``` ### Regular Maintenance - [ ] Monthly: Review failed job patterns - [ ] Monthly: Update actions/dependencies - [ ] Quarterly: Audit pipeline efficiency - [ ] Quarterly: Review and clean old caches - [ ] Yearly: Major version updates --- ## Getting Help **GitHub Actions:** - Community Forum: https://github.community - Documentation: https://docs.github.com/actions - Status: https://www.githubstatus.com **GitLab CI:** - Forum: https://forum.gitlab.com - Documentation: https://docs.gitlab.com/ee/ci - Status: https://status.gitlab.com **General CI/CD:** - Stack Overflow: Tag [github-actions] or [gitlab-ci] - Reddit: r/devops, r/cicd