Files
gh-ahmedasmar-devops-claude…/references/troubleshooting.md
2025-11-29 17:51:12 +08:00

13 KiB

CI/CD Troubleshooting

Comprehensive guide to diagnosing and resolving common CI/CD pipeline issues.

Table of Contents


Pipeline Failures

Workflow Not Triggering

GitHub Actions:

Symptoms: Workflow doesn't run on push/PR

Common causes:

  1. Workflow file in wrong location (must be .github/workflows/)
  2. Invalid YAML syntax
  3. Branch/path filters excluding the changes
  4. Workflow disabled in repository settings

Diagnostics:

# Validate YAML
yamllint .github/workflows/ci.yml

# Check if workflow is disabled
gh workflow list --repo owner/repo

Solutions:

# Check trigger configuration
on:
  push:
    branches: [main]  # Ensure your branch matches
    paths-ignore:
      - 'docs/**'  # May be excluding your changes

# Enable workflow
gh workflow enable ci.yml --repo owner/repo

GitLab CI:

Symptoms: Pipeline doesn't start

Diagnostics:

# Validate .gitlab-ci.yml
gl-ci-lint < .gitlab-ci.yml

# Check CI/CD settings
# Project > Settings > CI/CD > General pipelines

Solutions:

  • Check if CI/CD is enabled for the project
  • Verify .gitlab-ci.yml is in repository root
  • Check pipeline must succeed setting isn't blocking
  • Review only/except or rules configuration

Jobs Failing Intermittently

Symptoms: Same job passes sometimes, fails others

Common causes:

  1. Flaky tests
  2. Race conditions
  3. Network timeouts
  4. Resource constraints
  5. Time-dependent tests

Identify flaky tests:

# GitHub Actions - Run multiple times
strategy:
  matrix:
    attempt: [1, 2, 3, 4, 5]
steps:
  - run: npm test

Solutions:

// Add retries to flaky tests
jest.retryTimes(3);

// Increase timeouts
jest.setTimeout(30000);

// Fix race conditions
await waitFor(() => expect(element).toBeInDocument(), {
  timeout: 5000
});

Network retry pattern:

- name: Install with retry
  uses: nick-invision/retry@v2
  with:
    timeout_minutes: 10
    max_attempts: 3
    command: npm ci

Timeout Errors

Symptoms: "Job exceeded maximum time" or similar

Solutions:

# GitHub Actions - Increase timeout
jobs:
  build:
    timeout-minutes: 60  # Default: 360

# GitLab CI
test:
  timeout: 2h  # Default: 1h

Optimize long-running jobs:

  • Add caching for dependencies
  • Split tests into parallel jobs
  • Use faster runners
  • Identify and optimize slow tests

Exit Code Errors

Symptoms: "Process completed with exit code 1"

Diagnostics:

# Add verbose logging
- run: npm test -- --verbose

# Check specific exit codes
- run: |
    npm test
    EXIT_CODE=$?
    echo "Exit code: $EXIT_CODE"
    if [ $EXIT_CODE -eq 127 ]; then
      echo "Command not found"
    elif [ $EXIT_CODE -eq 1 ]; then
      echo "General error"
    fi
    exit $EXIT_CODE

Common exit codes:

  • 1: General error
  • 2: Misuse of shell command
  • 126: Command cannot execute
  • 127: Command not found
  • 130: Terminated by Ctrl+C
  • 137: Killed (OOM)
  • 143: Terminated (SIGTERM)

Dependency Issues

"Module not found" or "Cannot find package"

Symptoms: Build fails with missing dependency error

Causes:

  1. Missing dependency in package.json
  2. Cache corruption
  3. Lock file out of sync
  4. Private package access issues

Solutions:

# Clear cache and reinstall
- run: rm -rf node_modules package-lock.json
- run: npm install

# Use npm ci for clean install
- run: npm ci

# Clear GitHub Actions cache
# Settings > Actions > Caches > Delete specific cache

# GitLab - clear cache
cache:
  key: $CI_COMMIT_REF_SLUG
  policy: push  # Force new cache

Version Conflicts

Symptoms: Dependency resolution errors, peer dependency warnings

Diagnostics:

# Check for conflicts
npm ls
npm outdated

# View dependency tree
npm list --depth=1

Solutions:

// Use overrides (package.json)
{
  "overrides": {
    "problematic-package": "2.0.0"
  }
}

// Or resolutions (Yarn)
{
  "resolutions": {
    "problematic-package": "2.0.0"
  }
}

Private Package Access

Symptoms: "401 Unauthorized" or "404 Not Found" for private packages

GitHub Packages:

- run: |
    echo "@myorg:registry=https://npm.pkg.github.com" >> .npmrc
    echo "//npm.pkg.github.com/:_authToken=${{ secrets.GITHUB_TOKEN }}" >> .npmrc
- run: npm ci

npm Registry:

- run: echo "//registry.npmjs.org/:_authToken=${{ secrets.NPM_TOKEN }}" >> .npmrc
- run: npm ci

GitLab Package Registry:

before_script:
  - echo "@mygroup:registry=${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/packages/npm/" >> .npmrc
  - echo "${CI_API_V4_URL#https?}/projects/${CI_PROJECT_ID}/packages/npm/:_authToken=${CI_JOB_TOKEN}" >> .npmrc

Docker & Container Problems

"Cannot connect to Docker daemon"

Symptoms: Docker commands fail with connection error

GitHub Actions:

# Ensure Docker is available
runs-on: ubuntu-latest  # Has Docker pre-installed

steps:
  - run: docker ps  # Test Docker access

GitLab CI:

# Use Docker-in-Docker
image: docker:latest
services:
  - docker:dind

variables:
  DOCKER_HOST: tcp://docker:2376
  DOCKER_TLS_CERTDIR: "/certs"
  DOCKER_TLS_VERIFY: 1
  DOCKER_CERT_PATH: "$DOCKER_TLS_CERTDIR/client"

Image Pull Errors

Symptoms: "Error response from daemon: pull access denied" or timeout

Solutions:

# GitHub Actions - Login to registry
- uses: docker/login-action@v3
  with:
    registry: ghcr.io
    username: ${{ github.actor }}
    password: ${{ secrets.GITHUB_TOKEN }}

# Or for Docker Hub
- uses: docker/login-action@v3
  with:
    username: ${{ secrets.DOCKERHUB_USERNAME }}
    password: ${{ secrets.DOCKERHUB_TOKEN }}

# Add retry logic
- run: |
    for i in {1..3}; do
      docker pull myimage:latest && break
      sleep 5
    done

"No space left on device"

Symptoms: Docker build fails with disk space error

Solutions:

# GitHub Actions - Clean up space
- run: docker system prune -af --volumes

# Or use built-in action
- uses: jlumbroso/free-disk-space@main
  with:
    tool-cache: true
    android: true
    dotnet: true

# GitLab - configure runner
[[runners]]
  [runners.docker]
    volumes = ["/var/run/docker.sock:/var/run/docker.sock", "/cache"]
  [runners.docker.tmpfs]
    "/tmp" = "rw,noexec"

Multi-platform Build Issues

Symptoms: Build fails for ARM/different architecture

Solution:

- uses: docker/setup-qemu-action@v3

- uses: docker/setup-buildx-action@v3

- uses: docker/build-push-action@v5
  with:
    platforms: linux/amd64,linux/arm64
    context: .
    push: false

Authentication & Permissions

"Permission denied" or "403 Forbidden"

GitHub Actions:

Symptoms: Cannot push, create release, or access API

Solutions:

# Add necessary permissions
permissions:
  contents: write  # For pushing tags/releases
  pull-requests: write  # For commenting on PRs
  packages: write  # For pushing packages
  id-token: write  # For OIDC

# Check GITHUB_TOKEN permissions
- run: |
    curl -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
      https://api.github.com/repos/${{ github.repository }}

GitLab CI:

Symptoms: Cannot push to repository or access API

Solutions:

# Use CI_JOB_TOKEN for API access
script:
  - 'curl --header "JOB-TOKEN: $CI_JOB_TOKEN" "${CI_API_V4_URL}/projects"'

# Or use personal/project access token
variables:
  GIT_STRATEGY: clone
before_script:
  - git config --global user.email "ci@example.com"
  - git config --global user.name "CI Bot"

Git Push Failures

Symptoms: "failed to push some refs" or "protected branch"

Solutions:

# GitHub Actions - Check branch protection
# Settings > Branches > Branch protection rules

# Allow bypass
permissions:
  contents: write

# Or use PAT with admin access
- uses: actions/checkout@v4
  with:
    token: ${{ secrets.ADMIN_PAT }}

# GitLab - Grant permissions
# Settings > Repository > Protected Branches
# Add CI/CD role with push permission

AWS Credentials Issues

Symptoms: "Unable to locate credentials"

Solutions:

# Using OIDC (recommended)
- uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::123456789:role/GitHubActionsRole
    aws-region: us-east-1

# Using secrets (legacy)
- uses: aws-actions/configure-aws-credentials@v4
  with:
    aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
    aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
    aws-region: us-east-1

# Test credentials
- run: aws sts get-caller-identity

Performance Issues

Slow Pipeline Execution

Diagnostics:

# GitHub - View timing
gh run view <run-id> --log

# Identify slow steps
# Each step shows duration in UI

Solutions:

  • See optimization.md for comprehensive guide
  • Add dependency caching
  • Parallelize independent jobs
  • Use faster runners
  • Reduce test scope on PRs

Cache Not Working

Symptoms: Cache always misses, builds still slow

Diagnostics:

- uses: actions/cache@v4
  id: cache
  with:
    path: node_modules
    key: ${{ hashFiles('**/package-lock.json') }}

- run: echo "Cache hit: ${{ steps.cache.outputs.cache-hit }}"

Common issues:

  1. Key changes every time
  2. Path doesn't exist
  3. Cache size exceeds limit
  4. Cache evicted (LRU after 7 days on GitHub)

Solutions:

# Use consistent key
key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}

# Add restore-keys for partial match
restore-keys: |
  ${{ runner.os }}-node-

# Check cache size
- run: du -sh node_modules

Platform-Specific Issues

GitHub Actions

"Resource not accessible by integration":

# Add required permission
permissions:
  issues: write  # Or whatever resource you're accessing

"Workflow is not shared":

  • Reusable workflows must be in .github/workflows/
  • Repository must be public or org member
  • Check workflow access settings

"No runner available":

  • Self-hosted: Check runner is online and has matching labels
  • GitHub-hosted: May hit concurrent job limit (check usage)

GitLab CI

"This job is stuck":

  • No runner available with matching tags
  • All runners are busy
  • Runner not configured for this project

Solutions:

# Remove tags to use any available runner
job:
  tags: []

# Or check runner configuration
# Settings > CI/CD > Runners

"Job failed (system failure)":

  • Runner disconnected
  • Resource limits exceeded
  • Infrastructure issue

Check runner logs:

# On runner host
journalctl -u gitlab-runner -f

Debugging Techniques

Enable Debug Logging

GitHub Actions:

# Repository > Settings > Secrets > Add:
# ACTIONS_RUNNER_DEBUG = true
# ACTIONS_STEP_DEBUG = true

GitLab CI:

variables:
  CI_DEBUG_TRACE: "true"  # Caution: May expose secrets!

Interactive Debugging

GitHub Actions:

# Add tmate for SSH access
- uses: mxschmitt/action-tmate@v3
  if: failure()

Local reproduction:

# Use act to run GitHub Actions locally
act -j build

# Or nektos/act for Docker
docker run -v $(pwd):/workspace -it nektos/act -j build

Reproduce Locally

# GitHub Actions - Use same Docker image
docker run -it ubuntu:latest bash

# Install dependencies and test
apt-get update && apt-get install -y nodejs npm
npm ci
npm test

Prevention Strategies

Pre-commit Checks

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: trailing-whitespace
      - id: check-yaml
      - id: check-added-large-files

  - repo: local
    hooks:
      - id: tests
        name: Run tests
        entry: npm test
        language: system
        pass_filenames: false

CI/CD Health Monitoring

Use the scripts/ci_health.py script:

python3 scripts/ci_health.py --platform github --repo owner/repo

Regular Maintenance

  • Monthly: Review failed job patterns
  • Monthly: Update actions/dependencies
  • Quarterly: Audit pipeline efficiency
  • Quarterly: Review and clean old caches
  • Yearly: Major version updates

Getting Help

GitHub Actions:

GitLab CI:

General CI/CD:

  • Stack Overflow: Tag [github-actions] or [gitlab-ci]
  • Reddit: r/devops, r/cicd