13 KiB
CI/CD Troubleshooting
Comprehensive guide to diagnosing and resolving common CI/CD pipeline issues.
Table of Contents
- Pipeline Failures
- Dependency Issues
- Docker & Container Problems
- Authentication & Permissions
- Performance Issues
- Platform-Specific Issues
Pipeline Failures
Workflow Not Triggering
GitHub Actions:
Symptoms: Workflow doesn't run on push/PR
Common causes:
- Workflow file in wrong location (must be
.github/workflows/) - Invalid YAML syntax
- Branch/path filters excluding the changes
- Workflow disabled in repository settings
Diagnostics:
# Validate YAML
yamllint .github/workflows/ci.yml
# Check if workflow is disabled
gh workflow list --repo owner/repo
Solutions:
# Check trigger configuration
on:
push:
branches: [main] # Ensure your branch matches
paths-ignore:
- 'docs/**' # May be excluding your changes
# Enable workflow
gh workflow enable ci.yml --repo owner/repo
GitLab CI:
Symptoms: Pipeline doesn't start
Diagnostics:
# Validate .gitlab-ci.yml
gl-ci-lint < .gitlab-ci.yml
# Check CI/CD settings
# Project > Settings > CI/CD > General pipelines
Solutions:
- Check if CI/CD is enabled for the project
- Verify
.gitlab-ci.ymlis in repository root - Check pipeline must succeed setting isn't blocking
- Review
only/exceptorrulesconfiguration
Jobs Failing Intermittently
Symptoms: Same job passes sometimes, fails others
Common causes:
- Flaky tests
- Race conditions
- Network timeouts
- Resource constraints
- Time-dependent tests
Identify flaky tests:
# GitHub Actions - Run multiple times
strategy:
matrix:
attempt: [1, 2, 3, 4, 5]
steps:
- run: npm test
Solutions:
// Add retries to flaky tests
jest.retryTimes(3);
// Increase timeouts
jest.setTimeout(30000);
// Fix race conditions
await waitFor(() => expect(element).toBeInDocument(), {
timeout: 5000
});
Network retry pattern:
- name: Install with retry
uses: nick-invision/retry@v2
with:
timeout_minutes: 10
max_attempts: 3
command: npm ci
Timeout Errors
Symptoms: "Job exceeded maximum time" or similar
Solutions:
# GitHub Actions - Increase timeout
jobs:
build:
timeout-minutes: 60 # Default: 360
# GitLab CI
test:
timeout: 2h # Default: 1h
Optimize long-running jobs:
- Add caching for dependencies
- Split tests into parallel jobs
- Use faster runners
- Identify and optimize slow tests
Exit Code Errors
Symptoms: "Process completed with exit code 1"
Diagnostics:
# Add verbose logging
- run: npm test -- --verbose
# Check specific exit codes
- run: |
npm test
EXIT_CODE=$?
echo "Exit code: $EXIT_CODE"
if [ $EXIT_CODE -eq 127 ]; then
echo "Command not found"
elif [ $EXIT_CODE -eq 1 ]; then
echo "General error"
fi
exit $EXIT_CODE
Common exit codes:
1: General error2: Misuse of shell command126: Command cannot execute127: Command not found130: Terminated by Ctrl+C137: Killed (OOM)143: Terminated (SIGTERM)
Dependency Issues
"Module not found" or "Cannot find package"
Symptoms: Build fails with missing dependency error
Causes:
- Missing dependency in
package.json - Cache corruption
- Lock file out of sync
- Private package access issues
Solutions:
# Clear cache and reinstall
- run: rm -rf node_modules package-lock.json
- run: npm install
# Use npm ci for clean install
- run: npm ci
# Clear GitHub Actions cache
# Settings > Actions > Caches > Delete specific cache
# GitLab - clear cache
cache:
key: $CI_COMMIT_REF_SLUG
policy: push # Force new cache
Version Conflicts
Symptoms: Dependency resolution errors, peer dependency warnings
Diagnostics:
# Check for conflicts
npm ls
npm outdated
# View dependency tree
npm list --depth=1
Solutions:
// Use overrides (package.json)
{
"overrides": {
"problematic-package": "2.0.0"
}
}
// Or resolutions (Yarn)
{
"resolutions": {
"problematic-package": "2.0.0"
}
}
Private Package Access
Symptoms: "401 Unauthorized" or "404 Not Found" for private packages
GitHub Packages:
- run: |
echo "@myorg:registry=https://npm.pkg.github.com" >> .npmrc
echo "//npm.pkg.github.com/:_authToken=${{ secrets.GITHUB_TOKEN }}" >> .npmrc
- run: npm ci
npm Registry:
- run: echo "//registry.npmjs.org/:_authToken=${{ secrets.NPM_TOKEN }}" >> .npmrc
- run: npm ci
GitLab Package Registry:
before_script:
- echo "@mygroup:registry=${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/packages/npm/" >> .npmrc
- echo "${CI_API_V4_URL#https?}/projects/${CI_PROJECT_ID}/packages/npm/:_authToken=${CI_JOB_TOKEN}" >> .npmrc
Docker & Container Problems
"Cannot connect to Docker daemon"
Symptoms: Docker commands fail with connection error
GitHub Actions:
# Ensure Docker is available
runs-on: ubuntu-latest # Has Docker pre-installed
steps:
- run: docker ps # Test Docker access
GitLab CI:
# Use Docker-in-Docker
image: docker:latest
services:
- docker:dind
variables:
DOCKER_HOST: tcp://docker:2376
DOCKER_TLS_CERTDIR: "/certs"
DOCKER_TLS_VERIFY: 1
DOCKER_CERT_PATH: "$DOCKER_TLS_CERTDIR/client"
Image Pull Errors
Symptoms: "Error response from daemon: pull access denied" or timeout
Solutions:
# GitHub Actions - Login to registry
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
# Or for Docker Hub
- uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
# Add retry logic
- run: |
for i in {1..3}; do
docker pull myimage:latest && break
sleep 5
done
"No space left on device"
Symptoms: Docker build fails with disk space error
Solutions:
# GitHub Actions - Clean up space
- run: docker system prune -af --volumes
# Or use built-in action
- uses: jlumbroso/free-disk-space@main
with:
tool-cache: true
android: true
dotnet: true
# GitLab - configure runner
[[runners]]
[runners.docker]
volumes = ["/var/run/docker.sock:/var/run/docker.sock", "/cache"]
[runners.docker.tmpfs]
"/tmp" = "rw,noexec"
Multi-platform Build Issues
Symptoms: Build fails for ARM/different architecture
Solution:
- uses: docker/setup-qemu-action@v3
- uses: docker/setup-buildx-action@v3
- uses: docker/build-push-action@v5
with:
platforms: linux/amd64,linux/arm64
context: .
push: false
Authentication & Permissions
"Permission denied" or "403 Forbidden"
GitHub Actions:
Symptoms: Cannot push, create release, or access API
Solutions:
# Add necessary permissions
permissions:
contents: write # For pushing tags/releases
pull-requests: write # For commenting on PRs
packages: write # For pushing packages
id-token: write # For OIDC
# Check GITHUB_TOKEN permissions
- run: |
curl -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
https://api.github.com/repos/${{ github.repository }}
GitLab CI:
Symptoms: Cannot push to repository or access API
Solutions:
# Use CI_JOB_TOKEN for API access
script:
- 'curl --header "JOB-TOKEN: $CI_JOB_TOKEN" "${CI_API_V4_URL}/projects"'
# Or use personal/project access token
variables:
GIT_STRATEGY: clone
before_script:
- git config --global user.email "ci@example.com"
- git config --global user.name "CI Bot"
Git Push Failures
Symptoms: "failed to push some refs" or "protected branch"
Solutions:
# GitHub Actions - Check branch protection
# Settings > Branches > Branch protection rules
# Allow bypass
permissions:
contents: write
# Or use PAT with admin access
- uses: actions/checkout@v4
with:
token: ${{ secrets.ADMIN_PAT }}
# GitLab - Grant permissions
# Settings > Repository > Protected Branches
# Add CI/CD role with push permission
AWS Credentials Issues
Symptoms: "Unable to locate credentials"
Solutions:
# Using OIDC (recommended)
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/GitHubActionsRole
aws-region: us-east-1
# Using secrets (legacy)
- uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
# Test credentials
- run: aws sts get-caller-identity
Performance Issues
Slow Pipeline Execution
Diagnostics:
# GitHub - View timing
gh run view <run-id> --log
# Identify slow steps
# Each step shows duration in UI
Solutions:
- See optimization.md for comprehensive guide
- Add dependency caching
- Parallelize independent jobs
- Use faster runners
- Reduce test scope on PRs
Cache Not Working
Symptoms: Cache always misses, builds still slow
Diagnostics:
- uses: actions/cache@v4
id: cache
with:
path: node_modules
key: ${{ hashFiles('**/package-lock.json') }}
- run: echo "Cache hit: ${{ steps.cache.outputs.cache-hit }}"
Common issues:
- Key changes every time
- Path doesn't exist
- Cache size exceeds limit
- Cache evicted (LRU after 7 days on GitHub)
Solutions:
# Use consistent key
key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
# Add restore-keys for partial match
restore-keys: |
${{ runner.os }}-node-
# Check cache size
- run: du -sh node_modules
Platform-Specific Issues
GitHub Actions
"Resource not accessible by integration":
# Add required permission
permissions:
issues: write # Or whatever resource you're accessing
"Workflow is not shared":
- Reusable workflows must be in
.github/workflows/ - Repository must be public or org member
- Check workflow access settings
"No runner available":
- Self-hosted: Check runner is online and has matching labels
- GitHub-hosted: May hit concurrent job limit (check usage)
GitLab CI
"This job is stuck":
- No runner available with matching tags
- All runners are busy
- Runner not configured for this project
Solutions:
# Remove tags to use any available runner
job:
tags: []
# Or check runner configuration
# Settings > CI/CD > Runners
"Job failed (system failure)":
- Runner disconnected
- Resource limits exceeded
- Infrastructure issue
Check runner logs:
# On runner host
journalctl -u gitlab-runner -f
Debugging Techniques
Enable Debug Logging
GitHub Actions:
# Repository > Settings > Secrets > Add:
# ACTIONS_RUNNER_DEBUG = true
# ACTIONS_STEP_DEBUG = true
GitLab CI:
variables:
CI_DEBUG_TRACE: "true" # Caution: May expose secrets!
Interactive Debugging
GitHub Actions:
# Add tmate for SSH access
- uses: mxschmitt/action-tmate@v3
if: failure()
Local reproduction:
# Use act to run GitHub Actions locally
act -j build
# Or nektos/act for Docker
docker run -v $(pwd):/workspace -it nektos/act -j build
Reproduce Locally
# GitHub Actions - Use same Docker image
docker run -it ubuntu:latest bash
# Install dependencies and test
apt-get update && apt-get install -y nodejs npm
npm ci
npm test
Prevention Strategies
Pre-commit Checks
# .pre-commit-config.yaml
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: trailing-whitespace
- id: check-yaml
- id: check-added-large-files
- repo: local
hooks:
- id: tests
name: Run tests
entry: npm test
language: system
pass_filenames: false
CI/CD Health Monitoring
Use the scripts/ci_health.py script:
python3 scripts/ci_health.py --platform github --repo owner/repo
Regular Maintenance
- Monthly: Review failed job patterns
- Monthly: Update actions/dependencies
- Quarterly: Audit pipeline efficiency
- Quarterly: Review and clean old caches
- Yearly: Major version updates
Getting Help
GitHub Actions:
- Community Forum: https://github.community
- Documentation: https://docs.github.com/actions
- Status: https://www.githubstatus.com
GitLab CI:
- Forum: https://forum.gitlab.com
- Documentation: https://docs.gitlab.com/ee/ci
- Status: https://status.gitlab.com
General CI/CD:
- Stack Overflow: Tag [github-actions] or [gitlab-ci]
- Reddit: r/devops, r/cicd