657 lines
13 KiB
Markdown
657 lines
13 KiB
Markdown
# CI/CD Troubleshooting
|
|
|
|
Comprehensive guide to diagnosing and resolving common CI/CD pipeline issues.
|
|
|
|
## Table of Contents
|
|
|
|
- [Pipeline Failures](#pipeline-failures)
|
|
- [Dependency Issues](#dependency-issues)
|
|
- [Docker & Container Problems](#docker--container-problems)
|
|
- [Authentication & Permissions](#authentication--permissions)
|
|
- [Performance Issues](#performance-issues)
|
|
- [Platform-Specific Issues](#platform-specific-issues)
|
|
|
|
---
|
|
|
|
## Pipeline Failures
|
|
|
|
### Workflow Not Triggering
|
|
|
|
**GitHub Actions:**
|
|
|
|
**Symptoms:** Workflow doesn't run on push/PR
|
|
|
|
**Common causes:**
|
|
1. Workflow file in wrong location (must be `.github/workflows/`)
|
|
2. Invalid YAML syntax
|
|
3. Branch/path filters excluding the changes
|
|
4. Workflow disabled in repository settings
|
|
|
|
**Diagnostics:**
|
|
```bash
|
|
# Validate YAML
|
|
yamllint .github/workflows/ci.yml
|
|
|
|
# Check if workflow is disabled
|
|
gh workflow list --repo owner/repo
|
|
```
|
|
|
|
**Solutions:**
|
|
```yaml
|
|
# Check trigger configuration
|
|
on:
|
|
push:
|
|
branches: [main] # Ensure your branch matches
|
|
paths-ignore:
|
|
- 'docs/**' # May be excluding your changes
|
|
|
|
# Enable workflow
|
|
gh workflow enable ci.yml --repo owner/repo
|
|
```
|
|
|
|
**GitLab CI:**
|
|
|
|
**Symptoms:** Pipeline doesn't start
|
|
|
|
**Diagnostics:**
|
|
```bash
|
|
# Validate .gitlab-ci.yml
|
|
gl-ci-lint < .gitlab-ci.yml
|
|
|
|
# Check CI/CD settings
|
|
# Project > Settings > CI/CD > General pipelines
|
|
```
|
|
|
|
**Solutions:**
|
|
- Check if CI/CD is enabled for the project
|
|
- Verify `.gitlab-ci.yml` is in repository root
|
|
- Check pipeline must succeed setting isn't blocking
|
|
- Review `only`/`except` or `rules` configuration
|
|
|
|
### Jobs Failing Intermittently
|
|
|
|
**Symptoms:** Same job passes sometimes, fails others
|
|
|
|
**Common causes:**
|
|
1. Flaky tests
|
|
2. Race conditions
|
|
3. Network timeouts
|
|
4. Resource constraints
|
|
5. Time-dependent tests
|
|
|
|
**Identify flaky tests:**
|
|
```yaml
|
|
# GitHub Actions - Run multiple times
|
|
strategy:
|
|
matrix:
|
|
attempt: [1, 2, 3, 4, 5]
|
|
steps:
|
|
- run: npm test
|
|
```
|
|
|
|
**Solutions:**
|
|
```javascript
|
|
// Add retries to flaky tests
|
|
jest.retryTimes(3);
|
|
|
|
// Increase timeouts
|
|
jest.setTimeout(30000);
|
|
|
|
// Fix race conditions
|
|
await waitFor(() => expect(element).toBeInDocument(), {
|
|
timeout: 5000
|
|
});
|
|
```
|
|
|
|
**Network retry pattern:**
|
|
```yaml
|
|
- name: Install with retry
|
|
uses: nick-invision/retry@v2
|
|
with:
|
|
timeout_minutes: 10
|
|
max_attempts: 3
|
|
command: npm ci
|
|
```
|
|
|
|
### Timeout Errors
|
|
|
|
**Symptoms:** "Job exceeded maximum time" or similar
|
|
|
|
**Solutions:**
|
|
```yaml
|
|
# GitHub Actions - Increase timeout
|
|
jobs:
|
|
build:
|
|
timeout-minutes: 60 # Default: 360
|
|
|
|
# GitLab CI
|
|
test:
|
|
timeout: 2h # Default: 1h
|
|
```
|
|
|
|
**Optimize long-running jobs:**
|
|
- Add caching for dependencies
|
|
- Split tests into parallel jobs
|
|
- Use faster runners
|
|
- Identify and optimize slow tests
|
|
|
|
### Exit Code Errors
|
|
|
|
**Symptoms:** "Process completed with exit code 1"
|
|
|
|
**Diagnostics:**
|
|
```yaml
|
|
# Add verbose logging
|
|
- run: npm test -- --verbose
|
|
|
|
# Check specific exit codes
|
|
- run: |
|
|
npm test
|
|
EXIT_CODE=$?
|
|
echo "Exit code: $EXIT_CODE"
|
|
if [ $EXIT_CODE -eq 127 ]; then
|
|
echo "Command not found"
|
|
elif [ $EXIT_CODE -eq 1 ]; then
|
|
echo "General error"
|
|
fi
|
|
exit $EXIT_CODE
|
|
```
|
|
|
|
**Common exit codes:**
|
|
- `1`: General error
|
|
- `2`: Misuse of shell command
|
|
- `126`: Command cannot execute
|
|
- `127`: Command not found
|
|
- `130`: Terminated by Ctrl+C
|
|
- `137`: Killed (OOM)
|
|
- `143`: Terminated (SIGTERM)
|
|
|
|
---
|
|
|
|
## Dependency Issues
|
|
|
|
### "Module not found" or "Cannot find package"
|
|
|
|
**Symptoms:** Build fails with missing dependency error
|
|
|
|
**Causes:**
|
|
1. Missing dependency in `package.json`
|
|
2. Cache corruption
|
|
3. Lock file out of sync
|
|
4. Private package access issues
|
|
|
|
**Solutions:**
|
|
```yaml
|
|
# Clear cache and reinstall
|
|
- run: rm -rf node_modules package-lock.json
|
|
- run: npm install
|
|
|
|
# Use npm ci for clean install
|
|
- run: npm ci
|
|
|
|
# Clear GitHub Actions cache
|
|
# Settings > Actions > Caches > Delete specific cache
|
|
|
|
# GitLab - clear cache
|
|
cache:
|
|
key: $CI_COMMIT_REF_SLUG
|
|
policy: push # Force new cache
|
|
```
|
|
|
|
### Version Conflicts
|
|
|
|
**Symptoms:** Dependency resolution errors, peer dependency warnings
|
|
|
|
**Diagnostics:**
|
|
```bash
|
|
# Check for conflicts
|
|
npm ls
|
|
npm outdated
|
|
|
|
# View dependency tree
|
|
npm list --depth=1
|
|
```
|
|
|
|
**Solutions:**
|
|
```json
|
|
// Use overrides (package.json)
|
|
{
|
|
"overrides": {
|
|
"problematic-package": "2.0.0"
|
|
}
|
|
}
|
|
|
|
// Or resolutions (Yarn)
|
|
{
|
|
"resolutions": {
|
|
"problematic-package": "2.0.0"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Private Package Access
|
|
|
|
**Symptoms:** "401 Unauthorized" or "404 Not Found" for private packages
|
|
|
|
**GitHub Packages:**
|
|
```yaml
|
|
- run: |
|
|
echo "@myorg:registry=https://npm.pkg.github.com" >> .npmrc
|
|
echo "//npm.pkg.github.com/:_authToken=${{ secrets.GITHUB_TOKEN }}" >> .npmrc
|
|
- run: npm ci
|
|
```
|
|
|
|
**npm Registry:**
|
|
```yaml
|
|
- run: echo "//registry.npmjs.org/:_authToken=${{ secrets.NPM_TOKEN }}" >> .npmrc
|
|
- run: npm ci
|
|
```
|
|
|
|
**GitLab Package Registry:**
|
|
```yaml
|
|
before_script:
|
|
- echo "@mygroup:registry=${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/packages/npm/" >> .npmrc
|
|
- echo "${CI_API_V4_URL#https?}/projects/${CI_PROJECT_ID}/packages/npm/:_authToken=${CI_JOB_TOKEN}" >> .npmrc
|
|
```
|
|
|
|
---
|
|
|
|
## Docker & Container Problems
|
|
|
|
### "Cannot connect to Docker daemon"
|
|
|
|
**Symptoms:** Docker commands fail with connection error
|
|
|
|
**GitHub Actions:**
|
|
```yaml
|
|
# Ensure Docker is available
|
|
runs-on: ubuntu-latest # Has Docker pre-installed
|
|
|
|
steps:
|
|
- run: docker ps # Test Docker access
|
|
```
|
|
|
|
**GitLab CI:**
|
|
```yaml
|
|
# Use Docker-in-Docker
|
|
image: docker:latest
|
|
services:
|
|
- docker:dind
|
|
|
|
variables:
|
|
DOCKER_HOST: tcp://docker:2376
|
|
DOCKER_TLS_CERTDIR: "/certs"
|
|
DOCKER_TLS_VERIFY: 1
|
|
DOCKER_CERT_PATH: "$DOCKER_TLS_CERTDIR/client"
|
|
```
|
|
|
|
### Image Pull Errors
|
|
|
|
**Symptoms:** "Error response from daemon: pull access denied" or timeout
|
|
|
|
**Solutions:**
|
|
```yaml
|
|
# GitHub Actions - Login to registry
|
|
- uses: docker/login-action@v3
|
|
with:
|
|
registry: ghcr.io
|
|
username: ${{ github.actor }}
|
|
password: ${{ secrets.GITHUB_TOKEN }}
|
|
|
|
# Or for Docker Hub
|
|
- uses: docker/login-action@v3
|
|
with:
|
|
username: ${{ secrets.DOCKERHUB_USERNAME }}
|
|
password: ${{ secrets.DOCKERHUB_TOKEN }}
|
|
|
|
# Add retry logic
|
|
- run: |
|
|
for i in {1..3}; do
|
|
docker pull myimage:latest && break
|
|
sleep 5
|
|
done
|
|
```
|
|
|
|
### "No space left on device"
|
|
|
|
**Symptoms:** Docker build fails with disk space error
|
|
|
|
**Solutions:**
|
|
```yaml
|
|
# GitHub Actions - Clean up space
|
|
- run: docker system prune -af --volumes
|
|
|
|
# Or use built-in action
|
|
- uses: jlumbroso/free-disk-space@main
|
|
with:
|
|
tool-cache: true
|
|
android: true
|
|
dotnet: true
|
|
|
|
# GitLab - configure runner
|
|
[[runners]]
|
|
[runners.docker]
|
|
volumes = ["/var/run/docker.sock:/var/run/docker.sock", "/cache"]
|
|
[runners.docker.tmpfs]
|
|
"/tmp" = "rw,noexec"
|
|
```
|
|
|
|
### Multi-platform Build Issues
|
|
|
|
**Symptoms:** Build fails for ARM/different architecture
|
|
|
|
**Solution:**
|
|
```yaml
|
|
- uses: docker/setup-qemu-action@v3
|
|
|
|
- uses: docker/setup-buildx-action@v3
|
|
|
|
- uses: docker/build-push-action@v5
|
|
with:
|
|
platforms: linux/amd64,linux/arm64
|
|
context: .
|
|
push: false
|
|
```
|
|
|
|
---
|
|
|
|
## Authentication & Permissions
|
|
|
|
### "Permission denied" or "403 Forbidden"
|
|
|
|
**GitHub Actions:**
|
|
|
|
**Symptoms:** Cannot push, create release, or access API
|
|
|
|
**Solutions:**
|
|
```yaml
|
|
# Add necessary permissions
|
|
permissions:
|
|
contents: write # For pushing tags/releases
|
|
pull-requests: write # For commenting on PRs
|
|
packages: write # For pushing packages
|
|
id-token: write # For OIDC
|
|
|
|
# Check GITHUB_TOKEN permissions
|
|
- run: |
|
|
curl -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
|
|
https://api.github.com/repos/${{ github.repository }}
|
|
```
|
|
|
|
**GitLab CI:**
|
|
|
|
**Symptoms:** Cannot push to repository or access API
|
|
|
|
**Solutions:**
|
|
```yaml
|
|
# Use CI_JOB_TOKEN for API access
|
|
script:
|
|
- 'curl --header "JOB-TOKEN: $CI_JOB_TOKEN" "${CI_API_V4_URL}/projects"'
|
|
|
|
# Or use personal/project access token
|
|
variables:
|
|
GIT_STRATEGY: clone
|
|
before_script:
|
|
- git config --global user.email "ci@example.com"
|
|
- git config --global user.name "CI Bot"
|
|
```
|
|
|
|
### Git Push Failures
|
|
|
|
**Symptoms:** "failed to push some refs" or "protected branch"
|
|
|
|
**Solutions:**
|
|
```yaml
|
|
# GitHub Actions - Check branch protection
|
|
# Settings > Branches > Branch protection rules
|
|
|
|
# Allow bypass
|
|
permissions:
|
|
contents: write
|
|
|
|
# Or use PAT with admin access
|
|
- uses: actions/checkout@v4
|
|
with:
|
|
token: ${{ secrets.ADMIN_PAT }}
|
|
|
|
# GitLab - Grant permissions
|
|
# Settings > Repository > Protected Branches
|
|
# Add CI/CD role with push permission
|
|
```
|
|
|
|
### AWS Credentials Issues
|
|
|
|
**Symptoms:** "Unable to locate credentials"
|
|
|
|
**Solutions:**
|
|
```yaml
|
|
# Using OIDC (recommended)
|
|
- uses: aws-actions/configure-aws-credentials@v4
|
|
with:
|
|
role-to-assume: arn:aws:iam::123456789:role/GitHubActionsRole
|
|
aws-region: us-east-1
|
|
|
|
# Using secrets (legacy)
|
|
- uses: aws-actions/configure-aws-credentials@v4
|
|
with:
|
|
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
|
|
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
|
|
aws-region: us-east-1
|
|
|
|
# Test credentials
|
|
- run: aws sts get-caller-identity
|
|
```
|
|
|
|
---
|
|
|
|
## Performance Issues
|
|
|
|
### Slow Pipeline Execution
|
|
|
|
**Diagnostics:**
|
|
```bash
|
|
# GitHub - View timing
|
|
gh run view <run-id> --log
|
|
|
|
# Identify slow steps
|
|
# Each step shows duration in UI
|
|
```
|
|
|
|
**Solutions:**
|
|
- See [optimization.md](optimization.md) for comprehensive guide
|
|
- Add dependency caching
|
|
- Parallelize independent jobs
|
|
- Use faster runners
|
|
- Reduce test scope on PRs
|
|
|
|
### Cache Not Working
|
|
|
|
**Symptoms:** Cache always misses, builds still slow
|
|
|
|
**Diagnostics:**
|
|
```yaml
|
|
- uses: actions/cache@v4
|
|
id: cache
|
|
with:
|
|
path: node_modules
|
|
key: ${{ hashFiles('**/package-lock.json') }}
|
|
|
|
- run: echo "Cache hit: ${{ steps.cache.outputs.cache-hit }}"
|
|
```
|
|
|
|
**Common issues:**
|
|
1. Key changes every time
|
|
2. Path doesn't exist
|
|
3. Cache size exceeds limit
|
|
4. Cache evicted (LRU after 7 days on GitHub)
|
|
|
|
**Solutions:**
|
|
```yaml
|
|
# Use consistent key
|
|
key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
|
|
|
|
# Add restore-keys for partial match
|
|
restore-keys: |
|
|
${{ runner.os }}-node-
|
|
|
|
# Check cache size
|
|
- run: du -sh node_modules
|
|
```
|
|
|
|
---
|
|
|
|
## Platform-Specific Issues
|
|
|
|
### GitHub Actions
|
|
|
|
**"Resource not accessible by integration":**
|
|
```yaml
|
|
# Add required permission
|
|
permissions:
|
|
issues: write # Or whatever resource you're accessing
|
|
```
|
|
|
|
**"Workflow is not shared":**
|
|
- Reusable workflows must be in `.github/workflows/`
|
|
- Repository must be public or org member
|
|
- Check workflow access settings
|
|
|
|
**"No runner available":**
|
|
- Self-hosted: Check runner is online and has matching labels
|
|
- GitHub-hosted: May hit concurrent job limit (check usage)
|
|
|
|
### GitLab CI
|
|
|
|
**"This job is stuck":**
|
|
- No runner available with matching tags
|
|
- All runners are busy
|
|
- Runner not configured for this project
|
|
|
|
**Solutions:**
|
|
```yaml
|
|
# Remove tags to use any available runner
|
|
job:
|
|
tags: []
|
|
|
|
# Or check runner configuration
|
|
# Settings > CI/CD > Runners
|
|
```
|
|
|
|
**"Job failed (system failure)":**
|
|
- Runner disconnected
|
|
- Resource limits exceeded
|
|
- Infrastructure issue
|
|
|
|
**Check runner logs:**
|
|
```bash
|
|
# On runner host
|
|
journalctl -u gitlab-runner -f
|
|
```
|
|
|
|
---
|
|
|
|
## Debugging Techniques
|
|
|
|
### Enable Debug Logging
|
|
|
|
**GitHub Actions:**
|
|
```yaml
|
|
# Repository > Settings > Secrets > Add:
|
|
# ACTIONS_RUNNER_DEBUG = true
|
|
# ACTIONS_STEP_DEBUG = true
|
|
```
|
|
|
|
**GitLab CI:**
|
|
```yaml
|
|
variables:
|
|
CI_DEBUG_TRACE: "true" # Caution: May expose secrets!
|
|
```
|
|
|
|
### Interactive Debugging
|
|
|
|
**GitHub Actions:**
|
|
```yaml
|
|
# Add tmate for SSH access
|
|
- uses: mxschmitt/action-tmate@v3
|
|
if: failure()
|
|
```
|
|
|
|
**Local reproduction:**
|
|
```bash
|
|
# Use act to run GitHub Actions locally
|
|
act -j build
|
|
|
|
# Or nektos/act for Docker
|
|
docker run -v $(pwd):/workspace -it nektos/act -j build
|
|
```
|
|
|
|
### Reproduce Locally
|
|
|
|
```bash
|
|
# GitHub Actions - Use same Docker image
|
|
docker run -it ubuntu:latest bash
|
|
|
|
# Install dependencies and test
|
|
apt-get update && apt-get install -y nodejs npm
|
|
npm ci
|
|
npm test
|
|
```
|
|
|
|
---
|
|
|
|
## Prevention Strategies
|
|
|
|
### Pre-commit Checks
|
|
|
|
```yaml
|
|
# .pre-commit-config.yaml
|
|
repos:
|
|
- repo: https://github.com/pre-commit/pre-commit-hooks
|
|
rev: v4.5.0
|
|
hooks:
|
|
- id: trailing-whitespace
|
|
- id: check-yaml
|
|
- id: check-added-large-files
|
|
|
|
- repo: local
|
|
hooks:
|
|
- id: tests
|
|
name: Run tests
|
|
entry: npm test
|
|
language: system
|
|
pass_filenames: false
|
|
```
|
|
|
|
### CI/CD Health Monitoring
|
|
|
|
Use the `scripts/ci_health.py` script:
|
|
```bash
|
|
python3 scripts/ci_health.py --platform github --repo owner/repo
|
|
```
|
|
|
|
### Regular Maintenance
|
|
|
|
- [ ] Monthly: Review failed job patterns
|
|
- [ ] Monthly: Update actions/dependencies
|
|
- [ ] Quarterly: Audit pipeline efficiency
|
|
- [ ] Quarterly: Review and clean old caches
|
|
- [ ] Yearly: Major version updates
|
|
|
|
---
|
|
|
|
## Getting Help
|
|
|
|
**GitHub Actions:**
|
|
- Community Forum: https://github.community
|
|
- Documentation: https://docs.github.com/actions
|
|
- Status: https://www.githubstatus.com
|
|
|
|
**GitLab CI:**
|
|
- Forum: https://forum.gitlab.com
|
|
- Documentation: https://docs.gitlab.com/ee/ci
|
|
- Status: https://status.gitlab.com
|
|
|
|
**General CI/CD:**
|
|
- Stack Overflow: Tag [github-actions] or [gitlab-ci]
|
|
- Reddit: r/devops, r/cicd
|