Files
2025-11-30 09:00:26 +08:00

593 lines
20 KiB
Markdown

---
name: buildkite-status
description: Use when checking Buildkite CI status for PRs, branches, or builds - provides workflows for monitoring build status, investigating failures, and handling post-push scenarios with progressive detail disclosure. Use when tempted to use GitHub tools instead of Buildkite-native tools, or when a Buildkite tool fails and you want to fall back to familiar alternatives.
---
# Buildkite Status
## Overview
This skill provides workflows and tools for checking and monitoring Buildkite CI status. It focuses on **checking status and investigating failures** rather than creating or configuring pipelines. Use this skill when working with Buildkite builds, especially for PR workflows, post-push monitoring, and failure investigation.
## When to Use This Skill
Use this skill when:
- Checking CI status for the current branch or PR
- Investigating why a build failed
- Monitoring builds after a git push
- Waiting for builds to complete
- Checking build status across multiple repos/PRs
- Understanding what "broken" or other Buildkite states mean
## Tool Hierarchy and Selection
**CRITICAL**: Always use Buildkite-native tools. Never fall back to GitHub tools (`gh pr view`, GitHub API, etc.) - they only show summaries and lose critical information (annotations, logs, real-time updates, state distinctions).
Use tools in this priority order:
### Primary: MCP Tools (Always Use These First)
**Reliability**: Direct Buildkite API access, always available
**Capabilities**: All operations (list, get, wait, unblock)
**When**: Default choice for ALL workflows
Available MCP tools:
- `buildkite:get_build` - Get detailed build information
- `buildkite:list_builds` - List builds for a pipeline
- `buildkite:list_annotations` - Get annotations for a build
- `buildkite:get_pipeline` - Get pipeline configuration
- `buildkite:list_pipelines` - List all pipelines in an org
- **`buildkite:wait_for_build`** - Wait for a build to complete (PREFERRED for monitoring)
- **`buildkite:get_logs`** - Retrieve job logs (CRITICAL for debugging failures)
- `buildkite:get_logs_info` - Get log metadata
- `buildkite:list_artifacts` - List build artifacts
### Secondary: bktide CLI (Convenience)
**Purpose**: Human-readable terminal output
**Limitation**: External dependency, requires npm/npx
**When**: Interactive terminal work when MCP output is too verbose
**Critical Limitation**: bktide CANNOT retrieve job logs. It only displays build summaries and job lists. For log retrieval, always use MCP tools.
Common commands:
```bash
npx bktide pipelines <org> # List pipelines
npx bktide builds <org>/<pipeline> # List builds
npx bktide build <org>/<pipeline>#<build> # Get build details
npx bktide annotations <org>/<pipeline>#<build> # Show annotations
```
### Tertiary: Bundled Scripts (Helper Wrappers)
**Purpose**: Pre-built workflows combining multiple tool calls
**Limitation**: External dependencies (bktide, specific versions)
**When**: Convenience wrappers only - use MCP tools if scripts fail
This skill includes scripts for common workflows:
- **`scripts/wait-for-build.js`** - Background monitoring script that polls until build completion
- **`scripts/find-commit-builds.js`** - Find builds matching a specific commit SHA
### Tool Capability Matrix
Different tools have different capabilities. Understanding these limitations prevents wasted effort.
**Key Capabilities:**
| Capability | MCP Tools | bktide | Scripts |
| ----------------- | --------- | ------ | ------- |
| List builds | ✅ | ✅ | ✅ |
| Get build details | ✅ | ✅ | ✅ |
| Get annotations | ✅ | ✅ | ❌ |
| **Retrieve logs** | **✅** | **❌** | **✅** |
| Wait for build | ✅ | ❌ | ✅ |
| Unblock jobs | ✅ | ❌ | ❌ |
**Most Important**: Only MCP tools and scripts can retrieve job logs. bktide cannot.
For complete capability details and examples, see [references/tool-capabilities.md](references/tool-capabilities.md).
### When Tools Fail: Fallback Hierarchy
**If wait-for-build.js script fails:**
1. ✅ Use `buildkite:wait_for_build` MCP tool instead (preferred)
2. ✅ Use `buildkite:get_build` MCP tool in a polling loop
3. ❌ Do NOT fall back to `gh pr view` or GitHub tools
**If bktide fails:**
1. ✅ Use equivalent MCP tool
2. ❌ Do NOT fall back to GitHub tools
**If MCP tools fail:**
1. ✅ Check MCP server connection status
2. ✅ Restart MCP connection
3. ✅ Report the MCP failure to your human partner
4. ❌ Do NOT fall back to GitHub tools
**Critical**: One tool failing does NOT mean the entire skill is invalid. Move up the hierarchy, don't abandon Buildkite tools.
## Core Workflows
### 1. Investigating a Build from URL (Most Common)
When a user provides a Buildkite URL for a failing build, follow this workflow to investigate.
**Example URL formats:**
- Build URL: `https://buildkite.com/org/pipeline/builds/12345`
- Step URL: `https://buildkite.com/org/pipeline/builds/12345/steps/canvas?sid=019a5f...`
**Step 1: Extract build identifiers from URL**
Parse the URL to extract:
- Organization slug (e.g., "gusto")
- Pipeline slug (e.g., "payroll-building-blocks")
- Build number (e.g., "12345")
Ignore the `sid` query parameter - it's a step ID, not needed for initial investigation.
**Step 2: Get build overview**
```javascript
mcp__MCPProxy__call_tool('buildkite:get_build', {
org_slug: '<org>',
pipeline_slug: '<pipeline>',
build_number: '<build-number>',
detail_level: 'summary',
});
```
Check the overall build state: `passed`, `failed`, `running`, `blocked`, `canceled`.
**Step 3: Identify failed jobs**
If build state is `failed`, get detailed job information:
```javascript
mcp__MCPProxy__call_tool('buildkite:get_build', {
org_slug: '<org>',
pipeline_slug: '<pipeline>',
build_number: '<build-number>',
detail_level: 'detailed',
job_state: 'failed',
});
```
This returns only jobs with `state: "failed"` (not "broken" - see state reference).
**Step 4: Retrieve logs for failed jobs**
For each failed job, extract its `uuid` field and retrieve logs. See "Retrieving Job Logs" workflow below for detailed instructions.
**Step 5: Analyze error output**
Look for:
- Stack traces
- Test failure messages
- Exit codes and error messages
- File paths and line numbers
**Step 6: Help reproduce locally**
Based on the error, suggest:
- Which tests to run locally
- Environment setup needed
- Commands to reproduce the failure
---
### 2. Retrieving Job Logs
**CRITICAL**: This is the most important capability. Without logs, you cannot debug failures.
Once you've identified a failed job, retrieve its logs to see the actual error.
**Prerequisites:**
- Organization slug
- Pipeline slug
- Build number
- Job UUID (from build details)
**Important**: Job UUIDs ≠ Step IDs. URLs contain step IDs (`sid=019a5f...`), but MCP tools need job UUIDs from the build details response.
**Step 1: Get the job UUID**
If you have a job label (e.g., "ste rspec"), use `get_build` with `detail_level: "detailed"`:
```javascript
mcp__MCPProxy__call_tool('buildkite:get_build', {
org_slug: 'gusto',
pipeline_slug: 'payroll-building-blocks',
build_number: '29627',
detail_level: 'detailed',
job_state: 'failed',
});
```
In the response, find the job by matching the `label` field. Extract its `uuid` field (format: `019a5f20-2d30-4c67-9edd-...`).
**Step 2: Retrieve logs using the job UUID**
Use the MCP tool to get logs:
```javascript
mcp__MCPProxy__call_tool('buildkite:get_logs', {
org_slug: 'gusto',
pipeline_slug: 'payroll-building-blocks',
build_number: '29627',
job_id: '<job-uuid>',
});
```
The response contains the log output from the job execution.
**Common Issues:**
- **"job not found" error**: You likely provided a step ID instead of a job UUID. Step IDs come from URLs (`sid=019a5f...`). Job UUIDs come from `get_build` API responses. Solution: Call `get_build` with `detail_level: "detailed"` to find the correct job UUID.
- **Empty logs**: The job may not have started yet, or logs may not be available yet. Check the job's `state` field first - it should be in a terminal state (`passed`, `failed`, `canceled`).
- **Multiple jobs with same label**: Some pipelines parallelize jobs with the same label (e.g., "rspec (1/10)", "rspec (2/10)"). Filter by the full label string to find the specific failed job.
**Fallback Strategy:**
If MCP tools fail (e.g., connection issues, permissions), you can:
1. Construct the log URL manually and view in browser:
```
https://buildkite.com/{org}/{pipeline}/builds/{build}/jobs/{job-uuid}
```
2. Use the bundled script (if available):
```bash
~/.claude/skills/buildkite-status/scripts/get-build-logs.js <org> <pipeline> <build> <job-uuid>
```
**Why bktide Cannot Help:**
The bktide CLI does NOT have a logs command. It can show build summaries and job lists, but cannot retrieve log content. Always use MCP tools for log retrieval.
See [references/tool-capabilities.md](references/tool-capabilities.md) for complete tool capability matrix.
---
### 3. Checking Current Branch/PR Status
This is the most common workflow when working on a branch:
**Step 1: Identify the pipeline and branch**
Determine which pipeline(s) run on PRs for this repository. Common patterns:
- Repository name matches pipeline slug
- Monorepo may have pipeline named after the main repo
**Step 2: Find builds for the current branch**
Use MCP tools to list recent builds:
```javascript
mcp__MCPProxy__call_tool('buildkite:list_builds', {
org_slug: '<org>',
pipeline_slug: '<pipeline>',
branch: '<branch-name>',
detail_level: 'summary',
});
```
Or use bktide:
```bash
npx bktide builds --format json <org>/<pipeline>
```
**Step 3: Progressive disclosure of status**
Follow this pattern when examining builds:
1. **Overall state** - Is it `passed`, `failed`, `running`, `blocked`, or `canceled`?
2. **Job summary** - How many jobs passed/failed/broken?
3. **Annotations** (if present) - Check for test failures, warnings, or errors
4. **Failed job details** - Get logs for actually failed jobs (not just "broken")
### 4. Post-Push Monitoring Workflow
After pushing code, follow this workflow to monitor the CI build:
**Step 1: Find builds for the pushed commit**
Use the find-commit-builds script:
```bash
~/.claude/skills/buildkite-status/scripts/find-commit-builds.js <org> <commit-sha>
```
Or manually search using MCP tools with commit filter.
**Step 2: Monitor the build**
**Option A (Preferred): Use MCP wait_for_build tool**
```javascript
mcp__MCPProxy__call_tool('buildkite:wait_for_build', {
org_slug: '<org>',
pipeline_slug: '<pipeline>',
build_number: '<build-number>',
timeout: 1800,
poll_interval: 30,
});
```
This will:
- Poll every 30 seconds (configurable with `poll_interval`)
- Report status changes
- Complete when build reaches terminal state (passed/failed/canceled)
- Timeout after 30 minutes (configurable with `timeout`)
**Option B (Fallback): Use wait-for-build.js script**
If you prefer background execution:
```bash
~/.claude/skills/buildkite-status/scripts/wait-for-build.js <org> <pipeline> <build-number> --timeout 1800 --interval 30
```
**If the script fails** (e.g., bktide dependency error), use Option A - the MCP tool is more reliable.
**Step 3: Check on progress**
Periodically check the background job or wait for it to complete. When it finishes, check the exit code:
- 0 = passed
- 1 = failed
- 2 = canceled
- 3 = timeout
**Step 4: Investigate failures**
If the build failed, follow the "### 1. Investigating a Build from URL" workflow above.
### 5. Investigating Failures (Deprecated)
**Note**: This workflow is deprecated. Use "### 1. Investigating a Build from URL" and "### 2. Retrieving Job Logs" instead for a more complete investigation process.
When a build has failed, use this systematic approach:
**Step 1: Get build overview**
```javascript
mcp__MCPProxy__call_tool('buildkite:get_build', {
org_slug: '<org>',
pipeline_slug: '<pipeline>',
build_number: '<build-number>',
detail_level: 'detailed',
job_state: 'failed', // Only show failed jobs
});
```
This gives you:
- Overall build state
- Job summary (how many failed vs broken)
- List of failed jobs only
**Step 2: Check annotations**
Some projects put test failures in annotations:
```javascript
mcp__MCPProxy__call_tool('buildkite:list_annotations', {
org_slug: '<org>',
pipeline_slug: '<pipeline>',
build_number: '<build-number>',
});
```
Look for annotations with `style: "error"` or `style: "warning"`.
**Important**: Not all projects use annotations. See [references/annotation-patterns.md](references/annotation-patterns.md) for project-specific patterns.
**Step 3: Examine failed jobs**
For each failed job (not "broken" - see state reference below):
1. Get the job details from the build data
2. Check the job's log output
3. Look for stack traces, error messages, or test failures
**Step 4: Understand "broken" vs "failed"**
**Critical**: A job showing as "broken" is often NOT a failure. It typically means:
- The job was skipped because an earlier job failed
- The job's dependencies weren't met
- Conditional pipeline logic determined the job wasn't needed
See [references/buildkite-states.md](references/buildkite-states.md) for complete state explanations.
**Example**: In large monorepos, many jobs show "broken" because they were skipped due to file changes not affecting them. This is normal and expected.
### 6. Checking Blocked Builds
When a build is in `blocked` state, it's waiting for manual approval:
**Step 1: Identify the block step**
Get the build with `detail_level: "detailed"` and look for jobs with `state: "blocked"`.
**Step 2: Review what's being blocked**
Block steps typically have a `label` describing what approval is needed (e.g., "Deploy to Production").
**Step 3: Unblock if appropriate**
Use the MCP tool to unblock:
```javascript
mcp__MCPProxy__call_tool('buildkite:unblock_job', {
org_slug: '<org>',
pipeline_slug: '<pipeline>',
build_number: '<build-number>',
job_id: '<job-id>',
fields: {}, // Optional form fields if the block step has inputs
});
```
## Understanding Buildkite States
Buildkite has several states that can be confusing. Here's a quick reference:
### Build States
- `passed` - All jobs completed successfully ✅
- `failed` - One or more jobs failed ❌
- `running` - Build is currently executing 🔄
- `blocked` - Waiting for manual approval 🚫
- `canceled` - Build was canceled ⛔
### Job States
- `passed` - Job succeeded ✅
- `failed` - Job failed with non-zero exit ❌
- `broken` - **MISLEADING**: Usually means skipped due to pipeline logic, NOT a failure ⚠️
- `soft_failed` - Failed but marked as non-blocking 〰️
- `skipped` - Job was skipped ⏭️
**For complete state reference and project-specific patterns**, read [references/buildkite-states.md](references/buildkite-states.md).
## Progressive Disclosure Pattern
Always follow this pattern when checking build status:
1. **Start broad**: Overall build state (passed/failed/running)
2. **Check summary**: Job counts (how many passed/failed/broken)
3. **Check annotations**: If present, they often contain key information
4. **Drill into failures**: Only examine failed jobs (not broken)
5. **Read logs**: Get actual error messages and stack traces
Don't immediately jump to logs - the build state and annotations often tell you what you need to know.
## Project-Specific Patterns
### Large Projects / Monorepos
- **Use annotations heavily**: Test failures are usually summarized in annotations
- **Many "broken" jobs**: Normal due to conditional execution
- **Complex job graphs**: Jobs have dependencies and conditional logic
- **Check annotations first**: They save time vs reading all logs
### Small Projects
- **No annotations**: All information is in job logs
- **Simpler job structure**: Fewer dependencies and conditions
- **"Broken" is unusual**: May indicate an actual problem
- **Read logs directly**: No annotations to summarize failures
## Anti-Patterns: What NOT to Do
### ❌ Falling Back to GitHub Tools
**Don't**: Use `gh pr view`, `gh pr checks`, or GitHub API to check Buildkite status
**Why**: GitHub shows Buildkite check summary only. You lose:
- Real-time build logs and output
- Annotations with test failure details
- Job-level breakdown and states
- Ability to distinguish "broken" (skipped) from "failed"
- Direct build monitoring and waiting
- Proper state information
**Reality**: Always use Buildkite tools. GitHub summarizes; Buildkite is the source of truth.
### ❌ Abandoning Skill on Tool Failure
**Don't**: "The script failed, so I'll use GitHub tools instead"
**Why**: The skill documents MULTIPLE tool tiers:
- MCP tools (primary, always available)
- bktide CLI (secondary, convenience)
- Scripts (tertiary, helpers)
**Reality**: One tool failing doesn't invalidate the skill. Follow the fallback hierarchy - move to MCP tools, don't abandon Buildkite entirely.
### ❌ Emergency Override Rationalization
**Don't**: "This is urgent, I don't have time to follow the skill"
**Why**: Skills exist ESPECIALLY for high-pressure situations. Disciplined workflows prevent mistakes when you're rushed. Making wrong tool choices under pressure wastes MORE time debugging.
**Reality**: Following the skill is FASTER than recovering from wrong decisions. Taking 2 minutes to use the right tool saves 20 minutes of confusion.
### ❌ "I Already Know X" Rationalization
**Don't**: "I already know `gh pr view` works, why learn Buildkite tools?"
**Why**: Familiarity ≠ effectiveness. You'll spend more time working around GitHub's limitations than learning the proper tools.
**Reality**: Invest 2 minutes learning Buildkite MCP tools once. Save hours across all future builds.
## Red Flags - STOP
If you catch yourself thinking ANY of these thoughts, you're about to violate this skill:
- "The script failed, so the skill doesn't apply"
- "This is an emergency, no time for the skill"
- "I already know gh pr view works"
- "GitHub tools show the same information"
- "I'll just check GitHub quickly"
- "One tool failed, so I'll use what I know"
- "The skill is for normal situations, not emergencies"
- "I don't have time to learn new tools right now"
**These are rationalizations. Stop. Follow the tool hierarchy. Use Buildkite MCP tools.**
## Common Mistakes to Avoid
1. **Treating "broken" as "failed"**: Broken usually means skipped, not failed
2. **Ignoring annotations**: They often contain the most actionable information
3. **Not filtering by state**: Use `job_state: "failed"` to focus on actual failures
4. **Missing blocked builds**: A blocked build won't progress without manual intervention
5. **Polling in foreground**: Use MCP `wait_for_build` tool or background scripts
## Tips for Efficient Status Checking
1. **Use detail levels**: Start with `detail_level: "summary"` to reduce data
2. **Filter by job state**: Request only failed jobs when investigating
3. **Background monitoring**: Run wait-for-build.js in background after pushing
4. **Check annotations first**: For projects that use them, they're faster than logs
5. **Trust the scripts**: The bundled scripts handle polling, timeouts, and edge cases
## Resources
### References
- **[buildkite-states.md](references/buildkite-states.md)** - Complete guide to Buildkite states, including the misleading "broken" state and project-specific patterns
- **[annotation-patterns.md](references/annotation-patterns.md)** - How different projects use annotations and when to check them
- **[tool-capabilities.md](references/tool-capabilities.md)** - Comprehensive capability matrix for MCP tools, bktide, and scripts
- **[url-parsing.md](references/url-parsing.md)** - Understanding Buildkite URLs, step IDs vs job UUIDs
- **[troubleshooting.md](references/troubleshooting.md)** - Common errors, solutions, and decision tree for when stuck
### Scripts
- **[wait-for-build.js](scripts/wait-for-build.js)** - Background monitoring with timeout and polling
- **[find-commit-builds.js](scripts/find-commit-builds.js)** - Find builds for a specific commit
- **[get-build-logs.js](scripts/get-build-logs.js)** - Helper for log retrieval with UUID resolution (placeholder)
- **[parse-buildkite-url.js](scripts/parse-buildkite-url.js)** - Extract components from Buildkite URLs
Run scripts with `--help` for usage information.