11 KiB
Prow Job Extract Must-Gather Skill
This skill extracts and decompresses must-gather archives from Prow CI job artifacts, automatically handling nested tar and gzip archives, and generating an interactive HTML file browser.
Overview
The skill provides both a Claude Code skill interface and standalone scripts for extracting must-gather data from Prow CI jobs. It eliminates the manual steps of downloading and recursively extracting nested archives.
Components
1. SKILL.md
Claude Code skill definition that provides detailed implementation instructions for the AI assistant.
2. Python Scripts
extract_archives.py
Extracts and recursively processes must-gather archives.
Features:
- Extracts must-gather.tar to specified directory
- Renames long subdirectory (containing "-ci-") to "content/" for readability
- Recursively processes nested archives:
.tar.gzand.tgz: Extract in place, remove original.gz(plain gzip): Decompress in place, remove original
- Handles up to 10 levels of nested archives
- Reports extraction statistics
Usage:
python3 extract_archives.py <must-gather.tar> <output-directory>
Example:
python3 plugins/prow-job/skills/prow-job-extract-must-gather/extract_archives.py \
.work/prow-job-extract-must-gather/1965715986610917376/tmp/must-gather.tar \
.work/prow-job-extract-must-gather/1965715986610917376/logs
Output:
================================================================================
Must-Gather Archive Extraction
================================================================================
Step 1: Extracting must-gather.tar
From: .work/.../tmp/must-gather.tar
To: .work/.../logs
Extracting: .work/.../tmp/must-gather.tar
Step 2: Renaming long directory to 'content/'
From: registry-build09-ci-openshift-org-ci-op-...
To: content/
Step 3: Processing nested archives
Extracting: .../content/namespaces/openshift-etcd/pods/etcd-0.tar.gz
Decompressing: .../content/cluster-scoped-resources/nodes/ip-10-0-1-234.log.gz
... (continues for all archives)
================================================================================
Extraction Complete
================================================================================
Statistics:
Total files: 3,421
Total size: 234.5 MB
Archives processed: 247
Extracted to: .work/prow-job-extract-must-gather/1965715986610917376/logs
generate_html_report.py
Generates an interactive HTML file browser with filters and search.
Features:
- Scans directory tree and collects file metadata
- Classifies files by type (log, yaml, json, xml, cert, archive, script, config, other)
- Generates statistics (total files, total size, counts by type)
- Creates interactive HTML with:
- Multi-select file type filters
- Regex pattern filter for powerful searches
- Text search for file names/paths
- Direct links to files (relative paths)
- Same dark theme as analyze-resource skill
Usage:
python3 generate_html_report.py <logs-directory> <prowjob_name> <build_id> <target> <gcsweb_url>
Example:
python3 plugins/prow-job/skills/prow-job-extract-must-gather/generate_html_report.py \
.work/prow-job-extract-must-gather/1965715986610917376/logs \
"periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview" \
"1965715986610917376" \
"e2e-aws-ovn-techpreview" \
"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview/1965715986610917376"
Output:
- Creates
.work/prow-job-extract-must-gather/{build_id}/must-gather-browser.html
Prerequisites
- Python 3 - For running extraction and report generator scripts
- gcloud CLI - For downloading artifacts from GCS
- Install: https://cloud.google.com/sdk/docs/install
- Authentication NOT required (bucket is publicly accessible)
Workflow
-
URL Parsing
- Validate URL contains
test-platform-results/ - Extract build_id (10+ digits)
- Extract prowjob name
- Construct GCS paths
- Validate URL contains
-
Working Directory
- Create
.work/prow-job-extract-must-gather/{build_id}/directory - Create
logs/subdirectory for extraction - Create
tmp/subdirectory for temporary files - Check for existing extraction (offers to skip re-extraction)
- Create
-
prowjob.json Validation
- Download prowjob.json
- Search for
--target=pattern - Exit if not a ci-operator job
-
Must-Gather Download
- Download from:
artifacts/{target}/gather-must-gather/artifacts/must-gather.tar - Save to:
{build_id}/tmp/must-gather.tar
- Download from:
-
Extraction and Processing
- Extract must-gather.tar to
{build_id}/logs/ - Rename long subdirectory to "content/"
- Recursively extract nested archives (.tar.gz, .tgz, .gz)
- Remove original compressed files after extraction
- Extract must-gather.tar to
-
HTML Report Generation
- Scan directory tree
- Classify files by type
- Calculate statistics
- Generate interactive HTML browser
- Output to
{build_id}/must-gather-browser.html
Output
Console Output
Must-Gather Extraction Complete
Prow Job: periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview
Build ID: 1965715986610917376
Target: e2e-aws-ovn-techpreview
Extraction Statistics:
- Total files: 3,421
- Total size: 234.5 MB
- Archives extracted: 247
- Log files: 1,234
- YAML files: 856
- JSON files: 423
Extracted to: .work/prow-job-extract-must-gather/1965715986610917376/logs/
File browser generated: .work/prow-job-extract-must-gather/1965715986610917376/must-gather-browser.html
Open in browser to browse and search extracted files.
HTML File Browser
The generated HTML report includes:
-
Header Section
- Prow job name
- Build ID
- Target name
- GCS URL (link to gcsweb)
- Local extraction path
-
Statistics Dashboard
- Total files count
- Total size (human-readable)
- Counts by file type (log, yaml, json, xml, cert, archive, script, config, other)
-
Filter Controls
- File Type Filter: Multi-select buttons to filter by type
- Regex Pattern Filter: Input field for regex patterns (e.g.,
.*etcd.*,.*\.log$,^content/namespaces/.*) - Name Search: Text search for file names and paths
-
File List
- Icon for each file type
- File name (clickable link to open file)
- Directory path
- File size
- File type badge (color-coded)
- Sorted alphabetically by path
-
Interactive Features
- All filters work together (AND logic)
- Real-time filtering (300ms debounce)
- Regex pattern validation
- Scroll to top button
- No results message when filters match nothing
Directory Structure
.work/prow-job-extract-must-gather/{build_id}/
├── tmp/
│ ├── prowjob.json
│ └── must-gather.tar (downloaded, not deleted)
├── logs/
│ └── content/ # Renamed from long directory
│ ├── cluster-scoped-resources/
│ │ ├── nodes/
│ │ ├── clusterroles/
│ │ └── ...
│ ├── namespaces/
│ │ ├── openshift-etcd/
│ │ │ ├── pods/
│ │ │ ├── services/
│ │ │ └── ...
│ │ └── ...
│ └── ... (all extracted and decompressed)
└── must-gather-browser.html
Performance Features
-
Caching
- Extracted files are cached in
{build_id}/logs/ - Offers to skip re-extraction if content already exists
- Extracted files are cached in
-
Incremental Processing
- Archives processed iteratively (up to 10 passes)
- Handles deeply nested archive structures
-
Progress Indicators
- Colored output for different stages
- Status messages for long-running operations
- Final statistics summary
-
Error Handling
- Graceful handling of corrupted archives
- Continues processing after errors
- Reports all errors in final summary
Examples
Basic Usage
# Via Claude Code
User: "Extract must-gather from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview/1965715986610917376"
# Standalone script
python3 plugins/prow-job/skills/prow-job-extract-must-gather/extract_archives.py \
.work/prow-job-extract-must-gather/1965715986610917376/tmp/must-gather.tar \
.work/prow-job-extract-must-gather/1965715986610917376/logs
python3 plugins/prow-job/skills/prow-job-extract-must-gather/generate_html_report.py \
.work/prow-job-extract-must-gather/1965715986610917376/logs \
"periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview" \
"1965715986610917376" \
"e2e-aws-ovn-techpreview" \
"https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview/1965715986610917376"
Using Regex Filters in HTML Browser
Find all etcd-related files:
.*etcd.*
Find all log files:
.*\.log$
Find files in specific namespace:
^content/namespaces/openshift-etcd/.*
Find YAML manifests for pods:
.*pods/.*\.yaml$
Using with Claude Code
When you ask Claude to extract a must-gather, it will automatically use this skill. The skill provides detailed instructions that guide Claude through:
- Validating prerequisites
- Parsing URLs
- Downloading archives
- Extracting and decompressing
- Generating HTML browser
You can simply ask:
"Extract must-gather from this Prow job: https://gcsweb-ci.../1965715986610917376/"
Claude will execute the workflow and generate the interactive HTML file browser.
Troubleshooting
gcloud not installed
# Check installation
which gcloud
# Install (follow platform-specific instructions)
# https://cloud.google.com/sdk/docs/install
must-gather.tar not found
- Verify job completed successfully
- Check target name is correct
- Confirm gather-must-gather ran in the job
- Manually check GCS path in gcsweb
Corrupted archives
- Check error messages in extraction output
- Extraction continues despite individual failures
- Final summary lists all errors
No "-ci-" directory found
- Extraction continues with original directory names
- Check logs for warning message
- Files will still be accessible
HTML browser not opening files
- Verify files were extracted to
logs/directory - Check that relative paths are correct
- Files must be opened from the same directory as HTML file
File Type Classifications
| Extension | Type | Badge Color |
|---|---|---|
| .log, .txt | log | Blue |
| .yaml, .yml | yaml | Purple |
| .json | json | Green |
| .xml | xml | Yellow |
| .crt, .pem, .key | cert | Red |
| .tar, .gz, .tgz, .zip | archive | Gray |
| .sh, .py | script | Blue |
| .conf, .cfg, .ini | config | Yellow |
| others | other | Gray |