This skill extracts and decompresses must-gather archives from Prow CI job artifacts, automatically handling nested tar and gzip archives, and generating an interactive HTML file browser.

Overview

The skill provides both a Claude Code skill interface and standalone scripts for extracting must-gather data from Prow CI jobs. It eliminates the manual steps of downloading and recursively extracting nested archives.

Components

1. SKILL.md

Claude Code skill definition that provides detailed implementation instructions for the AI assistant.

2. Python Scripts

extract_archives.py

Extracts and recursively processes must-gather archives.

Features:

Extracts must-gather.tar to specified directory
Renames long subdirectory (containing "-ci-") to "content/" for readability
Recursively processes nested archives:
- .tar.gz and .tgz: Extract in place, remove original
- .gz (plain gzip): Decompress in place, remove original
Handles up to 10 levels of nested archives
Reports extraction statistics

Usage:

python3 extract_archives.py <must-gather.tar> <output-directory>

Example:

python3 plugins/prow-job/skills/prow-job-extract-must-gather/extract_archives.py \
  .work/prow-job-extract-must-gather/1965715986610917376/tmp/must-gather.tar \
  .work/prow-job-extract-must-gather/1965715986610917376/logs

Output:

================================================================================
Must-Gather Archive Extraction
================================================================================

Step 1: Extracting must-gather.tar
  From: .work/.../tmp/must-gather.tar
  To: .work/.../logs
  Extracting: .work/.../tmp/must-gather.tar

Step 2: Renaming long directory to 'content/'
  From: registry-build09-ci-openshift-org-ci-op-...
  To: content/

Step 3: Processing nested archives
  Extracting: .../content/namespaces/openshift-etcd/pods/etcd-0.tar.gz
  Decompressing: .../content/cluster-scoped-resources/nodes/ip-10-0-1-234.log.gz
  ... (continues for all archives)

================================================================================
Extraction Complete
================================================================================

Statistics:
  Total files: 3,421
  Total size: 234.5 MB
  Archives processed: 247

Extracted to: .work/prow-job-extract-must-gather/1965715986610917376/logs

generate_html_report.py

Generates an interactive HTML file browser with filters and search.

Features:

Scans directory tree and collects file metadata
Classifies files by type (log, yaml, json, xml, cert, archive, script, config, other)
Generates statistics (total files, total size, counts by type)
Creates interactive HTML with:
- Multi-select file type filters
- Regex pattern filter for powerful searches
- Text search for file names/paths
- Direct links to files (relative paths)
- Same dark theme as analyze-resource skill

Usage:

python3 generate_html_report.py <logs-directory> <prowjob_name> <build_id> <target> <gcsweb_url>

Example:

python3 plugins/prow-job/skills/prow-job-extract-must-gather/generate_html_report.py \
  .work/prow-job-extract-must-gather/1965715986610917376/logs \
  "periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview" \
  "1965715986610917376" \
  "e2e-aws-ovn-techpreview" \
  "https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview/1965715986610917376"

Output:

Creates .work/prow-job-extract-must-gather/{build_id}/must-gather-browser.html

Prerequisites

Python 3 - For running extraction and report generator scripts
gcloud CLI - For downloading artifacts from GCS
- Install: https://cloud.google.com/sdk/docs/install
- Authentication NOT required (bucket is publicly accessible)

Workflow

URL Parsing
- Validate URL contains test-platform-results/
- Extract build_id (10+ digits)
- Extract prowjob name
- Construct GCS paths
Working Directory
- Create .work/prow-job-extract-must-gather/{build_id}/ directory
- Create logs/ subdirectory for extraction
- Create tmp/ subdirectory for temporary files
- Check for existing extraction (offers to skip re-extraction)
prowjob.json Validation
- Download prowjob.json
- Search for --target= pattern
- Exit if not a ci-operator job
Must-Gather Download
- Download from: artifacts/{target}/gather-must-gather/artifacts/must-gather.tar
- Save to: {build_id}/tmp/must-gather.tar
Extraction and Processing
- Extract must-gather.tar to {build_id}/logs/
- Rename long subdirectory to "content/"
- Recursively extract nested archives (.tar.gz, .tgz, .gz)
- Remove original compressed files after extraction
HTML Report Generation
- Scan directory tree
- Classify files by type
- Calculate statistics
- Generate interactive HTML browser
- Output to {build_id}/must-gather-browser.html

Output

Console Output

Must-Gather Extraction Complete

Prow Job: periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview
Build ID: 1965715986610917376
Target: e2e-aws-ovn-techpreview

Extraction Statistics:
- Total files: 3,421
- Total size: 234.5 MB
- Archives extracted: 247
- Log files: 1,234
- YAML files: 856
- JSON files: 423

Extracted to: .work/prow-job-extract-must-gather/1965715986610917376/logs/

File browser generated: .work/prow-job-extract-must-gather/1965715986610917376/must-gather-browser.html

Open in browser to browse and search extracted files.

HTML File Browser

The generated HTML report includes:

Header Section
- Prow job name
- Build ID
- Target name
- GCS URL (link to gcsweb)
- Local extraction path
Statistics Dashboard
- Total files count
- Total size (human-readable)
- Counts by file type (log, yaml, json, xml, cert, archive, script, config, other)
Filter Controls
- File Type Filter: Multi-select buttons to filter by type
- Regex Pattern Filter: Input field for regex patterns (e.g., .*etcd.*, .*\.log$, ^content/namespaces/.*)
- Name Search: Text search for file names and paths
File List
- Icon for each file type
- File name (clickable link to open file)
- Directory path
- File size
- File type badge (color-coded)
- Sorted alphabetically by path
Interactive Features
- All filters work together (AND logic)
- Real-time filtering (300ms debounce)
- Regex pattern validation
- Scroll to top button
- No results message when filters match nothing

Directory Structure

.work/prow-job-extract-must-gather/{build_id}/
├── tmp/
│   ├── prowjob.json
│   └── must-gather.tar (downloaded, not deleted)
├── logs/
│   └── content/                    # Renamed from long directory
│       ├── cluster-scoped-resources/
│       │   ├── nodes/
│       │   ├── clusterroles/
│       │   └── ...
│       ├── namespaces/
│       │   ├── openshift-etcd/
│       │   │   ├── pods/
│       │   │   ├── services/
│       │   │   └── ...
│       │   └── ...
│       └── ... (all extracted and decompressed)
└── must-gather-browser.html

Performance Features

Caching
- Extracted files are cached in {build_id}/logs/
- Offers to skip re-extraction if content already exists
Incremental Processing
- Archives processed iteratively (up to 10 passes)
- Handles deeply nested archive structures
Progress Indicators
- Colored output for different stages
- Status messages for long-running operations
- Final statistics summary
Error Handling
- Graceful handling of corrupted archives
- Continues processing after errors
- Reports all errors in final summary

Examples

Basic Usage

# Via Claude Code
User: "Extract must-gather from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview/1965715986610917376"

# Standalone script
python3 plugins/prow-job/skills/prow-job-extract-must-gather/extract_archives.py \
  .work/prow-job-extract-must-gather/1965715986610917376/tmp/must-gather.tar \
  .work/prow-job-extract-must-gather/1965715986610917376/logs

python3 plugins/prow-job/skills/prow-job-extract-must-gather/generate_html_report.py \
  .work/prow-job-extract-must-gather/1965715986610917376/logs \
  "periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview" \
  "1965715986610917376" \
  "e2e-aws-ovn-techpreview" \
  "https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview/1965715986610917376"

Using Regex Filters in HTML Browser

Find all etcd-related files:

.*etcd.*

Find all log files:

.*\.log$

Find files in specific namespace:

^content/namespaces/openshift-etcd/.*

Find YAML manifests for pods:

.*pods/.*\.yaml$

Using with Claude Code

When you ask Claude to extract a must-gather, it will automatically use this skill. The skill provides detailed instructions that guide Claude through:

Validating prerequisites
Parsing URLs
Downloading archives
Extracting and decompressing
Generating HTML browser

You can simply ask:

"Extract must-gather from this Prow job: https://gcsweb-ci.../1965715986610917376/"

Claude will execute the workflow and generate the interactive HTML file browser.

Troubleshooting

gcloud not installed

# Check installation
which gcloud

# Install (follow platform-specific instructions)
# https://cloud.google.com/sdk/docs/install

must-gather.tar not found

Verify job completed successfully
Check target name is correct
Confirm gather-must-gather ran in the job
Manually check GCS path in gcsweb

Corrupted archives

Check error messages in extraction output
Extraction continues despite individual failures
Final summary lists all errors

No "-ci-" directory found

Extraction continues with original directory names
Check logs for warning message
Files will still be accessible

HTML browser not opening files

Verify files were extracted to logs/ directory
Check that relative paths are correct
Files must be opened from the same directory as HTML file

File Type Classifications

Extension	Type	Badge Color
.log, .txt	log	Blue
.yaml, .yml	yaml	Purple
.json	json	Green
.xml	xml	Yellow
.crt, .pem, .key	cert	Red
.tar, .gz, .tgz, .zip	archive	Gray
.sh, .py	script	Blue
.conf, .cfg, .ini	config	Yellow
others	other	Gray