# Prow Job Extract Must-Gather Skill This skill extracts and decompresses must-gather archives from Prow CI job artifacts, automatically handling nested tar and gzip archives, and generating an interactive HTML file browser. ## Overview The skill provides both a Claude Code skill interface and standalone scripts for extracting must-gather data from Prow CI jobs. It eliminates the manual steps of downloading and recursively extracting nested archives. ## Components ### 1. SKILL.md Claude Code skill definition that provides detailed implementation instructions for the AI assistant. ### 2. Python Scripts #### extract_archives.py Extracts and recursively processes must-gather archives. **Features:** - Extracts must-gather.tar to specified directory - Renames long subdirectory (containing "-ci-") to "content/" for readability - Recursively processes nested archives: - `.tar.gz` and `.tgz`: Extract in place, remove original - `.gz` (plain gzip): Decompress in place, remove original - Handles up to 10 levels of nested archives - Reports extraction statistics **Usage:** ```bash python3 extract_archives.py ``` **Example:** ```bash python3 plugins/prow-job/skills/prow-job-extract-must-gather/extract_archives.py \ .work/prow-job-extract-must-gather/1965715986610917376/tmp/must-gather.tar \ .work/prow-job-extract-must-gather/1965715986610917376/logs ``` **Output:** ``` ================================================================================ Must-Gather Archive Extraction ================================================================================ Step 1: Extracting must-gather.tar From: .work/.../tmp/must-gather.tar To: .work/.../logs Extracting: .work/.../tmp/must-gather.tar Step 2: Renaming long directory to 'content/' From: registry-build09-ci-openshift-org-ci-op-... To: content/ Step 3: Processing nested archives Extracting: .../content/namespaces/openshift-etcd/pods/etcd-0.tar.gz Decompressing: .../content/cluster-scoped-resources/nodes/ip-10-0-1-234.log.gz ... (continues for all archives) ================================================================================ Extraction Complete ================================================================================ Statistics: Total files: 3,421 Total size: 234.5 MB Archives processed: 247 Extracted to: .work/prow-job-extract-must-gather/1965715986610917376/logs ``` #### generate_html_report.py Generates an interactive HTML file browser with filters and search. **Features:** - Scans directory tree and collects file metadata - Classifies files by type (log, yaml, json, xml, cert, archive, script, config, other) - Generates statistics (total files, total size, counts by type) - Creates interactive HTML with: - Multi-select file type filters - Regex pattern filter for powerful searches - Text search for file names/paths - Direct links to files (relative paths) - Same dark theme as analyze-resource skill **Usage:** ```bash python3 generate_html_report.py ``` **Example:** ```bash python3 plugins/prow-job/skills/prow-job-extract-must-gather/generate_html_report.py \ .work/prow-job-extract-must-gather/1965715986610917376/logs \ "periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview" \ "1965715986610917376" \ "e2e-aws-ovn-techpreview" \ "https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview/1965715986610917376" ``` **Output:** - Creates `.work/prow-job-extract-must-gather/{build_id}/must-gather-browser.html` ## Prerequisites 1. **Python 3** - For running extraction and report generator scripts 2. **gcloud CLI** - For downloading artifacts from GCS - Install: https://cloud.google.com/sdk/docs/install - Authentication NOT required (bucket is publicly accessible) ## Workflow 1. **URL Parsing** - Validate URL contains `test-platform-results/` - Extract build_id (10+ digits) - Extract prowjob name - Construct GCS paths 2. **Working Directory** - Create `.work/prow-job-extract-must-gather/{build_id}/` directory - Create `logs/` subdirectory for extraction - Create `tmp/` subdirectory for temporary files - Check for existing extraction (offers to skip re-extraction) 3. **prowjob.json Validation** - Download prowjob.json - Search for `--target=` pattern - Exit if not a ci-operator job 4. **Must-Gather Download** - Download from: `artifacts/{target}/gather-must-gather/artifacts/must-gather.tar` - Save to: `{build_id}/tmp/must-gather.tar` 5. **Extraction and Processing** - Extract must-gather.tar to `{build_id}/logs/` - Rename long subdirectory to "content/" - Recursively extract nested archives (.tar.gz, .tgz, .gz) - Remove original compressed files after extraction 6. **HTML Report Generation** - Scan directory tree - Classify files by type - Calculate statistics - Generate interactive HTML browser - Output to `{build_id}/must-gather-browser.html` ## Output ### Console Output ``` Must-Gather Extraction Complete Prow Job: periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview Build ID: 1965715986610917376 Target: e2e-aws-ovn-techpreview Extraction Statistics: - Total files: 3,421 - Total size: 234.5 MB - Archives extracted: 247 - Log files: 1,234 - YAML files: 856 - JSON files: 423 Extracted to: .work/prow-job-extract-must-gather/1965715986610917376/logs/ File browser generated: .work/prow-job-extract-must-gather/1965715986610917376/must-gather-browser.html Open in browser to browse and search extracted files. ``` ### HTML File Browser The generated HTML report includes: 1. **Header Section** - Prow job name - Build ID - Target name - GCS URL (link to gcsweb) - Local extraction path 2. **Statistics Dashboard** - Total files count - Total size (human-readable) - Counts by file type (log, yaml, json, xml, cert, archive, script, config, other) 3. **Filter Controls** - **File Type Filter**: Multi-select buttons to filter by type - **Regex Pattern Filter**: Input field for regex patterns (e.g., `.*etcd.*`, `.*\.log$`, `^content/namespaces/.*`) - **Name Search**: Text search for file names and paths 4. **File List** - Icon for each file type - File name (clickable link to open file) - Directory path - File size - File type badge (color-coded) - Sorted alphabetically by path 5. **Interactive Features** - All filters work together (AND logic) - Real-time filtering (300ms debounce) - Regex pattern validation - Scroll to top button - No results message when filters match nothing ### Directory Structure ``` .work/prow-job-extract-must-gather/{build_id}/ ├── tmp/ │ ├── prowjob.json │ └── must-gather.tar (downloaded, not deleted) ├── logs/ │ └── content/ # Renamed from long directory │ ├── cluster-scoped-resources/ │ │ ├── nodes/ │ │ ├── clusterroles/ │ │ └── ... │ ├── namespaces/ │ │ ├── openshift-etcd/ │ │ │ ├── pods/ │ │ │ ├── services/ │ │ │ └── ... │ │ └── ... │ └── ... (all extracted and decompressed) └── must-gather-browser.html ``` ## Performance Features 1. **Caching** - Extracted files are cached in `{build_id}/logs/` - Offers to skip re-extraction if content already exists 2. **Incremental Processing** - Archives processed iteratively (up to 10 passes) - Handles deeply nested archive structures 3. **Progress Indicators** - Colored output for different stages - Status messages for long-running operations - Final statistics summary 4. **Error Handling** - Graceful handling of corrupted archives - Continues processing after errors - Reports all errors in final summary ## Examples ### Basic Usage ```bash # Via Claude Code User: "Extract must-gather from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview/1965715986610917376" # Standalone script python3 plugins/prow-job/skills/prow-job-extract-must-gather/extract_archives.py \ .work/prow-job-extract-must-gather/1965715986610917376/tmp/must-gather.tar \ .work/prow-job-extract-must-gather/1965715986610917376/logs python3 plugins/prow-job/skills/prow-job-extract-must-gather/generate_html_report.py \ .work/prow-job-extract-must-gather/1965715986610917376/logs \ "periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview" \ "1965715986610917376" \ "e2e-aws-ovn-techpreview" \ "https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview/1965715986610917376" ``` ### Using Regex Filters in HTML Browser **Find all etcd-related files:** ```regex .*etcd.* ``` **Find all log files:** ```regex .*\.log$ ``` **Find files in specific namespace:** ```regex ^content/namespaces/openshift-etcd/.* ``` **Find YAML manifests for pods:** ```regex .*pods/.*\.yaml$ ``` ## Using with Claude Code When you ask Claude to extract a must-gather, it will automatically use this skill. The skill provides detailed instructions that guide Claude through: - Validating prerequisites - Parsing URLs - Downloading archives - Extracting and decompressing - Generating HTML browser You can simply ask: > "Extract must-gather from this Prow job: https://gcsweb-ci.../1965715986610917376/" Claude will execute the workflow and generate the interactive HTML file browser. ## Troubleshooting ### gcloud not installed ```bash # Check installation which gcloud # Install (follow platform-specific instructions) # https://cloud.google.com/sdk/docs/install ``` ### must-gather.tar not found - Verify job completed successfully - Check target name is correct - Confirm gather-must-gather ran in the job - Manually check GCS path in gcsweb ### Corrupted archives - Check error messages in extraction output - Extraction continues despite individual failures - Final summary lists all errors ### No "-ci-" directory found - Extraction continues with original directory names - Check logs for warning message - Files will still be accessible ### HTML browser not opening files - Verify files were extracted to `logs/` directory - Check that relative paths are correct - Files must be opened from the same directory as HTML file ## File Type Classifications | Extension | Type | Badge Color | |-----------|------|-------------| | .log, .txt | log | Blue | | .yaml, .yml | yaml | Purple | | .json | json | Green | | .xml | xml | Yellow | | .crt, .pem, .key | cert | Red | | .tar, .gz, .tgz, .zip | archive | Gray | | .sh, .py | script | Blue | | .conf, .cfg, .ini | config | Yellow | | others | other | Gray |