--- name: Prow Job Extract Must-Gather description: Extract and decompress must-gather archives from Prow CI job artifacts, generating an interactive HTML file browser with filters --- # Prow Job Extract Must-Gather This skill extracts and decompresses must-gather archives from Prow CI job artifacts, automatically handling nested tar and gzip archives, and generating an interactive HTML file browser. ## When to Use This Skill Use this skill when the user wants to: - Extract must-gather archives from Prow CI job artifacts - Avoid manually downloading and extracting nested archives - Browse must-gather contents with an interactive HTML interface - Search for specific files or file types in must-gather data - Analyze OpenShift cluster state from CI test runs ## Prerequisites Before starting, verify these prerequisites: 1. **gcloud CLI Installation** - Check if installed: `which gcloud` - If not installed, provide instructions for the user's platform - Installation guide: https://cloud.google.com/sdk/docs/install 2. **gcloud Authentication (Optional)** - The `test-platform-results` bucket is publicly accessible - No authentication is required for read access - Skip authentication checks ## Input Format The user will provide: 1. **Prow job URL** - gcsweb URL containing `test-platform-results/` - Example: `https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview/1965715986610917376/` - URL may or may not have trailing slash ## Implementation Steps ### Step 1: Parse and Validate URL 1. **Extract bucket path** - Find `test-platform-results/` in URL - Extract everything after it as the GCS bucket relative path - If not found, error: "URL must contain 'test-platform-results/'" 2. **Extract build_id** - Search for pattern `/(\\d{10,})/` in the bucket path - build_id must be at least 10 consecutive decimal digits - Handle URLs with or without trailing slash - If not found, error: "Could not find build ID (10+ digits) in URL" 3. **Extract prowjob name** - Find the path segment immediately preceding build_id - Example: In `.../periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview/1965715986610917376/` - Prowjob name: `periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview` 4. **Construct GCS paths** - Bucket: `test-platform-results` - Base GCS path: `gs://test-platform-results/{bucket-path}/` - Ensure path ends with `/` ### Step 2: Create Working Directory 1. **Check for existing extraction first** - Check if `.work/prow-job-extract-must-gather/{build_id}/logs/` directory exists and has content - If it exists with content: - Use AskUserQuestion tool to ask: - Question: "Must-gather already extracted for build {build_id}. Would you like to use the existing extraction or re-extract?" - Options: - "Use existing" - Skip to HTML report generation (Step 6) - "Re-extract" - Continue to clean and re-download - If user chooses "Re-extract": - Remove all existing content: `rm -rf .work/prow-job-extract-must-gather/{build_id}/logs/` - Also remove tmp directory: `rm -rf .work/prow-job-extract-must-gather/{build_id}/tmp/` - This ensures clean state before downloading new content - If user chooses "Use existing": - Skip directly to Step 6 (Generate HTML Report) 2. **Create directory structure** ```bash mkdir -p .work/prow-job-extract-must-gather/{build_id}/logs mkdir -p .work/prow-job-extract-must-gather/{build_id}/tmp ``` - Use `.work/prow-job-extract-must-gather/` as the base directory (already in .gitignore) - Use build_id as subdirectory name - Create `logs/` subdirectory for extraction - Create `tmp/` subdirectory for temporary files - Working directory: `.work/prow-job-extract-must-gather/{build_id}/` ### Step 3: Download and Validate prowjob.json 1. **Download prowjob.json** ```bash gcloud storage cp gs://test-platform-results/{bucket-path}/prowjob.json .work/prow-job-extract-must-gather/{build_id}/tmp/prowjob.json --no-user-output-enabled ``` 2. **Parse and validate** - Read `.work/prow-job-extract-must-gather/{build_id}/tmp/prowjob.json` - Search for pattern: `--target=([a-zA-Z0-9-]+)` - If not found: - Display: "This is not a ci-operator job. The prowjob cannot be analyzed by this skill." - Explain: ci-operator jobs have a --target argument specifying the test target - Exit skill 3. **Extract target name** - Capture the target value (e.g., `e2e-aws-ovn-techpreview`) - Store for constructing must-gather path ### Step 4: Download Must-Gather Archive 1. **Construct must-gather path** - GCS path: `gs://test-platform-results/{bucket-path}/artifacts/{target}/gather-must-gather/artifacts/must-gather.tar` - Local path: `.work/prow-job-extract-must-gather/{build_id}/tmp/must-gather.tar` 2. **Download must-gather.tar** ```bash gcloud storage cp gs://test-platform-results/{bucket-path}/artifacts/{target}/gather-must-gather/artifacts/must-gather.tar .work/prow-job-extract-must-gather/{build_id}/tmp/must-gather.tar --no-user-output-enabled ``` - Use `--no-user-output-enabled` to suppress progress output - If file not found, error: "No must-gather archive found. Job may not have completed or gather-must-gather may not have run." ### Step 5: Extract and Process Archives **IMPORTANT: Use the provided Python script `extract_archives.py` from the skill directory.** **Usage:** ```bash python3 plugins/prow-job/skills/prow-job-extract-must-gather/extract_archives.py \ .work/prow-job-extract-must-gather/{build_id}/tmp/must-gather.tar \ .work/prow-job-extract-must-gather/{build_id}/logs ``` **What the script does:** 1. **Extract must-gather.tar** - Extract to `{build_id}/logs/` directory - Uses Python's tarfile module for reliable extraction 2. **Rename long subdirectory to "content/"** - Find subdirectory containing "-ci-" in the name - Example: `registry-build09-ci-openshift-org-ci-op-m8t77165-stable-sha256-d1ae126eed86a47fdbc8db0ad176bf078a5edebdbb0df180d73f02e5f03779e0/` - Rename to: `content/` - Preserves all files and subdirectories 3. **Recursively process nested archives** - Walk entire directory tree - Find and process archives: **For .tar.gz and .tgz files:** ```python # Extract in place with tarfile.open(archive_path, 'r:gz') as tar: tar.extractall(path=parent_dir) # Remove original archive os.remove(archive_path) ``` **For .gz files (no tar):** ```python # Gunzip in place with gzip.open(gz_path, 'rb') as f_in: with open(output_path, 'wb') as f_out: shutil.copyfileobj(f_in, f_out) # Remove original archive os.remove(gz_path) ``` 4. **Progress reporting** - Print status for each extracted archive - Count total files and archives processed - Report final statistics 5. **Error handling** - Skip corrupted archives with warning - Continue processing other files - Report all errors at the end ### Step 6: Generate HTML File Browser **IMPORTANT: Use the provided Python script `generate_html_report.py` from the skill directory.** **Usage:** ```bash python3 plugins/prow-job/skills/prow-job-extract-must-gather/generate_html_report.py \ .work/prow-job-extract-must-gather/{build_id}/logs \ "{prowjob_name}" \ "{build_id}" \ "{target}" \ "{gcsweb_url}" ``` **Output:** The script generates `.work/prow-job-extract-must-gather/{build_id}/must-gather-browser.html` **What the script does:** 1. **Scan directory tree** - Recursively walk `{build_id}/logs/` directory - Collect all files with metadata: - Relative path from logs/ - File size (human-readable: KB, MB, GB) - File extension - Directory depth - Last modified time 2. **Classify files** - Detect file types based on extension: - Logs: `.log`, `.txt` - YAML: `.yaml`, `.yml` - JSON: `.json` - XML: `.xml` - Certificates: `.crt`, `.pem`, `.key` - Binaries: `.tar`, `.gz`, `.tgz`, `.tar.gz` - Other - Count files by type for statistics 3. **Generate HTML structure** **Header Section:** ```html

Must-Gather File Browser

Prow Job: {prowjob-name}

Build ID: {build_id}

gcsweb URL: {original-url}

Target: {target}

Total Files: {count}

Total Size: {human-readable-size}

``` **Filter Controls:** ```html
``` **File List:** ```html
{icon}
{directory-path} {size} {type}
``` **CSS Styling:** - Use same dark theme as analyze-resource skill - Modern, clean design with good contrast - Responsive layout - File type color coding - Monospace fonts for paths - Hover effects on file items **JavaScript Interactivity:** ```javascript // Multi-select file type filters document.querySelectorAll('.filter-btn').forEach(btn => { btn.addEventListener('click', function() { // Toggle active state // Apply filters }); }); // Regex pattern filter document.getElementById('pattern').addEventListener('input', function() { const pattern = this.value; if (pattern) { const regex = new RegExp(pattern); // Filter files matching regex } }); // Name search filter document.getElementById('search').addEventListener('input', function() { const query = this.value.toLowerCase(); // Filter files by name substring }); // Combine all active filters function applyFilters() { // Show/hide files based on all active filters } ``` 4. **Statistics Section:** ```html
{total-files}
Total Files
{total-size}
Total Size
{log-count}
Log Files
{yaml-count}
YAML Files
``` 5. **Write HTML to file** - Script automatically writes to `.work/prow-job-extract-must-gather/{build_id}/must-gather-browser.html` - Includes proper HTML5 structure - All CSS and JavaScript are inline for portability ### Step 7: Present Results to User 1. **Display summary** ``` Must-Gather Extraction Complete Prow Job: {prowjob-name} Build ID: {build_id} Target: {target} Extraction Statistics: - Total files: {file-count} - Total size: {human-readable-size} - Archives extracted: {archive-count} - Log files: {log-count} - YAML files: {yaml-count} - JSON files: {json-count} Extracted to: .work/prow-job-extract-must-gather/{build_id}/logs/ File browser generated: .work/prow-job-extract-must-gather/{build_id}/must-gather-browser.html Open in browser to browse and search extracted files. ``` 2. **Open report in browser** - Detect platform and automatically open the HTML report in the default browser - Linux: `xdg-open .work/prow-job-extract-must-gather/{build_id}/must-gather-browser.html` - macOS: `open .work/prow-job-extract-must-gather/{build_id}/must-gather-browser.html` - Windows: `start .work/prow-job-extract-must-gather/{build_id}/must-gather-browser.html` - On Linux (most common for this environment), use `xdg-open` 3. **Offer next steps** - Ask if user wants to search for specific files - Explain that extracted files are available in `.work/prow-job-extract-must-gather/{build_id}/logs/` - Mention that extraction is cached for faster subsequent browsing ## Error Handling Handle these error scenarios gracefully: 1. **Invalid URL format** - Error: "URL must contain 'test-platform-results/' substring" - Provide example of valid URL 2. **Build ID not found** - Error: "Could not find build ID (10+ decimal digits) in URL path" - Explain requirement and show URL parsing 3. **gcloud not installed** - Detect with: `which gcloud` - Provide installation instructions for user's platform - Link: https://cloud.google.com/sdk/docs/install 4. **prowjob.json not found** - Suggest verifying URL and checking if job completed - Provide gcsweb URL for manual verification 5. **Not a ci-operator job** - Error: "This is not a ci-operator job. No --target found in prowjob.json." - Explain: Only ci-operator jobs can be analyzed by this skill 6. **must-gather.tar not found** - Warn: "Must-gather archive not found at expected path" - Suggest: Job may not have completed or gather-must-gather may not have run - Provide full GCS path that was checked 7. **Corrupted archive** - Warn: "Could not extract {archive-path}: {error}" - Continue processing other archives - Report all errors in final summary 8. **No "-ci-" subdirectory found** - Warn: "Could not find expected subdirectory to rename to 'content/'" - Continue with extraction anyway - Files will be in original directory structure ## Performance Considerations 1. **Avoid re-extracting** - Check if `.work/prow-job-extract-must-gather/{build_id}/logs/` already has content - Ask user before re-extracting 2. **Efficient downloads** - Use `gcloud storage cp` with `--no-user-output-enabled` to suppress verbose output 3. **Memory efficiency** - Process archives incrementally - Don't load entire files into memory - Use streaming extraction 4. **Progress indicators** - Show "Downloading must-gather archive..." before gcloud command - Show "Extracting must-gather.tar..." before extraction - Show "Processing nested archives..." during recursive extraction - Show "Generating HTML file browser..." before report generation ## Examples ### Example 1: Extract must-gather from periodic job ``` User: "Extract must-gather from this Prow job: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview/1965715986610917376" Output: - Downloads must-gather.tar to: .work/prow-job-extract-must-gather/1965715986610917376/tmp/ - Extracts to: .work/prow-job-extract-must-gather/1965715986610917376/logs/ - Renames long subdirectory to: content/ - Processes 247 nested archives (.tar.gz, .tgz, .gz) - Creates: .work/prow-job-extract-must-gather/1965715986610917376/must-gather-browser.html - Opens browser with interactive file list (3,421 files, 234 MB) ``` ## Tips - Always verify gcloud prerequisites before starting (gcloud CLI must be installed) - Authentication is NOT required - the bucket is publicly accessible - Use `.work/prow-job-extract-must-gather/{build_id}/` directory structure for organization - All work files are in `.work/` which is already in .gitignore - The Python scripts handle all extraction and HTML generation - use them! - Cache extracted files in `.work/prow-job-extract-must-gather/{build_id}/` to avoid re-extraction - The HTML file browser supports regex patterns for powerful file filtering - Extracted files can be opened directly from the HTML browser (links are relative) ## Important Notes 1. **Archive Processing:** - The script automatically handles nested archives - Original compressed files are removed after successful extraction - Corrupted archives are skipped with warnings 2. **Directory Renaming:** - The long subdirectory name (containing "-ci-") is renamed to "content/" for brevity - Files within "content/" are NOT altered - This makes paths more readable in the HTML browser 3. **File Type Detection:** - File types are detected based on extension - Common types are color-coded in the HTML browser - All file types can be filtered 4. **Regex Pattern Filtering:** - Users can enter regex patterns in the filter input - Patterns match against full file paths - Invalid regex patterns are ignored gracefully 5. **Working with Scripts:** - All scripts are in `plugins/prow-job/skills/prow-job-extract-must-gather/` - `extract_archives.py` - Extracts and processes archives - `generate_html_report.py` - Generates interactive HTML file browser