--- name: Prow Job Extract Must-Gather description: Extract and decompress must-gather archives from Prow CI job artifacts, generating an interactive HTML file browser with filters --- # Prow Job Extract Must-Gather This skill extracts and decompresses must-gather archives from Prow CI job artifacts, automatically handling nested tar and gzip archives, and generating an interactive HTML file browser. ## When to Use This Skill Use this skill when the user wants to: - Extract must-gather archives from Prow CI job artifacts - Avoid manually downloading and extracting nested archives - Browse must-gather contents with an interactive HTML interface - Search for specific files or file types in must-gather data - Analyze OpenShift cluster state from CI test runs ## Prerequisites Before starting, verify these prerequisites: 1. **gcloud CLI Installation** - Check if installed: `which gcloud` - If not installed, provide instructions for the user's platform - Installation guide: https://cloud.google.com/sdk/docs/install 2. **gcloud Authentication (Optional)** - The `test-platform-results` bucket is publicly accessible - No authentication is required for read access - Skip authentication checks ## Input Format The user will provide: 1. **Prow job URL** - gcsweb URL containing `test-platform-results/` - Example: `https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview/1965715986610917376/` - URL may or may not have trailing slash ## Implementation Steps ### Step 1: Parse and Validate URL 1. **Extract bucket path** - Find `test-platform-results/` in URL - Extract everything after it as the GCS bucket relative path - If not found, error: "URL must contain 'test-platform-results/'" 2. **Extract build_id** - Search for pattern `/(\\d{10,})/` in the bucket path - build_id must be at least 10 consecutive decimal digits - Handle URLs with or without trailing slash - If not found, error: "Could not find build ID (10+ digits) in URL" 3. **Extract prowjob name** - Find the path segment immediately preceding build_id - Example: In `.../periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview/1965715986610917376/` - Prowjob name: `periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview` 4. **Construct GCS paths** - Bucket: `test-platform-results` - Base GCS path: `gs://test-platform-results/{bucket-path}/` - Ensure path ends with `/` ### Step 2: Create Working Directory 1. **Check for existing extraction first** - Check if `.work/prow-job-extract-must-gather/{build_id}/logs/` directory exists and has content - If it exists with content: - Use AskUserQuestion tool to ask: - Question: "Must-gather already extracted for build {build_id}. Would you like to use the existing extraction or re-extract?" - Options: - "Use existing" - Skip to HTML report generation (Step 6) - "Re-extract" - Continue to clean and re-download - If user chooses "Re-extract": - Remove all existing content: `rm -rf .work/prow-job-extract-must-gather/{build_id}/logs/` - Also remove tmp directory: `rm -rf .work/prow-job-extract-must-gather/{build_id}/tmp/` - This ensures clean state before downloading new content - If user chooses "Use existing": - Skip directly to Step 6 (Generate HTML Report) 2. **Create directory structure** ```bash mkdir -p .work/prow-job-extract-must-gather/{build_id}/logs mkdir -p .work/prow-job-extract-must-gather/{build_id}/tmp ``` - Use `.work/prow-job-extract-must-gather/` as the base directory (already in .gitignore) - Use build_id as subdirectory name - Create `logs/` subdirectory for extraction - Create `tmp/` subdirectory for temporary files - Working directory: `.work/prow-job-extract-must-gather/{build_id}/` ### Step 3: Download and Validate prowjob.json 1. **Download prowjob.json** ```bash gcloud storage cp gs://test-platform-results/{bucket-path}/prowjob.json .work/prow-job-extract-must-gather/{build_id}/tmp/prowjob.json --no-user-output-enabled ``` 2. **Parse and validate** - Read `.work/prow-job-extract-must-gather/{build_id}/tmp/prowjob.json` - Search for pattern: `--target=([a-zA-Z0-9-]+)` - If not found: - Display: "This is not a ci-operator job. The prowjob cannot be analyzed by this skill." - Explain: ci-operator jobs have a --target argument specifying the test target - Exit skill 3. **Extract target name** - Capture the target value (e.g., `e2e-aws-ovn-techpreview`) - Store for constructing must-gather path ### Step 4: Download Must-Gather Archive 1. **Construct must-gather path** - GCS path: `gs://test-platform-results/{bucket-path}/artifacts/{target}/gather-must-gather/artifacts/must-gather.tar` - Local path: `.work/prow-job-extract-must-gather/{build_id}/tmp/must-gather.tar` 2. **Download must-gather.tar** ```bash gcloud storage cp gs://test-platform-results/{bucket-path}/artifacts/{target}/gather-must-gather/artifacts/must-gather.tar .work/prow-job-extract-must-gather/{build_id}/tmp/must-gather.tar --no-user-output-enabled ``` - Use `--no-user-output-enabled` to suppress progress output - If file not found, error: "No must-gather archive found. Job may not have completed or gather-must-gather may not have run." ### Step 5: Extract and Process Archives **IMPORTANT: Use the provided Python script `extract_archives.py` from the skill directory.** **Usage:** ```bash python3 plugins/prow-job/skills/prow-job-extract-must-gather/extract_archives.py \ .work/prow-job-extract-must-gather/{build_id}/tmp/must-gather.tar \ .work/prow-job-extract-must-gather/{build_id}/logs ``` **What the script does:** 1. **Extract must-gather.tar** - Extract to `{build_id}/logs/` directory - Uses Python's tarfile module for reliable extraction 2. **Rename long subdirectory to "content/"** - Find subdirectory containing "-ci-" in the name - Example: `registry-build09-ci-openshift-org-ci-op-m8t77165-stable-sha256-d1ae126eed86a47fdbc8db0ad176bf078a5edebdbb0df180d73f02e5f03779e0/` - Rename to: `content/` - Preserves all files and subdirectories 3. **Recursively process nested archives** - Walk entire directory tree - Find and process archives: **For .tar.gz and .tgz files:** ```python # Extract in place with tarfile.open(archive_path, 'r:gz') as tar: tar.extractall(path=parent_dir) # Remove original archive os.remove(archive_path) ``` **For .gz files (no tar):** ```python # Gunzip in place with gzip.open(gz_path, 'rb') as f_in: with open(output_path, 'wb') as f_out: shutil.copyfileobj(f_in, f_out) # Remove original archive os.remove(gz_path) ``` 4. **Progress reporting** - Print status for each extracted archive - Count total files and archives processed - Report final statistics 5. **Error handling** - Skip corrupted archives with warning - Continue processing other files - Report all errors at the end ### Step 6: Generate HTML File Browser **IMPORTANT: Use the provided Python script `generate_html_report.py` from the skill directory.** **Usage:** ```bash python3 plugins/prow-job/skills/prow-job-extract-must-gather/generate_html_report.py \ .work/prow-job-extract-must-gather/{build_id}/logs \ "{prowjob_name}" \ "{build_id}" \ "{target}" \ "{gcsweb_url}" ``` **Output:** The script generates `.work/prow-job-extract-must-gather/{build_id}/must-gather-browser.html` **What the script does:** 1. **Scan directory tree** - Recursively walk `{build_id}/logs/` directory - Collect all files with metadata: - Relative path from logs/ - File size (human-readable: KB, MB, GB) - File extension - Directory depth - Last modified time 2. **Classify files** - Detect file types based on extension: - Logs: `.log`, `.txt` - YAML: `.yaml`, `.yml` - JSON: `.json` - XML: `.xml` - Certificates: `.crt`, `.pem`, `.key` - Binaries: `.tar`, `.gz`, `.tgz`, `.tar.gz` - Other - Count files by type for statistics 3. **Generate HTML structure** **Header Section:** ```html