Files
gh-openshift-eng-ai-helpers…/skills/prow-job-analyze-resource/SKILL.md
2025-11-30 08:46:16 +08:00

595 lines
25 KiB
Markdown

---
name: Prow Job Analyze Resource
description: Analyze Kubernetes resource lifecycle in Prow CI job artifacts by parsing audit logs and pod logs from GCS, generating interactive HTML reports with timelines
---
# Prow Job Analyze Resource
This skill analyzes the lifecycle of Kubernetes resources during Prow CI job execution by downloading and parsing artifacts from Google Cloud Storage.
## When to Use This Skill
Use this skill when the user wants to:
- Debug Prow CI test failures by tracking resource state changes
- Understand when and how a Kubernetes resource was created, modified, or deleted during a test
- Analyze resource lifecycle across audit logs and pod logs from ephemeral test clusters
- Generate interactive HTML reports showing resource events over time
- Search for specific resources (pods, deployments, configmaps, etc.) in Prow job artifacts
## Prerequisites
Before starting, verify these prerequisites:
1. **gcloud CLI Installation**
- Check if installed: `which gcloud`
- If not installed, provide instructions for the user's platform
- Installation guide: https://cloud.google.com/sdk/docs/install
2. **gcloud Authentication (Optional)**
- The `test-platform-results` bucket is publicly accessible
- No authentication is required for read access
- Skip authentication checks
## Input Format
The user will provide:
1. **Prow job URL** - gcsweb URL containing `test-platform-results/`
- Example: `https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30393/pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn/1978913325970362368/`
- URL may or may not have trailing slash
2. **Resource specifications** - Comma-delimited list in format `[namespace:][kind/]name`
- Supports regex patterns for matching multiple resources
- Examples:
- `pod/etcd-0` - pod named etcd-0 in any namespace
- `openshift-etcd:pod/etcd-0` - pod in specific namespace
- `etcd-0` - any resource named etcd-0 (no kind filter)
- `pod/etcd-0,configmap/cluster-config` - multiple resources
- `resource-name-1|resource-name-2` - multiple resources using regex OR
- `e2e-test-project-api-.*` - all resources matching the pattern
## Implementation Steps
### Step 1: Parse and Validate URL
1. **Extract bucket path**
- Find `test-platform-results/` in URL
- Extract everything after it as the GCS bucket relative path
- If not found, error: "URL must contain 'test-platform-results/'"
2. **Extract build_id**
- Search for pattern `/(\d{10,})/` in the bucket path
- build_id must be at least 10 consecutive decimal digits
- Handle URLs with or without trailing slash
- If not found, error: "Could not find build ID (10+ digits) in URL"
3. **Extract prowjob name**
- Find the path segment immediately preceding build_id
- Example: In `.../pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn/1978913325970362368/`
- Prowjob name: `pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn`
4. **Construct GCS paths**
- Bucket: `test-platform-results`
- Base GCS path: `gs://test-platform-results/{bucket-path}/`
- Ensure path ends with `/`
### Step 2: Parse Resource Specifications
For each comma-delimited resource spec:
1. **Parse format** `[namespace:][kind/]name`
- Split on `:` to get namespace (optional)
- Split remaining on `/` to get kind (optional) and name (required)
- Store as structured data: `{namespace, kind, name}`
2. **Validate**
- name is required
- namespace and kind are optional
- Examples:
- `pod/etcd-0``{kind: "pod", name: "etcd-0"}`
- `openshift-etcd:pod/etcd-0``{namespace: "openshift-etcd", kind: "pod", name: "etcd-0"}`
- `etcd-0``{name: "etcd-0"}`
### Step 3: Create Working Directory
1. **Check for existing artifacts first**
- Check if `.work/prow-job-analyze-resource/{build_id}/logs/` directory exists and has content
- If it exists with content:
- Use AskUserQuestion tool to ask:
- Question: "Artifacts already exist for build {build_id}. Would you like to use the existing download or re-download?"
- Options:
- "Use existing" - Skip to artifact parsing step (Step 6)
- "Re-download" - Continue to clean and re-download
- If user chooses "Re-download":
- Remove all existing content: `rm -rf .work/prow-job-analyze-resource/{build_id}/logs/`
- Also remove tmp directory: `rm -rf .work/prow-job-analyze-resource/{build_id}/tmp/`
- This ensures clean state before downloading new content
- If user chooses "Use existing":
- Skip directly to Step 6 (Parse Audit Logs)
- Still need to download prowjob.json if it doesn't exist
2. **Create directory structure**
```bash
mkdir -p .work/prow-job-analyze-resource/{build_id}/logs
mkdir -p .work/prow-job-analyze-resource/{build_id}/tmp
```
- Use `.work/prow-job-analyze-resource/` as the base directory (already in .gitignore)
- Use build_id as subdirectory name
- Create `logs/` subdirectory for all downloads
- Create `tmp/` subdirectory for temporary files (intermediate JSON, etc.)
- Working directory: `.work/prow-job-analyze-resource/{build_id}/`
### Step 4: Download and Validate prowjob.json
1. **Download prowjob.json**
```bash
gcloud storage cp gs://test-platform-results/{bucket-path}/prowjob.json .work/prow-job-analyze-resource/{build_id}/logs/prowjob.json --no-user-output-enabled
```
2. **Parse and validate**
- Read `.work/prow-job-analyze-resource/{build_id}/logs/prowjob.json`
- Search for pattern: `--target=([a-zA-Z0-9-]+)`
- If not found:
- Display: "This is not a ci-operator job. The prowjob cannot be analyzed by this skill."
- Explain: ci-operator jobs have a --target argument specifying the test target
- Exit skill
3. **Extract target name**
- Capture the target value (e.g., `e2e-aws-ovn`)
- Store for constructing gather-extra path
### Step 5: Download Audit Logs and Pod Logs
1. **Construct gather-extra paths**
- GCS path: `gs://test-platform-results/{bucket-path}/artifacts/{target}/gather-extra/`
- Local path: `.work/prow-job-analyze-resource/{build_id}/logs/artifacts/{target}/gather-extra/`
2. **Download audit logs**
```bash
mkdir -p .work/prow-job-analyze-resource/{build_id}/logs/artifacts/{target}/gather-extra/artifacts/audit_logs
gcloud storage cp -r gs://test-platform-results/{bucket-path}/artifacts/{target}/gather-extra/artifacts/audit_logs/ .work/prow-job-analyze-resource/{build_id}/logs/artifacts/{target}/gather-extra/artifacts/audit_logs/ --no-user-output-enabled
```
- Create directory first to avoid gcloud errors
- Use `--no-user-output-enabled` to suppress progress output
- If directory not found, warn: "No audit logs found. Job may not have completed or audit logging may be disabled."
3. **Download pod logs**
```bash
mkdir -p .work/prow-job-analyze-resource/{build_id}/logs/artifacts/{target}/gather-extra/artifacts/pods
gcloud storage cp -r gs://test-platform-results/{bucket-path}/artifacts/{target}/gather-extra/artifacts/pods/ .work/prow-job-analyze-resource/{build_id}/logs/artifacts/{target}/gather-extra/artifacts/pods/ --no-user-output-enabled
```
- Create directory first to avoid gcloud errors
- Use `--no-user-output-enabled` to suppress progress output
- If directory not found, warn: "No pod logs found."
### Step 6: Parse Audit Logs and Pod Logs
**IMPORTANT: Use the provided Python script `parse_all_logs.py` from the skill directory to parse both audit logs and pod logs efficiently.**
**Usage:**
```bash
python3 plugins/prow-job/skills/prow-job-analyze-resource/parse_all_logs.py <resource_pattern> \
.work/prow-job-analyze-resource/{build_id}/logs/artifacts/{target}/gather-extra/artifacts/audit_logs \
.work/prow-job-analyze-resource/{build_id}/logs/artifacts/{target}/gather-extra/artifacts/pods \
> .work/prow-job-analyze-resource/{build_id}/tmp/all_entries.json
```
**Resource Pattern Parameter:**
- The `<resource_pattern>` parameter supports **regex patterns**
- Use `|` (pipe) to search for multiple resources: `resource1|resource2|resource3`
- Use `.*` for wildcards: `e2e-test-project-.*`
- Simple substring matching still works: `my-namespace`
- Examples:
- Single resource: `e2e-test-project-api-pkjxf`
- Multiple resources: `e2e-test-project-api-pkjxf|e2e-test-project-api-7zdxx`
- Pattern matching: `e2e-test-project-api-.*`
**Note:** The script outputs status messages to stderr which will display as progress. The JSON output to stdout is clean and ready to use.
**What the script does:**
1. **Find all log files**
- Audit logs: `.work/prow-job-analyze-resource/{build_id}/logs/artifacts/{target}/gather-extra/artifacts/audit_logs/**/*.log`
- Pod logs: `.work/prow-job-analyze-resource/{build_id}/logs/artifacts/{target}/gather-extra/artifacts/pods/**/*.log`
2. **Parse audit log files (JSONL format)**
- Read file line by line
- Each line is a JSON object (JSONL format)
- Parse JSON into object `e`
3. **Extract fields from each audit log entry**
- `e.verb` - action (get, list, create, update, patch, delete, watch)
- `e.user.username` - user making request
- `e.responseStatus.code` - HTTP response code (integer)
- `e.objectRef.namespace` - namespace (if namespaced)
- `e.objectRef.resource` - lowercase plural kind (e.g., "pods", "configmaps")
- `e.objectRef.name` - resource name
- `e.requestReceivedTimestamp` - ISO 8601 timestamp
4. **Filter matches for each resource spec**
- Uses **regex matching** on `e.objectRef.namespace` and `e.objectRef.name`
- Pattern matches if found in either namespace or name field
- Supports all regex features:
- Pipe operator: `resource1|resource2` matches either resource
- Wildcards: `e2e-test-.*` matches all resources starting with `e2e-test-`
- Character classes: `[abc]` matches a, b, or c
- Simple substring matching still works for patterns without regex special chars
- Performance optimization: plain strings use fast substring search
5. **For each audit log match, capture**
- **Source**: "audit"
- **Filename**: Full path to .log file
- **Line number**: Line number in file (1-indexed)
- **Level**: Based on `e.responseStatus.code`
- 200-299: "info"
- 400-499: "warn"
- 500-599: "error"
- **Timestamp**: Parse `e.requestReceivedTimestamp` to datetime
- **Content**: Full JSON line (for expandable details)
- **Summary**: Generate formatted summary
- Format: `{verb} {resource}/{name} in {namespace} by {username} → HTTP {code}`
- Example: `create pod/etcd-0 in openshift-etcd by system:serviceaccount:kube-system:deployment-controller → HTTP 201`
6. **Parse pod log files (plain text format)**
- Read file line by line
- Each line is plain text (not JSON)
- Search for resource pattern in line content
7. **For each pod log match, capture**
- **Source**: "pod"
- **Filename**: Full path to .log file
- **Line number**: Line number in file (1-indexed)
- **Level**: Detect from glog format or default to "info"
- Glog format: `E0910 11:43:41.153414 ...` (E=error, W=warn, I=info, F=fatal→error)
- Non-glog format: default to "info"
- **Timestamp**: Extract from start of line if present (format: `YYYY-MM-DDTHH:MM:SS.mmmmmmZ`)
- **Content**: Full log line
- **Summary**: First 200 characters of line (after timestamp if present)
8. **Combine and sort all entries**
- Merge audit log entries and pod log entries
- Sort all entries chronologically by timestamp
- Entries without timestamps are placed at the end
### Step 7: Generate HTML Report
**IMPORTANT: Use the provided Python script `generate_html_report.py` from the skill directory.**
**Usage:**
```bash
python3 plugins/prow-job/skills/prow-job-analyze-resource/generate_html_report.py \
.work/prow-job-analyze-resource/{build_id}/tmp/all_entries.json \
"{prowjob_name}" \
"{build_id}" \
"{target}" \
"{resource_pattern}" \
"{gcsweb_url}"
```
**Resource Pattern Parameter:**
- The `{resource_pattern}` should be the **same pattern used in the parse script**
- For single resources: `e2e-test-project-api-pkjxf`
- For multiple resources: `e2e-test-project-api-pkjxf|e2e-test-project-api-7zdxx`
- The script will parse the pattern to display the searched resources in the HTML header
**Output:** The script generates `.work/prow-job-analyze-resource/{build_id}/{first_resource_name}.html`
**What the script does:**
1. **Determine report filename**
- Format: `.work/prow-job-analyze-resource/{build_id}/{resource_name}.html`
- Uses the primary resource name for the filename
2. **Sort all entries by timestamp**
- Loads audit log entries from JSON
- Sort chronologically (ascending)
- Entries without timestamps go at the end
3. **Calculate timeline bounds**
- min_time: Earliest timestamp found
- max_time: Latest timestamp found
- Time range: max_time - min_time
4. **Generate HTML structure**
**Header Section:**
```html
<div class="header">
<h1>Prow Job Resource Lifecycle Analysis</h1>
<div class="metadata">
<p><strong>Prow Job:</strong> {prowjob-name}</p>
<p><strong>Build ID:</strong> {build_id}</p>
<p><strong>gcsweb URL:</strong> <a href="{original-url}">{original-url}</a></p>
<p><strong>Target:</strong> {target}</p>
<p><strong>Resources:</strong> {resource-list}</p>
<p><strong>Total Entries:</strong> {count}</p>
<p><strong>Time Range:</strong> {min_time} to {max_time}</p>
</div>
</div>
```
**Interactive Timeline:**
```html
<div class="timeline-container">
<svg id="timeline" width="100%" height="100">
<!-- For each entry, render colored vertical line -->
<line x1="{position}%" y1="0" x2="{position}%" y2="100"
stroke="{color}" stroke-width="2"
class="timeline-event" data-entry-id="{entry-id}"
title="{summary}">
</line>
</svg>
</div>
```
- Position: Calculate percentage based on timestamp between min_time and max_time
- Color: white/lightgray (info), yellow (warn), red (error)
- Clickable: Jump to corresponding entry
- Tooltip on hover: Show summary
**Log Entries Section:**
```html
<div class="entries">
<div class="filters">
<!-- Filter controls: by level, by resource, by time range -->
</div>
<div class="entry" id="entry-{index}">
<div class="entry-header">
<span class="timestamp">{formatted-timestamp}</span>
<span class="level badge-{level}">{level}</span>
<span class="source">{filename}:{line-number}</span>
</div>
<div class="entry-summary">{summary}</div>
<details class="entry-details">
<summary>Show full content</summary>
<pre><code>{content}</code></pre>
</details>
</div>
</div>
```
**CSS Styling:**
- Modern, clean design with good contrast
- Responsive layout
- Badge colors: info=gray, warn=yellow, error=red
- Monospace font for log content
- Syntax highlighting for JSON (in audit logs)
**JavaScript Interactivity:**
```javascript
// Timeline click handler
document.querySelectorAll('.timeline-event').forEach(el => {
el.addEventListener('click', () => {
const entryId = el.dataset.entryId;
document.getElementById(entryId).scrollIntoView({behavior: 'smooth'});
});
});
// Filter controls
// Expand/collapse details
// Search within entries
```
5. **Write HTML to file**
- Script automatically writes to `.work/prow-job-analyze-resource/{build_id}/{resource_name}.html`
- Includes proper HTML5 structure
- All CSS and JavaScript are inline for portability
### Step 8: Present Results to User
1. **Display summary**
```
Resource Lifecycle Analysis Complete
Prow Job: {prowjob-name}
Build ID: {build_id}
Target: {target}
Resources Analyzed:
- {resource-spec-1}
- {resource-spec-2}
...
Artifacts downloaded to: .work/prow-job-analyze-resource/{build_id}/logs/
Results:
- Audit log entries: {audit-count}
- Pod log entries: {pod-count}
- Total entries: {total-count}
- Time range: {min_time} to {max_time}
Report generated: .work/prow-job-analyze-resource/{build_id}/{resource_name}.html
Open in browser to view interactive timeline and detailed entries.
```
2. **Open report in browser**
- Detect platform and automatically open the HTML report in the default browser
- Linux: `xdg-open .work/prow-job-analyze-resource/{build_id}/{resource_name}.html`
- macOS: `open .work/prow-job-analyze-resource/{build_id}/{resource_name}.html`
- Windows: `start .work/prow-job-analyze-resource/{build_id}/{resource_name}.html`
- On Linux (most common for this environment), use `xdg-open`
3. **Offer next steps**
- Ask if user wants to search for additional resources in the same job
- Ask if user wants to analyze a different Prow job
- Explain that artifacts are cached in `.work/prow-job-analyze-resource/{build_id}/` for faster subsequent searches
## Error Handling
Handle these error scenarios gracefully:
1. **Invalid URL format**
- Error: "URL must contain 'test-platform-results/' substring"
- Provide example of valid URL
2. **Build ID not found**
- Error: "Could not find build ID (10+ decimal digits) in URL path"
- Explain requirement and show URL parsing
3. **gcloud not installed**
- Detect with: `which gcloud`
- Provide installation instructions for user's platform
- Link: https://cloud.google.com/sdk/docs/install
4. **gcloud not authenticated**
- Detect with: `gcloud auth list`
- Instruct: "Please run: gcloud auth login"
5. **No access to bucket**
- Error from gcloud storage commands
- Explain: "You need read access to the test-platform-results GCS bucket"
- Suggest checking project access
6. **prowjob.json not found**
- Suggest verifying URL and checking if job completed
- Provide gcsweb URL for manual verification
7. **Not a ci-operator job**
- Error: "This is not a ci-operator job. No --target found in prowjob.json."
- Explain: Only ci-operator jobs can be analyzed by this skill
8. **gather-extra not found**
- Warn: "gather-extra directory not found for target {target}"
- Suggest: Job may not have completed or target name is incorrect
9. **No matches found**
- Display: "No log entries found matching the specified resources"
- Suggest:
- Check resource names for typos
- Try searching without kind or namespace filters
- Verify resources existed during this job execution
10. **Timestamp parsing failures**
- Warn about unparseable timestamps
- Fall back to line order for sorting
- Still include entries in report
## Performance Considerations
1. **Avoid re-downloading**
- Check if `.work/prow-job-analyze-resource/{build_id}/logs/` already has content
- Ask user before re-downloading
2. **Efficient downloads**
- Use `gcloud storage cp -r` for recursive downloads
- Use `--no-user-output-enabled` to suppress verbose output
- Create target directories with `mkdir -p` before downloading to avoid gcloud errors
3. **Memory efficiency**
- The `parse_all_logs.py` script processes log files incrementally (line by line)
- Don't load entire files into memory
- Script outputs to JSON for efficient HTML generation
4. **Content length limits**
- The HTML generator trims JSON content to ~2000 chars in display
- Full content is available in expandable details sections
5. **Progress indicators**
- Show "Downloading audit logs..." before gcloud commands
- Show "Parsing audit logs..." before running parse script
- Show "Generating HTML report..." before running report generator
## Examples
### Example 1: Search for a namespace/project
```
User: "Analyze e2e-test-project-api-p28m in this Prow job: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-okd-scos-4.20-e2e-aws-ovn-techpreview/1964725888612306944"
Output:
- Downloads artifacts to: .work/prow-job-analyze-resource/1964725888612306944/logs/
- Finds actual resource name: e2e-test-project-api-p28mx (namespace)
- Parses 382 audit log entries
- Finds 86 pod log mentions
- Creates: .work/prow-job-analyze-resource/1964725888612306944/e2e-test-project-api-p28mx.html
- Shows timeline from creation (18:11:02) to deletion (18:17:32)
```
### Example 2: Search for a pod
```
User: "Analyze pod/etcd-0 in this Prow job: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30393/pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn/1978913325970362368/"
Output:
- Creates: .work/prow-job-analyze-resource/1978913325970362368/etcd-0.html
- Shows timeline of all pod/etcd-0 events across namespaces
```
### Example 3: Search by name only
```
User: "Find all resources named cluster-version-operator in job {url}"
Output:
- Searches without kind filter
- Finds deployments, pods, services, etc. all named cluster-version-operator
- Creates: .work/prow-job-analyze-resource/{build_id}/cluster-version-operator.html
```
### Example 4: Search for multiple resources using regex
```
User: "Analyze e2e-test-project-api-pkjxf and e2e-test-project-api-7zdxx in job {url}"
Output:
- Uses regex pattern: `e2e-test-project-api-pkjxf|e2e-test-project-api-7zdxx`
- Finds all events for both namespaces in a single pass
- Parses 1,047 total entries (501 for first namespace, 546 for second)
- Passes the same pattern to generate_html_report.py
- HTML displays: "Resources: e2e-test-project-api-7zdxx, e2e-test-project-api-pkjxf"
- Creates: .work/prow-job-analyze-resource/{build_id}/e2e-test-project-api-pkjxf.html
- Timeline shows interleaved events from both namespaces chronologically
```
## Tips
- Always verify gcloud prerequisites before starting (gcloud CLI must be installed)
- Authentication is NOT required - the bucket is publicly accessible
- Use `.work/prow-job-analyze-resource/{build_id}/` directory structure for organization
- All work files are in `.work/` which is already in .gitignore
- The Python scripts handle all parsing and HTML generation - use them!
- Cache artifacts in `.work/prow-job-analyze-resource/{build_id}/` to speed up subsequent searches
- The parse script supports **regex patterns** for flexible matching:
- Use `resource1|resource2` to search for multiple resources in a single pass
- Use `.*` wildcards to match resource name patterns
- Simple substring matching still works for basic searches
- The resource name provided by the user may not exactly match the actual resource name in logs
- Example: User asks for `e2e-test-project-api-p28m` but actual resource is `e2e-test-project-api-p28mx`
- Use regex patterns like `e2e-test-project-api-p28m.*` to find partial matches
- For namespaces/projects, search for the resource name - it will match both `namespace` and `project` resources
- Provide helpful error messages with actionable solutions
## Important Notes
1. **Resource Name Matching:**
- The parse script uses **regex pattern matching** for maximum flexibility
- Supports pipe operator (`|`) to search for multiple resources: `resource1|resource2`
- Supports wildcards (`.*`) for pattern matching: `e2e-test-.*`
- Simple substrings still work for basic searches
- May match multiple related resources (e.g., namespace, project, rolebindings in that namespace)
- Report all matches - this provides complete lifecycle context
2. **Namespace vs Project:**
- In OpenShift, a `project` is essentially a `namespace` with additional metadata
- Searching for a namespace will find both namespace and project resources
- The audit logs contain events for both resource types
3. **Target Extraction:**
- Must extract the `--target` argument from prowjob.json
- This is critical for finding the correct gather-extra path
- Non-ci-operator jobs cannot be analyzed (they don't have --target)
4. **Working with Scripts:**
- All scripts are in `plugins/prow-job/skills/prow-job-analyze-resource/`
- `parse_all_logs.py` - Parses audit logs and pod logs, outputs JSON
- Detects glog severity levels (E=error, W=warn, I=info, F=fatal)
- Supports regex patterns for resource matching
- `generate_html_report.py` - Generates interactive HTML report from JSON
- Scripts output status messages to stderr for progress display. JSON output to stdout is clean.
5. **Pod Log Glog Format Support:**
- The parser automatically detects and parses glog format logs
- Glog format: `E0910 11:43:41.153414 ...`
- `E` = severity (E/F → error, W → warn, I → info)
- `0910` = month/day (MMDD)
- `11:43:41.153414` = time with microseconds
- Timestamp parsing: Extracts timestamp and infers year (2025)
- Severity mapping allows filtering by level in HTML report
- Non-glog logs default to info level