Initial commit
This commit is contained in:
14
.claude-plugin/plugin.json
Normal file
14
.claude-plugin/plugin.json
Normal file
@@ -0,0 +1,14 @@
|
||||
{
|
||||
"name": "sosreport",
|
||||
"description": "Analyze sosreport archives for system diagnostics and troubleshooting",
|
||||
"version": "0.0.1",
|
||||
"author": {
|
||||
"name": "github.com/arkadeepsen"
|
||||
},
|
||||
"skills": [
|
||||
"./skills"
|
||||
],
|
||||
"commands": [
|
||||
"./commands"
|
||||
]
|
||||
}
|
||||
3
README.md
Normal file
3
README.md
Normal file
@@ -0,0 +1,3 @@
|
||||
# sosreport
|
||||
|
||||
Analyze sosreport archives for system diagnostics and troubleshooting
|
||||
348
commands/analyze.md
Normal file
348
commands/analyze.md
Normal file
@@ -0,0 +1,348 @@
|
||||
---
|
||||
description: Analyze sosreport archive for system diagnostics and issues
|
||||
argument-hint: <path-to-sosreport> [--only <areas>] [--skip <areas>]
|
||||
---
|
||||
|
||||
## Name
|
||||
sosreport:analyze
|
||||
|
||||
## Synopsis
|
||||
```
|
||||
/sosreport:analyze <path-to-sosreport> [--only <areas>] [--skip <areas>]
|
||||
```
|
||||
|
||||
**Analysis Areas:**
|
||||
|
||||
- **`logs`**: Analyze system and application logs (journald, syslog, dmesg, application logs)
|
||||
- Identifies errors, warnings, critical messages
|
||||
- Detects OOM killer events, kernel panics, segfaults
|
||||
- Counts and categorizes errors by severity
|
||||
- Provides timeline of critical events
|
||||
|
||||
- **`resources`**: Analyze system resource usage (memory, CPU, disk, processes)
|
||||
- Memory usage, swap, and pressure indicators
|
||||
- CPU information and load averages
|
||||
- Disk usage and filesystem capacity
|
||||
- Top resource consumers and zombie processes
|
||||
|
||||
- **`network`**: Analyze network configuration and connectivity
|
||||
- Network interface status and IP addresses
|
||||
- Routing table and default gateway
|
||||
- Active connections and listening services
|
||||
- Firewall rules (firewalld/iptables)
|
||||
- DNS configuration and hostname resolution
|
||||
|
||||
- **`system-config`**: Analyze system configuration (packages, services, security)
|
||||
- OS version and kernel information
|
||||
- Installed package versions
|
||||
- Systemd service status and failures
|
||||
- SELinux/AppArmor configuration and denials
|
||||
- Kernel parameters and resource limits
|
||||
|
||||
## Description
|
||||
The `sosreport:analyze` command performs comprehensive analysis of a sosreport archive (from <https://github.com/sosreport/sos>) to identify system issues, configuration problems, and potential causes of failures. It examines system logs, resource usage, network configuration, installed packages, and other diagnostic data collected by sosreport.
|
||||
|
||||
By default, all analysis areas are executed. Use `--only` to run specific areas or `--skip` to exclude areas from analysis.
|
||||
|
||||
## Arguments
|
||||
- `$1` (required): Path to the sosreport archive file (`.tar.gz` or `.tar.xz`) or extracted directory
|
||||
- `--only <areas>` (optional): Comma-separated list of analysis areas to run. Valid areas: `logs`, `resources`, `network`, `system-config`. If not specified, all areas are analyzed.
|
||||
- `--skip <areas>` (optional): Comma-separated list of analysis areas to skip. Valid areas: `logs`, `resources`, `network`, `system-config`. Cannot be used with `--only`.
|
||||
|
||||
## Implementation
|
||||
|
||||
The sosreport analysis is organized into several specialized phases, each with detailed implementation guidance in separate skill documents. The command supports selective analysis through optional arguments.
|
||||
|
||||
### 1. Parse Arguments and Determine Analysis Scope
|
||||
|
||||
1. **Parse command-line arguments**
|
||||
- Extract the sosreport path (required first argument)
|
||||
- Check for `--only` flag and parse comma-separated areas
|
||||
- Check for `--skip` flag and parse comma-separated areas
|
||||
- Validate that `--only` and `--skip` are not used together
|
||||
|
||||
2. **Validate analysis areas**
|
||||
- Valid areas: `logs`, `resources`, `network`, `system-config`
|
||||
- If invalid area specified, return error with list of valid areas
|
||||
- Normalize area names (case-insensitive, accept variations like `system` for `system-config`)
|
||||
|
||||
3. **Determine which skills to run**
|
||||
- If no flags specified: Run all skills (default comprehensive analysis)
|
||||
- If `--only` specified: Run only the specified skills
|
||||
- If `--skip` specified: Run all skills except the specified ones
|
||||
- Store the list of skills to execute for later phases
|
||||
|
||||
4. **Example argument parsing**:
|
||||
```bash
|
||||
# Parse: /sosreport:analyze /path/sos.tar.gz --only logs,network
|
||||
# Result: Run only logs-analysis and network-analysis skills
|
||||
|
||||
# Parse: /sosreport:analyze /path/sos.tar.gz --skip resources
|
||||
# Result: Run logs, network, and system-config (skip resources)
|
||||
|
||||
# Parse: /sosreport:analyze /path/sos.tar.gz
|
||||
# Result: Run all skills (comprehensive analysis)
|
||||
```
|
||||
|
||||
### 2. Extract and Validate Sosreport
|
||||
|
||||
1. **Check if path exists**
|
||||
- Verify the provided path points to a valid file or directory
|
||||
- If file doesn't exist, return error with helpful message
|
||||
|
||||
2. **Extract archive if needed**
|
||||
- If path is a `.tar.gz` or `.tar.xz` file:
|
||||
- Create extraction directory: `.work/sosreport-analyze/{timestamp}/`
|
||||
- Extract archive: `tar -xf <path> -C .work/sosreport-analyze/{timestamp}/`
|
||||
- Store extracted directory path for analysis
|
||||
- If path is already a directory:
|
||||
- Verify it's a valid sosreport directory (check for `sos_commands/`, `sos_logs/`, etc.)
|
||||
- Use the directory directly
|
||||
|
||||
3. **Identify sosreport structure**
|
||||
- Locate the root directory (usually has format `sosreport-{hostname}-{date}/`)
|
||||
- Verify expected directories exist: `sos_commands/`, `sos_logs/`, `sos_reports/`
|
||||
|
||||
### 3. Analyze System Logs
|
||||
|
||||
**Run condition**: Only if `logs` area is selected (or no filters specified)
|
||||
**Detailed implementation**: See `plugins/sosreport/skills/logs-analysis/SKILL.md`
|
||||
|
||||
Perform comprehensive log analysis including:
|
||||
- Journald logs (journalctl output)
|
||||
- System logs (messages, dmesg, secure)
|
||||
- Application-specific logs
|
||||
- Error counting and categorization
|
||||
- Timeline of critical events
|
||||
- OOM killer events, kernel panics, segfaults
|
||||
|
||||
**Key outputs**:
|
||||
- Error statistics by severity
|
||||
- Top error messages by frequency
|
||||
- Critical findings with timestamps
|
||||
- Log file locations for investigation
|
||||
|
||||
### 4. Analyze Resource Usage
|
||||
|
||||
**Run condition**: Only if `resources` area is selected (or no filters specified)
|
||||
**Detailed implementation**: See `plugins/sosreport/skills/resource-analysis/SKILL.md`
|
||||
|
||||
Perform resource analysis including:
|
||||
- Memory usage and pressure indicators
|
||||
- CPU information and load averages
|
||||
- Disk usage and I/O errors
|
||||
- Process analysis (top consumers, zombies)
|
||||
- Resource exhaustion patterns
|
||||
|
||||
**Key outputs**:
|
||||
- Memory usage metrics and swap status
|
||||
- CPU count and load per CPU
|
||||
- Filesystems near capacity
|
||||
- Top CPU and memory-consuming processes
|
||||
- Resource-related issues and recommendations
|
||||
|
||||
### 5. Analyze Network Configuration
|
||||
|
||||
**Run condition**: Only if `network` area is selected (or no filters specified)
|
||||
**Detailed implementation**: See `plugins/sosreport/skills/network-analysis/SKILL.md`
|
||||
|
||||
Perform network analysis including:
|
||||
- Network interface configuration and status
|
||||
- Routing table and default gateway
|
||||
- Active connections and listening services
|
||||
- Firewall rules (firewalld/iptables)
|
||||
- DNS configuration and hostname resolution
|
||||
- Network errors from logs
|
||||
|
||||
**Key outputs**:
|
||||
- Interface status with IP addresses
|
||||
- Routing configuration
|
||||
- Connection statistics by state
|
||||
- Firewall configuration summary
|
||||
- DNS and hostname settings
|
||||
- Network-related errors and issues
|
||||
|
||||
### 6. Analyze Installed Packages and System Configuration
|
||||
|
||||
**Run condition**: Only if `system-config` area is selected (or no filters specified)
|
||||
**Detailed implementation**: See `plugins/sosreport/skills/system-config-analysis/SKILL.md`
|
||||
|
||||
Perform system configuration analysis including:
|
||||
- OS version and kernel information
|
||||
- Installed package versions
|
||||
- Systemd service status
|
||||
- Failed services with reasons
|
||||
- SELinux/AppArmor configuration and denials
|
||||
- Kernel parameters and resource limits
|
||||
|
||||
**Key outputs**:
|
||||
- System information summary
|
||||
- Key package versions
|
||||
- Failed services with failure reasons
|
||||
- SELinux status and denial count
|
||||
- Configuration issues and recommendations
|
||||
|
||||
### 7. Generate Interactive Summary
|
||||
|
||||
1. **Create findings structure**
|
||||
- Organize findings by category (Critical, High, Medium, Low, Info)
|
||||
- Include only findings from the selected analysis areas
|
||||
- For each finding, include:
|
||||
- Severity level
|
||||
- Category (logs, resources, network, packages, config)
|
||||
- Description of the issue
|
||||
- Evidence (file paths, log snippets, metrics)
|
||||
- Recommended actions
|
||||
|
||||
2. **Display summary in terminal**
|
||||
- Show executive summary with key statistics
|
||||
- List critical and high-severity findings
|
||||
- Provide file paths for detailed investigation
|
||||
- Include timeline of significant events
|
||||
- Suggest next steps for troubleshooting
|
||||
|
||||
3. **Format output**
|
||||
```bash
|
||||
SOSREPORT ANALYSIS SUMMARY
|
||||
==========================
|
||||
|
||||
System: {hostname}
|
||||
Report Date: {date}
|
||||
OS: {os_version}
|
||||
Kernel: {kernel_version}
|
||||
|
||||
CRITICAL ISSUES (count)
|
||||
-----------------------
|
||||
- [Issue description with file reference]
|
||||
|
||||
HIGH PRIORITY (count)
|
||||
---------------------
|
||||
- [Issue description with file reference]
|
||||
|
||||
MEDIUM PRIORITY (count)
|
||||
-----------------------
|
||||
- [Issue description with file reference]
|
||||
|
||||
RESOURCE SUMMARY
|
||||
----------------
|
||||
- Memory: X GB used / Y GB total (Z% used)
|
||||
- Disk: Most full filesystem at X%
|
||||
- Load Average: X.XX, X.XX, X.XX
|
||||
|
||||
TOP ERRORS IN LOGS
|
||||
------------------
|
||||
1. [Error message] (count occurrences)
|
||||
2. [Error message] (count occurrences)
|
||||
|
||||
FAILED SERVICES
|
||||
---------------
|
||||
- [service name]: [reason]
|
||||
|
||||
RECOMMENDATIONS
|
||||
---------------
|
||||
1. [Actionable recommendation]
|
||||
2. [Actionable recommendation]
|
||||
|
||||
ANALYSIS LOCATION
|
||||
-----------------
|
||||
Extracted to: {extraction_path}
|
||||
```
|
||||
|
||||
4. **Interactive drill-down**
|
||||
- Offer to explore specific areas in more detail
|
||||
- Allow user to ask follow-up questions about findings
|
||||
- Provide file paths for manual investigation
|
||||
|
||||
## Return Value
|
||||
|
||||
- **Format**: Interactive summary displayed in terminal with categorized findings
|
||||
- **Exit code**:
|
||||
- 0 if analysis completes successfully
|
||||
- 1 if sosreport path is invalid
|
||||
- 2 if sosreport structure is malformed
|
||||
|
||||
## Examples
|
||||
|
||||
1. **Comprehensive analysis (default)**:
|
||||
```bash
|
||||
/sosreport:analyze /tmp/sosreport-server01-2024-01-15.tar.xz
|
||||
```
|
||||
|
||||
Extracts archive to `.work/sosreport-analyze/{timestamp}/` and performs comprehensive analysis using all skills (logs, resources, network, system-config).
|
||||
|
||||
2. **Analyze only logs and network**:
|
||||
```bash
|
||||
/sosreport:analyze /tmp/sosreport-server01-2024-01-15.tar.xz --only logs,network
|
||||
```
|
||||
|
||||
Performs only log analysis and network analysis. Useful when investigating connectivity or service issues without needing full resource analysis.
|
||||
|
||||
3. **Skip resource analysis**:
|
||||
```bash
|
||||
/sosreport:analyze /tmp/sosreport.tar.gz --skip resources
|
||||
```
|
||||
|
||||
Performs all analysis except resource analysis. Useful when you already know resource metrics and want to focus on configuration and logs.
|
||||
|
||||
4. **Quick log-only analysis**:
|
||||
```bash
|
||||
/sosreport:analyze /tmp/sosreport.tar.xz --only logs
|
||||
```
|
||||
|
||||
Performs only log analysis. Fastest option for quickly identifying errors and critical events without analyzing configuration or resources.
|
||||
|
||||
5. **Analyze extracted sosreport directory**:
|
||||
```bash
|
||||
/sosreport:analyze /tmp/sosreport-server01-2024-01-15/
|
||||
```
|
||||
|
||||
Analyzes an already extracted sosreport directory with comprehensive analysis.
|
||||
|
||||
6. **Selective analysis on extracted directory**:
|
||||
```bash
|
||||
/sosreport:analyze /tmp/sosreport-server01-2024-01-15/ --only system-config,network
|
||||
```
|
||||
|
||||
Analyzes only system configuration and network from an already extracted directory.
|
||||
|
||||
7. **Follow-up investigation**:
|
||||
```bash
|
||||
User: /sosreport:analyze /tmp/sosreport.tar.gz --only logs
|
||||
Agent: [Shows log analysis summary]
|
||||
User: Can you now analyze the resources as well?
|
||||
Agent: /sosreport:analyze /tmp/sosreport.tar.gz --only resources
|
||||
Agent: [Shows resource analysis]
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- Sosreport structure varies by OS version and sosreport version
|
||||
- Command handles both compressed archives and extracted directories
|
||||
- Analysis focuses on common issues but can be extended for specific use cases
|
||||
- For OpenShift/Kubernetes sosreports, additional pod/container analysis may be relevant
|
||||
- Large sosreports (>1GB) may take several minutes to analyze
|
||||
- **Selective analysis**: Use `--only` or `--skip` to run specific analysis areas for faster results
|
||||
- **Performance**: Running only needed analysis areas reduces analysis time significantly
|
||||
- **Valid areas**: `logs`, `resources`, `network`, `system-config`
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. **tar utility**: Required for extracting compressed sosreports
|
||||
- Check: `which tar`
|
||||
- Usually pre-installed on Linux/macOS
|
||||
|
||||
2. **Sufficient disk space**: Extracted sosreports can be large
|
||||
- Check available space: `df -h .work/`
|
||||
- Recommend at least 2x the compressed archive size
|
||||
|
||||
## See Also
|
||||
|
||||
### Analysis Skills
|
||||
- **Logs Analysis**: `plugins/sosreport/skills/logs-analysis/SKILL.md` - Detailed guidance for analyzing system and application logs
|
||||
- **Resource Analysis**: `plugins/sosreport/skills/resource-analysis/SKILL.md` - Detailed guidance for analyzing memory, CPU, disk, and processes
|
||||
- **Network Analysis**: `plugins/sosreport/skills/network-analysis/SKILL.md` - Detailed guidance for analyzing network configuration and connectivity
|
||||
- **System Configuration Analysis**: `plugins/sosreport/skills/system-config-analysis/SKILL.md` - Detailed guidance for analyzing packages, services, and security settings
|
||||
|
||||
### External Resources
|
||||
- Sosreport documentation: <https://github.com/sosreport/sos>
|
||||
- Red Hat sosreport guide: <https://access.redhat.com/solutions/3592>
|
||||
61
plugin.lock.json
Normal file
61
plugin.lock.json
Normal file
@@ -0,0 +1,61 @@
|
||||
{
|
||||
"$schema": "internal://schemas/plugin.lock.v1.json",
|
||||
"pluginId": "gh:openshift-eng/ai-helpers:plugins/sosreport",
|
||||
"normalized": {
|
||||
"repo": null,
|
||||
"ref": "refs/tags/v20251128.0",
|
||||
"commit": "0eb867b3b69859b6b81a01e8b9c6b741280b674f",
|
||||
"treeHash": "da22225890956bf3249b598dc62c1d1e1449cb67fb03095f655595326bd61251",
|
||||
"generatedAt": "2025-11-28T10:27:29.036176Z",
|
||||
"toolVersion": "publish_plugins.py@0.2.0"
|
||||
},
|
||||
"origin": {
|
||||
"remote": "git@github.com:zhongweili/42plugin-data.git",
|
||||
"branch": "master",
|
||||
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
|
||||
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
|
||||
},
|
||||
"manifest": {
|
||||
"name": "sosreport",
|
||||
"description": "Analyze sosreport archives for system diagnostics and troubleshooting",
|
||||
"version": "0.0.1"
|
||||
},
|
||||
"content": {
|
||||
"files": [
|
||||
{
|
||||
"path": "README.md",
|
||||
"sha256": "4e841bb190435f9010865a47307d28b8b167c9c677bddb64ef6b413f691d93fe"
|
||||
},
|
||||
{
|
||||
"path": ".claude-plugin/plugin.json",
|
||||
"sha256": "5797408d5f22c3633b85f0b478986cd5c49cac281c75bf155e67d5f871be69c5"
|
||||
},
|
||||
{
|
||||
"path": "commands/analyze.md",
|
||||
"sha256": "c97f8211ef0409b4bf2d0a82b66bfbfb18fca749599cf9bbf98787f1e0d9f93b"
|
||||
},
|
||||
{
|
||||
"path": "skills/resource-analysis/SKILL.md",
|
||||
"sha256": "2addd3387384eff8437cfa5770dd3a45d7167f71359c8099f03963454077f711"
|
||||
},
|
||||
{
|
||||
"path": "skills/logs-analysis/SKILL.md",
|
||||
"sha256": "e5bd9365de7065ad6862797bd9e566b18e8ea2449d3111173130a5950063d050"
|
||||
},
|
||||
{
|
||||
"path": "skills/network-analysis/SKILL.md",
|
||||
"sha256": "34de4e01b28a6f5fc17f7650360c5928251a1d4c256dd4ee4f86f722406ebf92"
|
||||
},
|
||||
{
|
||||
"path": "skills/system-config-analysis/SKILL.md",
|
||||
"sha256": "3b5083ced0943d488f1e68ac31cb15de9fdc047e47e42903d158d39a391b11a4"
|
||||
}
|
||||
],
|
||||
"dirSha256": "da22225890956bf3249b598dc62c1d1e1449cb67fb03095f655595326bd61251"
|
||||
},
|
||||
"security": {
|
||||
"scannedAt": null,
|
||||
"scannerVersion": null,
|
||||
"flags": []
|
||||
}
|
||||
}
|
||||
343
skills/logs-analysis/SKILL.md
Normal file
343
skills/logs-analysis/SKILL.md
Normal file
@@ -0,0 +1,343 @@
|
||||
---
|
||||
name: Logs Analysis
|
||||
description: Analyze system and application log data from sosreport archives, extracting error patterns, kernel panics, OOM events, service failures, and application crashes from journald logs and traditional log files within the sosreport directory structure to identify root causes of system failures and issues
|
||||
---
|
||||
|
||||
# Logs Analysis Skill
|
||||
|
||||
This skill provides detailed guidance for analyzing logs from sosreport archives, including journald logs, system logs, kernel messages, and application logs.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- Analyzing the `/sosreport:analyze` command's log analysis phase
|
||||
- Investigating specific log-related errors or warnings in a sosreport
|
||||
- Performing deep-dive analysis of system failures from logs
|
||||
- Identifying patterns and root causes in system logs
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Sosreport archive must be extracted to a working directory
|
||||
- Path to the sosreport root directory must be known
|
||||
- Basic understanding of Linux log structure and journald
|
||||
|
||||
## Key Log Locations in Sosreport
|
||||
|
||||
Sosreports contain logs in several locations:
|
||||
|
||||
1. **Journald logs**: `sos_commands/logs/journalctl_*`
|
||||
- `journalctl_--no-pager_--boot` - Current boot logs
|
||||
- `journalctl_--no-pager` - All available logs
|
||||
- `journalctl_--no-pager_--priority_err` - Error priority logs
|
||||
|
||||
2. **Traditional system logs**: `var/log/`
|
||||
- `messages` - System-level messages
|
||||
- `dmesg` - Kernel ring buffer
|
||||
- `secure` - Authentication and security logs
|
||||
- `cron` - Cron job logs
|
||||
|
||||
3. **Application logs**: `var/log/` (varies by application)
|
||||
- `httpd/` - Apache logs
|
||||
- `nginx/` - Nginx logs
|
||||
- `audit/audit.log` - SELinux audit logs
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
### Step 1: Identify Available Log Sources
|
||||
|
||||
1. **Check for journald logs**:
|
||||
```bash
|
||||
ls -la sos_commands/logs/journalctl_* 2>/dev/null || echo "No journald logs found"
|
||||
```
|
||||
|
||||
2. **Check for traditional system logs**:
|
||||
```bash
|
||||
ls -la var/log/{messages,dmesg,secure} 2>/dev/null || echo "No traditional logs found"
|
||||
```
|
||||
|
||||
3. **Identify application-specific logs**:
|
||||
```bash
|
||||
find var/log/ -type f -name "*.log" 2>/dev/null | head -20
|
||||
```
|
||||
|
||||
### Step 2: Analyze Journald Logs
|
||||
|
||||
1. **Parse journalctl output for error patterns**:
|
||||
```bash
|
||||
# Look for common error indicators
|
||||
grep -iE "(error|failed|failure|critical|panic|segfault|oom)" sos_commands/logs/journalctl_--no-pager 2>/dev/null | head -100
|
||||
```
|
||||
|
||||
2. **Identify OOM (Out of Memory) killer events**:
|
||||
```bash
|
||||
grep -i "out of memory\|oom.*kill" sos_commands/logs/journalctl_--no-pager 2>/dev/null
|
||||
```
|
||||
|
||||
3. **Find kernel panics**:
|
||||
```bash
|
||||
grep -i "kernel panic\|bug:\|oops:" sos_commands/logs/journalctl_--no-pager 2>/dev/null
|
||||
```
|
||||
|
||||
4. **Check for segmentation faults**:
|
||||
```bash
|
||||
grep -i "segfault\|sigsegv\|core dump" sos_commands/logs/journalctl_--no-pager 2>/dev/null
|
||||
```
|
||||
|
||||
5. **Extract service failures**:
|
||||
```bash
|
||||
grep -i "failed to start\|failed with result" sos_commands/logs/journalctl_--no-pager 2>/dev/null
|
||||
```
|
||||
|
||||
### Step 3: Analyze System Logs (var/log)
|
||||
|
||||
1. **Check messages for errors**:
|
||||
```bash
|
||||
# If file exists and is readable
|
||||
if [ -f var/log/messages ]; then
|
||||
grep -iE "(error|failed|failure|critical)" var/log/messages | tail -100
|
||||
fi
|
||||
```
|
||||
|
||||
2. **Check dmesg for hardware issues**:
|
||||
```bash
|
||||
if [ -f var/log/dmesg ]; then
|
||||
grep -iE "(error|fail|warning|i/o error|bad sector)" var/log/dmesg
|
||||
fi
|
||||
```
|
||||
|
||||
3. **Analyze authentication logs**:
|
||||
```bash
|
||||
if [ -f var/log/secure ]; then
|
||||
grep -iE "(failed|failure|invalid|denied)" var/log/secure | tail -50
|
||||
fi
|
||||
```
|
||||
|
||||
### Step 4: Count and Categorize Errors
|
||||
|
||||
1. **Count errors by severity**:
|
||||
```bash
|
||||
# Critical errors
|
||||
grep -ic "critical\|panic\|fatal" sos_commands/logs/journalctl_--no-pager 2>/dev/null || echo "0"
|
||||
|
||||
# Errors
|
||||
grep -ic "error" sos_commands/logs/journalctl_--no-pager 2>/dev/null || echo "0"
|
||||
|
||||
# Warnings
|
||||
grep -ic "warning\|warn" sos_commands/logs/journalctl_--no-pager 2>/dev/null || echo "0"
|
||||
```
|
||||
|
||||
2. **Find most frequent error messages**:
|
||||
```bash
|
||||
grep -iE "(error|failed)" sos_commands/logs/journalctl_--no-pager 2>/dev/null | \
|
||||
sed 's/^.*\]: //' | \
|
||||
sort | uniq -c | sort -rn | head -10
|
||||
```
|
||||
|
||||
3. **Extract timestamps for error timeline**:
|
||||
```bash
|
||||
# Get first and last error timestamps
|
||||
grep -i "error" sos_commands/logs/journalctl_--no-pager 2>/dev/null | \
|
||||
head -1 | awk '{print $1, $2, $3}'
|
||||
grep -i "error" sos_commands/logs/journalctl_--no-pager 2>/dev/null | \
|
||||
tail -1 | awk '{print $1, $2, $3}'
|
||||
```
|
||||
|
||||
### Step 5: Analyze Application-Specific Logs
|
||||
|
||||
1. **Identify application logs**:
|
||||
```bash
|
||||
find var/log/ -type f \( -name "*.log" -o -name "*_log" \) 2>/dev/null
|
||||
```
|
||||
|
||||
2. **Check for stack traces and exceptions**:
|
||||
```bash
|
||||
# Python tracebacks
|
||||
grep -A 10 "Traceback (most recent call last)" var/log/*.log 2>/dev/null | head -50
|
||||
|
||||
# Java exceptions
|
||||
grep -B 2 -A 10 "Exception\|Error:" var/log/*.log 2>/dev/null | head -50
|
||||
```
|
||||
|
||||
3. **Look for common application errors**:
|
||||
```bash
|
||||
# Database connection errors
|
||||
grep -i "connection.*refused\|connection.*timeout\|database.*error" var/log/*.log 2>/dev/null
|
||||
|
||||
# HTTP/API errors
|
||||
grep -E "HTTP [45][0-9]{2}|status.*[45][0-9]{2}" var/log/*.log 2>/dev/null | head -20
|
||||
```
|
||||
|
||||
### Step 6: Generate Log Analysis Summary
|
||||
|
||||
Create a structured summary with the following information:
|
||||
|
||||
1. **Error Statistics**:
|
||||
- Total critical errors
|
||||
- Total errors
|
||||
- Total warnings
|
||||
- Time range of errors (first to last)
|
||||
|
||||
2. **Critical Findings**:
|
||||
- Kernel panics (with timestamps)
|
||||
- OOM killer events (with victim processes)
|
||||
- Segmentation faults (with process names)
|
||||
- Service failures (with service names)
|
||||
|
||||
3. **Top Error Messages** (sorted by frequency):
|
||||
- Error message
|
||||
- Count
|
||||
- First occurrence timestamp
|
||||
- Affected component/service
|
||||
|
||||
4. **Application-Specific Issues**:
|
||||
- Stack traces found
|
||||
- Database errors
|
||||
- Network/connectivity errors
|
||||
- Authentication failures
|
||||
|
||||
5. **Log File Locations**:
|
||||
- Provide paths to specific log files for manual investigation
|
||||
- Indicate which logs contain the most relevant information
|
||||
|
||||
## Error Handling
|
||||
|
||||
1. **Missing log files**:
|
||||
- If journalctl logs are missing, fall back to var/log/* files
|
||||
- If traditional logs are missing, document this in the summary
|
||||
- Some sosreports may have limited logs due to collection parameters
|
||||
|
||||
2. **Large log files**:
|
||||
- For files larger than 100MB, sample the beginning and end
|
||||
- Use `head -n 10000` and `tail -n 10000` to avoid memory issues
|
||||
- Inform user that analysis is based on sampling
|
||||
|
||||
3. **Compressed logs**:
|
||||
- Check for `.gz` files in `var/log/`
|
||||
- Use `zgrep` instead of `grep` for compressed files
|
||||
- Example: `zgrep -i "error" var/log/messages*.gz`
|
||||
|
||||
4. **Binary log formats**:
|
||||
- Some logs may be in binary format (e.g., journald binary logs)
|
||||
- Rely on `sos_commands/logs/journalctl_*` text outputs
|
||||
- Do not attempt to parse binary files directly
|
||||
|
||||
## Output Format
|
||||
|
||||
The log analysis should produce:
|
||||
|
||||
```bash
|
||||
LOG ANALYSIS SUMMARY
|
||||
====================
|
||||
|
||||
Time Range: {first_log_entry} to {last_log_entry}
|
||||
|
||||
ERROR STATISTICS
|
||||
----------------
|
||||
Critical: {count}
|
||||
Errors: {count}
|
||||
Warnings: {count}
|
||||
|
||||
CRITICAL FINDINGS
|
||||
-----------------
|
||||
Kernel Panics: {count}
|
||||
- {timestamp}: {panic_message}
|
||||
|
||||
OOM Killer Events: {count}
|
||||
- {timestamp}: Killed {process_name} (PID: {pid})
|
||||
|
||||
Segmentation Faults: {count}
|
||||
- {timestamp}: {process_name} segfaulted
|
||||
|
||||
Service Failures: {count}
|
||||
- {service_name}: {failure_reason}
|
||||
|
||||
TOP ERROR MESSAGES
|
||||
------------------
|
||||
1. [{count}x] {error_message}
|
||||
First seen: {timestamp}
|
||||
Component: {component}
|
||||
|
||||
2. [{count}x] {error_message}
|
||||
First seen: {timestamp}
|
||||
Component: {component}
|
||||
|
||||
APPLICATION ERRORS
|
||||
------------------
|
||||
Stack Traces: {count} found in {log_files}
|
||||
Database Errors: {count}
|
||||
Network Errors: {count}
|
||||
Auth Failures: {count}
|
||||
|
||||
LOG FILES FOR INVESTIGATION
|
||||
---------------------------
|
||||
- Primary: {sosreport_path}/sos_commands/logs/journalctl_--no-pager
|
||||
- System: {sosreport_path}/var/log/messages
|
||||
- Kernel: {sosreport_path}/var/log/dmesg
|
||||
- Security: {sosreport_path}/var/log/secure
|
||||
- Application: {sosreport_path}/var/log/{app_specific}
|
||||
|
||||
RECOMMENDATIONS
|
||||
---------------
|
||||
1. {actionable_recommendation_based_on_findings}
|
||||
2. {actionable_recommendation_based_on_findings}
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Example 1: OOM Killer Analysis
|
||||
|
||||
```bash
|
||||
# Detect OOM events
|
||||
grep -B 5 -A 15 "Out of memory" sos_commands/logs/journalctl_--no-pager
|
||||
|
||||
# Output interpretation:
|
||||
# - Which process was killed
|
||||
# - Memory state at the time
|
||||
# - What triggered the OOM
|
||||
```
|
||||
|
||||
### Example 2: Service Failure Pattern
|
||||
|
||||
```bash
|
||||
# Find failed services
|
||||
grep "failed to start\|Failed with result" sos_commands/logs/journalctl_--no-pager | \
|
||||
awk -F'[][]' '{print $2}' | sort | uniq -c | sort -rn
|
||||
|
||||
# This shows which services failed most frequently
|
||||
```
|
||||
|
||||
### Example 3: Timeline of Errors
|
||||
|
||||
```bash
|
||||
# Create error timeline
|
||||
grep -i "error\|fail" sos_commands/logs/journalctl_--no-pager | \
|
||||
awk '{print $1, $2, $3}' | sort | uniq -c
|
||||
|
||||
# Shows error frequency over time
|
||||
```
|
||||
|
||||
## Tips for Effective Analysis
|
||||
|
||||
1. **Start with critical errors**: Focus on panics, OOMs, and segfaults first
|
||||
2. **Look for patterns**: Repeated errors often indicate systemic issues
|
||||
3. **Check timestamps**: Correlate errors with the reported incident time
|
||||
4. **Consider context**: Read surrounding log lines for context
|
||||
5. **Cross-reference**: Correlate log findings with resource analysis
|
||||
6. **Be thorough**: Check both journald and traditional logs
|
||||
7. **Document findings**: Note file paths and line numbers for reference
|
||||
|
||||
## Common Log Patterns to Look For
|
||||
|
||||
1. **OOM Killer**: "Out of memory: Kill process" → Memory pressure issue
|
||||
2. **Segfault**: "segfault at" → Application crash, possible bug
|
||||
3. **I/O Error**: "I/O error" in dmesg → Hardware or filesystem issue
|
||||
4. **Connection Refused**: "Connection refused" → Service not running or firewall
|
||||
5. **Permission Denied**: "Permission denied" → SELinux, file permissions, or ACL issue
|
||||
6. **Timeout**: "timeout" → Network or resource contention
|
||||
7. **Failed to start**: "Failed to start" → Service configuration or dependency issue
|
||||
|
||||
## See Also
|
||||
|
||||
- Resource Analysis Skill: For correlating log errors with resource constraints
|
||||
- System Configuration Analysis Skill: For investigating service failures
|
||||
- Network Analysis Skill: For investigating connectivity errors
|
||||
507
skills/network-analysis/SKILL.md
Normal file
507
skills/network-analysis/SKILL.md
Normal file
@@ -0,0 +1,507 @@
|
||||
---
|
||||
name: Network Analysis
|
||||
description: Analyze network configuration data from sosreport archives, extracting interface configurations, routing tables, active connections, firewall rules (firewalld/iptables), and DNS settings from the sosreport directory structure to diagnose network connectivity and configuration issues
|
||||
---
|
||||
|
||||
# Network Analysis Skill
|
||||
|
||||
This skill provides detailed guidance for analyzing network configuration and connectivity from sosreport archives, including interfaces, routing, firewall rules, and DNS configuration.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- Analyzing the `/sosreport:analyze` command's network analysis phase
|
||||
- Investigating network connectivity issues
|
||||
- Diagnosing firewall or routing problems
|
||||
- Verifying network configuration
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Sosreport archive must be extracted to a working directory
|
||||
- Path to the sosreport root directory must be known
|
||||
- Understanding of Linux networking concepts
|
||||
|
||||
## Key Network Data Locations in Sosreport
|
||||
|
||||
1. **Network Interfaces**:
|
||||
- `sos_commands/networking/ip_-o_addr` - IP addresses
|
||||
- `sos_commands/networking/ip_link` - Link status
|
||||
- `sos_commands/networking/ip_-s_link` - Link statistics with errors
|
||||
- `etc/sysconfig/network-scripts/` - Network configuration files (RHEL)
|
||||
|
||||
2. **Routing**:
|
||||
- `sos_commands/networking/ip_route` - Routing table
|
||||
- `sos_commands/networking/ip_-6_route` - IPv6 routing table
|
||||
- `proc/net/route` - Kernel routing table
|
||||
|
||||
3. **Network Connections**:
|
||||
- `sos_commands/networking/netstat_-neopa` - Active connections
|
||||
- `sos_commands/networking/ss_-tupna` - Socket statistics
|
||||
- `proc/net/tcp` - TCP connections
|
||||
- `proc/net/udp` - UDP connections
|
||||
|
||||
4. **Firewall**:
|
||||
- `sos_commands/firewalld/` - Firewalld configuration
|
||||
- `sos_commands/iptables/iptables_-vnxL` - iptables rules
|
||||
- `sos_commands/nftables/` - nftables configuration
|
||||
|
||||
5. **DNS and Resolution**:
|
||||
- `etc/resolv.conf` - DNS servers
|
||||
- `etc/hosts` - Static hostname mappings
|
||||
- `etc/nsswitch.conf` - Name resolution order
|
||||
|
||||
6. **Network Services**:
|
||||
- `sos_commands/networking/networkmanager_info` - NetworkManager status
|
||||
- `systemctl status NetworkManager` output
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
### Step 1: Analyze Network Interfaces
|
||||
|
||||
1. **List all network interfaces**:
|
||||
```bash
|
||||
if [ -f sos_commands/networking/ip_-o_addr ]; then
|
||||
cat sos_commands/networking/ip_-o_addr
|
||||
fi
|
||||
```
|
||||
|
||||
2. **Check interface states**:
|
||||
```bash
|
||||
if [ -f sos_commands/networking/ip_link ]; then
|
||||
# Look for interface states (UP/DOWN)
|
||||
grep -E "^[0-9]+:" sos_commands/networking/ip_link
|
||||
fi
|
||||
```
|
||||
|
||||
3. **Parse interface information**:
|
||||
- Interface name (eth0, ens192, etc.)
|
||||
- State (UP/DOWN)
|
||||
- IP addresses (IPv4 and IPv6)
|
||||
- MAC address
|
||||
- MTU size
|
||||
|
||||
4. **Check for interface errors**:
|
||||
```bash
|
||||
if [ -f sos_commands/networking/ip_-s_link ]; then
|
||||
# Look for RX/TX errors, drops, overruns
|
||||
cat sos_commands/networking/ip_-s_link
|
||||
fi
|
||||
```
|
||||
|
||||
5. **Identify interface issues**:
|
||||
- Interfaces with no IP address (when expected)
|
||||
- Interfaces in DOWN state (when should be UP)
|
||||
- High error counts (RX/TX errors, drops)
|
||||
- Duplicate IP addresses
|
||||
- MTU mismatches
|
||||
|
||||
### Step 2: Analyze Routing Configuration
|
||||
|
||||
1. **Check default route**:
|
||||
```bash
|
||||
if [ -f sos_commands/networking/ip_route ]; then
|
||||
grep "^default" sos_commands/networking/ip_route || echo "No default route found"
|
||||
fi
|
||||
```
|
||||
|
||||
2. **Review routing table**:
|
||||
```bash
|
||||
if [ -f sos_commands/networking/ip_route ]; then
|
||||
cat sos_commands/networking/ip_route
|
||||
fi
|
||||
```
|
||||
|
||||
3. **Check IPv6 routing**:
|
||||
```bash
|
||||
if [ -f sos_commands/networking/ip_-6_route ]; then
|
||||
cat sos_commands/networking/ip_-6_route
|
||||
fi
|
||||
```
|
||||
|
||||
4. **Identify routing issues**:
|
||||
- Missing default route
|
||||
- Multiple default routes (conflicting)
|
||||
- Incorrect gateway addresses
|
||||
- Route to nowhere (unreachable gateway)
|
||||
|
||||
### Step 3: Analyze Network Connectivity
|
||||
|
||||
1. **Check active connections**:
|
||||
```bash
|
||||
if [ -f sos_commands/networking/netstat_-neopa ]; then
|
||||
cat sos_commands/networking/netstat_-neopa
|
||||
elif [ -f sos_commands/networking/ss_-tupna ]; then
|
||||
cat sos_commands/networking/ss_-tupna
|
||||
fi
|
||||
```
|
||||
|
||||
2. **Count connections by state**:
|
||||
```bash
|
||||
# Count TCP connection states
|
||||
if [ -f sos_commands/networking/netstat_-neopa ]; then
|
||||
grep "^tcp" sos_commands/networking/netstat_-neopa | awk '{print $6}' | sort | uniq -c
|
||||
fi
|
||||
```
|
||||
|
||||
3. **Find listening services**:
|
||||
```bash
|
||||
# Show what's listening on which ports
|
||||
if [ -f sos_commands/networking/netstat_-neopa ]; then
|
||||
grep "LISTEN" sos_commands/networking/netstat_-neopa
|
||||
fi
|
||||
```
|
||||
|
||||
4. **Check for connection issues**:
|
||||
- Excessive TIME_WAIT connections
|
||||
- Many connections in SYN_SENT (connection attempts failing)
|
||||
- High number of CLOSE_WAIT (application not closing)
|
||||
- Port conflicts (multiple services on same port)
|
||||
|
||||
### Step 4: Analyze Firewall Configuration
|
||||
|
||||
1. **Check if firewalld is active**:
|
||||
```bash
|
||||
if [ -d sos_commands/firewalld ]; then
|
||||
# Firewalld is present
|
||||
if [ -f sos_commands/firewalld/firewall-cmd_--list-all-zones ]; then
|
||||
cat sos_commands/firewalld/firewall-cmd_--list-all-zones
|
||||
fi
|
||||
fi
|
||||
```
|
||||
|
||||
2. **Review iptables rules**:
|
||||
```bash
|
||||
if [ -f sos_commands/iptables/iptables_-vnxL ]; then
|
||||
cat sos_commands/iptables/iptables_-vnxL
|
||||
fi
|
||||
```
|
||||
|
||||
3. **Check firewall zones and rules**:
|
||||
- Active zones
|
||||
- Allowed services
|
||||
- Allowed ports
|
||||
- Rich rules
|
||||
- Drop/reject policies
|
||||
|
||||
4. **Identify firewall issues**:
|
||||
- Required ports blocked
|
||||
- Overly permissive rules (any any accept)
|
||||
- Conflicting rules
|
||||
- Missing rules for services
|
||||
|
||||
### Step 5: Analyze DNS Configuration
|
||||
|
||||
1. **Check DNS servers**:
|
||||
```bash
|
||||
if [ -f etc/resolv.conf ]; then
|
||||
cat etc/resolv.conf
|
||||
fi
|
||||
```
|
||||
|
||||
2. **Review /etc/hosts**:
|
||||
```bash
|
||||
if [ -f etc/hosts ]; then
|
||||
# Show non-comment, non-empty lines
|
||||
grep -v "^#\|^$" etc/hosts
|
||||
fi
|
||||
```
|
||||
|
||||
3. **Check hostname resolution**:
|
||||
```bash
|
||||
# Check hostname
|
||||
if [ -f hostname ]; then
|
||||
cat hostname
|
||||
fi
|
||||
|
||||
# Check FQDN
|
||||
if [ -f etc/hostname ]; then
|
||||
cat etc/hostname
|
||||
fi
|
||||
```
|
||||
|
||||
4. **Verify nsswitch configuration**:
|
||||
```bash
|
||||
if [ -f etc/nsswitch.conf ]; then
|
||||
grep "^hosts:" etc/nsswitch.conf
|
||||
fi
|
||||
```
|
||||
|
||||
5. **Identify DNS issues**:
|
||||
- No DNS servers configured
|
||||
- Unreachable DNS servers (check connectivity in logs)
|
||||
- Incorrect search domains
|
||||
- Hostname resolution failures in logs
|
||||
|
||||
### Step 6: Check for Network Errors in Logs
|
||||
|
||||
1. **Look for network-related errors**:
|
||||
```bash
|
||||
# Connection refused errors
|
||||
grep -i "connection refused" sos_commands/logs/journalctl_--no-pager 2>/dev/null | head -20
|
||||
|
||||
# Timeout errors
|
||||
grep -i "timeout\|timed out" sos_commands/logs/journalctl_--no-pager 2>/dev/null | head -20
|
||||
|
||||
# Network unreachable
|
||||
grep -i "network.*unreachable\|no route to host" sos_commands/logs/journalctl_--no-pager 2>/dev/null | head -20
|
||||
|
||||
# DNS resolution failures
|
||||
grep -i "could not resolve\|dns.*fail\|name resolution" sos_commands/logs/journalctl_--no-pager 2>/dev/null | head -20
|
||||
```
|
||||
|
||||
2. **Check for link state changes**:
|
||||
```bash
|
||||
grep -i "link.*up\|link.*down\|carrier.*lost" sos_commands/logs/journalctl_--no-pager 2>/dev/null | head -20
|
||||
```
|
||||
|
||||
3. **Look for network device errors**:
|
||||
```bash
|
||||
grep -i "network.*error\|eth[0-9].*error\|transmit.*error" var/log/dmesg 2>/dev/null
|
||||
```
|
||||
|
||||
### Step 7: Generate Network Analysis Summary
|
||||
|
||||
Create a structured summary with the following sections:
|
||||
|
||||
1. **Interface Summary**:
|
||||
- List of all interfaces with status
|
||||
- IP addresses assigned
|
||||
- Interface errors/drops
|
||||
- Link speeds and duplex settings
|
||||
|
||||
2. **Routing Summary**:
|
||||
- Default gateway
|
||||
- Number of routes
|
||||
- Any routing anomalies
|
||||
|
||||
3. **Connectivity Summary**:
|
||||
- Active connection count by state
|
||||
- Listening services and ports
|
||||
- Connection issues detected
|
||||
|
||||
4. **Firewall Summary**:
|
||||
- Firewall type (firewalld/iptables/nftables)
|
||||
- Active zones (if firewalld)
|
||||
- Key allowed services/ports
|
||||
- Potential blocking rules
|
||||
|
||||
5. **DNS Summary**:
|
||||
- DNS servers configured
|
||||
- Search domains
|
||||
- Hostname configuration
|
||||
- DNS resolution issues
|
||||
|
||||
6. **Network Issues**:
|
||||
- Critical network problems
|
||||
- Warnings and recommendations
|
||||
- Evidence from logs
|
||||
|
||||
## Error Handling
|
||||
|
||||
1. **Missing network files**:
|
||||
- Different sosreport versions may have different file names
|
||||
- Fall back to alternative files (netstat vs ss)
|
||||
- Document missing data in summary
|
||||
|
||||
2. **Multiple network configurations**:
|
||||
- System may use NetworkManager, systemd-networkd, or traditional ifcfg
|
||||
- Identify which is in use and analyze accordingly
|
||||
|
||||
3. **IPv6 presence**:
|
||||
- Check if IPv6 is enabled
|
||||
- Analyze IPv6 configuration if present
|
||||
- Note if IPv6 is disabled when expected
|
||||
|
||||
## Output Format
|
||||
|
||||
The network analysis should produce:
|
||||
|
||||
```bash
|
||||
NETWORK CONFIGURATION SUMMARY
|
||||
==============================
|
||||
|
||||
NETWORK INTERFACES
|
||||
------------------
|
||||
Interface: {name}
|
||||
State: {UP|DOWN}
|
||||
IP Addresses: {ipv4}, {ipv6}
|
||||
MAC: {mac_address}
|
||||
MTU: {mtu}
|
||||
RX Errors: {rx_errors} packets, {rx_dropped} dropped
|
||||
TX Errors: {tx_errors} packets, {tx_dropped} dropped
|
||||
Status: {OK|WARNING|CRITICAL}
|
||||
|
||||
ROUTING
|
||||
-------
|
||||
Default Gateway: {gateway_ip} via {interface}
|
||||
Total Routes: {count}
|
||||
|
||||
Key Routes:
|
||||
{destination} via {gateway} dev {interface}
|
||||
|
||||
Status: {OK|WARNING|CRITICAL}
|
||||
Issues:
|
||||
- {routing_issue_description}
|
||||
|
||||
CONNECTIVITY
|
||||
------------
|
||||
Total Active Connections: {count}
|
||||
|
||||
Connections by State:
|
||||
ESTABLISHED: {count}
|
||||
TIME_WAIT: {count}
|
||||
CLOSE_WAIT: {count}
|
||||
SYN_SENT: {count}
|
||||
|
||||
Listening Services:
|
||||
{port}/{protocol} - {service_name} (PID {pid})
|
||||
|
||||
Status: {OK|WARNING|CRITICAL}
|
||||
Issues:
|
||||
- {connectivity_issue_description}
|
||||
|
||||
FIREWALL
|
||||
--------
|
||||
Type: {firewalld|iptables|nftables|none}
|
||||
Default Zone: {zone_name} (if firewalld)
|
||||
|
||||
Allowed Services: {service1}, {service2}, ...
|
||||
Allowed Ports: {port1/protocol}, {port2/protocol}, ...
|
||||
|
||||
Active Rules Count: {count}
|
||||
|
||||
Status: {OK|WARNING|CRITICAL}
|
||||
Potential Issues:
|
||||
- {firewall_issue_description}
|
||||
|
||||
DNS CONFIGURATION
|
||||
-----------------
|
||||
DNS Servers: {dns1}, {dns2}, {dns3}
|
||||
Search Domains: {domain1}, {domain2}
|
||||
Hostname: {hostname}
|
||||
FQDN: {fqdn}
|
||||
|
||||
Status: {OK|WARNING|CRITICAL}
|
||||
Issues:
|
||||
- {dns_issue_description}
|
||||
|
||||
NETWORK ERRORS FROM LOGS
|
||||
------------------------
|
||||
Connection Refused: {count} occurrences
|
||||
Timeouts: {count} occurrences
|
||||
DNS Failures: {count} occurrences
|
||||
Link State Changes: {count} occurrences
|
||||
|
||||
Recent Network Errors:
|
||||
{timestamp}: {error_message}
|
||||
|
||||
CRITICAL NETWORK ISSUES
|
||||
-----------------------
|
||||
{severity}: {issue_description}
|
||||
Evidence: {file_path_or_log_excerpt}
|
||||
Impact: {impact_description}
|
||||
Recommendation: {remediation_action}
|
||||
|
||||
RECOMMENDATIONS
|
||||
---------------
|
||||
1. {actionable_recommendation}
|
||||
2. {actionable_recommendation}
|
||||
|
||||
DATA SOURCES
|
||||
------------
|
||||
- Interfaces: {sosreport_path}/sos_commands/networking/ip_-o_addr
|
||||
- Routes: {sosreport_path}/sos_commands/networking/ip_route
|
||||
- Connections: {sosreport_path}/sos_commands/networking/netstat_-neopa
|
||||
- Firewall: {sosreport_path}/sos_commands/firewalld/
|
||||
- DNS: {sosreport_path}/etc/resolv.conf
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Example 1: Interface Analysis
|
||||
|
||||
```bash
|
||||
# Check interface IP addresses
|
||||
$ cat sos_commands/networking/ip_-o_addr
|
||||
1: lo inet 127.0.0.1/8 scope host lo
|
||||
2: eth0 inet 192.168.1.100/24 brd 192.168.1.255 scope global eth0
|
||||
2: eth0 inet6 fe80::a00:27ff:fe4e:66a1/64 scope link
|
||||
|
||||
# Check for errors
|
||||
$ cat sos_commands/networking/ip_-s_link | grep -A 4 "eth0"
|
||||
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
|
||||
RX: bytes packets errors dropped overrun mcast
|
||||
15234567 98234 0 0 0 123
|
||||
TX: bytes packets errors dropped carrier collsns
|
||||
8765432 54321 15 0 0 0
|
||||
|
||||
# Interpretation: eth0 has 15 TX errors - investigate cable/switch
|
||||
```
|
||||
|
||||
### Example 2: Firewall Rule Analysis
|
||||
|
||||
```bash
|
||||
# Check firewalld active zone
|
||||
$ grep -A 20 "public" sos_commands/firewalld/firewall-cmd_--list-all-zones
|
||||
public (active)
|
||||
target: default
|
||||
services: ssh dhcpv6-client http https
|
||||
ports: 8080/tcp 9090/tcp
|
||||
...
|
||||
|
||||
# Interpretation: HTTP/HTTPS allowed, custom ports 8080 and 9090 open
|
||||
```
|
||||
|
||||
### Example 3: Connection State Issues
|
||||
|
||||
```bash
|
||||
# Count connection states
|
||||
$ grep "^tcp" sos_commands/networking/netstat_-neopa | awk '{print $6}' | sort | uniq -c
|
||||
234 ESTABLISHED
|
||||
1523 TIME_WAIT
|
||||
12 CLOSE_WAIT
|
||||
5 SYN_SENT
|
||||
|
||||
# Interpretation:
|
||||
# - Excessive TIME_WAIT (normal after closing connections)
|
||||
# - CLOSE_WAIT suggests application not properly closing sockets
|
||||
# - SYN_SENT indicates outbound connection attempts failing
|
||||
```
|
||||
|
||||
## Tips for Effective Analysis
|
||||
|
||||
1. **Check interface consistency**: Ensure IP addresses match expected configuration
|
||||
2. **Verify gateway reachability**: Default gateway should be on the same subnet
|
||||
3. **Look for asymmetric routing**: Packets in/out may take different paths
|
||||
4. **Check MTU settings**: MTU mismatches can cause packet fragmentation issues
|
||||
5. **Correlate with logs**: Network errors in logs often explain configuration issues
|
||||
6. **Consider network topology**: Understand expected network layout
|
||||
7. **Check both IPv4 and IPv6**: Be sure to check IPv6 if it's in use
|
||||
|
||||
## Common Network Patterns and Issues
|
||||
|
||||
1. **No default route**: "Network unreachable" errors, can't reach internet
|
||||
2. **Interface down**: "Network is down" errors, no connectivity
|
||||
3. **Duplicate IP**: ARP conflicts, intermittent connectivity
|
||||
4. **Firewall blocking**: "Connection refused/timeout" for specific ports
|
||||
5. **DNS failure**: Can't resolve hostnames, but IP connectivity works
|
||||
6. **Port exhaustion**: Too many TIME_WAIT connections, can't create new connections
|
||||
7. **MTU issues**: Large packets fail, small packets work (PMTUD failure)
|
||||
|
||||
## Network Issue Severity Classification
|
||||
|
||||
| Issue Type | Severity | Impact |
|
||||
|------------|----------|--------|
|
||||
| No network interface | Critical | Complete loss of connectivity |
|
||||
| No default route | Critical | No external connectivity |
|
||||
| Interface errors >1% | Warning | Potential packet loss |
|
||||
| Excessive TIME_WAIT | Warning | May indicate performance issue |
|
||||
| Missing DNS server | Critical | Name resolution failure |
|
||||
| Firewall blocking required port | High | Service unavailable |
|
||||
| IPv6 autoconfiguration failure | Low | IPv6 connectivity issue |
|
||||
|
||||
## See Also
|
||||
|
||||
- Logs Analysis Skill: For detailed network error log analysis
|
||||
- System Configuration Analysis Skill: For network service status
|
||||
- Resource Analysis Skill: For network I/O statistics
|
||||
455
skills/resource-analysis/SKILL.md
Normal file
455
skills/resource-analysis/SKILL.md
Normal file
@@ -0,0 +1,455 @@
|
||||
---
|
||||
name: Resource Analysis
|
||||
description: Analyze system resource usage data from sosreport archives, extracting memory statistics, CPU load averages, disk space utilization, and process information from the sosreport directory structure to diagnose resource exhaustion, performance bottlenecks, and capacity issues
|
||||
---
|
||||
|
||||
# Resource Analysis Skill
|
||||
|
||||
This skill provides detailed guidance for analyzing system resource usage from sosreport archives, including memory, CPU, disk space, and process information.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- Analyzing the `/sosreport:analyze` command's resource analysis phase
|
||||
- Investigating performance issues or resource bottlenecks
|
||||
- Identifying resource exhaustion problems
|
||||
- Correlating resource usage with system failures
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Sosreport archive must be extracted to a working directory
|
||||
- Path to the sosreport root directory must be known
|
||||
- Understanding of Linux resource management
|
||||
|
||||
## Key Resource Data Locations in Sosreport
|
||||
|
||||
1. **Memory Information**:
|
||||
- `sos_commands/memory/free` - Memory usage snapshot
|
||||
- `proc/meminfo` - Detailed memory statistics
|
||||
- `sos_commands/memory/swapon_-s` - Swap usage
|
||||
- `proc/buddyinfo` - Memory fragmentation
|
||||
|
||||
2. **CPU Information**:
|
||||
- `sos_commands/processor/lscpu` - CPU architecture and features
|
||||
- `proc/cpuinfo` - Detailed CPU information
|
||||
- `sos_commands/processor/turbostat` - CPU frequency and power states (if available)
|
||||
- `uptime` - Load averages
|
||||
|
||||
3. **Disk Information**:
|
||||
- `sos_commands/filesys/df_-al` - Filesystem usage
|
||||
- `sos_commands/block/lsblk` - Block device information
|
||||
- `sos_commands/filesys/mount` - Mounted filesystems
|
||||
- `proc/diskstats` - Disk I/O statistics
|
||||
|
||||
4. **Process Information**:
|
||||
- `sos_commands/process/ps_auxwww` - Process list with details
|
||||
- `sos_commands/process/top` - Process snapshot (if available)
|
||||
- `proc/[pid]/` - Per-process information
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
### Step 1: Analyze Memory Usage
|
||||
|
||||
1. **Parse free command output**:
|
||||
```bash
|
||||
# Check if free output exists
|
||||
if [ -f sos_commands/memory/free ]; then
|
||||
cat sos_commands/memory/free
|
||||
fi
|
||||
```
|
||||
|
||||
2. **Extract memory metrics**:
|
||||
```bash
|
||||
# Parse /proc/meminfo for detailed stats
|
||||
if [ -f proc/meminfo ]; then
|
||||
grep -E "^(MemTotal|MemFree|MemAvailable|Buffers|Cached|SwapTotal|SwapFree|Dirty|Slab):" proc/meminfo
|
||||
fi
|
||||
```
|
||||
|
||||
3. **Calculate memory usage percentage**:
|
||||
- Total memory = MemTotal
|
||||
- Used memory = MemTotal - MemAvailable
|
||||
- Usage percentage = (Used / Total) * 100
|
||||
- Parse from `free` output or calculate from `meminfo`
|
||||
|
||||
4. **Check for memory pressure indicators**:
|
||||
```bash
|
||||
# Look for OOM events in logs
|
||||
grep -i "out of memory\|oom killer" sos_commands/logs/journalctl_--no-pager 2>/dev/null
|
||||
|
||||
# Check swap usage
|
||||
if [ -f sos_commands/memory/swapon_-s ]; then
|
||||
cat sos_commands/memory/swapon_-s
|
||||
fi
|
||||
```
|
||||
|
||||
5. **Identify memory issues**:
|
||||
- Memory usage > 90% → Critical
|
||||
- Memory usage > 80% → Warning
|
||||
- Heavy swap usage (>50% swap used) → Performance issue
|
||||
- OOM killer events → Critical memory exhaustion
|
||||
|
||||
### Step 2: Analyze CPU Usage
|
||||
|
||||
1. **Extract CPU information**:
|
||||
```bash
|
||||
# Get CPU count and model
|
||||
if [ -f sos_commands/processor/lscpu ]; then
|
||||
grep -E "^(CPU\(s\)|Model name|Thread|Core|Socket|CPU MHz):" sos_commands/processor/lscpu
|
||||
fi
|
||||
```
|
||||
|
||||
2. **Check load averages**:
|
||||
```bash
|
||||
# Parse uptime for load averages
|
||||
if [ -f uptime ]; then
|
||||
cat uptime
|
||||
fi
|
||||
|
||||
# Or from proc/loadavg
|
||||
if [ -f proc/loadavg ]; then
|
||||
cat proc/loadavg
|
||||
fi
|
||||
```
|
||||
|
||||
3. **Interpret load averages**:
|
||||
- Load average format: 1-min, 5-min, 15-min
|
||||
- Compare with CPU count from lscpu
|
||||
- Load > CPU count → System overloaded
|
||||
- Load >> CPU count (2x or more) → Critical overload
|
||||
|
||||
4. **Check for CPU throttling**:
|
||||
```bash
|
||||
# Look for thermal throttling in logs
|
||||
grep -i "throttl\|temperature\|thermal" sos_commands/logs/journalctl_--no-pager 2>/dev/null | head -20
|
||||
```
|
||||
|
||||
5. **Identify CPU issues**:
|
||||
- 1-min load > 2x CPU count → Critical
|
||||
- 5-min load > CPU count → Warning
|
||||
- Thermal throttling present → Hardware/cooling issue
|
||||
|
||||
### Step 3: Analyze Disk Usage
|
||||
|
||||
1. **Parse df output for filesystem usage**:
|
||||
```bash
|
||||
if [ -f sos_commands/filesys/df_-al ]; then
|
||||
# Skip header and special filesystems, show only regular filesystems
|
||||
grep -v "^Filesystem\|tmpfs\|devtmpfs\|overlay" sos_commands/filesys/df_-al | grep -v "^$"
|
||||
fi
|
||||
```
|
||||
|
||||
2. **Identify full or nearly-full filesystems**:
|
||||
```bash
|
||||
# Extract filesystems with usage > 85%
|
||||
if [ -f sos_commands/filesys/df_-al ]; then
|
||||
awk 'NR>1 && $5+0 >= 85 {print $5, $6, $1}' sos_commands/filesys/df_-al | grep -v "tmpfs\|devtmpfs"
|
||||
fi
|
||||
```
|
||||
|
||||
3. **Check disk I/O errors**:
|
||||
```bash
|
||||
# Look for I/O errors in logs
|
||||
grep -i "i/o error\|read error\|write error\|bad sector" var/log/dmesg 2>/dev/null
|
||||
grep -i "i/o error\|read error\|write error" sos_commands/logs/journalctl_--no-pager 2>/dev/null | head -20
|
||||
```
|
||||
|
||||
4. **Analyze block devices**:
|
||||
```bash
|
||||
if [ -f sos_commands/block/lsblk ]; then
|
||||
cat sos_commands/block/lsblk
|
||||
fi
|
||||
```
|
||||
|
||||
5. **Identify disk issues**:
|
||||
- Filesystem > 95% full → Critical
|
||||
- Filesystem > 85% full → Warning
|
||||
- I/O errors present → Hardware issue
|
||||
- Root filesystem full → System stability risk
|
||||
|
||||
### Step 4: Analyze Process Information
|
||||
|
||||
1. **Parse ps output**:
|
||||
```bash
|
||||
if [ -f sos_commands/process/ps_auxwww ]; then
|
||||
# Show header
|
||||
head -1 sos_commands/process/ps_auxwww
|
||||
fi
|
||||
```
|
||||
|
||||
2. **Find top CPU consumers**:
|
||||
```bash
|
||||
# Sort by CPU usage (column 3), show top 10
|
||||
if [ -f sos_commands/process/ps_auxwww ]; then
|
||||
tail -n +2 sos_commands/process/ps_auxwww | sort -k3 -rn | head -10
|
||||
fi
|
||||
```
|
||||
|
||||
3. **Find top memory consumers**:
|
||||
```bash
|
||||
# Sort by memory usage (column 4), show top 10
|
||||
if [ -f sos_commands/process/ps_auxwww ]; then
|
||||
tail -n +2 sos_commands/process/ps_auxwww | sort -k4 -rn | head -10
|
||||
fi
|
||||
```
|
||||
|
||||
4. **Check for zombie processes**:
|
||||
```bash
|
||||
# Look for processes in Z state
|
||||
if [ -f sos_commands/process/ps_auxwww ]; then
|
||||
grep " Z " sos_commands/process/ps_auxwww || echo "No zombie processes found"
|
||||
fi
|
||||
```
|
||||
|
||||
5. **Count processes by state**:
|
||||
```bash
|
||||
# Count processes by state (R=running, S=sleeping, D=uninterruptible, Z=zombie, T=stopped)
|
||||
if [ -f sos_commands/process/ps_auxwww ]; then
|
||||
tail -n +2 sos_commands/process/ps_auxwww | awk '{print $8}' | cut -c1 | sort | uniq -c
|
||||
fi
|
||||
```
|
||||
|
||||
6. **Identify process issues**:
|
||||
- Zombie processes present → Parent process not reaping children
|
||||
- Many processes in D state → I/O bottleneck
|
||||
- Single process using >80% memory → Memory leak or expected behavior
|
||||
- Many processes using high CPU → CPU contention
|
||||
|
||||
### Step 5: Correlate Resource Usage with Issues
|
||||
|
||||
1. **Cross-reference with logs**:
|
||||
- If high memory usage, check for OOM events in logs
|
||||
- If high disk usage, check for disk full errors
|
||||
- If high load, check for performance-related errors
|
||||
|
||||
2. **Identify resource exhaustion patterns**:
|
||||
- Memory exhaustion → OOM killer → Service crashes
|
||||
- Disk full → Write failures → Application errors
|
||||
- CPU overload → Timeouts → Request failures
|
||||
|
||||
3. **Build timeline**:
|
||||
- When did resource issues start?
|
||||
- Correlate with log timestamps
|
||||
- Identify triggering event if possible
|
||||
|
||||
### Step 6: Generate Resource Analysis Summary
|
||||
|
||||
Create a structured summary with the following sections:
|
||||
|
||||
1. **Memory Summary**:
|
||||
- Total memory
|
||||
- Used memory (GB and %)
|
||||
- Available memory
|
||||
- Swap usage (GB and %)
|
||||
- Memory pressure indicators (OOM events)
|
||||
|
||||
2. **CPU Summary**:
|
||||
- CPU count and model
|
||||
- Load averages (1-min, 5-min, 15-min)
|
||||
- Load per CPU
|
||||
- CPU issues (throttling, overload)
|
||||
|
||||
3. **Disk Summary**:
|
||||
- Filesystems and usage percentages
|
||||
- Full or nearly-full filesystems
|
||||
- I/O errors count
|
||||
- Most full filesystem
|
||||
|
||||
4. **Process Summary**:
|
||||
- Total process count
|
||||
- Top CPU consumers (top 5)
|
||||
- Top memory consumers (top 5)
|
||||
- Zombie process count
|
||||
- Processes in uninterruptible sleep (D state)
|
||||
|
||||
5. **Critical Resource Issues**:
|
||||
- List issues by severity
|
||||
- Provide evidence (file paths, metrics)
|
||||
- Suggest remediation
|
||||
|
||||
## Error Handling
|
||||
|
||||
1. **Missing resource files**:
|
||||
- If `free` is missing, parse `proc/meminfo` directly
|
||||
- If `ps` is missing, check `proc/` for process information
|
||||
- Document missing data in summary
|
||||
|
||||
2. **Parsing errors**:
|
||||
- Handle different output formats (free -h vs free -m)
|
||||
- Account for locale differences in number formats
|
||||
- Validate data before calculations
|
||||
|
||||
3. **Incomplete data**:
|
||||
- Some sosreports may not include all resource files
|
||||
- Indicate which metrics are unavailable
|
||||
- Work with available data only
|
||||
|
||||
## Output Format
|
||||
|
||||
The resource analysis should produce:
|
||||
|
||||
```bash
|
||||
RESOURCE USAGE SUMMARY
|
||||
======================
|
||||
|
||||
MEMORY
|
||||
------
|
||||
Total: {total_gb} GB
|
||||
Used: {used_gb} GB ({used_pct}%)
|
||||
Available: {available_gb} GB ({available_pct}%)
|
||||
Buffers: {buffers_gb} GB
|
||||
Cached: {cached_gb} GB
|
||||
Swap Total: {swap_total_gb} GB
|
||||
Swap Used: {swap_used_gb} GB ({swap_used_pct}%)
|
||||
|
||||
Status: {OK|WARNING|CRITICAL}
|
||||
Issues:
|
||||
- {memory_issue_description}
|
||||
|
||||
CPU
|
||||
---
|
||||
Model: {cpu_model}
|
||||
CPU Count: {cpu_count}
|
||||
Threads/Core: {threads_per_core}
|
||||
|
||||
Load Averages: {load_1m}, {load_5m}, {load_15m}
|
||||
Load per CPU: {load_1m_per_cpu}, {load_5m_per_cpu}, {load_15m_per_cpu}
|
||||
|
||||
Status: {OK|WARNING|CRITICAL}
|
||||
Issues:
|
||||
- {cpu_issue_description}
|
||||
|
||||
DISK USAGE
|
||||
----------
|
||||
Filesystem Size Used Avail Use% Mounted on
|
||||
{filesystem} {size} {used} {avail} {pct}% {mount}
|
||||
|
||||
Nearly Full Filesystems (>85%):
|
||||
- {mount}: {pct}% full ({available} available)
|
||||
|
||||
I/O Errors: {count} errors found in logs
|
||||
|
||||
Status: {OK|WARNING|CRITICAL}
|
||||
Issues:
|
||||
- {disk_issue_description}
|
||||
|
||||
PROCESSES
|
||||
---------
|
||||
Total Processes: {total}
|
||||
Running: {running}
|
||||
Sleeping: {sleeping}
|
||||
Zombie: {zombie}
|
||||
Uninterruptible: {uninterruptible}
|
||||
|
||||
Top CPU Consumers:
|
||||
1. {process_name} (PID {pid}): {cpu}% CPU, {mem}% MEM
|
||||
2. {process_name} (PID {pid}): {cpu}% CPU, {mem}% MEM
|
||||
3. {process_name} (PID {pid}): {cpu}% CPU, {mem}% MEM
|
||||
|
||||
Top Memory Consumers:
|
||||
1. {process_name} (PID {pid}): {mem}% MEM, {cpu}% CPU
|
||||
2. {process_name} (PID {pid}): {mem}% MEM, {cpu}% CPU
|
||||
3. {process_name} (PID {pid}): {mem}% MEM, {cpu}% CPU
|
||||
|
||||
Status: {OK|WARNING|CRITICAL}
|
||||
Issues:
|
||||
- {process_issue_description}
|
||||
|
||||
CRITICAL RESOURCE ISSUES
|
||||
------------------------
|
||||
{severity}: {issue_description}
|
||||
Evidence: {file_path}
|
||||
Impact: {impact_description}
|
||||
Recommendation: {remediation_action}
|
||||
|
||||
RECOMMENDATIONS
|
||||
---------------
|
||||
1. {actionable_recommendation}
|
||||
2. {actionable_recommendation}
|
||||
|
||||
DATA SOURCES
|
||||
------------
|
||||
- Memory: {sosreport_path}/sos_commands/memory/free
|
||||
- Memory: {sosreport_path}/proc/meminfo
|
||||
- CPU: {sosreport_path}/sos_commands/processor/lscpu
|
||||
- Load: {sosreport_path}/uptime
|
||||
- Disk: {sosreport_path}/sos_commands/filesys/df_-al
|
||||
- Processes: {sosreport_path}/sos_commands/process/ps_auxwww
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Example 1: Memory Analysis
|
||||
|
||||
```bash
|
||||
# Parse free command output
|
||||
$ cat sos_commands/memory/free
|
||||
total used free shared buff/cache available
|
||||
Mem: 16277396 8123456 2145678 123456 6008262 7654321
|
||||
Swap: 8388604 512000 7876604
|
||||
|
||||
# Interpretation:
|
||||
# - Total RAM: ~16 GB
|
||||
# - Used: ~8 GB (50%)
|
||||
# - Available: ~7.6 GB (47%)
|
||||
# - Swap used: ~500 MB (6%)
|
||||
# Status: OK - healthy memory usage
|
||||
```
|
||||
|
||||
### Example 2: Disk Full Detection
|
||||
|
||||
```bash
|
||||
# Find filesystems > 85% full
|
||||
$ awk 'NR>1 && $5+0 >= 85' sos_commands/filesys/df_-al
|
||||
/dev/sda1 50G 45G 5G 90% /
|
||||
/dev/sdb1 100G 96G 4G 96% /var/log
|
||||
|
||||
# Critical: Root filesystem at 90%, /var/log at 96%
|
||||
# Action required: Clean up disk space
|
||||
```
|
||||
|
||||
### Example 3: High Load Investigation
|
||||
|
||||
```bash
|
||||
# Check load averages
|
||||
$ cat uptime
|
||||
14:23:45 up 10 days, 3:42, 2 users, load average: 8.45, 7.23, 6.12
|
||||
|
||||
# With lscpu showing 4 CPUs:
|
||||
# Load per CPU: 2.1, 1.8, 1.5
|
||||
# System is overloaded (load > 2x CPU count)
|
||||
```
|
||||
|
||||
## Tips for Effective Analysis
|
||||
|
||||
1. **Context matters**: High resource usage isn't always bad - consider the workload
|
||||
2. **Look for trends**: Compare 1-min, 5-min, 15-min loads to see if issues are growing
|
||||
3. **Correlate metrics**: High load + high memory + disk full = multiple issues
|
||||
4. **Check ratios**: Usage percentages are more meaningful than absolute values
|
||||
5. **Validate findings**: Cross-reference with log analysis for confirmation
|
||||
6. **Consider capacity**: Is the system appropriately sized for its workload?
|
||||
|
||||
## Common Resource Patterns
|
||||
|
||||
1. **Memory leak**: Steadily increasing memory usage, eventual OOM
|
||||
2. **Disk full**: Application writes failing, log rotation issues
|
||||
3. **CPU spike**: Load average spike, potentially from runaway process
|
||||
4. **I/O bottleneck**: High load but low CPU usage, many D-state processes
|
||||
5. **Swap thrashing**: High swap usage, poor performance
|
||||
6. **Zombie accumulation**: Parent process bug not reaping children
|
||||
|
||||
## Severity Classification
|
||||
|
||||
| Metric | OK | Warning | Critical |
|
||||
|--------|----|---------| ---------|
|
||||
| Memory Usage | < 80% | 80-90% | > 90% |
|
||||
| Swap Usage | < 20% | 20-50% | > 50% |
|
||||
| Disk Usage | < 85% | 85-95% | > 95% |
|
||||
| Load (per CPU) | < 1.0 | 1.0-2.0 | > 2.0 |
|
||||
| Root FS Usage | < 80% | 80-90% | > 90% |
|
||||
|
||||
## See Also
|
||||
|
||||
- Logs Analysis Skill: For finding resource-related errors in logs
|
||||
- System Configuration Analysis Skill: For investigating service resource limits
|
||||
- Network Analysis Skill: For network-related performance issues
|
||||
546
skills/system-config-analysis/SKILL.md
Normal file
546
skills/system-config-analysis/SKILL.md
Normal file
@@ -0,0 +1,546 @@
|
||||
---
|
||||
name: System Configuration Analysis
|
||||
description: Analyze system configuration data from sosreport archives, extracting OS details, installed packages, systemd service status, SELinux/AppArmor policies, and kernel parameters from the sosreport directory structure to diagnose configuration-related system issues
|
||||
---
|
||||
|
||||
# System Configuration Analysis Skill
|
||||
|
||||
This skill provides detailed guidance for analyzing system configuration from sosreport archives, including OS information, installed packages, systemd services, and SELinux/AppArmor settings.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- Analyzing the `/sosreport:analyze` command's system configuration phase
|
||||
- Investigating service failures or misconfigurations
|
||||
- Verifying package versions and updates
|
||||
- Checking security policy settings (SELinux/AppArmor)
|
||||
- Understanding system state and configuration
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Sosreport archive must be extracted to a working directory
|
||||
- Path to the sosreport root directory must be known
|
||||
- Understanding of Linux system administration
|
||||
|
||||
## Key Configuration Data Locations in Sosreport
|
||||
|
||||
1. **System Information**:
|
||||
- `uname` - Kernel version
|
||||
- `etc/os-release` - OS distribution and version
|
||||
- `uptime` - System uptime
|
||||
- `proc/uptime` - Uptime in seconds
|
||||
- `sos_commands/release/` - Release information
|
||||
|
||||
2. **Package Information**:
|
||||
- `installed-rpms` - RPM packages (RHEL/Fedora/CentOS)
|
||||
- `installed-debs` - DEB packages (Debian/Ubuntu)
|
||||
- `sos_commands/yum/` - Yum/DNF information
|
||||
- `sos_commands/rpm/` - RPM database queries
|
||||
|
||||
3. **Service Status**:
|
||||
- `sos_commands/systemd/systemctl_list-units` - All units
|
||||
- `sos_commands/systemd/systemctl_list-units_--failed` - Failed units
|
||||
- `sos_commands/systemd/systemctl_status_--all` - Detailed service status
|
||||
- `sos_commands/systemd/systemctl_list-unit-files` - Unit files
|
||||
|
||||
4. **SELinux**:
|
||||
- `sos_commands/selinux/sestatus` - SELinux status
|
||||
- `sos_commands/selinux/getenforce` - Current enforcement mode
|
||||
- `sos_commands/selinux/selinux-policy` - Policy information
|
||||
- `var/log/audit/audit.log` - SELinux denials
|
||||
|
||||
5. **AppArmor** (if applicable):
|
||||
- `sos_commands/apparmor/` - AppArmor configuration
|
||||
- `etc/apparmor.d/` - AppArmor profiles
|
||||
|
||||
6. **System Configuration Files**:
|
||||
- `etc/` - System-wide configuration
|
||||
- `etc/sysctl.conf` or `etc/sysctl.d/` - Kernel parameters
|
||||
- `etc/security/limits.conf` - Resource limits
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
### Step 1: Analyze System Information
|
||||
|
||||
1. **Check OS version and distribution**:
|
||||
```bash
|
||||
if [ -f etc/os-release ]; then
|
||||
cat etc/os-release
|
||||
fi
|
||||
```
|
||||
|
||||
2. **Get kernel version**:
|
||||
```bash
|
||||
if [ -f uname ]; then
|
||||
cat uname
|
||||
elif [ -f proc/version ]; then
|
||||
cat proc/version
|
||||
fi
|
||||
```
|
||||
|
||||
3. **Check system uptime**:
|
||||
```bash
|
||||
if [ -f uptime ]; then
|
||||
cat uptime
|
||||
elif [ -f proc/uptime ]; then
|
||||
# Parse uptime from proc/uptime (seconds)
|
||||
awk '{printf "%.2f days\n", $1/86400}' proc/uptime
|
||||
fi
|
||||
```
|
||||
|
||||
4. **Extract key system details**:
|
||||
- OS name and version
|
||||
- Kernel version
|
||||
- System architecture (x86_64, aarch64, etc.)
|
||||
- Uptime (days)
|
||||
|
||||
5. **Check for outdated kernel or OS**:
|
||||
- Compare kernel version with current stable
|
||||
- Note if system hasn't been rebooted in a very long time (>365 days)
|
||||
- Identify if OS version is EOL
|
||||
|
||||
### Step 2: Analyze Installed Packages
|
||||
|
||||
1. **List installed packages**:
|
||||
```bash
|
||||
# For RPM-based systems
|
||||
if [ -f installed-rpms ]; then
|
||||
cat installed-rpms
|
||||
fi
|
||||
|
||||
# For DEB-based systems
|
||||
if [ -f installed-debs ]; then
|
||||
cat installed-debs
|
||||
fi
|
||||
```
|
||||
|
||||
2. **Extract key package versions**:
|
||||
```bash
|
||||
# Important system packages
|
||||
grep -E "^(kernel|systemd|glibc|openssh|openssl)" installed-rpms 2>/dev/null
|
||||
|
||||
# Or use awk to parse package name and version
|
||||
awk '{print $1}' installed-rpms | head -20
|
||||
```
|
||||
|
||||
3. **Check for known problematic versions**:
|
||||
- Security vulnerabilities (if known CVEs)
|
||||
- Buggy package versions
|
||||
- Compatibility issues
|
||||
|
||||
4. **Identify package manager issues**:
|
||||
```bash
|
||||
# Check yum/dnf logs for errors
|
||||
if [ -d sos_commands/yum ]; then
|
||||
grep -i "error\|fail" sos_commands/yum/* 2>/dev/null
|
||||
fi
|
||||
```
|
||||
|
||||
5. **Count packages and categorize**:
|
||||
- Total packages installed
|
||||
- Key package versions (kernel, systemd, glibc, etc.)
|
||||
- Recently updated packages (if timestamps available)
|
||||
|
||||
### Step 3: Analyze Service Status
|
||||
|
||||
1. **List all systemd units**:
|
||||
```bash
|
||||
if [ -f sos_commands/systemd/systemctl_list-units ]; then
|
||||
cat sos_commands/systemd/systemctl_list-units
|
||||
fi
|
||||
```
|
||||
|
||||
2. **Identify failed services**:
|
||||
```bash
|
||||
if [ -f sos_commands/systemd/systemctl_list-units_--failed ]; then
|
||||
cat sos_commands/systemd/systemctl_list-units_--failed
|
||||
elif [ -f sos_commands/systemd/systemctl_list-units ]; then
|
||||
grep "failed" sos_commands/systemd/systemctl_list-units
|
||||
fi
|
||||
```
|
||||
|
||||
3. **Check service details**:
|
||||
```bash
|
||||
# Parse detailed status for failed services
|
||||
if [ -f sos_commands/systemd/systemctl_status_--all ]; then
|
||||
# Extract service names and their status
|
||||
grep -E "●|Active:" sos_commands/systemd/systemctl_status_--all | head -50
|
||||
fi
|
||||
```
|
||||
|
||||
4. **Count services by state**:
|
||||
```bash
|
||||
# Count running, failed, inactive services
|
||||
if [ -f sos_commands/systemd/systemctl_list-units ]; then
|
||||
awk '{print $4}' sos_commands/systemd/systemctl_list-units | sort | uniq -c
|
||||
fi
|
||||
```
|
||||
|
||||
5. **Identify critical service failures**:
|
||||
- System services (systemd-*, dbus, NetworkManager)
|
||||
- Application services (httpd, nginx, postgresql, etc.)
|
||||
- Custom services
|
||||
|
||||
6. **Extract failure reasons from logs**:
|
||||
```bash
|
||||
# For each failed service, find related log entries
|
||||
grep -i "failed to start\|service.*failed" sos_commands/logs/journalctl_--no-pager 2>/dev/null | head -20
|
||||
```
|
||||
|
||||
### Step 4: Analyze SELinux Configuration
|
||||
|
||||
1. **Check SELinux status**:
|
||||
```bash
|
||||
if [ -f sos_commands/selinux/sestatus ]; then
|
||||
cat sos_commands/selinux/sestatus
|
||||
fi
|
||||
```
|
||||
|
||||
2. **Get SELinux mode**:
|
||||
```bash
|
||||
if [ -f sos_commands/selinux/getenforce ]; then
|
||||
cat sos_commands/selinux/getenforce
|
||||
fi
|
||||
```
|
||||
|
||||
3. **Check for SELinux denials**:
|
||||
```bash
|
||||
# Look for AVC denials in audit log
|
||||
if [ -f var/log/audit/audit.log ]; then
|
||||
grep "avc.*denied" var/log/audit/audit.log | head -50
|
||||
fi
|
||||
|
||||
# Or in journald logs
|
||||
grep -i "selinux.*denied\|avc.*denied" sos_commands/logs/journalctl_--no-pager 2>/dev/null | head -20
|
||||
```
|
||||
|
||||
4. **Parse denial information**:
|
||||
- Extract denied operations (read, write, execute, etc.)
|
||||
- Identify source and target contexts
|
||||
- Note which services are affected
|
||||
|
||||
5. **Check for SELinux booleans**:
|
||||
```bash
|
||||
if [ -f sos_commands/selinux/getsebool_-a ]; then
|
||||
cat sos_commands/selinux/getsebool_-a
|
||||
fi
|
||||
```
|
||||
|
||||
6. **Identify SELinux issues**:
|
||||
- SELinux in permissive mode (may hide errors)
|
||||
- SELinux disabled (security concern)
|
||||
- Frequent AVC denials (policy may need adjustment)
|
||||
- Context mismatches
|
||||
|
||||
### Step 5: Check System Configuration
|
||||
|
||||
1. **Review kernel parameters**:
|
||||
```bash
|
||||
# Check sysctl settings
|
||||
if [ -f sos_commands/kernel/sysctl_-a ]; then
|
||||
cat sos_commands/kernel/sysctl_-a
|
||||
elif [ -d etc/sysctl.d ]; then
|
||||
cat etc/sysctl.d/*.conf 2>/dev/null
|
||||
fi
|
||||
```
|
||||
|
||||
2. **Check resource limits**:
|
||||
```bash
|
||||
if [ -f etc/security/limits.conf ]; then
|
||||
grep -v "^#\|^$" etc/security/limits.conf
|
||||
fi
|
||||
|
||||
# Check limits.d directory
|
||||
if [ -d etc/security/limits.d ]; then
|
||||
cat etc/security/limits.d/*.conf 2>/dev/null
|
||||
fi
|
||||
```
|
||||
|
||||
3. **Review boot parameters**:
|
||||
```bash
|
||||
if [ -f sos_commands/boot/grub2-editenv_list ]; then
|
||||
cat sos_commands/boot/grub2-editenv_list
|
||||
elif [ -f proc/cmdline ]; then
|
||||
cat proc/cmdline
|
||||
fi
|
||||
```
|
||||
|
||||
4. **Check systemd configuration**:
|
||||
```bash
|
||||
# Look for systemd configuration overrides
|
||||
if [ -d etc/systemd/system ]; then
|
||||
find etc/systemd/system -name "*.conf" 2>/dev/null
|
||||
fi
|
||||
```
|
||||
|
||||
### Step 6: Generate System Configuration Summary
|
||||
|
||||
Create a structured summary with the following sections:
|
||||
|
||||
1. **System Information**:
|
||||
- OS name and version
|
||||
- Kernel version
|
||||
- Architecture
|
||||
- System uptime
|
||||
- Last boot time
|
||||
|
||||
2. **Package Summary**:
|
||||
- Total packages installed
|
||||
- Key package versions (kernel, systemd, glibc, openssl, openssh)
|
||||
- Known problematic packages (if any)
|
||||
- Package manager issues
|
||||
|
||||
3. **Service Status**:
|
||||
- Total services
|
||||
- Running services count
|
||||
- Failed services count
|
||||
- List of failed services with reasons
|
||||
- Critical service status
|
||||
|
||||
4. **SELinux/AppArmor**:
|
||||
- SELinux status (enabled/disabled)
|
||||
- SELinux mode (enforcing/permissive)
|
||||
- Denial count
|
||||
- Top denied operations
|
||||
- Policy recommendations
|
||||
|
||||
5. **Configuration Issues**:
|
||||
- Kernel parameter anomalies
|
||||
- Resource limit issues
|
||||
- Boot parameter problems
|
||||
- Configuration file errors
|
||||
|
||||
## Error Handling
|
||||
|
||||
1. **Missing configuration files**:
|
||||
- Different distributions have different file locations
|
||||
- Some files may not be collected based on sosreport options
|
||||
- Document missing data in summary
|
||||
|
||||
2. **Package manager variations**:
|
||||
- Handle both RPM and DEB systems
|
||||
- Account for different package naming conventions
|
||||
- Support multiple package managers (yum, dnf, apt)
|
||||
|
||||
3. **SELinux vs AppArmor**:
|
||||
- Check which MAC system is in use
|
||||
- Analyze accordingly
|
||||
- Note if both or neither are present
|
||||
|
||||
4. **Systemd vs init**:
|
||||
- Older systems may use init instead of systemd
|
||||
- Check for both service management systems
|
||||
- Adapt analysis based on what's present
|
||||
|
||||
## Output Format
|
||||
|
||||
The system configuration analysis should produce:
|
||||
|
||||
```bash
|
||||
SYSTEM CONFIGURATION SUMMARY
|
||||
============================
|
||||
|
||||
SYSTEM INFORMATION
|
||||
------------------
|
||||
OS: {os_name} {os_version}
|
||||
Kernel: {kernel_version}
|
||||
Architecture: {arch}
|
||||
Uptime: {uptime_days} days ({last_boot_time})
|
||||
|
||||
Status: {OK|WARNING|CRITICAL}
|
||||
Notes:
|
||||
- {system_info_note}
|
||||
|
||||
INSTALLED PACKAGES
|
||||
------------------
|
||||
Total Packages: {count}
|
||||
|
||||
Key Package Versions:
|
||||
kernel: {version}
|
||||
systemd: {version}
|
||||
glibc: {version}
|
||||
openssl: {version}
|
||||
openssh-server: {version}
|
||||
|
||||
Status: {OK|WARNING|CRITICAL}
|
||||
Issues:
|
||||
- {package_issue_description}
|
||||
|
||||
SYSTEMD SERVICES
|
||||
----------------
|
||||
Total Units: {total}
|
||||
Active: {active_count}
|
||||
Failed: {failed_count}
|
||||
Inactive: {inactive_count}
|
||||
|
||||
Failed Services:
|
||||
● {service_name}.service - {description}
|
||||
Reason: {failure_reason}
|
||||
Last Failed: {timestamp}
|
||||
|
||||
● {service_name}.service - {description}
|
||||
Reason: {failure_reason}
|
||||
Last Failed: {timestamp}
|
||||
|
||||
Status: {OK|WARNING|CRITICAL}
|
||||
Recommendations:
|
||||
- {service_recommendation}
|
||||
|
||||
SELINUX
|
||||
-------
|
||||
Status: {enabled|disabled}
|
||||
Mode: {enforcing|permissive|disabled}
|
||||
Policy: {policy_name}
|
||||
|
||||
AVC Denials: {count} denials found
|
||||
|
||||
Top Denied Operations:
|
||||
[{count}x] {operation} on {target} by {source}
|
||||
[{count}x] {operation} on {target} by {source}
|
||||
|
||||
SELinux Booleans: {count} custom settings
|
||||
|
||||
Status: {OK|WARNING|CRITICAL}
|
||||
Issues:
|
||||
- {selinux_issue_description}
|
||||
|
||||
Recommendations:
|
||||
- {selinux_recommendation}
|
||||
|
||||
KERNEL PARAMETERS
|
||||
-----------------
|
||||
Key sysctl Settings:
|
||||
vm.swappiness: {value}
|
||||
net.ipv4.ip_forward: {value}
|
||||
kernel.panic: {value}
|
||||
|
||||
Custom Parameters: {count} custom settings found
|
||||
|
||||
Status: {OK|WARNING|CRITICAL}
|
||||
Notes:
|
||||
- {kernel_param_note}
|
||||
|
||||
RESOURCE LIMITS
|
||||
---------------
|
||||
Custom Limits Found: {count}
|
||||
|
||||
{user_or_group} {type} {item} {value}
|
||||
|
||||
Status: {OK|WARNING}
|
||||
Notes:
|
||||
- {limits_note}
|
||||
|
||||
CRITICAL CONFIGURATION ISSUES
|
||||
-----------------------------
|
||||
{severity}: {issue_description}
|
||||
Evidence: {file_path}
|
||||
Impact: {impact_description}
|
||||
Recommendation: {remediation_action}
|
||||
|
||||
RECOMMENDATIONS
|
||||
---------------
|
||||
1. {actionable_recommendation}
|
||||
2. {actionable_recommendation}
|
||||
|
||||
DATA SOURCES
|
||||
------------
|
||||
- OS Info: {sosreport_path}/etc/os-release
|
||||
- Kernel: {sosreport_path}/uname
|
||||
- Packages: {sosreport_path}/installed-rpms
|
||||
- Services: {sosreport_path}/sos_commands/systemd/systemctl_list-units
|
||||
- SELinux: {sosreport_path}/sos_commands/selinux/sestatus
|
||||
- Audit Log: {sosreport_path}/var/log/audit/audit.log
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Example 1: Failed Service Analysis
|
||||
|
||||
```bash
|
||||
# List failed services
|
||||
$ cat sos_commands/systemd/systemctl_list-units_--failed
|
||||
UNIT LOAD ACTIVE SUB DESCRIPTION
|
||||
● httpd.service loaded failed failed Apache Web Server
|
||||
● postgresql.service loaded failed failed PostgreSQL database
|
||||
|
||||
# Find failure reason in logs
|
||||
$ grep "httpd.service" sos_commands/logs/journalctl_--no-pager | grep -i "failed\|error"
|
||||
Jan 15 10:23:45 server systemd[1]: httpd.service: Main process exited, code=exited, status=1/FAILURE
|
||||
Jan 15 10:23:45 server systemd[1]: httpd.service: Failed with result 'exit-code'
|
||||
Jan 15 10:23:45 server httpd[12345]: (98)Address already in use: AH00072: make_sock: could not bind to address [::]:80
|
||||
|
||||
# Interpretation: httpd failed because port 80 is already in use
|
||||
```
|
||||
|
||||
### Example 2: SELinux Denial Analysis
|
||||
|
||||
```bash
|
||||
# Check for AVC denials
|
||||
$ grep "avc.*denied" var/log/audit/audit.log | head -5
|
||||
type=AVC msg=audit(1705320245.123:456): avc: denied { write } for pid=1234 comm="httpd" name="index.html" dev="sda1" ino=789012 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:user_home_t:s0 tclass=file permissive=0
|
||||
|
||||
# Interpretation:
|
||||
# - httpd (web server) was denied write access
|
||||
# - Target file: index.html with context user_home_t
|
||||
# - Issue: Web server trying to write to user home directory
|
||||
# - Solution: Fix file context or move file to proper location
|
||||
```
|
||||
|
||||
### Example 3: Package Version Check
|
||||
|
||||
```bash
|
||||
# Check for specific package versions
|
||||
$ grep "^openssl" installed-rpms
|
||||
openssl-1.1.1k-7.el8_6.x86_64
|
||||
openssl-libs-1.1.1k-7.el8_6.x86_64
|
||||
|
||||
$ grep "^kernel" installed-rpms
|
||||
kernel-4.18.0-425.el8.x86_64
|
||||
kernel-4.18.0-477.el8.x86_64
|
||||
kernel-core-4.18.0-425.el8.x86_64
|
||||
kernel-core-4.18.0-477.el8.x86_64
|
||||
|
||||
# Interpretation:
|
||||
# - OpenSSL version 1.1.1k (check for known CVEs)
|
||||
# - Multiple kernels installed (good for rollback)
|
||||
# - Current kernel is 4.18.0-477 (from uname)
|
||||
```
|
||||
|
||||
## Tips for Effective Analysis
|
||||
|
||||
1. **Check service dependencies**: Failed service may be due to dependency failure
|
||||
2. **Correlate with logs**: Service failures often have detailed errors in logs
|
||||
3. **Verify configurations**: Check service config files for syntax errors
|
||||
4. **Consider timing**: When did service fail? Correlate with system events
|
||||
5. **SELinux context matters**: File contexts must match policy expectations
|
||||
6. **Package versions**: Compare with known good/bad versions
|
||||
7. **Uptime significance**: Very long uptime may mean missed security updates
|
||||
|
||||
## Common Configuration Patterns and Issues
|
||||
|
||||
1. **Service dependency failure**: ServiceB fails because ServiceA is not running
|
||||
2. **Port conflict**: Service fails to bind - port already in use
|
||||
3. **Permission denied**: Service can't access required files/directories
|
||||
4. **SELinux blocking**: Service denied access by SELinux policy
|
||||
5. **Missing dependencies**: Required package not installed
|
||||
6. **Configuration error**: Syntax error in config file
|
||||
7. **Resource limits**: Service hits ulimit (open files, processes, etc.)
|
||||
8. **Outdated kernel**: Running kernel doesn't match installed packages
|
||||
|
||||
## Configuration Issue Severity Classification
|
||||
|
||||
| Issue Type | Severity | Impact |
|
||||
|------------|----------|--------|
|
||||
| Critical service failed | High | Core functionality unavailable |
|
||||
| Optional service failed | Low | Non-essential feature unavailable |
|
||||
| SELinux in permissive | Warning | Reduced security, hiding issues |
|
||||
| SELinux disabled | Critical | No mandatory access control |
|
||||
| Kernel very outdated | High | Missing security fixes |
|
||||
| EOL OS version | Critical | No security updates |
|
||||
| Many AVC denials | Warning | Policy may need tuning |
|
||||
|
||||
## See Also
|
||||
|
||||
- Logs Analysis Skill: For detailed service failure log analysis
|
||||
- Resource Analysis Skill: For resource limit issues
|
||||
- Network Analysis Skill: For network service configuration
|
||||
Reference in New Issue
Block a user