Files
gh-openshift-eng-ai-helpers…/skills/system-config-analysis/SKILL.md
2025-11-30 08:46:21 +08:00

547 lines
16 KiB
Markdown

---
name: System Configuration Analysis
description: Analyze system configuration data from sosreport archives, extracting OS details, installed packages, systemd service status, SELinux/AppArmor policies, and kernel parameters from the sosreport directory structure to diagnose configuration-related system issues
---
# System Configuration Analysis Skill
This skill provides detailed guidance for analyzing system configuration from sosreport archives, including OS information, installed packages, systemd services, and SELinux/AppArmor settings.
## When to Use This Skill
Use this skill when:
- Analyzing the `/sosreport:analyze` command's system configuration phase
- Investigating service failures or misconfigurations
- Verifying package versions and updates
- Checking security policy settings (SELinux/AppArmor)
- Understanding system state and configuration
## Prerequisites
- Sosreport archive must be extracted to a working directory
- Path to the sosreport root directory must be known
- Understanding of Linux system administration
## Key Configuration Data Locations in Sosreport
1. **System Information**:
- `uname` - Kernel version
- `etc/os-release` - OS distribution and version
- `uptime` - System uptime
- `proc/uptime` - Uptime in seconds
- `sos_commands/release/` - Release information
2. **Package Information**:
- `installed-rpms` - RPM packages (RHEL/Fedora/CentOS)
- `installed-debs` - DEB packages (Debian/Ubuntu)
- `sos_commands/yum/` - Yum/DNF information
- `sos_commands/rpm/` - RPM database queries
3. **Service Status**:
- `sos_commands/systemd/systemctl_list-units` - All units
- `sos_commands/systemd/systemctl_list-units_--failed` - Failed units
- `sos_commands/systemd/systemctl_status_--all` - Detailed service status
- `sos_commands/systemd/systemctl_list-unit-files` - Unit files
4. **SELinux**:
- `sos_commands/selinux/sestatus` - SELinux status
- `sos_commands/selinux/getenforce` - Current enforcement mode
- `sos_commands/selinux/selinux-policy` - Policy information
- `var/log/audit/audit.log` - SELinux denials
5. **AppArmor** (if applicable):
- `sos_commands/apparmor/` - AppArmor configuration
- `etc/apparmor.d/` - AppArmor profiles
6. **System Configuration Files**:
- `etc/` - System-wide configuration
- `etc/sysctl.conf` or `etc/sysctl.d/` - Kernel parameters
- `etc/security/limits.conf` - Resource limits
## Implementation Steps
### Step 1: Analyze System Information
1. **Check OS version and distribution**:
```bash
if [ -f etc/os-release ]; then
cat etc/os-release
fi
```
2. **Get kernel version**:
```bash
if [ -f uname ]; then
cat uname
elif [ -f proc/version ]; then
cat proc/version
fi
```
3. **Check system uptime**:
```bash
if [ -f uptime ]; then
cat uptime
elif [ -f proc/uptime ]; then
# Parse uptime from proc/uptime (seconds)
awk '{printf "%.2f days\n", $1/86400}' proc/uptime
fi
```
4. **Extract key system details**:
- OS name and version
- Kernel version
- System architecture (x86_64, aarch64, etc.)
- Uptime (days)
5. **Check for outdated kernel or OS**:
- Compare kernel version with current stable
- Note if system hasn't been rebooted in a very long time (>365 days)
- Identify if OS version is EOL
### Step 2: Analyze Installed Packages
1. **List installed packages**:
```bash
# For RPM-based systems
if [ -f installed-rpms ]; then
cat installed-rpms
fi
# For DEB-based systems
if [ -f installed-debs ]; then
cat installed-debs
fi
```
2. **Extract key package versions**:
```bash
# Important system packages
grep -E "^(kernel|systemd|glibc|openssh|openssl)" installed-rpms 2>/dev/null
# Or use awk to parse package name and version
awk '{print $1}' installed-rpms | head -20
```
3. **Check for known problematic versions**:
- Security vulnerabilities (if known CVEs)
- Buggy package versions
- Compatibility issues
4. **Identify package manager issues**:
```bash
# Check yum/dnf logs for errors
if [ -d sos_commands/yum ]; then
grep -i "error\|fail" sos_commands/yum/* 2>/dev/null
fi
```
5. **Count packages and categorize**:
- Total packages installed
- Key package versions (kernel, systemd, glibc, etc.)
- Recently updated packages (if timestamps available)
### Step 3: Analyze Service Status
1. **List all systemd units**:
```bash
if [ -f sos_commands/systemd/systemctl_list-units ]; then
cat sos_commands/systemd/systemctl_list-units
fi
```
2. **Identify failed services**:
```bash
if [ -f sos_commands/systemd/systemctl_list-units_--failed ]; then
cat sos_commands/systemd/systemctl_list-units_--failed
elif [ -f sos_commands/systemd/systemctl_list-units ]; then
grep "failed" sos_commands/systemd/systemctl_list-units
fi
```
3. **Check service details**:
```bash
# Parse detailed status for failed services
if [ -f sos_commands/systemd/systemctl_status_--all ]; then
# Extract service names and their status
grep -E "●|Active:" sos_commands/systemd/systemctl_status_--all | head -50
fi
```
4. **Count services by state**:
```bash
# Count running, failed, inactive services
if [ -f sos_commands/systemd/systemctl_list-units ]; then
awk '{print $4}' sos_commands/systemd/systemctl_list-units | sort | uniq -c
fi
```
5. **Identify critical service failures**:
- System services (systemd-*, dbus, NetworkManager)
- Application services (httpd, nginx, postgresql, etc.)
- Custom services
6. **Extract failure reasons from logs**:
```bash
# For each failed service, find related log entries
grep -i "failed to start\|service.*failed" sos_commands/logs/journalctl_--no-pager 2>/dev/null | head -20
```
### Step 4: Analyze SELinux Configuration
1. **Check SELinux status**:
```bash
if [ -f sos_commands/selinux/sestatus ]; then
cat sos_commands/selinux/sestatus
fi
```
2. **Get SELinux mode**:
```bash
if [ -f sos_commands/selinux/getenforce ]; then
cat sos_commands/selinux/getenforce
fi
```
3. **Check for SELinux denials**:
```bash
# Look for AVC denials in audit log
if [ -f var/log/audit/audit.log ]; then
grep "avc.*denied" var/log/audit/audit.log | head -50
fi
# Or in journald logs
grep -i "selinux.*denied\|avc.*denied" sos_commands/logs/journalctl_--no-pager 2>/dev/null | head -20
```
4. **Parse denial information**:
- Extract denied operations (read, write, execute, etc.)
- Identify source and target contexts
- Note which services are affected
5. **Check for SELinux booleans**:
```bash
if [ -f sos_commands/selinux/getsebool_-a ]; then
cat sos_commands/selinux/getsebool_-a
fi
```
6. **Identify SELinux issues**:
- SELinux in permissive mode (may hide errors)
- SELinux disabled (security concern)
- Frequent AVC denials (policy may need adjustment)
- Context mismatches
### Step 5: Check System Configuration
1. **Review kernel parameters**:
```bash
# Check sysctl settings
if [ -f sos_commands/kernel/sysctl_-a ]; then
cat sos_commands/kernel/sysctl_-a
elif [ -d etc/sysctl.d ]; then
cat etc/sysctl.d/*.conf 2>/dev/null
fi
```
2. **Check resource limits**:
```bash
if [ -f etc/security/limits.conf ]; then
grep -v "^#\|^$" etc/security/limits.conf
fi
# Check limits.d directory
if [ -d etc/security/limits.d ]; then
cat etc/security/limits.d/*.conf 2>/dev/null
fi
```
3. **Review boot parameters**:
```bash
if [ -f sos_commands/boot/grub2-editenv_list ]; then
cat sos_commands/boot/grub2-editenv_list
elif [ -f proc/cmdline ]; then
cat proc/cmdline
fi
```
4. **Check systemd configuration**:
```bash
# Look for systemd configuration overrides
if [ -d etc/systemd/system ]; then
find etc/systemd/system -name "*.conf" 2>/dev/null
fi
```
### Step 6: Generate System Configuration Summary
Create a structured summary with the following sections:
1. **System Information**:
- OS name and version
- Kernel version
- Architecture
- System uptime
- Last boot time
2. **Package Summary**:
- Total packages installed
- Key package versions (kernel, systemd, glibc, openssl, openssh)
- Known problematic packages (if any)
- Package manager issues
3. **Service Status**:
- Total services
- Running services count
- Failed services count
- List of failed services with reasons
- Critical service status
4. **SELinux/AppArmor**:
- SELinux status (enabled/disabled)
- SELinux mode (enforcing/permissive)
- Denial count
- Top denied operations
- Policy recommendations
5. **Configuration Issues**:
- Kernel parameter anomalies
- Resource limit issues
- Boot parameter problems
- Configuration file errors
## Error Handling
1. **Missing configuration files**:
- Different distributions have different file locations
- Some files may not be collected based on sosreport options
- Document missing data in summary
2. **Package manager variations**:
- Handle both RPM and DEB systems
- Account for different package naming conventions
- Support multiple package managers (yum, dnf, apt)
3. **SELinux vs AppArmor**:
- Check which MAC system is in use
- Analyze accordingly
- Note if both or neither are present
4. **Systemd vs init**:
- Older systems may use init instead of systemd
- Check for both service management systems
- Adapt analysis based on what's present
## Output Format
The system configuration analysis should produce:
```bash
SYSTEM CONFIGURATION SUMMARY
============================
SYSTEM INFORMATION
------------------
OS: {os_name} {os_version}
Kernel: {kernel_version}
Architecture: {arch}
Uptime: {uptime_days} days ({last_boot_time})
Status: {OK|WARNING|CRITICAL}
Notes:
- {system_info_note}
INSTALLED PACKAGES
------------------
Total Packages: {count}
Key Package Versions:
kernel: {version}
systemd: {version}
glibc: {version}
openssl: {version}
openssh-server: {version}
Status: {OK|WARNING|CRITICAL}
Issues:
- {package_issue_description}
SYSTEMD SERVICES
----------------
Total Units: {total}
Active: {active_count}
Failed: {failed_count}
Inactive: {inactive_count}
Failed Services:
● {service_name}.service - {description}
Reason: {failure_reason}
Last Failed: {timestamp}
● {service_name}.service - {description}
Reason: {failure_reason}
Last Failed: {timestamp}
Status: {OK|WARNING|CRITICAL}
Recommendations:
- {service_recommendation}
SELINUX
-------
Status: {enabled|disabled}
Mode: {enforcing|permissive|disabled}
Policy: {policy_name}
AVC Denials: {count} denials found
Top Denied Operations:
[{count}x] {operation} on {target} by {source}
[{count}x] {operation} on {target} by {source}
SELinux Booleans: {count} custom settings
Status: {OK|WARNING|CRITICAL}
Issues:
- {selinux_issue_description}
Recommendations:
- {selinux_recommendation}
KERNEL PARAMETERS
-----------------
Key sysctl Settings:
vm.swappiness: {value}
net.ipv4.ip_forward: {value}
kernel.panic: {value}
Custom Parameters: {count} custom settings found
Status: {OK|WARNING|CRITICAL}
Notes:
- {kernel_param_note}
RESOURCE LIMITS
---------------
Custom Limits Found: {count}
{user_or_group} {type} {item} {value}
Status: {OK|WARNING}
Notes:
- {limits_note}
CRITICAL CONFIGURATION ISSUES
-----------------------------
{severity}: {issue_description}
Evidence: {file_path}
Impact: {impact_description}
Recommendation: {remediation_action}
RECOMMENDATIONS
---------------
1. {actionable_recommendation}
2. {actionable_recommendation}
DATA SOURCES
------------
- OS Info: {sosreport_path}/etc/os-release
- Kernel: {sosreport_path}/uname
- Packages: {sosreport_path}/installed-rpms
- Services: {sosreport_path}/sos_commands/systemd/systemctl_list-units
- SELinux: {sosreport_path}/sos_commands/selinux/sestatus
- Audit Log: {sosreport_path}/var/log/audit/audit.log
```
## Examples
### Example 1: Failed Service Analysis
```bash
# List failed services
$ cat sos_commands/systemd/systemctl_list-units_--failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● httpd.service loaded failed failed Apache Web Server
● postgresql.service loaded failed failed PostgreSQL database
# Find failure reason in logs
$ grep "httpd.service" sos_commands/logs/journalctl_--no-pager | grep -i "failed\|error"
Jan 15 10:23:45 server systemd[1]: httpd.service: Main process exited, code=exited, status=1/FAILURE
Jan 15 10:23:45 server systemd[1]: httpd.service: Failed with result 'exit-code'
Jan 15 10:23:45 server httpd[12345]: (98)Address already in use: AH00072: make_sock: could not bind to address [::]:80
# Interpretation: httpd failed because port 80 is already in use
```
### Example 2: SELinux Denial Analysis
```bash
# Check for AVC denials
$ grep "avc.*denied" var/log/audit/audit.log | head -5
type=AVC msg=audit(1705320245.123:456): avc: denied { write } for pid=1234 comm="httpd" name="index.html" dev="sda1" ino=789012 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:user_home_t:s0 tclass=file permissive=0
# Interpretation:
# - httpd (web server) was denied write access
# - Target file: index.html with context user_home_t
# - Issue: Web server trying to write to user home directory
# - Solution: Fix file context or move file to proper location
```
### Example 3: Package Version Check
```bash
# Check for specific package versions
$ grep "^openssl" installed-rpms
openssl-1.1.1k-7.el8_6.x86_64
openssl-libs-1.1.1k-7.el8_6.x86_64
$ grep "^kernel" installed-rpms
kernel-4.18.0-425.el8.x86_64
kernel-4.18.0-477.el8.x86_64
kernel-core-4.18.0-425.el8.x86_64
kernel-core-4.18.0-477.el8.x86_64
# Interpretation:
# - OpenSSL version 1.1.1k (check for known CVEs)
# - Multiple kernels installed (good for rollback)
# - Current kernel is 4.18.0-477 (from uname)
```
## Tips for Effective Analysis
1. **Check service dependencies**: Failed service may be due to dependency failure
2. **Correlate with logs**: Service failures often have detailed errors in logs
3. **Verify configurations**: Check service config files for syntax errors
4. **Consider timing**: When did service fail? Correlate with system events
5. **SELinux context matters**: File contexts must match policy expectations
6. **Package versions**: Compare with known good/bad versions
7. **Uptime significance**: Very long uptime may mean missed security updates
## Common Configuration Patterns and Issues
1. **Service dependency failure**: ServiceB fails because ServiceA is not running
2. **Port conflict**: Service fails to bind - port already in use
3. **Permission denied**: Service can't access required files/directories
4. **SELinux blocking**: Service denied access by SELinux policy
5. **Missing dependencies**: Required package not installed
6. **Configuration error**: Syntax error in config file
7. **Resource limits**: Service hits ulimit (open files, processes, etc.)
8. **Outdated kernel**: Running kernel doesn't match installed packages
## Configuration Issue Severity Classification
| Issue Type | Severity | Impact |
|------------|----------|--------|
| Critical service failed | High | Core functionality unavailable |
| Optional service failed | Low | Non-essential feature unavailable |
| SELinux in permissive | Warning | Reduced security, hiding issues |
| SELinux disabled | Critical | No mandatory access control |
| Kernel very outdated | High | Missing security fixes |
| EOL OS version | Critical | No security updates |
| Many AVC denials | Warning | Policy may need tuning |
## See Also
- Logs Analysis Skill: For detailed service failure log analysis
- Resource Analysis Skill: For resource limit issues
- Network Analysis Skill: For network service configuration