Initial commit

2025-11-29 17:56:41 +08:00
commit 9427ed1eea
40 changed files with 15189 additions and 0 deletions
--- a/agents/sre/modules/security-incidents.md
+++ b/agents/sre/modules/security-incidents.md
@@ -0,0 +1,421 @@
+# Security Incidents
+
+**Purpose**: Respond to security breaches, DDoS attacks, and unauthorized access attempts.
+
+**IMPORTANT**: For security incidents, SRE Agent collaborates with `security-agent` skill.
+
+## Incident Response Protocol
+
+### SEV1 Security Incidents (CRITICAL)
+
+**Immediate Actions** (First 5 minutes):
+1. **Isolate** affected systems
+2. **Preserve** evidence (logs, snapshots)
+3. **Notify** security team and management
+4. **Assess** scope of breach
+5. **Document** timeline
+
+**DO NOT**:
+- Delete logs (preserve evidence)
+- Reboot systems (unless absolutely necessary)
+- Make changes without documenting
+
+---
+
+## Common Security Incidents
+
+### 1. DDoS Attack
+
+**Symptoms**:
+- Sudden traffic spike (10x-100x normal)
+- Legitimate users can't access service
+- High bandwidth usage
+- Server overload
+
+**Diagnosis**:
+
+#### Check Traffic Patterns
+```bash
+# Check connections by IP
+netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -20
+
+# Check HTTP requests by IP (nginx)
+awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head -20
+
+# Check requests per second
+tail -f /var/log/nginx/access.log | awk '{print $4}' | uniq -c
+```
+
+**Red flags**:
+- Single IP making thousands of requests
+- Requests from suspicious IPs (botnets)
+- High rate of 4xx errors (probing)
+- Unusual traffic patterns
+
+---
+
+#### Immediate Mitigation
+```bash
+# 1. Rate limiting (nginx)
+# Add to nginx.conf:
+limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
+limit_req zone=one burst=20 nodelay;
+
+# 2. Block suspicious IPs (iptables)
+iptables -A INPUT -s <ATTACKER_IP> -j DROP
+
+# 3. Enable DDoS protection (CloudFlare, AWS Shield)
+# CloudFlare: Enable "I'm Under Attack" mode
+# AWS: Enable AWS Shield Standard/Advanced
+
+# 4. Increase capacity (auto-scaling)
+# Scale up to handle traffic (if legitimate)
+```
+
+---
+
+### 2. Unauthorized Access / Data Breach
+
+**Symptoms**:
+- Alerts for failed login attempts
+- Successful login from unusual location
+- Unusual data access patterns
+- Data exfiltration detected
+
+**Diagnosis**:
+
+#### Check Access Logs
+```bash
+# Check authentication logs (Linux)
+grep "Failed password" /var/log/auth.log | tail -50
+
+# Check successful logins
+grep "Accepted password" /var/log/auth.log | tail -50
+
+# Check login attempts by IP
+awk '/Failed password/ {print $(NF-3)}' /var/log/auth.log | sort | uniq -c | sort -nr
+```
+
+**Red flags**:
+- Hundreds of failed login attempts (brute force)
+- Successful login from suspicious IP/location
+- Login at unusual time (3am)
+- Multiple accounts accessed from same IP
+
+---
+
+#### Immediate Response (SEV1)
+```bash
+# 1. ISOLATE: Disable compromised account
+# Application-level:
+UPDATE users SET disabled = true WHERE id = <COMPROMISED_USER_ID>;
+
+# System-level:
+passwd -l <username>  # Lock account
+
+# 2. PRESERVE: Copy logs for forensics
+cp /var/log/auth.log /forensics/auth.log.$(date +%Y%m%d)
+cp /var/log/nginx/access.log /forensics/access.log.$(date +%Y%m%d)
+
+# 3. ASSESS: Check what was accessed
+# Database audit logs
+# Application logs
+# File access logs
+
+# 4. NOTIFY: Alert security team
+# Email, Slack, PagerDuty
+
+# 5. DOCUMENT: Create incident timeline
+```
+
+---
+
+#### Long-term Mitigation
+- Force password reset for all users
+- Enable 2FA/MFA
+- Review access controls
+- Conduct security audit
+- Update security policies
+- Train users on security
+
+---
+
+### 3. SQL Injection Attempt
+
+**Symptoms**:
+- Unusual SQL queries in logs
+- 500 errors with SQL syntax messages
+- Alerts from WAF (Web Application Firewall)
+
+**Diagnosis**:
+
+#### Check Application Logs
+```bash
+# Look for SQL injection patterns
+grep -E "(SELECT|INSERT|UPDATE|DELETE).*FROM.*WHERE" /var/log/application.log
+
+# Look for SQL errors
+grep "SQLException\|SQL syntax" /var/log/application.log
+
+# Check for malicious patterns
+grep -E "(\'\s*OR\s*\'|\-\-|UNION\s+SELECT)" /var/log/nginx/access.log
+```
+
+**Example Malicious Request**:
+```
+GET /api/users?id=1' OR '1'='1
+GET /api/users?id=1; DROP TABLE users;--
+```
+
+---
+
+#### Immediate Response
+```bash
+# 1. Block attacker IP
+iptables -A INPUT -s <ATTACKER_IP> -j DROP
+
+# 2. Enable WAF rule (ModSecurity, AWS WAF)
+# Block requests with SQL keywords
+
+# 3. Check database for unauthorized changes
+# Compare current schema with backup
+# Check audit logs for suspicious queries
+
+# 4. Review application code
+# Use parameterized queries, not string concatenation
+```
+
+**Long-term Fix**:
+```javascript
+// BAD: SQL injection vulnerable
+const query = `SELECT * FROM users WHERE id = ${req.query.id}`;
+
+// GOOD: Parameterized query
+const query = 'SELECT * FROM users WHERE id = ?';
+db.query(query, [req.query.id]);
+```
+
+---
+
+### 4. Malware / Crypto Mining
+
+**Symptoms**:
+- High CPU usage (100%)
+- Unusual network traffic (to crypto pool)
+- Unknown processes running
+- Server slow
+
+**Diagnosis**:
+
+#### Check Running Processes
+```bash
+# Check CPU usage by process
+top -bn1 | head -20
+
+# Check all processes
+ps aux | sort -nrk 3,3 | head -20
+
+# Check for suspicious processes
+ps aux | grep -v -E "^(root|www-data|mysql|postgres)"
+
+# Check network connections
+netstat -tunap | grep ESTABLISHED
+```
+
+**Red flags**:
+- Unknown process using 100% CPU
+- Connections to crypto mining pools
+- Processes running as unexpected user
+- Processes with random names (xmrig, minerd)
+
+---
+
+#### Immediate Response
+```bash
+# 1. Kill malicious process
+kill -9 <PID>
+
+# 2. Find and remove malware
+find / -name "<PROCESS_NAME>" -delete
+
+# 3. Check for persistence mechanisms
+crontab -l                    # Cron jobs
+cat /etc/rc.local             # Startup scripts
+systemctl list-unit-files     # Systemd services
+
+# 4. Change all credentials
+# Root password
+# SSH keys
+# Database passwords
+# API keys
+
+# 5. Restore from clean backup (if available)
+```
+
+---
+
+### 5. Insider Threat / Data Exfiltration
+
+**Symptoms**:
+- Large data downloads
+- Database dump exports
+- Unusual file transfers
+- After-hours access
+
+**Diagnosis**:
+
+#### Check Data Access Logs
+```bash
+# Check database queries (large exports)
+grep "SELECT.*FROM" /var/log/postgresql/postgresql.log | grep -E "LIMIT\s+[0-9]{5,}"
+
+# Check file downloads (nginx)
+awk '$10 > 10000000 {print $1, $7, $10}' /var/log/nginx/access.log
+
+# Check SSH file transfers
+grep "sftp\|scp" /var/log/auth.log
+```
+
+**Red flags**:
+- SELECT with no LIMIT (full table export)
+- Large file downloads (>10MB)
+- Multiple consecutive downloads
+- Access from unusual location
+
+---
+
+#### Immediate Response
+```bash
+# 1. Disable account
+UPDATE users SET disabled = true WHERE id = <USER_ID>;
+
+# 2. Preserve evidence
+cp /var/log/* /forensics/
+
+# 3. Assess damage
+# What data was accessed?
+# What data was exported?
+# What systems were compromised?
+
+# 4. Legal/compliance notification
+# GDPR: Notify within 72 hours
+# HIPAA: Notify within 60 days
+# PCI-DSS: Immediate notification
+
+# 5. Incident report
+```
+
+---
+
+## Security Incident Checklist
+
+**When security incident detected**:
+
+### Phase 1: Immediate Response (0-5 min)
+- [ ] Classify severity (SEV1/SEV2/SEV3)
+- [ ] Isolate affected systems
+- [ ] Preserve evidence (logs, snapshots)
+- [ ] Notify security team
+- [ ] Document timeline (start timestamp)
+
+### Phase 2: Assessment (5-30 min)
+- [ ] Identify attack vector
+- [ ] Assess scope (what was compromised?)
+- [ ] Check for data exfiltration
+- [ ] Identify attacker (IP, location, identity)
+- [ ] Determine if ongoing or stopped
+
+### Phase 3: Containment (30 min - 2 hours)
+- [ ] Block attacker access
+- [ ] Close vulnerability
+- [ ] Revoke compromised credentials
+- [ ] Remove malware/backdoors
+- [ ] Restore from clean backup (if needed)
+
+### Phase 4: Recovery (2 hours - days)
+- [ ] Restore normal operations
+- [ ] Verify no persistence mechanisms
+- [ ] Monitor for re-infection
+- [ ] Change all credentials
+- [ ] Apply security patches
+
+### Phase 5: Post-Incident (1 week)
+- [ ] Complete post-mortem
+- [ ] Legal/compliance notifications
+- [ ] Security audit
+- [ ] Update security policies
+- [ ] Train team on lessons learned
+
+---
+
+## Collaboration with Security Agent
+
+**SRE Agent Role**:
+- Initial detection and triage
+- Immediate containment
+- Preserve evidence
+- Restore service
+
+**Security Agent Role** (handoff):
+- Forensic analysis
+- Legal compliance
+- Security audit
+- Policy updates
+
+**Handoff Protocol**:
+```
+SRE: Detects security incident → Immediate containment
+SRE: Preserves evidence → Creates incident report
+SRE: Hands off to Security Agent
+Security Agent: Forensic analysis → Legal compliance → Long-term fixes
+SRE: Implements security fixes → Updates runbook
+```
+
+---
+
+## Security Metrics
+
+**Detection Time**:
+- SEV1: <5 minutes from first indicator
+- SEV2: <30 minutes
+- SEV3: <24 hours
+
+**Response Time**:
+- SEV1: Containment within 30 minutes
+- SEV2: Containment within 2 hours
+- SEV3: Containment within 24 hours
+
+**False Positives**:
+- Target: <5% of security alerts
+
+---
+
+## Related Documentation
+
+- [SKILL.md](../SKILL.md) - Main SRE agent
+- [infrastructure.md](infrastructure.md) - Server security hardening
+- [monitoring.md](monitoring.md) - Security monitoring setup
+- `security-agent` skill - Full security expertise (handoff for forensics)
+
+---
+
+## Important Notes
+
+**For SRE Agent**:
+- Focus on IMMEDIATE containment and service restoration
+- Preserve evidence (don't delete logs!)
+- Hand off to `security-agent` for forensic analysis
+- Document everything with timestamps
+- Blameless post-mortem (focus on systems, not people)
+
+**Legal Compliance**:
+- GDPR: Notify within 72 hours of breach
+- HIPAA: Notify within 60 days
+- PCI-DSS: Immediate notification to card brands
+- SOC 2: Document in audit trail
+
+**Evidence Preservation**:
+- Copy logs before any changes
+- Take disk/memory snapshots
+- Document all actions taken
+- Preserve chain of custody