9.5 KiB
9.5 KiB
Security Incidents
Purpose: Respond to security breaches, DDoS attacks, and unauthorized access attempts.
IMPORTANT: For security incidents, SRE Agent collaborates with security-agent skill.
Incident Response Protocol
SEV1 Security Incidents (CRITICAL)
Immediate Actions (First 5 minutes):
- Isolate affected systems
- Preserve evidence (logs, snapshots)
- Notify security team and management
- Assess scope of breach
- Document timeline
DO NOT:
- Delete logs (preserve evidence)
- Reboot systems (unless absolutely necessary)
- Make changes without documenting
Common Security Incidents
1. DDoS Attack
Symptoms:
- Sudden traffic spike (10x-100x normal)
- Legitimate users can't access service
- High bandwidth usage
- Server overload
Diagnosis:
Check Traffic Patterns
# Check connections by IP
netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -20
# Check HTTP requests by IP (nginx)
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head -20
# Check requests per second
tail -f /var/log/nginx/access.log | awk '{print $4}' | uniq -c
Red flags:
- Single IP making thousands of requests
- Requests from suspicious IPs (botnets)
- High rate of 4xx errors (probing)
- Unusual traffic patterns
Immediate Mitigation
# 1. Rate limiting (nginx)
# Add to nginx.conf:
limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
limit_req zone=one burst=20 nodelay;
# 2. Block suspicious IPs (iptables)
iptables -A INPUT -s <ATTACKER_IP> -j DROP
# 3. Enable DDoS protection (CloudFlare, AWS Shield)
# CloudFlare: Enable "I'm Under Attack" mode
# AWS: Enable AWS Shield Standard/Advanced
# 4. Increase capacity (auto-scaling)
# Scale up to handle traffic (if legitimate)
2. Unauthorized Access / Data Breach
Symptoms:
- Alerts for failed login attempts
- Successful login from unusual location
- Unusual data access patterns
- Data exfiltration detected
Diagnosis:
Check Access Logs
# Check authentication logs (Linux)
grep "Failed password" /var/log/auth.log | tail -50
# Check successful logins
grep "Accepted password" /var/log/auth.log | tail -50
# Check login attempts by IP
awk '/Failed password/ {print $(NF-3)}' /var/log/auth.log | sort | uniq -c | sort -nr
Red flags:
- Hundreds of failed login attempts (brute force)
- Successful login from suspicious IP/location
- Login at unusual time (3am)
- Multiple accounts accessed from same IP
Immediate Response (SEV1)
# 1. ISOLATE: Disable compromised account
# Application-level:
UPDATE users SET disabled = true WHERE id = <COMPROMISED_USER_ID>;
# System-level:
passwd -l <username> # Lock account
# 2. PRESERVE: Copy logs for forensics
cp /var/log/auth.log /forensics/auth.log.$(date +%Y%m%d)
cp /var/log/nginx/access.log /forensics/access.log.$(date +%Y%m%d)
# 3. ASSESS: Check what was accessed
# Database audit logs
# Application logs
# File access logs
# 4. NOTIFY: Alert security team
# Email, Slack, PagerDuty
# 5. DOCUMENT: Create incident timeline
Long-term Mitigation
- Force password reset for all users
- Enable 2FA/MFA
- Review access controls
- Conduct security audit
- Update security policies
- Train users on security
3. SQL Injection Attempt
Symptoms:
- Unusual SQL queries in logs
- 500 errors with SQL syntax messages
- Alerts from WAF (Web Application Firewall)
Diagnosis:
Check Application Logs
# Look for SQL injection patterns
grep -E "(SELECT|INSERT|UPDATE|DELETE).*FROM.*WHERE" /var/log/application.log
# Look for SQL errors
grep "SQLException\|SQL syntax" /var/log/application.log
# Check for malicious patterns
grep -E "(\'\s*OR\s*\'|\-\-|UNION\s+SELECT)" /var/log/nginx/access.log
Example Malicious Request:
GET /api/users?id=1' OR '1'='1
GET /api/users?id=1; DROP TABLE users;--
Immediate Response
# 1. Block attacker IP
iptables -A INPUT -s <ATTACKER_IP> -j DROP
# 2. Enable WAF rule (ModSecurity, AWS WAF)
# Block requests with SQL keywords
# 3. Check database for unauthorized changes
# Compare current schema with backup
# Check audit logs for suspicious queries
# 4. Review application code
# Use parameterized queries, not string concatenation
Long-term Fix:
// BAD: SQL injection vulnerable
const query = `SELECT * FROM users WHERE id = ${req.query.id}`;
// GOOD: Parameterized query
const query = 'SELECT * FROM users WHERE id = ?';
db.query(query, [req.query.id]);
4. Malware / Crypto Mining
Symptoms:
- High CPU usage (100%)
- Unusual network traffic (to crypto pool)
- Unknown processes running
- Server slow
Diagnosis:
Check Running Processes
# Check CPU usage by process
top -bn1 | head -20
# Check all processes
ps aux | sort -nrk 3,3 | head -20
# Check for suspicious processes
ps aux | grep -v -E "^(root|www-data|mysql|postgres)"
# Check network connections
netstat -tunap | grep ESTABLISHED
Red flags:
- Unknown process using 100% CPU
- Connections to crypto mining pools
- Processes running as unexpected user
- Processes with random names (xmrig, minerd)
Immediate Response
# 1. Kill malicious process
kill -9 <PID>
# 2. Find and remove malware
find / -name "<PROCESS_NAME>" -delete
# 3. Check for persistence mechanisms
crontab -l # Cron jobs
cat /etc/rc.local # Startup scripts
systemctl list-unit-files # Systemd services
# 4. Change all credentials
# Root password
# SSH keys
# Database passwords
# API keys
# 5. Restore from clean backup (if available)
5. Insider Threat / Data Exfiltration
Symptoms:
- Large data downloads
- Database dump exports
- Unusual file transfers
- After-hours access
Diagnosis:
Check Data Access Logs
# Check database queries (large exports)
grep "SELECT.*FROM" /var/log/postgresql/postgresql.log | grep -E "LIMIT\s+[0-9]{5,}"
# Check file downloads (nginx)
awk '$10 > 10000000 {print $1, $7, $10}' /var/log/nginx/access.log
# Check SSH file transfers
grep "sftp\|scp" /var/log/auth.log
Red flags:
- SELECT with no LIMIT (full table export)
- Large file downloads (>10MB)
- Multiple consecutive downloads
- Access from unusual location
Immediate Response
# 1. Disable account
UPDATE users SET disabled = true WHERE id = <USER_ID>;
# 2. Preserve evidence
cp /var/log/* /forensics/
# 3. Assess damage
# What data was accessed?
# What data was exported?
# What systems were compromised?
# 4. Legal/compliance notification
# GDPR: Notify within 72 hours
# HIPAA: Notify within 60 days
# PCI-DSS: Immediate notification
# 5. Incident report
Security Incident Checklist
When security incident detected:
Phase 1: Immediate Response (0-5 min)
- Classify severity (SEV1/SEV2/SEV3)
- Isolate affected systems
- Preserve evidence (logs, snapshots)
- Notify security team
- Document timeline (start timestamp)
Phase 2: Assessment (5-30 min)
- Identify attack vector
- Assess scope (what was compromised?)
- Check for data exfiltration
- Identify attacker (IP, location, identity)
- Determine if ongoing or stopped
Phase 3: Containment (30 min - 2 hours)
- Block attacker access
- Close vulnerability
- Revoke compromised credentials
- Remove malware/backdoors
- Restore from clean backup (if needed)
Phase 4: Recovery (2 hours - days)
- Restore normal operations
- Verify no persistence mechanisms
- Monitor for re-infection
- Change all credentials
- Apply security patches
Phase 5: Post-Incident (1 week)
- Complete post-mortem
- Legal/compliance notifications
- Security audit
- Update security policies
- Train team on lessons learned
Collaboration with Security Agent
SRE Agent Role:
- Initial detection and triage
- Immediate containment
- Preserve evidence
- Restore service
Security Agent Role (handoff):
- Forensic analysis
- Legal compliance
- Security audit
- Policy updates
Handoff Protocol:
SRE: Detects security incident → Immediate containment
SRE: Preserves evidence → Creates incident report
SRE: Hands off to Security Agent
Security Agent: Forensic analysis → Legal compliance → Long-term fixes
SRE: Implements security fixes → Updates runbook
Security Metrics
Detection Time:
- SEV1: <5 minutes from first indicator
- SEV2: <30 minutes
- SEV3: <24 hours
Response Time:
- SEV1: Containment within 30 minutes
- SEV2: Containment within 2 hours
- SEV3: Containment within 24 hours
False Positives:
- Target: <5% of security alerts
Related Documentation
- SKILL.md - Main SRE agent
- infrastructure.md - Server security hardening
- monitoring.md - Security monitoring setup
security-agentskill - Full security expertise (handoff for forensics)
Important Notes
For SRE Agent:
- Focus on IMMEDIATE containment and service restoration
- Preserve evidence (don't delete logs!)
- Hand off to
security-agentfor forensic analysis - Document everything with timestamps
- Blameless post-mortem (focus on systems, not people)
Legal Compliance:
- GDPR: Notify within 72 hours of breach
- HIPAA: Notify within 60 days
- PCI-DSS: Immediate notification to card brands
- SOC 2: Document in audit trail
Evidence Preservation:
- Copy logs before any changes
- Take disk/memory snapshots
- Document all actions taken
- Preserve chain of custody