zhongwei/gh-anton-abyzov-specweave-plugins-specweave-infrastructure

Fork 0

Files

Zhongwei Li 9427ed1eea Initial commit

2025-11-29 17:56:41 +08:00

9.5 KiB

Raw Blame History

Security Incidents

Purpose: Respond to security breaches, DDoS attacks, and unauthorized access attempts.

IMPORTANT: For security incidents, SRE Agent collaborates with security-agent skill.

Incident Response Protocol

SEV1 Security Incidents (CRITICAL)

Immediate Actions (First 5 minutes):

Isolate affected systems
Preserve evidence (logs, snapshots)
Notify security team and management
Assess scope of breach
Document timeline

DO NOT:

Delete logs (preserve evidence)
Reboot systems (unless absolutely necessary)
Make changes without documenting

Common Security Incidents

1. DDoS Attack

Symptoms:

Sudden traffic spike (10x-100x normal)
Legitimate users can't access service
High bandwidth usage
Server overload

Diagnosis:

Check Traffic Patterns

# Check connections by IP
netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -20

# Check HTTP requests by IP (nginx)
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head -20

# Check requests per second
tail -f /var/log/nginx/access.log | awk '{print $4}' | uniq -c

Red flags:

Single IP making thousands of requests
Requests from suspicious IPs (botnets)
High rate of 4xx errors (probing)
Unusual traffic patterns

Immediate Mitigation

# 1. Rate limiting (nginx)
# Add to nginx.conf:
limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
limit_req zone=one burst=20 nodelay;

# 2. Block suspicious IPs (iptables)
iptables -A INPUT -s <ATTACKER_IP> -j DROP

# 3. Enable DDoS protection (CloudFlare, AWS Shield)
# CloudFlare: Enable "I'm Under Attack" mode
# AWS: Enable AWS Shield Standard/Advanced

# 4. Increase capacity (auto-scaling)
# Scale up to handle traffic (if legitimate)

2. Unauthorized Access / Data Breach

Symptoms:

Alerts for failed login attempts
Successful login from unusual location
Unusual data access patterns
Data exfiltration detected

Diagnosis:

Check Access Logs

# Check authentication logs (Linux)
grep "Failed password" /var/log/auth.log | tail -50

# Check successful logins
grep "Accepted password" /var/log/auth.log | tail -50

# Check login attempts by IP
awk '/Failed password/ {print $(NF-3)}' /var/log/auth.log | sort | uniq -c | sort -nr

Red flags:

Hundreds of failed login attempts (brute force)
Successful login from suspicious IP/location
Login at unusual time (3am)
Multiple accounts accessed from same IP

Immediate Response (SEV1)

# 1. ISOLATE: Disable compromised account
# Application-level:
UPDATE users SET disabled = true WHERE id = <COMPROMISED_USER_ID>;

# System-level:
passwd -l <username>  # Lock account

# 2. PRESERVE: Copy logs for forensics
cp /var/log/auth.log /forensics/auth.log.$(date +%Y%m%d)
cp /var/log/nginx/access.log /forensics/access.log.$(date +%Y%m%d)

# 3. ASSESS: Check what was accessed
# Database audit logs
# Application logs
# File access logs

# 4. NOTIFY: Alert security team
# Email, Slack, PagerDuty

# 5. DOCUMENT: Create incident timeline

Long-term Mitigation

Force password reset for all users
Enable 2FA/MFA
Review access controls
Conduct security audit
Update security policies
Train users on security

3. SQL Injection Attempt

Symptoms:

Unusual SQL queries in logs
500 errors with SQL syntax messages
Alerts from WAF (Web Application Firewall)

Diagnosis:

Check Application Logs

# Look for SQL injection patterns
grep -E "(SELECT|INSERT|UPDATE|DELETE).*FROM.*WHERE" /var/log/application.log

# Look for SQL errors
grep "SQLException\|SQL syntax" /var/log/application.log

# Check for malicious patterns
grep -E "(\'\s*OR\s*\'|\-\-|UNION\s+SELECT)" /var/log/nginx/access.log

Example Malicious Request:

GET /api/users?id=1' OR '1'='1
GET /api/users?id=1; DROP TABLE users;--

Immediate Response

# 1. Block attacker IP
iptables -A INPUT -s <ATTACKER_IP> -j DROP

# 2. Enable WAF rule (ModSecurity, AWS WAF)
# Block requests with SQL keywords

# 3. Check database for unauthorized changes
# Compare current schema with backup
# Check audit logs for suspicious queries

# 4. Review application code
# Use parameterized queries, not string concatenation

Long-term Fix:

// BAD: SQL injection vulnerable
const query = `SELECT * FROM users WHERE id = ${req.query.id}`;

// GOOD: Parameterized query
const query = 'SELECT * FROM users WHERE id = ?';
db.query(query, [req.query.id]);

4. Malware / Crypto Mining

Symptoms:

High CPU usage (100%)
Unusual network traffic (to crypto pool)
Unknown processes running
Server slow

Diagnosis:

Check Running Processes

# Check CPU usage by process
top -bn1 | head -20

# Check all processes
ps aux | sort -nrk 3,3 | head -20

# Check for suspicious processes
ps aux | grep -v -E "^(root|www-data|mysql|postgres)"

# Check network connections
netstat -tunap | grep ESTABLISHED

Red flags:

Unknown process using 100% CPU
Connections to crypto mining pools
Processes running as unexpected user
Processes with random names (xmrig, minerd)

Immediate Response

# 1. Kill malicious process
kill -9 <PID>

# 2. Find and remove malware
find / -name "<PROCESS_NAME>" -delete

# 3. Check for persistence mechanisms
crontab -l                    # Cron jobs
cat /etc/rc.local             # Startup scripts
systemctl list-unit-files     # Systemd services

# 4. Change all credentials
# Root password
# SSH keys
# Database passwords
# API keys

# 5. Restore from clean backup (if available)

5. Insider Threat / Data Exfiltration

Symptoms:

Large data downloads
Database dump exports
Unusual file transfers
After-hours access

Diagnosis:

Check Data Access Logs

# Check database queries (large exports)
grep "SELECT.*FROM" /var/log/postgresql/postgresql.log | grep -E "LIMIT\s+[0-9]{5,}"

# Check file downloads (nginx)
awk '$10 > 10000000 {print $1, $7, $10}' /var/log/nginx/access.log

# Check SSH file transfers
grep "sftp\|scp" /var/log/auth.log

Red flags:

SELECT with no LIMIT (full table export)
Large file downloads (>10MB)
Multiple consecutive downloads
Access from unusual location

Immediate Response

# 1. Disable account
UPDATE users SET disabled = true WHERE id = <USER_ID>;

# 2. Preserve evidence
cp /var/log/* /forensics/

# 3. Assess damage
# What data was accessed?
# What data was exported?
# What systems were compromised?

# 4. Legal/compliance notification
# GDPR: Notify within 72 hours
# HIPAA: Notify within 60 days
# PCI-DSS: Immediate notification

# 5. Incident report

Security Incident Checklist

When security incident detected:

Phase 1: Immediate Response (0-5 min)

Classify severity (SEV1/SEV2/SEV3)
Isolate affected systems
Preserve evidence (logs, snapshots)
Notify security team
Document timeline (start timestamp)

Phase 2: Assessment (5-30 min)

Identify attack vector
Assess scope (what was compromised?)
Check for data exfiltration
Identify attacker (IP, location, identity)
Determine if ongoing or stopped

Phase 3: Containment (30 min - 2 hours)

Block attacker access
Close vulnerability
Revoke compromised credentials
Remove malware/backdoors
Restore from clean backup (if needed)

Phase 4: Recovery (2 hours - days)

Restore normal operations
Verify no persistence mechanisms
Monitor for re-infection
Change all credentials
Apply security patches

Phase 5: Post-Incident (1 week)

Complete post-mortem
Legal/compliance notifications
Security audit
Update security policies
Train team on lessons learned

Collaboration with Security Agent

SRE Agent Role:

Initial detection and triage
Immediate containment
Preserve evidence
Restore service

Security Agent Role (handoff):

Forensic analysis
Legal compliance
Security audit
Policy updates

Handoff Protocol:

SRE: Detects security incident → Immediate containment
SRE: Preserves evidence → Creates incident report
SRE: Hands off to Security Agent
Security Agent: Forensic analysis → Legal compliance → Long-term fixes
SRE: Implements security fixes → Updates runbook

Security Metrics

Detection Time:

SEV1: <5 minutes from first indicator
SEV2: <30 minutes
SEV3: <24 hours

Response Time:

SEV1: Containment within 30 minutes
SEV2: Containment within 2 hours
SEV3: Containment within 24 hours

False Positives:

Target: <5% of security alerts

SKILL.md - Main SRE agent
infrastructure.md - Server security hardening
monitoring.md - Security monitoring setup
security-agent skill - Full security expertise (handoff for forensics)

Important Notes

For SRE Agent:

Focus on IMMEDIATE containment and service restoration
Preserve evidence (don't delete logs!)
Hand off to security-agent for forensic analysis
Document everything with timestamps
Blameless post-mortem (focus on systems, not people)

Legal Compliance:

GDPR: Notify within 72 hours of breach
HIPAA: Notify within 60 days
PCI-DSS: Immediate notification to card brands
SOC 2: Document in audit trail

Evidence Preservation:

Copy logs before any changes
Take disk/memory snapshots
Document all actions taken
Preserve chain of custody

9.5 KiB Raw Blame History

Security Incidents

Incident Response Protocol

SEV1 Security Incidents (CRITICAL)

Common Security Incidents

1. DDoS Attack

Check Traffic Patterns

Immediate Mitigation

2. Unauthorized Access / Data Breach

Check Access Logs

Immediate Response (SEV1)

Long-term Mitigation

3. SQL Injection Attempt

Check Application Logs

Immediate Response

4. Malware / Crypto Mining

Check Running Processes

Immediate Response

5. Insider Threat / Data Exfiltration

Check Data Access Logs

Immediate Response

Security Incident Checklist

Phase 1: Immediate Response (0-5 min)

Phase 2: Assessment (5-30 min)

Phase 3: Containment (30 min - 2 hours)

Phase 4: Recovery (2 hours - days)

Phase 5: Post-Incident (1 week)

Collaboration with Security Agent

Security Metrics

Related Documentation

Important Notes

9.5 KiB

Raw Blame History