Initial commit
This commit is contained in:
86
skills/memory-profiling/examples/INDEX.md
Normal file
86
skills/memory-profiling/examples/INDEX.md
Normal file
@@ -0,0 +1,86 @@
|
||||
# Memory Profiling Examples
|
||||
|
||||
Production memory profiling implementations for Node.js and Python with leak detection, heap analysis, and optimization strategies.
|
||||
|
||||
## Examples Overview
|
||||
|
||||
### Node.js Memory Leak Detection
|
||||
|
||||
**File**: [nodejs-memory-leak.md](nodejs-memory-leak.md)
|
||||
|
||||
Identifying and fixing memory leaks in Node.js applications:
|
||||
- **Memory leak detection**: Chrome DevTools, heapdump analysis
|
||||
- **Common leak patterns**: Event listeners, closures, global variables
|
||||
- **Heap snapshots**: Before/after comparison, retained object analysis
|
||||
- **Real leak**: EventEmitter leak causing 2GB memory growth
|
||||
- **Fix**: Proper cleanup with `removeListener()`, WeakMap for caching
|
||||
- **Result**: Memory stabilized at 150MB (93% reduction)
|
||||
|
||||
**Use when**: Node.js memory growing over time, debugging production memory issues
|
||||
|
||||
---
|
||||
|
||||
### Python Memory Profiling with Scalene
|
||||
|
||||
**File**: [python-scalene-profiling.md](python-scalene-profiling.md)
|
||||
|
||||
Line-by-line memory profiling for Python applications:
|
||||
- **Scalene setup**: Installation, pytest integration, CLI usage
|
||||
- **Memory hotspots**: Line-by-line allocation tracking
|
||||
- **CPU + Memory**: Combined profiling for performance bottlenecks
|
||||
- **Real scenario**: 500MB dataset causing OOM, fixed with generators
|
||||
- **Optimization**: List comprehension → generator (500MB → 5MB)
|
||||
- **Result**: 99% memory reduction, no OOM errors
|
||||
|
||||
**Use when**: Python memory spikes, profiling pytest tests, finding allocation hotspots
|
||||
|
||||
---
|
||||
|
||||
### Database Connection Pool Leak
|
||||
|
||||
**File**: [database-connection-leak.md](database-connection-leak.md)
|
||||
|
||||
PostgreSQL connection pool exhaustion and memory leaks:
|
||||
- **Symptom**: Connection pool maxed out, memory growing linearly
|
||||
- **Root cause**: Unclosed connections in error paths, missing `finally` blocks
|
||||
- **Detection**: Connection pool metrics, memory profiling
|
||||
- **Fix**: Context managers (`with` statement), proper cleanup
|
||||
- **Result**: Zero connection leaks, memory stable at 80MB
|
||||
|
||||
**Use when**: Database connection errors, "too many clients" errors, connection pool issues
|
||||
|
||||
---
|
||||
|
||||
### Large Dataset Memory Optimization
|
||||
|
||||
**File**: [large-dataset-optimization.md](large-dataset-optimization.md)
|
||||
|
||||
Memory-efficient data processing for large datasets:
|
||||
- **Problem**: Loading 10GB CSV into memory (OOM killer)
|
||||
- **Solutions**: Streaming with `pandas.read_csv(chunksize)`, generators, memory mapping
|
||||
- **Techniques**: Lazy evaluation, columnar processing, batch processing
|
||||
- **Before/After**: 10GB memory → 500MB (95% reduction)
|
||||
- **Tools**: Pandas chunking, Dask for parallel processing
|
||||
|
||||
**Use when**: Processing large files, OOM errors, batch data processing
|
||||
|
||||
---
|
||||
|
||||
## Quick Navigation
|
||||
|
||||
| Topic | File | Lines | Focus |
|
||||
|-------|------|-------|-------|
|
||||
| **Node.js Leaks** | [nodejs-memory-leak.md](nodejs-memory-leak.md) | ~450 | EventEmitter, heap snapshots |
|
||||
| **Python Scalene** | [python-scalene-profiling.md](python-scalene-profiling.md) | ~420 | Line-by-line profiling |
|
||||
| **DB Connection Leaks** | [database-connection-leak.md](database-connection-leak.md) | ~380 | Connection pool management |
|
||||
| **Large Datasets** | [large-dataset-optimization.md](large-dataset-optimization.md) | ~400 | Streaming, chunking |
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Reference**: [Reference Index](../reference/INDEX.md) - Memory patterns, profiling tools
|
||||
- **Templates**: [Templates Index](../templates/INDEX.md) - Profiling report template
|
||||
- **Main Agent**: [memory-profiler.md](../memory-profiler.md) - Memory profiler agent
|
||||
|
||||
---
|
||||
|
||||
Return to [main agent](../memory-profiler.md)
|
||||
490
skills/memory-profiling/examples/database-connection-leak.md
Normal file
490
skills/memory-profiling/examples/database-connection-leak.md
Normal file
@@ -0,0 +1,490 @@
|
||||
# Database Connection Pool Memory Leaks
|
||||
|
||||
Detecting and fixing PostgreSQL connection pool leaks in FastAPI applications using connection monitoring and proper cleanup patterns.
|
||||
|
||||
## Overview
|
||||
|
||||
**Before Optimization**:
|
||||
- Active connections: 95/100 (pool exhausted)
|
||||
- Connection timeouts: 15-20/min during peak
|
||||
- Memory growth: 100MB/hour (unclosed connections)
|
||||
- Service restarts: 3-4x/day
|
||||
|
||||
**After Optimization**:
|
||||
- Active connections: 8-12/100 (healthy pool)
|
||||
- Connection timeouts: 0/day
|
||||
- Memory growth: 0MB/hour (stable)
|
||||
- Service restarts: 0/month
|
||||
|
||||
**Tools**: asyncpg, SQLModel, psycopg3, pg_stat_activity, Prometheus
|
||||
|
||||
## 1. Connection Pool Architecture
|
||||
|
||||
### Grey Haven Stack: PostgreSQL + SQLModel
|
||||
|
||||
**Connection Pool Configuration**:
|
||||
```python
|
||||
# database.py
|
||||
from sqlmodel import create_engine
|
||||
from sqlalchemy.pool import QueuePool
|
||||
|
||||
# ❌ VULNERABLE: No max_overflow, no timeout
|
||||
engine = create_engine(
|
||||
"postgresql://user:pass@localhost/db",
|
||||
poolclass=QueuePool,
|
||||
pool_size=20,
|
||||
echo=True
|
||||
)
|
||||
|
||||
# ✅ SECURE: Proper pool configuration
|
||||
engine = create_engine(
|
||||
"postgresql://user:pass@localhost/db",
|
||||
poolclass=QueuePool,
|
||||
pool_size=20, # Core connections
|
||||
max_overflow=10, # Max additional connections
|
||||
pool_timeout=30, # Wait timeout (seconds)
|
||||
pool_recycle=3600, # Recycle after 1 hour
|
||||
pool_pre_ping=True, # Verify connection before use
|
||||
echo=False
|
||||
)
|
||||
```
|
||||
|
||||
**Pool Health Monitoring**:
|
||||
```python
|
||||
# monitoring.py
|
||||
from prometheus_client import Gauge
|
||||
|
||||
# Prometheus metrics
|
||||
db_pool_size = Gauge('db_pool_connections_total', 'Total pool size')
|
||||
db_pool_active = Gauge('db_pool_connections_active', 'Active connections')
|
||||
db_pool_idle = Gauge('db_pool_connections_idle', 'Idle connections')
|
||||
db_pool_overflow = Gauge('db_pool_connections_overflow', 'Overflow connections')
|
||||
|
||||
def record_pool_metrics(engine):
|
||||
pool = engine.pool
|
||||
db_pool_size.set(pool.size())
|
||||
db_pool_active.set(pool.checkedout())
|
||||
db_pool_idle.set(pool.size() - pool.checkedout())
|
||||
db_pool_overflow.set(pool.overflow())
|
||||
```
|
||||
|
||||
## 2. Common Leak Pattern: Unclosed Connections
|
||||
|
||||
### Vulnerable Code (Connection Leak)
|
||||
|
||||
```python
|
||||
# api/orders.py (BEFORE)
|
||||
from fastapi import APIRouter, Depends
|
||||
from sqlmodel import Session, select
|
||||
from database import engine
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
@router.get("/orders")
|
||||
async def get_orders():
|
||||
# ❌ LEAK: Connection never closed
|
||||
session = Session(engine)
|
||||
|
||||
# If exception occurs here, session never closed
|
||||
orders = session.exec(select(Order)).all()
|
||||
|
||||
# If return happens here, session never closed
|
||||
return orders
|
||||
|
||||
# session.close() never reached if early return/exception
|
||||
session.close()
|
||||
```
|
||||
|
||||
**What Happens**:
|
||||
1. Every request acquires connection from pool
|
||||
2. Exception/early return prevents `session.close()`
|
||||
3. Connection remains in "active" state
|
||||
4. Pool exhausts after 100 requests (pool_size=100)
|
||||
5. New requests timeout waiting for connection
|
||||
|
||||
**Memory Impact**:
|
||||
```
|
||||
Initial pool: 20 connections (40MB)
|
||||
After 1 hour: 95 leaked connections (190MB)
|
||||
After 6 hours: Pool exhausted + 100MB leaked memory
|
||||
```
|
||||
|
||||
### Fixed Code (Context Manager)
|
||||
|
||||
```python
|
||||
# api/orders.py (AFTER)
|
||||
from fastapi import APIRouter, Depends
|
||||
from sqlmodel import Session, select
|
||||
from database import engine, get_session
|
||||
from contextlib import contextmanager
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
# ✅ Option 1: FastAPI dependency injection (recommended)
|
||||
def get_session():
|
||||
"""Session dependency with automatic cleanup"""
|
||||
with Session(engine) as session:
|
||||
yield session
|
||||
|
||||
@router.get("/orders")
|
||||
async def get_orders(session: Session = Depends(get_session)):
|
||||
# Session automatically closed after request
|
||||
orders = session.exec(select(Order)).all()
|
||||
return orders
|
||||
|
||||
|
||||
# ✅ Option 2: Explicit context manager
|
||||
@router.get("/orders-alt")
|
||||
async def get_orders_alt():
|
||||
with Session(engine) as session:
|
||||
orders = session.exec(select(Order)).all()
|
||||
return orders
|
||||
# Session guaranteed to close (even on exception)
|
||||
```
|
||||
|
||||
**Why This Works**:
|
||||
- Context manager ensures `session.close()` called in `__exit__`
|
||||
- Works even if exception raised
|
||||
- Works even if early return
|
||||
- FastAPI `Depends()` handles async cleanup
|
||||
|
||||
## 3. Async Connection Leaks (asyncpg)
|
||||
|
||||
### Vulnerable Async Pattern
|
||||
|
||||
```python
|
||||
# api/analytics.py (BEFORE)
|
||||
import asyncpg
|
||||
from fastapi import APIRouter
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
@router.get("/analytics")
|
||||
async def get_analytics():
|
||||
# ❌ LEAK: Connection never closed
|
||||
conn = await asyncpg.connect(
|
||||
user='postgres',
|
||||
password='secret',
|
||||
database='analytics'
|
||||
)
|
||||
|
||||
# Exception here = connection leaked
|
||||
result = await conn.fetch('SELECT * FROM metrics WHERE date > $1', date)
|
||||
|
||||
# Early return = connection leaked
|
||||
if not result:
|
||||
return []
|
||||
|
||||
await conn.close() # Never reached
|
||||
return result
|
||||
```
|
||||
|
||||
### Fixed Async Pattern
|
||||
|
||||
```python
|
||||
# api/analytics.py (AFTER)
|
||||
import asyncpg
|
||||
from fastapi import APIRouter
|
||||
from contextlib import asynccontextmanager
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
# ✅ Connection pool (shared across requests)
|
||||
pool: asyncpg.Pool = None
|
||||
|
||||
@asynccontextmanager
|
||||
async def get_db_connection():
|
||||
"""Async context manager for connections"""
|
||||
conn = await pool.acquire()
|
||||
try:
|
||||
yield conn
|
||||
finally:
|
||||
await pool.release(conn)
|
||||
|
||||
@router.get("/analytics")
|
||||
async def get_analytics():
|
||||
async with get_db_connection() as conn:
|
||||
result = await conn.fetch(
|
||||
'SELECT * FROM metrics WHERE date > $1',
|
||||
date
|
||||
)
|
||||
return result
|
||||
# Connection automatically released to pool
|
||||
```
|
||||
|
||||
**Pool Setup** (application startup):
|
||||
```python
|
||||
# main.py
|
||||
from fastapi import FastAPI
|
||||
import asyncpg
|
||||
|
||||
app = FastAPI()
|
||||
|
||||
@app.on_event("startup")
|
||||
async def startup():
|
||||
global pool
|
||||
pool = await asyncpg.create_pool(
|
||||
user='postgres',
|
||||
password='secret',
|
||||
database='analytics',
|
||||
min_size=10, # Minimum connections
|
||||
max_size=20, # Maximum connections
|
||||
max_inactive_connection_lifetime=300 # Recycle after 5 min
|
||||
)
|
||||
|
||||
@app.on_event("shutdown")
|
||||
async def shutdown():
|
||||
await pool.close()
|
||||
```
|
||||
|
||||
## 4. Transaction Leak Detection
|
||||
|
||||
### Monitoring Active Connections
|
||||
|
||||
**PostgreSQL Query**:
|
||||
```sql
|
||||
-- Show active connections with details
|
||||
SELECT
|
||||
pid,
|
||||
usename,
|
||||
application_name,
|
||||
client_addr,
|
||||
state,
|
||||
query,
|
||||
state_change,
|
||||
NOW() - state_change AS duration
|
||||
FROM pg_stat_activity
|
||||
WHERE state != 'idle'
|
||||
ORDER BY duration DESC;
|
||||
```
|
||||
|
||||
**Prometheus Metrics**:
|
||||
```python
|
||||
# monitoring.py
|
||||
from prometheus_client import Gauge
|
||||
import asyncpg
|
||||
|
||||
db_connections_active = Gauge(
|
||||
'db_connections_active',
|
||||
'Active database connections',
|
||||
['state']
|
||||
)
|
||||
|
||||
async def monitor_connections(pool: asyncpg.Pool):
|
||||
"""Monitor PostgreSQL connections every 30 seconds"""
|
||||
async with pool.acquire() as conn:
|
||||
rows = await conn.fetch("""
|
||||
SELECT state, COUNT(*) as count
|
||||
FROM pg_stat_activity
|
||||
WHERE datname = current_database()
|
||||
GROUP BY state
|
||||
""")
|
||||
|
||||
for row in rows:
|
||||
db_connections_active.labels(state=row['state']).set(row['count'])
|
||||
```
|
||||
|
||||
**Grafana Alert** (connection leak):
|
||||
```yaml
|
||||
alert: DatabaseConnectionLeak
|
||||
expr: db_connections_active{state="active"} > 80
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "Potential connection leak ({{ $value }} active connections)"
|
||||
description: "Active connections have been above 80 for 5+ minutes"
|
||||
```
|
||||
|
||||
## 5. Real-World Fix: FastAPI Order Service
|
||||
|
||||
### Before (Connection Pool Exhaustion)
|
||||
|
||||
```python
|
||||
# services/order_processor.py (BEFORE)
|
||||
from sqlmodel import Session, select
|
||||
from database import engine
|
||||
from models import Order, OrderItem
|
||||
|
||||
class OrderProcessor:
|
||||
async def process_order(self, order_id: int):
|
||||
# ❌ LEAK: Multiple sessions, some never closed
|
||||
session1 = Session(engine)
|
||||
order = session1.get(Order, order_id)
|
||||
|
||||
if not order:
|
||||
# Early return = session1 leaked
|
||||
return None
|
||||
|
||||
# ❌ LEAK: Second session
|
||||
session2 = Session(engine)
|
||||
items = session2.exec(
|
||||
select(OrderItem).where(OrderItem.order_id == order_id)
|
||||
).all()
|
||||
|
||||
# Exception here = both sessions leaked
|
||||
total = sum(item.price * item.quantity for item in items)
|
||||
|
||||
order.total = total
|
||||
session1.commit()
|
||||
|
||||
# Only session1 closed, session2 leaked
|
||||
session1.close()
|
||||
return order
|
||||
```
|
||||
|
||||
**Metrics (Before)**:
|
||||
```
|
||||
Connection pool: 100 connections
|
||||
Active connections after 1 hour: 95/100
|
||||
Leaked connections: ~12/min
|
||||
Memory growth: 100MB/hour
|
||||
Pool exhaustion: Every 6-8 hours
|
||||
```
|
||||
|
||||
### After (Proper Resource Management)
|
||||
|
||||
```python
|
||||
# services/order_processor.py (AFTER)
|
||||
from sqlmodel import Session, select
|
||||
from database import engine, get_session
|
||||
from models import Order, OrderItem
|
||||
from contextlib import contextmanager
|
||||
|
||||
class OrderProcessor:
|
||||
async def process_order(self, order_id: int):
|
||||
# ✅ Single session, guaranteed cleanup
|
||||
with Session(engine) as session:
|
||||
# Query order
|
||||
order = session.get(Order, order_id)
|
||||
if not order:
|
||||
return None
|
||||
|
||||
# Query items (same session)
|
||||
items = session.exec(
|
||||
select(OrderItem).where(OrderItem.order_id == order_id)
|
||||
).all()
|
||||
|
||||
# Calculate total
|
||||
total = sum(item.price * item.quantity for item in items)
|
||||
|
||||
# Update order
|
||||
order.total = total
|
||||
session.add(order)
|
||||
session.commit()
|
||||
session.refresh(order)
|
||||
|
||||
return order
|
||||
# Session automatically closed (even on exception)
|
||||
```
|
||||
|
||||
**Metrics (After)**:
|
||||
```
|
||||
Connection pool: 100 connections
|
||||
Active connections: 8-12/100 (stable)
|
||||
Leaked connections: 0/day
|
||||
Memory growth: 0MB/hour
|
||||
Pool exhaustion: Never (0 incidents/month)
|
||||
```
|
||||
|
||||
## 6. Connection Pool Configuration Best Practices
|
||||
|
||||
### Recommended Settings (Grey Haven Stack)
|
||||
|
||||
```python
|
||||
# database.py - Production settings
|
||||
from sqlmodel import create_engine
|
||||
from sqlalchemy.pool import QueuePool
|
||||
|
||||
engine = create_engine(
|
||||
database_url,
|
||||
poolclass=QueuePool,
|
||||
pool_size=20, # (workers * connections/worker) + buffer
|
||||
max_overflow=10, # 50% of pool_size
|
||||
pool_timeout=30, # Wait timeout
|
||||
pool_recycle=3600, # Recycle after 1h
|
||||
pool_pre_ping=True # Health check
|
||||
)
|
||||
```
|
||||
|
||||
**Pool Size Formula**: `pool_size = (workers * conn_per_worker) + buffer`
|
||||
Example: `(4 workers * 3 conn) + 8 buffer = 20`
|
||||
|
||||
## 7. Testing Connection Cleanup
|
||||
|
||||
### Pytest Fixture for Connection Tracking
|
||||
|
||||
```python
|
||||
# tests/conftest.py
|
||||
import pytest
|
||||
from sqlmodel import Session, create_engine
|
||||
|
||||
@pytest.fixture
|
||||
def engine():
|
||||
"""Test engine with connection tracking"""
|
||||
test_engine = create_engine("postgresql://test:test@localhost/test_db", pool_size=5)
|
||||
initial_active = test_engine.pool.checkedout()
|
||||
yield test_engine
|
||||
final_active = test_engine.pool.checkedout()
|
||||
assert final_active == initial_active, f"Leaked {final_active - initial_active} connections"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_no_connection_leak_under_load(engine):
|
||||
"""Simulate 1000 concurrent requests"""
|
||||
initial = engine.pool.checkedout()
|
||||
tasks = [get_orders() for _ in range(1000)]
|
||||
await asyncio.gather(*tasks)
|
||||
await asyncio.sleep(1)
|
||||
assert engine.pool.checkedout() == initial, "Connection leak detected"
|
||||
```
|
||||
|
||||
## 8. CI/CD Integration
|
||||
|
||||
```yaml
|
||||
# .github/workflows/connection-leak-test.yml
|
||||
name: Connection Leak Detection
|
||||
on: [pull_request]
|
||||
jobs:
|
||||
leak-test:
|
||||
runs-on: ubuntu-latest
|
||||
services:
|
||||
postgres:
|
||||
image: postgres:15
|
||||
env: {POSTGRES_PASSWORD: test, POSTGRES_DB: test_db}
|
||||
ports: [5432:5432]
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: actions/setup-python@v4
|
||||
with: {python-version: '3.11'}
|
||||
- run: pip install -r requirements.txt pytest pytest-asyncio
|
||||
- run: pytest tests/test_connection_leaks.py -v
|
||||
```
|
||||
|
||||
## 9. Results and Impact
|
||||
|
||||
### Before vs After Metrics
|
||||
|
||||
| Metric | Before | After | Impact |
|
||||
|--------|--------|-------|--------|
|
||||
| **Active Connections** | 95/100 (95%) | 8-12/100 (10%) | **85% reduction** |
|
||||
| **Connection Timeouts** | 15-20/min | 0/day | **100% eliminated** |
|
||||
| **Memory Growth** | 100MB/hour | 0MB/hour | **100% eliminated** |
|
||||
| **Service Restarts** | 3-4x/day | 0/month | **100% eliminated** |
|
||||
| **Pool Wait Time (p95)** | 5.2s | 0.01s | **99.8% faster** |
|
||||
|
||||
### Key Optimizations Applied
|
||||
|
||||
1. **Context Managers**: Guaranteed connection cleanup (even on exceptions)
|
||||
2. **FastAPI Dependencies**: Automatic session lifecycle management
|
||||
3. **Connection Pooling**: Proper pool_size, max_overflow, pool_timeout
|
||||
4. **Prometheus Monitoring**: Real-time pool saturation metrics
|
||||
5. **Load Testing**: CI/CD checks for connection leaks
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Node.js Leaks**: [nodejs-memory-leak.md](nodejs-memory-leak.md)
|
||||
- **Python Profiling**: [python-scalene-profiling.md](python-scalene-profiling.md)
|
||||
- **Large Datasets**: [large-dataset-optimization.md](large-dataset-optimization.md)
|
||||
- **Reference**: [../reference/profiling-tools.md](../reference/profiling-tools.md)
|
||||
|
||||
---
|
||||
|
||||
Return to [examples index](INDEX.md)
|
||||
452
skills/memory-profiling/examples/large-dataset-optimization.md
Normal file
452
skills/memory-profiling/examples/large-dataset-optimization.md
Normal file
@@ -0,0 +1,452 @@
|
||||
# Large Dataset Memory Optimization
|
||||
|
||||
Memory-efficient patterns for processing multi-GB datasets in Python and Node.js without OOM errors.
|
||||
|
||||
## Overview
|
||||
|
||||
**Before Optimization**:
|
||||
- Dataset size: 10GB CSV (50M rows)
|
||||
- Memory usage: 20GB (2x dataset size)
|
||||
- Processing time: 45 minutes
|
||||
- OOM errors: Frequent (3-4x/day)
|
||||
|
||||
**After Optimization**:
|
||||
- Dataset size: Same (10GB, 50M rows)
|
||||
- Memory usage: 500MB (constant)
|
||||
- Processing time: 12 minutes (73% faster)
|
||||
- OOM errors: 0/month
|
||||
|
||||
**Tools**: Polars, pandas chunking, generators, streaming parsers
|
||||
|
||||
## 1. Problem: Loading Entire Dataset
|
||||
|
||||
### Vulnerable Pattern (Pandas read_csv)
|
||||
|
||||
```python
|
||||
# analysis.py (BEFORE)
|
||||
import pandas as pd
|
||||
|
||||
def analyze_sales_data(filename: str):
|
||||
# ❌ Loads entire 10GB file into memory
|
||||
df = pd.read_csv(filename) # 20GB RAM usage
|
||||
|
||||
# ❌ Creates copies for each operation
|
||||
df['total'] = df['quantity'] * df['price'] # +10GB
|
||||
df_filtered = df[df['total'] > 1000] # +8GB
|
||||
df_sorted = df_filtered.sort_values('total', ascending=False) # +8GB
|
||||
|
||||
# Peak memory: 46GB for 10GB file!
|
||||
return df_sorted.head(100)
|
||||
```
|
||||
|
||||
**Memory Profile**:
|
||||
```
|
||||
Step 1 (read_csv): 20GB
|
||||
Step 2 (calculation): +10GB = 30GB
|
||||
Step 3 (filter): +8GB = 38GB
|
||||
Step 4 (sort): +8GB = 46GB
|
||||
Result: OOM on 32GB machine
|
||||
```
|
||||
|
||||
## 2. Solution 1: Pandas Chunking
|
||||
|
||||
### Chunk-Based Processing
|
||||
|
||||
```python
|
||||
# analysis.py (AFTER - Chunking)
|
||||
import pandas as pd
|
||||
from typing import Iterator
|
||||
|
||||
def analyze_sales_data_chunked(filename: str, chunk_size: int = 100000):
|
||||
"""Process 100K rows at a time (constant memory)"""
|
||||
|
||||
top_sales = []
|
||||
|
||||
# ✅ Process in chunks (100K rows = ~50MB each)
|
||||
for chunk in pd.read_csv(filename, chunksize=chunk_size):
|
||||
# Calculate total (in-place when possible)
|
||||
chunk['total'] = chunk['quantity'] * chunk['price']
|
||||
|
||||
# Filter high-value sales
|
||||
filtered = chunk[chunk['total'] > 1000]
|
||||
|
||||
# Keep top 100 from this chunk
|
||||
top_chunk = filtered.nlargest(100, 'total')
|
||||
top_sales.append(top_chunk)
|
||||
|
||||
# chunk goes out of scope, memory freed
|
||||
|
||||
# Combine top results from all chunks
|
||||
final_df = pd.concat(top_sales).nlargest(100, 'total')
|
||||
return final_df
|
||||
```
|
||||
|
||||
**Memory Profile (Chunked)**:
|
||||
```
|
||||
Chunk 1: 50MB (process) → 10MB (top 100) → garbage collected
|
||||
Chunk 2: 50MB (process) → 10MB (top 100) → garbage collected
|
||||
...
|
||||
Chunk 500: 50MB (process) → 10MB (top 100) → garbage collected
|
||||
Final combine: 500 * 10MB = 500MB total
|
||||
Peak memory: 500MB (99% reduction!)
|
||||
```
|
||||
|
||||
## 3. Solution 2: Polars (Lazy Evaluation)
|
||||
|
||||
### Polars for Large Datasets
|
||||
|
||||
**Why Polars**:
|
||||
- 10-100x faster than pandas
|
||||
- True streaming (doesn't load entire file)
|
||||
- Query optimizer (like SQL databases)
|
||||
- Parallel processing (uses all CPU cores)
|
||||
|
||||
```python
|
||||
# analysis.py (POLARS)
|
||||
import polars as pl
|
||||
|
||||
def analyze_sales_data_polars(filename: str):
|
||||
"""Polars lazy evaluation - constant memory"""
|
||||
|
||||
result = (
|
||||
pl.scan_csv(filename) # ✅ Lazy: doesn't load yet
|
||||
.with_columns([
|
||||
(pl.col('quantity') * pl.col('price')).alias('total')
|
||||
])
|
||||
.filter(pl.col('total') > 1000)
|
||||
.sort('total', descending=True)
|
||||
.head(100)
|
||||
.collect(streaming=True) # ✅ Streaming: processes in chunks
|
||||
)
|
||||
|
||||
return result
|
||||
```
|
||||
|
||||
**Memory Profile (Polars Streaming)**:
|
||||
```
|
||||
Memory usage: 200-300MB (constant)
|
||||
Processing: Parallel chunks, optimized query plan
|
||||
Time: 12 minutes vs 45 minutes (pandas)
|
||||
```
|
||||
|
||||
## 4. Node.js Streaming
|
||||
|
||||
### CSV Streaming with csv-parser
|
||||
|
||||
```typescript
|
||||
// analysis.ts (BEFORE)
|
||||
import fs from 'fs';
|
||||
import Papa from 'papaparse';
|
||||
|
||||
async function analyzeSalesData(filename: string) {
|
||||
// ❌ Loads entire 10GB file
|
||||
const fileContent = fs.readFileSync(filename, 'utf-8'); // 20GB RAM
|
||||
const parsed = Papa.parse(fileContent, { header: true }); // +10GB
|
||||
|
||||
// Process all rows
|
||||
const results = parsed.data.map(row => ({
|
||||
total: row.quantity * row.price
|
||||
}));
|
||||
|
||||
return results; // 30GB total
|
||||
}
|
||||
```
|
||||
|
||||
**Fixed with Streaming**:
|
||||
```typescript
|
||||
// analysis.ts (AFTER - Streaming)
|
||||
import fs from 'fs';
|
||||
import csv from 'csv-parser';
|
||||
import { pipeline } from 'stream/promises';
|
||||
|
||||
async function analyzeSalesDataStreaming(filename: string) {
|
||||
const topSales: Array<{row: any, total: number}> = [];
|
||||
|
||||
await pipeline(
|
||||
fs.createReadStream(filename), // ✅ Stream (not load all)
|
||||
csv(),
|
||||
async function* (source) {
|
||||
for await (const row of source) {
|
||||
const total = row.quantity * row.price;
|
||||
|
||||
if (total > 1000) {
|
||||
topSales.push({ row, total });
|
||||
|
||||
// Keep only top 100 (memory bounded)
|
||||
if (topSales.length > 100) {
|
||||
topSales.sort((a, b) => b.total - a.total);
|
||||
topSales.length = 100;
|
||||
}
|
||||
}
|
||||
}
|
||||
yield topSales;
|
||||
}
|
||||
);
|
||||
|
||||
return topSales;
|
||||
}
|
||||
```
|
||||
|
||||
**Memory Profile (Streaming)**:
|
||||
```
|
||||
Buffer: 64KB (stream chunk size)
|
||||
Processing: One row at a time
|
||||
Array: 100 rows max (bounded)
|
||||
Peak memory: 5MB vs 30GB (99.98% reduction!)
|
||||
```
|
||||
|
||||
## 5. Generator Pattern (Python)
|
||||
|
||||
### Memory-Efficient Pipeline
|
||||
|
||||
```python
|
||||
# pipeline.py (Generator-based)
|
||||
from typing import Iterator
|
||||
import csv
|
||||
|
||||
def read_csv_streaming(filename: str) -> Iterator[dict]:
|
||||
"""Read CSV line by line (not all at once)"""
|
||||
with open(filename, 'r') as f:
|
||||
reader = csv.DictReader(f)
|
||||
for row in reader:
|
||||
yield row # ✅ One row at a time
|
||||
|
||||
def calculate_totals(rows: Iterator[dict]) -> Iterator[dict]:
|
||||
"""Calculate totals (lazy)"""
|
||||
for row in rows:
|
||||
row['total'] = float(row['quantity']) * float(row['price'])
|
||||
yield row
|
||||
|
||||
def filter_high_value(rows: Iterator[dict], threshold: float = 1000) -> Iterator[dict]:
|
||||
"""Filter high-value sales (lazy)"""
|
||||
for row in rows:
|
||||
if row['total'] > threshold:
|
||||
yield row
|
||||
|
||||
def top_n(rows: Iterator[dict], n: int = 100) -> list[dict]:
|
||||
"""Keep top N rows (bounded memory)"""
|
||||
import heapq
|
||||
return heapq.nlargest(n, rows, key=lambda x: x['total'])
|
||||
|
||||
# ✅ Pipeline: each stage processes one row at a time
|
||||
def analyze_sales_pipeline(filename: str):
|
||||
rows = read_csv_streaming(filename)
|
||||
with_totals = calculate_totals(rows)
|
||||
high_value = filter_high_value(with_totals)
|
||||
top_100 = top_n(high_value, 100)
|
||||
return top_100
|
||||
```
|
||||
|
||||
**Memory Profile (Generator Pipeline)**:
|
||||
```
|
||||
Stage 1 (read): 1 row (few KB)
|
||||
Stage 2 (calculate): 1 row (few KB)
|
||||
Stage 3 (filter): 1 row (few KB)
|
||||
Stage 4 (top_n): 100 rows (bounded)
|
||||
Peak memory: <1MB (constant)
|
||||
```
|
||||
|
||||
## 6. Real-World: E-Commerce Analytics
|
||||
|
||||
### Before (Pandas load_all)
|
||||
|
||||
```python
|
||||
# analytics_service.py (BEFORE)
|
||||
import pandas as pd
|
||||
|
||||
class AnalyticsService:
|
||||
def generate_sales_report(self, start_date: str, end_date: str):
|
||||
# ❌ Load entire orders table (10GB)
|
||||
orders = pd.read_sql(
|
||||
"SELECT * FROM orders WHERE date BETWEEN %s AND %s",
|
||||
engine,
|
||||
params=(start_date, end_date)
|
||||
) # 20GB RAM
|
||||
|
||||
# ❌ Load entire order_items (50GB)
|
||||
items = pd.read_sql("SELECT * FROM order_items", engine) # +100GB RAM
|
||||
|
||||
# Join (creates another copy)
|
||||
merged = orders.merge(items, on='order_id') # +150GB
|
||||
|
||||
# Aggregate
|
||||
summary = merged.groupby('category').agg({
|
||||
'total': 'sum',
|
||||
'quantity': 'sum'
|
||||
})
|
||||
|
||||
return summary # Peak: 270GB - OOM!
|
||||
```
|
||||
|
||||
### After (Database Aggregation + Chunking)
|
||||
|
||||
```python
|
||||
# analytics_service.py (AFTER)
|
||||
import pandas as pd
|
||||
|
||||
class AnalyticsService:
|
||||
def generate_sales_report(self, start_date: str, end_date: str):
|
||||
# ✅ Aggregate in database (PostgreSQL does the work)
|
||||
query = """
|
||||
SELECT
|
||||
oi.category,
|
||||
SUM(oi.price * oi.quantity) as total,
|
||||
SUM(oi.quantity) as quantity
|
||||
FROM orders o
|
||||
JOIN order_items oi ON o.id = oi.order_id
|
||||
WHERE o.date BETWEEN %(start)s AND %(end)s
|
||||
GROUP BY oi.category
|
||||
"""
|
||||
|
||||
# Result: aggregated data (few KB, not 270GB!)
|
||||
summary = pd.read_sql(
|
||||
query,
|
||||
engine,
|
||||
params={'start': start_date, 'end': end_date}
|
||||
)
|
||||
|
||||
return summary # Peak: 1MB vs 270GB
|
||||
```
|
||||
|
||||
**Metrics**:
|
||||
```
|
||||
Before: 270GB RAM, OOM error
|
||||
After: 1MB RAM, 99.9996% reduction
|
||||
Time: 45 min → 30 seconds (90x faster)
|
||||
```
|
||||
|
||||
## 7. Dask for Parallel Processing
|
||||
|
||||
### Dask DataFrame (Parallel Chunking)
|
||||
|
||||
```python
|
||||
# analysis_dask.py
|
||||
import dask.dataframe as dd
|
||||
|
||||
def analyze_sales_data_dask(filename: str):
|
||||
"""Process in parallel chunks across CPU cores"""
|
||||
|
||||
# ✅ Lazy loading, parallel processing
|
||||
df = dd.read_csv(
|
||||
filename,
|
||||
blocksize='64MB' # Process 64MB chunks
|
||||
)
|
||||
|
||||
# All operations are lazy (no computation yet)
|
||||
df['total'] = df['quantity'] * df['price']
|
||||
filtered = df[df['total'] > 1000]
|
||||
top_100 = filtered.nlargest(100, 'total')
|
||||
|
||||
# ✅ Trigger computation (parallel across cores)
|
||||
result = top_100.compute()
|
||||
|
||||
return result
|
||||
```
|
||||
|
||||
**Memory Profile (Dask)**:
|
||||
```
|
||||
Workers: 8 (one per CPU core)
|
||||
Memory per worker: 100MB
|
||||
Total memory: 800MB vs 46GB
|
||||
Speed: 4-8x faster (parallel)
|
||||
```
|
||||
|
||||
## 8. Memory Monitoring
|
||||
|
||||
### Track Memory Usage During Processing
|
||||
|
||||
```python
|
||||
# monitor.py
|
||||
import tracemalloc
|
||||
import psutil
|
||||
from contextlib import contextmanager
|
||||
|
||||
@contextmanager
|
||||
def memory_monitor(label: str):
|
||||
"""Monitor memory usage of code block"""
|
||||
|
||||
# Start tracking
|
||||
tracemalloc.start()
|
||||
process = psutil.Process()
|
||||
mem_before = process.memory_info().rss / 1024 / 1024 # MB
|
||||
|
||||
yield
|
||||
|
||||
# Measure after
|
||||
mem_after = process.memory_info().rss / 1024 / 1024
|
||||
current, peak = tracemalloc.get_traced_memory()
|
||||
tracemalloc.stop()
|
||||
|
||||
print(f"{label}:")
|
||||
print(f" Memory before: {mem_before:.1f} MB")
|
||||
print(f" Memory after: {mem_after:.1f} MB")
|
||||
print(f" Memory delta: {mem_after - mem_before:.1f} MB")
|
||||
print(f" Peak traced: {peak / 1024 / 1024:.1f} MB")
|
||||
|
||||
# Usage
|
||||
with memory_monitor("Pandas load_all"):
|
||||
df = pd.read_csv("large_file.csv") # Shows high memory usage
|
||||
|
||||
with memory_monitor("Polars streaming"):
|
||||
df = pl.scan_csv("large_file.csv").collect(streaming=True) # Low memory
|
||||
```
|
||||
|
||||
## 9. Optimization Decision Tree
|
||||
|
||||
**Choose the right tool based on dataset size**:
|
||||
|
||||
```
|
||||
Dataset < 1GB:
|
||||
→ Use pandas.read_csv() (simple, fast)
|
||||
|
||||
Dataset 1-10GB:
|
||||
→ Use pandas chunking (chunksize=100000)
|
||||
→ Or Polars streaming (faster, less memory)
|
||||
|
||||
Dataset 10-100GB:
|
||||
→ Use Polars streaming (best performance)
|
||||
→ Or Dask (parallel processing)
|
||||
→ Or Database aggregation (PostgreSQL, ClickHouse)
|
||||
|
||||
Dataset > 100GB:
|
||||
→ Database aggregation (required)
|
||||
→ Or Spark/Ray (distributed computing)
|
||||
→ Never load into memory
|
||||
```
|
||||
|
||||
## 10. Results and Impact
|
||||
|
||||
### Before vs After Metrics
|
||||
|
||||
| Metric | Before (pandas) | After (Polars) | Impact |
|
||||
|--------|----------------|----------------|--------|
|
||||
| **Memory Usage** | 46GB | 300MB | **99.3% reduction** |
|
||||
| **Processing Time** | 45 min | 12 min | **73% faster** |
|
||||
| **OOM Errors** | 3-4/day | 0/month | **100% eliminated** |
|
||||
| **Max Dataset Size** | 10GB | 500GB+ | **50x scalability** |
|
||||
|
||||
### Key Optimizations Applied
|
||||
|
||||
1. **Chunking**: Process 100K rows at a time (constant memory)
|
||||
2. **Lazy Evaluation**: Polars/Dask don't load until needed
|
||||
3. **Streaming**: One row at a time (generators, Node.js streams)
|
||||
4. **Database Aggregation**: Let PostgreSQL do the work
|
||||
5. **Bounded Memory**: heapq.nlargest() keeps top N (not all rows)
|
||||
|
||||
### Cost Savings
|
||||
|
||||
**Infrastructure costs**:
|
||||
- Before: r5.8xlarge (256GB RAM) = $1.344/hour
|
||||
- After: r5.large (16GB RAM) = $0.084/hour
|
||||
- **Savings**: 94% reduction ($23,000/year per service)
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Node.js Leaks**: [nodejs-memory-leak.md](nodejs-memory-leak.md)
|
||||
- **Python Profiling**: [python-scalene-profiling.md](python-scalene-profiling.md)
|
||||
- **DB Leaks**: [database-connection-leak.md](database-connection-leak.md)
|
||||
- **Reference**: [../reference/memory-optimization-patterns.md](../reference/memory-optimization-patterns.md)
|
||||
|
||||
---
|
||||
|
||||
Return to [examples index](INDEX.md)
|
||||
490
skills/memory-profiling/examples/nodejs-memory-leak.md
Normal file
490
skills/memory-profiling/examples/nodejs-memory-leak.md
Normal file
@@ -0,0 +1,490 @@
|
||||
# Node.js Memory Leak Detection
|
||||
|
||||
Identifying and fixing memory leaks in Node.js applications using Chrome DevTools, heapdump, and memory profiling techniques.
|
||||
|
||||
## Overview
|
||||
|
||||
**Symptoms Before Fix**:
|
||||
- Memory usage: 150MB → 2GB over 6 hours
|
||||
- Heap size growing linearly (5MB/minute)
|
||||
- V8 garbage collection ineffective
|
||||
- Production outages (OOM killer)
|
||||
|
||||
**After Fix**:
|
||||
- Memory stable at 150MB (93% reduction)
|
||||
- Heap size constant over time
|
||||
- Zero OOM errors in 30 days
|
||||
- Proper resource cleanup
|
||||
|
||||
**Tools**: Chrome DevTools, heapdump, memwatch-next, Prometheus monitoring
|
||||
|
||||
## 1. Memory Leak Symptoms
|
||||
|
||||
### Linear Memory Growth
|
||||
|
||||
```bash
|
||||
# Monitor Node.js memory usage
|
||||
node --expose-gc --inspect app.js
|
||||
|
||||
# Connect Chrome DevTools: chrome://inspect
|
||||
# Memory tab → Take heap snapshot every 5 minutes
|
||||
```
|
||||
|
||||
**Heap growth pattern**:
|
||||
```
|
||||
Time | Heap Size | External | Total
|
||||
------|-----------|----------|-------
|
||||
0 min | 50MB | 10MB | 60MB
|
||||
5 min | 75MB | 15MB | 90MB
|
||||
10min | 100MB | 20MB | 120MB
|
||||
15min | 125MB | 25MB | 150MB
|
||||
... | ... | ... | ...
|
||||
6 hrs | 1.8GB | 200MB | 2GB
|
||||
```
|
||||
|
||||
**Diagnosis**: Linear growth indicates memory leak (not normal sawtooth GC pattern)
|
||||
|
||||
### High GC Activity
|
||||
|
||||
```javascript
|
||||
// Monitor GC events
|
||||
const v8 = require('v8');
|
||||
const memoryUsage = process.memoryUsage();
|
||||
|
||||
setInterval(() => {
|
||||
const usage = process.memoryUsage();
|
||||
console.log({
|
||||
heapUsed: `${Math.round(usage.heapUsed / 1024 / 1024)}MB`,
|
||||
heapTotal: `${Math.round(usage.heapTotal / 1024 / 1024)}MB`,
|
||||
external: `${Math.round(usage.external / 1024 / 1024)}MB`,
|
||||
rss: `${Math.round(usage.rss / 1024 / 1024)}MB`
|
||||
});
|
||||
}, 60000); // Every minute
|
||||
```
|
||||
|
||||
**Output showing leak**:
|
||||
```
|
||||
{heapUsed: '75MB', heapTotal: '100MB', external: '15MB', rss: '120MB'}
|
||||
{heapUsed: '100MB', heapTotal: '130MB', external: '20MB', rss: '150MB'}
|
||||
{heapUsed: '125MB', heapTotal: '160MB', external: '25MB', rss: '185MB'}
|
||||
```
|
||||
|
||||
## 2. Heap Snapshot Analysis
|
||||
|
||||
### Taking Heap Snapshots
|
||||
|
||||
```javascript
|
||||
// Generate heap snapshot programmatically
|
||||
const v8 = require('v8');
|
||||
const fs = require('fs');
|
||||
|
||||
function takeHeapSnapshot(filename) {
|
||||
const heapSnapshot = v8.writeHeapSnapshot(filename);
|
||||
console.log(`Heap snapshot written to ${heapSnapshot}`);
|
||||
}
|
||||
|
||||
// Take snapshot every hour
|
||||
setInterval(() => {
|
||||
const timestamp = new Date().toISOString().replace(/:/g, '-');
|
||||
takeHeapSnapshot(`heap-${timestamp}.heapsnapshot`);
|
||||
}, 3600000);
|
||||
```
|
||||
|
||||
### Analyzing Snapshots in Chrome DevTools
|
||||
|
||||
**Steps**:
|
||||
1. Load two snapshots (before and after 1 hour)
|
||||
2. Compare snapshots (Comparison view)
|
||||
3. Sort by "Size Delta" (descending)
|
||||
4. Look for objects growing significantly
|
||||
|
||||
**Example Analysis**:
|
||||
```
|
||||
Object Type | Count | Size Delta | Retained Size
|
||||
----------------------|--------|------------|---------------
|
||||
(array) | +5,000 | +50MB | +60MB
|
||||
EventEmitter | +1,200 | +12MB | +15MB
|
||||
Closure (anonymous) | +800 | +8MB | +10MB
|
||||
```
|
||||
|
||||
**Diagnosis**: EventEmitter count growing = likely event listener leak
|
||||
|
||||
### Retained Objects Analysis
|
||||
|
||||
```javascript
|
||||
// Chrome DevTools → Heap Snapshot → Summary → sort by "Retained Size"
|
||||
// Click object → view Retainer tree
|
||||
```
|
||||
|
||||
**Retainer tree example** (EventEmitter leak):
|
||||
```
|
||||
EventEmitter @123456
|
||||
← listeners: Array[50]
|
||||
← _events.data: Array
|
||||
← EventEmitter @123456 (self-reference leak!)
|
||||
```
|
||||
|
||||
## 3. Common Memory Leak Patterns
|
||||
|
||||
### Pattern 1: Event Listener Leak
|
||||
|
||||
**Vulnerable Code**:
|
||||
```typescript
|
||||
// ❌ LEAK: EventEmitter listeners never removed
|
||||
import {EventEmitter} from 'events';
|
||||
|
||||
class DataProcessor {
|
||||
private emitter = new EventEmitter();
|
||||
|
||||
async processOrders() {
|
||||
// Add listener every time function called
|
||||
this.emitter.on('data', (data) => {
|
||||
console.log('Processing:', data);
|
||||
});
|
||||
|
||||
// Emit 1000 events
|
||||
for (let i = 0; i < 1000; i++) {
|
||||
this.emitter.emit('data', {id: i});
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Called 1000 times = 1000 listeners accumulate!
|
||||
setInterval(() => new DataProcessor().processOrders(), 1000);
|
||||
```
|
||||
|
||||
**Result**: 1000 listeners/second = 3.6M listeners/hour → 2GB memory leak
|
||||
|
||||
**Fixed Code**:
|
||||
```typescript
|
||||
// ✅ FIXED: Remove listener after use
|
||||
class DataProcessor {
|
||||
private emitter = new EventEmitter();
|
||||
|
||||
async processOrders() {
|
||||
const handler = (data) => {
|
||||
console.log('Processing:', data);
|
||||
};
|
||||
|
||||
this.emitter.on('data', handler);
|
||||
|
||||
try {
|
||||
for (let i = 0; i < 1000; i++) {
|
||||
this.emitter.emit('data', {id: i});
|
||||
}
|
||||
} finally {
|
||||
// ✅ Clean up listener
|
||||
this.emitter.removeListener('data', handler);
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Better**: Use `once()` for one-time listeners:
|
||||
```typescript
|
||||
this.emitter.once('data', handler); // Auto-removed after first emit
|
||||
```
|
||||
|
||||
### Pattern 2: Closure Leak
|
||||
|
||||
**Vulnerable Code**:
|
||||
```typescript
|
||||
// ❌ LEAK: Closure captures large object
|
||||
const cache = new Map();
|
||||
|
||||
function processRequest(userId: string) {
|
||||
const largeData = fetchLargeDataset(userId); // 10MB object
|
||||
|
||||
// Closure captures entire largeData
|
||||
cache.set(userId, () => {
|
||||
return largeData.summary; // Only need summary (1KB)
|
||||
});
|
||||
}
|
||||
|
||||
// Called for 1000 users = 10GB in cache!
|
||||
```
|
||||
|
||||
**Fixed Code**:
|
||||
```typescript
|
||||
// ✅ FIXED: Only store what you need
|
||||
const cache = new Map();
|
||||
|
||||
function processRequest(userId: string) {
|
||||
const largeData = fetchLargeDataset(userId);
|
||||
const summary = largeData.summary; // Extract only 1KB
|
||||
|
||||
// Store minimal data
|
||||
cache.set(userId, () => summary);
|
||||
}
|
||||
|
||||
// 1000 users = 1MB in cache ✅
|
||||
```
|
||||
|
||||
### Pattern 3: Global Variable Accumulation
|
||||
|
||||
**Vulnerable Code**:
|
||||
```typescript
|
||||
// ❌ LEAK: Global array keeps growing
|
||||
const requestLog: Request[] = [];
|
||||
|
||||
app.post('/api/orders', (req, res) => {
|
||||
requestLog.push(req); // Never removed!
|
||||
// ... process order
|
||||
});
|
||||
|
||||
// 1M requests = 1M objects in memory permanently
|
||||
```
|
||||
|
||||
**Fixed Code**:
|
||||
```typescript
|
||||
// ✅ FIXED: Use LRU cache with size limit
|
||||
import LRU from 'lru-cache';
|
||||
|
||||
const requestLog = new LRU({
|
||||
max: 1000, // Maximum 1000 items
|
||||
ttl: 1000 * 60 * 5 // 5-minute TTL
|
||||
});
|
||||
|
||||
app.post('/api/orders', (req, res) => {
|
||||
requestLog.set(req.id, req); // Auto-evicts old items
|
||||
});
|
||||
```
|
||||
|
||||
### Pattern 4: Forgotten Timers/Intervals
|
||||
|
||||
**Vulnerable Code**:
|
||||
```typescript
|
||||
// ❌ LEAK: setInterval never cleared
|
||||
class ReportGenerator {
|
||||
private data: any[] = [];
|
||||
|
||||
start() {
|
||||
setInterval(() => {
|
||||
this.data.push(generateReport()); // Accumulates forever
|
||||
}, 60000);
|
||||
}
|
||||
}
|
||||
|
||||
// Each instance leaks!
|
||||
const generator = new ReportGenerator();
|
||||
generator.start();
|
||||
```
|
||||
|
||||
**Fixed Code**:
|
||||
```typescript
|
||||
// ✅ FIXED: Clear interval on cleanup
|
||||
class ReportGenerator {
|
||||
private data: any[] = [];
|
||||
private intervalId?: NodeJS.Timeout;
|
||||
|
||||
start() {
|
||||
this.intervalId = setInterval(() => {
|
||||
this.data.push(generateReport());
|
||||
}, 60000);
|
||||
}
|
||||
|
||||
stop() {
|
||||
if (this.intervalId) {
|
||||
clearInterval(this.intervalId);
|
||||
this.intervalId = undefined;
|
||||
this.data = []; // Clear accumulated data
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 4. Memory Profiling with memwatch-next
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
bun add memwatch-next
|
||||
```
|
||||
|
||||
### Leak Detection
|
||||
|
||||
```typescript
|
||||
// memory-monitor.ts
|
||||
import memwatch from 'memwatch-next';
|
||||
|
||||
// Detect memory leaks
|
||||
memwatch.on('leak', (info) => {
|
||||
console.error('Memory leak detected:', {
|
||||
growth: info.growth,
|
||||
reason: info.reason,
|
||||
current_base: `${Math.round(info.current_base / 1024 / 1024)}MB`,
|
||||
leaked: `${Math.round((info.current_base - info.start) / 1024 / 1024)}MB`
|
||||
});
|
||||
|
||||
// Alert to PagerDuty/Slack
|
||||
alertOps('Memory leak detected', info);
|
||||
});
|
||||
|
||||
// Monitor GC stats
|
||||
memwatch.on('stats', (stats) => {
|
||||
console.log('GC stats:', {
|
||||
used_heap_size: `${Math.round(stats.used_heap_size / 1024 / 1024)}MB`,
|
||||
heap_size_limit: `${Math.round(stats.heap_size_limit / 1024 / 1024)}MB`,
|
||||
num_full_gc: stats.num_full_gc,
|
||||
num_inc_gc: stats.num_inc_gc
|
||||
});
|
||||
});
|
||||
```
|
||||
|
||||
### HeapDiff for Leak Analysis
|
||||
|
||||
```typescript
|
||||
import memwatch from 'memwatch-next';
|
||||
|
||||
const hd = new memwatch.HeapDiff();
|
||||
|
||||
// Simulate leak
|
||||
const leak: any[] = [];
|
||||
for (let i = 0; i < 10000; i++) {
|
||||
leak.push({data: new Array(1000).fill('x')});
|
||||
}
|
||||
|
||||
// Compare heaps
|
||||
const diff = hd.end();
|
||||
console.log('Heap diff:', JSON.stringify(diff, null, 2));
|
||||
|
||||
// Output:
|
||||
// {
|
||||
// "before": {"nodes": 12345, "size": 50000000},
|
||||
// "after": {"nodes": 22345, "size": 150000000},
|
||||
// "change": {
|
||||
// "size_bytes": 100000000, // 100MB leak!
|
||||
// "size": "100.00MB",
|
||||
// "freed_nodes": 100,
|
||||
// "allocated_nodes": 10100 // Net increase
|
||||
// }
|
||||
// }
|
||||
```
|
||||
|
||||
## 5. Production Memory Monitoring
|
||||
|
||||
### Prometheus Metrics
|
||||
|
||||
```typescript
|
||||
// metrics.ts
|
||||
import {Gauge} from 'prom-client';
|
||||
|
||||
const memoryUsageGauge = new Gauge({
|
||||
name: 'nodejs_memory_usage_bytes',
|
||||
help: 'Node.js memory usage in bytes',
|
||||
labelNames: ['type']
|
||||
});
|
||||
|
||||
setInterval(() => {
|
||||
const usage = process.memoryUsage();
|
||||
memoryUsageGauge.set({type: 'heap_used'}, usage.heapUsed);
|
||||
memoryUsageGauge.set({type: 'heap_total'}, usage.heapTotal);
|
||||
memoryUsageGauge.set({type: 'external'}, usage.external);
|
||||
memoryUsageGauge.set({type: 'rss'}, usage.rss);
|
||||
}, 15000);
|
||||
```
|
||||
|
||||
**Grafana Alert**:
|
||||
```promql
|
||||
# Alert if heap usage growing linearly
|
||||
increase(nodejs_memory_usage_bytes{type="heap_used"}[1h]) > 100000000 # 100MB/hour
|
||||
```
|
||||
|
||||
## 6. Real-World Fix: EventEmitter Leak
|
||||
|
||||
### Before (Leaking)
|
||||
|
||||
```typescript
|
||||
// order-processor.ts (BEFORE FIX)
|
||||
class OrderProcessor {
|
||||
private emitter = new EventEmitter();
|
||||
|
||||
async processOrders() {
|
||||
// ❌ LEAK: Listener added every call
|
||||
this.emitter.on('order:created', async (order) => {
|
||||
await this.sendConfirmationEmail(order);
|
||||
await this.updateInventory(order);
|
||||
});
|
||||
|
||||
const orders = await db.query.orders.findMany({status: 'pending'});
|
||||
for (const order of orders) {
|
||||
this.emitter.emit('order:created', order);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Called every minute
|
||||
setInterval(() => new OrderProcessor().processOrders(), 60000);
|
||||
```
|
||||
|
||||
**Result**: 1,440 listeners/day → 2GB memory leak in production
|
||||
|
||||
### After (Fixed)
|
||||
|
||||
```typescript
|
||||
// order-processor.ts (AFTER FIX)
|
||||
class OrderProcessor {
|
||||
private emitter = new EventEmitter();
|
||||
private listeners = new WeakMap(); // Track listeners for cleanup
|
||||
|
||||
async processOrders() {
|
||||
const handler = async (order) => {
|
||||
await this.sendConfirmationEmail(order);
|
||||
await this.updateInventory(order);
|
||||
};
|
||||
|
||||
// ✅ Use once() for one-time processing
|
||||
this.emitter.once('order:created', handler);
|
||||
|
||||
const orders = await db.query.orders.findMany({status: 'pending'});
|
||||
for (const order of orders) {
|
||||
this.emitter.emit('order:created', order);
|
||||
}
|
||||
|
||||
// ✅ Cleanup (if using on() instead of once())
|
||||
this.emitter.removeAllListeners('order:created');
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Result**: Memory stable at 150MB, zero leaks
|
||||
|
||||
## 7. Results and Impact
|
||||
|
||||
### Before vs After Metrics
|
||||
|
||||
| Metric | Before Fix | After Fix | Impact |
|
||||
|--------|-----------|-----------|---------|
|
||||
| **Memory Usage** | 2GB (after 6h) | 150MB (stable) | **93% reduction** |
|
||||
| **Heap Size** | Linear growth (5MB/min) | Stable | **Zero growth** |
|
||||
| **OOM Incidents** | 12/month | 0/month | **100% eliminated** |
|
||||
| **GC Pause Time** | 200ms avg | 50ms avg | **75% faster** |
|
||||
| **Uptime** | 6 hours avg | 30+ days | **120x improvement** |
|
||||
|
||||
### Lessons Learned
|
||||
|
||||
**1. Always remove event listeners**
|
||||
- Use `once()` for one-time events
|
||||
- Use `removeListener()` in finally blocks
|
||||
- Track listeners with WeakMap for debugging
|
||||
|
||||
**2. Avoid closures capturing large objects**
|
||||
- Extract only needed data before closure
|
||||
- Use WeakMap/WeakSet for object references
|
||||
- Profile with heap snapshots regularly
|
||||
|
||||
**3. Monitor memory in production**
|
||||
- Prometheus metrics for heap usage
|
||||
- Alert on linear growth patterns
|
||||
- Weekly heap snapshot analysis
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Python Profiling**: [python-scalene-profiling.md](python-scalene-profiling.md)
|
||||
- **DB Leaks**: [database-connection-leak.md](database-connection-leak.md)
|
||||
- **Reference**: [../reference/memory-patterns.md](../reference/memory-patterns.md)
|
||||
- **Templates**: [../templates/memory-report.md](../templates/memory-report.md)
|
||||
|
||||
---
|
||||
|
||||
Return to [examples index](INDEX.md)
|
||||
456
skills/memory-profiling/examples/python-scalene-profiling.md
Normal file
456
skills/memory-profiling/examples/python-scalene-profiling.md
Normal file
@@ -0,0 +1,456 @@
|
||||
# Python Memory Profiling with Scalene
|
||||
|
||||
Line-by-line memory and CPU profiling for Python applications using Scalene, with pytest integration and optimization strategies.
|
||||
|
||||
## Overview
|
||||
|
||||
**Before Optimization**:
|
||||
- Memory usage: 500MB for processing 10K records
|
||||
- OOM (Out of Memory) errors with 100K records
|
||||
- Processing time: 45 seconds for 10K records
|
||||
- List comprehensions loading entire dataset
|
||||
|
||||
**After Optimization**:
|
||||
- Memory usage: 5MB for processing 10K records (99% reduction)
|
||||
- No OOM errors with 1M records
|
||||
- Processing time: 8 seconds for 10K records (82% faster)
|
||||
- Generator-based streaming
|
||||
|
||||
**Tools**: Scalene, pytest, memory_profiler, tracemalloc
|
||||
|
||||
## 1. Scalene Installation and Setup
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# Install Scalene
|
||||
pip install scalene
|
||||
|
||||
# Or with uv (faster)
|
||||
uv pip install scalene
|
||||
```
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```bash
|
||||
# Profile entire script
|
||||
scalene script.py
|
||||
|
||||
# Profile with pytest (recommended)
|
||||
scalene --cli --memory -m pytest tests/
|
||||
|
||||
# HTML output
|
||||
scalene --html --outfile profile.html script.py
|
||||
|
||||
# Profile specific function
|
||||
scalene --reduced-profile script.py
|
||||
```
|
||||
|
||||
## 2. Profiling with pytest
|
||||
|
||||
### Test File Setup
|
||||
|
||||
```python
|
||||
# tests/test_data_processing.py
|
||||
import pytest
|
||||
from data_processor import DataProcessor
|
||||
|
||||
@pytest.fixture
|
||||
def processor():
|
||||
return DataProcessor()
|
||||
|
||||
def test_process_large_dataset(processor):
|
||||
# Generate 10K records
|
||||
records = [{'id': i, 'value': i * 2} for i in range(10000)]
|
||||
|
||||
# Process (this is where memory spike occurs)
|
||||
result = processor.process_records(records)
|
||||
|
||||
assert len(result) == 10000
|
||||
```
|
||||
|
||||
### Running Scalene with pytest
|
||||
|
||||
```bash
|
||||
# Profile memory usage during test execution
|
||||
uv run scalene --cli --memory -m pytest tests/test_data_processing.py 2>&1 | grep -i "memory\|mb\|test"
|
||||
|
||||
# Output shows line-by-line memory allocation
|
||||
```
|
||||
|
||||
**Scalene Output** (before optimization):
|
||||
```
|
||||
data_processor.py:
|
||||
Line | Memory % | Memory (MB) | CPU % | Code
|
||||
-----|----------|-------------|-------|-----
|
||||
12 | 45% | 225 MB | 10% | result = [transform(r) for r in records]
|
||||
18 | 30% | 150 MB | 5% | filtered = [r for r in result if r['value'] > 0]
|
||||
25 | 15% | 75 MB | 20% | sorted_data = sorted(filtered, key=lambda x: x['id'])
|
||||
```
|
||||
|
||||
**Analysis**: Line 12 is the hotspot (45% of memory)
|
||||
|
||||
## 3. Memory Hotspot Identification
|
||||
|
||||
### Vulnerable Code (Memory Spike)
|
||||
|
||||
```python
|
||||
# data_processor.py (BEFORE OPTIMIZATION)
|
||||
class DataProcessor:
|
||||
def process_records(self, records: list[dict]) -> list[dict]:
|
||||
# ❌ HOTSPOT: List comprehension loads entire dataset
|
||||
result = [self.transform(r) for r in records] # 225MB for 10K records
|
||||
|
||||
# ❌ Creates another copy
|
||||
filtered = [r for r in result if r['value'] > 0] # +150MB
|
||||
|
||||
# ❌ sorted() creates yet another copy
|
||||
sorted_data = sorted(filtered, key=lambda x: x['id']) # +75MB
|
||||
|
||||
return sorted_data # Total: 450MB for 10K records
|
||||
|
||||
def transform(self, record: dict) -> dict:
|
||||
return {
|
||||
'id': record['id'],
|
||||
'value': record['value'] * 2,
|
||||
'timestamp': datetime.now()
|
||||
}
|
||||
```
|
||||
|
||||
**Scalene Report**:
|
||||
```
|
||||
Memory allocation breakdown:
|
||||
- Line 12 (list comprehension): 225MB (50%)
|
||||
- Line 18 (filtering): 150MB (33%)
|
||||
- Line 25 (sorting): 75MB (17%)
|
||||
|
||||
Total memory: 450MB for 10,000 records
|
||||
Projected for 100K: 4.5GB → OOM!
|
||||
```
|
||||
|
||||
### Optimized Code (Generator-Based)
|
||||
|
||||
```python
|
||||
# data_processor.py (AFTER OPTIMIZATION)
|
||||
from typing import Iterator
|
||||
|
||||
class DataProcessor:
|
||||
def process_records(self, records: list[dict]) -> Iterator[dict]:
|
||||
# ✅ Generator: processes one record at a time
|
||||
transformed = (self.transform(r) for r in records) # O(1) memory
|
||||
|
||||
# ✅ Generator chaining
|
||||
filtered = (r for r in transformed if r['value'] > 0) # O(1) memory
|
||||
|
||||
# ✅ Stream-based sorting (only if needed)
|
||||
# For very large datasets, use external sorting or database ORDER BY
|
||||
yield from sorted(filtered, key=lambda x: x['id']) # Still O(n), but lazy
|
||||
|
||||
def transform(self, record: dict) -> dict:
|
||||
return {
|
||||
'id': record['id'],
|
||||
'value': record['value'] * 2,
|
||||
'timestamp': datetime.now()
|
||||
}
|
||||
|
||||
# Alternative: Fully streaming (no sorting)
|
||||
def process_records_streaming(self, records: list[dict]) -> Iterator[dict]:
|
||||
for record in records:
|
||||
transformed = self.transform(record)
|
||||
if transformed['value'] > 0:
|
||||
yield transformed # O(1) memory, fully streaming
|
||||
```
|
||||
|
||||
**Scalene Report (After)**:
|
||||
```
|
||||
Memory allocation breakdown:
|
||||
- Line 12 (generator): 5MB (100% - constant overhead)
|
||||
- Line 18 (filter generator): 0MB (lazy)
|
||||
- Line 25 (yield): 0MB (lazy)
|
||||
|
||||
Total memory: 5MB for 10,000 records (99% reduction!)
|
||||
Scalable to 1M+ records without OOM
|
||||
```
|
||||
|
||||
## 4. Common Memory Patterns
|
||||
|
||||
### Pattern 1: List Comprehension → Generator
|
||||
|
||||
**Before** (High Memory):
|
||||
```python
|
||||
# ❌ Loads entire list into memory
|
||||
def process_large_file(filename: str) -> list[dict]:
|
||||
with open(filename) as f:
|
||||
lines = f.readlines() # Loads entire file (500MB)
|
||||
|
||||
# Another copy
|
||||
return [json.loads(line) for line in lines] # +500MB = 1GB total
|
||||
```
|
||||
|
||||
**After** (Low Memory):
|
||||
```python
|
||||
# ✅ Generator: processes line-by-line
|
||||
def process_large_file(filename: str) -> Iterator[dict]:
|
||||
with open(filename) as f:
|
||||
for line in f: # Reads one line at a time
|
||||
yield json.loads(line) # O(1) memory
|
||||
```
|
||||
|
||||
**Scalene diff**: 1GB → 5MB (99.5% reduction)
|
||||
|
||||
### Pattern 2: DataFrame Memory Optimization
|
||||
|
||||
**Before** (High Memory):
|
||||
```python
|
||||
# ❌ Loads entire CSV into memory
|
||||
import pandas as pd
|
||||
|
||||
def analyze_data(filename: str):
|
||||
df = pd.read_csv(filename) # 10GB CSV → 10GB RAM
|
||||
|
||||
# All transformations in memory
|
||||
df['new_col'] = df['value'] * 2
|
||||
df_filtered = df[df['value'] > 0]
|
||||
return df_filtered.groupby('category').sum()
|
||||
```
|
||||
|
||||
**After** (Low Memory with Chunking):
|
||||
```python
|
||||
# ✅ Process in chunks
|
||||
import pandas as pd
|
||||
|
||||
def analyze_data(filename: str):
|
||||
chunk_size = 10000
|
||||
results = []
|
||||
|
||||
# Process 10K rows at a time
|
||||
for chunk in pd.read_csv(filename, chunksize=chunk_size):
|
||||
chunk['new_col'] = chunk['value'] * 2
|
||||
filtered = chunk[chunk['value'] > 0]
|
||||
group_result = filtered.groupby('category').sum()
|
||||
results.append(group_result)
|
||||
|
||||
# Combine results
|
||||
return pd.concat(results).groupby(level=0).sum() # Much smaller
|
||||
```
|
||||
|
||||
**Scalene diff**: 10GB → 500MB (95% reduction)
|
||||
|
||||
### Pattern 3: String Concatenation
|
||||
|
||||
**Before** (High Memory):
|
||||
```python
|
||||
# ❌ Creates new string each iteration (O(n²) memory)
|
||||
def build_report(data: list[dict]) -> str:
|
||||
report = ""
|
||||
for item in data: # 100K items
|
||||
report += f"{item['id']}: {item['value']}\n" # New string every time
|
||||
return report # 500MB final string + 500MB garbage = 1GB
|
||||
```
|
||||
|
||||
**After** (Low Memory):
|
||||
```python
|
||||
# ✅ StringIO or join (O(n) memory)
|
||||
from io import StringIO
|
||||
|
||||
def build_report(data: list[dict]) -> str:
|
||||
buffer = StringIO()
|
||||
for item in data:
|
||||
buffer.write(f"{item['id']}: {item['value']}\n")
|
||||
return buffer.getvalue()
|
||||
|
||||
# Or even better: generator
|
||||
def build_report_streaming(data: list[dict]) -> Iterator[str]:
|
||||
for item in data:
|
||||
yield f"{item['id']}: {item['value']}\n"
|
||||
```
|
||||
|
||||
**Scalene diff**: 1GB → 50MB (95% reduction)
|
||||
|
||||
## 5. Scalene CLI Reference
|
||||
|
||||
### Common Options
|
||||
|
||||
```bash
|
||||
# Memory-only profiling (fastest)
|
||||
scalene --cli --memory script.py
|
||||
|
||||
# CPU + Memory profiling
|
||||
scalene --cli --cpu --memory script.py
|
||||
|
||||
# Reduced profile (functions only, not lines)
|
||||
scalene --reduced-profile script.py
|
||||
|
||||
# Profile specific function
|
||||
scalene --profile-only process_data script.py
|
||||
|
||||
# HTML report
|
||||
scalene --html --outfile profile.html script.py
|
||||
|
||||
# Profile with pytest
|
||||
scalene --cli --memory -m pytest tests/
|
||||
|
||||
# Set memory sampling interval (default: 1MB)
|
||||
scalene --malloc-threshold 0.1 script.py # Sample every 100KB
|
||||
```
|
||||
|
||||
### Interpreting Output
|
||||
|
||||
**Column Meanings**:
|
||||
```
|
||||
Memory % | Percentage of total memory allocated
|
||||
Memory MB | Absolute memory allocated (in megabytes)
|
||||
CPU % | Percentage of CPU time spent
|
||||
Python % | Time spent in Python (vs native code)
|
||||
```
|
||||
|
||||
**Example Output**:
|
||||
```
|
||||
script.py:
|
||||
Line | Memory % | Memory MB | CPU % | Python % | Code
|
||||
-----|----------|-----------|-------|----------|-----
|
||||
12 | 45.2% | 225.6 MB | 10.5% | 95.2% | data = [x for x in range(1000000)]
|
||||
18 | 30.1% | 150.3 MB | 5.2% | 98.1% | filtered = list(filter(lambda x: x > 0, data))
|
||||
```
|
||||
|
||||
**Analysis**:
|
||||
- Line 12: High memory (45.2%) → optimize list comprehension
|
||||
- Line 18: Moderate memory (30.1%) → use generator instead of list()
|
||||
|
||||
## 6. Integration with CI/CD
|
||||
|
||||
### GitHub Actions Workflow
|
||||
|
||||
```yaml
|
||||
# .github/workflows/memory-profiling.yml
|
||||
name: Memory Profiling
|
||||
|
||||
on: [pull_request]
|
||||
|
||||
jobs:
|
||||
profile:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
- name: Setup Python
|
||||
uses: actions/setup-python@v4
|
||||
with:
|
||||
python-version: '3.11'
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
pip install scalene pytest
|
||||
|
||||
- name: Run memory profiling
|
||||
run: |
|
||||
scalene --cli --memory --reduced-profile -m pytest tests/ > profile.txt
|
||||
|
||||
- name: Check for memory hotspots
|
||||
run: |
|
||||
if grep -q "Memory %" profile.txt; then
|
||||
# Alert if any line uses >100MB
|
||||
if awk '$3 > 100 {exit 1}' profile.txt; then
|
||||
echo "Memory hotspot detected!"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
- name: Upload profile
|
||||
uses: actions/upload-artifact@v3
|
||||
with:
|
||||
name: memory-profile
|
||||
path: profile.txt
|
||||
```
|
||||
|
||||
## 7. Real-World Optimization: CSV Processing
|
||||
|
||||
### Before (500MB Memory, OOM at 100K rows)
|
||||
|
||||
```python
|
||||
# csv_processor.py (BEFORE)
|
||||
import pandas as pd
|
||||
|
||||
class CSVProcessor:
|
||||
def process_file(self, filename: str) -> dict:
|
||||
# ❌ Loads entire CSV
|
||||
df = pd.read_csv(filename) # 500MB for 10K rows
|
||||
|
||||
# ❌ Multiple copies
|
||||
df['total'] = df['quantity'] * df['price']
|
||||
df_filtered = df[df['total'] > 100]
|
||||
summary = df_filtered.groupby('category').agg({
|
||||
'total': 'sum',
|
||||
'quantity': 'sum'
|
||||
})
|
||||
|
||||
return summary.to_dict()
|
||||
```
|
||||
|
||||
**Scalene Output**:
|
||||
```
|
||||
Line 8: 500MB (75%) - pd.read_csv()
|
||||
Line 11: 100MB (15%) - df['total'] calculation
|
||||
Line 12: 50MB (10%) - filtering
|
||||
Total: 650MB for 10K rows
|
||||
```
|
||||
|
||||
### After (5MB Memory, Handles 1M rows)
|
||||
|
||||
```python
|
||||
# csv_processor.py (AFTER)
|
||||
import pandas as pd
|
||||
from collections import defaultdict
|
||||
|
||||
class CSVProcessor:
|
||||
def process_file(self, filename: str) -> dict:
|
||||
# ✅ Process in 10K row chunks
|
||||
chunk_size = 10000
|
||||
results = defaultdict(lambda: {'total': 0, 'quantity': 0})
|
||||
|
||||
for chunk in pd.read_csv(filename, chunksize=chunk_size):
|
||||
chunk['total'] = chunk['quantity'] * chunk['price']
|
||||
filtered = chunk[chunk['total'] > 100]
|
||||
|
||||
# Aggregate incrementally
|
||||
for category, group in filtered.groupby('category'):
|
||||
results[category]['total'] += group['total'].sum()
|
||||
results[category]['quantity'] += group['quantity'].sum()
|
||||
|
||||
return dict(results)
|
||||
```
|
||||
|
||||
**Scalene Output (After)**:
|
||||
```
|
||||
Line 9: 5MB (100%) - chunk processing (constant memory)
|
||||
Total: 5MB for any file size (99% reduction)
|
||||
```
|
||||
|
||||
## 8. Results and Impact
|
||||
|
||||
### Before vs After Metrics
|
||||
|
||||
| Metric | Before | After | Impact |
|
||||
|--------|--------|-------|--------|
|
||||
| **Memory Usage** | 500MB (10K rows) | 5MB (1M rows) | **99% reduction** |
|
||||
| **Processing Time** | 45s (10K rows) | 8s (10K rows) | **82% faster** |
|
||||
| **Max File Size** | 100K rows (OOM) | 10M+ rows | **100x scalability** |
|
||||
| **OOM Errors** | 5/week | 0/month | **100% eliminated** |
|
||||
|
||||
### Key Optimizations Applied
|
||||
|
||||
1. **List comprehension → Generator**: 225MB → 0MB
|
||||
2. **DataFrame chunking**: 500MB → 5MB per chunk
|
||||
3. **String concatenation**: 1GB → 50MB (StringIO)
|
||||
4. **Lazy evaluation**: Load on demand vs load all
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Node.js Leaks**: [nodejs-memory-leak.md](nodejs-memory-leak.md)
|
||||
- **DB Leaks**: [database-connection-leak.md](database-connection-leak.md)
|
||||
- **Reference**: [../reference/profiling-tools.md](../reference/profiling-tools.md)
|
||||
- **Templates**: [../templates/scalene-config.txt](../templates/scalene-config.txt)
|
||||
|
||||
---
|
||||
|
||||
Return to [examples index](INDEX.md)
|
||||
Reference in New Issue
Block a user