Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:29:23 +08:00
commit ebc71f5387
37 changed files with 9382 additions and 0 deletions

View File

@@ -0,0 +1,86 @@
# Memory Profiling Examples
Production memory profiling implementations for Node.js and Python with leak detection, heap analysis, and optimization strategies.
## Examples Overview
### Node.js Memory Leak Detection
**File**: [nodejs-memory-leak.md](nodejs-memory-leak.md)
Identifying and fixing memory leaks in Node.js applications:
- **Memory leak detection**: Chrome DevTools, heapdump analysis
- **Common leak patterns**: Event listeners, closures, global variables
- **Heap snapshots**: Before/after comparison, retained object analysis
- **Real leak**: EventEmitter leak causing 2GB memory growth
- **Fix**: Proper cleanup with `removeListener()`, WeakMap for caching
- **Result**: Memory stabilized at 150MB (93% reduction)
**Use when**: Node.js memory growing over time, debugging production memory issues
---
### Python Memory Profiling with Scalene
**File**: [python-scalene-profiling.md](python-scalene-profiling.md)
Line-by-line memory profiling for Python applications:
- **Scalene setup**: Installation, pytest integration, CLI usage
- **Memory hotspots**: Line-by-line allocation tracking
- **CPU + Memory**: Combined profiling for performance bottlenecks
- **Real scenario**: 500MB dataset causing OOM, fixed with generators
- **Optimization**: List comprehension → generator (500MB → 5MB)
- **Result**: 99% memory reduction, no OOM errors
**Use when**: Python memory spikes, profiling pytest tests, finding allocation hotspots
---
### Database Connection Pool Leak
**File**: [database-connection-leak.md](database-connection-leak.md)
PostgreSQL connection pool exhaustion and memory leaks:
- **Symptom**: Connection pool maxed out, memory growing linearly
- **Root cause**: Unclosed connections in error paths, missing `finally` blocks
- **Detection**: Connection pool metrics, memory profiling
- **Fix**: Context managers (`with` statement), proper cleanup
- **Result**: Zero connection leaks, memory stable at 80MB
**Use when**: Database connection errors, "too many clients" errors, connection pool issues
---
### Large Dataset Memory Optimization
**File**: [large-dataset-optimization.md](large-dataset-optimization.md)
Memory-efficient data processing for large datasets:
- **Problem**: Loading 10GB CSV into memory (OOM killer)
- **Solutions**: Streaming with `pandas.read_csv(chunksize)`, generators, memory mapping
- **Techniques**: Lazy evaluation, columnar processing, batch processing
- **Before/After**: 10GB memory → 500MB (95% reduction)
- **Tools**: Pandas chunking, Dask for parallel processing
**Use when**: Processing large files, OOM errors, batch data processing
---
## Quick Navigation
| Topic | File | Lines | Focus |
|-------|------|-------|-------|
| **Node.js Leaks** | [nodejs-memory-leak.md](nodejs-memory-leak.md) | ~450 | EventEmitter, heap snapshots |
| **Python Scalene** | [python-scalene-profiling.md](python-scalene-profiling.md) | ~420 | Line-by-line profiling |
| **DB Connection Leaks** | [database-connection-leak.md](database-connection-leak.md) | ~380 | Connection pool management |
| **Large Datasets** | [large-dataset-optimization.md](large-dataset-optimization.md) | ~400 | Streaming, chunking |
## Related Documentation
- **Reference**: [Reference Index](../reference/INDEX.md) - Memory patterns, profiling tools
- **Templates**: [Templates Index](../templates/INDEX.md) - Profiling report template
- **Main Agent**: [memory-profiler.md](../memory-profiler.md) - Memory profiler agent
---
Return to [main agent](../memory-profiler.md)

View File

@@ -0,0 +1,490 @@
# Database Connection Pool Memory Leaks
Detecting and fixing PostgreSQL connection pool leaks in FastAPI applications using connection monitoring and proper cleanup patterns.
## Overview
**Before Optimization**:
- Active connections: 95/100 (pool exhausted)
- Connection timeouts: 15-20/min during peak
- Memory growth: 100MB/hour (unclosed connections)
- Service restarts: 3-4x/day
**After Optimization**:
- Active connections: 8-12/100 (healthy pool)
- Connection timeouts: 0/day
- Memory growth: 0MB/hour (stable)
- Service restarts: 0/month
**Tools**: asyncpg, SQLModel, psycopg3, pg_stat_activity, Prometheus
## 1. Connection Pool Architecture
### Grey Haven Stack: PostgreSQL + SQLModel
**Connection Pool Configuration**:
```python
# database.py
from sqlmodel import create_engine
from sqlalchemy.pool import QueuePool
# ❌ VULNERABLE: No max_overflow, no timeout
engine = create_engine(
"postgresql://user:pass@localhost/db",
poolclass=QueuePool,
pool_size=20,
echo=True
)
# ✅ SECURE: Proper pool configuration
engine = create_engine(
"postgresql://user:pass@localhost/db",
poolclass=QueuePool,
pool_size=20, # Core connections
max_overflow=10, # Max additional connections
pool_timeout=30, # Wait timeout (seconds)
pool_recycle=3600, # Recycle after 1 hour
pool_pre_ping=True, # Verify connection before use
echo=False
)
```
**Pool Health Monitoring**:
```python
# monitoring.py
from prometheus_client import Gauge
# Prometheus metrics
db_pool_size = Gauge('db_pool_connections_total', 'Total pool size')
db_pool_active = Gauge('db_pool_connections_active', 'Active connections')
db_pool_idle = Gauge('db_pool_connections_idle', 'Idle connections')
db_pool_overflow = Gauge('db_pool_connections_overflow', 'Overflow connections')
def record_pool_metrics(engine):
pool = engine.pool
db_pool_size.set(pool.size())
db_pool_active.set(pool.checkedout())
db_pool_idle.set(pool.size() - pool.checkedout())
db_pool_overflow.set(pool.overflow())
```
## 2. Common Leak Pattern: Unclosed Connections
### Vulnerable Code (Connection Leak)
```python
# api/orders.py (BEFORE)
from fastapi import APIRouter, Depends
from sqlmodel import Session, select
from database import engine
router = APIRouter()
@router.get("/orders")
async def get_orders():
# ❌ LEAK: Connection never closed
session = Session(engine)
# If exception occurs here, session never closed
orders = session.exec(select(Order)).all()
# If return happens here, session never closed
return orders
# session.close() never reached if early return/exception
session.close()
```
**What Happens**:
1. Every request acquires connection from pool
2. Exception/early return prevents `session.close()`
3. Connection remains in "active" state
4. Pool exhausts after 100 requests (pool_size=100)
5. New requests timeout waiting for connection
**Memory Impact**:
```
Initial pool: 20 connections (40MB)
After 1 hour: 95 leaked connections (190MB)
After 6 hours: Pool exhausted + 100MB leaked memory
```
### Fixed Code (Context Manager)
```python
# api/orders.py (AFTER)
from fastapi import APIRouter, Depends
from sqlmodel import Session, select
from database import engine, get_session
from contextlib import contextmanager
router = APIRouter()
# ✅ Option 1: FastAPI dependency injection (recommended)
def get_session():
"""Session dependency with automatic cleanup"""
with Session(engine) as session:
yield session
@router.get("/orders")
async def get_orders(session: Session = Depends(get_session)):
# Session automatically closed after request
orders = session.exec(select(Order)).all()
return orders
# ✅ Option 2: Explicit context manager
@router.get("/orders-alt")
async def get_orders_alt():
with Session(engine) as session:
orders = session.exec(select(Order)).all()
return orders
# Session guaranteed to close (even on exception)
```
**Why This Works**:
- Context manager ensures `session.close()` called in `__exit__`
- Works even if exception raised
- Works even if early return
- FastAPI `Depends()` handles async cleanup
## 3. Async Connection Leaks (asyncpg)
### Vulnerable Async Pattern
```python
# api/analytics.py (BEFORE)
import asyncpg
from fastapi import APIRouter
router = APIRouter()
@router.get("/analytics")
async def get_analytics():
# ❌ LEAK: Connection never closed
conn = await asyncpg.connect(
user='postgres',
password='secret',
database='analytics'
)
# Exception here = connection leaked
result = await conn.fetch('SELECT * FROM metrics WHERE date > $1', date)
# Early return = connection leaked
if not result:
return []
await conn.close() # Never reached
return result
```
### Fixed Async Pattern
```python
# api/analytics.py (AFTER)
import asyncpg
from fastapi import APIRouter
from contextlib import asynccontextmanager
router = APIRouter()
# ✅ Connection pool (shared across requests)
pool: asyncpg.Pool = None
@asynccontextmanager
async def get_db_connection():
"""Async context manager for connections"""
conn = await pool.acquire()
try:
yield conn
finally:
await pool.release(conn)
@router.get("/analytics")
async def get_analytics():
async with get_db_connection() as conn:
result = await conn.fetch(
'SELECT * FROM metrics WHERE date > $1',
date
)
return result
# Connection automatically released to pool
```
**Pool Setup** (application startup):
```python
# main.py
from fastapi import FastAPI
import asyncpg
app = FastAPI()
@app.on_event("startup")
async def startup():
global pool
pool = await asyncpg.create_pool(
user='postgres',
password='secret',
database='analytics',
min_size=10, # Minimum connections
max_size=20, # Maximum connections
max_inactive_connection_lifetime=300 # Recycle after 5 min
)
@app.on_event("shutdown")
async def shutdown():
await pool.close()
```
## 4. Transaction Leak Detection
### Monitoring Active Connections
**PostgreSQL Query**:
```sql
-- Show active connections with details
SELECT
pid,
usename,
application_name,
client_addr,
state,
query,
state_change,
NOW() - state_change AS duration
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
```
**Prometheus Metrics**:
```python
# monitoring.py
from prometheus_client import Gauge
import asyncpg
db_connections_active = Gauge(
'db_connections_active',
'Active database connections',
['state']
)
async def monitor_connections(pool: asyncpg.Pool):
"""Monitor PostgreSQL connections every 30 seconds"""
async with pool.acquire() as conn:
rows = await conn.fetch("""
SELECT state, COUNT(*) as count
FROM pg_stat_activity
WHERE datname = current_database()
GROUP BY state
""")
for row in rows:
db_connections_active.labels(state=row['state']).set(row['count'])
```
**Grafana Alert** (connection leak):
```yaml
alert: DatabaseConnectionLeak
expr: db_connections_active{state="active"} > 80
for: 5m
annotations:
summary: "Potential connection leak ({{ $value }} active connections)"
description: "Active connections have been above 80 for 5+ minutes"
```
## 5. Real-World Fix: FastAPI Order Service
### Before (Connection Pool Exhaustion)
```python
# services/order_processor.py (BEFORE)
from sqlmodel import Session, select
from database import engine
from models import Order, OrderItem
class OrderProcessor:
async def process_order(self, order_id: int):
# ❌ LEAK: Multiple sessions, some never closed
session1 = Session(engine)
order = session1.get(Order, order_id)
if not order:
# Early return = session1 leaked
return None
# ❌ LEAK: Second session
session2 = Session(engine)
items = session2.exec(
select(OrderItem).where(OrderItem.order_id == order_id)
).all()
# Exception here = both sessions leaked
total = sum(item.price * item.quantity for item in items)
order.total = total
session1.commit()
# Only session1 closed, session2 leaked
session1.close()
return order
```
**Metrics (Before)**:
```
Connection pool: 100 connections
Active connections after 1 hour: 95/100
Leaked connections: ~12/min
Memory growth: 100MB/hour
Pool exhaustion: Every 6-8 hours
```
### After (Proper Resource Management)
```python
# services/order_processor.py (AFTER)
from sqlmodel import Session, select
from database import engine, get_session
from models import Order, OrderItem
from contextlib import contextmanager
class OrderProcessor:
async def process_order(self, order_id: int):
# ✅ Single session, guaranteed cleanup
with Session(engine) as session:
# Query order
order = session.get(Order, order_id)
if not order:
return None
# Query items (same session)
items = session.exec(
select(OrderItem).where(OrderItem.order_id == order_id)
).all()
# Calculate total
total = sum(item.price * item.quantity for item in items)
# Update order
order.total = total
session.add(order)
session.commit()
session.refresh(order)
return order
# Session automatically closed (even on exception)
```
**Metrics (After)**:
```
Connection pool: 100 connections
Active connections: 8-12/100 (stable)
Leaked connections: 0/day
Memory growth: 0MB/hour
Pool exhaustion: Never (0 incidents/month)
```
## 6. Connection Pool Configuration Best Practices
### Recommended Settings (Grey Haven Stack)
```python
# database.py - Production settings
from sqlmodel import create_engine
from sqlalchemy.pool import QueuePool
engine = create_engine(
database_url,
poolclass=QueuePool,
pool_size=20, # (workers * connections/worker) + buffer
max_overflow=10, # 50% of pool_size
pool_timeout=30, # Wait timeout
pool_recycle=3600, # Recycle after 1h
pool_pre_ping=True # Health check
)
```
**Pool Size Formula**: `pool_size = (workers * conn_per_worker) + buffer`
Example: `(4 workers * 3 conn) + 8 buffer = 20`
## 7. Testing Connection Cleanup
### Pytest Fixture for Connection Tracking
```python
# tests/conftest.py
import pytest
from sqlmodel import Session, create_engine
@pytest.fixture
def engine():
"""Test engine with connection tracking"""
test_engine = create_engine("postgresql://test:test@localhost/test_db", pool_size=5)
initial_active = test_engine.pool.checkedout()
yield test_engine
final_active = test_engine.pool.checkedout()
assert final_active == initial_active, f"Leaked {final_active - initial_active} connections"
@pytest.mark.asyncio
async def test_no_connection_leak_under_load(engine):
"""Simulate 1000 concurrent requests"""
initial = engine.pool.checkedout()
tasks = [get_orders() for _ in range(1000)]
await asyncio.gather(*tasks)
await asyncio.sleep(1)
assert engine.pool.checkedout() == initial, "Connection leak detected"
```
## 8. CI/CD Integration
```yaml
# .github/workflows/connection-leak-test.yml
name: Connection Leak Detection
on: [pull_request]
jobs:
leak-test:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:15
env: {POSTGRES_PASSWORD: test, POSTGRES_DB: test_db}
ports: [5432:5432]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with: {python-version: '3.11'}
- run: pip install -r requirements.txt pytest pytest-asyncio
- run: pytest tests/test_connection_leaks.py -v
```
## 9. Results and Impact
### Before vs After Metrics
| Metric | Before | After | Impact |
|--------|--------|-------|--------|
| **Active Connections** | 95/100 (95%) | 8-12/100 (10%) | **85% reduction** |
| **Connection Timeouts** | 15-20/min | 0/day | **100% eliminated** |
| **Memory Growth** | 100MB/hour | 0MB/hour | **100% eliminated** |
| **Service Restarts** | 3-4x/day | 0/month | **100% eliminated** |
| **Pool Wait Time (p95)** | 5.2s | 0.01s | **99.8% faster** |
### Key Optimizations Applied
1. **Context Managers**: Guaranteed connection cleanup (even on exceptions)
2. **FastAPI Dependencies**: Automatic session lifecycle management
3. **Connection Pooling**: Proper pool_size, max_overflow, pool_timeout
4. **Prometheus Monitoring**: Real-time pool saturation metrics
5. **Load Testing**: CI/CD checks for connection leaks
## Related Documentation
- **Node.js Leaks**: [nodejs-memory-leak.md](nodejs-memory-leak.md)
- **Python Profiling**: [python-scalene-profiling.md](python-scalene-profiling.md)
- **Large Datasets**: [large-dataset-optimization.md](large-dataset-optimization.md)
- **Reference**: [../reference/profiling-tools.md](../reference/profiling-tools.md)
---
Return to [examples index](INDEX.md)

View File

@@ -0,0 +1,452 @@
# Large Dataset Memory Optimization
Memory-efficient patterns for processing multi-GB datasets in Python and Node.js without OOM errors.
## Overview
**Before Optimization**:
- Dataset size: 10GB CSV (50M rows)
- Memory usage: 20GB (2x dataset size)
- Processing time: 45 minutes
- OOM errors: Frequent (3-4x/day)
**After Optimization**:
- Dataset size: Same (10GB, 50M rows)
- Memory usage: 500MB (constant)
- Processing time: 12 minutes (73% faster)
- OOM errors: 0/month
**Tools**: Polars, pandas chunking, generators, streaming parsers
## 1. Problem: Loading Entire Dataset
### Vulnerable Pattern (Pandas read_csv)
```python
# analysis.py (BEFORE)
import pandas as pd
def analyze_sales_data(filename: str):
# ❌ Loads entire 10GB file into memory
df = pd.read_csv(filename) # 20GB RAM usage
# ❌ Creates copies for each operation
df['total'] = df['quantity'] * df['price'] # +10GB
df_filtered = df[df['total'] > 1000] # +8GB
df_sorted = df_filtered.sort_values('total', ascending=False) # +8GB
# Peak memory: 46GB for 10GB file!
return df_sorted.head(100)
```
**Memory Profile**:
```
Step 1 (read_csv): 20GB
Step 2 (calculation): +10GB = 30GB
Step 3 (filter): +8GB = 38GB
Step 4 (sort): +8GB = 46GB
Result: OOM on 32GB machine
```
## 2. Solution 1: Pandas Chunking
### Chunk-Based Processing
```python
# analysis.py (AFTER - Chunking)
import pandas as pd
from typing import Iterator
def analyze_sales_data_chunked(filename: str, chunk_size: int = 100000):
"""Process 100K rows at a time (constant memory)"""
top_sales = []
# ✅ Process in chunks (100K rows = ~50MB each)
for chunk in pd.read_csv(filename, chunksize=chunk_size):
# Calculate total (in-place when possible)
chunk['total'] = chunk['quantity'] * chunk['price']
# Filter high-value sales
filtered = chunk[chunk['total'] > 1000]
# Keep top 100 from this chunk
top_chunk = filtered.nlargest(100, 'total')
top_sales.append(top_chunk)
# chunk goes out of scope, memory freed
# Combine top results from all chunks
final_df = pd.concat(top_sales).nlargest(100, 'total')
return final_df
```
**Memory Profile (Chunked)**:
```
Chunk 1: 50MB (process) → 10MB (top 100) → garbage collected
Chunk 2: 50MB (process) → 10MB (top 100) → garbage collected
...
Chunk 500: 50MB (process) → 10MB (top 100) → garbage collected
Final combine: 500 * 10MB = 500MB total
Peak memory: 500MB (99% reduction!)
```
## 3. Solution 2: Polars (Lazy Evaluation)
### Polars for Large Datasets
**Why Polars**:
- 10-100x faster than pandas
- True streaming (doesn't load entire file)
- Query optimizer (like SQL databases)
- Parallel processing (uses all CPU cores)
```python
# analysis.py (POLARS)
import polars as pl
def analyze_sales_data_polars(filename: str):
"""Polars lazy evaluation - constant memory"""
result = (
pl.scan_csv(filename) # ✅ Lazy: doesn't load yet
.with_columns([
(pl.col('quantity') * pl.col('price')).alias('total')
])
.filter(pl.col('total') > 1000)
.sort('total', descending=True)
.head(100)
.collect(streaming=True) # ✅ Streaming: processes in chunks
)
return result
```
**Memory Profile (Polars Streaming)**:
```
Memory usage: 200-300MB (constant)
Processing: Parallel chunks, optimized query plan
Time: 12 minutes vs 45 minutes (pandas)
```
## 4. Node.js Streaming
### CSV Streaming with csv-parser
```typescript
// analysis.ts (BEFORE)
import fs from 'fs';
import Papa from 'papaparse';
async function analyzeSalesData(filename: string) {
// ❌ Loads entire 10GB file
const fileContent = fs.readFileSync(filename, 'utf-8'); // 20GB RAM
const parsed = Papa.parse(fileContent, { header: true }); // +10GB
// Process all rows
const results = parsed.data.map(row => ({
total: row.quantity * row.price
}));
return results; // 30GB total
}
```
**Fixed with Streaming**:
```typescript
// analysis.ts (AFTER - Streaming)
import fs from 'fs';
import csv from 'csv-parser';
import { pipeline } from 'stream/promises';
async function analyzeSalesDataStreaming(filename: string) {
const topSales: Array<{row: any, total: number}> = [];
await pipeline(
fs.createReadStream(filename), // ✅ Stream (not load all)
csv(),
async function* (source) {
for await (const row of source) {
const total = row.quantity * row.price;
if (total > 1000) {
topSales.push({ row, total });
// Keep only top 100 (memory bounded)
if (topSales.length > 100) {
topSales.sort((a, b) => b.total - a.total);
topSales.length = 100;
}
}
}
yield topSales;
}
);
return topSales;
}
```
**Memory Profile (Streaming)**:
```
Buffer: 64KB (stream chunk size)
Processing: One row at a time
Array: 100 rows max (bounded)
Peak memory: 5MB vs 30GB (99.98% reduction!)
```
## 5. Generator Pattern (Python)
### Memory-Efficient Pipeline
```python
# pipeline.py (Generator-based)
from typing import Iterator
import csv
def read_csv_streaming(filename: str) -> Iterator[dict]:
"""Read CSV line by line (not all at once)"""
with open(filename, 'r') as f:
reader = csv.DictReader(f)
for row in reader:
yield row # ✅ One row at a time
def calculate_totals(rows: Iterator[dict]) -> Iterator[dict]:
"""Calculate totals (lazy)"""
for row in rows:
row['total'] = float(row['quantity']) * float(row['price'])
yield row
def filter_high_value(rows: Iterator[dict], threshold: float = 1000) -> Iterator[dict]:
"""Filter high-value sales (lazy)"""
for row in rows:
if row['total'] > threshold:
yield row
def top_n(rows: Iterator[dict], n: int = 100) -> list[dict]:
"""Keep top N rows (bounded memory)"""
import heapq
return heapq.nlargest(n, rows, key=lambda x: x['total'])
# ✅ Pipeline: each stage processes one row at a time
def analyze_sales_pipeline(filename: str):
rows = read_csv_streaming(filename)
with_totals = calculate_totals(rows)
high_value = filter_high_value(with_totals)
top_100 = top_n(high_value, 100)
return top_100
```
**Memory Profile (Generator Pipeline)**:
```
Stage 1 (read): 1 row (few KB)
Stage 2 (calculate): 1 row (few KB)
Stage 3 (filter): 1 row (few KB)
Stage 4 (top_n): 100 rows (bounded)
Peak memory: <1MB (constant)
```
## 6. Real-World: E-Commerce Analytics
### Before (Pandas load_all)
```python
# analytics_service.py (BEFORE)
import pandas as pd
class AnalyticsService:
def generate_sales_report(self, start_date: str, end_date: str):
# ❌ Load entire orders table (10GB)
orders = pd.read_sql(
"SELECT * FROM orders WHERE date BETWEEN %s AND %s",
engine,
params=(start_date, end_date)
) # 20GB RAM
# ❌ Load entire order_items (50GB)
items = pd.read_sql("SELECT * FROM order_items", engine) # +100GB RAM
# Join (creates another copy)
merged = orders.merge(items, on='order_id') # +150GB
# Aggregate
summary = merged.groupby('category').agg({
'total': 'sum',
'quantity': 'sum'
})
return summary # Peak: 270GB - OOM!
```
### After (Database Aggregation + Chunking)
```python
# analytics_service.py (AFTER)
import pandas as pd
class AnalyticsService:
def generate_sales_report(self, start_date: str, end_date: str):
# ✅ Aggregate in database (PostgreSQL does the work)
query = """
SELECT
oi.category,
SUM(oi.price * oi.quantity) as total,
SUM(oi.quantity) as quantity
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
WHERE o.date BETWEEN %(start)s AND %(end)s
GROUP BY oi.category
"""
# Result: aggregated data (few KB, not 270GB!)
summary = pd.read_sql(
query,
engine,
params={'start': start_date, 'end': end_date}
)
return summary # Peak: 1MB vs 270GB
```
**Metrics**:
```
Before: 270GB RAM, OOM error
After: 1MB RAM, 99.9996% reduction
Time: 45 min → 30 seconds (90x faster)
```
## 7. Dask for Parallel Processing
### Dask DataFrame (Parallel Chunking)
```python
# analysis_dask.py
import dask.dataframe as dd
def analyze_sales_data_dask(filename: str):
"""Process in parallel chunks across CPU cores"""
# ✅ Lazy loading, parallel processing
df = dd.read_csv(
filename,
blocksize='64MB' # Process 64MB chunks
)
# All operations are lazy (no computation yet)
df['total'] = df['quantity'] * df['price']
filtered = df[df['total'] > 1000]
top_100 = filtered.nlargest(100, 'total')
# ✅ Trigger computation (parallel across cores)
result = top_100.compute()
return result
```
**Memory Profile (Dask)**:
```
Workers: 8 (one per CPU core)
Memory per worker: 100MB
Total memory: 800MB vs 46GB
Speed: 4-8x faster (parallel)
```
## 8. Memory Monitoring
### Track Memory Usage During Processing
```python
# monitor.py
import tracemalloc
import psutil
from contextlib import contextmanager
@contextmanager
def memory_monitor(label: str):
"""Monitor memory usage of code block"""
# Start tracking
tracemalloc.start()
process = psutil.Process()
mem_before = process.memory_info().rss / 1024 / 1024 # MB
yield
# Measure after
mem_after = process.memory_info().rss / 1024 / 1024
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"{label}:")
print(f" Memory before: {mem_before:.1f} MB")
print(f" Memory after: {mem_after:.1f} MB")
print(f" Memory delta: {mem_after - mem_before:.1f} MB")
print(f" Peak traced: {peak / 1024 / 1024:.1f} MB")
# Usage
with memory_monitor("Pandas load_all"):
df = pd.read_csv("large_file.csv") # Shows high memory usage
with memory_monitor("Polars streaming"):
df = pl.scan_csv("large_file.csv").collect(streaming=True) # Low memory
```
## 9. Optimization Decision Tree
**Choose the right tool based on dataset size**:
```
Dataset < 1GB:
→ Use pandas.read_csv() (simple, fast)
Dataset 1-10GB:
→ Use pandas chunking (chunksize=100000)
→ Or Polars streaming (faster, less memory)
Dataset 10-100GB:
→ Use Polars streaming (best performance)
→ Or Dask (parallel processing)
→ Or Database aggregation (PostgreSQL, ClickHouse)
Dataset > 100GB:
→ Database aggregation (required)
→ Or Spark/Ray (distributed computing)
→ Never load into memory
```
## 10. Results and Impact
### Before vs After Metrics
| Metric | Before (pandas) | After (Polars) | Impact |
|--------|----------------|----------------|--------|
| **Memory Usage** | 46GB | 300MB | **99.3% reduction** |
| **Processing Time** | 45 min | 12 min | **73% faster** |
| **OOM Errors** | 3-4/day | 0/month | **100% eliminated** |
| **Max Dataset Size** | 10GB | 500GB+ | **50x scalability** |
### Key Optimizations Applied
1. **Chunking**: Process 100K rows at a time (constant memory)
2. **Lazy Evaluation**: Polars/Dask don't load until needed
3. **Streaming**: One row at a time (generators, Node.js streams)
4. **Database Aggregation**: Let PostgreSQL do the work
5. **Bounded Memory**: heapq.nlargest() keeps top N (not all rows)
### Cost Savings
**Infrastructure costs**:
- Before: r5.8xlarge (256GB RAM) = $1.344/hour
- After: r5.large (16GB RAM) = $0.084/hour
- **Savings**: 94% reduction ($23,000/year per service)
## Related Documentation
- **Node.js Leaks**: [nodejs-memory-leak.md](nodejs-memory-leak.md)
- **Python Profiling**: [python-scalene-profiling.md](python-scalene-profiling.md)
- **DB Leaks**: [database-connection-leak.md](database-connection-leak.md)
- **Reference**: [../reference/memory-optimization-patterns.md](../reference/memory-optimization-patterns.md)
---
Return to [examples index](INDEX.md)

View File

@@ -0,0 +1,490 @@
# Node.js Memory Leak Detection
Identifying and fixing memory leaks in Node.js applications using Chrome DevTools, heapdump, and memory profiling techniques.
## Overview
**Symptoms Before Fix**:
- Memory usage: 150MB → 2GB over 6 hours
- Heap size growing linearly (5MB/minute)
- V8 garbage collection ineffective
- Production outages (OOM killer)
**After Fix**:
- Memory stable at 150MB (93% reduction)
- Heap size constant over time
- Zero OOM errors in 30 days
- Proper resource cleanup
**Tools**: Chrome DevTools, heapdump, memwatch-next, Prometheus monitoring
## 1. Memory Leak Symptoms
### Linear Memory Growth
```bash
# Monitor Node.js memory usage
node --expose-gc --inspect app.js
# Connect Chrome DevTools: chrome://inspect
# Memory tab → Take heap snapshot every 5 minutes
```
**Heap growth pattern**:
```
Time | Heap Size | External | Total
------|-----------|----------|-------
0 min | 50MB | 10MB | 60MB
5 min | 75MB | 15MB | 90MB
10min | 100MB | 20MB | 120MB
15min | 125MB | 25MB | 150MB
... | ... | ... | ...
6 hrs | 1.8GB | 200MB | 2GB
```
**Diagnosis**: Linear growth indicates memory leak (not normal sawtooth GC pattern)
### High GC Activity
```javascript
// Monitor GC events
const v8 = require('v8');
const memoryUsage = process.memoryUsage();
setInterval(() => {
const usage = process.memoryUsage();
console.log({
heapUsed: `${Math.round(usage.heapUsed / 1024 / 1024)}MB`,
heapTotal: `${Math.round(usage.heapTotal / 1024 / 1024)}MB`,
external: `${Math.round(usage.external / 1024 / 1024)}MB`,
rss: `${Math.round(usage.rss / 1024 / 1024)}MB`
});
}, 60000); // Every minute
```
**Output showing leak**:
```
{heapUsed: '75MB', heapTotal: '100MB', external: '15MB', rss: '120MB'}
{heapUsed: '100MB', heapTotal: '130MB', external: '20MB', rss: '150MB'}
{heapUsed: '125MB', heapTotal: '160MB', external: '25MB', rss: '185MB'}
```
## 2. Heap Snapshot Analysis
### Taking Heap Snapshots
```javascript
// Generate heap snapshot programmatically
const v8 = require('v8');
const fs = require('fs');
function takeHeapSnapshot(filename) {
const heapSnapshot = v8.writeHeapSnapshot(filename);
console.log(`Heap snapshot written to ${heapSnapshot}`);
}
// Take snapshot every hour
setInterval(() => {
const timestamp = new Date().toISOString().replace(/:/g, '-');
takeHeapSnapshot(`heap-${timestamp}.heapsnapshot`);
}, 3600000);
```
### Analyzing Snapshots in Chrome DevTools
**Steps**:
1. Load two snapshots (before and after 1 hour)
2. Compare snapshots (Comparison view)
3. Sort by "Size Delta" (descending)
4. Look for objects growing significantly
**Example Analysis**:
```
Object Type | Count | Size Delta | Retained Size
----------------------|--------|------------|---------------
(array) | +5,000 | +50MB | +60MB
EventEmitter | +1,200 | +12MB | +15MB
Closure (anonymous) | +800 | +8MB | +10MB
```
**Diagnosis**: EventEmitter count growing = likely event listener leak
### Retained Objects Analysis
```javascript
// Chrome DevTools → Heap Snapshot → Summary → sort by "Retained Size"
// Click object → view Retainer tree
```
**Retainer tree example** (EventEmitter leak):
```
EventEmitter @123456
← listeners: Array[50]
← _events.data: Array
← EventEmitter @123456 (self-reference leak!)
```
## 3. Common Memory Leak Patterns
### Pattern 1: Event Listener Leak
**Vulnerable Code**:
```typescript
// ❌ LEAK: EventEmitter listeners never removed
import {EventEmitter} from 'events';
class DataProcessor {
private emitter = new EventEmitter();
async processOrders() {
// Add listener every time function called
this.emitter.on('data', (data) => {
console.log('Processing:', data);
});
// Emit 1000 events
for (let i = 0; i < 1000; i++) {
this.emitter.emit('data', {id: i});
}
}
}
// Called 1000 times = 1000 listeners accumulate!
setInterval(() => new DataProcessor().processOrders(), 1000);
```
**Result**: 1000 listeners/second = 3.6M listeners/hour → 2GB memory leak
**Fixed Code**:
```typescript
// ✅ FIXED: Remove listener after use
class DataProcessor {
private emitter = new EventEmitter();
async processOrders() {
const handler = (data) => {
console.log('Processing:', data);
};
this.emitter.on('data', handler);
try {
for (let i = 0; i < 1000; i++) {
this.emitter.emit('data', {id: i});
}
} finally {
// ✅ Clean up listener
this.emitter.removeListener('data', handler);
}
}
}
```
**Better**: Use `once()` for one-time listeners:
```typescript
this.emitter.once('data', handler); // Auto-removed after first emit
```
### Pattern 2: Closure Leak
**Vulnerable Code**:
```typescript
// ❌ LEAK: Closure captures large object
const cache = new Map();
function processRequest(userId: string) {
const largeData = fetchLargeDataset(userId); // 10MB object
// Closure captures entire largeData
cache.set(userId, () => {
return largeData.summary; // Only need summary (1KB)
});
}
// Called for 1000 users = 10GB in cache!
```
**Fixed Code**:
```typescript
// ✅ FIXED: Only store what you need
const cache = new Map();
function processRequest(userId: string) {
const largeData = fetchLargeDataset(userId);
const summary = largeData.summary; // Extract only 1KB
// Store minimal data
cache.set(userId, () => summary);
}
// 1000 users = 1MB in cache ✅
```
### Pattern 3: Global Variable Accumulation
**Vulnerable Code**:
```typescript
// ❌ LEAK: Global array keeps growing
const requestLog: Request[] = [];
app.post('/api/orders', (req, res) => {
requestLog.push(req); // Never removed!
// ... process order
});
// 1M requests = 1M objects in memory permanently
```
**Fixed Code**:
```typescript
// ✅ FIXED: Use LRU cache with size limit
import LRU from 'lru-cache';
const requestLog = new LRU({
max: 1000, // Maximum 1000 items
ttl: 1000 * 60 * 5 // 5-minute TTL
});
app.post('/api/orders', (req, res) => {
requestLog.set(req.id, req); // Auto-evicts old items
});
```
### Pattern 4: Forgotten Timers/Intervals
**Vulnerable Code**:
```typescript
// ❌ LEAK: setInterval never cleared
class ReportGenerator {
private data: any[] = [];
start() {
setInterval(() => {
this.data.push(generateReport()); // Accumulates forever
}, 60000);
}
}
// Each instance leaks!
const generator = new ReportGenerator();
generator.start();
```
**Fixed Code**:
```typescript
// ✅ FIXED: Clear interval on cleanup
class ReportGenerator {
private data: any[] = [];
private intervalId?: NodeJS.Timeout;
start() {
this.intervalId = setInterval(() => {
this.data.push(generateReport());
}, 60000);
}
stop() {
if (this.intervalId) {
clearInterval(this.intervalId);
this.intervalId = undefined;
this.data = []; // Clear accumulated data
}
}
}
```
## 4. Memory Profiling with memwatch-next
### Installation
```bash
bun add memwatch-next
```
### Leak Detection
```typescript
// memory-monitor.ts
import memwatch from 'memwatch-next';
// Detect memory leaks
memwatch.on('leak', (info) => {
console.error('Memory leak detected:', {
growth: info.growth,
reason: info.reason,
current_base: `${Math.round(info.current_base / 1024 / 1024)}MB`,
leaked: `${Math.round((info.current_base - info.start) / 1024 / 1024)}MB`
});
// Alert to PagerDuty/Slack
alertOps('Memory leak detected', info);
});
// Monitor GC stats
memwatch.on('stats', (stats) => {
console.log('GC stats:', {
used_heap_size: `${Math.round(stats.used_heap_size / 1024 / 1024)}MB`,
heap_size_limit: `${Math.round(stats.heap_size_limit / 1024 / 1024)}MB`,
num_full_gc: stats.num_full_gc,
num_inc_gc: stats.num_inc_gc
});
});
```
### HeapDiff for Leak Analysis
```typescript
import memwatch from 'memwatch-next';
const hd = new memwatch.HeapDiff();
// Simulate leak
const leak: any[] = [];
for (let i = 0; i < 10000; i++) {
leak.push({data: new Array(1000).fill('x')});
}
// Compare heaps
const diff = hd.end();
console.log('Heap diff:', JSON.stringify(diff, null, 2));
// Output:
// {
// "before": {"nodes": 12345, "size": 50000000},
// "after": {"nodes": 22345, "size": 150000000},
// "change": {
// "size_bytes": 100000000, // 100MB leak!
// "size": "100.00MB",
// "freed_nodes": 100,
// "allocated_nodes": 10100 // Net increase
// }
// }
```
## 5. Production Memory Monitoring
### Prometheus Metrics
```typescript
// metrics.ts
import {Gauge} from 'prom-client';
const memoryUsageGauge = new Gauge({
name: 'nodejs_memory_usage_bytes',
help: 'Node.js memory usage in bytes',
labelNames: ['type']
});
setInterval(() => {
const usage = process.memoryUsage();
memoryUsageGauge.set({type: 'heap_used'}, usage.heapUsed);
memoryUsageGauge.set({type: 'heap_total'}, usage.heapTotal);
memoryUsageGauge.set({type: 'external'}, usage.external);
memoryUsageGauge.set({type: 'rss'}, usage.rss);
}, 15000);
```
**Grafana Alert**:
```promql
# Alert if heap usage growing linearly
increase(nodejs_memory_usage_bytes{type="heap_used"}[1h]) > 100000000 # 100MB/hour
```
## 6. Real-World Fix: EventEmitter Leak
### Before (Leaking)
```typescript
// order-processor.ts (BEFORE FIX)
class OrderProcessor {
private emitter = new EventEmitter();
async processOrders() {
// ❌ LEAK: Listener added every call
this.emitter.on('order:created', async (order) => {
await this.sendConfirmationEmail(order);
await this.updateInventory(order);
});
const orders = await db.query.orders.findMany({status: 'pending'});
for (const order of orders) {
this.emitter.emit('order:created', order);
}
}
}
// Called every minute
setInterval(() => new OrderProcessor().processOrders(), 60000);
```
**Result**: 1,440 listeners/day → 2GB memory leak in production
### After (Fixed)
```typescript
// order-processor.ts (AFTER FIX)
class OrderProcessor {
private emitter = new EventEmitter();
private listeners = new WeakMap(); // Track listeners for cleanup
async processOrders() {
const handler = async (order) => {
await this.sendConfirmationEmail(order);
await this.updateInventory(order);
};
// ✅ Use once() for one-time processing
this.emitter.once('order:created', handler);
const orders = await db.query.orders.findMany({status: 'pending'});
for (const order of orders) {
this.emitter.emit('order:created', order);
}
// ✅ Cleanup (if using on() instead of once())
this.emitter.removeAllListeners('order:created');
}
}
```
**Result**: Memory stable at 150MB, zero leaks
## 7. Results and Impact
### Before vs After Metrics
| Metric | Before Fix | After Fix | Impact |
|--------|-----------|-----------|---------|
| **Memory Usage** | 2GB (after 6h) | 150MB (stable) | **93% reduction** |
| **Heap Size** | Linear growth (5MB/min) | Stable | **Zero growth** |
| **OOM Incidents** | 12/month | 0/month | **100% eliminated** |
| **GC Pause Time** | 200ms avg | 50ms avg | **75% faster** |
| **Uptime** | 6 hours avg | 30+ days | **120x improvement** |
### Lessons Learned
**1. Always remove event listeners**
- Use `once()` for one-time events
- Use `removeListener()` in finally blocks
- Track listeners with WeakMap for debugging
**2. Avoid closures capturing large objects**
- Extract only needed data before closure
- Use WeakMap/WeakSet for object references
- Profile with heap snapshots regularly
**3. Monitor memory in production**
- Prometheus metrics for heap usage
- Alert on linear growth patterns
- Weekly heap snapshot analysis
## Related Documentation
- **Python Profiling**: [python-scalene-profiling.md](python-scalene-profiling.md)
- **DB Leaks**: [database-connection-leak.md](database-connection-leak.md)
- **Reference**: [../reference/memory-patterns.md](../reference/memory-patterns.md)
- **Templates**: [../templates/memory-report.md](../templates/memory-report.md)
---
Return to [examples index](INDEX.md)

View File

@@ -0,0 +1,456 @@
# Python Memory Profiling with Scalene
Line-by-line memory and CPU profiling for Python applications using Scalene, with pytest integration and optimization strategies.
## Overview
**Before Optimization**:
- Memory usage: 500MB for processing 10K records
- OOM (Out of Memory) errors with 100K records
- Processing time: 45 seconds for 10K records
- List comprehensions loading entire dataset
**After Optimization**:
- Memory usage: 5MB for processing 10K records (99% reduction)
- No OOM errors with 1M records
- Processing time: 8 seconds for 10K records (82% faster)
- Generator-based streaming
**Tools**: Scalene, pytest, memory_profiler, tracemalloc
## 1. Scalene Installation and Setup
### Installation
```bash
# Install Scalene
pip install scalene
# Or with uv (faster)
uv pip install scalene
```
### Basic Usage
```bash
# Profile entire script
scalene script.py
# Profile with pytest (recommended)
scalene --cli --memory -m pytest tests/
# HTML output
scalene --html --outfile profile.html script.py
# Profile specific function
scalene --reduced-profile script.py
```
## 2. Profiling with pytest
### Test File Setup
```python
# tests/test_data_processing.py
import pytest
from data_processor import DataProcessor
@pytest.fixture
def processor():
return DataProcessor()
def test_process_large_dataset(processor):
# Generate 10K records
records = [{'id': i, 'value': i * 2} for i in range(10000)]
# Process (this is where memory spike occurs)
result = processor.process_records(records)
assert len(result) == 10000
```
### Running Scalene with pytest
```bash
# Profile memory usage during test execution
uv run scalene --cli --memory -m pytest tests/test_data_processing.py 2>&1 | grep -i "memory\|mb\|test"
# Output shows line-by-line memory allocation
```
**Scalene Output** (before optimization):
```
data_processor.py:
Line | Memory % | Memory (MB) | CPU % | Code
-----|----------|-------------|-------|-----
12 | 45% | 225 MB | 10% | result = [transform(r) for r in records]
18 | 30% | 150 MB | 5% | filtered = [r for r in result if r['value'] > 0]
25 | 15% | 75 MB | 20% | sorted_data = sorted(filtered, key=lambda x: x['id'])
```
**Analysis**: Line 12 is the hotspot (45% of memory)
## 3. Memory Hotspot Identification
### Vulnerable Code (Memory Spike)
```python
# data_processor.py (BEFORE OPTIMIZATION)
class DataProcessor:
def process_records(self, records: list[dict]) -> list[dict]:
# ❌ HOTSPOT: List comprehension loads entire dataset
result = [self.transform(r) for r in records] # 225MB for 10K records
# ❌ Creates another copy
filtered = [r for r in result if r['value'] > 0] # +150MB
# ❌ sorted() creates yet another copy
sorted_data = sorted(filtered, key=lambda x: x['id']) # +75MB
return sorted_data # Total: 450MB for 10K records
def transform(self, record: dict) -> dict:
return {
'id': record['id'],
'value': record['value'] * 2,
'timestamp': datetime.now()
}
```
**Scalene Report**:
```
Memory allocation breakdown:
- Line 12 (list comprehension): 225MB (50%)
- Line 18 (filtering): 150MB (33%)
- Line 25 (sorting): 75MB (17%)
Total memory: 450MB for 10,000 records
Projected for 100K: 4.5GB → OOM!
```
### Optimized Code (Generator-Based)
```python
# data_processor.py (AFTER OPTIMIZATION)
from typing import Iterator
class DataProcessor:
def process_records(self, records: list[dict]) -> Iterator[dict]:
# ✅ Generator: processes one record at a time
transformed = (self.transform(r) for r in records) # O(1) memory
# ✅ Generator chaining
filtered = (r for r in transformed if r['value'] > 0) # O(1) memory
# ✅ Stream-based sorting (only if needed)
# For very large datasets, use external sorting or database ORDER BY
yield from sorted(filtered, key=lambda x: x['id']) # Still O(n), but lazy
def transform(self, record: dict) -> dict:
return {
'id': record['id'],
'value': record['value'] * 2,
'timestamp': datetime.now()
}
# Alternative: Fully streaming (no sorting)
def process_records_streaming(self, records: list[dict]) -> Iterator[dict]:
for record in records:
transformed = self.transform(record)
if transformed['value'] > 0:
yield transformed # O(1) memory, fully streaming
```
**Scalene Report (After)**:
```
Memory allocation breakdown:
- Line 12 (generator): 5MB (100% - constant overhead)
- Line 18 (filter generator): 0MB (lazy)
- Line 25 (yield): 0MB (lazy)
Total memory: 5MB for 10,000 records (99% reduction!)
Scalable to 1M+ records without OOM
```
## 4. Common Memory Patterns
### Pattern 1: List Comprehension → Generator
**Before** (High Memory):
```python
# ❌ Loads entire list into memory
def process_large_file(filename: str) -> list[dict]:
with open(filename) as f:
lines = f.readlines() # Loads entire file (500MB)
# Another copy
return [json.loads(line) for line in lines] # +500MB = 1GB total
```
**After** (Low Memory):
```python
# ✅ Generator: processes line-by-line
def process_large_file(filename: str) -> Iterator[dict]:
with open(filename) as f:
for line in f: # Reads one line at a time
yield json.loads(line) # O(1) memory
```
**Scalene diff**: 1GB → 5MB (99.5% reduction)
### Pattern 2: DataFrame Memory Optimization
**Before** (High Memory):
```python
# ❌ Loads entire CSV into memory
import pandas as pd
def analyze_data(filename: str):
df = pd.read_csv(filename) # 10GB CSV → 10GB RAM
# All transformations in memory
df['new_col'] = df['value'] * 2
df_filtered = df[df['value'] > 0]
return df_filtered.groupby('category').sum()
```
**After** (Low Memory with Chunking):
```python
# ✅ Process in chunks
import pandas as pd
def analyze_data(filename: str):
chunk_size = 10000
results = []
# Process 10K rows at a time
for chunk in pd.read_csv(filename, chunksize=chunk_size):
chunk['new_col'] = chunk['value'] * 2
filtered = chunk[chunk['value'] > 0]
group_result = filtered.groupby('category').sum()
results.append(group_result)
# Combine results
return pd.concat(results).groupby(level=0).sum() # Much smaller
```
**Scalene diff**: 10GB → 500MB (95% reduction)
### Pattern 3: String Concatenation
**Before** (High Memory):
```python
# ❌ Creates new string each iteration (O(n²) memory)
def build_report(data: list[dict]) -> str:
report = ""
for item in data: # 100K items
report += f"{item['id']}: {item['value']}\n" # New string every time
return report # 500MB final string + 500MB garbage = 1GB
```
**After** (Low Memory):
```python
# ✅ StringIO or join (O(n) memory)
from io import StringIO
def build_report(data: list[dict]) -> str:
buffer = StringIO()
for item in data:
buffer.write(f"{item['id']}: {item['value']}\n")
return buffer.getvalue()
# Or even better: generator
def build_report_streaming(data: list[dict]) -> Iterator[str]:
for item in data:
yield f"{item['id']}: {item['value']}\n"
```
**Scalene diff**: 1GB → 50MB (95% reduction)
## 5. Scalene CLI Reference
### Common Options
```bash
# Memory-only profiling (fastest)
scalene --cli --memory script.py
# CPU + Memory profiling
scalene --cli --cpu --memory script.py
# Reduced profile (functions only, not lines)
scalene --reduced-profile script.py
# Profile specific function
scalene --profile-only process_data script.py
# HTML report
scalene --html --outfile profile.html script.py
# Profile with pytest
scalene --cli --memory -m pytest tests/
# Set memory sampling interval (default: 1MB)
scalene --malloc-threshold 0.1 script.py # Sample every 100KB
```
### Interpreting Output
**Column Meanings**:
```
Memory % | Percentage of total memory allocated
Memory MB | Absolute memory allocated (in megabytes)
CPU % | Percentage of CPU time spent
Python % | Time spent in Python (vs native code)
```
**Example Output**:
```
script.py:
Line | Memory % | Memory MB | CPU % | Python % | Code
-----|----------|-----------|-------|----------|-----
12 | 45.2% | 225.6 MB | 10.5% | 95.2% | data = [x for x in range(1000000)]
18 | 30.1% | 150.3 MB | 5.2% | 98.1% | filtered = list(filter(lambda x: x > 0, data))
```
**Analysis**:
- Line 12: High memory (45.2%) → optimize list comprehension
- Line 18: Moderate memory (30.1%) → use generator instead of list()
## 6. Integration with CI/CD
### GitHub Actions Workflow
```yaml
# .github/workflows/memory-profiling.yml
name: Memory Profiling
on: [pull_request]
jobs:
profile:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install scalene pytest
- name: Run memory profiling
run: |
scalene --cli --memory --reduced-profile -m pytest tests/ > profile.txt
- name: Check for memory hotspots
run: |
if grep -q "Memory %" profile.txt; then
# Alert if any line uses >100MB
if awk '$3 > 100 {exit 1}' profile.txt; then
echo "Memory hotspot detected!"
exit 1
fi
fi
- name: Upload profile
uses: actions/upload-artifact@v3
with:
name: memory-profile
path: profile.txt
```
## 7. Real-World Optimization: CSV Processing
### Before (500MB Memory, OOM at 100K rows)
```python
# csv_processor.py (BEFORE)
import pandas as pd
class CSVProcessor:
def process_file(self, filename: str) -> dict:
# ❌ Loads entire CSV
df = pd.read_csv(filename) # 500MB for 10K rows
# ❌ Multiple copies
df['total'] = df['quantity'] * df['price']
df_filtered = df[df['total'] > 100]
summary = df_filtered.groupby('category').agg({
'total': 'sum',
'quantity': 'sum'
})
return summary.to_dict()
```
**Scalene Output**:
```
Line 8: 500MB (75%) - pd.read_csv()
Line 11: 100MB (15%) - df['total'] calculation
Line 12: 50MB (10%) - filtering
Total: 650MB for 10K rows
```
### After (5MB Memory, Handles 1M rows)
```python
# csv_processor.py (AFTER)
import pandas as pd
from collections import defaultdict
class CSVProcessor:
def process_file(self, filename: str) -> dict:
# ✅ Process in 10K row chunks
chunk_size = 10000
results = defaultdict(lambda: {'total': 0, 'quantity': 0})
for chunk in pd.read_csv(filename, chunksize=chunk_size):
chunk['total'] = chunk['quantity'] * chunk['price']
filtered = chunk[chunk['total'] > 100]
# Aggregate incrementally
for category, group in filtered.groupby('category'):
results[category]['total'] += group['total'].sum()
results[category]['quantity'] += group['quantity'].sum()
return dict(results)
```
**Scalene Output (After)**:
```
Line 9: 5MB (100%) - chunk processing (constant memory)
Total: 5MB for any file size (99% reduction)
```
## 8. Results and Impact
### Before vs After Metrics
| Metric | Before | After | Impact |
|--------|--------|-------|--------|
| **Memory Usage** | 500MB (10K rows) | 5MB (1M rows) | **99% reduction** |
| **Processing Time** | 45s (10K rows) | 8s (10K rows) | **82% faster** |
| **Max File Size** | 100K rows (OOM) | 10M+ rows | **100x scalability** |
| **OOM Errors** | 5/week | 0/month | **100% eliminated** |
### Key Optimizations Applied
1. **List comprehension → Generator**: 225MB → 0MB
2. **DataFrame chunking**: 500MB → 5MB per chunk
3. **String concatenation**: 1GB → 50MB (StringIO)
4. **Lazy evaluation**: Load on demand vs load all
## Related Documentation
- **Node.js Leaks**: [nodejs-memory-leak.md](nodejs-memory-leak.md)
- **DB Leaks**: [database-connection-leak.md](database-connection-leak.md)
- **Reference**: [../reference/profiling-tools.md](../reference/profiling-tools.md)
- **Templates**: [../templates/scalene-config.txt](../templates/scalene-config.txt)
---
Return to [examples index](INDEX.md)