494 lines
13 KiB
Markdown
494 lines
13 KiB
Markdown
---
|
|
name: sql-optimization-patterns
|
|
description: Master SQL query optimization, indexing strategies, and EXPLAIN analysis to dramatically improve database performance and eliminate slow queries. Use when debugging slow queries, designing database schemas, or optimizing application performance.
|
|
---
|
|
|
|
# SQL Optimization Patterns
|
|
|
|
Transform slow database queries into lightning-fast operations through systematic optimization, proper indexing, and query plan analysis.
|
|
|
|
## When to Use This Skill
|
|
|
|
- Debugging slow-running queries
|
|
- Designing performant database schemas
|
|
- Optimizing application response times
|
|
- Reducing database load and costs
|
|
- Improving scalability for growing datasets
|
|
- Analyzing EXPLAIN query plans
|
|
- Implementing efficient indexes
|
|
- Resolving N+1 query problems
|
|
|
|
## Core Concepts
|
|
|
|
### 1. Query Execution Plans (EXPLAIN)
|
|
|
|
Understanding EXPLAIN output is fundamental to optimization.
|
|
|
|
**PostgreSQL EXPLAIN:**
|
|
```sql
|
|
-- Basic explain
|
|
EXPLAIN SELECT * FROM users WHERE email = 'user@example.com';
|
|
|
|
-- With actual execution stats
|
|
EXPLAIN ANALYZE
|
|
SELECT * FROM users WHERE email = 'user@example.com';
|
|
|
|
-- Verbose output with more details
|
|
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
|
|
SELECT u.*, o.order_total
|
|
FROM users u
|
|
JOIN orders o ON u.id = o.user_id
|
|
WHERE u.created_at > NOW() - INTERVAL '30 days';
|
|
```
|
|
|
|
**Key Metrics to Watch:**
|
|
- **Seq Scan**: Full table scan (usually slow for large tables)
|
|
- **Index Scan**: Using index (good)
|
|
- **Index Only Scan**: Using index without touching table (best)
|
|
- **Nested Loop**: Join method (okay for small datasets)
|
|
- **Hash Join**: Join method (good for larger datasets)
|
|
- **Merge Join**: Join method (good for sorted data)
|
|
- **Cost**: Estimated query cost (lower is better)
|
|
- **Rows**: Estimated rows returned
|
|
- **Actual Time**: Real execution time
|
|
|
|
### 2. Index Strategies
|
|
|
|
Indexes are the most powerful optimization tool.
|
|
|
|
**Index Types:**
|
|
- **B-Tree**: Default, good for equality and range queries
|
|
- **Hash**: Only for equality (=) comparisons
|
|
- **GIN**: Full-text search, array queries, JSONB
|
|
- **GiST**: Geometric data, full-text search
|
|
- **BRIN**: Block Range INdex for very large tables with correlation
|
|
|
|
```sql
|
|
-- Standard B-Tree index
|
|
CREATE INDEX idx_users_email ON users(email);
|
|
|
|
-- Composite index (order matters!)
|
|
CREATE INDEX idx_orders_user_status ON orders(user_id, status);
|
|
|
|
-- Partial index (index subset of rows)
|
|
CREATE INDEX idx_active_users ON users(email)
|
|
WHERE status = 'active';
|
|
|
|
-- Expression index
|
|
CREATE INDEX idx_users_lower_email ON users(LOWER(email));
|
|
|
|
-- Covering index (include additional columns)
|
|
CREATE INDEX idx_users_email_covering ON users(email)
|
|
INCLUDE (name, created_at);
|
|
|
|
-- Full-text search index
|
|
CREATE INDEX idx_posts_search ON posts
|
|
USING GIN(to_tsvector('english', title || ' ' || body));
|
|
|
|
-- JSONB index
|
|
CREATE INDEX idx_metadata ON events USING GIN(metadata);
|
|
```
|
|
|
|
### 3. Query Optimization Patterns
|
|
|
|
**Avoid SELECT \*:**
|
|
```sql
|
|
-- Bad: Fetches unnecessary columns
|
|
SELECT * FROM users WHERE id = 123;
|
|
|
|
-- Good: Fetch only what you need
|
|
SELECT id, email, name FROM users WHERE id = 123;
|
|
```
|
|
|
|
**Use WHERE Clause Efficiently:**
|
|
```sql
|
|
-- Bad: Function prevents index usage
|
|
SELECT * FROM users WHERE LOWER(email) = 'user@example.com';
|
|
|
|
-- Good: Create functional index or use exact match
|
|
CREATE INDEX idx_users_email_lower ON users(LOWER(email));
|
|
-- Then:
|
|
SELECT * FROM users WHERE LOWER(email) = 'user@example.com';
|
|
|
|
-- Or store normalized data
|
|
SELECT * FROM users WHERE email = 'user@example.com';
|
|
```
|
|
|
|
**Optimize JOINs:**
|
|
```sql
|
|
-- Bad: Cartesian product then filter
|
|
SELECT u.name, o.total
|
|
FROM users u, orders o
|
|
WHERE u.id = o.user_id AND u.created_at > '2024-01-01';
|
|
|
|
-- Good: Filter before join
|
|
SELECT u.name, o.total
|
|
FROM users u
|
|
JOIN orders o ON u.id = o.user_id
|
|
WHERE u.created_at > '2024-01-01';
|
|
|
|
-- Better: Filter both tables
|
|
SELECT u.name, o.total
|
|
FROM (SELECT * FROM users WHERE created_at > '2024-01-01') u
|
|
JOIN orders o ON u.id = o.user_id;
|
|
```
|
|
|
|
## Optimization Patterns
|
|
|
|
### Pattern 1: Eliminate N+1 Queries
|
|
|
|
**Problem: N+1 Query Anti-Pattern**
|
|
```python
|
|
# Bad: Executes N+1 queries
|
|
users = db.query("SELECT * FROM users LIMIT 10")
|
|
for user in users:
|
|
orders = db.query("SELECT * FROM orders WHERE user_id = ?", user.id)
|
|
# Process orders
|
|
```
|
|
|
|
**Solution: Use JOINs or Batch Loading**
|
|
```sql
|
|
-- Solution 1: JOIN
|
|
SELECT
|
|
u.id, u.name,
|
|
o.id as order_id, o.total
|
|
FROM users u
|
|
LEFT JOIN orders o ON u.id = o.user_id
|
|
WHERE u.id IN (1, 2, 3, 4, 5);
|
|
|
|
-- Solution 2: Batch query
|
|
SELECT * FROM orders
|
|
WHERE user_id IN (1, 2, 3, 4, 5);
|
|
```
|
|
|
|
```python
|
|
# Good: Single query with JOIN or batch load
|
|
# Using JOIN
|
|
results = db.query("""
|
|
SELECT u.id, u.name, o.id as order_id, o.total
|
|
FROM users u
|
|
LEFT JOIN orders o ON u.id = o.user_id
|
|
WHERE u.id IN (1, 2, 3, 4, 5)
|
|
""")
|
|
|
|
# Or batch load
|
|
users = db.query("SELECT * FROM users LIMIT 10")
|
|
user_ids = [u.id for u in users]
|
|
orders = db.query(
|
|
"SELECT * FROM orders WHERE user_id IN (?)",
|
|
user_ids
|
|
)
|
|
# Group orders by user_id
|
|
orders_by_user = {}
|
|
for order in orders:
|
|
orders_by_user.setdefault(order.user_id, []).append(order)
|
|
```
|
|
|
|
### Pattern 2: Optimize Pagination
|
|
|
|
**Bad: OFFSET on Large Tables**
|
|
```sql
|
|
-- Slow for large offsets
|
|
SELECT * FROM users
|
|
ORDER BY created_at DESC
|
|
LIMIT 20 OFFSET 100000; -- Very slow!
|
|
```
|
|
|
|
**Good: Cursor-Based Pagination**
|
|
```sql
|
|
-- Much faster: Use cursor (last seen ID)
|
|
SELECT * FROM users
|
|
WHERE created_at < '2024-01-15 10:30:00' -- Last cursor
|
|
ORDER BY created_at DESC
|
|
LIMIT 20;
|
|
|
|
-- With composite sorting
|
|
SELECT * FROM users
|
|
WHERE (created_at, id) < ('2024-01-15 10:30:00', 12345)
|
|
ORDER BY created_at DESC, id DESC
|
|
LIMIT 20;
|
|
|
|
-- Requires index
|
|
CREATE INDEX idx_users_cursor ON users(created_at DESC, id DESC);
|
|
```
|
|
|
|
### Pattern 3: Aggregate Efficiently
|
|
|
|
**Optimize COUNT Queries:**
|
|
```sql
|
|
-- Bad: Counts all rows
|
|
SELECT COUNT(*) FROM orders; -- Slow on large tables
|
|
|
|
-- Good: Use estimates for approximate counts
|
|
SELECT reltuples::bigint AS estimate
|
|
FROM pg_class
|
|
WHERE relname = 'orders';
|
|
|
|
-- Good: Filter before counting
|
|
SELECT COUNT(*) FROM orders
|
|
WHERE created_at > NOW() - INTERVAL '7 days';
|
|
|
|
-- Better: Use index-only scan
|
|
CREATE INDEX idx_orders_created ON orders(created_at);
|
|
SELECT COUNT(*) FROM orders
|
|
WHERE created_at > NOW() - INTERVAL '7 days';
|
|
```
|
|
|
|
**Optimize GROUP BY:**
|
|
```sql
|
|
-- Bad: Group by then filter
|
|
SELECT user_id, COUNT(*) as order_count
|
|
FROM orders
|
|
GROUP BY user_id
|
|
HAVING COUNT(*) > 10;
|
|
|
|
-- Better: Filter first, then group (if possible)
|
|
SELECT user_id, COUNT(*) as order_count
|
|
FROM orders
|
|
WHERE status = 'completed'
|
|
GROUP BY user_id
|
|
HAVING COUNT(*) > 10;
|
|
|
|
-- Best: Use covering index
|
|
CREATE INDEX idx_orders_user_status ON orders(user_id, status);
|
|
```
|
|
|
|
### Pattern 4: Subquery Optimization
|
|
|
|
**Transform Correlated Subqueries:**
|
|
```sql
|
|
-- Bad: Correlated subquery (runs for each row)
|
|
SELECT u.name, u.email,
|
|
(SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) as order_count
|
|
FROM users u;
|
|
|
|
-- Good: JOIN with aggregation
|
|
SELECT u.name, u.email, COUNT(o.id) as order_count
|
|
FROM users u
|
|
LEFT JOIN orders o ON o.user_id = u.id
|
|
GROUP BY u.id, u.name, u.email;
|
|
|
|
-- Better: Use window functions
|
|
SELECT DISTINCT ON (u.id)
|
|
u.name, u.email,
|
|
COUNT(o.id) OVER (PARTITION BY u.id) as order_count
|
|
FROM users u
|
|
LEFT JOIN orders o ON o.user_id = u.id;
|
|
```
|
|
|
|
**Use CTEs for Clarity:**
|
|
```sql
|
|
-- Using Common Table Expressions
|
|
WITH recent_users AS (
|
|
SELECT id, name, email
|
|
FROM users
|
|
WHERE created_at > NOW() - INTERVAL '30 days'
|
|
),
|
|
user_order_counts AS (
|
|
SELECT user_id, COUNT(*) as order_count
|
|
FROM orders
|
|
WHERE created_at > NOW() - INTERVAL '30 days'
|
|
GROUP BY user_id
|
|
)
|
|
SELECT ru.name, ru.email, COALESCE(uoc.order_count, 0) as orders
|
|
FROM recent_users ru
|
|
LEFT JOIN user_order_counts uoc ON ru.id = uoc.user_id;
|
|
```
|
|
|
|
### Pattern 5: Batch Operations
|
|
|
|
**Batch INSERT:**
|
|
```sql
|
|
-- Bad: Multiple individual inserts
|
|
INSERT INTO users (name, email) VALUES ('Alice', 'alice@example.com');
|
|
INSERT INTO users (name, email) VALUES ('Bob', 'bob@example.com');
|
|
INSERT INTO users (name, email) VALUES ('Carol', 'carol@example.com');
|
|
|
|
-- Good: Batch insert
|
|
INSERT INTO users (name, email) VALUES
|
|
('Alice', 'alice@example.com'),
|
|
('Bob', 'bob@example.com'),
|
|
('Carol', 'carol@example.com');
|
|
|
|
-- Better: Use COPY for bulk inserts (PostgreSQL)
|
|
COPY users (name, email) FROM '/tmp/users.csv' CSV HEADER;
|
|
```
|
|
|
|
**Batch UPDATE:**
|
|
```sql
|
|
-- Bad: Update in loop
|
|
UPDATE users SET status = 'active' WHERE id = 1;
|
|
UPDATE users SET status = 'active' WHERE id = 2;
|
|
-- ... repeat for many IDs
|
|
|
|
-- Good: Single UPDATE with IN clause
|
|
UPDATE users
|
|
SET status = 'active'
|
|
WHERE id IN (1, 2, 3, 4, 5, ...);
|
|
|
|
-- Better: Use temporary table for large batches
|
|
CREATE TEMP TABLE temp_user_updates (id INT, new_status VARCHAR);
|
|
INSERT INTO temp_user_updates VALUES (1, 'active'), (2, 'active'), ...;
|
|
|
|
UPDATE users u
|
|
SET status = t.new_status
|
|
FROM temp_user_updates t
|
|
WHERE u.id = t.id;
|
|
```
|
|
|
|
## Advanced Techniques
|
|
|
|
### Materialized Views
|
|
|
|
Pre-compute expensive queries.
|
|
|
|
```sql
|
|
-- Create materialized view
|
|
CREATE MATERIALIZED VIEW user_order_summary AS
|
|
SELECT
|
|
u.id,
|
|
u.name,
|
|
COUNT(o.id) as total_orders,
|
|
SUM(o.total) as total_spent,
|
|
MAX(o.created_at) as last_order_date
|
|
FROM users u
|
|
LEFT JOIN orders o ON u.id = o.user_id
|
|
GROUP BY u.id, u.name;
|
|
|
|
-- Add index to materialized view
|
|
CREATE INDEX idx_user_summary_spent ON user_order_summary(total_spent DESC);
|
|
|
|
-- Refresh materialized view
|
|
REFRESH MATERIALIZED VIEW user_order_summary;
|
|
|
|
-- Concurrent refresh (PostgreSQL)
|
|
REFRESH MATERIALIZED VIEW CONCURRENTLY user_order_summary;
|
|
|
|
-- Query materialized view (very fast)
|
|
SELECT * FROM user_order_summary
|
|
WHERE total_spent > 1000
|
|
ORDER BY total_spent DESC;
|
|
```
|
|
|
|
### Partitioning
|
|
|
|
Split large tables for better performance.
|
|
|
|
```sql
|
|
-- Range partitioning by date (PostgreSQL)
|
|
CREATE TABLE orders (
|
|
id SERIAL,
|
|
user_id INT,
|
|
total DECIMAL,
|
|
created_at TIMESTAMP
|
|
) PARTITION BY RANGE (created_at);
|
|
|
|
-- Create partitions
|
|
CREATE TABLE orders_2024_q1 PARTITION OF orders
|
|
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
|
|
|
|
CREATE TABLE orders_2024_q2 PARTITION OF orders
|
|
FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');
|
|
|
|
-- Queries automatically use appropriate partition
|
|
SELECT * FROM orders
|
|
WHERE created_at BETWEEN '2024-02-01' AND '2024-02-28';
|
|
-- Only scans orders_2024_q1 partition
|
|
```
|
|
|
|
### Query Hints and Optimization
|
|
|
|
```sql
|
|
-- Force index usage (MySQL)
|
|
SELECT * FROM users
|
|
USE INDEX (idx_users_email)
|
|
WHERE email = 'user@example.com';
|
|
|
|
-- Parallel query (PostgreSQL)
|
|
SET max_parallel_workers_per_gather = 4;
|
|
SELECT * FROM large_table WHERE condition;
|
|
|
|
-- Join hints (PostgreSQL)
|
|
SET enable_nestloop = OFF; -- Force hash or merge join
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Index Selectively**: Too many indexes slow down writes
|
|
2. **Monitor Query Performance**: Use slow query logs
|
|
3. **Keep Statistics Updated**: Run ANALYZE regularly
|
|
4. **Use Appropriate Data Types**: Smaller types = better performance
|
|
5. **Normalize Thoughtfully**: Balance normalization vs performance
|
|
6. **Cache Frequently Accessed Data**: Use application-level caching
|
|
7. **Connection Pooling**: Reuse database connections
|
|
8. **Regular Maintenance**: VACUUM, ANALYZE, rebuild indexes
|
|
|
|
```sql
|
|
-- Update statistics
|
|
ANALYZE users;
|
|
ANALYZE VERBOSE orders;
|
|
|
|
-- Vacuum (PostgreSQL)
|
|
VACUUM ANALYZE users;
|
|
VACUUM FULL users; -- Reclaim space (locks table)
|
|
|
|
-- Reindex
|
|
REINDEX INDEX idx_users_email;
|
|
REINDEX TABLE users;
|
|
```
|
|
|
|
## Common Pitfalls
|
|
|
|
- **Over-Indexing**: Each index slows down INSERT/UPDATE/DELETE
|
|
- **Unused Indexes**: Waste space and slow writes
|
|
- **Missing Indexes**: Slow queries, full table scans
|
|
- **Implicit Type Conversion**: Prevents index usage
|
|
- **OR Conditions**: Can't use indexes efficiently
|
|
- **LIKE with Leading Wildcard**: `LIKE '%abc'` can't use index
|
|
- **Function in WHERE**: Prevents index usage unless functional index exists
|
|
|
|
## Monitoring Queries
|
|
|
|
```sql
|
|
-- Find slow queries (PostgreSQL)
|
|
SELECT query, calls, total_time, mean_time
|
|
FROM pg_stat_statements
|
|
ORDER BY mean_time DESC
|
|
LIMIT 10;
|
|
|
|
-- Find missing indexes (PostgreSQL)
|
|
SELECT
|
|
schemaname,
|
|
tablename,
|
|
seq_scan,
|
|
seq_tup_read,
|
|
idx_scan,
|
|
seq_tup_read / seq_scan AS avg_seq_tup_read
|
|
FROM pg_stat_user_tables
|
|
WHERE seq_scan > 0
|
|
ORDER BY seq_tup_read DESC
|
|
LIMIT 10;
|
|
|
|
-- Find unused indexes (PostgreSQL)
|
|
SELECT
|
|
schemaname,
|
|
tablename,
|
|
indexname,
|
|
idx_scan,
|
|
idx_tup_read,
|
|
idx_tup_fetch
|
|
FROM pg_stat_user_indexes
|
|
WHERE idx_scan = 0
|
|
ORDER BY pg_relation_size(indexrelid) DESC;
|
|
```
|
|
|
|
## Resources
|
|
|
|
- **references/postgres-optimization-guide.md**: PostgreSQL-specific optimization
|
|
- **references/mysql-optimization-guide.md**: MySQL/MariaDB optimization
|
|
- **references/query-plan-analysis.md**: Deep dive into EXPLAIN plans
|
|
- **assets/index-strategy-checklist.md**: When and how to create indexes
|
|
- **assets/query-optimization-checklist.md**: Step-by-step optimization guide
|
|
- **scripts/analyze-slow-queries.sql**: Identify slow queries in your database
|
|
- **scripts/index-recommendations.sql**: Generate index recommendations
|