314 lines
8.2 KiB
Markdown
314 lines
8.2 KiB
Markdown
---
|
|
name: python-data-reviewer
|
|
description: |
|
|
WHEN: Pandas/NumPy code review, data processing, vectorization, memory optimization
|
|
WHAT: Vectorization patterns + Memory efficiency + Data validation + Performance optimization + Best practices
|
|
WHEN NOT: Web framework → fastapi/django/flask-reviewer, General Python → python-reviewer
|
|
---
|
|
|
|
# Python Data Reviewer Skill
|
|
|
|
## Purpose
|
|
Reviews data science code for Pandas/NumPy efficiency, memory usage, and best practices.
|
|
|
|
## When to Use
|
|
- Pandas/NumPy code review
|
|
- Data processing optimization
|
|
- Memory efficiency check
|
|
- "Why is my data code slow?"
|
|
- ETL pipeline review
|
|
|
|
## Project Detection
|
|
- `pandas`, `numpy` in requirements.txt
|
|
- `.ipynb` Jupyter notebooks
|
|
- `data/`, `notebooks/` directories
|
|
- DataFrame operations in code
|
|
|
|
## Workflow
|
|
|
|
### Step 1: Analyze Project
|
|
```
|
|
**Pandas**: 2.0+
|
|
**NumPy**: 1.24+
|
|
**Other**: polars, dask, vaex
|
|
**Visualization**: matplotlib, seaborn, plotly
|
|
**ML**: scikit-learn, xgboost
|
|
```
|
|
|
|
### Step 2: Select Review Areas
|
|
**AskUserQuestion:**
|
|
```
|
|
"Which areas to review?"
|
|
Options:
|
|
- Full data code review (recommended)
|
|
- Vectorization and performance
|
|
- Memory optimization
|
|
- Data validation
|
|
- Code organization
|
|
multiSelect: true
|
|
```
|
|
|
|
## Detection Rules
|
|
|
|
### Vectorization
|
|
| Check | Recommendation | Severity |
|
|
|-------|----------------|----------|
|
|
| iterrows() loop | Use vectorized operations | CRITICAL |
|
|
| apply() with simple func | Use built-in vectorized | HIGH |
|
|
| Manual loop over array | Use NumPy broadcasting | HIGH |
|
|
| List comprehension on Series | Use .map() or vectorize | MEDIUM |
|
|
|
|
```python
|
|
# BAD: iterrows (extremely slow)
|
|
for idx, row in df.iterrows():
|
|
df.loc[idx, "total"] = row["price"] * row["quantity"]
|
|
|
|
# GOOD: Vectorized operation
|
|
df["total"] = df["price"] * df["quantity"]
|
|
|
|
# BAD: apply with simple operation
|
|
df["upper_name"] = df["name"].apply(lambda x: x.upper())
|
|
|
|
# GOOD: Built-in string method
|
|
df["upper_name"] = df["name"].str.upper()
|
|
|
|
# BAD: apply with condition
|
|
df["status"] = df["score"].apply(lambda x: "pass" if x >= 60 else "fail")
|
|
|
|
# GOOD: np.where or np.select
|
|
df["status"] = np.where(df["score"] >= 60, "pass", "fail")
|
|
|
|
# Multiple conditions
|
|
conditions = [
|
|
df["score"] >= 90,
|
|
df["score"] >= 60,
|
|
df["score"] < 60,
|
|
]
|
|
choices = ["A", "B", "F"]
|
|
df["grade"] = np.select(conditions, choices)
|
|
```
|
|
|
|
### Memory Optimization
|
|
| Check | Recommendation | Severity |
|
|
|-------|----------------|----------|
|
|
| int64 for small ints | Use int8/int16/int32 | MEDIUM |
|
|
| object dtype for categories | Use category dtype | HIGH |
|
|
| Loading full file | Use chunks or usecols | HIGH |
|
|
| Keeping unused columns | Drop early | MEDIUM |
|
|
|
|
```python
|
|
# BAD: Default dtypes waste memory
|
|
df = pd.read_csv("large_file.csv") # All int64, object
|
|
|
|
# GOOD: Specify dtypes
|
|
dtype_map = {
|
|
"id": "int32",
|
|
"age": "int8",
|
|
"status": "category",
|
|
"price": "float32",
|
|
}
|
|
df = pd.read_csv("large_file.csv", dtype=dtype_map)
|
|
|
|
# GOOD: Load only needed columns
|
|
df = pd.read_csv(
|
|
"large_file.csv",
|
|
usecols=["id", "name", "price"],
|
|
dtype=dtype_map,
|
|
)
|
|
|
|
# GOOD: Process in chunks
|
|
chunks = pd.read_csv("huge_file.csv", chunksize=100_000)
|
|
result = pd.concat([process_chunk(chunk) for chunk in chunks])
|
|
|
|
# Memory check
|
|
print(df.info(memory_usage="deep"))
|
|
|
|
# Convert object to category if low cardinality
|
|
for col in df.select_dtypes(include=["object"]).columns:
|
|
if df[col].nunique() / len(df) < 0.5: # < 50% unique
|
|
df[col] = df[col].astype("category")
|
|
```
|
|
|
|
### Data Validation
|
|
| Check | Recommendation | Severity |
|
|
|-------|----------------|----------|
|
|
| No null check | Validate nulls early | HIGH |
|
|
| No dtype validation | Assert expected types | MEDIUM |
|
|
| No range validation | Check value bounds | MEDIUM |
|
|
| Silent data issues | Raise or log warnings | HIGH |
|
|
|
|
```python
|
|
# GOOD: Data validation function
|
|
def validate_dataframe(df: pd.DataFrame) -> pd.DataFrame:
|
|
"""Validate and clean input DataFrame."""
|
|
|
|
# Required columns
|
|
required_cols = ["id", "name", "price", "quantity"]
|
|
missing = set(required_cols) - set(df.columns)
|
|
if missing:
|
|
raise ValueError(f"Missing columns: {missing}")
|
|
|
|
# Null checks
|
|
null_counts = df[required_cols].isnull().sum()
|
|
if null_counts.any():
|
|
logger.warning(f"Null values found:\n{null_counts[null_counts > 0]}")
|
|
|
|
# Type validation
|
|
assert df["id"].dtype in ["int32", "int64"], "id must be integer"
|
|
assert df["price"].dtype in ["float32", "float64"], "price must be float"
|
|
|
|
# Range validation
|
|
invalid_prices = df[df["price"] < 0]
|
|
if len(invalid_prices) > 0:
|
|
logger.warning(f"Found {len(invalid_prices)} negative prices")
|
|
df = df[df["price"] >= 0]
|
|
|
|
# Duplicate check
|
|
duplicates = df.duplicated(subset=["id"])
|
|
if duplicates.any():
|
|
logger.warning(f"Removing {duplicates.sum()} duplicates")
|
|
df = df.drop_duplicates(subset=["id"])
|
|
|
|
return df
|
|
```
|
|
|
|
### NumPy Patterns
|
|
| Check | Recommendation | Severity |
|
|
|-------|----------------|----------|
|
|
| Python loop over array | Use broadcasting | CRITICAL |
|
|
| np.append in loop | Pre-allocate array | HIGH |
|
|
| Repeated array creation | Reuse arrays | MEDIUM |
|
|
| Copy when not needed | Use views | MEDIUM |
|
|
|
|
```python
|
|
# BAD: Python loop
|
|
result = []
|
|
for i in range(len(arr)):
|
|
result.append(arr[i] * 2 + 1)
|
|
result = np.array(result)
|
|
|
|
# GOOD: Vectorized
|
|
result = arr * 2 + 1
|
|
|
|
# BAD: np.append in loop (creates new array each time)
|
|
result = np.array([])
|
|
for x in data:
|
|
result = np.append(result, process(x))
|
|
|
|
# GOOD: Pre-allocate
|
|
result = np.empty(len(data))
|
|
for i, x in enumerate(data):
|
|
result[i] = process(x)
|
|
|
|
# BETTER: Vectorize if possible
|
|
result = np.vectorize(process)(data)
|
|
|
|
# BEST: Pure NumPy operation
|
|
result = np.sqrt(data) + np.log(data)
|
|
|
|
# BAD: Unnecessary copy
|
|
subset = arr[arr > 0].copy() # copy() often not needed
|
|
|
|
# GOOD: View when modification not needed
|
|
subset = arr[arr > 0] # Returns view
|
|
```
|
|
|
|
### Pandas Best Practices
|
|
| Check | Recommendation | Severity |
|
|
|-------|----------------|----------|
|
|
| Chained indexing | Use .loc/.iloc | HIGH |
|
|
| inplace=True | Assign result instead | MEDIUM |
|
|
| reset_index() abuse | Keep index when useful | LOW |
|
|
| df.append (deprecated) | Use pd.concat | HIGH |
|
|
|
|
```python
|
|
# BAD: Chained indexing (unpredictable)
|
|
df[df["price"] > 100]["status"] = "premium" # May not work!
|
|
|
|
# GOOD: Use .loc
|
|
df.loc[df["price"] > 100, "status"] = "premium"
|
|
|
|
# BAD: df.append (deprecated)
|
|
result = pd.DataFrame()
|
|
for chunk in chunks:
|
|
result = result.append(chunk)
|
|
|
|
# GOOD: pd.concat
|
|
result = pd.concat(chunks, ignore_index=True)
|
|
|
|
# BAD: Multiple operations creating copies
|
|
df = df.dropna()
|
|
df = df.reset_index(drop=True)
|
|
df = df.sort_values("date")
|
|
|
|
# GOOD: Method chaining
|
|
df = (
|
|
df
|
|
.dropna()
|
|
.reset_index(drop=True)
|
|
.sort_values("date")
|
|
)
|
|
```
|
|
|
|
## Response Template
|
|
```
|
|
## Python Data Code Review Results
|
|
|
|
**Project**: [name]
|
|
**Pandas**: 2.1 | **NumPy**: 1.26 | **Size**: ~1M rows
|
|
|
|
### Vectorization
|
|
| Status | File | Issue |
|
|
|--------|------|-------|
|
|
| CRITICAL | process.py:45 | iterrows() loop - use vectorized ops |
|
|
|
|
### Memory
|
|
| Status | File | Issue |
|
|
|--------|------|-------|
|
|
| HIGH | load.py:23 | object dtype for 'status' - use category |
|
|
|
|
### Data Validation
|
|
| Status | File | Issue |
|
|
|--------|------|-------|
|
|
| HIGH | etl.py:67 | No null value handling |
|
|
|
|
### NumPy
|
|
| Status | File | Issue |
|
|
|--------|------|-------|
|
|
| HIGH | calc.py:12 | np.append in loop - pre-allocate |
|
|
|
|
### Recommended Actions
|
|
1. [ ] Replace iterrows with vectorized operations
|
|
2. [ ] Convert low-cardinality columns to category
|
|
3. [ ] Add data validation before processing
|
|
4. [ ] Pre-allocate NumPy arrays in loops
|
|
```
|
|
|
|
## Performance Tips
|
|
```python
|
|
# Profile memory usage
|
|
df.info(memory_usage="deep")
|
|
|
|
# Profile time
|
|
%timeit df.apply(func) # Jupyter magic
|
|
%timeit df["col"].map(func)
|
|
|
|
# Use query() for complex filters
|
|
df.query("price > 100 and category == 'A'")
|
|
|
|
# Use eval() for complex expressions
|
|
df.eval("profit = revenue - cost")
|
|
```
|
|
|
|
## Best Practices
|
|
1. **Vectorize**: Avoid loops, use broadcasting
|
|
2. **Dtypes**: Use smallest sufficient type
|
|
3. **Validate**: Check data quality early
|
|
4. **Profile**: Measure before optimizing
|
|
5. **Chunk**: Process large files incrementally
|
|
|
|
## Integration
|
|
- `python-reviewer`: General Python patterns
|
|
- `perf-analyzer`: Performance profiling
|
|
- `coverage-analyzer`: Test coverage for data code
|