Initial commit
This commit is contained in:
313
skills/python-data-reviewer/SKILL.md
Normal file
313
skills/python-data-reviewer/SKILL.md
Normal file
@@ -0,0 +1,313 @@
|
||||
---
|
||||
name: python-data-reviewer
|
||||
description: |
|
||||
WHEN: Pandas/NumPy code review, data processing, vectorization, memory optimization
|
||||
WHAT: Vectorization patterns + Memory efficiency + Data validation + Performance optimization + Best practices
|
||||
WHEN NOT: Web framework → fastapi/django/flask-reviewer, General Python → python-reviewer
|
||||
---
|
||||
|
||||
# Python Data Reviewer Skill
|
||||
|
||||
## Purpose
|
||||
Reviews data science code for Pandas/NumPy efficiency, memory usage, and best practices.
|
||||
|
||||
## When to Use
|
||||
- Pandas/NumPy code review
|
||||
- Data processing optimization
|
||||
- Memory efficiency check
|
||||
- "Why is my data code slow?"
|
||||
- ETL pipeline review
|
||||
|
||||
## Project Detection
|
||||
- `pandas`, `numpy` in requirements.txt
|
||||
- `.ipynb` Jupyter notebooks
|
||||
- `data/`, `notebooks/` directories
|
||||
- DataFrame operations in code
|
||||
|
||||
## Workflow
|
||||
|
||||
### Step 1: Analyze Project
|
||||
```
|
||||
**Pandas**: 2.0+
|
||||
**NumPy**: 1.24+
|
||||
**Other**: polars, dask, vaex
|
||||
**Visualization**: matplotlib, seaborn, plotly
|
||||
**ML**: scikit-learn, xgboost
|
||||
```
|
||||
|
||||
### Step 2: Select Review Areas
|
||||
**AskUserQuestion:**
|
||||
```
|
||||
"Which areas to review?"
|
||||
Options:
|
||||
- Full data code review (recommended)
|
||||
- Vectorization and performance
|
||||
- Memory optimization
|
||||
- Data validation
|
||||
- Code organization
|
||||
multiSelect: true
|
||||
```
|
||||
|
||||
## Detection Rules
|
||||
|
||||
### Vectorization
|
||||
| Check | Recommendation | Severity |
|
||||
|-------|----------------|----------|
|
||||
| iterrows() loop | Use vectorized operations | CRITICAL |
|
||||
| apply() with simple func | Use built-in vectorized | HIGH |
|
||||
| Manual loop over array | Use NumPy broadcasting | HIGH |
|
||||
| List comprehension on Series | Use .map() or vectorize | MEDIUM |
|
||||
|
||||
```python
|
||||
# BAD: iterrows (extremely slow)
|
||||
for idx, row in df.iterrows():
|
||||
df.loc[idx, "total"] = row["price"] * row["quantity"]
|
||||
|
||||
# GOOD: Vectorized operation
|
||||
df["total"] = df["price"] * df["quantity"]
|
||||
|
||||
# BAD: apply with simple operation
|
||||
df["upper_name"] = df["name"].apply(lambda x: x.upper())
|
||||
|
||||
# GOOD: Built-in string method
|
||||
df["upper_name"] = df["name"].str.upper()
|
||||
|
||||
# BAD: apply with condition
|
||||
df["status"] = df["score"].apply(lambda x: "pass" if x >= 60 else "fail")
|
||||
|
||||
# GOOD: np.where or np.select
|
||||
df["status"] = np.where(df["score"] >= 60, "pass", "fail")
|
||||
|
||||
# Multiple conditions
|
||||
conditions = [
|
||||
df["score"] >= 90,
|
||||
df["score"] >= 60,
|
||||
df["score"] < 60,
|
||||
]
|
||||
choices = ["A", "B", "F"]
|
||||
df["grade"] = np.select(conditions, choices)
|
||||
```
|
||||
|
||||
### Memory Optimization
|
||||
| Check | Recommendation | Severity |
|
||||
|-------|----------------|----------|
|
||||
| int64 for small ints | Use int8/int16/int32 | MEDIUM |
|
||||
| object dtype for categories | Use category dtype | HIGH |
|
||||
| Loading full file | Use chunks or usecols | HIGH |
|
||||
| Keeping unused columns | Drop early | MEDIUM |
|
||||
|
||||
```python
|
||||
# BAD: Default dtypes waste memory
|
||||
df = pd.read_csv("large_file.csv") # All int64, object
|
||||
|
||||
# GOOD: Specify dtypes
|
||||
dtype_map = {
|
||||
"id": "int32",
|
||||
"age": "int8",
|
||||
"status": "category",
|
||||
"price": "float32",
|
||||
}
|
||||
df = pd.read_csv("large_file.csv", dtype=dtype_map)
|
||||
|
||||
# GOOD: Load only needed columns
|
||||
df = pd.read_csv(
|
||||
"large_file.csv",
|
||||
usecols=["id", "name", "price"],
|
||||
dtype=dtype_map,
|
||||
)
|
||||
|
||||
# GOOD: Process in chunks
|
||||
chunks = pd.read_csv("huge_file.csv", chunksize=100_000)
|
||||
result = pd.concat([process_chunk(chunk) for chunk in chunks])
|
||||
|
||||
# Memory check
|
||||
print(df.info(memory_usage="deep"))
|
||||
|
||||
# Convert object to category if low cardinality
|
||||
for col in df.select_dtypes(include=["object"]).columns:
|
||||
if df[col].nunique() / len(df) < 0.5: # < 50% unique
|
||||
df[col] = df[col].astype("category")
|
||||
```
|
||||
|
||||
### Data Validation
|
||||
| Check | Recommendation | Severity |
|
||||
|-------|----------------|----------|
|
||||
| No null check | Validate nulls early | HIGH |
|
||||
| No dtype validation | Assert expected types | MEDIUM |
|
||||
| No range validation | Check value bounds | MEDIUM |
|
||||
| Silent data issues | Raise or log warnings | HIGH |
|
||||
|
||||
```python
|
||||
# GOOD: Data validation function
|
||||
def validate_dataframe(df: pd.DataFrame) -> pd.DataFrame:
|
||||
"""Validate and clean input DataFrame."""
|
||||
|
||||
# Required columns
|
||||
required_cols = ["id", "name", "price", "quantity"]
|
||||
missing = set(required_cols) - set(df.columns)
|
||||
if missing:
|
||||
raise ValueError(f"Missing columns: {missing}")
|
||||
|
||||
# Null checks
|
||||
null_counts = df[required_cols].isnull().sum()
|
||||
if null_counts.any():
|
||||
logger.warning(f"Null values found:\n{null_counts[null_counts > 0]}")
|
||||
|
||||
# Type validation
|
||||
assert df["id"].dtype in ["int32", "int64"], "id must be integer"
|
||||
assert df["price"].dtype in ["float32", "float64"], "price must be float"
|
||||
|
||||
# Range validation
|
||||
invalid_prices = df[df["price"] < 0]
|
||||
if len(invalid_prices) > 0:
|
||||
logger.warning(f"Found {len(invalid_prices)} negative prices")
|
||||
df = df[df["price"] >= 0]
|
||||
|
||||
# Duplicate check
|
||||
duplicates = df.duplicated(subset=["id"])
|
||||
if duplicates.any():
|
||||
logger.warning(f"Removing {duplicates.sum()} duplicates")
|
||||
df = df.drop_duplicates(subset=["id"])
|
||||
|
||||
return df
|
||||
```
|
||||
|
||||
### NumPy Patterns
|
||||
| Check | Recommendation | Severity |
|
||||
|-------|----------------|----------|
|
||||
| Python loop over array | Use broadcasting | CRITICAL |
|
||||
| np.append in loop | Pre-allocate array | HIGH |
|
||||
| Repeated array creation | Reuse arrays | MEDIUM |
|
||||
| Copy when not needed | Use views | MEDIUM |
|
||||
|
||||
```python
|
||||
# BAD: Python loop
|
||||
result = []
|
||||
for i in range(len(arr)):
|
||||
result.append(arr[i] * 2 + 1)
|
||||
result = np.array(result)
|
||||
|
||||
# GOOD: Vectorized
|
||||
result = arr * 2 + 1
|
||||
|
||||
# BAD: np.append in loop (creates new array each time)
|
||||
result = np.array([])
|
||||
for x in data:
|
||||
result = np.append(result, process(x))
|
||||
|
||||
# GOOD: Pre-allocate
|
||||
result = np.empty(len(data))
|
||||
for i, x in enumerate(data):
|
||||
result[i] = process(x)
|
||||
|
||||
# BETTER: Vectorize if possible
|
||||
result = np.vectorize(process)(data)
|
||||
|
||||
# BEST: Pure NumPy operation
|
||||
result = np.sqrt(data) + np.log(data)
|
||||
|
||||
# BAD: Unnecessary copy
|
||||
subset = arr[arr > 0].copy() # copy() often not needed
|
||||
|
||||
# GOOD: View when modification not needed
|
||||
subset = arr[arr > 0] # Returns view
|
||||
```
|
||||
|
||||
### Pandas Best Practices
|
||||
| Check | Recommendation | Severity |
|
||||
|-------|----------------|----------|
|
||||
| Chained indexing | Use .loc/.iloc | HIGH |
|
||||
| inplace=True | Assign result instead | MEDIUM |
|
||||
| reset_index() abuse | Keep index when useful | LOW |
|
||||
| df.append (deprecated) | Use pd.concat | HIGH |
|
||||
|
||||
```python
|
||||
# BAD: Chained indexing (unpredictable)
|
||||
df[df["price"] > 100]["status"] = "premium" # May not work!
|
||||
|
||||
# GOOD: Use .loc
|
||||
df.loc[df["price"] > 100, "status"] = "premium"
|
||||
|
||||
# BAD: df.append (deprecated)
|
||||
result = pd.DataFrame()
|
||||
for chunk in chunks:
|
||||
result = result.append(chunk)
|
||||
|
||||
# GOOD: pd.concat
|
||||
result = pd.concat(chunks, ignore_index=True)
|
||||
|
||||
# BAD: Multiple operations creating copies
|
||||
df = df.dropna()
|
||||
df = df.reset_index(drop=True)
|
||||
df = df.sort_values("date")
|
||||
|
||||
# GOOD: Method chaining
|
||||
df = (
|
||||
df
|
||||
.dropna()
|
||||
.reset_index(drop=True)
|
||||
.sort_values("date")
|
||||
)
|
||||
```
|
||||
|
||||
## Response Template
|
||||
```
|
||||
## Python Data Code Review Results
|
||||
|
||||
**Project**: [name]
|
||||
**Pandas**: 2.1 | **NumPy**: 1.26 | **Size**: ~1M rows
|
||||
|
||||
### Vectorization
|
||||
| Status | File | Issue |
|
||||
|--------|------|-------|
|
||||
| CRITICAL | process.py:45 | iterrows() loop - use vectorized ops |
|
||||
|
||||
### Memory
|
||||
| Status | File | Issue |
|
||||
|--------|------|-------|
|
||||
| HIGH | load.py:23 | object dtype for 'status' - use category |
|
||||
|
||||
### Data Validation
|
||||
| Status | File | Issue |
|
||||
|--------|------|-------|
|
||||
| HIGH | etl.py:67 | No null value handling |
|
||||
|
||||
### NumPy
|
||||
| Status | File | Issue |
|
||||
|--------|------|-------|
|
||||
| HIGH | calc.py:12 | np.append in loop - pre-allocate |
|
||||
|
||||
### Recommended Actions
|
||||
1. [ ] Replace iterrows with vectorized operations
|
||||
2. [ ] Convert low-cardinality columns to category
|
||||
3. [ ] Add data validation before processing
|
||||
4. [ ] Pre-allocate NumPy arrays in loops
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
```python
|
||||
# Profile memory usage
|
||||
df.info(memory_usage="deep")
|
||||
|
||||
# Profile time
|
||||
%timeit df.apply(func) # Jupyter magic
|
||||
%timeit df["col"].map(func)
|
||||
|
||||
# Use query() for complex filters
|
||||
df.query("price > 100 and category == 'A'")
|
||||
|
||||
# Use eval() for complex expressions
|
||||
df.eval("profit = revenue - cost")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
1. **Vectorize**: Avoid loops, use broadcasting
|
||||
2. **Dtypes**: Use smallest sufficient type
|
||||
3. **Validate**: Check data quality early
|
||||
4. **Profile**: Measure before optimizing
|
||||
5. **Chunk**: Process large files incrementally
|
||||
|
||||
## Integration
|
||||
- `python-reviewer`: General Python patterns
|
||||
- `perf-analyzer`: Performance profiling
|
||||
- `coverage-analyzer`: Test coverage for data code
|
||||
Reference in New Issue
Block a user