Files
gh-physics91-claude-vibe/skills/python-data-reviewer/SKILL.md
2025-11-30 08:47:23 +08:00

8.2 KiB

name, description
name description
python-data-reviewer WHEN: Pandas/NumPy code review, data processing, vectorization, memory optimization WHAT: Vectorization patterns + Memory efficiency + Data validation + Performance optimization + Best practices WHEN NOT: Web framework → fastapi/django/flask-reviewer, General Python → python-reviewer

Python Data Reviewer Skill

Purpose

Reviews data science code for Pandas/NumPy efficiency, memory usage, and best practices.

When to Use

  • Pandas/NumPy code review
  • Data processing optimization
  • Memory efficiency check
  • "Why is my data code slow?"
  • ETL pipeline review

Project Detection

  • pandas, numpy in requirements.txt
  • .ipynb Jupyter notebooks
  • data/, notebooks/ directories
  • DataFrame operations in code

Workflow

Step 1: Analyze Project

**Pandas**: 2.0+
**NumPy**: 1.24+
**Other**: polars, dask, vaex
**Visualization**: matplotlib, seaborn, plotly
**ML**: scikit-learn, xgboost

Step 2: Select Review Areas

AskUserQuestion:

"Which areas to review?"
Options:
- Full data code review (recommended)
- Vectorization and performance
- Memory optimization
- Data validation
- Code organization
multiSelect: true

Detection Rules

Vectorization

Check Recommendation Severity
iterrows() loop Use vectorized operations CRITICAL
apply() with simple func Use built-in vectorized HIGH
Manual loop over array Use NumPy broadcasting HIGH
List comprehension on Series Use .map() or vectorize MEDIUM
# BAD: iterrows (extremely slow)
for idx, row in df.iterrows():
    df.loc[idx, "total"] = row["price"] * row["quantity"]

# GOOD: Vectorized operation
df["total"] = df["price"] * df["quantity"]

# BAD: apply with simple operation
df["upper_name"] = df["name"].apply(lambda x: x.upper())

# GOOD: Built-in string method
df["upper_name"] = df["name"].str.upper()

# BAD: apply with condition
df["status"] = df["score"].apply(lambda x: "pass" if x >= 60 else "fail")

# GOOD: np.where or np.select
df["status"] = np.where(df["score"] >= 60, "pass", "fail")

# Multiple conditions
conditions = [
    df["score"] >= 90,
    df["score"] >= 60,
    df["score"] < 60,
]
choices = ["A", "B", "F"]
df["grade"] = np.select(conditions, choices)

Memory Optimization

Check Recommendation Severity
int64 for small ints Use int8/int16/int32 MEDIUM
object dtype for categories Use category dtype HIGH
Loading full file Use chunks or usecols HIGH
Keeping unused columns Drop early MEDIUM
# BAD: Default dtypes waste memory
df = pd.read_csv("large_file.csv")  # All int64, object

# GOOD: Specify dtypes
dtype_map = {
    "id": "int32",
    "age": "int8",
    "status": "category",
    "price": "float32",
}
df = pd.read_csv("large_file.csv", dtype=dtype_map)

# GOOD: Load only needed columns
df = pd.read_csv(
    "large_file.csv",
    usecols=["id", "name", "price"],
    dtype=dtype_map,
)

# GOOD: Process in chunks
chunks = pd.read_csv("huge_file.csv", chunksize=100_000)
result = pd.concat([process_chunk(chunk) for chunk in chunks])

# Memory check
print(df.info(memory_usage="deep"))

# Convert object to category if low cardinality
for col in df.select_dtypes(include=["object"]).columns:
    if df[col].nunique() / len(df) < 0.5:  # < 50% unique
        df[col] = df[col].astype("category")

Data Validation

Check Recommendation Severity
No null check Validate nulls early HIGH
No dtype validation Assert expected types MEDIUM
No range validation Check value bounds MEDIUM
Silent data issues Raise or log warnings HIGH
# GOOD: Data validation function
def validate_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """Validate and clean input DataFrame."""

    # Required columns
    required_cols = ["id", "name", "price", "quantity"]
    missing = set(required_cols) - set(df.columns)
    if missing:
        raise ValueError(f"Missing columns: {missing}")

    # Null checks
    null_counts = df[required_cols].isnull().sum()
    if null_counts.any():
        logger.warning(f"Null values found:\n{null_counts[null_counts > 0]}")

    # Type validation
    assert df["id"].dtype in ["int32", "int64"], "id must be integer"
    assert df["price"].dtype in ["float32", "float64"], "price must be float"

    # Range validation
    invalid_prices = df[df["price"] < 0]
    if len(invalid_prices) > 0:
        logger.warning(f"Found {len(invalid_prices)} negative prices")
        df = df[df["price"] >= 0]

    # Duplicate check
    duplicates = df.duplicated(subset=["id"])
    if duplicates.any():
        logger.warning(f"Removing {duplicates.sum()} duplicates")
        df = df.drop_duplicates(subset=["id"])

    return df

NumPy Patterns

Check Recommendation Severity
Python loop over array Use broadcasting CRITICAL
np.append in loop Pre-allocate array HIGH
Repeated array creation Reuse arrays MEDIUM
Copy when not needed Use views MEDIUM
# BAD: Python loop
result = []
for i in range(len(arr)):
    result.append(arr[i] * 2 + 1)
result = np.array(result)

# GOOD: Vectorized
result = arr * 2 + 1

# BAD: np.append in loop (creates new array each time)
result = np.array([])
for x in data:
    result = np.append(result, process(x))

# GOOD: Pre-allocate
result = np.empty(len(data))
for i, x in enumerate(data):
    result[i] = process(x)

# BETTER: Vectorize if possible
result = np.vectorize(process)(data)

# BEST: Pure NumPy operation
result = np.sqrt(data) + np.log(data)

# BAD: Unnecessary copy
subset = arr[arr > 0].copy()  # copy() often not needed

# GOOD: View when modification not needed
subset = arr[arr > 0]  # Returns view

Pandas Best Practices

Check Recommendation Severity
Chained indexing Use .loc/.iloc HIGH
inplace=True Assign result instead MEDIUM
reset_index() abuse Keep index when useful LOW
df.append (deprecated) Use pd.concat HIGH
# BAD: Chained indexing (unpredictable)
df[df["price"] > 100]["status"] = "premium"  # May not work!

# GOOD: Use .loc
df.loc[df["price"] > 100, "status"] = "premium"

# BAD: df.append (deprecated)
result = pd.DataFrame()
for chunk in chunks:
    result = result.append(chunk)

# GOOD: pd.concat
result = pd.concat(chunks, ignore_index=True)

# BAD: Multiple operations creating copies
df = df.dropna()
df = df.reset_index(drop=True)
df = df.sort_values("date")

# GOOD: Method chaining
df = (
    df
    .dropna()
    .reset_index(drop=True)
    .sort_values("date")
)

Response Template

## Python Data Code Review Results

**Project**: [name]
**Pandas**: 2.1 | **NumPy**: 1.26 | **Size**: ~1M rows

### Vectorization
| Status | File | Issue |
|--------|------|-------|
| CRITICAL | process.py:45 | iterrows() loop - use vectorized ops |

### Memory
| Status | File | Issue |
|--------|------|-------|
| HIGH | load.py:23 | object dtype for 'status' - use category |

### Data Validation
| Status | File | Issue |
|--------|------|-------|
| HIGH | etl.py:67 | No null value handling |

### NumPy
| Status | File | Issue |
|--------|------|-------|
| HIGH | calc.py:12 | np.append in loop - pre-allocate |

### Recommended Actions
1. [ ] Replace iterrows with vectorized operations
2. [ ] Convert low-cardinality columns to category
3. [ ] Add data validation before processing
4. [ ] Pre-allocate NumPy arrays in loops

Performance Tips

# Profile memory usage
df.info(memory_usage="deep")

# Profile time
%timeit df.apply(func)  # Jupyter magic
%timeit df["col"].map(func)

# Use query() for complex filters
df.query("price > 100 and category == 'A'")

# Use eval() for complex expressions
df.eval("profit = revenue - cost")

Best Practices

  1. Vectorize: Avoid loops, use broadcasting
  2. Dtypes: Use smallest sufficient type
  3. Validate: Check data quality early
  4. Profile: Measure before optimizing
  5. Chunk: Process large files incrementally

Integration

  • python-reviewer: General Python patterns
  • perf-analyzer: Performance profiling
  • coverage-analyzer: Test coverage for data code