Initial commit

2025-11-30 08:47:23 +08:00
commit eb42d25f7b
51 changed files with 9966 additions and 0 deletions
--- a/skills/python-data-reviewer/SKILL.md
+++ b/skills/python-data-reviewer/SKILL.md
@@ -0,0 +1,313 @@
+---
+name: python-data-reviewer
+description: |
+  WHEN: Pandas/NumPy code review, data processing, vectorization, memory optimization
+  WHAT: Vectorization patterns + Memory efficiency + Data validation + Performance optimization + Best practices
+  WHEN NOT: Web framework → fastapi/django/flask-reviewer, General Python → python-reviewer
+---
+
+# Python Data Reviewer Skill
+
+## Purpose
+Reviews data science code for Pandas/NumPy efficiency, memory usage, and best practices.
+
+## When to Use
+- Pandas/NumPy code review
+- Data processing optimization
+- Memory efficiency check
+- "Why is my data code slow?"
+- ETL pipeline review
+
+## Project Detection
+- `pandas`, `numpy` in requirements.txt
+- `.ipynb` Jupyter notebooks
+- `data/`, `notebooks/` directories
+- DataFrame operations in code
+
+## Workflow
+
+### Step 1: Analyze Project
+```
+**Pandas**: 2.0+
+**NumPy**: 1.24+
+**Other**: polars, dask, vaex
+**Visualization**: matplotlib, seaborn, plotly
+**ML**: scikit-learn, xgboost
+```
+
+### Step 2: Select Review Areas
+**AskUserQuestion:**
+```
+"Which areas to review?"
+Options:
+- Full data code review (recommended)
+- Vectorization and performance
+- Memory optimization
+- Data validation
+- Code organization
+multiSelect: true
+```
+
+## Detection Rules
+
+### Vectorization
+| Check | Recommendation | Severity |
+|-------|----------------|----------|
+| iterrows() loop | Use vectorized operations | CRITICAL |
+| apply() with simple func | Use built-in vectorized | HIGH |
+| Manual loop over array | Use NumPy broadcasting | HIGH |
+| List comprehension on Series | Use .map() or vectorize | MEDIUM |
+
+```python
+# BAD: iterrows (extremely slow)
+for idx, row in df.iterrows():
+    df.loc[idx, "total"] = row["price"] * row["quantity"]
+
+# GOOD: Vectorized operation
+df["total"] = df["price"] * df["quantity"]
+
+# BAD: apply with simple operation
+df["upper_name"] = df["name"].apply(lambda x: x.upper())
+
+# GOOD: Built-in string method
+df["upper_name"] = df["name"].str.upper()
+
+# BAD: apply with condition
+df["status"] = df["score"].apply(lambda x: "pass" if x >= 60 else "fail")
+
+# GOOD: np.where or np.select
+df["status"] = np.where(df["score"] >= 60, "pass", "fail")
+
+# Multiple conditions
+conditions = [
+    df["score"] >= 90,
+    df["score"] >= 60,
+    df["score"] < 60,
+]
+choices = ["A", "B", "F"]
+df["grade"] = np.select(conditions, choices)
+```
+
+### Memory Optimization
+| Check | Recommendation | Severity |
+|-------|----------------|----------|
+| int64 for small ints | Use int8/int16/int32 | MEDIUM |
+| object dtype for categories | Use category dtype | HIGH |
+| Loading full file | Use chunks or usecols | HIGH |
+| Keeping unused columns | Drop early | MEDIUM |
+
+```python
+# BAD: Default dtypes waste memory
+df = pd.read_csv("large_file.csv")  # All int64, object
+
+# GOOD: Specify dtypes
+dtype_map = {
+    "id": "int32",
+    "age": "int8",
+    "status": "category",
+    "price": "float32",
+}
+df = pd.read_csv("large_file.csv", dtype=dtype_map)
+
+# GOOD: Load only needed columns
+df = pd.read_csv(
+    "large_file.csv",
+    usecols=["id", "name", "price"],
+    dtype=dtype_map,
+)
+
+# GOOD: Process in chunks
+chunks = pd.read_csv("huge_file.csv", chunksize=100_000)
+result = pd.concat([process_chunk(chunk) for chunk in chunks])
+
+# Memory check
+print(df.info(memory_usage="deep"))
+
+# Convert object to category if low cardinality
+for col in df.select_dtypes(include=["object"]).columns:
+    if df[col].nunique() / len(df) < 0.5:  # < 50% unique
+        df[col] = df[col].astype("category")
+```
+
+### Data Validation
+| Check | Recommendation | Severity |
+|-------|----------------|----------|
+| No null check | Validate nulls early | HIGH |
+| No dtype validation | Assert expected types | MEDIUM |
+| No range validation | Check value bounds | MEDIUM |
+| Silent data issues | Raise or log warnings | HIGH |
+
+```python
+# GOOD: Data validation function
+def validate_dataframe(df: pd.DataFrame) -> pd.DataFrame:
+    """Validate and clean input DataFrame."""
+
+    # Required columns
+    required_cols = ["id", "name", "price", "quantity"]
+    missing = set(required_cols) - set(df.columns)
+    if missing:
+        raise ValueError(f"Missing columns: {missing}")
+
+    # Null checks
+    null_counts = df[required_cols].isnull().sum()
+    if null_counts.any():
+        logger.warning(f"Null values found:\n{null_counts[null_counts > 0]}")
+
+    # Type validation
+    assert df["id"].dtype in ["int32", "int64"], "id must be integer"
+    assert df["price"].dtype in ["float32", "float64"], "price must be float"
+
+    # Range validation
+    invalid_prices = df[df["price"] < 0]
+    if len(invalid_prices) > 0:
+        logger.warning(f"Found {len(invalid_prices)} negative prices")
+        df = df[df["price"] >= 0]
+
+    # Duplicate check
+    duplicates = df.duplicated(subset=["id"])
+    if duplicates.any():
+        logger.warning(f"Removing {duplicates.sum()} duplicates")
+        df = df.drop_duplicates(subset=["id"])
+
+    return df
+```
+
+### NumPy Patterns
+| Check | Recommendation | Severity |
+|-------|----------------|----------|
+| Python loop over array | Use broadcasting | CRITICAL |
+| np.append in loop | Pre-allocate array | HIGH |
+| Repeated array creation | Reuse arrays | MEDIUM |
+| Copy when not needed | Use views | MEDIUM |
+
+```python
+# BAD: Python loop
+result = []
+for i in range(len(arr)):
+    result.append(arr[i] * 2 + 1)
+result = np.array(result)
+
+# GOOD: Vectorized
+result = arr * 2 + 1
+
+# BAD: np.append in loop (creates new array each time)
+result = np.array([])
+for x in data:
+    result = np.append(result, process(x))
+
+# GOOD: Pre-allocate
+result = np.empty(len(data))
+for i, x in enumerate(data):
+    result[i] = process(x)
+
+# BETTER: Vectorize if possible
+result = np.vectorize(process)(data)
+
+# BEST: Pure NumPy operation
+result = np.sqrt(data) + np.log(data)
+
+# BAD: Unnecessary copy
+subset = arr[arr > 0].copy()  # copy() often not needed
+
+# GOOD: View when modification not needed
+subset = arr[arr > 0]  # Returns view
+```
+
+### Pandas Best Practices
+| Check | Recommendation | Severity |
+|-------|----------------|----------|
+| Chained indexing | Use .loc/.iloc | HIGH |
+| inplace=True | Assign result instead | MEDIUM |
+| reset_index() abuse | Keep index when useful | LOW |
+| df.append (deprecated) | Use pd.concat | HIGH |
+
+```python
+# BAD: Chained indexing (unpredictable)
+df[df["price"] > 100]["status"] = "premium"  # May not work!
+
+# GOOD: Use .loc
+df.loc[df["price"] > 100, "status"] = "premium"
+
+# BAD: df.append (deprecated)
+result = pd.DataFrame()
+for chunk in chunks:
+    result = result.append(chunk)
+
+# GOOD: pd.concat
+result = pd.concat(chunks, ignore_index=True)
+
+# BAD: Multiple operations creating copies
+df = df.dropna()
+df = df.reset_index(drop=True)
+df = df.sort_values("date")
+
+# GOOD: Method chaining
+df = (
+    df
+    .dropna()
+    .reset_index(drop=True)
+    .sort_values("date")
+)
+```
+
+## Response Template
+```
+## Python Data Code Review Results
+
+**Project**: [name]
+**Pandas**: 2.1 | **NumPy**: 1.26 | **Size**: ~1M rows
+
+### Vectorization
+| Status | File | Issue |
+|--------|------|-------|
+| CRITICAL | process.py:45 | iterrows() loop - use vectorized ops |
+
+### Memory
+| Status | File | Issue |
+|--------|------|-------|
+| HIGH | load.py:23 | object dtype for 'status' - use category |
+
+### Data Validation
+| Status | File | Issue |
+|--------|------|-------|
+| HIGH | etl.py:67 | No null value handling |
+
+### NumPy
+| Status | File | Issue |
+|--------|------|-------|
+| HIGH | calc.py:12 | np.append in loop - pre-allocate |
+
+### Recommended Actions
+1. [ ] Replace iterrows with vectorized operations
+2. [ ] Convert low-cardinality columns to category
+3. [ ] Add data validation before processing
+4. [ ] Pre-allocate NumPy arrays in loops
+```
+
+## Performance Tips
+```python
+# Profile memory usage
+df.info(memory_usage="deep")
+
+# Profile time
+%timeit df.apply(func)  # Jupyter magic
+%timeit df["col"].map(func)
+
+# Use query() for complex filters
+df.query("price > 100 and category == 'A'")
+
+# Use eval() for complex expressions
+df.eval("profit = revenue - cost")
+```
+
+## Best Practices
+1. **Vectorize**: Avoid loops, use broadcasting
+2. **Dtypes**: Use smallest sufficient type
+3. **Validate**: Check data quality early
+4. **Profile**: Measure before optimizing
+5. **Chunk**: Process large files incrementally
+
+## Integration
+- `python-reviewer`: General Python patterns
+- `perf-analyzer`: Performance profiling
+- `coverage-analyzer`: Test coverage for data code