Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:59:19 +08:00
commit 335acf9d10
14 changed files with 11644 additions and 0 deletions

View File

@@ -0,0 +1,12 @@
{
"name": "axiom-python-engineering",
"description": "Modern Python 3.12+ engineering: types, testing, async, scientific computing, ML workflows - 10 skills",
"version": "1.1.1",
"author": {
"name": "tachyon-beep",
"email": "zhongweili@tubi.tv"
},
"skills": [
"./skills"
]
}

3
README.md Normal file
View File

@@ -0,0 +1,3 @@
# axiom-python-engineering
Modern Python 3.12+ engineering: types, testing, async, scientific computing, ML workflows - 10 skills

85
plugin.lock.json Normal file
View File

@@ -0,0 +1,85 @@
{
"$schema": "internal://schemas/plugin.lock.v1.json",
"pluginId": "gh:tachyon-beep/skillpacks:plugins/axiom-python-engineering",
"normalized": {
"repo": null,
"ref": "refs/tags/v20251128.0",
"commit": "d2e2cd2da633625dc09d9fa35a3dc0f60c3cafe5",
"treeHash": "0d643778d6d2de81b6df637544298a6b9a3214de1a3ca03e3ba9d6eaf0cd3f4c",
"generatedAt": "2025-11-28T10:28:30.940625Z",
"toolVersion": "publish_plugins.py@0.2.0"
},
"origin": {
"remote": "git@github.com:zhongweili/42plugin-data.git",
"branch": "master",
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
},
"manifest": {
"name": "axiom-python-engineering",
"description": "Modern Python 3.12+ engineering: types, testing, async, scientific computing, ML workflows - 10 skills",
"version": "1.1.1"
},
"content": {
"files": [
{
"path": "README.md",
"sha256": "501cc915afac6582b68359b9cd06b82b4f156ca4439348d16be2515a17d4425e"
},
{
"path": ".claude-plugin/plugin.json",
"sha256": "f8c9d4521cad5f45e5aade5cc4aa8041a256579a1c8b00041d146bcb830a01b8"
},
{
"path": "skills/.gitkeep",
"sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
},
{
"path": "skills/using-python-engineering/resolving-mypy-errors.md",
"sha256": "f5ff38011ecf1304093b84cfa59f41daf585d36a9d75598ac4a045e7c41ceafa"
},
{
"path": "skills/using-python-engineering/ml-engineering-workflows.md",
"sha256": "3b45246f1477f341e770877a95fd5b67f25774e1b5e7c21898aba2f1b27fa8d0"
},
{
"path": "skills/using-python-engineering/project-structure-and-tooling.md",
"sha256": "3e225c744c5138ec6c4945f60e6bc959aac1663d1a4cfb741efaf0e622351dc2"
},
{
"path": "skills/using-python-engineering/modern-syntax-and-types.md",
"sha256": "56a51be261616cc49041af9dcb5943f6e5b3f2424b84669a6f7df84a5b6458c3"
},
{
"path": "skills/using-python-engineering/systematic-delinting.md",
"sha256": "57df2647863de7e4937b4c5d92cc4832559e10f0a77917f64802bf4bf89ace83"
},
{
"path": "skills/using-python-engineering/testing-and-quality.md",
"sha256": "9515f2638edfaaedf0d8664beb141de376f5f6d233aad0fd128588c1fffc257d"
},
{
"path": "skills/using-python-engineering/scientific-computing-foundations.md",
"sha256": "2f1157d97cbc98ed3b7fbf2489b9e5ef8a6c0c05847095bd5b0acb2d45f4cb71"
},
{
"path": "skills/using-python-engineering/SKILL.md",
"sha256": "f265281bc5cd8efd8e3e034ddcbad83038485b2789aa01e0480024cf9f34aee4"
},
{
"path": "skills/using-python-engineering/async-patterns-and-concurrency.md",
"sha256": "83003bd109a5393c689415fe9529a2fb8b77cbc10e4aaf5ec706a609e1122b50"
},
{
"path": "skills/using-python-engineering/debugging-and-profiling.md",
"sha256": "9073f36ae95bcc55458bc78aedacf6e005d1fb6b5d60b883fc7ff6b1e4d61260"
}
],
"dirSha256": "0d643778d6d2de81b6df637544298a6b9a3214de1a3ca03e3ba9d6eaf0cd3f4c"
},
"security": {
"scannedAt": null,
"scannerVersion": null,
"flags": []
}
}

0
skills/.gitkeep Normal file
View File

View File

@@ -0,0 +1,398 @@
---
name: using-python-engineering
description: Routes to appropriate Python specialist skill based on symptoms and problem type
mode: true
---
# Using Python Engineering
## Overview
This meta-skill routes you to the right Python specialist based on symptoms. Python engineering problems fall into distinct categories that require specialized knowledge. Load this skill when you encounter Python-specific issues but aren't sure which specialized skill to use.
**Core Principle**: Different Python problems require different specialists. Match symptoms to the appropriate specialist skill. Don't guess at solutions—route to the expert.
## When to Use
Load this skill when:
- Working with Python and encountering problems
- User mentions: "Python", "type hints", "mypy", "pytest", "async", "pandas", "numpy"
- Need to implement Python projects or optimize performance
- Setting up Python tooling or fixing lint warnings
- Debugging Python code or profiling performance
**Don't use for**: Non-Python languages, algorithm theory (not Python-specific), deployment infrastructure (not Python-specific)
---
## Routing by Symptom
### Type Errors and Type Hints
**Symptoms - Learning Type Syntax**:
- "How to use type hints?"
- "Python 3.12 type syntax"
- "Generic types"
- "Protocol vs ABC"
- "TypeVar usage"
- "Configure mypy/pyright"
**Route to**: See [modern-syntax-and-types.md](modern-syntax-and-types.md) for comprehensive type system guidance.
**Why**: Learning type hint syntax, patterns, and configuration.
**Symptoms - Fixing Type Errors**:
- "mypy error: Incompatible types"
- "mypy error: Argument has incompatible type"
- "How to fix mypy errors?"
- "100+ mypy errors, where to start?"
- "When to use type: ignore?"
- "Add types to legacy code"
- "Understanding mypy error messages"
**Route to**: See [resolving-mypy-errors.md](resolving-mypy-errors.md) for systematic mypy error resolution.
**Why**: Resolving type errors requires systematic methodology, understanding error messages, and knowing when to fix vs ignore.
**Example queries**:
- "Getting mypy error about incompatible types" → [resolving-mypy-errors.md](resolving-mypy-errors.md)
- "How to use Python 3.12 type parameter syntax?" → [modern-syntax-and-types.md](modern-syntax-and-types.md)
- "Fix 150 mypy errors systematically" → [resolving-mypy-errors.md](resolving-mypy-errors.md)
---
### Project Setup and Tooling
**Symptoms**:
- "How to structure my Python project?"
- "Setup pyproject.toml"
- "Configure ruff/black/mypy"
- "Dependency management"
- "Pre-commit hooks"
- "Package my project"
- "src layout vs flat layout"
**Route to**: [project-structure-and-tooling.md](project-structure-and-tooling.md)
**Why**: Project setup involves multiple tools (ruff, mypy, pre-commit) and architectural decisions (src vs flat layout). Need comprehensive setup guide.
**Example queries**:
- "Starting new Python project, how to set up?"
- "Configure ruff for my team"
- "Should I use poetry or pip-tools?"
---
### Lint Warnings and Delinting
**Symptoms**:
- "Too many lint warnings"
- "Fix ruff errors"
- "How to delint legacy code?"
- "Systematic approach to fixing lint"
- "Don't want to disable warnings"
- "Clean up codebase lint"
**Route to**: [systematic-delinting.md](systematic-delinting.md)
**Why**: Delinting requires systematic methodology to fix warnings without disabling them or over-refactoring. Process-driven approach needed.
**Example queries**:
- "1000+ lint warnings, where to start?"
- "Fix lint warnings systematically"
- "Legacy code has no linting"
**Note**: If setting UP linting (not fixing), route to [project-structure-and-tooling.md](project-structure-and-tooling.md) first.
---
### Testing Issues
**Symptoms**:
- "pytest not working"
- "Flaky tests"
- "How to structure tests?"
- "Fixture issues"
- "Mock/patch problems"
- "Test coverage"
- "Property-based testing"
**Route to**: [testing-and-quality.md](testing-and-quality.md)
**Why**: Testing requires understanding pytest architecture, fixture scopes, mocking patterns, and test organization strategies.
**Example queries**:
- "Tests fail intermittently"
- "How to use pytest fixtures properly?"
- "Improve test coverage"
---
### Async/Await Issues
**Symptoms**:
- "asyncio not working"
- "async/await errors"
- "Event loop issues"
- "Blocking the event loop"
- "TaskGroup (Python 3.11+)"
- "Async context managers"
- "When to use async?"
**Route to**: [async-patterns-and-concurrency.md](async-patterns-and-concurrency.md)
**Why**: Async programming has unique patterns, pitfalls (blocking event loop), and requires understanding structured concurrency.
**Example queries**:
- "Getting 'coroutine never awaited' error"
- "How to use Python 3.11 TaskGroup?"
- "Async code is slow"
---
### Performance and Profiling
**Symptoms**:
- "Python code is slow"
- "How to profile?"
- "Memory leak"
- "Optimize performance"
- "Bottleneck identification"
- "CPU profiling"
- "Memory profiling"
**Route to**: [debugging-and-profiling.md](debugging-and-profiling.md) FIRST
**Why**: MUST profile before optimizing. Many "performance" problems are actually I/O or algorithm issues. Profile to identify the real bottleneck.
**After profiling**, may route to:
- [async-patterns-and-concurrency.md](async-patterns-and-concurrency.md) if I/O-bound
- [scientific-computing-foundations.md](scientific-computing-foundations.md) if array operations slow
- Same skill for optimization strategies
**Example queries**:
- "Code is slow, how to speed up?"
- "Find bottleneck in my code"
- "Memory usage too high"
---
### Array and DataFrame Operations
**Symptoms**:
- "NumPy operations"
- "Pandas DataFrame slow"
- "Vectorization"
- "Array performance"
- "Replace loops with numpy"
- "DataFrame best practices"
- "Large dataset processing"
**Route to**: [scientific-computing-foundations.md](scientific-computing-foundations.md)
**Why**: NumPy/pandas have specific patterns for vectorization, memory efficiency, and avoiding anti-patterns (iterrows).
**Example queries**:
- "How to vectorize this loop?"
- "Pandas operation too slow"
- "DataFrame memory usage high"
---
### ML Experiment Tracking and Workflows
**Symptoms**:
- "Track ML experiments"
- "MLflow setup"
- "Reproducible ML pipelines"
- "ML model lifecycle"
- "Hyperparameter management"
- "ML monitoring"
- "Data versioning"
**Route to**: [ml-engineering-workflows.md](ml-engineering-workflows.md)
**Why**: ML workflows require experiment tracking, reproducibility patterns, configuration management, and monitoring strategies.
**Example queries**:
- "How to track experiments with MLflow?"
- "Make ML training reproducible"
- "Monitor model in production"
---
## Cross-Cutting Scenarios
### Multiple Skills Needed
Some scenarios require multiple specialized skills in sequence:
**New Python project setup with ML**:
1. Route to [project-structure-and-tooling.md](project-structure-and-tooling.md) (setup)
2. THEN [ml-engineering-workflows.md](ml-engineering-workflows.md) (ML specifics)
**Legacy code cleanup**:
1. Route to [project-structure-and-tooling.md](project-structure-and-tooling.md) (setup linting)
2. THEN [systematic-delinting.md](systematic-delinting.md) (fix warnings)
**Slow pandas code**:
1. Route to [debugging-and-profiling.md](debugging-and-profiling.md) (profile)
2. THEN [scientific-computing-foundations.md](scientific-computing-foundations.md) (optimize)
**Type hints for existing code**:
1. Route to [project-structure-and-tooling.md](project-structure-and-tooling.md) (setup mypy)
2. THEN `modern-syntax-and-types` (add types)
**Load in order of execution**: Setup before optimization, diagnosis before fixes, structure before specialization.
---
## Ambiguous Queries - Ask First
When symptom unclear, ASK ONE clarifying question:
**"Fix my Python code"**
→ Ask: "What specific issue? Type errors? Lint warnings? Tests failing? Performance?"
**"Optimize my code"**
→ Ask: "Optimize what? Speed? Memory? Code quality?"
**"Setup Python project"**
→ Ask: "General project or ML-specific? Starting fresh or fixing existing?"
**"My code doesn't work"**
→ Ask: "What's broken? Import errors? Type errors? Runtime errors? Tests?"
**Never guess when ambiguous. Ask once, route accurately.**
---
## Common Routing Mistakes
| Symptom | Wrong Route | Correct Route | Why |
|---------|-------------|---------------|-----|
| "Code slow" | async-patterns | debugging-and-profiling FIRST | Don't optimize without profiling |
| "Setup linting and fix" | systematic-delinting only | project-structure THEN delinting | Setup before fixing |
| "Pandas slow" | debugging only | debugging THEN scientific-computing | Profile then vectorize |
| "Add type hints" | modern-syntax only | project-structure THEN modern-syntax | Setup mypy first |
| "Fix 1000 lint warnings" | project-structure | systematic-delinting | Process for fixing, not setup |
| "Fix mypy errors" | modern-syntax-and-types | resolving-mypy-errors | Syntax vs resolution process |
| "100 mypy errors" | modern-syntax-and-types | resolving-mypy-errors | Need systematic approach |
**Key principle**: Diagnosis before solutions, setup before optimization, profile before performance fixes.
---
## Red Flags - Stop and Route
If you catch yourself about to:
- Suggest "use async" for slow code → Route to [debugging-and-profiling.md](debugging-and-profiling.md) to profile first
- Show pytest example → Route to [testing-and-quality.md](testing-and-quality.md) for complete patterns
- Suggest "just fix the lint warnings" → Route to [systematic-delinting.md](systematic-delinting.md) for methodology
- Show type hint syntax → Route to `modern-syntax-and-types` for comprehensive guide
- Suggest "use numpy instead" → Route to [scientific-computing-foundations.md](scientific-computing-foundations.md) for vectorization patterns
**All of these mean: You're about to give incomplete advice. Route to the specialist instead.**
---
## Common Rationalizations (Don't Do These)
| Excuse | Reality | What To Do |
|--------|---------|------------|
| "User is rushed, skip routing" | Routing takes 5 seconds. Wrong fix wastes hours. | Route anyway - specialists have quick answers |
| "Simple question" | Simple questions deserve complete answers. | Route to specialist for comprehensive coverage |
| "Just need quick syntax" | Syntax without context leads to misuse. | Route to get syntax + patterns + anti-patterns |
| "User sounds experienced" | Experience in one area ≠ expertise in all Python. | Route based on symptoms, not perceived skill |
| "Already tried X" | May have done X wrong or incompletely. | Route to specialist to verify X properly |
| "Too many skills" | 8 focused skills > 1 overwhelming wall of text. | Use router to navigate - that's its purpose |
**If you catch yourself thinking ANY of these, STOP and route to the specialist.**
---
## Red Flags Checklist - Self-Check Before Answering
Before giving ANY Python advice, ask yourself:
1.**Did I identify the symptom?**
- If no → Read query again, identify symptoms
2.**Is this symptom in my routing table?**
- If yes → Route to that specialist
- If no → Ask clarifying question
3.**Am I about to give advice directly?**
- If yes → STOP. Why am I not routing?
- Check rationalization table - am I making excuses?
4.**Is this a diagnosis issue or solution issue?**
- Diagnosis → Route to profiling/debugging skill FIRST
- Solution → Route to appropriate implementation skill
5.**Is query ambiguous?**
- If yes → Ask ONE clarifying question
- If no → Route confidently
6.**Am I feeling pressure to skip routing?**
- Time pressure → Route anyway (faster overall)
- Complexity → Route anyway (specialists handle complexity)
- User confidence → Route anyway (verify assumptions)
- "Simple" question → Route anyway (simple deserves correct)
**If you failed ANY check above, do NOT give direct advice. Route to specialist or ask clarifying question.**
---
## Python Engineering Specialist Skills
After routing, load the appropriate specialist skill for detailed guidance:
1. [modern-syntax-and-types.md](modern-syntax-and-types.md) - Type hints, mypy/pyright, Python 3.10-3.12 features, generics, protocols
2. [resolving-mypy-errors.md](resolving-mypy-errors.md) - Systematic mypy error resolution, type: ignore best practices, typing legacy code
3. [project-structure-and-tooling.md](project-structure-and-tooling.md) - pyproject.toml, ruff, pre-commit, dependency management, packaging
4. [systematic-delinting.md](systematic-delinting.md) - Process for fixing lint warnings without disabling or over-refactoring
5. [testing-and-quality.md](testing-and-quality.md) - pytest patterns, fixtures, mocking, coverage, property-based testing
6. [async-patterns-and-concurrency.md](async-patterns-and-concurrency.md) - async/await, asyncio, TaskGroup, structured concurrency, threading
7. [scientific-computing-foundations.md](scientific-computing-foundations.md) - NumPy/pandas, vectorization, memory efficiency, large datasets
8. [ml-engineering-workflows.md](ml-engineering-workflows.md) - MLflow, experiment tracking, reproducibility, monitoring, model lifecycle
9. [debugging-and-profiling.md](debugging-and-profiling.md) - pdb/debugpy, cProfile, memory_profiler, optimization strategies
---
## When NOT to Use Python Skills
**Skip Python pack when**:
- Non-Python language (use appropriate language pack)
- Algorithm selection (use computer science / algorithms pack)
- Infrastructure/deployment (use DevOps/infrastructure pack)
- Database design (use database pack)
**Python pack is for**: Python-specific implementation, tooling, patterns, debugging, and optimization.
---
## Diagnosis-First Principle
**Critical**: Many Python issues require diagnosis before solutions:
| Issue Type | Diagnosis Skill | Then Solution Skill |
|------------|----------------|---------------------|
| Performance | debugging-and-profiling | async or scientific-computing |
| Slow arrays | debugging-and-profiling | scientific-computing-foundations |
| Type errors | modern-syntax-and-types | modern-syntax-and-types (same) |
| Lint warnings | systematic-delinting | systematic-delinting (same) |
**If unclear what's wrong, route to diagnostic skill first.**
---
## Integration Notes
**Phase 1 - Standalone**: Python skills are self-contained
**Future cross-references**:
- superpowers:test-driven-development (TDD methodology before implementing)
- superpowers:systematic-debugging (systematic debugging before profiling)
**Current focus**: Route within Python pack only. Other packs handle other concerns.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,848 @@
# Modern Python Syntax and Types
## Overview
**Core Principle:** Type hints make code self-documenting and catch bugs before runtime. Python 3.10-3.12 introduced powerful type system features and syntax improvements. Use them.
Modern Python is statically typed (optionally), with match statements, structural pattern matching, and cleaner syntax. The type system evolved dramatically: `|` union syntax (3.10), exception groups (3.11), PEP 695 generics (3.12). Master these to write production-quality Python.
## When to Use
**Use this skill when:**
- "mypy error: ..." or "pyright error: ..."
- Adding type hints to existing code
- Using Python 3.10+ features (match, | unions, generics)
- Configuring static type checkers
- Type errors with generics, protocols, or TypedDict
**Don't use when:**
- Setting up project structure (use project-structure-and-tooling)
- Runtime type checking needed (use pydantic or similar)
- Performance optimization (use debugging-and-profiling)
**Symptoms triggering this skill:**
- "Incompatible type" errors
- "How to type hint X?"
- "Use Python 3.12 features"
- "Configure mypy strict mode"
## Type Hints Fundamentals
### Basic Annotations
```python
# ❌ WRONG: No type hints
def calculate_total(prices, tax_rate):
return sum(prices) * (1 + tax_rate)
# ✅ CORRECT: Clear types
def calculate_total(prices: list[float], tax_rate: float) -> float:
return sum(prices) * (1 + tax_rate)
# Why this matters: Type checker catches calculate_total([1, 2], "0.1")
# immediately instead of failing at runtime with TypeError
```
### Built-in Collection Types (Python 3.9+)
```python
# ❌ WRONG: Using typing.List, typing.Dict (deprecated)
from typing import List, Dict, Tuple
def process(items: List[str]) -> Dict[str, int]:
return {item: len(item) for item in items}
# ✅ CORRECT: Use built-in types directly (Python 3.9+)
def process(items: list[str]) -> dict[str, int]:
return {item: len(item) for item in items}
# ✅ More complex built-ins
def transform(data: dict[str, list[int]]) -> tuple[int, ...]:
all_values = [v for values in data.values() for v in values]
return tuple(all_values)
```
**Why this matters**: Python 3.9+ supports `list[T]` directly. Using `typing.List` is deprecated and adds unnecessary imports.
### Optional and None
```python
# ❌ WRONG: Using Optional without understanding
from typing import Optional
def get_user(id: int) -> Optional[dict]:
# Returns dict or None, but which dict structure?
...
# ✅ CORRECT: Use | None (Python 3.10+) with specific types
from dataclasses import dataclass
@dataclass
class User:
id: int
name: str
email: str
def get_user(id: int) -> User | None:
# Clear: Returns User or None
if user_exists(id):
return User(id=id, name="...", email="...")
return None
# Using the result
user = get_user(123)
if user is not None: # Type checker knows user is User here
print(user.name)
```
**Why this matters**: `Optional[X]` is just `X | None`. Python 3.10+ syntax is clearer. TypedDict or dataclass is better than raw dict.
### Union Types
```python
# ❌ WRONG: Old-style Union (Python <3.10)
from typing import Union
def process(value: Union[str, int, float]) -> Union[str, bool]:
...
# ✅ CORRECT: Use | operator (Python 3.10+)
def process(value: str | int | float) -> str | bool:
if isinstance(value, str):
return value.upper()
return value > 0
# ✅ Multiple returns with | None
def parse_config(path: str) -> dict[str, str] | None:
try:
with open(path) as f:
return json.load(f)
except FileNotFoundError:
return None
```
**Why this matters**: `|` is PEP 604, available Python 3.10+. Cleaner, more readable, Pythonic. No imports needed.
### Type Aliases
```python
# ❌ WRONG: Reusing complex types
def process_users(users: list[dict[str, str | int]]) -> dict[str, list[dict[str, str | int]]]:
...
# ✅ CORRECT: Type alias for readability
UserDict = dict[str, str | int]
UserMap = dict[str, list[UserDict]]
def process_users(users: list[UserDict]) -> UserMap:
return {"active": [u for u in users if u.get("active")]}
# ✅ BETTER: Use TypedDict for structure
from typing import TypedDict
class User(TypedDict):
id: int
name: str
email: str
active: bool
def process_users(users: list[User]) -> dict[str, list[User]]:
return {"active": [u for u in users if u["active"]]}
```
**Why this matters**: Type aliases improve readability. TypedDict provides structure validation for dict types.
## Advanced Typing
### Generics with TypeVar
```python
from typing import TypeVar
T = TypeVar('T')
# ✅ Generic function
def first(items: list[T]) -> T | None:
return items[0] if items else None
# Usage: type checker knows the return type
names: list[str] = ["Alice", "Bob"]
first_name: str | None = first(names) # Type checker infers str | None
numbers: list[int] = [1, 2, 3]
first_num: int | None = first(numbers) # Type checker infers int | None
# ✅ Generic class (old style)
class Container(Generic[T]):
def __init__(self, value: T) -> None:
self.value = value
def get(self) -> T:
return self.value
# Usage
container: Container[int] = Container(42)
value: int = container.get() # Type checker knows it's int
```
### Python 3.12+ Generics (PEP 695)
```python
# ❌ WRONG: Old-style generic syntax (still works but verbose)
from typing import TypeVar, Generic
T = TypeVar('T')
class Container(Generic[T]):
def __init__(self, value: T) -> None:
self.value = value
# ✅ CORRECT: Python 3.12+ PEP 695 syntax
class Container[T]:
def __init__(self, value: T) -> None:
self.value = value
def get(self) -> T:
return self.value
# ✅ Generic function with PEP 695
def first[T](items: list[T]) -> T | None:
return items[0] if items else None
# ✅ Multiple type parameters
class Pair[T, U]:
def __init__(self, first: T, second: U) -> None:
self.first = first
self.second = second
def get_first(self) -> T:
return self.first
def get_second(self) -> U:
return self.second
# Usage
pair: Pair[str, int] = Pair("answer", 42)
```
**Why this matters**: PEP 695 (Python 3.12+) simplifies generic syntax. No TypeVar needed. Cleaner, more readable.
### Bounded TypeVars
```python
# ✅ TypeVar with bounds (works with old and new syntax)
from typing import TypeVar
# Bound to specific type
T_Number = TypeVar('T_Number', bound=int | float)
def add[T: int | float](a: T, b: T) -> T: # Python 3.12+ syntax
return a + b # Type checker knows a and b support +
# ✅ Constrained to specific types only
T_Scalar = TypeVar('T_Scalar', int, float, str)
def format_value(value: T_Scalar) -> str:
return str(value)
# Usage
result: int = add(1, 2) # OK
result2: float = add(1.5, 2.5) # OK
# result3 = add("a", "b") # mypy error: str not compatible with int | float
```
### Protocol (Structural Subtyping)
```python
from typing import Protocol
# ✅ Define protocol for duck typing
class Drawable(Protocol):
def draw(self) -> None: ...
class Circle:
def draw(self) -> None:
print("Drawing circle")
class Square:
def draw(self) -> None:
print("Drawing square")
# Works without inheritance - structural typing
def render(shape: Drawable) -> None:
shape.draw()
# Usage - no need to inherit from Drawable
circle = Circle()
square = Square()
render(circle) # OK
render(square) # OK
# ❌ WRONG: Using ABC when Protocol is better
from abc import ABC, abstractmethod
class DrawableABC(ABC):
@abstractmethod
def draw(self) -> None: ...
# Now Circle must inherit from DrawableABC - too rigid!
```
**Why this matters**: Protocol enables structural typing (duck typing with type safety). No inheritance needed. More Pythonic than ABC for many cases.
### TypedDict
```python
from typing import TypedDict
# ✅ Define structured dict types
class UserDict(TypedDict):
id: int
name: str
email: str
active: bool
def create_user(data: UserDict) -> UserDict:
# Type checker ensures all required keys present
return data
# Usage
user: UserDict = {
"id": 1,
"name": "Alice",
"email": "alice@example.com",
"active": True
}
# mypy error: Missing key "active"
# bad_user: UserDict = {"id": 1, "name": "Alice", "email": "alice@example.com"}
# ✅ Optional fields
class UserDictOptional(TypedDict, total=False):
bio: str
avatar_url: str
# ✅ Combining required and optional
class User(TypedDict):
id: int
name: str
class UserWithOptional(User, total=False):
email: str
bio: str
```
**Why this matters**: TypedDict provides structure for dict types. Better than `dict[str, Any]`. Type checker validates keys and value types.
## Python 3.10+ Features
### Match Statements (Structural Pattern Matching)
```python
# ❌ WRONG: Long if-elif chains
def handle_response(response):
if response["status"] == 200:
return response["data"]
elif response["status"] == 404:
return None
elif response["status"] in [500, 502, 503]:
raise ServerError()
else:
raise UnknownError()
# ✅ CORRECT: Match statement (Python 3.10+)
def handle_response(response: dict[str, Any]) -> Any:
match response["status"]:
case 200:
return response["data"]
case 404:
return None
case 500 | 502 | 503:
raise ServerError()
case _:
raise UnknownError()
# ✅ Pattern matching with structure
def process_command(command: dict[str, Any]) -> str:
match command:
case {"action": "create", "type": "user", "data": data}:
return create_user(data)
case {"action": "delete", "type": "user", "id": user_id}:
return delete_user(user_id)
case {"action": action, "type": type_}:
return f"Unknown action {action} for {type_}"
case _:
return "Invalid command"
# ✅ Matching class instances
from dataclasses import dataclass
@dataclass
class Point:
x: float
y: float
def describe_point(point: Point) -> str:
match point:
case Point(x=0, y=0):
return "Origin"
case Point(x=0, y=y):
return f"On Y-axis at {y}"
case Point(x=x, y=0):
return f"On X-axis at {x}"
case Point(x=x, y=y) if x == y:
return f"On diagonal at ({x}, {y})"
case Point(x=x, y=y):
return f"At ({x}, {y})"
```
**Why this matters**: Match statements are more readable than if-elif chains for complex conditionals. Pattern matching extracts values directly.
## Python 3.11 Features
### Exception Groups
```python
# ❌ WRONG: Can't handle multiple exceptions from concurrent tasks
async def fetch_all(urls: list[str]) -> list[str]:
results = []
for url in urls:
try:
results.append(await fetch(url))
except Exception as e:
# Only logs first error, continues
log.error(f"Failed to fetch {url}: {e}")
return results
# ✅ CORRECT: Python 3.11 exception groups
async def fetch_all(urls: list[str]) -> list[str]:
async with asyncio.TaskGroup() as tg:
tasks = [tg.create_task(fetch(url)) for url in urls]
# If any fail, TaskGroup raises ExceptionGroup
return [task.result() for task in tasks]
# Handling exception groups
try:
results = await fetch_all(urls)
except* TimeoutError as e:
# Handle all TimeoutErrors
log.error(f"Timeouts: {e.exceptions}")
except* ConnectionError as e:
# Handle all ConnectionErrors
log.error(f"Connection errors: {e.exceptions}")
# ✅ Creating exception groups manually
errors = [ValueError("Invalid user 1"), ValueError("Invalid user 2")]
raise ExceptionGroup("Validation errors", errors)
```
**Why this matters**: Exception groups handle multiple exceptions from concurrent operations. Essential for structured concurrency (TaskGroup).
### Better Error Messages
Python 3.11 improved error messages significantly:
```python
# Python 3.10 error:
# TypeError: 'NoneType' object is not subscriptable
# Python 3.11 error with exact location:
# TypeError: 'NoneType' object is not subscriptable
# user["name"]
# ^^^^^^^^^^^^
# Helpful for nested expressions
result = data["users"][0]["profile"]["settings"]["theme"]
# Python 3.11 shows exactly which part is None
```
**Why this matters**: Better error messages speed up debugging. Exact location highlighted.
## Python 3.12 Features
### PEP 695 Type Parameter Syntax
Already covered in Generics section above. Key improvement: cleaner syntax for generic classes and functions.
```python
# Old style (still works)
from typing import TypeVar, Generic
T = TypeVar('T')
class Box(Generic[T]):
...
# Python 3.12+ style
class Box[T]:
...
```
### @override Decorator
```python
from typing import override
class Base:
def process(self) -> None:
print("Base process")
class Derived(Base):
@override
def process(self) -> None: # OK - overriding Base.process
print("Derived process")
@override
def compute(self) -> None: # mypy error: Base has no method 'compute'
print("New method")
# Why use @override:
# 1. Documents intent explicitly
# 2. Type checker catches typos (processs vs process)
# 3. Catches issues when base class changes
```
**Why this matters**: @override makes intent explicit and catches errors when base class changes or method names have typos.
### f-string Improvements
```python
# Python 3.12 allows more complex expressions in f-strings
# ✅ Reusing quotes in f-strings
value = "test"
result = f"Value is {value.replace('t', 'T')}" # Works in 3.12
# ✅ Multi-line f-strings with backslashes
message = f"Processing {
len(items)
} items"
# ✅ f-string debugging with = (since 3.8, improved in 3.12)
x = 42
print(f"{x=}") # Output: x=42
print(f"{x * 2=}") # Output: x * 2=84
```
## Static Analysis Setup
### mypy Configuration
**File:** `pyproject.toml`
```toml
[tool.mypy]
python_version = "3.12"
warn_return_any = true
warn_unused_configs = true
disallow_untyped_defs = true
disallow_incomplete_defs = true
check_untyped_defs = true
disallow_untyped_decorators = true
warn_redundant_casts = true
warn_unused_ignores = true
warn_no_return = true
warn_unreachable = true
strict_equality = true
strict = true
# Per-module options
[[tool.mypy.overrides]]
module = "tests.*"
disallow_untyped_defs = false # Tests can be less strict
[[tool.mypy.overrides]]
module = "third_party.*"
ignore_missing_imports = true
```
**Strict mode breakdown:**
- `strict = true`: Enables all strict checks
- `disallow_untyped_defs`: All functions must have type hints
- `warn_return_any`: Warn when returning Any type
- `warn_unused_ignores`: Warn on unnecessary `# type: ignore`
**When to use strict mode:**
- New projects: Start strict from day 1
- Existing projects: Enable incrementally per module
### pyright Configuration
**File:** `pyproject.toml`
```toml
[tool.pyright]
pythonVersion = "3.12"
typeCheckingMode = "strict"
reportMissingTypeStubs = false
reportUnknownMemberType = false
# Stricter checks
reportUnusedImport = true
reportUnusedVariable = true
reportDuplicateImport = true
# Exclude patterns
exclude = [
"**/__pycache__",
"**/node_modules",
".venv",
]
```
**pyright vs mypy:**
- pyright: Faster, better IDE integration, stricter by default
- mypy: More configurable, wider adoption, plugin ecosystem
**Recommendation**: Use both if possible. pyright in IDE, mypy in CI.
### Dealing with Untyped Libraries
```python
# ❌ WRONG: Silencing all errors
import untyped_lib # type: ignore
# ✅ CORRECT: Create stub file
# File: stubs/untyped_lib.pyi
def important_function(x: int, y: str) -> bool: ...
class ImportantClass:
def method(self, value: int) -> None: ...
# Configure mypy to find stubs
# pyproject.toml:
# mypy_path = "stubs"
# ✅ Use # type: ignore with explanation
from untyped_lib import obscure_function # type: ignore[import] # TODO: Add stub
# ✅ Use cast when library returns Any
from typing import cast
result = cast(list[int], untyped_lib.get_items())
```
**Why this matters**: Stubs preserve type safety even with untyped libraries. Type: ignore should be specific and documented.
## Common Type Errors and Fixes
### Incompatible Types
```python
# mypy error: Incompatible types in assignment (expression has type "str | None", variable has type "str")
# ❌ WRONG: Ignoring the error
name: str = get_name() # type: ignore
# ✅ CORRECT: Handle None case
name: str | None = get_name()
if name is not None:
process_name(name)
# ✅ CORRECT: Provide default
name: str = get_name() or "default"
# ✅ CORRECT: Assert if you're certain
name = get_name()
assert name is not None
process_name(name)
```
### List/Dict Invariance
```python
# mypy error: Argument has incompatible type "list[int]"; expected "list[float]"
def process_numbers(numbers: list[float]) -> None:
...
int_list: list[int] = [1, 2, 3]
# process_numbers(int_list) # mypy error!
# Why: Lists are mutable. If process_numbers did numbers.append(3.14),
# it would break int_list type safety
# ✅ CORRECT: Use Sequence for read-only
from collections.abc import Sequence
def process_numbers(numbers: Sequence[float]) -> None:
# Can't modify, so safe to accept list[int]
...
process_numbers(int_list) # OK now
```
### Missing Return Type
```python
# mypy error: Function is missing a return type annotation
# ❌ WRONG: No return type
def calculate(x, y):
return x + y
# ✅ CORRECT: Add return type
def calculate(x: int, y: int) -> int:
return x + y
# ✅ Functions that don't return
def log_message(message: str) -> None:
print(message)
```
### Generic Type Issues
```python
# mypy error: Need type annotation for 'items'
# ❌ WRONG: No type for empty container
items = []
items.append(1) # mypy can't infer type
# ✅ CORRECT: Explicit type annotation
items: list[int] = []
items.append(1)
# ✅ CORRECT: Initialize with values
items = [1, 2, 3] # mypy infers list[int]
```
## Anti-Patterns
### Over-Typing
```python
# ❌ WRONG: Too specific, breaks flexibility
def process_items(items: list[str]) -> list[str]:
return [item.upper() for item in items]
# Can't pass tuple, generator, or other iterables
# ✅ CORRECT: Use abstract types
from collections.abc import Sequence
def process_items(items: Sequence[str]) -> list[str]:
return [item.upper() for item in items]
# Now works with list, tuple, etc.
```
### Type: Ignore Abuse
```python
# ❌ WRONG: Blanket ignore
def sketchy_function(data): # type: ignore
return data["key"]
# ✅ CORRECT: Specific ignore with comment
def legacy_integration(data: dict[str, Any]) -> Any:
# type: ignore[no-untyped-def] # TODO(#123): Add proper types
return data["key"]
# ✅ BETTER: Fix the issue
def fixed_integration(data: dict[str, str]) -> str:
return data["key"]
```
### Using Any Everywhere
```python
# ❌ WRONG: Any defeats the purpose of types
def process(data: Any) -> Any:
return data.transform()
# ✅ CORRECT: Use specific types
from typing import Protocol
class Transformable(Protocol):
def transform(self) -> str: ...
def process(data: Transformable) -> str:
return data.transform()
```
### Incompatible Generics
```python
# ❌ WRONG: Generic type mismatch
T = TypeVar('T')
def combine(a: list[T], b: list[T]) -> list[T]:
return a + b
ints: list[int] = [1, 2]
strs: list[str] = ["a", "b"]
# result = combine(ints, strs) # mypy error: incompatible types
# ✅ CORRECT: Different type parameters
T1 = TypeVar('T1')
T2 = TypeVar('T2')
def combine_any[T1, T2](a: list[T1], b: list[T2]) -> list[T1 | T2]:
return a + b # type: ignore[return-value] # Runtime works, typing is complex
# ✅ BETTER: Keep types consistent
result_ints = combine(ints, [3, 4]) # OK: both list[int]
```
## Decision Trees
### When to Use Which Type?
**For functions accepting sequences:**
```
Read-only? → Sequence[T]
Need indexing? → Sequence[T]
Need mutation? → list[T]
Large data? → Iterator[T] or Generator[T]
```
**For dictionary-like types:**
```
Known structure? → TypedDict
Dynamic keys? → dict[K, V]
Protocol needed? → Mapping[K, V] (read-only)
Need mutation? → MutableMapping[K, V]
```
**For optional values:**
```
Can be None? → T | None
Has default? → T with default parameter
Really optional? → T | None in TypedDict(total=False)
```
## Integration with Other Skills
**After using this skill:**
- If setting up project → See @project-structure-and-tooling for mypy in pyproject.toml
- If fixing lint → See @systematic-delinting for type-related lint rules
- If testing typed code → See @testing-and-quality for pytest type checking
**Before using this skill:**
- Setup mypy → Use @project-structure-and-tooling first
## Quick Reference
| Python Version | Key Type Features |
|----------------|-------------------|
| 3.9 | Built-in generics (`list[T]` instead of `List[T]`) |
| 3.10 | Union with `|`, match statements, ParamSpec |
| 3.11 | Exception groups, Self type, better errors |
| 3.12 | PEP 695 generics, @override decorator |
**Most impactful features:**
1. `| None` instead of `Optional` (3.10+)
2. Built-in generics: `list[T]` not `List[T]` (3.9+)
3. PEP 695: `class Box[T]` not `class Box(Generic[T])` (3.12+)
4. Match statements for complex conditionals (3.10+)
5. @override for explicit method overriding (3.12+)

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,981 @@
# Scientific Computing Foundations
## Overview
**Core Principle:** Vectorize operations, avoid loops. NumPy and pandas are built on C/Fortran code that's orders of magnitude faster than Python loops. The biggest performance gains come from eliminating iteration over rows/elements.
Scientific computing in Python centers on NumPy (arrays) and pandas (dataframes). These libraries enable fast numerical computation on large datasets through vectorized operations and efficient memory layouts. The most common mistake: using Python loops when vectorized operations exist.
## When to Use
**Use this skill when:**
- "NumPy operations"
- "Pandas DataFrame slow"
- "Vectorization"
- "How to avoid loops?"
- "DataFrame iteration"
- "Array performance"
- "Memory usage too high"
- "Large dataset processing"
**Don't use when:**
- Setting up project (use project-structure-and-tooling)
- Profiling needed first (use debugging-and-profiling)
- ML pipeline orchestration (use ml-engineering-workflows)
**Symptoms triggering this skill:**
- Slow DataFrame operations
- High memory usage with arrays
- Using loops over DataFrame rows
- Need to process large datasets efficiently
## NumPy Fundamentals
### Array Creation and Types
```python
import numpy as np
# ❌ WRONG: Creating arrays from Python lists in loop
data = []
for i in range(1000000):
data.append(i * 2)
arr = np.array(data)
# ✅ CORRECT: Use NumPy functions
arr = np.arange(1000000) * 2
# ✅ CORRECT: Pre-allocate for known size
arr = np.empty(1000000, dtype=np.int64)
for i in range(1000000):
arr[i] = i * 2 # Still slow, but better than list
# ✅ BETTER: Fully vectorized
arr = np.arange(1000000, dtype=np.int64) * 2
# ✅ CORRECT: Specify dtype for memory efficiency
# float64 (default): 8 bytes per element
# float32: 4 bytes per element
large_arr = np.zeros(1000000, dtype=np.float32) # Half the memory
# Why this matters: dtype affects both memory usage and performance
# Use smallest dtype that fits your data
```
### Vectorized Operations
```python
# ❌ WRONG: Loop over array elements
arr = np.arange(1000000)
result = np.empty(1000000)
for i in range(len(arr)):
result[i] = arr[i] ** 2 + 2 * arr[i] + 1
# ✅ CORRECT: Vectorized operations
arr = np.arange(1000000)
result = arr ** 2 + 2 * arr + 1
# Speed difference: ~100x faster with vectorization
# ❌ WRONG: Element-wise comparison in loop
matches = []
for val in arr:
if val > 100:
matches.append(val)
result = np.array(matches)
# ✅ CORRECT: Boolean indexing
result = arr[arr > 100]
# ✅ CORRECT: Complex conditions
result = arr[(arr > 100) & (arr < 200)] # Note: & not 'and'
result = arr[(arr < 50) | (arr > 150)] # Note: | not 'or'
```
**Why this matters**: Vectorized operations run in C, avoiding Python interpreter overhead. 10-100x speedup typical.
### Broadcasting
```python
# Broadcasting: Operating on arrays of different shapes
# ✅ CORRECT: Scalar broadcasting
arr = np.array([1, 2, 3, 4])
result = arr + 10 # [11, 12, 13, 14]
# ✅ CORRECT: 1D array broadcast to 2D
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
row_vector = np.array([10, 20, 30])
result = matrix + row_vector
# [[11, 22, 33],
# [14, 25, 36],
# [17, 28, 39]]
# ✅ CORRECT: Column vector broadcasting
col_vector = np.array([[10],
[20],
[30]])
result = matrix + col_vector
# [[11, 12, 13],
# [24, 25, 26],
# [37, 38, 39]]
# ✅ CORRECT: Add axis for broadcasting
row = np.array([1, 2, 3])
col = row[:, np.newaxis] # Convert to column vector
# col shape: (3, 1)
# Outer product via broadcasting
outer = row[np.newaxis, :] * col
# [[1, 2, 3],
# [2, 4, 6],
# [3, 6, 9]]
# ❌ WRONG: Manual broadcasting with loops
result = np.empty_like(matrix)
for i in range(matrix.shape[0]):
for j in range(matrix.shape[1]):
result[i, j] = matrix[i, j] + row_vector[j]
# Why this matters: Broadcasting eliminates loops and is much faster
```
### Memory-Efficient Operations
```python
# ❌ WRONG: Creating unnecessary copies
large_arr = np.random.rand(10000, 10000) # ~800MB
result1 = large_arr + 1 # Creates new 800MB array
result2 = result1 * 2 # Creates another 800MB array
# Total: 2.4GB memory usage
# ✅ CORRECT: In-place operations
large_arr = np.random.rand(10000, 10000)
large_arr += 1 # Modifies in-place, no copy
large_arr *= 2 # Modifies in-place, no copy
# Total: 800MB memory usage
# ✅ CORRECT: Use 'out' parameter
result = np.empty_like(large_arr)
np.add(large_arr, 1, out=result)
np.multiply(result, 2, out=result)
# ❌ WRONG: Unnecessary array copies
arr = np.arange(1000000)
subset = arr[::2].copy() # Explicit copy needed? Check first
subset[0] = 999 # Doesn't affect arr
# ✅ CORRECT: Views avoid copies (when possible)
arr = np.arange(1000000)
view = arr[::2] # View, not copy (shares memory)
view[0] = 999 # Modifies arr too!
# Check if view or copy
print(arr.base is None) # False = view, True = owns memory
```
**Why this matters**: Large arrays consume lots of memory. In-place operations and views avoid copies, reducing memory usage significantly.
### Aggregations and Reductions
```python
# ✅ CORRECT: Axis-aware aggregations
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# Sum all elements
total = matrix.sum() # 45
# Sum along axis 0 (columns)
col_sums = matrix.sum(axis=0) # [12, 15, 18]
# Sum along axis 1 (rows)
row_sums = matrix.sum(axis=1) # [6, 15, 24]
# ❌ WRONG: Manual aggregation
total = 0
for row in matrix:
for val in row:
total += val
# ✅ CORRECT: Multiple aggregations
matrix.mean()
matrix.std()
matrix.min()
matrix.max()
matrix.argmin() # Index of minimum
matrix.argmax() # Index of maximum
# ✅ CORRECT: Conditional aggregations
# Sum only positive values
positive_sum = matrix[matrix > 0].sum()
# Count elements > 5
count = (matrix > 5).sum()
# Percentage > 5
percentage = (matrix > 5).mean() * 100
```
## pandas Fundamentals
### DataFrame Creation
```python
import pandas as pd
# ❌ WRONG: Building DataFrame row by row
df = pd.DataFrame()
for i in range(10000):
df = pd.concat([df, pd.DataFrame({'a': [i], 'b': [i*2]})], ignore_index=True)
# Extremely slow: O(n²) complexity
# ✅ CORRECT: Create from dict of lists
data = {
'a': list(range(10000)),
'b': [i * 2 for i in range(10000)]
}
df = pd.DataFrame(data)
# ✅ BETTER: Use NumPy arrays
df = pd.DataFrame({
'a': np.arange(10000),
'b': np.arange(10000) * 2
})
# ✅ CORRECT: From records
records = [{'a': i, 'b': i*2} for i in range(10000)]
df = pd.DataFrame.from_records(records)
```
### The Iteration Anti-Pattern
```python
# ❌ WRONG: iterrows() - THE MOST COMMON MISTAKE
df = pd.DataFrame({
'value': np.random.rand(100000),
'category': np.random.choice(['A', 'B', 'C'], 100000)
})
result = []
for idx, row in df.iterrows(): # VERY SLOW
if row['value'] > 0.5:
result.append(row['value'] * 2)
# ✅ CORRECT: Vectorized operations
mask = df['value'] > 0.5
result = df.loc[mask, 'value'] * 2
# Speed difference: ~100x faster
# ❌ WRONG: apply() on axis=1 (still row-by-row)
df['result'] = df.apply(
lambda row: row['value'] * 2 if row['value'] > 0.5 else 0,
axis=1
)
# Still slow: applies Python function to each row
# ✅ CORRECT: Vectorized with np.where
df['result'] = np.where(df['value'] > 0.5, df['value'] * 2, 0)
# ✅ CORRECT: Boolean indexing + assignment
df['result'] = 0
df.loc[df['value'] > 0.5, 'result'] = df['value'] * 2
```
**Why this matters**: `iterrows()` is the single biggest DataFrame performance killer. ALWAYS look for vectorized alternatives.
### Efficient Filtering and Selection
```python
df = pd.DataFrame({
'A': np.random.rand(100000),
'B': np.random.rand(100000),
'C': np.random.choice(['X', 'Y', 'Z'], 100000)
})
# ❌ WRONG: Chaining filters inefficiently
df_filtered = df[df['A'] > 0.5]
df_filtered = df_filtered[df_filtered['B'] < 0.3]
df_filtered = df_filtered[df_filtered['C'] == 'X']
# ✅ CORRECT: Single boolean mask
mask = (df['A'] > 0.5) & (df['B'] < 0.3) & (df['C'] == 'X')
df_filtered = df[mask]
# ✅ CORRECT: query() for complex filters (cleaner syntax)
df_filtered = df.query('A > 0.5 and B < 0.3 and C == "X"')
# ✅ CORRECT: isin() for multiple values
df_filtered = df[df['C'].isin(['X', 'Y'])]
# ❌ WRONG: String matching in loop
matches = []
for val in df['C']:
if 'X' in val:
matches.append(True)
else:
matches.append(False)
df_filtered = df[matches]
# ✅ CORRECT: Vectorized string operations
df_filtered = df[df['C'].str.contains('X')]
```
### GroupBy Operations
```python
df = pd.DataFrame({
'category': np.random.choice(['A', 'B', 'C'], 100000),
'value': np.random.rand(100000),
'count': np.random.randint(1, 100, 100000)
})
# ❌ WRONG: Manual grouping
groups = {}
for idx, row in df.iterrows():
cat = row['category']
if cat not in groups:
groups[cat] = []
groups[cat].append(row['value'])
results = {cat: sum(vals) / len(vals) for cat, vals in groups.items()}
# ✅ CORRECT: GroupBy
results = df.groupby('category')['value'].mean()
# ✅ CORRECT: Multiple aggregations
results = df.groupby('category').agg({
'value': ['mean', 'std', 'min', 'max'],
'count': 'sum'
})
# ✅ CORRECT: Named aggregations (pandas 0.25+)
results = df.groupby('category').agg(
mean_value=('value', 'mean'),
std_value=('value', 'std'),
total_count=('count', 'sum')
)
# ✅ CORRECT: Custom aggregation function
def range_func(x):
return x.max() - x.min()
results = df.groupby('category')['value'].agg(range_func)
# ✅ CORRECT: Transform (keeps original shape)
df['value_centered'] = df.groupby('category')['value'].transform(
lambda x: x - x.mean()
)
```
**Why this matters**: GroupBy is highly optimized. Much faster than manual grouping. Use built-in aggregations when possible.
## Performance Anti-Patterns
### Anti-Pattern 1: DataFrame Iteration
```python
# ❌ WRONG: Iterating over rows
for idx, row in df.iterrows():
df.at[idx, 'new_col'] = row['a'] + row['b']
# ✅ CORRECT: Vectorized column operation
df['new_col'] = df['a'] + df['b']
# ❌ WRONG: Itertuples (better than iterrows, but still slow)
for row in df.itertuples():
# Process row...
# ✅ CORRECT: Use vectorized operations or apply to columns
```
### Anti-Pattern 2: Repeated Concatenation
```python
# ❌ WRONG: Growing DataFrame in loop
df = pd.DataFrame()
for i in range(10000):
df = pd.concat([df, new_row_df], ignore_index=True)
# O(n²) complexity, extremely slow
# ✅ CORRECT: Collect data, then create DataFrame
data = []
for i in range(10000):
data.append({'a': i, 'b': i*2})
df = pd.DataFrame(data)
# ✅ CORRECT: Pre-allocate NumPy array
arr = np.empty((10000, 2))
for i in range(10000):
arr[i] = [i, i*2]
df = pd.DataFrame(arr, columns=['a', 'b'])
```
### Anti-Pattern 3: Using apply When Vectorized Exists
```python
# ❌ WRONG: apply() for simple operations
df['result'] = df['value'].apply(lambda x: x * 2)
# ✅ CORRECT: Direct vectorized operation
df['result'] = df['value'] * 2
# ❌ WRONG: apply() for conditions
df['category'] = df['value'].apply(lambda x: 'high' if x > 0.5 else 'low')
# ✅ CORRECT: np.where or pd.cut
df['category'] = np.where(df['value'] > 0.5, 'high', 'low')
# ✅ CORRECT: pd.cut for binning
df['category'] = pd.cut(df['value'], bins=[0, 0.5, 1.0], labels=['low', 'high'])
# When apply IS appropriate:
# - Complex logic not vectorizable
# - Need to call external function per row
# But verify vectorization truly impossible first
```
### Anti-Pattern 4: Not Using Categorical Data
```python
# ❌ WRONG: String columns for repeated values
df = pd.DataFrame({
'category': ['A'] * 10000 + ['B'] * 10000 + ['C'] * 10000
})
# Memory: ~240KB (each string stored separately)
# ✅ CORRECT: Categorical type
df['category'] = pd.Categorical(df['category'])
# Memory: ~30KB (integers + small string table)
# ✅ CORRECT: Define categories at creation
df = pd.DataFrame({
'category': pd.Categorical(
['A'] * 10000 + ['B'] * 10000,
categories=['A', 'B', 'C']
)
})
# When to use categorical:
# - Limited number of unique values (< 50% of rows)
# - Repeated string/object values
# - Memory constraints
# - Faster groupby operations
```
## Memory Optimization
### Choosing Appropriate dtypes
```python
# ❌ WRONG: Default dtypes waste memory
df = pd.DataFrame({
'int_col': [1, 2, 3, 4, 5], # int64 by default
'float_col': [1.0, 2.0, 3.0], # float64 by default
'str_col': ['a', 'b', 'c', 'd', 'e'] # object dtype
})
print(df.memory_usage(deep=True))
# ✅ CORRECT: Optimize dtypes
df = pd.DataFrame({
'int_col': pd.array([1, 2, 3, 4, 5], dtype='int8'), # -128 to 127
'float_col': pd.array([1.0, 2.0, 3.0], dtype='float32'),
'str_col': pd.Categorical(['a', 'b', 'c', 'd', 'e'])
})
# ✅ CORRECT: Downcast after loading
df = pd.read_csv('data.csv')
df['int_col'] = pd.to_numeric(df['int_col'], downcast='integer')
df['float_col'] = pd.to_numeric(df['float_col'], downcast='float')
# Integer dtype ranges:
# int8: -128 to 127
# int16: -32,768 to 32,767
# int32: -2.1B to 2.1B
# int64: -9.2E18 to 9.2E18
# Float dtype precision:
# float16: ~3 decimal digits (rarely used)
# float32: ~7 decimal digits
# float64: ~15 decimal digits
```
### Chunked Processing for Large Files
```python
# ❌ WRONG: Loading entire file into memory
df = pd.read_csv('huge_file.csv') # 10GB file, OOM!
df_processed = process_dataframe(df)
# ✅ CORRECT: Process in chunks
chunk_size = 100000
results = []
for chunk in pd.read_csv('huge_file.csv', chunksize=chunk_size):
processed = process_dataframe(chunk)
results.append(processed)
df_final = pd.concat(results, ignore_index=True)
# ✅ CORRECT: Streaming aggregation
totals = {'A': 0, 'B': 0, 'C': 0}
for chunk in pd.read_csv('huge_file.csv', chunksize=chunk_size):
for col in totals:
totals[col] += chunk[col].sum()
# ✅ CORRECT: Only load needed columns
df = pd.read_csv('huge_file.csv', usecols=['col1', 'col2', 'col3'])
```
### Using Sparse Data Structures
```python
# ❌ WRONG: Dense array for sparse data
# Data with 99% zeros
dense = np.zeros(1000000)
dense[::100] = 1 # Only 1% non-zero
# Memory: 8MB (float64 * 1M)
# ✅ CORRECT: Sparse array
from scipy.sparse import csr_matrix
sparse = csr_matrix(dense)
# Memory: ~80KB (only stores non-zero values + indices)
# ✅ CORRECT: Sparse DataFrame
df = pd.DataFrame({
'A': pd.arrays.SparseArray([0] * 100 + [1] + [0] * 100),
'B': pd.arrays.SparseArray([0] * 50 + [2] + [0] * 150)
})
```
## Data Pipeline Patterns
### Method Chaining
```python
# ❌ WRONG: Many intermediate variables
df = pd.read_csv('data.csv')
df = df[df['value'] > 0]
df = df.groupby('category')['value'].mean()
df = df.reset_index()
df = df.rename(columns={'value': 'mean_value'})
# ✅ CORRECT: Method chaining
df = (
pd.read_csv('data.csv')
.query('value > 0')
.groupby('category')['value']
.mean()
.reset_index()
.rename(columns={'value': 'mean_value'})
)
# ✅ CORRECT: Pipe for custom functions
def remove_outliers(df, column, n_std=3):
mean = df[column].mean()
std = df[column].std()
return df[
(df[column] > mean - n_std * std) &
(df[column] < mean + n_std * std)
]
df = (
pd.read_csv('data.csv')
.pipe(remove_outliers, 'value', n_std=2)
.groupby('category')['value']
.mean()
)
```
### Efficient Merges and Joins
```python
# ❌ WRONG: Multiple small merges
for small_df in list_of_dfs:
main_df = main_df.merge(small_df, on='key')
# Inefficient: creates many intermediate copies
# ✅ CORRECT: Merge all at once
df_merged = pd.concat(list_of_dfs, ignore_index=True)
# ✅ CORRECT: Optimize merge with sorted/indexed data
df1 = df1.set_index('key').sort_index()
df2 = df2.set_index('key').sort_index()
result = df1.merge(df2, left_index=True, right_index=True)
# ✅ CORRECT: Use indicator to track merge sources
result = df1.merge(df2, on='key', how='outer', indicator=True)
print(result['_merge'].value_counts())
# Shows: left_only, right_only, both
# ❌ WRONG: Cartesian product by accident
# df1: 1000 rows, df2: 1000 rows
result = df1.merge(df2, on='wrong_key')
# result: 1,000,000 rows! (if all keys match)
# ✅ CORRECT: Validate merge
result = df1.merge(df2, on='key', validate='1:1')
# Raises error if not one-to-one relationship
```
### Handling Missing Data
```python
# ❌ WRONG: Dropping all rows with any NaN
df_clean = df.dropna() # Might lose most of data
# ✅ CORRECT: Drop rows with NaN in specific columns
df_clean = df.dropna(subset=['important_col1', 'important_col2'])
# ✅ CORRECT: Fill NaN with appropriate values
df['numeric_col'] = df['numeric_col'].fillna(df['numeric_col'].mean())
df['category_col'] = df['category_col'].fillna('Unknown')
# ✅ CORRECT: Forward/backward fill for time series
df['value'] = df['value'].fillna(method='ffill') # Forward fill
# ✅ CORRECT: Interpolation
df['value'] = df['value'].interpolate(method='linear')
# ❌ WRONG: Not checking for NaN before operations
result = df['value'].mean() # NaN propagates, might return NaN
# ✅ CORRECT: Explicit NaN handling
result = df['value'].mean(skipna=True) # Default, but explicit is better
```
## Advanced NumPy Techniques
### Universal Functions (ufuncs)
```python
# ✅ CORRECT: Using built-in ufuncs
arr = np.random.rand(1000000)
# Trigonometric
result = np.sin(arr)
result = np.cos(arr)
# Exponential
result = np.exp(arr)
result = np.log(arr)
# Comparison
result = np.maximum(arr, 0.5) # Element-wise max with scalar
result = np.minimum(arr, 0.5)
# ✅ CORRECT: Custom ufunc with @vectorize
from numba import vectorize
@vectorize
def custom_func(x):
if x > 0.5:
return x ** 2
else:
return x ** 3
result = custom_func(arr) # Runs at C speed
```
### Advanced Indexing
```python
# ✅ CORRECT: Fancy indexing
arr = np.arange(100)
indices = [0, 5, 10, 15, 20]
result = arr[indices] # Select specific indices
# ✅ CORRECT: Boolean indexing with multiple conditions
arr = np.random.rand(1000000)
mask = (arr > 0.3) & (arr < 0.7)
result = arr[mask]
# ✅ CORRECT: np.where for conditional replacement
arr = np.random.rand(1000)
result = np.where(arr > 0.5, arr, 0) # Replace values <= 0.5 with 0
# ✅ CORRECT: Multi-dimensional indexing
matrix = np.random.rand(100, 100)
rows = [0, 10, 20]
cols = [5, 15, 25]
result = matrix[rows, cols] # Select specific elements
# Get diagonal
diagonal = matrix[np.arange(100), np.arange(100)]
# Or use np.diag
diagonal = np.diag(matrix)
```
### Linear Algebra Operations
```python
# ✅ CORRECT: Matrix multiplication
A = np.random.rand(1000, 500)
B = np.random.rand(500, 200)
C = A @ B # Python 3.5+ matrix multiply operator
# Or
C = np.dot(A, B)
C = np.matmul(A, B)
# ✅ CORRECT: Solve linear system Ax = b
A = np.random.rand(100, 100)
b = np.random.rand(100)
x = np.linalg.solve(A, b)
# ✅ CORRECT: Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
# ✅ CORRECT: SVD (Singular Value Decomposition)
U, s, Vt = np.linalg.svd(A)
# ✅ CORRECT: Inverse
A_inv = np.linalg.inv(A)
# ❌ WRONG: Using inverse for solving Ax = b
x = np.linalg.inv(A) @ b # Slower and less numerically stable
# ✅ CORRECT: Use solve directly
x = np.linalg.solve(A, b)
```
## Type Hints for NumPy and pandas
### NumPy Type Hints
```python
import numpy as np
from numpy.typing import NDArray
# ✅ CORRECT: Type hint for NumPy arrays
def process_array(arr: NDArray[np.float64]) -> NDArray[np.float64]:
return arr * 2
# ✅ CORRECT: Generic array type
def normalize(arr: NDArray) -> NDArray:
return (arr - arr.mean()) / arr.std()
# ✅ CORRECT: Shape-specific type hints (Python 3.11+)
from typing import TypeAlias
Vector: TypeAlias = NDArray[np.float64] # 1D array
Matrix: TypeAlias = NDArray[np.float64] # 2D array
def matrix_multiply(A: Matrix, B: Matrix) -> Matrix:
return A @ B
```
### pandas Type Hints
```python
import pandas as pd
# ✅ CORRECT: Type hints for Series and DataFrame
def process_series(s: pd.Series) -> pd.Series:
return s * 2
def process_dataframe(df: pd.DataFrame) -> pd.DataFrame:
return df[df['value'] > 0]
# ✅ CORRECT: More specific DataFrame types (using Protocols)
from typing import Protocol
class DataFrameWithColumns(Protocol):
"""DataFrame with specific columns."""
def __getitem__(self, key: str) -> pd.Series: ...
def analyze_data(df: DataFrameWithColumns) -> float:
return df['value'].mean()
```
## Real-World Patterns
### Time Series Operations
```python
# ✅ CORRECT: Efficient time series resampling
df = pd.DataFrame({
'timestamp': pd.date_range('2024-01-01', periods=1000000, freq='1s'),
'value': np.random.rand(1000000)
})
df = df.set_index('timestamp')
# Resample to 1-minute intervals
df_resampled = df.resample('1min').agg({
'value': ['mean', 'std', 'min', 'max']
})
# ✅ CORRECT: Rolling window operations
df['rolling_mean'] = df['value'].rolling(window=60).mean()
df['rolling_std'] = df['value'].rolling(window=60).std()
# ✅ CORRECT: Lag features
df['value_lag1'] = df['value'].shift(1)
df['value_lag60'] = df['value'].shift(60)
# ✅ CORRECT: Difference for stationarity
df['value_diff'] = df['value'].diff()
```
### Multi-Index Operations
```python
# ✅ CORRECT: Creating multi-index DataFrame
df = pd.DataFrame({
'country': ['USA', 'USA', 'UK', 'UK'],
'city': ['NYC', 'LA', 'London', 'Manchester'],
'value': [100, 200, 150, 175]
})
df = df.set_index(['country', 'city'])
# Accessing with multi-index
df.loc['USA'] # All USA cities
df.loc[('USA', 'NYC')] # Specific city
# ✅ CORRECT: Cross-section
df.xs('USA', level='country')
df.xs('London', level='city')
# ✅ CORRECT: GroupBy with multi-index
df.groupby(level='country').mean()
```
### Parallel Processing with Dask
```python
# For datasets larger than memory, use Dask (not in plan detail, but worth mentioning)
import dask.dataframe as dd
# ✅ CORRECT: Dask for out-of-core processing
df = dd.read_csv('huge_file.csv')
result = df.groupby('category')['value'].mean().compute()
# Dask uses same API as pandas, but lazy evaluation
# Only computes when .compute() is called
```
## Anti-Pattern Summary
### Top 5 Performance Killers
1. **iterrows()** - Use vectorized operations
2. **Growing DataFrame in loop** - Collect data, then create DataFrame
3. **apply() for simple operations** - Use vectorized alternatives
4. **Not using categorical for strings** - Convert to categorical
5. **Loading entire file when chunking works** - Use chunksize parameter
### Memory Usage Mistakes
1. **Using float64 when float32 sufficient** - Halves memory
2. **Not using categorical for repeated strings** - 10x memory savings
3. **Creating unnecessary copies** - Use in-place operations
4. **Loading all columns when few needed** - Use usecols parameter
## Decision Trees
### Should I Use NumPy or pandas?
```
What's your data structure?
├─ Homogeneous numeric array → NumPy
├─ Heterogeneous tabular data → pandas
├─ Time series → pandas
└─ Linear algebra → NumPy
```
### How to Optimize DataFrame Operation?
```
Can I vectorize?
├─ Yes → Use vectorized pandas/NumPy operations
└─ No → Can I use groupby?
├─ Yes → Use groupby with built-in aggregations
└─ No → Can I use apply on columns (not rows)?
├─ Yes → Use apply on Series
└─ No → Use itertuples (last resort)
```
### Memory Optimization Strategy
```
Is memory usage high?
├─ Yes → Check dtypes (downcast if possible)
│ └─ Still high? → Use categorical for strings
│ └─ Still high? → Process in chunks
└─ No → Continue with current approach
```
## Integration with Other Skills
**After using this skill:**
- If need ML pipelines → See @ml-engineering-workflows for experiment tracking
- If performance issues persist → See @debugging-and-profiling for profiling
- If type hints needed → See @modern-syntax-and-types for advanced typing
**Before using this skill:**
- If unsure if slow → Use @debugging-and-profiling to profile first
- If setting up project → Use @project-structure-and-tooling for dependencies
## Quick Reference
### NumPy Quick Wins
```python
# Vectorization
result = arr ** 2 + 2 * arr + 1 # Not: for loop
# Boolean indexing
result = arr[arr > 0] # Not: list comprehension
# Broadcasting
result = matrix + row_vector # Not: loop over rows
# In-place operations
arr += 1 # Not: arr = arr + 1
```
### pandas Quick Wins
```python
# Never iterrows
df['new'] = df['a'] + df['b'] # Not: iterrows
# Vectorized conditions
df['category'] = np.where(df['value'] > 0.5, 'high', 'low')
# Categorical for strings
df['category'] = pd.Categorical(df['category'])
# Query for complex filters
df.query('A > 0.5 and B < 0.3') # Not: multiple []
```
### Memory Optimization Checklist
- [ ] Use smallest dtype that fits data
- [ ] Convert repeated strings to categorical
- [ ] Use chunking for files > available RAM
- [ ] Avoid unnecessary copies (use views or in-place ops)
- [ ] Only load needed columns (usecols in read_csv)

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff