Initial commit
This commit is contained in:
12
.claude-plugin/plugin.json
Normal file
12
.claude-plugin/plugin.json
Normal file
@@ -0,0 +1,12 @@
|
||||
{
|
||||
"name": "axiom-python-engineering",
|
||||
"description": "Modern Python 3.12+ engineering: types, testing, async, scientific computing, ML workflows - 10 skills",
|
||||
"version": "1.1.1",
|
||||
"author": {
|
||||
"name": "tachyon-beep",
|
||||
"email": "zhongweili@tubi.tv"
|
||||
},
|
||||
"skills": [
|
||||
"./skills"
|
||||
]
|
||||
}
|
||||
3
README.md
Normal file
3
README.md
Normal file
@@ -0,0 +1,3 @@
|
||||
# axiom-python-engineering
|
||||
|
||||
Modern Python 3.12+ engineering: types, testing, async, scientific computing, ML workflows - 10 skills
|
||||
85
plugin.lock.json
Normal file
85
plugin.lock.json
Normal file
@@ -0,0 +1,85 @@
|
||||
{
|
||||
"$schema": "internal://schemas/plugin.lock.v1.json",
|
||||
"pluginId": "gh:tachyon-beep/skillpacks:plugins/axiom-python-engineering",
|
||||
"normalized": {
|
||||
"repo": null,
|
||||
"ref": "refs/tags/v20251128.0",
|
||||
"commit": "d2e2cd2da633625dc09d9fa35a3dc0f60c3cafe5",
|
||||
"treeHash": "0d643778d6d2de81b6df637544298a6b9a3214de1a3ca03e3ba9d6eaf0cd3f4c",
|
||||
"generatedAt": "2025-11-28T10:28:30.940625Z",
|
||||
"toolVersion": "publish_plugins.py@0.2.0"
|
||||
},
|
||||
"origin": {
|
||||
"remote": "git@github.com:zhongweili/42plugin-data.git",
|
||||
"branch": "master",
|
||||
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
|
||||
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
|
||||
},
|
||||
"manifest": {
|
||||
"name": "axiom-python-engineering",
|
||||
"description": "Modern Python 3.12+ engineering: types, testing, async, scientific computing, ML workflows - 10 skills",
|
||||
"version": "1.1.1"
|
||||
},
|
||||
"content": {
|
||||
"files": [
|
||||
{
|
||||
"path": "README.md",
|
||||
"sha256": "501cc915afac6582b68359b9cd06b82b4f156ca4439348d16be2515a17d4425e"
|
||||
},
|
||||
{
|
||||
"path": ".claude-plugin/plugin.json",
|
||||
"sha256": "f8c9d4521cad5f45e5aade5cc4aa8041a256579a1c8b00041d146bcb830a01b8"
|
||||
},
|
||||
{
|
||||
"path": "skills/.gitkeep",
|
||||
"sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-python-engineering/resolving-mypy-errors.md",
|
||||
"sha256": "f5ff38011ecf1304093b84cfa59f41daf585d36a9d75598ac4a045e7c41ceafa"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-python-engineering/ml-engineering-workflows.md",
|
||||
"sha256": "3b45246f1477f341e770877a95fd5b67f25774e1b5e7c21898aba2f1b27fa8d0"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-python-engineering/project-structure-and-tooling.md",
|
||||
"sha256": "3e225c744c5138ec6c4945f60e6bc959aac1663d1a4cfb741efaf0e622351dc2"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-python-engineering/modern-syntax-and-types.md",
|
||||
"sha256": "56a51be261616cc49041af9dcb5943f6e5b3f2424b84669a6f7df84a5b6458c3"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-python-engineering/systematic-delinting.md",
|
||||
"sha256": "57df2647863de7e4937b4c5d92cc4832559e10f0a77917f64802bf4bf89ace83"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-python-engineering/testing-and-quality.md",
|
||||
"sha256": "9515f2638edfaaedf0d8664beb141de376f5f6d233aad0fd128588c1fffc257d"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-python-engineering/scientific-computing-foundations.md",
|
||||
"sha256": "2f1157d97cbc98ed3b7fbf2489b9e5ef8a6c0c05847095bd5b0acb2d45f4cb71"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-python-engineering/SKILL.md",
|
||||
"sha256": "f265281bc5cd8efd8e3e034ddcbad83038485b2789aa01e0480024cf9f34aee4"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-python-engineering/async-patterns-and-concurrency.md",
|
||||
"sha256": "83003bd109a5393c689415fe9529a2fb8b77cbc10e4aaf5ec706a609e1122b50"
|
||||
},
|
||||
{
|
||||
"path": "skills/using-python-engineering/debugging-and-profiling.md",
|
||||
"sha256": "9073f36ae95bcc55458bc78aedacf6e005d1fb6b5d60b883fc7ff6b1e4d61260"
|
||||
}
|
||||
],
|
||||
"dirSha256": "0d643778d6d2de81b6df637544298a6b9a3214de1a3ca03e3ba9d6eaf0cd3f4c"
|
||||
},
|
||||
"security": {
|
||||
"scannedAt": null,
|
||||
"scannerVersion": null,
|
||||
"flags": []
|
||||
}
|
||||
}
|
||||
0
skills/.gitkeep
Normal file
0
skills/.gitkeep
Normal file
398
skills/using-python-engineering/SKILL.md
Normal file
398
skills/using-python-engineering/SKILL.md
Normal file
@@ -0,0 +1,398 @@
|
||||
---
|
||||
name: using-python-engineering
|
||||
description: Routes to appropriate Python specialist skill based on symptoms and problem type
|
||||
mode: true
|
||||
---
|
||||
|
||||
# Using Python Engineering
|
||||
|
||||
## Overview
|
||||
|
||||
This meta-skill routes you to the right Python specialist based on symptoms. Python engineering problems fall into distinct categories that require specialized knowledge. Load this skill when you encounter Python-specific issues but aren't sure which specialized skill to use.
|
||||
|
||||
**Core Principle**: Different Python problems require different specialists. Match symptoms to the appropriate specialist skill. Don't guess at solutions—route to the expert.
|
||||
|
||||
## When to Use
|
||||
|
||||
Load this skill when:
|
||||
- Working with Python and encountering problems
|
||||
- User mentions: "Python", "type hints", "mypy", "pytest", "async", "pandas", "numpy"
|
||||
- Need to implement Python projects or optimize performance
|
||||
- Setting up Python tooling or fixing lint warnings
|
||||
- Debugging Python code or profiling performance
|
||||
|
||||
**Don't use for**: Non-Python languages, algorithm theory (not Python-specific), deployment infrastructure (not Python-specific)
|
||||
|
||||
---
|
||||
|
||||
## Routing by Symptom
|
||||
|
||||
### Type Errors and Type Hints
|
||||
|
||||
**Symptoms - Learning Type Syntax**:
|
||||
- "How to use type hints?"
|
||||
- "Python 3.12 type syntax"
|
||||
- "Generic types"
|
||||
- "Protocol vs ABC"
|
||||
- "TypeVar usage"
|
||||
- "Configure mypy/pyright"
|
||||
|
||||
**Route to**: See [modern-syntax-and-types.md](modern-syntax-and-types.md) for comprehensive type system guidance.
|
||||
|
||||
**Why**: Learning type hint syntax, patterns, and configuration.
|
||||
|
||||
**Symptoms - Fixing Type Errors**:
|
||||
- "mypy error: Incompatible types"
|
||||
- "mypy error: Argument has incompatible type"
|
||||
- "How to fix mypy errors?"
|
||||
- "100+ mypy errors, where to start?"
|
||||
- "When to use type: ignore?"
|
||||
- "Add types to legacy code"
|
||||
- "Understanding mypy error messages"
|
||||
|
||||
**Route to**: See [resolving-mypy-errors.md](resolving-mypy-errors.md) for systematic mypy error resolution.
|
||||
|
||||
**Why**: Resolving type errors requires systematic methodology, understanding error messages, and knowing when to fix vs ignore.
|
||||
|
||||
**Example queries**:
|
||||
- "Getting mypy error about incompatible types" → [resolving-mypy-errors.md](resolving-mypy-errors.md)
|
||||
- "How to use Python 3.12 type parameter syntax?" → [modern-syntax-and-types.md](modern-syntax-and-types.md)
|
||||
- "Fix 150 mypy errors systematically" → [resolving-mypy-errors.md](resolving-mypy-errors.md)
|
||||
|
||||
---
|
||||
|
||||
### Project Setup and Tooling
|
||||
|
||||
**Symptoms**:
|
||||
- "How to structure my Python project?"
|
||||
- "Setup pyproject.toml"
|
||||
- "Configure ruff/black/mypy"
|
||||
- "Dependency management"
|
||||
- "Pre-commit hooks"
|
||||
- "Package my project"
|
||||
- "src layout vs flat layout"
|
||||
|
||||
**Route to**: [project-structure-and-tooling.md](project-structure-and-tooling.md)
|
||||
|
||||
**Why**: Project setup involves multiple tools (ruff, mypy, pre-commit) and architectural decisions (src vs flat layout). Need comprehensive setup guide.
|
||||
|
||||
**Example queries**:
|
||||
- "Starting new Python project, how to set up?"
|
||||
- "Configure ruff for my team"
|
||||
- "Should I use poetry or pip-tools?"
|
||||
|
||||
---
|
||||
|
||||
### Lint Warnings and Delinting
|
||||
|
||||
**Symptoms**:
|
||||
- "Too many lint warnings"
|
||||
- "Fix ruff errors"
|
||||
- "How to delint legacy code?"
|
||||
- "Systematic approach to fixing lint"
|
||||
- "Don't want to disable warnings"
|
||||
- "Clean up codebase lint"
|
||||
|
||||
**Route to**: [systematic-delinting.md](systematic-delinting.md)
|
||||
|
||||
**Why**: Delinting requires systematic methodology to fix warnings without disabling them or over-refactoring. Process-driven approach needed.
|
||||
|
||||
**Example queries**:
|
||||
- "1000+ lint warnings, where to start?"
|
||||
- "Fix lint warnings systematically"
|
||||
- "Legacy code has no linting"
|
||||
|
||||
**Note**: If setting UP linting (not fixing), route to [project-structure-and-tooling.md](project-structure-and-tooling.md) first.
|
||||
|
||||
---
|
||||
|
||||
### Testing Issues
|
||||
|
||||
**Symptoms**:
|
||||
- "pytest not working"
|
||||
- "Flaky tests"
|
||||
- "How to structure tests?"
|
||||
- "Fixture issues"
|
||||
- "Mock/patch problems"
|
||||
- "Test coverage"
|
||||
- "Property-based testing"
|
||||
|
||||
**Route to**: [testing-and-quality.md](testing-and-quality.md)
|
||||
|
||||
**Why**: Testing requires understanding pytest architecture, fixture scopes, mocking patterns, and test organization strategies.
|
||||
|
||||
**Example queries**:
|
||||
- "Tests fail intermittently"
|
||||
- "How to use pytest fixtures properly?"
|
||||
- "Improve test coverage"
|
||||
|
||||
---
|
||||
|
||||
### Async/Await Issues
|
||||
|
||||
**Symptoms**:
|
||||
- "asyncio not working"
|
||||
- "async/await errors"
|
||||
- "Event loop issues"
|
||||
- "Blocking the event loop"
|
||||
- "TaskGroup (Python 3.11+)"
|
||||
- "Async context managers"
|
||||
- "When to use async?"
|
||||
|
||||
**Route to**: [async-patterns-and-concurrency.md](async-patterns-and-concurrency.md)
|
||||
|
||||
**Why**: Async programming has unique patterns, pitfalls (blocking event loop), and requires understanding structured concurrency.
|
||||
|
||||
**Example queries**:
|
||||
- "Getting 'coroutine never awaited' error"
|
||||
- "How to use Python 3.11 TaskGroup?"
|
||||
- "Async code is slow"
|
||||
|
||||
---
|
||||
|
||||
### Performance and Profiling
|
||||
|
||||
**Symptoms**:
|
||||
- "Python code is slow"
|
||||
- "How to profile?"
|
||||
- "Memory leak"
|
||||
- "Optimize performance"
|
||||
- "Bottleneck identification"
|
||||
- "CPU profiling"
|
||||
- "Memory profiling"
|
||||
|
||||
**Route to**: [debugging-and-profiling.md](debugging-and-profiling.md) FIRST
|
||||
|
||||
**Why**: MUST profile before optimizing. Many "performance" problems are actually I/O or algorithm issues. Profile to identify the real bottleneck.
|
||||
|
||||
**After profiling**, may route to:
|
||||
- [async-patterns-and-concurrency.md](async-patterns-and-concurrency.md) if I/O-bound
|
||||
- [scientific-computing-foundations.md](scientific-computing-foundations.md) if array operations slow
|
||||
- Same skill for optimization strategies
|
||||
|
||||
**Example queries**:
|
||||
- "Code is slow, how to speed up?"
|
||||
- "Find bottleneck in my code"
|
||||
- "Memory usage too high"
|
||||
|
||||
---
|
||||
|
||||
### Array and DataFrame Operations
|
||||
|
||||
**Symptoms**:
|
||||
- "NumPy operations"
|
||||
- "Pandas DataFrame slow"
|
||||
- "Vectorization"
|
||||
- "Array performance"
|
||||
- "Replace loops with numpy"
|
||||
- "DataFrame best practices"
|
||||
- "Large dataset processing"
|
||||
|
||||
**Route to**: [scientific-computing-foundations.md](scientific-computing-foundations.md)
|
||||
|
||||
**Why**: NumPy/pandas have specific patterns for vectorization, memory efficiency, and avoiding anti-patterns (iterrows).
|
||||
|
||||
**Example queries**:
|
||||
- "How to vectorize this loop?"
|
||||
- "Pandas operation too slow"
|
||||
- "DataFrame memory usage high"
|
||||
|
||||
---
|
||||
|
||||
### ML Experiment Tracking and Workflows
|
||||
|
||||
**Symptoms**:
|
||||
- "Track ML experiments"
|
||||
- "MLflow setup"
|
||||
- "Reproducible ML pipelines"
|
||||
- "ML model lifecycle"
|
||||
- "Hyperparameter management"
|
||||
- "ML monitoring"
|
||||
- "Data versioning"
|
||||
|
||||
**Route to**: [ml-engineering-workflows.md](ml-engineering-workflows.md)
|
||||
|
||||
**Why**: ML workflows require experiment tracking, reproducibility patterns, configuration management, and monitoring strategies.
|
||||
|
||||
**Example queries**:
|
||||
- "How to track experiments with MLflow?"
|
||||
- "Make ML training reproducible"
|
||||
- "Monitor model in production"
|
||||
|
||||
---
|
||||
|
||||
## Cross-Cutting Scenarios
|
||||
|
||||
### Multiple Skills Needed
|
||||
|
||||
Some scenarios require multiple specialized skills in sequence:
|
||||
|
||||
**New Python project setup with ML**:
|
||||
1. Route to [project-structure-and-tooling.md](project-structure-and-tooling.md) (setup)
|
||||
2. THEN [ml-engineering-workflows.md](ml-engineering-workflows.md) (ML specifics)
|
||||
|
||||
**Legacy code cleanup**:
|
||||
1. Route to [project-structure-and-tooling.md](project-structure-and-tooling.md) (setup linting)
|
||||
2. THEN [systematic-delinting.md](systematic-delinting.md) (fix warnings)
|
||||
|
||||
**Slow pandas code**:
|
||||
1. Route to [debugging-and-profiling.md](debugging-and-profiling.md) (profile)
|
||||
2. THEN [scientific-computing-foundations.md](scientific-computing-foundations.md) (optimize)
|
||||
|
||||
**Type hints for existing code**:
|
||||
1. Route to [project-structure-and-tooling.md](project-structure-and-tooling.md) (setup mypy)
|
||||
2. THEN `modern-syntax-and-types` (add types)
|
||||
|
||||
**Load in order of execution**: Setup before optimization, diagnosis before fixes, structure before specialization.
|
||||
|
||||
---
|
||||
|
||||
## Ambiguous Queries - Ask First
|
||||
|
||||
When symptom unclear, ASK ONE clarifying question:
|
||||
|
||||
**"Fix my Python code"**
|
||||
→ Ask: "What specific issue? Type errors? Lint warnings? Tests failing? Performance?"
|
||||
|
||||
**"Optimize my code"**
|
||||
→ Ask: "Optimize what? Speed? Memory? Code quality?"
|
||||
|
||||
**"Setup Python project"**
|
||||
→ Ask: "General project or ML-specific? Starting fresh or fixing existing?"
|
||||
|
||||
**"My code doesn't work"**
|
||||
→ Ask: "What's broken? Import errors? Type errors? Runtime errors? Tests?"
|
||||
|
||||
**Never guess when ambiguous. Ask once, route accurately.**
|
||||
|
||||
---
|
||||
|
||||
## Common Routing Mistakes
|
||||
|
||||
| Symptom | Wrong Route | Correct Route | Why |
|
||||
|---------|-------------|---------------|-----|
|
||||
| "Code slow" | async-patterns | debugging-and-profiling FIRST | Don't optimize without profiling |
|
||||
| "Setup linting and fix" | systematic-delinting only | project-structure THEN delinting | Setup before fixing |
|
||||
| "Pandas slow" | debugging only | debugging THEN scientific-computing | Profile then vectorize |
|
||||
| "Add type hints" | modern-syntax only | project-structure THEN modern-syntax | Setup mypy first |
|
||||
| "Fix 1000 lint warnings" | project-structure | systematic-delinting | Process for fixing, not setup |
|
||||
| "Fix mypy errors" | modern-syntax-and-types | resolving-mypy-errors | Syntax vs resolution process |
|
||||
| "100 mypy errors" | modern-syntax-and-types | resolving-mypy-errors | Need systematic approach |
|
||||
|
||||
**Key principle**: Diagnosis before solutions, setup before optimization, profile before performance fixes.
|
||||
|
||||
---
|
||||
|
||||
## Red Flags - Stop and Route
|
||||
|
||||
If you catch yourself about to:
|
||||
- Suggest "use async" for slow code → Route to [debugging-and-profiling.md](debugging-and-profiling.md) to profile first
|
||||
- Show pytest example → Route to [testing-and-quality.md](testing-and-quality.md) for complete patterns
|
||||
- Suggest "just fix the lint warnings" → Route to [systematic-delinting.md](systematic-delinting.md) for methodology
|
||||
- Show type hint syntax → Route to `modern-syntax-and-types` for comprehensive guide
|
||||
- Suggest "use numpy instead" → Route to [scientific-computing-foundations.md](scientific-computing-foundations.md) for vectorization patterns
|
||||
|
||||
**All of these mean: You're about to give incomplete advice. Route to the specialist instead.**
|
||||
|
||||
---
|
||||
|
||||
## Common Rationalizations (Don't Do These)
|
||||
|
||||
| Excuse | Reality | What To Do |
|
||||
|--------|---------|------------|
|
||||
| "User is rushed, skip routing" | Routing takes 5 seconds. Wrong fix wastes hours. | Route anyway - specialists have quick answers |
|
||||
| "Simple question" | Simple questions deserve complete answers. | Route to specialist for comprehensive coverage |
|
||||
| "Just need quick syntax" | Syntax without context leads to misuse. | Route to get syntax + patterns + anti-patterns |
|
||||
| "User sounds experienced" | Experience in one area ≠ expertise in all Python. | Route based on symptoms, not perceived skill |
|
||||
| "Already tried X" | May have done X wrong or incompletely. | Route to specialist to verify X properly |
|
||||
| "Too many skills" | 8 focused skills > 1 overwhelming wall of text. | Use router to navigate - that's its purpose |
|
||||
|
||||
**If you catch yourself thinking ANY of these, STOP and route to the specialist.**
|
||||
|
||||
---
|
||||
|
||||
## Red Flags Checklist - Self-Check Before Answering
|
||||
|
||||
Before giving ANY Python advice, ask yourself:
|
||||
|
||||
1. ❓ **Did I identify the symptom?**
|
||||
- If no → Read query again, identify symptoms
|
||||
|
||||
2. ❓ **Is this symptom in my routing table?**
|
||||
- If yes → Route to that specialist
|
||||
- If no → Ask clarifying question
|
||||
|
||||
3. ❓ **Am I about to give advice directly?**
|
||||
- If yes → STOP. Why am I not routing?
|
||||
- Check rationalization table - am I making excuses?
|
||||
|
||||
4. ❓ **Is this a diagnosis issue or solution issue?**
|
||||
- Diagnosis → Route to profiling/debugging skill FIRST
|
||||
- Solution → Route to appropriate implementation skill
|
||||
|
||||
5. ❓ **Is query ambiguous?**
|
||||
- If yes → Ask ONE clarifying question
|
||||
- If no → Route confidently
|
||||
|
||||
6. ❓ **Am I feeling pressure to skip routing?**
|
||||
- Time pressure → Route anyway (faster overall)
|
||||
- Complexity → Route anyway (specialists handle complexity)
|
||||
- User confidence → Route anyway (verify assumptions)
|
||||
- "Simple" question → Route anyway (simple deserves correct)
|
||||
|
||||
**If you failed ANY check above, do NOT give direct advice. Route to specialist or ask clarifying question.**
|
||||
|
||||
---
|
||||
|
||||
## Python Engineering Specialist Skills
|
||||
|
||||
After routing, load the appropriate specialist skill for detailed guidance:
|
||||
|
||||
1. [modern-syntax-and-types.md](modern-syntax-and-types.md) - Type hints, mypy/pyright, Python 3.10-3.12 features, generics, protocols
|
||||
2. [resolving-mypy-errors.md](resolving-mypy-errors.md) - Systematic mypy error resolution, type: ignore best practices, typing legacy code
|
||||
3. [project-structure-and-tooling.md](project-structure-and-tooling.md) - pyproject.toml, ruff, pre-commit, dependency management, packaging
|
||||
4. [systematic-delinting.md](systematic-delinting.md) - Process for fixing lint warnings without disabling or over-refactoring
|
||||
5. [testing-and-quality.md](testing-and-quality.md) - pytest patterns, fixtures, mocking, coverage, property-based testing
|
||||
6. [async-patterns-and-concurrency.md](async-patterns-and-concurrency.md) - async/await, asyncio, TaskGroup, structured concurrency, threading
|
||||
7. [scientific-computing-foundations.md](scientific-computing-foundations.md) - NumPy/pandas, vectorization, memory efficiency, large datasets
|
||||
8. [ml-engineering-workflows.md](ml-engineering-workflows.md) - MLflow, experiment tracking, reproducibility, monitoring, model lifecycle
|
||||
9. [debugging-and-profiling.md](debugging-and-profiling.md) - pdb/debugpy, cProfile, memory_profiler, optimization strategies
|
||||
|
||||
---
|
||||
|
||||
## When NOT to Use Python Skills
|
||||
|
||||
**Skip Python pack when**:
|
||||
- Non-Python language (use appropriate language pack)
|
||||
- Algorithm selection (use computer science / algorithms pack)
|
||||
- Infrastructure/deployment (use DevOps/infrastructure pack)
|
||||
- Database design (use database pack)
|
||||
|
||||
**Python pack is for**: Python-specific implementation, tooling, patterns, debugging, and optimization.
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis-First Principle
|
||||
|
||||
**Critical**: Many Python issues require diagnosis before solutions:
|
||||
|
||||
| Issue Type | Diagnosis Skill | Then Solution Skill |
|
||||
|------------|----------------|---------------------|
|
||||
| Performance | debugging-and-profiling | async or scientific-computing |
|
||||
| Slow arrays | debugging-and-profiling | scientific-computing-foundations |
|
||||
| Type errors | modern-syntax-and-types | modern-syntax-and-types (same) |
|
||||
| Lint warnings | systematic-delinting | systematic-delinting (same) |
|
||||
|
||||
**If unclear what's wrong, route to diagnostic skill first.**
|
||||
|
||||
---
|
||||
|
||||
## Integration Notes
|
||||
|
||||
**Phase 1 - Standalone**: Python skills are self-contained
|
||||
|
||||
**Future cross-references**:
|
||||
- superpowers:test-driven-development (TDD methodology before implementing)
|
||||
- superpowers:systematic-debugging (systematic debugging before profiling)
|
||||
|
||||
**Current focus**: Route within Python pack only. Other packs handle other concerns.
|
||||
1131
skills/using-python-engineering/async-patterns-and-concurrency.md
Normal file
1131
skills/using-python-engineering/async-patterns-and-concurrency.md
Normal file
File diff suppressed because it is too large
Load Diff
1047
skills/using-python-engineering/debugging-and-profiling.md
Normal file
1047
skills/using-python-engineering/debugging-and-profiling.md
Normal file
File diff suppressed because it is too large
Load Diff
1072
skills/using-python-engineering/ml-engineering-workflows.md
Normal file
1072
skills/using-python-engineering/ml-engineering-workflows.md
Normal file
File diff suppressed because it is too large
Load Diff
848
skills/using-python-engineering/modern-syntax-and-types.md
Normal file
848
skills/using-python-engineering/modern-syntax-and-types.md
Normal file
@@ -0,0 +1,848 @@
|
||||
|
||||
# Modern Python Syntax and Types
|
||||
|
||||
## Overview
|
||||
|
||||
**Core Principle:** Type hints make code self-documenting and catch bugs before runtime. Python 3.10-3.12 introduced powerful type system features and syntax improvements. Use them.
|
||||
|
||||
Modern Python is statically typed (optionally), with match statements, structural pattern matching, and cleaner syntax. The type system evolved dramatically: `|` union syntax (3.10), exception groups (3.11), PEP 695 generics (3.12). Master these to write production-quality Python.
|
||||
|
||||
## When to Use
|
||||
|
||||
**Use this skill when:**
|
||||
- "mypy error: ..." or "pyright error: ..."
|
||||
- Adding type hints to existing code
|
||||
- Using Python 3.10+ features (match, | unions, generics)
|
||||
- Configuring static type checkers
|
||||
- Type errors with generics, protocols, or TypedDict
|
||||
|
||||
**Don't use when:**
|
||||
- Setting up project structure (use project-structure-and-tooling)
|
||||
- Runtime type checking needed (use pydantic or similar)
|
||||
- Performance optimization (use debugging-and-profiling)
|
||||
|
||||
**Symptoms triggering this skill:**
|
||||
- "Incompatible type" errors
|
||||
- "How to type hint X?"
|
||||
- "Use Python 3.12 features"
|
||||
- "Configure mypy strict mode"
|
||||
|
||||
|
||||
## Type Hints Fundamentals
|
||||
|
||||
### Basic Annotations
|
||||
|
||||
```python
|
||||
# ❌ WRONG: No type hints
|
||||
def calculate_total(prices, tax_rate):
|
||||
return sum(prices) * (1 + tax_rate)
|
||||
|
||||
# ✅ CORRECT: Clear types
|
||||
def calculate_total(prices: list[float], tax_rate: float) -> float:
|
||||
return sum(prices) * (1 + tax_rate)
|
||||
|
||||
# Why this matters: Type checker catches calculate_total([1, 2], "0.1")
|
||||
# immediately instead of failing at runtime with TypeError
|
||||
```
|
||||
|
||||
### Built-in Collection Types (Python 3.9+)
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Using typing.List, typing.Dict (deprecated)
|
||||
from typing import List, Dict, Tuple
|
||||
|
||||
def process(items: List[str]) -> Dict[str, int]:
|
||||
return {item: len(item) for item in items}
|
||||
|
||||
# ✅ CORRECT: Use built-in types directly (Python 3.9+)
|
||||
def process(items: list[str]) -> dict[str, int]:
|
||||
return {item: len(item) for item in items}
|
||||
|
||||
# ✅ More complex built-ins
|
||||
def transform(data: dict[str, list[int]]) -> tuple[int, ...]:
|
||||
all_values = [v for values in data.values() for v in values]
|
||||
return tuple(all_values)
|
||||
```
|
||||
|
||||
**Why this matters**: Python 3.9+ supports `list[T]` directly. Using `typing.List` is deprecated and adds unnecessary imports.
|
||||
|
||||
### Optional and None
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Using Optional without understanding
|
||||
from typing import Optional
|
||||
|
||||
def get_user(id: int) -> Optional[dict]:
|
||||
# Returns dict or None, but which dict structure?
|
||||
...
|
||||
|
||||
# ✅ CORRECT: Use | None (Python 3.10+) with specific types
|
||||
from dataclasses import dataclass
|
||||
|
||||
@dataclass
|
||||
class User:
|
||||
id: int
|
||||
name: str
|
||||
email: str
|
||||
|
||||
def get_user(id: int) -> User | None:
|
||||
# Clear: Returns User or None
|
||||
if user_exists(id):
|
||||
return User(id=id, name="...", email="...")
|
||||
return None
|
||||
|
||||
# Using the result
|
||||
user = get_user(123)
|
||||
if user is not None: # Type checker knows user is User here
|
||||
print(user.name)
|
||||
```
|
||||
|
||||
**Why this matters**: `Optional[X]` is just `X | None`. Python 3.10+ syntax is clearer. TypedDict or dataclass is better than raw dict.
|
||||
|
||||
### Union Types
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Old-style Union (Python <3.10)
|
||||
from typing import Union
|
||||
|
||||
def process(value: Union[str, int, float]) -> Union[str, bool]:
|
||||
...
|
||||
|
||||
# ✅ CORRECT: Use | operator (Python 3.10+)
|
||||
def process(value: str | int | float) -> str | bool:
|
||||
if isinstance(value, str):
|
||||
return value.upper()
|
||||
return value > 0
|
||||
|
||||
# ✅ Multiple returns with | None
|
||||
def parse_config(path: str) -> dict[str, str] | None:
|
||||
try:
|
||||
with open(path) as f:
|
||||
return json.load(f)
|
||||
except FileNotFoundError:
|
||||
return None
|
||||
```
|
||||
|
||||
**Why this matters**: `|` is PEP 604, available Python 3.10+. Cleaner, more readable, Pythonic. No imports needed.
|
||||
|
||||
### Type Aliases
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Reusing complex types
|
||||
def process_users(users: list[dict[str, str | int]]) -> dict[str, list[dict[str, str | int]]]:
|
||||
...
|
||||
|
||||
# ✅ CORRECT: Type alias for readability
|
||||
UserDict = dict[str, str | int]
|
||||
UserMap = dict[str, list[UserDict]]
|
||||
|
||||
def process_users(users: list[UserDict]) -> UserMap:
|
||||
return {"active": [u for u in users if u.get("active")]}
|
||||
|
||||
# ✅ BETTER: Use TypedDict for structure
|
||||
from typing import TypedDict
|
||||
|
||||
class User(TypedDict):
|
||||
id: int
|
||||
name: str
|
||||
email: str
|
||||
active: bool
|
||||
|
||||
def process_users(users: list[User]) -> dict[str, list[User]]:
|
||||
return {"active": [u for u in users if u["active"]]}
|
||||
```
|
||||
|
||||
**Why this matters**: Type aliases improve readability. TypedDict provides structure validation for dict types.
|
||||
|
||||
|
||||
## Advanced Typing
|
||||
|
||||
### Generics with TypeVar
|
||||
|
||||
```python
|
||||
from typing import TypeVar
|
||||
|
||||
T = TypeVar('T')
|
||||
|
||||
# ✅ Generic function
|
||||
def first(items: list[T]) -> T | None:
|
||||
return items[0] if items else None
|
||||
|
||||
# Usage: type checker knows the return type
|
||||
names: list[str] = ["Alice", "Bob"]
|
||||
first_name: str | None = first(names) # Type checker infers str | None
|
||||
|
||||
numbers: list[int] = [1, 2, 3]
|
||||
first_num: int | None = first(numbers) # Type checker infers int | None
|
||||
|
||||
# ✅ Generic class (old style)
|
||||
class Container(Generic[T]):
|
||||
def __init__(self, value: T) -> None:
|
||||
self.value = value
|
||||
|
||||
def get(self) -> T:
|
||||
return self.value
|
||||
|
||||
# Usage
|
||||
container: Container[int] = Container(42)
|
||||
value: int = container.get() # Type checker knows it's int
|
||||
```
|
||||
|
||||
### Python 3.12+ Generics (PEP 695)
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Old-style generic syntax (still works but verbose)
|
||||
from typing import TypeVar, Generic
|
||||
|
||||
T = TypeVar('T')
|
||||
|
||||
class Container(Generic[T]):
|
||||
def __init__(self, value: T) -> None:
|
||||
self.value = value
|
||||
|
||||
# ✅ CORRECT: Python 3.12+ PEP 695 syntax
|
||||
class Container[T]:
|
||||
def __init__(self, value: T) -> None:
|
||||
self.value = value
|
||||
|
||||
def get(self) -> T:
|
||||
return self.value
|
||||
|
||||
# ✅ Generic function with PEP 695
|
||||
def first[T](items: list[T]) -> T | None:
|
||||
return items[0] if items else None
|
||||
|
||||
# ✅ Multiple type parameters
|
||||
class Pair[T, U]:
|
||||
def __init__(self, first: T, second: U) -> None:
|
||||
self.first = first
|
||||
self.second = second
|
||||
|
||||
def get_first(self) -> T:
|
||||
return self.first
|
||||
|
||||
def get_second(self) -> U:
|
||||
return self.second
|
||||
|
||||
# Usage
|
||||
pair: Pair[str, int] = Pair("answer", 42)
|
||||
```
|
||||
|
||||
**Why this matters**: PEP 695 (Python 3.12+) simplifies generic syntax. No TypeVar needed. Cleaner, more readable.
|
||||
|
||||
### Bounded TypeVars
|
||||
|
||||
```python
|
||||
# ✅ TypeVar with bounds (works with old and new syntax)
|
||||
from typing import TypeVar
|
||||
|
||||
# Bound to specific type
|
||||
T_Number = TypeVar('T_Number', bound=int | float)
|
||||
|
||||
def add[T: int | float](a: T, b: T) -> T: # Python 3.12+ syntax
|
||||
return a + b # Type checker knows a and b support +
|
||||
|
||||
# ✅ Constrained to specific types only
|
||||
T_Scalar = TypeVar('T_Scalar', int, float, str)
|
||||
|
||||
def format_value(value: T_Scalar) -> str:
|
||||
return str(value)
|
||||
|
||||
# Usage
|
||||
result: int = add(1, 2) # OK
|
||||
result2: float = add(1.5, 2.5) # OK
|
||||
# result3 = add("a", "b") # mypy error: str not compatible with int | float
|
||||
```
|
||||
|
||||
### Protocol (Structural Subtyping)
|
||||
|
||||
```python
|
||||
from typing import Protocol
|
||||
|
||||
# ✅ Define protocol for duck typing
|
||||
class Drawable(Protocol):
|
||||
def draw(self) -> None: ...
|
||||
|
||||
class Circle:
|
||||
def draw(self) -> None:
|
||||
print("Drawing circle")
|
||||
|
||||
class Square:
|
||||
def draw(self) -> None:
|
||||
print("Drawing square")
|
||||
|
||||
# Works without inheritance - structural typing
|
||||
def render(shape: Drawable) -> None:
|
||||
shape.draw()
|
||||
|
||||
# Usage - no need to inherit from Drawable
|
||||
circle = Circle()
|
||||
square = Square()
|
||||
render(circle) # OK
|
||||
render(square) # OK
|
||||
|
||||
# ❌ WRONG: Using ABC when Protocol is better
|
||||
from abc import ABC, abstractmethod
|
||||
|
||||
class DrawableABC(ABC):
|
||||
@abstractmethod
|
||||
def draw(self) -> None: ...
|
||||
|
||||
# Now Circle must inherit from DrawableABC - too rigid!
|
||||
```
|
||||
|
||||
**Why this matters**: Protocol enables structural typing (duck typing with type safety). No inheritance needed. More Pythonic than ABC for many cases.
|
||||
|
||||
### TypedDict
|
||||
|
||||
```python
|
||||
from typing import TypedDict
|
||||
|
||||
# ✅ Define structured dict types
|
||||
class UserDict(TypedDict):
|
||||
id: int
|
||||
name: str
|
||||
email: str
|
||||
active: bool
|
||||
|
||||
def create_user(data: UserDict) -> UserDict:
|
||||
# Type checker ensures all required keys present
|
||||
return data
|
||||
|
||||
# Usage
|
||||
user: UserDict = {
|
||||
"id": 1,
|
||||
"name": "Alice",
|
||||
"email": "alice@example.com",
|
||||
"active": True
|
||||
}
|
||||
|
||||
# mypy error: Missing key "active"
|
||||
# bad_user: UserDict = {"id": 1, "name": "Alice", "email": "alice@example.com"}
|
||||
|
||||
# ✅ Optional fields
|
||||
class UserDictOptional(TypedDict, total=False):
|
||||
bio: str
|
||||
avatar_url: str
|
||||
|
||||
# ✅ Combining required and optional
|
||||
class User(TypedDict):
|
||||
id: int
|
||||
name: str
|
||||
|
||||
class UserWithOptional(User, total=False):
|
||||
email: str
|
||||
bio: str
|
||||
```
|
||||
|
||||
**Why this matters**: TypedDict provides structure for dict types. Better than `dict[str, Any]`. Type checker validates keys and value types.
|
||||
|
||||
|
||||
## Python 3.10+ Features
|
||||
|
||||
### Match Statements (Structural Pattern Matching)
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Long if-elif chains
|
||||
def handle_response(response):
|
||||
if response["status"] == 200:
|
||||
return response["data"]
|
||||
elif response["status"] == 404:
|
||||
return None
|
||||
elif response["status"] in [500, 502, 503]:
|
||||
raise ServerError()
|
||||
else:
|
||||
raise UnknownError()
|
||||
|
||||
# ✅ CORRECT: Match statement (Python 3.10+)
|
||||
def handle_response(response: dict[str, Any]) -> Any:
|
||||
match response["status"]:
|
||||
case 200:
|
||||
return response["data"]
|
||||
case 404:
|
||||
return None
|
||||
case 500 | 502 | 503:
|
||||
raise ServerError()
|
||||
case _:
|
||||
raise UnknownError()
|
||||
|
||||
# ✅ Pattern matching with structure
|
||||
def process_command(command: dict[str, Any]) -> str:
|
||||
match command:
|
||||
case {"action": "create", "type": "user", "data": data}:
|
||||
return create_user(data)
|
||||
case {"action": "delete", "type": "user", "id": user_id}:
|
||||
return delete_user(user_id)
|
||||
case {"action": action, "type": type_}:
|
||||
return f"Unknown action {action} for {type_}"
|
||||
case _:
|
||||
return "Invalid command"
|
||||
|
||||
# ✅ Matching class instances
|
||||
from dataclasses import dataclass
|
||||
|
||||
@dataclass
|
||||
class Point:
|
||||
x: float
|
||||
y: float
|
||||
|
||||
def describe_point(point: Point) -> str:
|
||||
match point:
|
||||
case Point(x=0, y=0):
|
||||
return "Origin"
|
||||
case Point(x=0, y=y):
|
||||
return f"On Y-axis at {y}"
|
||||
case Point(x=x, y=0):
|
||||
return f"On X-axis at {x}"
|
||||
case Point(x=x, y=y) if x == y:
|
||||
return f"On diagonal at ({x}, {y})"
|
||||
case Point(x=x, y=y):
|
||||
return f"At ({x}, {y})"
|
||||
```
|
||||
|
||||
**Why this matters**: Match statements are more readable than if-elif chains for complex conditionals. Pattern matching extracts values directly.
|
||||
|
||||
|
||||
## Python 3.11 Features
|
||||
|
||||
### Exception Groups
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Can't handle multiple exceptions from concurrent tasks
|
||||
async def fetch_all(urls: list[str]) -> list[str]:
|
||||
results = []
|
||||
for url in urls:
|
||||
try:
|
||||
results.append(await fetch(url))
|
||||
except Exception as e:
|
||||
# Only logs first error, continues
|
||||
log.error(f"Failed to fetch {url}: {e}")
|
||||
return results
|
||||
|
||||
# ✅ CORRECT: Python 3.11 exception groups
|
||||
async def fetch_all(urls: list[str]) -> list[str]:
|
||||
async with asyncio.TaskGroup() as tg:
|
||||
tasks = [tg.create_task(fetch(url)) for url in urls]
|
||||
# If any fail, TaskGroup raises ExceptionGroup
|
||||
return [task.result() for task in tasks]
|
||||
|
||||
# Handling exception groups
|
||||
try:
|
||||
results = await fetch_all(urls)
|
||||
except* TimeoutError as e:
|
||||
# Handle all TimeoutErrors
|
||||
log.error(f"Timeouts: {e.exceptions}")
|
||||
except* ConnectionError as e:
|
||||
# Handle all ConnectionErrors
|
||||
log.error(f"Connection errors: {e.exceptions}")
|
||||
|
||||
# ✅ Creating exception groups manually
|
||||
errors = [ValueError("Invalid user 1"), ValueError("Invalid user 2")]
|
||||
raise ExceptionGroup("Validation errors", errors)
|
||||
```
|
||||
|
||||
**Why this matters**: Exception groups handle multiple exceptions from concurrent operations. Essential for structured concurrency (TaskGroup).
|
||||
|
||||
### Better Error Messages
|
||||
|
||||
Python 3.11 improved error messages significantly:
|
||||
|
||||
```python
|
||||
# Python 3.10 error:
|
||||
# TypeError: 'NoneType' object is not subscriptable
|
||||
|
||||
# Python 3.11 error with exact location:
|
||||
# TypeError: 'NoneType' object is not subscriptable
|
||||
# user["name"]
|
||||
# ^^^^^^^^^^^^
|
||||
|
||||
# Helpful for nested expressions
|
||||
result = data["users"][0]["profile"]["settings"]["theme"]
|
||||
# Python 3.11 shows exactly which part is None
|
||||
```
|
||||
|
||||
**Why this matters**: Better error messages speed up debugging. Exact location highlighted.
|
||||
|
||||
|
||||
## Python 3.12 Features
|
||||
|
||||
### PEP 695 Type Parameter Syntax
|
||||
|
||||
Already covered in Generics section above. Key improvement: cleaner syntax for generic classes and functions.
|
||||
|
||||
```python
|
||||
# Old style (still works)
|
||||
from typing import TypeVar, Generic
|
||||
T = TypeVar('T')
|
||||
class Box(Generic[T]):
|
||||
...
|
||||
|
||||
# Python 3.12+ style
|
||||
class Box[T]:
|
||||
...
|
||||
```
|
||||
|
||||
### @override Decorator
|
||||
|
||||
```python
|
||||
from typing import override
|
||||
|
||||
class Base:
|
||||
def process(self) -> None:
|
||||
print("Base process")
|
||||
|
||||
class Derived(Base):
|
||||
@override
|
||||
def process(self) -> None: # OK - overriding Base.process
|
||||
print("Derived process")
|
||||
|
||||
@override
|
||||
def compute(self) -> None: # mypy error: Base has no method 'compute'
|
||||
print("New method")
|
||||
|
||||
# Why use @override:
|
||||
# 1. Documents intent explicitly
|
||||
# 2. Type checker catches typos (processs vs process)
|
||||
# 3. Catches issues when base class changes
|
||||
```
|
||||
|
||||
**Why this matters**: @override makes intent explicit and catches errors when base class changes or method names have typos.
|
||||
|
||||
### f-string Improvements
|
||||
|
||||
```python
|
||||
# Python 3.12 allows more complex expressions in f-strings
|
||||
|
||||
# ✅ Reusing quotes in f-strings
|
||||
value = "test"
|
||||
result = f"Value is {value.replace('t', 'T')}" # Works in 3.12
|
||||
|
||||
# ✅ Multi-line f-strings with backslashes
|
||||
message = f"Processing {
|
||||
len(items)
|
||||
} items"
|
||||
|
||||
# ✅ f-string debugging with = (since 3.8, improved in 3.12)
|
||||
x = 42
|
||||
print(f"{x=}") # Output: x=42
|
||||
print(f"{x * 2=}") # Output: x * 2=84
|
||||
```
|
||||
|
||||
|
||||
## Static Analysis Setup
|
||||
|
||||
### mypy Configuration
|
||||
|
||||
**File:** `pyproject.toml`
|
||||
|
||||
```toml
|
||||
[tool.mypy]
|
||||
python_version = "3.12"
|
||||
warn_return_any = true
|
||||
warn_unused_configs = true
|
||||
disallow_untyped_defs = true
|
||||
disallow_incomplete_defs = true
|
||||
check_untyped_defs = true
|
||||
disallow_untyped_decorators = true
|
||||
warn_redundant_casts = true
|
||||
warn_unused_ignores = true
|
||||
warn_no_return = true
|
||||
warn_unreachable = true
|
||||
strict_equality = true
|
||||
strict = true
|
||||
|
||||
# Per-module options
|
||||
[[tool.mypy.overrides]]
|
||||
module = "tests.*"
|
||||
disallow_untyped_defs = false # Tests can be less strict
|
||||
|
||||
[[tool.mypy.overrides]]
|
||||
module = "third_party.*"
|
||||
ignore_missing_imports = true
|
||||
```
|
||||
|
||||
**Strict mode breakdown:**
|
||||
|
||||
- `strict = true`: Enables all strict checks
|
||||
- `disallow_untyped_defs`: All functions must have type hints
|
||||
- `warn_return_any`: Warn when returning Any type
|
||||
- `warn_unused_ignores`: Warn on unnecessary `# type: ignore`
|
||||
|
||||
**When to use strict mode:**
|
||||
- New projects: Start strict from day 1
|
||||
- Existing projects: Enable incrementally per module
|
||||
|
||||
### pyright Configuration
|
||||
|
||||
**File:** `pyproject.toml`
|
||||
|
||||
```toml
|
||||
[tool.pyright]
|
||||
pythonVersion = "3.12"
|
||||
typeCheckingMode = "strict"
|
||||
reportMissingTypeStubs = false
|
||||
reportUnknownMemberType = false
|
||||
|
||||
# Stricter checks
|
||||
reportUnusedImport = true
|
||||
reportUnusedVariable = true
|
||||
reportDuplicateImport = true
|
||||
|
||||
# Exclude patterns
|
||||
exclude = [
|
||||
"**/__pycache__",
|
||||
"**/node_modules",
|
||||
".venv",
|
||||
]
|
||||
```
|
||||
|
||||
**pyright vs mypy:**
|
||||
- pyright: Faster, better IDE integration, stricter by default
|
||||
- mypy: More configurable, wider adoption, plugin ecosystem
|
||||
|
||||
**Recommendation**: Use both if possible. pyright in IDE, mypy in CI.
|
||||
|
||||
### Dealing with Untyped Libraries
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Silencing all errors
|
||||
import untyped_lib # type: ignore
|
||||
|
||||
# ✅ CORRECT: Create stub file
|
||||
# File: stubs/untyped_lib.pyi
|
||||
def important_function(x: int, y: str) -> bool: ...
|
||||
class ImportantClass:
|
||||
def method(self, value: int) -> None: ...
|
||||
|
||||
# Configure mypy to find stubs
|
||||
# pyproject.toml:
|
||||
# mypy_path = "stubs"
|
||||
|
||||
# ✅ Use # type: ignore with explanation
|
||||
from untyped_lib import obscure_function # type: ignore[import] # TODO: Add stub
|
||||
|
||||
# ✅ Use cast when library returns Any
|
||||
from typing import cast
|
||||
result = cast(list[int], untyped_lib.get_items())
|
||||
```
|
||||
|
||||
**Why this matters**: Stubs preserve type safety even with untyped libraries. Type: ignore should be specific and documented.
|
||||
|
||||
|
||||
## Common Type Errors and Fixes
|
||||
|
||||
### Incompatible Types
|
||||
|
||||
```python
|
||||
# mypy error: Incompatible types in assignment (expression has type "str | None", variable has type "str")
|
||||
|
||||
# ❌ WRONG: Ignoring the error
|
||||
name: str = get_name() # type: ignore
|
||||
|
||||
# ✅ CORRECT: Handle None case
|
||||
name: str | None = get_name()
|
||||
if name is not None:
|
||||
process_name(name)
|
||||
|
||||
# ✅ CORRECT: Provide default
|
||||
name: str = get_name() or "default"
|
||||
|
||||
# ✅ CORRECT: Assert if you're certain
|
||||
name = get_name()
|
||||
assert name is not None
|
||||
process_name(name)
|
||||
```
|
||||
|
||||
### List/Dict Invariance
|
||||
|
||||
```python
|
||||
# mypy error: Argument has incompatible type "list[int]"; expected "list[float]"
|
||||
|
||||
def process_numbers(numbers: list[float]) -> None:
|
||||
...
|
||||
|
||||
int_list: list[int] = [1, 2, 3]
|
||||
# process_numbers(int_list) # mypy error!
|
||||
|
||||
# Why: Lists are mutable. If process_numbers did numbers.append(3.14),
|
||||
# it would break int_list type safety
|
||||
|
||||
# ✅ CORRECT: Use Sequence for read-only
|
||||
from collections.abc import Sequence
|
||||
|
||||
def process_numbers(numbers: Sequence[float]) -> None:
|
||||
# Can't modify, so safe to accept list[int]
|
||||
...
|
||||
|
||||
process_numbers(int_list) # OK now
|
||||
```
|
||||
|
||||
### Missing Return Type
|
||||
|
||||
```python
|
||||
# mypy error: Function is missing a return type annotation
|
||||
|
||||
# ❌ WRONG: No return type
|
||||
def calculate(x, y):
|
||||
return x + y
|
||||
|
||||
# ✅ CORRECT: Add return type
|
||||
def calculate(x: int, y: int) -> int:
|
||||
return x + y
|
||||
|
||||
# ✅ Functions that don't return
|
||||
def log_message(message: str) -> None:
|
||||
print(message)
|
||||
```
|
||||
|
||||
### Generic Type Issues
|
||||
|
||||
```python
|
||||
# mypy error: Need type annotation for 'items'
|
||||
|
||||
# ❌ WRONG: No type for empty container
|
||||
items = []
|
||||
items.append(1) # mypy can't infer type
|
||||
|
||||
# ✅ CORRECT: Explicit type annotation
|
||||
items: list[int] = []
|
||||
items.append(1)
|
||||
|
||||
# ✅ CORRECT: Initialize with values
|
||||
items = [1, 2, 3] # mypy infers list[int]
|
||||
```
|
||||
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
### Over-Typing
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Too specific, breaks flexibility
|
||||
def process_items(items: list[str]) -> list[str]:
|
||||
return [item.upper() for item in items]
|
||||
|
||||
# Can't pass tuple, generator, or other iterables
|
||||
|
||||
# ✅ CORRECT: Use abstract types
|
||||
from collections.abc import Sequence
|
||||
|
||||
def process_items(items: Sequence[str]) -> list[str]:
|
||||
return [item.upper() for item in items]
|
||||
|
||||
# Now works with list, tuple, etc.
|
||||
```
|
||||
|
||||
### Type: Ignore Abuse
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Blanket ignore
|
||||
def sketchy_function(data): # type: ignore
|
||||
return data["key"]
|
||||
|
||||
# ✅ CORRECT: Specific ignore with comment
|
||||
def legacy_integration(data: dict[str, Any]) -> Any:
|
||||
# type: ignore[no-untyped-def] # TODO(#123): Add proper types
|
||||
return data["key"]
|
||||
|
||||
# ✅ BETTER: Fix the issue
|
||||
def fixed_integration(data: dict[str, str]) -> str:
|
||||
return data["key"]
|
||||
```
|
||||
|
||||
### Using Any Everywhere
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Any defeats the purpose of types
|
||||
def process(data: Any) -> Any:
|
||||
return data.transform()
|
||||
|
||||
# ✅ CORRECT: Use specific types
|
||||
from typing import Protocol
|
||||
|
||||
class Transformable(Protocol):
|
||||
def transform(self) -> str: ...
|
||||
|
||||
def process(data: Transformable) -> str:
|
||||
return data.transform()
|
||||
```
|
||||
|
||||
### Incompatible Generics
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Generic type mismatch
|
||||
T = TypeVar('T')
|
||||
|
||||
def combine(a: list[T], b: list[T]) -> list[T]:
|
||||
return a + b
|
||||
|
||||
ints: list[int] = [1, 2]
|
||||
strs: list[str] = ["a", "b"]
|
||||
# result = combine(ints, strs) # mypy error: incompatible types
|
||||
|
||||
# ✅ CORRECT: Different type parameters
|
||||
T1 = TypeVar('T1')
|
||||
T2 = TypeVar('T2')
|
||||
|
||||
def combine_any[T1, T2](a: list[T1], b: list[T2]) -> list[T1 | T2]:
|
||||
return a + b # type: ignore[return-value] # Runtime works, typing is complex
|
||||
|
||||
# ✅ BETTER: Keep types consistent
|
||||
result_ints = combine(ints, [3, 4]) # OK: both list[int]
|
||||
```
|
||||
|
||||
|
||||
## Decision Trees
|
||||
|
||||
### When to Use Which Type?
|
||||
|
||||
**For functions accepting sequences:**
|
||||
```
|
||||
Read-only? → Sequence[T]
|
||||
Need indexing? → Sequence[T]
|
||||
Need mutation? → list[T]
|
||||
Large data? → Iterator[T] or Generator[T]
|
||||
```
|
||||
|
||||
**For dictionary-like types:**
|
||||
```
|
||||
Known structure? → TypedDict
|
||||
Dynamic keys? → dict[K, V]
|
||||
Protocol needed? → Mapping[K, V] (read-only)
|
||||
Need mutation? → MutableMapping[K, V]
|
||||
```
|
||||
|
||||
**For optional values:**
|
||||
```
|
||||
Can be None? → T | None
|
||||
Has default? → T with default parameter
|
||||
Really optional? → T | None in TypedDict(total=False)
|
||||
```
|
||||
|
||||
|
||||
## Integration with Other Skills
|
||||
|
||||
**After using this skill:**
|
||||
- If setting up project → See @project-structure-and-tooling for mypy in pyproject.toml
|
||||
- If fixing lint → See @systematic-delinting for type-related lint rules
|
||||
- If testing typed code → See @testing-and-quality for pytest type checking
|
||||
|
||||
**Before using this skill:**
|
||||
- Setup mypy → Use @project-structure-and-tooling first
|
||||
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Python Version | Key Type Features |
|
||||
|----------------|-------------------|
|
||||
| 3.9 | Built-in generics (`list[T]` instead of `List[T]`) |
|
||||
| 3.10 | Union with `|`, match statements, ParamSpec |
|
||||
| 3.11 | Exception groups, Self type, better errors |
|
||||
| 3.12 | PEP 695 generics, @override decorator |
|
||||
|
||||
**Most impactful features:**
|
||||
1. `| None` instead of `Optional` (3.10+)
|
||||
2. Built-in generics: `list[T]` not `List[T]` (3.9+)
|
||||
3. PEP 695: `class Box[T]` not `class Box(Generic[T])` (3.12+)
|
||||
4. Match statements for complex conditionals (3.10+)
|
||||
5. @override for explicit method overriding (3.12+)
|
||||
1593
skills/using-python-engineering/project-structure-and-tooling.md
Normal file
1593
skills/using-python-engineering/project-structure-and-tooling.md
Normal file
File diff suppressed because it is too large
Load Diff
1120
skills/using-python-engineering/resolving-mypy-errors.md
Normal file
1120
skills/using-python-engineering/resolving-mypy-errors.md
Normal file
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,981 @@
|
||||
|
||||
# Scientific Computing Foundations
|
||||
|
||||
## Overview
|
||||
|
||||
**Core Principle:** Vectorize operations, avoid loops. NumPy and pandas are built on C/Fortran code that's orders of magnitude faster than Python loops. The biggest performance gains come from eliminating iteration over rows/elements.
|
||||
|
||||
Scientific computing in Python centers on NumPy (arrays) and pandas (dataframes). These libraries enable fast numerical computation on large datasets through vectorized operations and efficient memory layouts. The most common mistake: using Python loops when vectorized operations exist.
|
||||
|
||||
## When to Use
|
||||
|
||||
**Use this skill when:**
|
||||
- "NumPy operations"
|
||||
- "Pandas DataFrame slow"
|
||||
- "Vectorization"
|
||||
- "How to avoid loops?"
|
||||
- "DataFrame iteration"
|
||||
- "Array performance"
|
||||
- "Memory usage too high"
|
||||
- "Large dataset processing"
|
||||
|
||||
**Don't use when:**
|
||||
- Setting up project (use project-structure-and-tooling)
|
||||
- Profiling needed first (use debugging-and-profiling)
|
||||
- ML pipeline orchestration (use ml-engineering-workflows)
|
||||
|
||||
**Symptoms triggering this skill:**
|
||||
- Slow DataFrame operations
|
||||
- High memory usage with arrays
|
||||
- Using loops over DataFrame rows
|
||||
- Need to process large datasets efficiently
|
||||
|
||||
|
||||
## NumPy Fundamentals
|
||||
|
||||
### Array Creation and Types
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
# ❌ WRONG: Creating arrays from Python lists in loop
|
||||
data = []
|
||||
for i in range(1000000):
|
||||
data.append(i * 2)
|
||||
arr = np.array(data)
|
||||
|
||||
# ✅ CORRECT: Use NumPy functions
|
||||
arr = np.arange(1000000) * 2
|
||||
|
||||
# ✅ CORRECT: Pre-allocate for known size
|
||||
arr = np.empty(1000000, dtype=np.int64)
|
||||
for i in range(1000000):
|
||||
arr[i] = i * 2 # Still slow, but better than list
|
||||
|
||||
# ✅ BETTER: Fully vectorized
|
||||
arr = np.arange(1000000, dtype=np.int64) * 2
|
||||
|
||||
# ✅ CORRECT: Specify dtype for memory efficiency
|
||||
# float64 (default): 8 bytes per element
|
||||
# float32: 4 bytes per element
|
||||
large_arr = np.zeros(1000000, dtype=np.float32) # Half the memory
|
||||
|
||||
# Why this matters: dtype affects both memory usage and performance
|
||||
# Use smallest dtype that fits your data
|
||||
```
|
||||
|
||||
### Vectorized Operations
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Loop over array elements
|
||||
arr = np.arange(1000000)
|
||||
result = np.empty(1000000)
|
||||
for i in range(len(arr)):
|
||||
result[i] = arr[i] ** 2 + 2 * arr[i] + 1
|
||||
|
||||
# ✅ CORRECT: Vectorized operations
|
||||
arr = np.arange(1000000)
|
||||
result = arr ** 2 + 2 * arr + 1
|
||||
|
||||
# Speed difference: ~100x faster with vectorization
|
||||
|
||||
# ❌ WRONG: Element-wise comparison in loop
|
||||
matches = []
|
||||
for val in arr:
|
||||
if val > 100:
|
||||
matches.append(val)
|
||||
result = np.array(matches)
|
||||
|
||||
# ✅ CORRECT: Boolean indexing
|
||||
result = arr[arr > 100]
|
||||
|
||||
# ✅ CORRECT: Complex conditions
|
||||
result = arr[(arr > 100) & (arr < 200)] # Note: & not 'and'
|
||||
result = arr[(arr < 50) | (arr > 150)] # Note: | not 'or'
|
||||
```
|
||||
|
||||
**Why this matters**: Vectorized operations run in C, avoiding Python interpreter overhead. 10-100x speedup typical.
|
||||
|
||||
### Broadcasting
|
||||
|
||||
```python
|
||||
# Broadcasting: Operating on arrays of different shapes
|
||||
|
||||
# ✅ CORRECT: Scalar broadcasting
|
||||
arr = np.array([1, 2, 3, 4])
|
||||
result = arr + 10 # [11, 12, 13, 14]
|
||||
|
||||
# ✅ CORRECT: 1D array broadcast to 2D
|
||||
matrix = np.array([[1, 2, 3],
|
||||
[4, 5, 6],
|
||||
[7, 8, 9]])
|
||||
|
||||
row_vector = np.array([10, 20, 30])
|
||||
result = matrix + row_vector
|
||||
# [[11, 22, 33],
|
||||
# [14, 25, 36],
|
||||
# [17, 28, 39]]
|
||||
|
||||
# ✅ CORRECT: Column vector broadcasting
|
||||
col_vector = np.array([[10],
|
||||
[20],
|
||||
[30]])
|
||||
result = matrix + col_vector
|
||||
# [[11, 12, 13],
|
||||
# [24, 25, 26],
|
||||
# [37, 38, 39]]
|
||||
|
||||
# ✅ CORRECT: Add axis for broadcasting
|
||||
row = np.array([1, 2, 3])
|
||||
col = row[:, np.newaxis] # Convert to column vector
|
||||
# col shape: (3, 1)
|
||||
|
||||
# Outer product via broadcasting
|
||||
outer = row[np.newaxis, :] * col
|
||||
# [[1, 2, 3],
|
||||
# [2, 4, 6],
|
||||
# [3, 6, 9]]
|
||||
|
||||
# ❌ WRONG: Manual broadcasting with loops
|
||||
result = np.empty_like(matrix)
|
||||
for i in range(matrix.shape[0]):
|
||||
for j in range(matrix.shape[1]):
|
||||
result[i, j] = matrix[i, j] + row_vector[j]
|
||||
|
||||
# Why this matters: Broadcasting eliminates loops and is much faster
|
||||
```
|
||||
|
||||
### Memory-Efficient Operations
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Creating unnecessary copies
|
||||
large_arr = np.random.rand(10000, 10000) # ~800MB
|
||||
result1 = large_arr + 1 # Creates new 800MB array
|
||||
result2 = result1 * 2 # Creates another 800MB array
|
||||
# Total: 2.4GB memory usage
|
||||
|
||||
# ✅ CORRECT: In-place operations
|
||||
large_arr = np.random.rand(10000, 10000)
|
||||
large_arr += 1 # Modifies in-place, no copy
|
||||
large_arr *= 2 # Modifies in-place, no copy
|
||||
# Total: 800MB memory usage
|
||||
|
||||
# ✅ CORRECT: Use 'out' parameter
|
||||
result = np.empty_like(large_arr)
|
||||
np.add(large_arr, 1, out=result)
|
||||
np.multiply(result, 2, out=result)
|
||||
|
||||
# ❌ WRONG: Unnecessary array copies
|
||||
arr = np.arange(1000000)
|
||||
subset = arr[::2].copy() # Explicit copy needed? Check first
|
||||
subset[0] = 999 # Doesn't affect arr
|
||||
|
||||
# ✅ CORRECT: Views avoid copies (when possible)
|
||||
arr = np.arange(1000000)
|
||||
view = arr[::2] # View, not copy (shares memory)
|
||||
view[0] = 999 # Modifies arr too!
|
||||
|
||||
# Check if view or copy
|
||||
print(arr.base is None) # False = view, True = owns memory
|
||||
```
|
||||
|
||||
**Why this matters**: Large arrays consume lots of memory. In-place operations and views avoid copies, reducing memory usage significantly.
|
||||
|
||||
### Aggregations and Reductions
|
||||
|
||||
```python
|
||||
# ✅ CORRECT: Axis-aware aggregations
|
||||
matrix = np.array([[1, 2, 3],
|
||||
[4, 5, 6],
|
||||
[7, 8, 9]])
|
||||
|
||||
# Sum all elements
|
||||
total = matrix.sum() # 45
|
||||
|
||||
# Sum along axis 0 (columns)
|
||||
col_sums = matrix.sum(axis=0) # [12, 15, 18]
|
||||
|
||||
# Sum along axis 1 (rows)
|
||||
row_sums = matrix.sum(axis=1) # [6, 15, 24]
|
||||
|
||||
# ❌ WRONG: Manual aggregation
|
||||
total = 0
|
||||
for row in matrix:
|
||||
for val in row:
|
||||
total += val
|
||||
|
||||
# ✅ CORRECT: Multiple aggregations
|
||||
matrix.mean()
|
||||
matrix.std()
|
||||
matrix.min()
|
||||
matrix.max()
|
||||
matrix.argmin() # Index of minimum
|
||||
matrix.argmax() # Index of maximum
|
||||
|
||||
# ✅ CORRECT: Conditional aggregations
|
||||
# Sum only positive values
|
||||
positive_sum = matrix[matrix > 0].sum()
|
||||
|
||||
# Count elements > 5
|
||||
count = (matrix > 5).sum()
|
||||
|
||||
# Percentage > 5
|
||||
percentage = (matrix > 5).mean() * 100
|
||||
```
|
||||
|
||||
|
||||
## pandas Fundamentals
|
||||
|
||||
### DataFrame Creation
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
# ❌ WRONG: Building DataFrame row by row
|
||||
df = pd.DataFrame()
|
||||
for i in range(10000):
|
||||
df = pd.concat([df, pd.DataFrame({'a': [i], 'b': [i*2]})], ignore_index=True)
|
||||
# Extremely slow: O(n²) complexity
|
||||
|
||||
# ✅ CORRECT: Create from dict of lists
|
||||
data = {
|
||||
'a': list(range(10000)),
|
||||
'b': [i * 2 for i in range(10000)]
|
||||
}
|
||||
df = pd.DataFrame(data)
|
||||
|
||||
# ✅ BETTER: Use NumPy arrays
|
||||
df = pd.DataFrame({
|
||||
'a': np.arange(10000),
|
||||
'b': np.arange(10000) * 2
|
||||
})
|
||||
|
||||
# ✅ CORRECT: From records
|
||||
records = [{'a': i, 'b': i*2} for i in range(10000)]
|
||||
df = pd.DataFrame.from_records(records)
|
||||
```
|
||||
|
||||
### The Iteration Anti-Pattern
|
||||
|
||||
```python
|
||||
# ❌ WRONG: iterrows() - THE MOST COMMON MISTAKE
|
||||
df = pd.DataFrame({
|
||||
'value': np.random.rand(100000),
|
||||
'category': np.random.choice(['A', 'B', 'C'], 100000)
|
||||
})
|
||||
|
||||
result = []
|
||||
for idx, row in df.iterrows(): # VERY SLOW
|
||||
if row['value'] > 0.5:
|
||||
result.append(row['value'] * 2)
|
||||
|
||||
# ✅ CORRECT: Vectorized operations
|
||||
mask = df['value'] > 0.5
|
||||
result = df.loc[mask, 'value'] * 2
|
||||
|
||||
# Speed difference: ~100x faster
|
||||
|
||||
# ❌ WRONG: apply() on axis=1 (still row-by-row)
|
||||
df['result'] = df.apply(
|
||||
lambda row: row['value'] * 2 if row['value'] > 0.5 else 0,
|
||||
axis=1
|
||||
)
|
||||
# Still slow: applies Python function to each row
|
||||
|
||||
# ✅ CORRECT: Vectorized with np.where
|
||||
df['result'] = np.where(df['value'] > 0.5, df['value'] * 2, 0)
|
||||
|
||||
# ✅ CORRECT: Boolean indexing + assignment
|
||||
df['result'] = 0
|
||||
df.loc[df['value'] > 0.5, 'result'] = df['value'] * 2
|
||||
```
|
||||
|
||||
**Why this matters**: `iterrows()` is the single biggest DataFrame performance killer. ALWAYS look for vectorized alternatives.
|
||||
|
||||
### Efficient Filtering and Selection
|
||||
|
||||
```python
|
||||
df = pd.DataFrame({
|
||||
'A': np.random.rand(100000),
|
||||
'B': np.random.rand(100000),
|
||||
'C': np.random.choice(['X', 'Y', 'Z'], 100000)
|
||||
})
|
||||
|
||||
# ❌ WRONG: Chaining filters inefficiently
|
||||
df_filtered = df[df['A'] > 0.5]
|
||||
df_filtered = df_filtered[df_filtered['B'] < 0.3]
|
||||
df_filtered = df_filtered[df_filtered['C'] == 'X']
|
||||
|
||||
# ✅ CORRECT: Single boolean mask
|
||||
mask = (df['A'] > 0.5) & (df['B'] < 0.3) & (df['C'] == 'X')
|
||||
df_filtered = df[mask]
|
||||
|
||||
# ✅ CORRECT: query() for complex filters (cleaner syntax)
|
||||
df_filtered = df.query('A > 0.5 and B < 0.3 and C == "X"')
|
||||
|
||||
# ✅ CORRECT: isin() for multiple values
|
||||
df_filtered = df[df['C'].isin(['X', 'Y'])]
|
||||
|
||||
# ❌ WRONG: String matching in loop
|
||||
matches = []
|
||||
for val in df['C']:
|
||||
if 'X' in val:
|
||||
matches.append(True)
|
||||
else:
|
||||
matches.append(False)
|
||||
df_filtered = df[matches]
|
||||
|
||||
# ✅ CORRECT: Vectorized string operations
|
||||
df_filtered = df[df['C'].str.contains('X')]
|
||||
```
|
||||
|
||||
### GroupBy Operations
|
||||
|
||||
```python
|
||||
df = pd.DataFrame({
|
||||
'category': np.random.choice(['A', 'B', 'C'], 100000),
|
||||
'value': np.random.rand(100000),
|
||||
'count': np.random.randint(1, 100, 100000)
|
||||
})
|
||||
|
||||
# ❌ WRONG: Manual grouping
|
||||
groups = {}
|
||||
for idx, row in df.iterrows():
|
||||
cat = row['category']
|
||||
if cat not in groups:
|
||||
groups[cat] = []
|
||||
groups[cat].append(row['value'])
|
||||
|
||||
results = {cat: sum(vals) / len(vals) for cat, vals in groups.items()}
|
||||
|
||||
# ✅ CORRECT: GroupBy
|
||||
results = df.groupby('category')['value'].mean()
|
||||
|
||||
# ✅ CORRECT: Multiple aggregations
|
||||
results = df.groupby('category').agg({
|
||||
'value': ['mean', 'std', 'min', 'max'],
|
||||
'count': 'sum'
|
||||
})
|
||||
|
||||
# ✅ CORRECT: Named aggregations (pandas 0.25+)
|
||||
results = df.groupby('category').agg(
|
||||
mean_value=('value', 'mean'),
|
||||
std_value=('value', 'std'),
|
||||
total_count=('count', 'sum')
|
||||
)
|
||||
|
||||
# ✅ CORRECT: Custom aggregation function
|
||||
def range_func(x):
|
||||
return x.max() - x.min()
|
||||
|
||||
results = df.groupby('category')['value'].agg(range_func)
|
||||
|
||||
# ✅ CORRECT: Transform (keeps original shape)
|
||||
df['value_centered'] = df.groupby('category')['value'].transform(
|
||||
lambda x: x - x.mean()
|
||||
)
|
||||
```
|
||||
|
||||
**Why this matters**: GroupBy is highly optimized. Much faster than manual grouping. Use built-in aggregations when possible.
|
||||
|
||||
|
||||
## Performance Anti-Patterns
|
||||
|
||||
### Anti-Pattern 1: DataFrame Iteration
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Iterating over rows
|
||||
for idx, row in df.iterrows():
|
||||
df.at[idx, 'new_col'] = row['a'] + row['b']
|
||||
|
||||
# ✅ CORRECT: Vectorized column operation
|
||||
df['new_col'] = df['a'] + df['b']
|
||||
|
||||
# ❌ WRONG: Itertuples (better than iterrows, but still slow)
|
||||
for row in df.itertuples():
|
||||
# Process row...
|
||||
|
||||
# ✅ CORRECT: Use vectorized operations or apply to columns
|
||||
```
|
||||
|
||||
### Anti-Pattern 2: Repeated Concatenation
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Growing DataFrame in loop
|
||||
df = pd.DataFrame()
|
||||
for i in range(10000):
|
||||
df = pd.concat([df, new_row_df], ignore_index=True)
|
||||
# O(n²) complexity, extremely slow
|
||||
|
||||
# ✅ CORRECT: Collect data, then create DataFrame
|
||||
data = []
|
||||
for i in range(10000):
|
||||
data.append({'a': i, 'b': i*2})
|
||||
df = pd.DataFrame(data)
|
||||
|
||||
# ✅ CORRECT: Pre-allocate NumPy array
|
||||
arr = np.empty((10000, 2))
|
||||
for i in range(10000):
|
||||
arr[i] = [i, i*2]
|
||||
df = pd.DataFrame(arr, columns=['a', 'b'])
|
||||
```
|
||||
|
||||
### Anti-Pattern 3: Using apply When Vectorized Exists
|
||||
|
||||
```python
|
||||
# ❌ WRONG: apply() for simple operations
|
||||
df['result'] = df['value'].apply(lambda x: x * 2)
|
||||
|
||||
# ✅ CORRECT: Direct vectorized operation
|
||||
df['result'] = df['value'] * 2
|
||||
|
||||
# ❌ WRONG: apply() for conditions
|
||||
df['category'] = df['value'].apply(lambda x: 'high' if x > 0.5 else 'low')
|
||||
|
||||
# ✅ CORRECT: np.where or pd.cut
|
||||
df['category'] = np.where(df['value'] > 0.5, 'high', 'low')
|
||||
|
||||
# ✅ CORRECT: pd.cut for binning
|
||||
df['category'] = pd.cut(df['value'], bins=[0, 0.5, 1.0], labels=['low', 'high'])
|
||||
|
||||
# When apply IS appropriate:
|
||||
# - Complex logic not vectorizable
|
||||
# - Need to call external function per row
|
||||
# But verify vectorization truly impossible first
|
||||
```
|
||||
|
||||
### Anti-Pattern 4: Not Using Categorical Data
|
||||
|
||||
```python
|
||||
# ❌ WRONG: String columns for repeated values
|
||||
df = pd.DataFrame({
|
||||
'category': ['A'] * 10000 + ['B'] * 10000 + ['C'] * 10000
|
||||
})
|
||||
# Memory: ~240KB (each string stored separately)
|
||||
|
||||
# ✅ CORRECT: Categorical type
|
||||
df['category'] = pd.Categorical(df['category'])
|
||||
# Memory: ~30KB (integers + small string table)
|
||||
|
||||
# ✅ CORRECT: Define categories at creation
|
||||
df = pd.DataFrame({
|
||||
'category': pd.Categorical(
|
||||
['A'] * 10000 + ['B'] * 10000,
|
||||
categories=['A', 'B', 'C']
|
||||
)
|
||||
})
|
||||
|
||||
# When to use categorical:
|
||||
# - Limited number of unique values (< 50% of rows)
|
||||
# - Repeated string/object values
|
||||
# - Memory constraints
|
||||
# - Faster groupby operations
|
||||
```
|
||||
|
||||
|
||||
## Memory Optimization
|
||||
|
||||
### Choosing Appropriate dtypes
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Default dtypes waste memory
|
||||
df = pd.DataFrame({
|
||||
'int_col': [1, 2, 3, 4, 5], # int64 by default
|
||||
'float_col': [1.0, 2.0, 3.0], # float64 by default
|
||||
'str_col': ['a', 'b', 'c', 'd', 'e'] # object dtype
|
||||
})
|
||||
|
||||
print(df.memory_usage(deep=True))
|
||||
|
||||
# ✅ CORRECT: Optimize dtypes
|
||||
df = pd.DataFrame({
|
||||
'int_col': pd.array([1, 2, 3, 4, 5], dtype='int8'), # -128 to 127
|
||||
'float_col': pd.array([1.0, 2.0, 3.0], dtype='float32'),
|
||||
'str_col': pd.Categorical(['a', 'b', 'c', 'd', 'e'])
|
||||
})
|
||||
|
||||
# ✅ CORRECT: Downcast after loading
|
||||
df = pd.read_csv('data.csv')
|
||||
df['int_col'] = pd.to_numeric(df['int_col'], downcast='integer')
|
||||
df['float_col'] = pd.to_numeric(df['float_col'], downcast='float')
|
||||
|
||||
# Integer dtype ranges:
|
||||
# int8: -128 to 127
|
||||
# int16: -32,768 to 32,767
|
||||
# int32: -2.1B to 2.1B
|
||||
# int64: -9.2E18 to 9.2E18
|
||||
|
||||
# Float dtype precision:
|
||||
# float16: ~3 decimal digits (rarely used)
|
||||
# float32: ~7 decimal digits
|
||||
# float64: ~15 decimal digits
|
||||
```
|
||||
|
||||
### Chunked Processing for Large Files
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Loading entire file into memory
|
||||
df = pd.read_csv('huge_file.csv') # 10GB file, OOM!
|
||||
df_processed = process_dataframe(df)
|
||||
|
||||
# ✅ CORRECT: Process in chunks
|
||||
chunk_size = 100000
|
||||
results = []
|
||||
|
||||
for chunk in pd.read_csv('huge_file.csv', chunksize=chunk_size):
|
||||
processed = process_dataframe(chunk)
|
||||
results.append(processed)
|
||||
|
||||
df_final = pd.concat(results, ignore_index=True)
|
||||
|
||||
# ✅ CORRECT: Streaming aggregation
|
||||
totals = {'A': 0, 'B': 0, 'C': 0}
|
||||
|
||||
for chunk in pd.read_csv('huge_file.csv', chunksize=chunk_size):
|
||||
for col in totals:
|
||||
totals[col] += chunk[col].sum()
|
||||
|
||||
# ✅ CORRECT: Only load needed columns
|
||||
df = pd.read_csv('huge_file.csv', usecols=['col1', 'col2', 'col3'])
|
||||
```
|
||||
|
||||
### Using Sparse Data Structures
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Dense array for sparse data
|
||||
# Data with 99% zeros
|
||||
dense = np.zeros(1000000)
|
||||
dense[::100] = 1 # Only 1% non-zero
|
||||
# Memory: 8MB (float64 * 1M)
|
||||
|
||||
# ✅ CORRECT: Sparse array
|
||||
from scipy.sparse import csr_matrix
|
||||
sparse = csr_matrix(dense)
|
||||
# Memory: ~80KB (only stores non-zero values + indices)
|
||||
|
||||
# ✅ CORRECT: Sparse DataFrame
|
||||
df = pd.DataFrame({
|
||||
'A': pd.arrays.SparseArray([0] * 100 + [1] + [0] * 100),
|
||||
'B': pd.arrays.SparseArray([0] * 50 + [2] + [0] * 150)
|
||||
})
|
||||
```
|
||||
|
||||
|
||||
## Data Pipeline Patterns
|
||||
|
||||
### Method Chaining
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Many intermediate variables
|
||||
df = pd.read_csv('data.csv')
|
||||
df = df[df['value'] > 0]
|
||||
df = df.groupby('category')['value'].mean()
|
||||
df = df.reset_index()
|
||||
df = df.rename(columns={'value': 'mean_value'})
|
||||
|
||||
# ✅ CORRECT: Method chaining
|
||||
df = (
|
||||
pd.read_csv('data.csv')
|
||||
.query('value > 0')
|
||||
.groupby('category')['value']
|
||||
.mean()
|
||||
.reset_index()
|
||||
.rename(columns={'value': 'mean_value'})
|
||||
)
|
||||
|
||||
# ✅ CORRECT: Pipe for custom functions
|
||||
def remove_outliers(df, column, n_std=3):
|
||||
mean = df[column].mean()
|
||||
std = df[column].std()
|
||||
return df[
|
||||
(df[column] > mean - n_std * std) &
|
||||
(df[column] < mean + n_std * std)
|
||||
]
|
||||
|
||||
df = (
|
||||
pd.read_csv('data.csv')
|
||||
.pipe(remove_outliers, 'value', n_std=2)
|
||||
.groupby('category')['value']
|
||||
.mean()
|
||||
)
|
||||
```
|
||||
|
||||
### Efficient Merges and Joins
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Multiple small merges
|
||||
for small_df in list_of_dfs:
|
||||
main_df = main_df.merge(small_df, on='key')
|
||||
# Inefficient: creates many intermediate copies
|
||||
|
||||
# ✅ CORRECT: Merge all at once
|
||||
df_merged = pd.concat(list_of_dfs, ignore_index=True)
|
||||
|
||||
# ✅ CORRECT: Optimize merge with sorted/indexed data
|
||||
df1 = df1.set_index('key').sort_index()
|
||||
df2 = df2.set_index('key').sort_index()
|
||||
result = df1.merge(df2, left_index=True, right_index=True)
|
||||
|
||||
# ✅ CORRECT: Use indicator to track merge sources
|
||||
result = df1.merge(df2, on='key', how='outer', indicator=True)
|
||||
print(result['_merge'].value_counts())
|
||||
# Shows: left_only, right_only, both
|
||||
|
||||
# ❌ WRONG: Cartesian product by accident
|
||||
# df1: 1000 rows, df2: 1000 rows
|
||||
result = df1.merge(df2, on='wrong_key')
|
||||
# result: 1,000,000 rows! (if all keys match)
|
||||
|
||||
# ✅ CORRECT: Validate merge
|
||||
result = df1.merge(df2, on='key', validate='1:1')
|
||||
# Raises error if not one-to-one relationship
|
||||
```
|
||||
|
||||
### Handling Missing Data
|
||||
|
||||
```python
|
||||
# ❌ WRONG: Dropping all rows with any NaN
|
||||
df_clean = df.dropna() # Might lose most of data
|
||||
|
||||
# ✅ CORRECT: Drop rows with NaN in specific columns
|
||||
df_clean = df.dropna(subset=['important_col1', 'important_col2'])
|
||||
|
||||
# ✅ CORRECT: Fill NaN with appropriate values
|
||||
df['numeric_col'] = df['numeric_col'].fillna(df['numeric_col'].mean())
|
||||
df['category_col'] = df['category_col'].fillna('Unknown')
|
||||
|
||||
# ✅ CORRECT: Forward/backward fill for time series
|
||||
df['value'] = df['value'].fillna(method='ffill') # Forward fill
|
||||
|
||||
# ✅ CORRECT: Interpolation
|
||||
df['value'] = df['value'].interpolate(method='linear')
|
||||
|
||||
# ❌ WRONG: Not checking for NaN before operations
|
||||
result = df['value'].mean() # NaN propagates, might return NaN
|
||||
|
||||
# ✅ CORRECT: Explicit NaN handling
|
||||
result = df['value'].mean(skipna=True) # Default, but explicit is better
|
||||
```
|
||||
|
||||
|
||||
## Advanced NumPy Techniques
|
||||
|
||||
### Universal Functions (ufuncs)
|
||||
|
||||
```python
|
||||
# ✅ CORRECT: Using built-in ufuncs
|
||||
arr = np.random.rand(1000000)
|
||||
|
||||
# Trigonometric
|
||||
result = np.sin(arr)
|
||||
result = np.cos(arr)
|
||||
|
||||
# Exponential
|
||||
result = np.exp(arr)
|
||||
result = np.log(arr)
|
||||
|
||||
# Comparison
|
||||
result = np.maximum(arr, 0.5) # Element-wise max with scalar
|
||||
result = np.minimum(arr, 0.5)
|
||||
|
||||
# ✅ CORRECT: Custom ufunc with @vectorize
|
||||
from numba import vectorize
|
||||
|
||||
@vectorize
|
||||
def custom_func(x):
|
||||
if x > 0.5:
|
||||
return x ** 2
|
||||
else:
|
||||
return x ** 3
|
||||
|
||||
result = custom_func(arr) # Runs at C speed
|
||||
```
|
||||
|
||||
### Advanced Indexing
|
||||
|
||||
```python
|
||||
# ✅ CORRECT: Fancy indexing
|
||||
arr = np.arange(100)
|
||||
indices = [0, 5, 10, 15, 20]
|
||||
result = arr[indices] # Select specific indices
|
||||
|
||||
# ✅ CORRECT: Boolean indexing with multiple conditions
|
||||
arr = np.random.rand(1000000)
|
||||
mask = (arr > 0.3) & (arr < 0.7)
|
||||
result = arr[mask]
|
||||
|
||||
# ✅ CORRECT: np.where for conditional replacement
|
||||
arr = np.random.rand(1000)
|
||||
result = np.where(arr > 0.5, arr, 0) # Replace values <= 0.5 with 0
|
||||
|
||||
# ✅ CORRECT: Multi-dimensional indexing
|
||||
matrix = np.random.rand(100, 100)
|
||||
rows = [0, 10, 20]
|
||||
cols = [5, 15, 25]
|
||||
result = matrix[rows, cols] # Select specific elements
|
||||
|
||||
# Get diagonal
|
||||
diagonal = matrix[np.arange(100), np.arange(100)]
|
||||
# Or use np.diag
|
||||
diagonal = np.diag(matrix)
|
||||
```
|
||||
|
||||
### Linear Algebra Operations
|
||||
|
||||
```python
|
||||
# ✅ CORRECT: Matrix multiplication
|
||||
A = np.random.rand(1000, 500)
|
||||
B = np.random.rand(500, 200)
|
||||
C = A @ B # Python 3.5+ matrix multiply operator
|
||||
|
||||
# Or
|
||||
C = np.dot(A, B)
|
||||
C = np.matmul(A, B)
|
||||
|
||||
# ✅ CORRECT: Solve linear system Ax = b
|
||||
A = np.random.rand(100, 100)
|
||||
b = np.random.rand(100)
|
||||
x = np.linalg.solve(A, b)
|
||||
|
||||
# ✅ CORRECT: Eigenvalues and eigenvectors
|
||||
eigenvalues, eigenvectors = np.linalg.eig(A)
|
||||
|
||||
# ✅ CORRECT: SVD (Singular Value Decomposition)
|
||||
U, s, Vt = np.linalg.svd(A)
|
||||
|
||||
# ✅ CORRECT: Inverse
|
||||
A_inv = np.linalg.inv(A)
|
||||
|
||||
# ❌ WRONG: Using inverse for solving Ax = b
|
||||
x = np.linalg.inv(A) @ b # Slower and less numerically stable
|
||||
|
||||
# ✅ CORRECT: Use solve directly
|
||||
x = np.linalg.solve(A, b)
|
||||
```
|
||||
|
||||
|
||||
## Type Hints for NumPy and pandas
|
||||
|
||||
### NumPy Type Hints
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from numpy.typing import NDArray
|
||||
|
||||
# ✅ CORRECT: Type hint for NumPy arrays
|
||||
def process_array(arr: NDArray[np.float64]) -> NDArray[np.float64]:
|
||||
return arr * 2
|
||||
|
||||
# ✅ CORRECT: Generic array type
|
||||
def normalize(arr: NDArray) -> NDArray:
|
||||
return (arr - arr.mean()) / arr.std()
|
||||
|
||||
# ✅ CORRECT: Shape-specific type hints (Python 3.11+)
|
||||
from typing import TypeAlias
|
||||
|
||||
Vector: TypeAlias = NDArray[np.float64] # 1D array
|
||||
Matrix: TypeAlias = NDArray[np.float64] # 2D array
|
||||
|
||||
def matrix_multiply(A: Matrix, B: Matrix) -> Matrix:
|
||||
return A @ B
|
||||
```
|
||||
|
||||
### pandas Type Hints
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
# ✅ CORRECT: Type hints for Series and DataFrame
|
||||
def process_series(s: pd.Series) -> pd.Series:
|
||||
return s * 2
|
||||
|
||||
def process_dataframe(df: pd.DataFrame) -> pd.DataFrame:
|
||||
return df[df['value'] > 0]
|
||||
|
||||
# ✅ CORRECT: More specific DataFrame types (using Protocols)
|
||||
from typing import Protocol
|
||||
|
||||
class DataFrameWithColumns(Protocol):
|
||||
"""DataFrame with specific columns."""
|
||||
def __getitem__(self, key: str) -> pd.Series: ...
|
||||
|
||||
def analyze_data(df: DataFrameWithColumns) -> float:
|
||||
return df['value'].mean()
|
||||
```
|
||||
|
||||
|
||||
## Real-World Patterns
|
||||
|
||||
### Time Series Operations
|
||||
|
||||
```python
|
||||
# ✅ CORRECT: Efficient time series resampling
|
||||
df = pd.DataFrame({
|
||||
'timestamp': pd.date_range('2024-01-01', periods=1000000, freq='1s'),
|
||||
'value': np.random.rand(1000000)
|
||||
})
|
||||
|
||||
df = df.set_index('timestamp')
|
||||
|
||||
# Resample to 1-minute intervals
|
||||
df_resampled = df.resample('1min').agg({
|
||||
'value': ['mean', 'std', 'min', 'max']
|
||||
})
|
||||
|
||||
# ✅ CORRECT: Rolling window operations
|
||||
df['rolling_mean'] = df['value'].rolling(window=60).mean()
|
||||
df['rolling_std'] = df['value'].rolling(window=60).std()
|
||||
|
||||
# ✅ CORRECT: Lag features
|
||||
df['value_lag1'] = df['value'].shift(1)
|
||||
df['value_lag60'] = df['value'].shift(60)
|
||||
|
||||
# ✅ CORRECT: Difference for stationarity
|
||||
df['value_diff'] = df['value'].diff()
|
||||
```
|
||||
|
||||
### Multi-Index Operations
|
||||
|
||||
```python
|
||||
# ✅ CORRECT: Creating multi-index DataFrame
|
||||
df = pd.DataFrame({
|
||||
'country': ['USA', 'USA', 'UK', 'UK'],
|
||||
'city': ['NYC', 'LA', 'London', 'Manchester'],
|
||||
'value': [100, 200, 150, 175]
|
||||
})
|
||||
|
||||
df = df.set_index(['country', 'city'])
|
||||
|
||||
# Accessing with multi-index
|
||||
df.loc['USA'] # All USA cities
|
||||
df.loc[('USA', 'NYC')] # Specific city
|
||||
|
||||
# ✅ CORRECT: Cross-section
|
||||
df.xs('USA', level='country')
|
||||
df.xs('London', level='city')
|
||||
|
||||
# ✅ CORRECT: GroupBy with multi-index
|
||||
df.groupby(level='country').mean()
|
||||
```
|
||||
|
||||
### Parallel Processing with Dask
|
||||
|
||||
```python
|
||||
# For datasets larger than memory, use Dask (not in plan detail, but worth mentioning)
|
||||
import dask.dataframe as dd
|
||||
|
||||
# ✅ CORRECT: Dask for out-of-core processing
|
||||
df = dd.read_csv('huge_file.csv')
|
||||
result = df.groupby('category')['value'].mean().compute()
|
||||
|
||||
# Dask uses same API as pandas, but lazy evaluation
|
||||
# Only computes when .compute() is called
|
||||
```
|
||||
|
||||
|
||||
## Anti-Pattern Summary
|
||||
|
||||
### Top 5 Performance Killers
|
||||
|
||||
1. **iterrows()** - Use vectorized operations
|
||||
2. **Growing DataFrame in loop** - Collect data, then create DataFrame
|
||||
3. **apply() for simple operations** - Use vectorized alternatives
|
||||
4. **Not using categorical for strings** - Convert to categorical
|
||||
5. **Loading entire file when chunking works** - Use chunksize parameter
|
||||
|
||||
### Memory Usage Mistakes
|
||||
|
||||
1. **Using float64 when float32 sufficient** - Halves memory
|
||||
2. **Not using categorical for repeated strings** - 10x memory savings
|
||||
3. **Creating unnecessary copies** - Use in-place operations
|
||||
4. **Loading all columns when few needed** - Use usecols parameter
|
||||
|
||||
|
||||
## Decision Trees
|
||||
|
||||
### Should I Use NumPy or pandas?
|
||||
|
||||
```
|
||||
What's your data structure?
|
||||
├─ Homogeneous numeric array → NumPy
|
||||
├─ Heterogeneous tabular data → pandas
|
||||
├─ Time series → pandas
|
||||
└─ Linear algebra → NumPy
|
||||
```
|
||||
|
||||
### How to Optimize DataFrame Operation?
|
||||
|
||||
```
|
||||
Can I vectorize?
|
||||
├─ Yes → Use vectorized pandas/NumPy operations
|
||||
└─ No → Can I use groupby?
|
||||
├─ Yes → Use groupby with built-in aggregations
|
||||
└─ No → Can I use apply on columns (not rows)?
|
||||
├─ Yes → Use apply on Series
|
||||
└─ No → Use itertuples (last resort)
|
||||
```
|
||||
|
||||
### Memory Optimization Strategy
|
||||
|
||||
```
|
||||
Is memory usage high?
|
||||
├─ Yes → Check dtypes (downcast if possible)
|
||||
│ └─ Still high? → Use categorical for strings
|
||||
│ └─ Still high? → Process in chunks
|
||||
└─ No → Continue with current approach
|
||||
```
|
||||
|
||||
|
||||
## Integration with Other Skills
|
||||
|
||||
**After using this skill:**
|
||||
- If need ML pipelines → See @ml-engineering-workflows for experiment tracking
|
||||
- If performance issues persist → See @debugging-and-profiling for profiling
|
||||
- If type hints needed → See @modern-syntax-and-types for advanced typing
|
||||
|
||||
**Before using this skill:**
|
||||
- If unsure if slow → Use @debugging-and-profiling to profile first
|
||||
- If setting up project → Use @project-structure-and-tooling for dependencies
|
||||
|
||||
|
||||
## Quick Reference
|
||||
|
||||
### NumPy Quick Wins
|
||||
|
||||
```python
|
||||
# Vectorization
|
||||
result = arr ** 2 + 2 * arr + 1 # Not: for loop
|
||||
|
||||
# Boolean indexing
|
||||
result = arr[arr > 0] # Not: list comprehension
|
||||
|
||||
# Broadcasting
|
||||
result = matrix + row_vector # Not: loop over rows
|
||||
|
||||
# In-place operations
|
||||
arr += 1 # Not: arr = arr + 1
|
||||
```
|
||||
|
||||
### pandas Quick Wins
|
||||
|
||||
```python
|
||||
# Never iterrows
|
||||
df['new'] = df['a'] + df['b'] # Not: iterrows
|
||||
|
||||
# Vectorized conditions
|
||||
df['category'] = np.where(df['value'] > 0.5, 'high', 'low')
|
||||
|
||||
# Categorical for strings
|
||||
df['category'] = pd.Categorical(df['category'])
|
||||
|
||||
# Query for complex filters
|
||||
df.query('A > 0.5 and B < 0.3') # Not: multiple []
|
||||
```
|
||||
|
||||
### Memory Optimization Checklist
|
||||
|
||||
- [ ] Use smallest dtype that fits data
|
||||
- [ ] Convert repeated strings to categorical
|
||||
- [ ] Use chunking for files > available RAM
|
||||
- [ ] Avoid unnecessary copies (use views or in-place ops)
|
||||
- [ ] Only load needed columns (usecols in read_csv)
|
||||
1506
skills/using-python-engineering/systematic-delinting.md
Normal file
1506
skills/using-python-engineering/systematic-delinting.md
Normal file
File diff suppressed because it is too large
Load Diff
1848
skills/using-python-engineering/testing-and-quality.md
Normal file
1848
skills/using-python-engineering/testing-and-quality.md
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user