Initial commit

2025-11-30 08:47:43 +08:00
commit 2e8d89fca3
41 changed files with 14051 additions and 0 deletions
--- a/skills/data-science/l0-skill/SKILL.md
+++ b/skills/data-science/l0-skill/SKILL.md
@@ -0,0 +1,277 @@
+---
+name: l0
+description: L0 regularization for neural network sparsification and intelligent sampling - used in survey calibration
+---
+
+# L0 Regularization
+
+L0 is a PyTorch implementation of L0 regularization for neural network sparsification and intelligent sampling, used in PolicyEngine's survey calibration pipeline.
+
+## For Users 👥
+
+### What is L0?
+
+L0 regularization helps PolicyEngine create more efficient survey datasets by intelligently selecting which households to include in calculations.
+
+**Impact you see:**
+- Faster population impact calculations
+- Smaller dataset sizes
+- Maintained accuracy with fewer samples
+
+**Behind the scenes:**
+When PolicyEngine shows population-wide impacts, L0 helps select representative households from the full survey, reducing computation time while maintaining accuracy.
+
+## For Analysts 📊
+
+### What L0 Does
+
+L0 provides intelligent sampling gates for:
+- **Household selection** - Choose representative samples from CPS
+- **Feature selection** - Identify important variables
+- **Sparse weighting** - Create compact, efficient datasets
+
+**Used in PolicyEngine for:**
+- Survey calibration (via microcalibrate)
+- Dataset sparsification in policyengine-us-data
+- Efficient microsimulation
+
+### Installation
+
+```bash
+pip install l0-python
+```
+
+### Quick Example: Sample Selection
+
+```python
+from l0 import SampleGate
+
+# Select 1,000 households from 10,000
+gate = SampleGate(n_samples=10000, target_samples=1000)
+selected_data, indices = gate.select_samples(data)
+
+# Gates learn which samples are most informative
+```
+
+### Integration with microcalibrate
+
+```python
+from l0 import HardConcrete
+from microcalibrate import Calibration
+
+# L0 gates for household selection
+gates = HardConcrete(
+    len(household_weights),
+    temperature=0.25,
+    init_mean=0.999  # Start with most households
+)
+
+# Use in calibration
+# microcalibrate applies gates during weight optimization
+```
+
+## For Contributors 💻
+
+### Repository
+
+**Location:** PolicyEngine/L0
+
+**Clone:**
+```bash
+git clone https://github.com/PolicyEngine/L0
+cd L0
+```
+
+### Current Implementation
+
+**To see structure:**
+```bash
+tree l0/
+
+# Key modules:
+ls l0/
+# - hard_concrete.py - Core L0 distribution
+# - layers.py - L0Linear, L0Conv2d
+# - gates.py - Sample/feature gates
+# - penalties.py - L0/L2 penalty computation
+# - temperature.py - Temperature scheduling
+```
+
+**To see specific implementations:**
+```bash
+# Hard Concrete distribution (core algorithm)
+cat l0/hard_concrete.py
+
+# Sample gates (used in calibration)
+cat l0/gates.py
+
+# Neural network layers
+cat l0/layers.py
+```
+
+### Key Concepts
+
+**Hard Concrete Distribution:**
+- Differentiable approximation of L0 norm
+- Allows gradient-based optimization
+- Temperature controls sparsity level
+
+**To see implementation:**
+```bash
+cat l0/hard_concrete.py
+```
+
+**Sample Gates:**
+- Binary gates for sample selection
+- Learn which samples are most informative
+- Used in microcalibrate for household selection
+
+**Feature Gates:**
+- Select important features/variables
+- Reduce dimensionality
+- Maintain prediction accuracy
+
+### Usage in PolicyEngine
+
+**In microcalibrate (survey calibration):**
+```python
+from l0 import HardConcrete
+
+# Create gates for household selection
+gates = HardConcrete(
+    n_items=len(households),
+    temperature=0.25,
+    init_mean=0.999  # Start with almost all households
+)
+
+# Gates produce probabilities (0 to 1)
+probs = gates()
+
+# Apply to weights during calibration
+masked_weights = weights * probs
+```
+
+**In policyengine-us-data:**
+```bash
+# See usage in data pipeline
+grep -r "from l0 import" ../policyengine-us-data/
+```
+
+### Temperature Scheduling
+
+**Controls sparsity over training:**
+```python
+from l0 import TemperatureScheduler, update_temperatures
+
+scheduler = TemperatureScheduler(
+    initial_temp=1.0,  # Start relaxed
+    final_temp=0.1,    # End sparse
+    total_epochs=100
+)
+
+for epoch in range(100):
+    temp = scheduler.get_temperature(epoch)
+    update_temperatures(model, temp)
+    # ... training ...
+```
+
+**To see implementation:**
+```bash
+cat l0/temperature.py
+```
+
+### L0L2 Combined Penalty
+
+**Prevents overfitting:**
+```python
+from l0 import compute_l0l2_penalty
+
+# Combine L0 (sparsity) with L2 (regularization)
+penalty = compute_l0l2_penalty(
+    model,
+    l0_lambda=1e-3,  # Sparsity strength
+    l2_lambda=1e-4   # Weight regularization
+)
+
+loss = task_loss + penalty
+```
+
+### Testing
+
+**Run tests:**
+```bash
+make test
+
+# Or
+pytest tests/ -v --cov=l0
+```
+
+**To see test patterns:**
+```bash
+cat tests/test_hard_concrete.py
+cat tests/test_gates.py
+```
+
+## Advanced Usage
+
+### Hybrid Gates (L0 + Random)
+
+```python
+from l0 import HybridGate
+
+# Combine L0 selection with random sampling
+hybrid = HybridGate(
+    n_items=10000,
+    l0_fraction=0.25,      # 25% from L0
+    random_fraction=0.75,  # 75% random
+    target_items=1000
+)
+
+selected, indices, types = hybrid.select(data)
+```
+
+### Feature Selection
+
+```python
+from l0 import FeatureGate
+
+# Select top features
+gate = FeatureGate(n_features=1000, max_features=50)
+selected_data, feature_indices = gate.select_features(data)
+
+# Get feature importance
+importance = gate.get_feature_importance()
+```
+
+## Mathematical Background
+
+**L0 norm:**
+- Counts non-zero elements
+- Non-differentiable (discontinuous)
+- Hard to optimize directly
+
+**Hard Concrete relaxation:**
+- Continuous, differentiable approximation
+- Enables gradient descent
+- "Stretches" binary distribution to allow gradients
+
+**Paper:**
+Louizos, Welling, & Kingma (2017): "Learning Sparse Neural Networks through L0 Regularization"
+https://arxiv.org/abs/1712.01312
+
+## Related Packages
+
+**Uses L0:**
+- microcalibrate (survey weight calibration)
+- policyengine-us-data (household selection)
+
+**See also:**
+- **microcalibrate-skill** - Survey calibration using L0
+- **policyengine-us-data-skill** - Data pipeline integration
+
+## Resources
+
+**Repository:** https://github.com/PolicyEngine/L0
+**Documentation:** https://policyengine.github.io/L0/
+**Paper:** https://arxiv.org/abs/1712.01312
+**PyPI:** https://pypi.org/project/l0-python/
--- a/skills/data-science/microcalibrate-skill/SKILL.md
+++ b/skills/data-science/microcalibrate-skill/SKILL.md
@@ -0,0 +1,404 @@
+---
+name: microcalibrate
+description: Survey weight calibration to match population targets - used in policyengine-us-data for enhanced microdata
+---
+
+# MicroCalibrate
+
+MicroCalibrate calibrates survey weights to match population targets, with L0 regularization for sparsity and automatic hyperparameter tuning.
+
+## For Users 👥
+
+### What is MicroCalibrate?
+
+When you see PolicyEngine population impacts, the underlying data has been "calibrated" using MicroCalibrate to match official population statistics.
+
+**What calibration does:**
+- Adjusts survey weights to match known totals (population, income, employment)
+- Creates representative datasets
+- Reduces dataset size while maintaining accuracy
+- Ensures PolicyEngine estimates match administrative data
+
+**Example:**
+- Census says US has 331 million people
+- Survey has 100,000 households representing the population
+- MicroCalibrate adjusts weights so survey totals match census totals
+- Result: More accurate PolicyEngine calculations
+
+## For Analysts 📊
+
+### Installation
+
+```bash
+pip install microcalibrate
+```
+
+### What MicroCalibrate Does
+
+**Calibration problem:**
+You have survey data with initial weights, and you know certain population totals (benchmarks). Calibration adjusts weights so weighted survey totals match benchmarks.
+
+**Example:**
+```python
+from microcalibrate import Calibration
+import numpy as np
+import pandas as pd
+
+# Survey data (1,000 households)
+weights = np.ones(1000)  # Initial weights
+
+# Estimates (how much each household contributes to targets)
+estimate_matrix = pd.DataFrame({
+    'total_income': household_incomes,      # Each household's income
+    'total_employed': household_employment  # 1 if employed, 0 if not
+})
+
+# Known population targets (benchmarks)
+targets = np.array([
+    50_000_000,  # Total income in population
+    600,         # Total employed people
+])
+
+# Calibrate
+cal = Calibration(
+    weights=weights,
+    targets=targets,
+    estimate_matrix=estimate_matrix,
+    l0_lambda=0.01  # Sparsity penalty
+)
+
+# Optimize weights
+new_weights = cal.calibrate(max_iter=1000)
+
+# Check results
+achieved = (estimate_matrix.values.T @ new_weights)
+print(f"Target: {targets}")
+print(f"Achieved: {achieved}")
+print(f"Non-zero weights: {(new_weights > 0).sum()} / {len(weights)}")
+```
+
+### L0 Regularization for Sparsity
+
+**Why sparsity matters:**
+- Reduces dataset size (fewer households to simulate)
+- Faster PolicyEngine calculations
+- Easier to validate and understand
+
+**L0 penalty:**
+```python
+# L0 encourages many weights to be exactly zero
+cal = Calibration(
+    weights=weights,
+    targets=targets,
+    estimate_matrix=estimate_matrix,
+    l0_lambda=0.01  # Higher = more sparse
+)
+```
+
+**To see impact:**
+```python
+# Without L0
+cal_dense = Calibration(..., l0_lambda=0.0)
+weights_dense = cal_dense.calibrate()
+
+# With L0
+cal_sparse = Calibration(..., l0_lambda=0.01)
+weights_sparse = cal_sparse.calibrate()
+
+print(f"Dense: {(weights_dense > 0).sum()} households")
+print(f"Sparse: {(weights_sparse > 0).sum()} households")
+# Sparse might use 60% fewer households while matching same targets
+```
+
+### Automatic Hyperparameter Tuning
+
+**Find optimal l0_lambda:**
+```python
+from microcalibrate import tune_hyperparameters
+
+# Find best l0_lambda using cross-validation
+best_lambda, results = tune_hyperparameters(
+    weights=weights,
+    targets=targets,
+    estimate_matrix=estimate_matrix,
+    lambda_min=1e-4,
+    lambda_max=1e-1,
+    n_trials=50
+)
+
+print(f"Best lambda: {best_lambda}")
+```
+
+### Robustness Evaluation
+
+**Test calibration stability:**
+```python
+from microcalibrate import evaluate_robustness
+
+# Holdout validation
+robustness = evaluate_robustness(
+    weights=weights,
+    targets=targets,
+    estimate_matrix=estimate_matrix,
+    l0_lambda=0.01,
+    n_folds=5
+)
+
+print(f"Mean error: {robustness['mean_error']}")
+print(f"Std error: {robustness['std_error']}")
+```
+
+### Interactive Dashboard
+
+**Visualize calibration:**
+https://microcalibrate.vercel.app/
+
+Features:
+- Upload survey data
+- Set targets
+- Tune hyperparameters
+- View results
+- Download calibrated weights
+
+## For Contributors 💻
+
+### Repository
+
+**Location:** PolicyEngine/microcalibrate
+
+**Clone:**
+```bash
+git clone https://github.com/PolicyEngine/microcalibrate
+cd microcalibrate
+```
+
+### Current Implementation
+
+**To see structure:**
+```bash
+tree microcalibrate/
+
+# Key modules:
+ls microcalibrate/
+# - calibration.py - Main Calibration class
+# - hyperparameter_tuning.py - Optuna integration
+# - evaluation.py - Robustness testing
+# - target_analysis.py - Target diagnostics
+```
+
+**To see specific implementations:**
+```bash
+# Main calibration algorithm
+cat microcalibrate/calibration.py
+
+# Hyperparameter tuning
+cat microcalibrate/hyperparameter_tuning.py
+
+# Robustness evaluation
+cat microcalibrate/evaluation.py
+```
+
+### Dependencies
+
+**Required:**
+- torch (PyTorch for optimization)
+- l0-python (L0 regularization)
+- optuna (hyperparameter tuning)
+- numpy, pandas, tqdm
+
+**To see all dependencies:**
+```bash
+cat pyproject.toml
+```
+
+### How MicroCalibrate Uses L0
+
+```python
+# Internal to microcalibrate
+from l0 import HardConcrete
+
+# Create gates for sample selection
+gates = HardConcrete(
+    n_items=len(weights),
+    temperature=temperature,
+    init_mean=0.999
+)
+
+# Apply gates during optimization
+effective_weights = weights * gates()
+
+# L0 penalty encourages gates → 0 or 1
+# Result: Many households get weight = 0 (sparse)
+```
+
+**To see L0 integration:**
+```bash
+grep -n "HardConcrete\|l0" microcalibrate/calibration.py
+```
+
+### Optimization Algorithm
+
+**Iterative reweighting:**
+1. Start with initial weights
+2. Apply L0 gates (select samples)
+3. Optimize to match targets
+4. Apply penalty for sparsity
+5. Iterate until convergence
+
+**Loss function:**
+```python
+# Target matching loss
+target_loss = sum((achieved_targets - desired_targets)^2)
+
+# L0 penalty (number of non-zero weights)
+l0_penalty = l0_lambda * count_nonzero(weights)
+
+# Total loss
+total_loss = target_loss + l0_penalty
+```
+
+### Testing
+
+**Run tests:**
+```bash
+make test
+
+# Or
+pytest tests/ -v
+```
+
+**To see test patterns:**
+```bash
+cat tests/test_calibration.py
+cat tests/test_hyperparameter_tuning.py
+```
+
+### Usage in policyengine-us-data
+
+**To see how data pipeline uses microcalibrate:**
+```bash
+cd ../policyengine-us-data
+
+# Find usage
+grep -r "microcalibrate" policyengine_us_data/
+grep -r "Calibration" policyengine_us_data/
+```
+
+## Common Patterns
+
+### Pattern 1: Basic Calibration
+
+```python
+from microcalibrate import Calibration
+
+cal = Calibration(
+    weights=initial_weights,
+    targets=benchmark_values,
+    estimate_matrix=contributions,
+    l0_lambda=0.01
+)
+
+calibrated_weights = cal.calibrate(max_iter=1000)
+```
+
+### Pattern 2: With Hyperparameter Tuning
+
+```python
+from microcalibrate import tune_hyperparameters, Calibration
+
+# Find best lambda
+best_lambda, results = tune_hyperparameters(
+    weights=weights,
+    targets=targets,
+    estimate_matrix=estimate_matrix
+)
+
+# Use best lambda
+cal = Calibration(..., l0_lambda=best_lambda)
+calibrated_weights = cal.calibrate()
+```
+
+### Pattern 3: Multi-Target Calibration
+
+```python
+# Multiple population targets
+estimate_matrix = pd.DataFrame({
+    'total_population': population_counts,
+    'total_income': incomes,
+    'total_employed': employment_indicators,
+    'total_children': child_counts
+})
+
+targets = np.array([
+    331_000_000,   # US population
+    15_000_000_000_000,  # Total income
+    160_000_000,   # Employed people
+    73_000_000     # Children
+])
+
+cal = Calibration(weights, targets, estimate_matrix, l0_lambda=0.01)
+```
+
+## Performance Considerations
+
+**Calibration speed:**
+- 1,000 households, 5 targets: ~1 second
+- 100,000 households, 10 targets: ~30 seconds
+- Depends on: dataset size, number of targets, l0_lambda
+
+**Memory usage:**
+- PyTorch tensors for optimization
+- Scales linearly with dataset size
+
+**To profile:**
+```python
+import time
+
+start = time.time()
+weights = cal.calibrate()
+print(f"Calibration took {time.time() - start:.1f}s")
+```
+
+## Troubleshooting
+
+**Common issues:**
+
+**1. Calibration not converging:**
+```python
+# Try:
+# - More iterations
+# - Lower l0_lambda
+# - Better initialization
+
+cal = Calibration(..., l0_lambda=0.001)  # Lower sparsity penalty
+weights = cal.calibrate(max_iter=5000)  # More iterations
+```
+
+**2. Targets not matching:**
+```python
+# Check achieved vs desired
+achieved = (estimate_matrix.values.T @ weights)
+error = np.abs(achieved - targets) / targets
+print(f"Relative errors: {error}")
+
+# If large errors, l0_lambda may be too high
+```
+
+**3. Too sparse (all weights zero):**
+```python
+# Lower l0_lambda
+cal = Calibration(..., l0_lambda=0.0001)
+```
+
+## Related Skills
+
+- **l0-skill** - Understanding L0 regularization
+- **policyengine-us-data-skill** - How calibration fits in data pipeline
+- **microdf-skill** - Working with calibrated survey data
+
+## Resources
+
+**Repository:** https://github.com/PolicyEngine/microcalibrate
+**Dashboard:** https://microcalibrate.vercel.app/
+**PyPI:** https://pypi.org/project/microcalibrate/
+**Paper:** Louizos et al. (2017) on L0 regularization
--- a/skills/data-science/microdf-skill/SKILL.md
+++ b/skills/data-science/microdf-skill/SKILL.md
@@ -0,0 +1,329 @@
+---
+name: microdf
+description: Weighted pandas DataFrames for survey microdata analysis - inequality, poverty, and distributional calculations
+---
+
+# MicroDF
+
+MicroDF provides weighted pandas DataFrames and Series for analyzing survey microdata, with built-in support for inequality and poverty calculations.
+
+## For Users 👥
+
+### What is MicroDF?
+
+When you see poverty rates, Gini coefficients, or distributional charts in PolicyEngine, those are calculated using MicroDF.
+
+**MicroDF powers:**
+- Poverty rate calculations (SPM)
+- Inequality metrics (Gini coefficient)
+- Income distribution analysis
+- Weighted statistics from survey data
+
+### Understanding the Metrics
+
+**Gini coefficient:**
+- Calculated using MicroDF from weighted income data
+- Ranges from 0 (perfect equality) to 1 (perfect inequality)
+- US typically around 0.48
+
+**Poverty rates:**
+- Calculated using MicroDF with weighted household data
+- Compares income to poverty thresholds
+- Accounts for household composition
+
+**Percentiles:**
+- MicroDF calculates weighted percentiles
+- Shows income distribution (10th, 50th, 90th percentile)
+
+## For Analysts 📊
+
+### Installation
+
+```bash
+pip install microdf-python
+```
+
+### Quick Start
+
+```python
+import microdf as mdf
+import pandas as pd
+
+# Create sample data
+df = pd.DataFrame({
+    'income': [10000, 20000, 30000, 40000, 50000],
+    'weights': [1, 2, 3, 2, 1]
+})
+
+# Create MicroDataFrame
+mdf_df = mdf.MicroDataFrame(df, weights='weights')
+
+# All operations are weight-aware
+print(f"Weighted mean: ${mdf_df.income.mean():,.0f}")
+print(f"Gini coefficient: {mdf_df.income.gini():.3f}")
+```
+
+### Common Operations
+
+**Weighted statistics:**
+```python
+mdf_df.income.mean()     # Weighted mean
+mdf_df.income.median()   # Weighted median
+mdf_df.income.sum()      # Weighted sum
+mdf_df.income.std()      # Weighted standard deviation
+```
+
+**Inequality metrics:**
+```python
+mdf_df.income.gini()     # Gini coefficient
+mdf_df.income.top_x_pct_share(10)  # Top 10% share
+mdf_df.income.top_x_pct_share(1)   # Top 1% share
+```
+
+**Poverty analysis:**
+```python
+# Poverty rate (income < threshold)
+poverty_rate = mdf_df.poverty_rate(
+    income_measure='income',
+    threshold=poverty_line
+)
+
+# Poverty gap (how far below threshold)
+poverty_gap = mdf_df.poverty_gap(
+    income_measure='income',
+    threshold=poverty_line
+)
+
+# Deep poverty (income < 50% of threshold)
+deep_poverty_rate = mdf_df.deep_poverty_rate(
+    income_measure='income',
+    threshold=poverty_line,
+    deep_poverty_line=0.5
+)
+```
+
+**Quantiles:**
+```python
+# Deciles
+mdf_df.income.decile_values()
+
+# Quintiles
+mdf_df.income.quintile_values()
+
+# Custom quantiles
+mdf_df.income.quantile(0.25)  # 25th percentile
+```
+
+### MicroSeries
+
+```python
+# Extract a Series with weights
+income_series = mdf_df.income  # This is a MicroSeries
+
+# MicroSeries operations
+income_series.mean()
+income_series.gini()
+income_series.percentile(50)
+```
+
+### Working with PolicyEngine Results
+
+```python
+import microdf as mdf
+from policyengine_us import Simulation
+
+# Run simulation with axes (multiple households)
+situation_with_axes = {...}  # See policyengine-us-skill
+sim = Simulation(situation=situation_with_axes)
+
+# Get results as arrays
+incomes = sim.calculate("household_net_income", 2024)
+weights = sim.calculate("household_weight", 2024)
+
+# Create MicroDataFrame
+df = pd.DataFrame({'income': incomes, 'weight': weights})
+mdf_df = mdf.MicroDataFrame(df, weights='weight')
+
+# Calculate metrics
+gini = mdf_df.income.gini()
+poverty_rate = mdf_df.poverty_rate('income', threshold=15000)
+
+print(f"Gini: {gini:.3f}")
+print(f"Poverty rate: {poverty_rate:.1%}")
+```
+
+## For Contributors 💻
+
+### Repository
+
+**Location:** PolicyEngine/microdf
+
+**Clone:**
+```bash
+git clone https://github.com/PolicyEngine/microdf
+cd microdf
+```
+
+### Current Implementation
+
+**To see current API:**
+```bash
+# Main classes
+cat microdf/microframe.py   # MicroDataFrame
+cat microdf/microseries.py  # MicroSeries
+
+# Key modules
+cat microdf/generic.py      # Generic weighted operations
+cat microdf/inequality.py   # Gini, top shares
+cat microdf/poverty.py      # Poverty metrics
+```
+
+**To see all methods:**
+```bash
+# MicroDataFrame methods
+grep "def " microdf/microframe.py
+
+# MicroSeries methods
+grep "def " microdf/microseries.py
+```
+
+### Testing
+
+**To see test patterns:**
+```bash
+ls tests/
+cat tests/test_microframe.py
+```
+
+**Run tests:**
+```bash
+make test
+
+# Or
+pytest tests/ -v
+```
+
+### Contributing
+
+**Before contributing:**
+1. Check if method already exists
+2. Ensure it's weighted correctly
+3. Add tests
+4. Follow policyengine-standards-skill
+
+**Common contributions:**
+- New inequality metrics
+- New poverty measures
+- Performance optimizations
+- Bug fixes
+
+## Advanced Patterns
+
+### Custom Aggregations
+
+```python
+# Define custom weighted aggregation
+def weighted_operation(series, weights):
+    return (series * weights).sum() / weights.sum()
+
+# Apply to MicroSeries
+result = weighted_operation(mdf_df.income, mdf_df.weights)
+```
+
+### Groupby Operations
+
+```python
+# Group by with weights
+grouped = mdf_df.groupby('state')
+state_means = grouped.income.mean()  # Weighted means by state
+```
+
+### Inequality Decomposition
+
+**To see decomposition methods:**
+```bash
+grep -A 20 "def.*decomp" microdf/
+```
+
+## Integration Examples
+
+### Example 1: PolicyEngine Blog Post Analysis
+
+```python
+# Pattern from PolicyEngine blog posts
+import microdf as mdf
+
+# Get simulation results
+baseline_income = baseline_sim.calculate("household_net_income", 2024)
+reform_income = reform_sim.calculate("household_net_income", 2024)
+weights = baseline_sim.calculate("household_weight", 2024)
+
+# Create MicroDataFrame
+df = pd.DataFrame({
+    'baseline_income': baseline_income,
+    'reform_income': reform_income,
+    'weight': weights
+})
+mdf_df = mdf.MicroDataFrame(df, weights='weight')
+
+# Calculate impacts
+baseline_gini = mdf_df.baseline_income.gini()
+reform_gini = mdf_df.reform_income.gini()
+
+print(f"Gini change: {reform_gini - baseline_gini:+.4f}")
+```
+
+### Example 2: Poverty Analysis
+
+```python
+# Calculate poverty under baseline and reform
+from policyengine_us import Simulation
+
+baseline_sim = Simulation(situation=situation)
+reform_sim = Simulation(situation=situation, reform=reform)
+
+# Get incomes
+baseline_income = baseline_sim.calculate("spm_unit_net_income", 2024)
+reform_income = reform_sim.calculate("spm_unit_net_income", 2024)
+spm_threshold = baseline_sim.calculate("spm_unit_poverty_threshold", 2024)
+weights = baseline_sim.calculate("spm_unit_weight", 2024)
+
+# Calculate poverty rates
+df_baseline = mdf.MicroDataFrame(
+    pd.DataFrame({'income': baseline_income, 'threshold': spm_threshold, 'weight': weights}),
+    weights='weight'
+)
+
+poverty_baseline = (df_baseline.income < df_baseline.threshold).mean()  # Weighted
+
+# Similar for reform
+print(f"Poverty reduction: {(poverty_baseline - poverty_reform):.1%}")
+```
+
+## Package Status
+
+**Maturity:** Stable, production-ready
+**API stability:** Stable (rarely breaking changes)
+**Performance:** Optimized for large datasets
+
+**To see version:**
+```bash
+pip show microdf-python
+```
+
+**To see changelog:**
+```bash
+cat CHANGELOG.md  # In microdf repo
+```
+
+## Related Skills
+
+- **policyengine-us-skill** - Generating data for microdf analysis
+- **policyengine-analysis-skill** - Using microdf in policy analysis
+- **policyengine-us-data-skill** - Data sources for microdf
+
+## Resources
+
+**Repository:** https://github.com/PolicyEngine/microdf
+**PyPI:** https://pypi.org/project/microdf-python/
+**Issues:** https://github.com/PolicyEngine/microdf/issues
--- a/skills/data-science/microimpute-skill/SKILL.md
+++ b/skills/data-science/microimpute-skill/SKILL.md
@@ -0,0 +1,415 @@
+---
+name: microimpute
+description: ML-based variable imputation for survey data - used in policyengine-us-data to fill missing values
+---
+
+# MicroImpute
+
+MicroImpute enables ML-based variable imputation through different statistical methods, with comparison and benchmarking capabilities.
+
+## For Users 👥
+
+### What is MicroImpute?
+
+When PolicyEngine calculates population impacts, the underlying survey data has missing information. MicroImpute uses machine learning to fill in those gaps intelligently.
+
+**What imputation does:**
+- Fills missing data in surveys
+- Uses machine learning to predict missing values
+- Maintains statistical relationships
+- Improves PolicyEngine accuracy
+
+**Example:**
+- Survey asks about income but not capital gains breakdown
+- MicroImpute predicts short-term vs long-term capital gains
+- Based on patterns from IRS data
+- Result: More accurate tax calculations
+
+**You benefit from imputation when:**
+- PolicyEngine calculates capital gains tax accurately
+- Benefits eligibility uses complete household information
+- State-specific calculations have all needed data
+
+## For Analysts 📊
+
+### Installation
+
+```bash
+pip install microimpute
+
+# With image export (for plots)
+pip install microimpute[images]
+```
+
+### What MicroImpute Does
+
+**Imputation problem:**
+- Donor dataset has complete information (e.g., IRS tax records)
+- Recipient dataset has missing variables (e.g., CPS survey)
+- Imputation predicts missing values in recipient using donor patterns
+
+**Methods available:**
+- Linear regression
+- Random forest
+- Quantile forest (preserves full distribution)
+- XGBoost
+- Hot deck (traditional matching)
+
+### Quick Example
+
+```python
+from microimpute import Imputer
+import pandas as pd
+
+# Donor data (complete)
+donor = pd.DataFrame({
+    'income': [50000, 60000, 70000],
+    'age': [30, 40, 50],
+    'capital_gains': [5000, 8000, 12000]  # Variable to impute
+})
+
+# Recipient data (missing capital_gains)
+recipient = pd.DataFrame({
+    'income': [55000, 65000],
+    'age': [35, 45],
+    # capital_gains is missing
+})
+
+# Impute using quantile forest
+imputer = Imputer(method='quantile_forest')
+imputer.fit(
+    donor=donor,
+    donor_target='capital_gains',
+    common_vars=['income', 'age']
+)
+
+recipient_imputed = imputer.predict(recipient)
+# Now recipient has predicted capital_gains
+```
+
+### Method Comparison
+
+```python
+from microimpute import compare_methods
+
+# Compare different imputation methods
+results = compare_methods(
+    donor=donor,
+    recipient=recipient,
+    target_var='capital_gains',
+    common_vars=['income', 'age'],
+    methods=['linear', 'random_forest', 'quantile_forest']
+)
+
+# Shows quantile loss for each method
+print(results)
+```
+
+### Quantile Loss (Quality Metric)
+
+**Why quantile loss:**
+- Measures how well imputation preserves the distribution
+- Not just mean accuracy, but full distribution shape
+- Lower is better
+
+**Interpretation:**
+```python
+# Quantile loss around 0.1 = good
+# Quantile loss around 0.5 = poor
+# Compare across methods to choose best
+```
+
+## For Contributors 💻
+
+### Repository
+
+**Location:** PolicyEngine/microimpute
+
+**Clone:**
+```bash
+git clone https://github.com/PolicyEngine/microimpute
+cd microimpute
+```
+
+### Current Implementation
+
+**To see structure:**
+```bash
+tree microimpute/
+
+# Key modules:
+ls microimpute/
+# - imputer.py - Main Imputer class
+# - methods/ - Different imputation methods
+# - comparison.py - Method benchmarking
+# - utils/ - Utilities
+```
+
+**To see specific methods:**
+```bash
+# Quantile forest implementation
+cat microimpute/methods/quantile_forest.py
+
+# Random forest
+cat microimpute/methods/random_forest.py
+
+# Linear regression
+cat microimpute/methods/linear.py
+```
+
+### Dependencies
+
+**Required:**
+- numpy, pandas (data handling)
+- scikit-learn (ML models)
+- quantile-forest (distributional imputation)
+- optuna (hyperparameter tuning)
+- statsmodels (statistical methods)
+- scipy (statistical functions)
+
+**To see all dependencies:**
+```bash
+cat pyproject.toml
+```
+
+### Adding New Imputation Methods
+
+**Pattern:**
+```python
+# microimpute/methods/my_method.py
+
+class MyMethodImputer:
+    def fit(self, X_train, y_train):
+        """Train on donor data."""
+        # Fit your model
+        pass
+
+    def predict(self, X_test):
+        """Impute on recipient data."""
+        # Return predictions
+        pass
+
+    def get_quantile_loss(self, X_val, y_val):
+        """Compute validation loss."""
+        # Evaluate quality
+        pass
+```
+
+### Usage in policyengine-us-data
+
+**To see how data pipeline uses microimpute:**
+```bash
+cd ../policyengine-us-data
+
+# Find usage
+grep -r "microimpute" policyengine_us_data/
+grep -r "Imputer" policyengine_us_data/
+```
+
+**Typical workflow:**
+1. Load CPS (has demographics, missing capital gains details)
+2. Load IRS PUF (has complete tax data)
+3. Use microimpute to predict missing CPS variables from PUF patterns
+4. Validate imputation quality
+5. Save enhanced dataset
+
+### Testing
+
+**Run tests:**
+```bash
+make test
+
+# Or
+pytest tests/ -v --cov=microimpute
+```
+
+**To see test patterns:**
+```bash
+cat tests/test_imputer.py
+cat tests/test_methods.py
+```
+
+## Common Patterns
+
+### Pattern 1: Basic Imputation
+
+```python
+from microimpute import Imputer
+
+# Create imputer
+imputer = Imputer(method='quantile_forest')
+
+# Fit on donor (complete data)
+imputer.fit(
+    donor=donor_df,
+    donor_target='target_variable',
+    common_vars=['age', 'income', 'state']
+)
+
+# Predict on recipient (missing target_variable)
+recipient_imputed = imputer.predict(recipient_df)
+```
+
+### Pattern 2: Choosing Best Method
+
+```python
+from microimpute import compare_methods
+
+# Test multiple methods
+methods = ['linear', 'random_forest', 'quantile_forest', 'xgboost']
+
+results = compare_methods(
+    donor=donor,
+    recipient=recipient,
+    target_var='target',
+    common_vars=common_vars,
+    methods=methods
+)
+
+# Use method with lowest quantile loss
+best_method = results.sort_values('quantile_loss').iloc[0]['method']
+```
+
+### Pattern 3: Multiple Variable Imputation
+
+```python
+# Impute several variables
+variables_to_impute = [
+    'short_term_capital_gains',
+    'long_term_capital_gains',
+    'qualified_dividends'
+]
+
+for var in variables_to_impute:
+    imputer = Imputer(method='quantile_forest')
+    imputer.fit(donor=irs_puf, donor_target=var, common_vars=common_vars)
+    cps[var] = imputer.predict(cps)
+```
+
+## Advanced Features
+
+### Hyperparameter Tuning
+
+**Built-in Optuna integration:**
+```python
+from microimpute import tune_hyperparameters
+
+# Automatically find best hyperparameters
+best_params, study = tune_hyperparameters(
+    donor=donor,
+    target_var='target',
+    common_vars=common_vars,
+    method='quantile_forest',
+    n_trials=100
+)
+
+# Use tuned parameters
+imputer = Imputer(method='quantile_forest', **best_params)
+```
+
+### Cross-Validation
+
+**Validate imputation quality:**
+```python
+from sklearn.model_selection import cross_val_score
+
+# Split donor for validation
+# Impute on validation set
+# Measure accuracy
+```
+
+### Visualization
+
+**Plot imputation results:**
+```python
+import plotly.express as px
+
+# Compare imputed vs actual (on donor validation set)
+fig = px.scatter(
+    x=actual_values,
+    y=imputed_values,
+    labels={'x': 'Actual', 'y': 'Imputed'}
+)
+fig.add_trace(px.line(x=[min, max], y=[min, max]))  # 45-degree line
+```
+
+## Statistical Background
+
+**Imputation preserves:**
+- Marginal distributions (imputed variable distribution matches donor)
+- Conditional relationships (imputation depends on common variables)
+- Uncertainty (quantile methods preserve full distribution)
+
+**Trade-offs:**
+- **Linear:** Fast, but assumes linear relationships
+- **Random forest:** Handles non-linearity, may overfit
+- **Quantile forest:** Preserves full distribution, slower
+- **XGBoost:** High accuracy, requires tuning
+
+## Integration with PolicyEngine
+
+**Full pipeline (policyengine-us-data):**
+```
+1. Load CPS survey data
+   ↓
+2. microimpute: Fill missing variables from IRS PUF
+   ↓
+3. microcalibrate: Adjust weights to match benchmarks
+   ↓
+4. Validation: Check against administrative totals
+   ↓
+5. Package: Distribute enhanced dataset
+   ↓
+6. PolicyEngine: Use for population simulations
+```
+
+## Comparison to Other Methods
+
+**MicroImpute vs traditional imputation:**
+
+**Traditional (mean imputation):**
+- Fast but destroys distribution
+- All missing values get same value
+- Underestimates variance
+
+**MicroImpute (ML methods):**
+- Preserves relationships
+- Different predictions per record
+- Maintains distribution shape
+
+**Quantile forest advantage:**
+- Predicts full conditional distribution
+- Not just point estimates
+- Can sample from predicted distribution
+
+## Performance Tips
+
+**For large datasets:**
+```python
+# Use random forest (faster than quantile forest)
+imputer = Imputer(method='random_forest')
+
+# Or subsample donor
+donor_sample = donor.sample(n=10000, random_state=42)
+imputer.fit(donor=donor_sample, ...)
+```
+
+**For high accuracy:**
+```python
+# Use quantile forest with tuning
+best_params, _ = tune_hyperparameters(...)
+imputer = Imputer(method='quantile_forest', **best_params)
+```
+
+## Related Skills
+
+- **l0-skill** - Regularization techniques
+- **microcalibrate-skill** - Survey calibration (next step after imputation)
+- **policyengine-us-data-skill** - Complete data pipeline
+- **microdf-skill** - Working with imputed/calibrated data
+
+## Resources
+
+**Repository:** https://github.com/PolicyEngine/microimpute
+**PyPI:** https://pypi.org/project/microimpute/
+**Documentation:** See README and docstrings in source