Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:47:43 +08:00
commit 2e8d89fca3
41 changed files with 14051 additions and 0 deletions

View File

@@ -0,0 +1,277 @@
---
name: l0
description: L0 regularization for neural network sparsification and intelligent sampling - used in survey calibration
---
# L0 Regularization
L0 is a PyTorch implementation of L0 regularization for neural network sparsification and intelligent sampling, used in PolicyEngine's survey calibration pipeline.
## For Users 👥
### What is L0?
L0 regularization helps PolicyEngine create more efficient survey datasets by intelligently selecting which households to include in calculations.
**Impact you see:**
- Faster population impact calculations
- Smaller dataset sizes
- Maintained accuracy with fewer samples
**Behind the scenes:**
When PolicyEngine shows population-wide impacts, L0 helps select representative households from the full survey, reducing computation time while maintaining accuracy.
## For Analysts 📊
### What L0 Does
L0 provides intelligent sampling gates for:
- **Household selection** - Choose representative samples from CPS
- **Feature selection** - Identify important variables
- **Sparse weighting** - Create compact, efficient datasets
**Used in PolicyEngine for:**
- Survey calibration (via microcalibrate)
- Dataset sparsification in policyengine-us-data
- Efficient microsimulation
### Installation
```bash
pip install l0-python
```
### Quick Example: Sample Selection
```python
from l0 import SampleGate
# Select 1,000 households from 10,000
gate = SampleGate(n_samples=10000, target_samples=1000)
selected_data, indices = gate.select_samples(data)
# Gates learn which samples are most informative
```
### Integration with microcalibrate
```python
from l0 import HardConcrete
from microcalibrate import Calibration
# L0 gates for household selection
gates = HardConcrete(
len(household_weights),
temperature=0.25,
init_mean=0.999 # Start with most households
)
# Use in calibration
# microcalibrate applies gates during weight optimization
```
## For Contributors 💻
### Repository
**Location:** PolicyEngine/L0
**Clone:**
```bash
git clone https://github.com/PolicyEngine/L0
cd L0
```
### Current Implementation
**To see structure:**
```bash
tree l0/
# Key modules:
ls l0/
# - hard_concrete.py - Core L0 distribution
# - layers.py - L0Linear, L0Conv2d
# - gates.py - Sample/feature gates
# - penalties.py - L0/L2 penalty computation
# - temperature.py - Temperature scheduling
```
**To see specific implementations:**
```bash
# Hard Concrete distribution (core algorithm)
cat l0/hard_concrete.py
# Sample gates (used in calibration)
cat l0/gates.py
# Neural network layers
cat l0/layers.py
```
### Key Concepts
**Hard Concrete Distribution:**
- Differentiable approximation of L0 norm
- Allows gradient-based optimization
- Temperature controls sparsity level
**To see implementation:**
```bash
cat l0/hard_concrete.py
```
**Sample Gates:**
- Binary gates for sample selection
- Learn which samples are most informative
- Used in microcalibrate for household selection
**Feature Gates:**
- Select important features/variables
- Reduce dimensionality
- Maintain prediction accuracy
### Usage in PolicyEngine
**In microcalibrate (survey calibration):**
```python
from l0 import HardConcrete
# Create gates for household selection
gates = HardConcrete(
n_items=len(households),
temperature=0.25,
init_mean=0.999 # Start with almost all households
)
# Gates produce probabilities (0 to 1)
probs = gates()
# Apply to weights during calibration
masked_weights = weights * probs
```
**In policyengine-us-data:**
```bash
# See usage in data pipeline
grep -r "from l0 import" ../policyengine-us-data/
```
### Temperature Scheduling
**Controls sparsity over training:**
```python
from l0 import TemperatureScheduler, update_temperatures
scheduler = TemperatureScheduler(
initial_temp=1.0, # Start relaxed
final_temp=0.1, # End sparse
total_epochs=100
)
for epoch in range(100):
temp = scheduler.get_temperature(epoch)
update_temperatures(model, temp)
# ... training ...
```
**To see implementation:**
```bash
cat l0/temperature.py
```
### L0L2 Combined Penalty
**Prevents overfitting:**
```python
from l0 import compute_l0l2_penalty
# Combine L0 (sparsity) with L2 (regularization)
penalty = compute_l0l2_penalty(
model,
l0_lambda=1e-3, # Sparsity strength
l2_lambda=1e-4 # Weight regularization
)
loss = task_loss + penalty
```
### Testing
**Run tests:**
```bash
make test
# Or
pytest tests/ -v --cov=l0
```
**To see test patterns:**
```bash
cat tests/test_hard_concrete.py
cat tests/test_gates.py
```
## Advanced Usage
### Hybrid Gates (L0 + Random)
```python
from l0 import HybridGate
# Combine L0 selection with random sampling
hybrid = HybridGate(
n_items=10000,
l0_fraction=0.25, # 25% from L0
random_fraction=0.75, # 75% random
target_items=1000
)
selected, indices, types = hybrid.select(data)
```
### Feature Selection
```python
from l0 import FeatureGate
# Select top features
gate = FeatureGate(n_features=1000, max_features=50)
selected_data, feature_indices = gate.select_features(data)
# Get feature importance
importance = gate.get_feature_importance()
```
## Mathematical Background
**L0 norm:**
- Counts non-zero elements
- Non-differentiable (discontinuous)
- Hard to optimize directly
**Hard Concrete relaxation:**
- Continuous, differentiable approximation
- Enables gradient descent
- "Stretches" binary distribution to allow gradients
**Paper:**
Louizos, Welling, & Kingma (2017): "Learning Sparse Neural Networks through L0 Regularization"
https://arxiv.org/abs/1712.01312
## Related Packages
**Uses L0:**
- microcalibrate (survey weight calibration)
- policyengine-us-data (household selection)
**See also:**
- **microcalibrate-skill** - Survey calibration using L0
- **policyengine-us-data-skill** - Data pipeline integration
## Resources
**Repository:** https://github.com/PolicyEngine/L0
**Documentation:** https://policyengine.github.io/L0/
**Paper:** https://arxiv.org/abs/1712.01312
**PyPI:** https://pypi.org/project/l0-python/

View File

@@ -0,0 +1,404 @@
---
name: microcalibrate
description: Survey weight calibration to match population targets - used in policyengine-us-data for enhanced microdata
---
# MicroCalibrate
MicroCalibrate calibrates survey weights to match population targets, with L0 regularization for sparsity and automatic hyperparameter tuning.
## For Users 👥
### What is MicroCalibrate?
When you see PolicyEngine population impacts, the underlying data has been "calibrated" using MicroCalibrate to match official population statistics.
**What calibration does:**
- Adjusts survey weights to match known totals (population, income, employment)
- Creates representative datasets
- Reduces dataset size while maintaining accuracy
- Ensures PolicyEngine estimates match administrative data
**Example:**
- Census says US has 331 million people
- Survey has 100,000 households representing the population
- MicroCalibrate adjusts weights so survey totals match census totals
- Result: More accurate PolicyEngine calculations
## For Analysts 📊
### Installation
```bash
pip install microcalibrate
```
### What MicroCalibrate Does
**Calibration problem:**
You have survey data with initial weights, and you know certain population totals (benchmarks). Calibration adjusts weights so weighted survey totals match benchmarks.
**Example:**
```python
from microcalibrate import Calibration
import numpy as np
import pandas as pd
# Survey data (1,000 households)
weights = np.ones(1000) # Initial weights
# Estimates (how much each household contributes to targets)
estimate_matrix = pd.DataFrame({
'total_income': household_incomes, # Each household's income
'total_employed': household_employment # 1 if employed, 0 if not
})
# Known population targets (benchmarks)
targets = np.array([
50_000_000, # Total income in population
600, # Total employed people
])
# Calibrate
cal = Calibration(
weights=weights,
targets=targets,
estimate_matrix=estimate_matrix,
l0_lambda=0.01 # Sparsity penalty
)
# Optimize weights
new_weights = cal.calibrate(max_iter=1000)
# Check results
achieved = (estimate_matrix.values.T @ new_weights)
print(f"Target: {targets}")
print(f"Achieved: {achieved}")
print(f"Non-zero weights: {(new_weights > 0).sum()} / {len(weights)}")
```
### L0 Regularization for Sparsity
**Why sparsity matters:**
- Reduces dataset size (fewer households to simulate)
- Faster PolicyEngine calculations
- Easier to validate and understand
**L0 penalty:**
```python
# L0 encourages many weights to be exactly zero
cal = Calibration(
weights=weights,
targets=targets,
estimate_matrix=estimate_matrix,
l0_lambda=0.01 # Higher = more sparse
)
```
**To see impact:**
```python
# Without L0
cal_dense = Calibration(..., l0_lambda=0.0)
weights_dense = cal_dense.calibrate()
# With L0
cal_sparse = Calibration(..., l0_lambda=0.01)
weights_sparse = cal_sparse.calibrate()
print(f"Dense: {(weights_dense > 0).sum()} households")
print(f"Sparse: {(weights_sparse > 0).sum()} households")
# Sparse might use 60% fewer households while matching same targets
```
### Automatic Hyperparameter Tuning
**Find optimal l0_lambda:**
```python
from microcalibrate import tune_hyperparameters
# Find best l0_lambda using cross-validation
best_lambda, results = tune_hyperparameters(
weights=weights,
targets=targets,
estimate_matrix=estimate_matrix,
lambda_min=1e-4,
lambda_max=1e-1,
n_trials=50
)
print(f"Best lambda: {best_lambda}")
```
### Robustness Evaluation
**Test calibration stability:**
```python
from microcalibrate import evaluate_robustness
# Holdout validation
robustness = evaluate_robustness(
weights=weights,
targets=targets,
estimate_matrix=estimate_matrix,
l0_lambda=0.01,
n_folds=5
)
print(f"Mean error: {robustness['mean_error']}")
print(f"Std error: {robustness['std_error']}")
```
### Interactive Dashboard
**Visualize calibration:**
https://microcalibrate.vercel.app/
Features:
- Upload survey data
- Set targets
- Tune hyperparameters
- View results
- Download calibrated weights
## For Contributors 💻
### Repository
**Location:** PolicyEngine/microcalibrate
**Clone:**
```bash
git clone https://github.com/PolicyEngine/microcalibrate
cd microcalibrate
```
### Current Implementation
**To see structure:**
```bash
tree microcalibrate/
# Key modules:
ls microcalibrate/
# - calibration.py - Main Calibration class
# - hyperparameter_tuning.py - Optuna integration
# - evaluation.py - Robustness testing
# - target_analysis.py - Target diagnostics
```
**To see specific implementations:**
```bash
# Main calibration algorithm
cat microcalibrate/calibration.py
# Hyperparameter tuning
cat microcalibrate/hyperparameter_tuning.py
# Robustness evaluation
cat microcalibrate/evaluation.py
```
### Dependencies
**Required:**
- torch (PyTorch for optimization)
- l0-python (L0 regularization)
- optuna (hyperparameter tuning)
- numpy, pandas, tqdm
**To see all dependencies:**
```bash
cat pyproject.toml
```
### How MicroCalibrate Uses L0
```python
# Internal to microcalibrate
from l0 import HardConcrete
# Create gates for sample selection
gates = HardConcrete(
n_items=len(weights),
temperature=temperature,
init_mean=0.999
)
# Apply gates during optimization
effective_weights = weights * gates()
# L0 penalty encourages gates → 0 or 1
# Result: Many households get weight = 0 (sparse)
```
**To see L0 integration:**
```bash
grep -n "HardConcrete\|l0" microcalibrate/calibration.py
```
### Optimization Algorithm
**Iterative reweighting:**
1. Start with initial weights
2. Apply L0 gates (select samples)
3. Optimize to match targets
4. Apply penalty for sparsity
5. Iterate until convergence
**Loss function:**
```python
# Target matching loss
target_loss = sum((achieved_targets - desired_targets)^2)
# L0 penalty (number of non-zero weights)
l0_penalty = l0_lambda * count_nonzero(weights)
# Total loss
total_loss = target_loss + l0_penalty
```
### Testing
**Run tests:**
```bash
make test
# Or
pytest tests/ -v
```
**To see test patterns:**
```bash
cat tests/test_calibration.py
cat tests/test_hyperparameter_tuning.py
```
### Usage in policyengine-us-data
**To see how data pipeline uses microcalibrate:**
```bash
cd ../policyengine-us-data
# Find usage
grep -r "microcalibrate" policyengine_us_data/
grep -r "Calibration" policyengine_us_data/
```
## Common Patterns
### Pattern 1: Basic Calibration
```python
from microcalibrate import Calibration
cal = Calibration(
weights=initial_weights,
targets=benchmark_values,
estimate_matrix=contributions,
l0_lambda=0.01
)
calibrated_weights = cal.calibrate(max_iter=1000)
```
### Pattern 2: With Hyperparameter Tuning
```python
from microcalibrate import tune_hyperparameters, Calibration
# Find best lambda
best_lambda, results = tune_hyperparameters(
weights=weights,
targets=targets,
estimate_matrix=estimate_matrix
)
# Use best lambda
cal = Calibration(..., l0_lambda=best_lambda)
calibrated_weights = cal.calibrate()
```
### Pattern 3: Multi-Target Calibration
```python
# Multiple population targets
estimate_matrix = pd.DataFrame({
'total_population': population_counts,
'total_income': incomes,
'total_employed': employment_indicators,
'total_children': child_counts
})
targets = np.array([
331_000_000, # US population
15_000_000_000_000, # Total income
160_000_000, # Employed people
73_000_000 # Children
])
cal = Calibration(weights, targets, estimate_matrix, l0_lambda=0.01)
```
## Performance Considerations
**Calibration speed:**
- 1,000 households, 5 targets: ~1 second
- 100,000 households, 10 targets: ~30 seconds
- Depends on: dataset size, number of targets, l0_lambda
**Memory usage:**
- PyTorch tensors for optimization
- Scales linearly with dataset size
**To profile:**
```python
import time
start = time.time()
weights = cal.calibrate()
print(f"Calibration took {time.time() - start:.1f}s")
```
## Troubleshooting
**Common issues:**
**1. Calibration not converging:**
```python
# Try:
# - More iterations
# - Lower l0_lambda
# - Better initialization
cal = Calibration(..., l0_lambda=0.001) # Lower sparsity penalty
weights = cal.calibrate(max_iter=5000) # More iterations
```
**2. Targets not matching:**
```python
# Check achieved vs desired
achieved = (estimate_matrix.values.T @ weights)
error = np.abs(achieved - targets) / targets
print(f"Relative errors: {error}")
# If large errors, l0_lambda may be too high
```
**3. Too sparse (all weights zero):**
```python
# Lower l0_lambda
cal = Calibration(..., l0_lambda=0.0001)
```
## Related Skills
- **l0-skill** - Understanding L0 regularization
- **policyengine-us-data-skill** - How calibration fits in data pipeline
- **microdf-skill** - Working with calibrated survey data
## Resources
**Repository:** https://github.com/PolicyEngine/microcalibrate
**Dashboard:** https://microcalibrate.vercel.app/
**PyPI:** https://pypi.org/project/microcalibrate/
**Paper:** Louizos et al. (2017) on L0 regularization

View File

@@ -0,0 +1,329 @@
---
name: microdf
description: Weighted pandas DataFrames for survey microdata analysis - inequality, poverty, and distributional calculations
---
# MicroDF
MicroDF provides weighted pandas DataFrames and Series for analyzing survey microdata, with built-in support for inequality and poverty calculations.
## For Users 👥
### What is MicroDF?
When you see poverty rates, Gini coefficients, or distributional charts in PolicyEngine, those are calculated using MicroDF.
**MicroDF powers:**
- Poverty rate calculations (SPM)
- Inequality metrics (Gini coefficient)
- Income distribution analysis
- Weighted statistics from survey data
### Understanding the Metrics
**Gini coefficient:**
- Calculated using MicroDF from weighted income data
- Ranges from 0 (perfect equality) to 1 (perfect inequality)
- US typically around 0.48
**Poverty rates:**
- Calculated using MicroDF with weighted household data
- Compares income to poverty thresholds
- Accounts for household composition
**Percentiles:**
- MicroDF calculates weighted percentiles
- Shows income distribution (10th, 50th, 90th percentile)
## For Analysts 📊
### Installation
```bash
pip install microdf-python
```
### Quick Start
```python
import microdf as mdf
import pandas as pd
# Create sample data
df = pd.DataFrame({
'income': [10000, 20000, 30000, 40000, 50000],
'weights': [1, 2, 3, 2, 1]
})
# Create MicroDataFrame
mdf_df = mdf.MicroDataFrame(df, weights='weights')
# All operations are weight-aware
print(f"Weighted mean: ${mdf_df.income.mean():,.0f}")
print(f"Gini coefficient: {mdf_df.income.gini():.3f}")
```
### Common Operations
**Weighted statistics:**
```python
mdf_df.income.mean() # Weighted mean
mdf_df.income.median() # Weighted median
mdf_df.income.sum() # Weighted sum
mdf_df.income.std() # Weighted standard deviation
```
**Inequality metrics:**
```python
mdf_df.income.gini() # Gini coefficient
mdf_df.income.top_x_pct_share(10) # Top 10% share
mdf_df.income.top_x_pct_share(1) # Top 1% share
```
**Poverty analysis:**
```python
# Poverty rate (income < threshold)
poverty_rate = mdf_df.poverty_rate(
income_measure='income',
threshold=poverty_line
)
# Poverty gap (how far below threshold)
poverty_gap = mdf_df.poverty_gap(
income_measure='income',
threshold=poverty_line
)
# Deep poverty (income < 50% of threshold)
deep_poverty_rate = mdf_df.deep_poverty_rate(
income_measure='income',
threshold=poverty_line,
deep_poverty_line=0.5
)
```
**Quantiles:**
```python
# Deciles
mdf_df.income.decile_values()
# Quintiles
mdf_df.income.quintile_values()
# Custom quantiles
mdf_df.income.quantile(0.25) # 25th percentile
```
### MicroSeries
```python
# Extract a Series with weights
income_series = mdf_df.income # This is a MicroSeries
# MicroSeries operations
income_series.mean()
income_series.gini()
income_series.percentile(50)
```
### Working with PolicyEngine Results
```python
import microdf as mdf
from policyengine_us import Simulation
# Run simulation with axes (multiple households)
situation_with_axes = {...} # See policyengine-us-skill
sim = Simulation(situation=situation_with_axes)
# Get results as arrays
incomes = sim.calculate("household_net_income", 2024)
weights = sim.calculate("household_weight", 2024)
# Create MicroDataFrame
df = pd.DataFrame({'income': incomes, 'weight': weights})
mdf_df = mdf.MicroDataFrame(df, weights='weight')
# Calculate metrics
gini = mdf_df.income.gini()
poverty_rate = mdf_df.poverty_rate('income', threshold=15000)
print(f"Gini: {gini:.3f}")
print(f"Poverty rate: {poverty_rate:.1%}")
```
## For Contributors 💻
### Repository
**Location:** PolicyEngine/microdf
**Clone:**
```bash
git clone https://github.com/PolicyEngine/microdf
cd microdf
```
### Current Implementation
**To see current API:**
```bash
# Main classes
cat microdf/microframe.py # MicroDataFrame
cat microdf/microseries.py # MicroSeries
# Key modules
cat microdf/generic.py # Generic weighted operations
cat microdf/inequality.py # Gini, top shares
cat microdf/poverty.py # Poverty metrics
```
**To see all methods:**
```bash
# MicroDataFrame methods
grep "def " microdf/microframe.py
# MicroSeries methods
grep "def " microdf/microseries.py
```
### Testing
**To see test patterns:**
```bash
ls tests/
cat tests/test_microframe.py
```
**Run tests:**
```bash
make test
# Or
pytest tests/ -v
```
### Contributing
**Before contributing:**
1. Check if method already exists
2. Ensure it's weighted correctly
3. Add tests
4. Follow policyengine-standards-skill
**Common contributions:**
- New inequality metrics
- New poverty measures
- Performance optimizations
- Bug fixes
## Advanced Patterns
### Custom Aggregations
```python
# Define custom weighted aggregation
def weighted_operation(series, weights):
return (series * weights).sum() / weights.sum()
# Apply to MicroSeries
result = weighted_operation(mdf_df.income, mdf_df.weights)
```
### Groupby Operations
```python
# Group by with weights
grouped = mdf_df.groupby('state')
state_means = grouped.income.mean() # Weighted means by state
```
### Inequality Decomposition
**To see decomposition methods:**
```bash
grep -A 20 "def.*decomp" microdf/
```
## Integration Examples
### Example 1: PolicyEngine Blog Post Analysis
```python
# Pattern from PolicyEngine blog posts
import microdf as mdf
# Get simulation results
baseline_income = baseline_sim.calculate("household_net_income", 2024)
reform_income = reform_sim.calculate("household_net_income", 2024)
weights = baseline_sim.calculate("household_weight", 2024)
# Create MicroDataFrame
df = pd.DataFrame({
'baseline_income': baseline_income,
'reform_income': reform_income,
'weight': weights
})
mdf_df = mdf.MicroDataFrame(df, weights='weight')
# Calculate impacts
baseline_gini = mdf_df.baseline_income.gini()
reform_gini = mdf_df.reform_income.gini()
print(f"Gini change: {reform_gini - baseline_gini:+.4f}")
```
### Example 2: Poverty Analysis
```python
# Calculate poverty under baseline and reform
from policyengine_us import Simulation
baseline_sim = Simulation(situation=situation)
reform_sim = Simulation(situation=situation, reform=reform)
# Get incomes
baseline_income = baseline_sim.calculate("spm_unit_net_income", 2024)
reform_income = reform_sim.calculate("spm_unit_net_income", 2024)
spm_threshold = baseline_sim.calculate("spm_unit_poverty_threshold", 2024)
weights = baseline_sim.calculate("spm_unit_weight", 2024)
# Calculate poverty rates
df_baseline = mdf.MicroDataFrame(
pd.DataFrame({'income': baseline_income, 'threshold': spm_threshold, 'weight': weights}),
weights='weight'
)
poverty_baseline = (df_baseline.income < df_baseline.threshold).mean() # Weighted
# Similar for reform
print(f"Poverty reduction: {(poverty_baseline - poverty_reform):.1%}")
```
## Package Status
**Maturity:** Stable, production-ready
**API stability:** Stable (rarely breaking changes)
**Performance:** Optimized for large datasets
**To see version:**
```bash
pip show microdf-python
```
**To see changelog:**
```bash
cat CHANGELOG.md # In microdf repo
```
## Related Skills
- **policyengine-us-skill** - Generating data for microdf analysis
- **policyengine-analysis-skill** - Using microdf in policy analysis
- **policyengine-us-data-skill** - Data sources for microdf
## Resources
**Repository:** https://github.com/PolicyEngine/microdf
**PyPI:** https://pypi.org/project/microdf-python/
**Issues:** https://github.com/PolicyEngine/microdf/issues

View File

@@ -0,0 +1,415 @@
---
name: microimpute
description: ML-based variable imputation for survey data - used in policyengine-us-data to fill missing values
---
# MicroImpute
MicroImpute enables ML-based variable imputation through different statistical methods, with comparison and benchmarking capabilities.
## For Users 👥
### What is MicroImpute?
When PolicyEngine calculates population impacts, the underlying survey data has missing information. MicroImpute uses machine learning to fill in those gaps intelligently.
**What imputation does:**
- Fills missing data in surveys
- Uses machine learning to predict missing values
- Maintains statistical relationships
- Improves PolicyEngine accuracy
**Example:**
- Survey asks about income but not capital gains breakdown
- MicroImpute predicts short-term vs long-term capital gains
- Based on patterns from IRS data
- Result: More accurate tax calculations
**You benefit from imputation when:**
- PolicyEngine calculates capital gains tax accurately
- Benefits eligibility uses complete household information
- State-specific calculations have all needed data
## For Analysts 📊
### Installation
```bash
pip install microimpute
# With image export (for plots)
pip install microimpute[images]
```
### What MicroImpute Does
**Imputation problem:**
- Donor dataset has complete information (e.g., IRS tax records)
- Recipient dataset has missing variables (e.g., CPS survey)
- Imputation predicts missing values in recipient using donor patterns
**Methods available:**
- Linear regression
- Random forest
- Quantile forest (preserves full distribution)
- XGBoost
- Hot deck (traditional matching)
### Quick Example
```python
from microimpute import Imputer
import pandas as pd
# Donor data (complete)
donor = pd.DataFrame({
'income': [50000, 60000, 70000],
'age': [30, 40, 50],
'capital_gains': [5000, 8000, 12000] # Variable to impute
})
# Recipient data (missing capital_gains)
recipient = pd.DataFrame({
'income': [55000, 65000],
'age': [35, 45],
# capital_gains is missing
})
# Impute using quantile forest
imputer = Imputer(method='quantile_forest')
imputer.fit(
donor=donor,
donor_target='capital_gains',
common_vars=['income', 'age']
)
recipient_imputed = imputer.predict(recipient)
# Now recipient has predicted capital_gains
```
### Method Comparison
```python
from microimpute import compare_methods
# Compare different imputation methods
results = compare_methods(
donor=donor,
recipient=recipient,
target_var='capital_gains',
common_vars=['income', 'age'],
methods=['linear', 'random_forest', 'quantile_forest']
)
# Shows quantile loss for each method
print(results)
```
### Quantile Loss (Quality Metric)
**Why quantile loss:**
- Measures how well imputation preserves the distribution
- Not just mean accuracy, but full distribution shape
- Lower is better
**Interpretation:**
```python
# Quantile loss around 0.1 = good
# Quantile loss around 0.5 = poor
# Compare across methods to choose best
```
## For Contributors 💻
### Repository
**Location:** PolicyEngine/microimpute
**Clone:**
```bash
git clone https://github.com/PolicyEngine/microimpute
cd microimpute
```
### Current Implementation
**To see structure:**
```bash
tree microimpute/
# Key modules:
ls microimpute/
# - imputer.py - Main Imputer class
# - methods/ - Different imputation methods
# - comparison.py - Method benchmarking
# - utils/ - Utilities
```
**To see specific methods:**
```bash
# Quantile forest implementation
cat microimpute/methods/quantile_forest.py
# Random forest
cat microimpute/methods/random_forest.py
# Linear regression
cat microimpute/methods/linear.py
```
### Dependencies
**Required:**
- numpy, pandas (data handling)
- scikit-learn (ML models)
- quantile-forest (distributional imputation)
- optuna (hyperparameter tuning)
- statsmodels (statistical methods)
- scipy (statistical functions)
**To see all dependencies:**
```bash
cat pyproject.toml
```
### Adding New Imputation Methods
**Pattern:**
```python
# microimpute/methods/my_method.py
class MyMethodImputer:
def fit(self, X_train, y_train):
"""Train on donor data."""
# Fit your model
pass
def predict(self, X_test):
"""Impute on recipient data."""
# Return predictions
pass
def get_quantile_loss(self, X_val, y_val):
"""Compute validation loss."""
# Evaluate quality
pass
```
### Usage in policyengine-us-data
**To see how data pipeline uses microimpute:**
```bash
cd ../policyengine-us-data
# Find usage
grep -r "microimpute" policyengine_us_data/
grep -r "Imputer" policyengine_us_data/
```
**Typical workflow:**
1. Load CPS (has demographics, missing capital gains details)
2. Load IRS PUF (has complete tax data)
3. Use microimpute to predict missing CPS variables from PUF patterns
4. Validate imputation quality
5. Save enhanced dataset
### Testing
**Run tests:**
```bash
make test
# Or
pytest tests/ -v --cov=microimpute
```
**To see test patterns:**
```bash
cat tests/test_imputer.py
cat tests/test_methods.py
```
## Common Patterns
### Pattern 1: Basic Imputation
```python
from microimpute import Imputer
# Create imputer
imputer = Imputer(method='quantile_forest')
# Fit on donor (complete data)
imputer.fit(
donor=donor_df,
donor_target='target_variable',
common_vars=['age', 'income', 'state']
)
# Predict on recipient (missing target_variable)
recipient_imputed = imputer.predict(recipient_df)
```
### Pattern 2: Choosing Best Method
```python
from microimpute import compare_methods
# Test multiple methods
methods = ['linear', 'random_forest', 'quantile_forest', 'xgboost']
results = compare_methods(
donor=donor,
recipient=recipient,
target_var='target',
common_vars=common_vars,
methods=methods
)
# Use method with lowest quantile loss
best_method = results.sort_values('quantile_loss').iloc[0]['method']
```
### Pattern 3: Multiple Variable Imputation
```python
# Impute several variables
variables_to_impute = [
'short_term_capital_gains',
'long_term_capital_gains',
'qualified_dividends'
]
for var in variables_to_impute:
imputer = Imputer(method='quantile_forest')
imputer.fit(donor=irs_puf, donor_target=var, common_vars=common_vars)
cps[var] = imputer.predict(cps)
```
## Advanced Features
### Hyperparameter Tuning
**Built-in Optuna integration:**
```python
from microimpute import tune_hyperparameters
# Automatically find best hyperparameters
best_params, study = tune_hyperparameters(
donor=donor,
target_var='target',
common_vars=common_vars,
method='quantile_forest',
n_trials=100
)
# Use tuned parameters
imputer = Imputer(method='quantile_forest', **best_params)
```
### Cross-Validation
**Validate imputation quality:**
```python
from sklearn.model_selection import cross_val_score
# Split donor for validation
# Impute on validation set
# Measure accuracy
```
### Visualization
**Plot imputation results:**
```python
import plotly.express as px
# Compare imputed vs actual (on donor validation set)
fig = px.scatter(
x=actual_values,
y=imputed_values,
labels={'x': 'Actual', 'y': 'Imputed'}
)
fig.add_trace(px.line(x=[min, max], y=[min, max])) # 45-degree line
```
## Statistical Background
**Imputation preserves:**
- Marginal distributions (imputed variable distribution matches donor)
- Conditional relationships (imputation depends on common variables)
- Uncertainty (quantile methods preserve full distribution)
**Trade-offs:**
- **Linear:** Fast, but assumes linear relationships
- **Random forest:** Handles non-linearity, may overfit
- **Quantile forest:** Preserves full distribution, slower
- **XGBoost:** High accuracy, requires tuning
## Integration with PolicyEngine
**Full pipeline (policyengine-us-data):**
```
1. Load CPS survey data
2. microimpute: Fill missing variables from IRS PUF
3. microcalibrate: Adjust weights to match benchmarks
4. Validation: Check against administrative totals
5. Package: Distribute enhanced dataset
6. PolicyEngine: Use for population simulations
```
## Comparison to Other Methods
**MicroImpute vs traditional imputation:**
**Traditional (mean imputation):**
- Fast but destroys distribution
- All missing values get same value
- Underestimates variance
**MicroImpute (ML methods):**
- Preserves relationships
- Different predictions per record
- Maintains distribution shape
**Quantile forest advantage:**
- Predicts full conditional distribution
- Not just point estimates
- Can sample from predicted distribution
## Performance Tips
**For large datasets:**
```python
# Use random forest (faster than quantile forest)
imputer = Imputer(method='random_forest')
# Or subsample donor
donor_sample = donor.sample(n=10000, random_state=42)
imputer.fit(donor=donor_sample, ...)
```
**For high accuracy:**
```python
# Use quantile forest with tuning
best_params, _ = tune_hyperparameters(...)
imputer = Imputer(method='quantile_forest', **best_params)
```
## Related Skills
- **l0-skill** - Regularization techniques
- **microcalibrate-skill** - Survey calibration (next step after imputation)
- **policyengine-us-data-skill** - Complete data pipeline
- **microdf-skill** - Working with imputed/calibrated data
## Resources
**Repository:** https://github.com/PolicyEngine/microimpute
**PyPI:** https://pypi.org/project/microimpute/
**Documentation:** See README and docstrings in source