Initial commit
This commit is contained in:
277
skills/data-science/l0-skill/SKILL.md
Normal file
277
skills/data-science/l0-skill/SKILL.md
Normal file
@@ -0,0 +1,277 @@
|
||||
---
|
||||
name: l0
|
||||
description: L0 regularization for neural network sparsification and intelligent sampling - used in survey calibration
|
||||
---
|
||||
|
||||
# L0 Regularization
|
||||
|
||||
L0 is a PyTorch implementation of L0 regularization for neural network sparsification and intelligent sampling, used in PolicyEngine's survey calibration pipeline.
|
||||
|
||||
## For Users 👥
|
||||
|
||||
### What is L0?
|
||||
|
||||
L0 regularization helps PolicyEngine create more efficient survey datasets by intelligently selecting which households to include in calculations.
|
||||
|
||||
**Impact you see:**
|
||||
- Faster population impact calculations
|
||||
- Smaller dataset sizes
|
||||
- Maintained accuracy with fewer samples
|
||||
|
||||
**Behind the scenes:**
|
||||
When PolicyEngine shows population-wide impacts, L0 helps select representative households from the full survey, reducing computation time while maintaining accuracy.
|
||||
|
||||
## For Analysts 📊
|
||||
|
||||
### What L0 Does
|
||||
|
||||
L0 provides intelligent sampling gates for:
|
||||
- **Household selection** - Choose representative samples from CPS
|
||||
- **Feature selection** - Identify important variables
|
||||
- **Sparse weighting** - Create compact, efficient datasets
|
||||
|
||||
**Used in PolicyEngine for:**
|
||||
- Survey calibration (via microcalibrate)
|
||||
- Dataset sparsification in policyengine-us-data
|
||||
- Efficient microsimulation
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
pip install l0-python
|
||||
```
|
||||
|
||||
### Quick Example: Sample Selection
|
||||
|
||||
```python
|
||||
from l0 import SampleGate
|
||||
|
||||
# Select 1,000 households from 10,000
|
||||
gate = SampleGate(n_samples=10000, target_samples=1000)
|
||||
selected_data, indices = gate.select_samples(data)
|
||||
|
||||
# Gates learn which samples are most informative
|
||||
```
|
||||
|
||||
### Integration with microcalibrate
|
||||
|
||||
```python
|
||||
from l0 import HardConcrete
|
||||
from microcalibrate import Calibration
|
||||
|
||||
# L0 gates for household selection
|
||||
gates = HardConcrete(
|
||||
len(household_weights),
|
||||
temperature=0.25,
|
||||
init_mean=0.999 # Start with most households
|
||||
)
|
||||
|
||||
# Use in calibration
|
||||
# microcalibrate applies gates during weight optimization
|
||||
```
|
||||
|
||||
## For Contributors 💻
|
||||
|
||||
### Repository
|
||||
|
||||
**Location:** PolicyEngine/L0
|
||||
|
||||
**Clone:**
|
||||
```bash
|
||||
git clone https://github.com/PolicyEngine/L0
|
||||
cd L0
|
||||
```
|
||||
|
||||
### Current Implementation
|
||||
|
||||
**To see structure:**
|
||||
```bash
|
||||
tree l0/
|
||||
|
||||
# Key modules:
|
||||
ls l0/
|
||||
# - hard_concrete.py - Core L0 distribution
|
||||
# - layers.py - L0Linear, L0Conv2d
|
||||
# - gates.py - Sample/feature gates
|
||||
# - penalties.py - L0/L2 penalty computation
|
||||
# - temperature.py - Temperature scheduling
|
||||
```
|
||||
|
||||
**To see specific implementations:**
|
||||
```bash
|
||||
# Hard Concrete distribution (core algorithm)
|
||||
cat l0/hard_concrete.py
|
||||
|
||||
# Sample gates (used in calibration)
|
||||
cat l0/gates.py
|
||||
|
||||
# Neural network layers
|
||||
cat l0/layers.py
|
||||
```
|
||||
|
||||
### Key Concepts
|
||||
|
||||
**Hard Concrete Distribution:**
|
||||
- Differentiable approximation of L0 norm
|
||||
- Allows gradient-based optimization
|
||||
- Temperature controls sparsity level
|
||||
|
||||
**To see implementation:**
|
||||
```bash
|
||||
cat l0/hard_concrete.py
|
||||
```
|
||||
|
||||
**Sample Gates:**
|
||||
- Binary gates for sample selection
|
||||
- Learn which samples are most informative
|
||||
- Used in microcalibrate for household selection
|
||||
|
||||
**Feature Gates:**
|
||||
- Select important features/variables
|
||||
- Reduce dimensionality
|
||||
- Maintain prediction accuracy
|
||||
|
||||
### Usage in PolicyEngine
|
||||
|
||||
**In microcalibrate (survey calibration):**
|
||||
```python
|
||||
from l0 import HardConcrete
|
||||
|
||||
# Create gates for household selection
|
||||
gates = HardConcrete(
|
||||
n_items=len(households),
|
||||
temperature=0.25,
|
||||
init_mean=0.999 # Start with almost all households
|
||||
)
|
||||
|
||||
# Gates produce probabilities (0 to 1)
|
||||
probs = gates()
|
||||
|
||||
# Apply to weights during calibration
|
||||
masked_weights = weights * probs
|
||||
```
|
||||
|
||||
**In policyengine-us-data:**
|
||||
```bash
|
||||
# See usage in data pipeline
|
||||
grep -r "from l0 import" ../policyengine-us-data/
|
||||
```
|
||||
|
||||
### Temperature Scheduling
|
||||
|
||||
**Controls sparsity over training:**
|
||||
```python
|
||||
from l0 import TemperatureScheduler, update_temperatures
|
||||
|
||||
scheduler = TemperatureScheduler(
|
||||
initial_temp=1.0, # Start relaxed
|
||||
final_temp=0.1, # End sparse
|
||||
total_epochs=100
|
||||
)
|
||||
|
||||
for epoch in range(100):
|
||||
temp = scheduler.get_temperature(epoch)
|
||||
update_temperatures(model, temp)
|
||||
# ... training ...
|
||||
```
|
||||
|
||||
**To see implementation:**
|
||||
```bash
|
||||
cat l0/temperature.py
|
||||
```
|
||||
|
||||
### L0L2 Combined Penalty
|
||||
|
||||
**Prevents overfitting:**
|
||||
```python
|
||||
from l0 import compute_l0l2_penalty
|
||||
|
||||
# Combine L0 (sparsity) with L2 (regularization)
|
||||
penalty = compute_l0l2_penalty(
|
||||
model,
|
||||
l0_lambda=1e-3, # Sparsity strength
|
||||
l2_lambda=1e-4 # Weight regularization
|
||||
)
|
||||
|
||||
loss = task_loss + penalty
|
||||
```
|
||||
|
||||
### Testing
|
||||
|
||||
**Run tests:**
|
||||
```bash
|
||||
make test
|
||||
|
||||
# Or
|
||||
pytest tests/ -v --cov=l0
|
||||
```
|
||||
|
||||
**To see test patterns:**
|
||||
```bash
|
||||
cat tests/test_hard_concrete.py
|
||||
cat tests/test_gates.py
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Hybrid Gates (L0 + Random)
|
||||
|
||||
```python
|
||||
from l0 import HybridGate
|
||||
|
||||
# Combine L0 selection with random sampling
|
||||
hybrid = HybridGate(
|
||||
n_items=10000,
|
||||
l0_fraction=0.25, # 25% from L0
|
||||
random_fraction=0.75, # 75% random
|
||||
target_items=1000
|
||||
)
|
||||
|
||||
selected, indices, types = hybrid.select(data)
|
||||
```
|
||||
|
||||
### Feature Selection
|
||||
|
||||
```python
|
||||
from l0 import FeatureGate
|
||||
|
||||
# Select top features
|
||||
gate = FeatureGate(n_features=1000, max_features=50)
|
||||
selected_data, feature_indices = gate.select_features(data)
|
||||
|
||||
# Get feature importance
|
||||
importance = gate.get_feature_importance()
|
||||
```
|
||||
|
||||
## Mathematical Background
|
||||
|
||||
**L0 norm:**
|
||||
- Counts non-zero elements
|
||||
- Non-differentiable (discontinuous)
|
||||
- Hard to optimize directly
|
||||
|
||||
**Hard Concrete relaxation:**
|
||||
- Continuous, differentiable approximation
|
||||
- Enables gradient descent
|
||||
- "Stretches" binary distribution to allow gradients
|
||||
|
||||
**Paper:**
|
||||
Louizos, Welling, & Kingma (2017): "Learning Sparse Neural Networks through L0 Regularization"
|
||||
https://arxiv.org/abs/1712.01312
|
||||
|
||||
## Related Packages
|
||||
|
||||
**Uses L0:**
|
||||
- microcalibrate (survey weight calibration)
|
||||
- policyengine-us-data (household selection)
|
||||
|
||||
**See also:**
|
||||
- **microcalibrate-skill** - Survey calibration using L0
|
||||
- **policyengine-us-data-skill** - Data pipeline integration
|
||||
|
||||
## Resources
|
||||
|
||||
**Repository:** https://github.com/PolicyEngine/L0
|
||||
**Documentation:** https://policyengine.github.io/L0/
|
||||
**Paper:** https://arxiv.org/abs/1712.01312
|
||||
**PyPI:** https://pypi.org/project/l0-python/
|
||||
404
skills/data-science/microcalibrate-skill/SKILL.md
Normal file
404
skills/data-science/microcalibrate-skill/SKILL.md
Normal file
@@ -0,0 +1,404 @@
|
||||
---
|
||||
name: microcalibrate
|
||||
description: Survey weight calibration to match population targets - used in policyengine-us-data for enhanced microdata
|
||||
---
|
||||
|
||||
# MicroCalibrate
|
||||
|
||||
MicroCalibrate calibrates survey weights to match population targets, with L0 regularization for sparsity and automatic hyperparameter tuning.
|
||||
|
||||
## For Users 👥
|
||||
|
||||
### What is MicroCalibrate?
|
||||
|
||||
When you see PolicyEngine population impacts, the underlying data has been "calibrated" using MicroCalibrate to match official population statistics.
|
||||
|
||||
**What calibration does:**
|
||||
- Adjusts survey weights to match known totals (population, income, employment)
|
||||
- Creates representative datasets
|
||||
- Reduces dataset size while maintaining accuracy
|
||||
- Ensures PolicyEngine estimates match administrative data
|
||||
|
||||
**Example:**
|
||||
- Census says US has 331 million people
|
||||
- Survey has 100,000 households representing the population
|
||||
- MicroCalibrate adjusts weights so survey totals match census totals
|
||||
- Result: More accurate PolicyEngine calculations
|
||||
|
||||
## For Analysts 📊
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
pip install microcalibrate
|
||||
```
|
||||
|
||||
### What MicroCalibrate Does
|
||||
|
||||
**Calibration problem:**
|
||||
You have survey data with initial weights, and you know certain population totals (benchmarks). Calibration adjusts weights so weighted survey totals match benchmarks.
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from microcalibrate import Calibration
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
# Survey data (1,000 households)
|
||||
weights = np.ones(1000) # Initial weights
|
||||
|
||||
# Estimates (how much each household contributes to targets)
|
||||
estimate_matrix = pd.DataFrame({
|
||||
'total_income': household_incomes, # Each household's income
|
||||
'total_employed': household_employment # 1 if employed, 0 if not
|
||||
})
|
||||
|
||||
# Known population targets (benchmarks)
|
||||
targets = np.array([
|
||||
50_000_000, # Total income in population
|
||||
600, # Total employed people
|
||||
])
|
||||
|
||||
# Calibrate
|
||||
cal = Calibration(
|
||||
weights=weights,
|
||||
targets=targets,
|
||||
estimate_matrix=estimate_matrix,
|
||||
l0_lambda=0.01 # Sparsity penalty
|
||||
)
|
||||
|
||||
# Optimize weights
|
||||
new_weights = cal.calibrate(max_iter=1000)
|
||||
|
||||
# Check results
|
||||
achieved = (estimate_matrix.values.T @ new_weights)
|
||||
print(f"Target: {targets}")
|
||||
print(f"Achieved: {achieved}")
|
||||
print(f"Non-zero weights: {(new_weights > 0).sum()} / {len(weights)}")
|
||||
```
|
||||
|
||||
### L0 Regularization for Sparsity
|
||||
|
||||
**Why sparsity matters:**
|
||||
- Reduces dataset size (fewer households to simulate)
|
||||
- Faster PolicyEngine calculations
|
||||
- Easier to validate and understand
|
||||
|
||||
**L0 penalty:**
|
||||
```python
|
||||
# L0 encourages many weights to be exactly zero
|
||||
cal = Calibration(
|
||||
weights=weights,
|
||||
targets=targets,
|
||||
estimate_matrix=estimate_matrix,
|
||||
l0_lambda=0.01 # Higher = more sparse
|
||||
)
|
||||
```
|
||||
|
||||
**To see impact:**
|
||||
```python
|
||||
# Without L0
|
||||
cal_dense = Calibration(..., l0_lambda=0.0)
|
||||
weights_dense = cal_dense.calibrate()
|
||||
|
||||
# With L0
|
||||
cal_sparse = Calibration(..., l0_lambda=0.01)
|
||||
weights_sparse = cal_sparse.calibrate()
|
||||
|
||||
print(f"Dense: {(weights_dense > 0).sum()} households")
|
||||
print(f"Sparse: {(weights_sparse > 0).sum()} households")
|
||||
# Sparse might use 60% fewer households while matching same targets
|
||||
```
|
||||
|
||||
### Automatic Hyperparameter Tuning
|
||||
|
||||
**Find optimal l0_lambda:**
|
||||
```python
|
||||
from microcalibrate import tune_hyperparameters
|
||||
|
||||
# Find best l0_lambda using cross-validation
|
||||
best_lambda, results = tune_hyperparameters(
|
||||
weights=weights,
|
||||
targets=targets,
|
||||
estimate_matrix=estimate_matrix,
|
||||
lambda_min=1e-4,
|
||||
lambda_max=1e-1,
|
||||
n_trials=50
|
||||
)
|
||||
|
||||
print(f"Best lambda: {best_lambda}")
|
||||
```
|
||||
|
||||
### Robustness Evaluation
|
||||
|
||||
**Test calibration stability:**
|
||||
```python
|
||||
from microcalibrate import evaluate_robustness
|
||||
|
||||
# Holdout validation
|
||||
robustness = evaluate_robustness(
|
||||
weights=weights,
|
||||
targets=targets,
|
||||
estimate_matrix=estimate_matrix,
|
||||
l0_lambda=0.01,
|
||||
n_folds=5
|
||||
)
|
||||
|
||||
print(f"Mean error: {robustness['mean_error']}")
|
||||
print(f"Std error: {robustness['std_error']}")
|
||||
```
|
||||
|
||||
### Interactive Dashboard
|
||||
|
||||
**Visualize calibration:**
|
||||
https://microcalibrate.vercel.app/
|
||||
|
||||
Features:
|
||||
- Upload survey data
|
||||
- Set targets
|
||||
- Tune hyperparameters
|
||||
- View results
|
||||
- Download calibrated weights
|
||||
|
||||
## For Contributors 💻
|
||||
|
||||
### Repository
|
||||
|
||||
**Location:** PolicyEngine/microcalibrate
|
||||
|
||||
**Clone:**
|
||||
```bash
|
||||
git clone https://github.com/PolicyEngine/microcalibrate
|
||||
cd microcalibrate
|
||||
```
|
||||
|
||||
### Current Implementation
|
||||
|
||||
**To see structure:**
|
||||
```bash
|
||||
tree microcalibrate/
|
||||
|
||||
# Key modules:
|
||||
ls microcalibrate/
|
||||
# - calibration.py - Main Calibration class
|
||||
# - hyperparameter_tuning.py - Optuna integration
|
||||
# - evaluation.py - Robustness testing
|
||||
# - target_analysis.py - Target diagnostics
|
||||
```
|
||||
|
||||
**To see specific implementations:**
|
||||
```bash
|
||||
# Main calibration algorithm
|
||||
cat microcalibrate/calibration.py
|
||||
|
||||
# Hyperparameter tuning
|
||||
cat microcalibrate/hyperparameter_tuning.py
|
||||
|
||||
# Robustness evaluation
|
||||
cat microcalibrate/evaluation.py
|
||||
```
|
||||
|
||||
### Dependencies
|
||||
|
||||
**Required:**
|
||||
- torch (PyTorch for optimization)
|
||||
- l0-python (L0 regularization)
|
||||
- optuna (hyperparameter tuning)
|
||||
- numpy, pandas, tqdm
|
||||
|
||||
**To see all dependencies:**
|
||||
```bash
|
||||
cat pyproject.toml
|
||||
```
|
||||
|
||||
### How MicroCalibrate Uses L0
|
||||
|
||||
```python
|
||||
# Internal to microcalibrate
|
||||
from l0 import HardConcrete
|
||||
|
||||
# Create gates for sample selection
|
||||
gates = HardConcrete(
|
||||
n_items=len(weights),
|
||||
temperature=temperature,
|
||||
init_mean=0.999
|
||||
)
|
||||
|
||||
# Apply gates during optimization
|
||||
effective_weights = weights * gates()
|
||||
|
||||
# L0 penalty encourages gates → 0 or 1
|
||||
# Result: Many households get weight = 0 (sparse)
|
||||
```
|
||||
|
||||
**To see L0 integration:**
|
||||
```bash
|
||||
grep -n "HardConcrete\|l0" microcalibrate/calibration.py
|
||||
```
|
||||
|
||||
### Optimization Algorithm
|
||||
|
||||
**Iterative reweighting:**
|
||||
1. Start with initial weights
|
||||
2. Apply L0 gates (select samples)
|
||||
3. Optimize to match targets
|
||||
4. Apply penalty for sparsity
|
||||
5. Iterate until convergence
|
||||
|
||||
**Loss function:**
|
||||
```python
|
||||
# Target matching loss
|
||||
target_loss = sum((achieved_targets - desired_targets)^2)
|
||||
|
||||
# L0 penalty (number of non-zero weights)
|
||||
l0_penalty = l0_lambda * count_nonzero(weights)
|
||||
|
||||
# Total loss
|
||||
total_loss = target_loss + l0_penalty
|
||||
```
|
||||
|
||||
### Testing
|
||||
|
||||
**Run tests:**
|
||||
```bash
|
||||
make test
|
||||
|
||||
# Or
|
||||
pytest tests/ -v
|
||||
```
|
||||
|
||||
**To see test patterns:**
|
||||
```bash
|
||||
cat tests/test_calibration.py
|
||||
cat tests/test_hyperparameter_tuning.py
|
||||
```
|
||||
|
||||
### Usage in policyengine-us-data
|
||||
|
||||
**To see how data pipeline uses microcalibrate:**
|
||||
```bash
|
||||
cd ../policyengine-us-data
|
||||
|
||||
# Find usage
|
||||
grep -r "microcalibrate" policyengine_us_data/
|
||||
grep -r "Calibration" policyengine_us_data/
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Pattern 1: Basic Calibration
|
||||
|
||||
```python
|
||||
from microcalibrate import Calibration
|
||||
|
||||
cal = Calibration(
|
||||
weights=initial_weights,
|
||||
targets=benchmark_values,
|
||||
estimate_matrix=contributions,
|
||||
l0_lambda=0.01
|
||||
)
|
||||
|
||||
calibrated_weights = cal.calibrate(max_iter=1000)
|
||||
```
|
||||
|
||||
### Pattern 2: With Hyperparameter Tuning
|
||||
|
||||
```python
|
||||
from microcalibrate import tune_hyperparameters, Calibration
|
||||
|
||||
# Find best lambda
|
||||
best_lambda, results = tune_hyperparameters(
|
||||
weights=weights,
|
||||
targets=targets,
|
||||
estimate_matrix=estimate_matrix
|
||||
)
|
||||
|
||||
# Use best lambda
|
||||
cal = Calibration(..., l0_lambda=best_lambda)
|
||||
calibrated_weights = cal.calibrate()
|
||||
```
|
||||
|
||||
### Pattern 3: Multi-Target Calibration
|
||||
|
||||
```python
|
||||
# Multiple population targets
|
||||
estimate_matrix = pd.DataFrame({
|
||||
'total_population': population_counts,
|
||||
'total_income': incomes,
|
||||
'total_employed': employment_indicators,
|
||||
'total_children': child_counts
|
||||
})
|
||||
|
||||
targets = np.array([
|
||||
331_000_000, # US population
|
||||
15_000_000_000_000, # Total income
|
||||
160_000_000, # Employed people
|
||||
73_000_000 # Children
|
||||
])
|
||||
|
||||
cal = Calibration(weights, targets, estimate_matrix, l0_lambda=0.01)
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
**Calibration speed:**
|
||||
- 1,000 households, 5 targets: ~1 second
|
||||
- 100,000 households, 10 targets: ~30 seconds
|
||||
- Depends on: dataset size, number of targets, l0_lambda
|
||||
|
||||
**Memory usage:**
|
||||
- PyTorch tensors for optimization
|
||||
- Scales linearly with dataset size
|
||||
|
||||
**To profile:**
|
||||
```python
|
||||
import time
|
||||
|
||||
start = time.time()
|
||||
weights = cal.calibrate()
|
||||
print(f"Calibration took {time.time() - start:.1f}s")
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Common issues:**
|
||||
|
||||
**1. Calibration not converging:**
|
||||
```python
|
||||
# Try:
|
||||
# - More iterations
|
||||
# - Lower l0_lambda
|
||||
# - Better initialization
|
||||
|
||||
cal = Calibration(..., l0_lambda=0.001) # Lower sparsity penalty
|
||||
weights = cal.calibrate(max_iter=5000) # More iterations
|
||||
```
|
||||
|
||||
**2. Targets not matching:**
|
||||
```python
|
||||
# Check achieved vs desired
|
||||
achieved = (estimate_matrix.values.T @ weights)
|
||||
error = np.abs(achieved - targets) / targets
|
||||
print(f"Relative errors: {error}")
|
||||
|
||||
# If large errors, l0_lambda may be too high
|
||||
```
|
||||
|
||||
**3. Too sparse (all weights zero):**
|
||||
```python
|
||||
# Lower l0_lambda
|
||||
cal = Calibration(..., l0_lambda=0.0001)
|
||||
```
|
||||
|
||||
## Related Skills
|
||||
|
||||
- **l0-skill** - Understanding L0 regularization
|
||||
- **policyengine-us-data-skill** - How calibration fits in data pipeline
|
||||
- **microdf-skill** - Working with calibrated survey data
|
||||
|
||||
## Resources
|
||||
|
||||
**Repository:** https://github.com/PolicyEngine/microcalibrate
|
||||
**Dashboard:** https://microcalibrate.vercel.app/
|
||||
**PyPI:** https://pypi.org/project/microcalibrate/
|
||||
**Paper:** Louizos et al. (2017) on L0 regularization
|
||||
329
skills/data-science/microdf-skill/SKILL.md
Normal file
329
skills/data-science/microdf-skill/SKILL.md
Normal file
@@ -0,0 +1,329 @@
|
||||
---
|
||||
name: microdf
|
||||
description: Weighted pandas DataFrames for survey microdata analysis - inequality, poverty, and distributional calculations
|
||||
---
|
||||
|
||||
# MicroDF
|
||||
|
||||
MicroDF provides weighted pandas DataFrames and Series for analyzing survey microdata, with built-in support for inequality and poverty calculations.
|
||||
|
||||
## For Users 👥
|
||||
|
||||
### What is MicroDF?
|
||||
|
||||
When you see poverty rates, Gini coefficients, or distributional charts in PolicyEngine, those are calculated using MicroDF.
|
||||
|
||||
**MicroDF powers:**
|
||||
- Poverty rate calculations (SPM)
|
||||
- Inequality metrics (Gini coefficient)
|
||||
- Income distribution analysis
|
||||
- Weighted statistics from survey data
|
||||
|
||||
### Understanding the Metrics
|
||||
|
||||
**Gini coefficient:**
|
||||
- Calculated using MicroDF from weighted income data
|
||||
- Ranges from 0 (perfect equality) to 1 (perfect inequality)
|
||||
- US typically around 0.48
|
||||
|
||||
**Poverty rates:**
|
||||
- Calculated using MicroDF with weighted household data
|
||||
- Compares income to poverty thresholds
|
||||
- Accounts for household composition
|
||||
|
||||
**Percentiles:**
|
||||
- MicroDF calculates weighted percentiles
|
||||
- Shows income distribution (10th, 50th, 90th percentile)
|
||||
|
||||
## For Analysts 📊
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
pip install microdf-python
|
||||
```
|
||||
|
||||
### Quick Start
|
||||
|
||||
```python
|
||||
import microdf as mdf
|
||||
import pandas as pd
|
||||
|
||||
# Create sample data
|
||||
df = pd.DataFrame({
|
||||
'income': [10000, 20000, 30000, 40000, 50000],
|
||||
'weights': [1, 2, 3, 2, 1]
|
||||
})
|
||||
|
||||
# Create MicroDataFrame
|
||||
mdf_df = mdf.MicroDataFrame(df, weights='weights')
|
||||
|
||||
# All operations are weight-aware
|
||||
print(f"Weighted mean: ${mdf_df.income.mean():,.0f}")
|
||||
print(f"Gini coefficient: {mdf_df.income.gini():.3f}")
|
||||
```
|
||||
|
||||
### Common Operations
|
||||
|
||||
**Weighted statistics:**
|
||||
```python
|
||||
mdf_df.income.mean() # Weighted mean
|
||||
mdf_df.income.median() # Weighted median
|
||||
mdf_df.income.sum() # Weighted sum
|
||||
mdf_df.income.std() # Weighted standard deviation
|
||||
```
|
||||
|
||||
**Inequality metrics:**
|
||||
```python
|
||||
mdf_df.income.gini() # Gini coefficient
|
||||
mdf_df.income.top_x_pct_share(10) # Top 10% share
|
||||
mdf_df.income.top_x_pct_share(1) # Top 1% share
|
||||
```
|
||||
|
||||
**Poverty analysis:**
|
||||
```python
|
||||
# Poverty rate (income < threshold)
|
||||
poverty_rate = mdf_df.poverty_rate(
|
||||
income_measure='income',
|
||||
threshold=poverty_line
|
||||
)
|
||||
|
||||
# Poverty gap (how far below threshold)
|
||||
poverty_gap = mdf_df.poverty_gap(
|
||||
income_measure='income',
|
||||
threshold=poverty_line
|
||||
)
|
||||
|
||||
# Deep poverty (income < 50% of threshold)
|
||||
deep_poverty_rate = mdf_df.deep_poverty_rate(
|
||||
income_measure='income',
|
||||
threshold=poverty_line,
|
||||
deep_poverty_line=0.5
|
||||
)
|
||||
```
|
||||
|
||||
**Quantiles:**
|
||||
```python
|
||||
# Deciles
|
||||
mdf_df.income.decile_values()
|
||||
|
||||
# Quintiles
|
||||
mdf_df.income.quintile_values()
|
||||
|
||||
# Custom quantiles
|
||||
mdf_df.income.quantile(0.25) # 25th percentile
|
||||
```
|
||||
|
||||
### MicroSeries
|
||||
|
||||
```python
|
||||
# Extract a Series with weights
|
||||
income_series = mdf_df.income # This is a MicroSeries
|
||||
|
||||
# MicroSeries operations
|
||||
income_series.mean()
|
||||
income_series.gini()
|
||||
income_series.percentile(50)
|
||||
```
|
||||
|
||||
### Working with PolicyEngine Results
|
||||
|
||||
```python
|
||||
import microdf as mdf
|
||||
from policyengine_us import Simulation
|
||||
|
||||
# Run simulation with axes (multiple households)
|
||||
situation_with_axes = {...} # See policyengine-us-skill
|
||||
sim = Simulation(situation=situation_with_axes)
|
||||
|
||||
# Get results as arrays
|
||||
incomes = sim.calculate("household_net_income", 2024)
|
||||
weights = sim.calculate("household_weight", 2024)
|
||||
|
||||
# Create MicroDataFrame
|
||||
df = pd.DataFrame({'income': incomes, 'weight': weights})
|
||||
mdf_df = mdf.MicroDataFrame(df, weights='weight')
|
||||
|
||||
# Calculate metrics
|
||||
gini = mdf_df.income.gini()
|
||||
poverty_rate = mdf_df.poverty_rate('income', threshold=15000)
|
||||
|
||||
print(f"Gini: {gini:.3f}")
|
||||
print(f"Poverty rate: {poverty_rate:.1%}")
|
||||
```
|
||||
|
||||
## For Contributors 💻
|
||||
|
||||
### Repository
|
||||
|
||||
**Location:** PolicyEngine/microdf
|
||||
|
||||
**Clone:**
|
||||
```bash
|
||||
git clone https://github.com/PolicyEngine/microdf
|
||||
cd microdf
|
||||
```
|
||||
|
||||
### Current Implementation
|
||||
|
||||
**To see current API:**
|
||||
```bash
|
||||
# Main classes
|
||||
cat microdf/microframe.py # MicroDataFrame
|
||||
cat microdf/microseries.py # MicroSeries
|
||||
|
||||
# Key modules
|
||||
cat microdf/generic.py # Generic weighted operations
|
||||
cat microdf/inequality.py # Gini, top shares
|
||||
cat microdf/poverty.py # Poverty metrics
|
||||
```
|
||||
|
||||
**To see all methods:**
|
||||
```bash
|
||||
# MicroDataFrame methods
|
||||
grep "def " microdf/microframe.py
|
||||
|
||||
# MicroSeries methods
|
||||
grep "def " microdf/microseries.py
|
||||
```
|
||||
|
||||
### Testing
|
||||
|
||||
**To see test patterns:**
|
||||
```bash
|
||||
ls tests/
|
||||
cat tests/test_microframe.py
|
||||
```
|
||||
|
||||
**Run tests:**
|
||||
```bash
|
||||
make test
|
||||
|
||||
# Or
|
||||
pytest tests/ -v
|
||||
```
|
||||
|
||||
### Contributing
|
||||
|
||||
**Before contributing:**
|
||||
1. Check if method already exists
|
||||
2. Ensure it's weighted correctly
|
||||
3. Add tests
|
||||
4. Follow policyengine-standards-skill
|
||||
|
||||
**Common contributions:**
|
||||
- New inequality metrics
|
||||
- New poverty measures
|
||||
- Performance optimizations
|
||||
- Bug fixes
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Custom Aggregations
|
||||
|
||||
```python
|
||||
# Define custom weighted aggregation
|
||||
def weighted_operation(series, weights):
|
||||
return (series * weights).sum() / weights.sum()
|
||||
|
||||
# Apply to MicroSeries
|
||||
result = weighted_operation(mdf_df.income, mdf_df.weights)
|
||||
```
|
||||
|
||||
### Groupby Operations
|
||||
|
||||
```python
|
||||
# Group by with weights
|
||||
grouped = mdf_df.groupby('state')
|
||||
state_means = grouped.income.mean() # Weighted means by state
|
||||
```
|
||||
|
||||
### Inequality Decomposition
|
||||
|
||||
**To see decomposition methods:**
|
||||
```bash
|
||||
grep -A 20 "def.*decomp" microdf/
|
||||
```
|
||||
|
||||
## Integration Examples
|
||||
|
||||
### Example 1: PolicyEngine Blog Post Analysis
|
||||
|
||||
```python
|
||||
# Pattern from PolicyEngine blog posts
|
||||
import microdf as mdf
|
||||
|
||||
# Get simulation results
|
||||
baseline_income = baseline_sim.calculate("household_net_income", 2024)
|
||||
reform_income = reform_sim.calculate("household_net_income", 2024)
|
||||
weights = baseline_sim.calculate("household_weight", 2024)
|
||||
|
||||
# Create MicroDataFrame
|
||||
df = pd.DataFrame({
|
||||
'baseline_income': baseline_income,
|
||||
'reform_income': reform_income,
|
||||
'weight': weights
|
||||
})
|
||||
mdf_df = mdf.MicroDataFrame(df, weights='weight')
|
||||
|
||||
# Calculate impacts
|
||||
baseline_gini = mdf_df.baseline_income.gini()
|
||||
reform_gini = mdf_df.reform_income.gini()
|
||||
|
||||
print(f"Gini change: {reform_gini - baseline_gini:+.4f}")
|
||||
```
|
||||
|
||||
### Example 2: Poverty Analysis
|
||||
|
||||
```python
|
||||
# Calculate poverty under baseline and reform
|
||||
from policyengine_us import Simulation
|
||||
|
||||
baseline_sim = Simulation(situation=situation)
|
||||
reform_sim = Simulation(situation=situation, reform=reform)
|
||||
|
||||
# Get incomes
|
||||
baseline_income = baseline_sim.calculate("spm_unit_net_income", 2024)
|
||||
reform_income = reform_sim.calculate("spm_unit_net_income", 2024)
|
||||
spm_threshold = baseline_sim.calculate("spm_unit_poverty_threshold", 2024)
|
||||
weights = baseline_sim.calculate("spm_unit_weight", 2024)
|
||||
|
||||
# Calculate poverty rates
|
||||
df_baseline = mdf.MicroDataFrame(
|
||||
pd.DataFrame({'income': baseline_income, 'threshold': spm_threshold, 'weight': weights}),
|
||||
weights='weight'
|
||||
)
|
||||
|
||||
poverty_baseline = (df_baseline.income < df_baseline.threshold).mean() # Weighted
|
||||
|
||||
# Similar for reform
|
||||
print(f"Poverty reduction: {(poverty_baseline - poverty_reform):.1%}")
|
||||
```
|
||||
|
||||
## Package Status
|
||||
|
||||
**Maturity:** Stable, production-ready
|
||||
**API stability:** Stable (rarely breaking changes)
|
||||
**Performance:** Optimized for large datasets
|
||||
|
||||
**To see version:**
|
||||
```bash
|
||||
pip show microdf-python
|
||||
```
|
||||
|
||||
**To see changelog:**
|
||||
```bash
|
||||
cat CHANGELOG.md # In microdf repo
|
||||
```
|
||||
|
||||
## Related Skills
|
||||
|
||||
- **policyengine-us-skill** - Generating data for microdf analysis
|
||||
- **policyengine-analysis-skill** - Using microdf in policy analysis
|
||||
- **policyengine-us-data-skill** - Data sources for microdf
|
||||
|
||||
## Resources
|
||||
|
||||
**Repository:** https://github.com/PolicyEngine/microdf
|
||||
**PyPI:** https://pypi.org/project/microdf-python/
|
||||
**Issues:** https://github.com/PolicyEngine/microdf/issues
|
||||
415
skills/data-science/microimpute-skill/SKILL.md
Normal file
415
skills/data-science/microimpute-skill/SKILL.md
Normal file
@@ -0,0 +1,415 @@
|
||||
---
|
||||
name: microimpute
|
||||
description: ML-based variable imputation for survey data - used in policyengine-us-data to fill missing values
|
||||
---
|
||||
|
||||
# MicroImpute
|
||||
|
||||
MicroImpute enables ML-based variable imputation through different statistical methods, with comparison and benchmarking capabilities.
|
||||
|
||||
## For Users 👥
|
||||
|
||||
### What is MicroImpute?
|
||||
|
||||
When PolicyEngine calculates population impacts, the underlying survey data has missing information. MicroImpute uses machine learning to fill in those gaps intelligently.
|
||||
|
||||
**What imputation does:**
|
||||
- Fills missing data in surveys
|
||||
- Uses machine learning to predict missing values
|
||||
- Maintains statistical relationships
|
||||
- Improves PolicyEngine accuracy
|
||||
|
||||
**Example:**
|
||||
- Survey asks about income but not capital gains breakdown
|
||||
- MicroImpute predicts short-term vs long-term capital gains
|
||||
- Based on patterns from IRS data
|
||||
- Result: More accurate tax calculations
|
||||
|
||||
**You benefit from imputation when:**
|
||||
- PolicyEngine calculates capital gains tax accurately
|
||||
- Benefits eligibility uses complete household information
|
||||
- State-specific calculations have all needed data
|
||||
|
||||
## For Analysts 📊
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
pip install microimpute
|
||||
|
||||
# With image export (for plots)
|
||||
pip install microimpute[images]
|
||||
```
|
||||
|
||||
### What MicroImpute Does
|
||||
|
||||
**Imputation problem:**
|
||||
- Donor dataset has complete information (e.g., IRS tax records)
|
||||
- Recipient dataset has missing variables (e.g., CPS survey)
|
||||
- Imputation predicts missing values in recipient using donor patterns
|
||||
|
||||
**Methods available:**
|
||||
- Linear regression
|
||||
- Random forest
|
||||
- Quantile forest (preserves full distribution)
|
||||
- XGBoost
|
||||
- Hot deck (traditional matching)
|
||||
|
||||
### Quick Example
|
||||
|
||||
```python
|
||||
from microimpute import Imputer
|
||||
import pandas as pd
|
||||
|
||||
# Donor data (complete)
|
||||
donor = pd.DataFrame({
|
||||
'income': [50000, 60000, 70000],
|
||||
'age': [30, 40, 50],
|
||||
'capital_gains': [5000, 8000, 12000] # Variable to impute
|
||||
})
|
||||
|
||||
# Recipient data (missing capital_gains)
|
||||
recipient = pd.DataFrame({
|
||||
'income': [55000, 65000],
|
||||
'age': [35, 45],
|
||||
# capital_gains is missing
|
||||
})
|
||||
|
||||
# Impute using quantile forest
|
||||
imputer = Imputer(method='quantile_forest')
|
||||
imputer.fit(
|
||||
donor=donor,
|
||||
donor_target='capital_gains',
|
||||
common_vars=['income', 'age']
|
||||
)
|
||||
|
||||
recipient_imputed = imputer.predict(recipient)
|
||||
# Now recipient has predicted capital_gains
|
||||
```
|
||||
|
||||
### Method Comparison
|
||||
|
||||
```python
|
||||
from microimpute import compare_methods
|
||||
|
||||
# Compare different imputation methods
|
||||
results = compare_methods(
|
||||
donor=donor,
|
||||
recipient=recipient,
|
||||
target_var='capital_gains',
|
||||
common_vars=['income', 'age'],
|
||||
methods=['linear', 'random_forest', 'quantile_forest']
|
||||
)
|
||||
|
||||
# Shows quantile loss for each method
|
||||
print(results)
|
||||
```
|
||||
|
||||
### Quantile Loss (Quality Metric)
|
||||
|
||||
**Why quantile loss:**
|
||||
- Measures how well imputation preserves the distribution
|
||||
- Not just mean accuracy, but full distribution shape
|
||||
- Lower is better
|
||||
|
||||
**Interpretation:**
|
||||
```python
|
||||
# Quantile loss around 0.1 = good
|
||||
# Quantile loss around 0.5 = poor
|
||||
# Compare across methods to choose best
|
||||
```
|
||||
|
||||
## For Contributors 💻
|
||||
|
||||
### Repository
|
||||
|
||||
**Location:** PolicyEngine/microimpute
|
||||
|
||||
**Clone:**
|
||||
```bash
|
||||
git clone https://github.com/PolicyEngine/microimpute
|
||||
cd microimpute
|
||||
```
|
||||
|
||||
### Current Implementation
|
||||
|
||||
**To see structure:**
|
||||
```bash
|
||||
tree microimpute/
|
||||
|
||||
# Key modules:
|
||||
ls microimpute/
|
||||
# - imputer.py - Main Imputer class
|
||||
# - methods/ - Different imputation methods
|
||||
# - comparison.py - Method benchmarking
|
||||
# - utils/ - Utilities
|
||||
```
|
||||
|
||||
**To see specific methods:**
|
||||
```bash
|
||||
# Quantile forest implementation
|
||||
cat microimpute/methods/quantile_forest.py
|
||||
|
||||
# Random forest
|
||||
cat microimpute/methods/random_forest.py
|
||||
|
||||
# Linear regression
|
||||
cat microimpute/methods/linear.py
|
||||
```
|
||||
|
||||
### Dependencies
|
||||
|
||||
**Required:**
|
||||
- numpy, pandas (data handling)
|
||||
- scikit-learn (ML models)
|
||||
- quantile-forest (distributional imputation)
|
||||
- optuna (hyperparameter tuning)
|
||||
- statsmodels (statistical methods)
|
||||
- scipy (statistical functions)
|
||||
|
||||
**To see all dependencies:**
|
||||
```bash
|
||||
cat pyproject.toml
|
||||
```
|
||||
|
||||
### Adding New Imputation Methods
|
||||
|
||||
**Pattern:**
|
||||
```python
|
||||
# microimpute/methods/my_method.py
|
||||
|
||||
class MyMethodImputer:
|
||||
def fit(self, X_train, y_train):
|
||||
"""Train on donor data."""
|
||||
# Fit your model
|
||||
pass
|
||||
|
||||
def predict(self, X_test):
|
||||
"""Impute on recipient data."""
|
||||
# Return predictions
|
||||
pass
|
||||
|
||||
def get_quantile_loss(self, X_val, y_val):
|
||||
"""Compute validation loss."""
|
||||
# Evaluate quality
|
||||
pass
|
||||
```
|
||||
|
||||
### Usage in policyengine-us-data
|
||||
|
||||
**To see how data pipeline uses microimpute:**
|
||||
```bash
|
||||
cd ../policyengine-us-data
|
||||
|
||||
# Find usage
|
||||
grep -r "microimpute" policyengine_us_data/
|
||||
grep -r "Imputer" policyengine_us_data/
|
||||
```
|
||||
|
||||
**Typical workflow:**
|
||||
1. Load CPS (has demographics, missing capital gains details)
|
||||
2. Load IRS PUF (has complete tax data)
|
||||
3. Use microimpute to predict missing CPS variables from PUF patterns
|
||||
4. Validate imputation quality
|
||||
5. Save enhanced dataset
|
||||
|
||||
### Testing
|
||||
|
||||
**Run tests:**
|
||||
```bash
|
||||
make test
|
||||
|
||||
# Or
|
||||
pytest tests/ -v --cov=microimpute
|
||||
```
|
||||
|
||||
**To see test patterns:**
|
||||
```bash
|
||||
cat tests/test_imputer.py
|
||||
cat tests/test_methods.py
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Pattern 1: Basic Imputation
|
||||
|
||||
```python
|
||||
from microimpute import Imputer
|
||||
|
||||
# Create imputer
|
||||
imputer = Imputer(method='quantile_forest')
|
||||
|
||||
# Fit on donor (complete data)
|
||||
imputer.fit(
|
||||
donor=donor_df,
|
||||
donor_target='target_variable',
|
||||
common_vars=['age', 'income', 'state']
|
||||
)
|
||||
|
||||
# Predict on recipient (missing target_variable)
|
||||
recipient_imputed = imputer.predict(recipient_df)
|
||||
```
|
||||
|
||||
### Pattern 2: Choosing Best Method
|
||||
|
||||
```python
|
||||
from microimpute import compare_methods
|
||||
|
||||
# Test multiple methods
|
||||
methods = ['linear', 'random_forest', 'quantile_forest', 'xgboost']
|
||||
|
||||
results = compare_methods(
|
||||
donor=donor,
|
||||
recipient=recipient,
|
||||
target_var='target',
|
||||
common_vars=common_vars,
|
||||
methods=methods
|
||||
)
|
||||
|
||||
# Use method with lowest quantile loss
|
||||
best_method = results.sort_values('quantile_loss').iloc[0]['method']
|
||||
```
|
||||
|
||||
### Pattern 3: Multiple Variable Imputation
|
||||
|
||||
```python
|
||||
# Impute several variables
|
||||
variables_to_impute = [
|
||||
'short_term_capital_gains',
|
||||
'long_term_capital_gains',
|
||||
'qualified_dividends'
|
||||
]
|
||||
|
||||
for var in variables_to_impute:
|
||||
imputer = Imputer(method='quantile_forest')
|
||||
imputer.fit(donor=irs_puf, donor_target=var, common_vars=common_vars)
|
||||
cps[var] = imputer.predict(cps)
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Hyperparameter Tuning
|
||||
|
||||
**Built-in Optuna integration:**
|
||||
```python
|
||||
from microimpute import tune_hyperparameters
|
||||
|
||||
# Automatically find best hyperparameters
|
||||
best_params, study = tune_hyperparameters(
|
||||
donor=donor,
|
||||
target_var='target',
|
||||
common_vars=common_vars,
|
||||
method='quantile_forest',
|
||||
n_trials=100
|
||||
)
|
||||
|
||||
# Use tuned parameters
|
||||
imputer = Imputer(method='quantile_forest', **best_params)
|
||||
```
|
||||
|
||||
### Cross-Validation
|
||||
|
||||
**Validate imputation quality:**
|
||||
```python
|
||||
from sklearn.model_selection import cross_val_score
|
||||
|
||||
# Split donor for validation
|
||||
# Impute on validation set
|
||||
# Measure accuracy
|
||||
```
|
||||
|
||||
### Visualization
|
||||
|
||||
**Plot imputation results:**
|
||||
```python
|
||||
import plotly.express as px
|
||||
|
||||
# Compare imputed vs actual (on donor validation set)
|
||||
fig = px.scatter(
|
||||
x=actual_values,
|
||||
y=imputed_values,
|
||||
labels={'x': 'Actual', 'y': 'Imputed'}
|
||||
)
|
||||
fig.add_trace(px.line(x=[min, max], y=[min, max])) # 45-degree line
|
||||
```
|
||||
|
||||
## Statistical Background
|
||||
|
||||
**Imputation preserves:**
|
||||
- Marginal distributions (imputed variable distribution matches donor)
|
||||
- Conditional relationships (imputation depends on common variables)
|
||||
- Uncertainty (quantile methods preserve full distribution)
|
||||
|
||||
**Trade-offs:**
|
||||
- **Linear:** Fast, but assumes linear relationships
|
||||
- **Random forest:** Handles non-linearity, may overfit
|
||||
- **Quantile forest:** Preserves full distribution, slower
|
||||
- **XGBoost:** High accuracy, requires tuning
|
||||
|
||||
## Integration with PolicyEngine
|
||||
|
||||
**Full pipeline (policyengine-us-data):**
|
||||
```
|
||||
1. Load CPS survey data
|
||||
↓
|
||||
2. microimpute: Fill missing variables from IRS PUF
|
||||
↓
|
||||
3. microcalibrate: Adjust weights to match benchmarks
|
||||
↓
|
||||
4. Validation: Check against administrative totals
|
||||
↓
|
||||
5. Package: Distribute enhanced dataset
|
||||
↓
|
||||
6. PolicyEngine: Use for population simulations
|
||||
```
|
||||
|
||||
## Comparison to Other Methods
|
||||
|
||||
**MicroImpute vs traditional imputation:**
|
||||
|
||||
**Traditional (mean imputation):**
|
||||
- Fast but destroys distribution
|
||||
- All missing values get same value
|
||||
- Underestimates variance
|
||||
|
||||
**MicroImpute (ML methods):**
|
||||
- Preserves relationships
|
||||
- Different predictions per record
|
||||
- Maintains distribution shape
|
||||
|
||||
**Quantile forest advantage:**
|
||||
- Predicts full conditional distribution
|
||||
- Not just point estimates
|
||||
- Can sample from predicted distribution
|
||||
|
||||
## Performance Tips
|
||||
|
||||
**For large datasets:**
|
||||
```python
|
||||
# Use random forest (faster than quantile forest)
|
||||
imputer = Imputer(method='random_forest')
|
||||
|
||||
# Or subsample donor
|
||||
donor_sample = donor.sample(n=10000, random_state=42)
|
||||
imputer.fit(donor=donor_sample, ...)
|
||||
```
|
||||
|
||||
**For high accuracy:**
|
||||
```python
|
||||
# Use quantile forest with tuning
|
||||
best_params, _ = tune_hyperparameters(...)
|
||||
imputer = Imputer(method='quantile_forest', **best_params)
|
||||
```
|
||||
|
||||
## Related Skills
|
||||
|
||||
- **l0-skill** - Regularization techniques
|
||||
- **microcalibrate-skill** - Survey calibration (next step after imputation)
|
||||
- **policyengine-us-data-skill** - Complete data pipeline
|
||||
- **microdf-skill** - Working with imputed/calibrated data
|
||||
|
||||
## Resources
|
||||
|
||||
**Repository:** https://github.com/PolicyEngine/microimpute
|
||||
**PyPI:** https://pypi.org/project/microimpute/
|
||||
**Documentation:** See README and docstrings in source
|
||||
Reference in New Issue
Block a user