Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:47:57 +08:00
commit 9794eed0aa
25 changed files with 7709 additions and 0 deletions

View File

@@ -0,0 +1,303 @@
---
name: policyengine-vectorization
description: PolicyEngine vectorization patterns - NumPy operations, where/select usage, avoiding scalar logic with arrays
---
# PolicyEngine Vectorization Patterns
Critical patterns for vectorized operations in PolicyEngine. Scalar logic with arrays will crash the microsimulation.
## The Golden Rule
**PolicyEngine processes multiple households simultaneously using NumPy arrays. NEVER use if-elif-else with entity data.**
---
## 1. Critical: What Will Crash
### ❌ NEVER: if-elif-else with Arrays
```python
# THIS WILL CRASH - household data is an array
def formula(household, period, parameters):
income = household("income", period)
if income > 1000: # ❌ CRASH: "truth value of array is ambiguous"
return 500
else:
return 100
```
### ✅ ALWAYS: Vectorized Operations
```python
# CORRECT - works with arrays
def formula(household, period, parameters):
income = household("income", period)
return where(income > 1000, 500, 100) # ✅ Vectorized
```
---
## 2. Common Vectorization Patterns
### Pattern 1: Simple Conditions → `where()`
```python
# Instead of if-else
if age >= 65:
amount = senior_amount
else:
amount = regular_amount
amount = where(age >= 65, senior_amount, regular_amount)
```
### Pattern 2: Multiple Conditions → `select()`
```python
# Instead of if-elif-else
if age < 18:
benefit = child_amount
elif age >= 65:
benefit = senior_amount
else:
benefit = adult_amount
benefit = select(
[age < 18, age >= 65],
[child_amount, senior_amount],
default=adult_amount
)
```
### Pattern 3: Boolean Operations
```python
# Combining conditions
eligible = (age >= 18) & (income < threshold) # Use & not 'and'
eligible = (is_disabled | is_elderly) # Use | not 'or'
eligible = ~is_excluded # Use ~ not 'not'
```
### Pattern 4: Clipping Values
```python
# Instead of if for bounds checking
if amount < 0:
amount = 0
elif amount > maximum:
amount = maximum
amount = clip(amount, 0, maximum)
# Or: amount = max_(0, min_(amount, maximum))
```
---
## 3. When if-else IS Acceptable
### ✅ OK: Parameter-Only Conditions
```python
# OK - parameters are scalars, not arrays
def formula(entity, period, parameters):
p = parameters(period).gov.program
# This is fine - p.enabled is a scalar boolean
if p.enabled:
base = p.base_amount
else:
base = 0
# But must vectorize when using entity data
income = entity("income", period)
return where(income < p.threshold, base, 0)
```
### ✅ OK: Control Flow (Not Data)
```python
# OK - controlling which calculation to use
def formula(entity, period, parameters):
year = period.start.year
if year >= 2024:
# Use new formula (still vectorized)
return entity("new_calculation", period)
else:
# Use old formula (still vectorized)
return entity("old_calculation", period)
```
---
## 4. Common Vectorization Mistakes
### Mistake 1: Scalar Comparison with Array
```python
WRONG:
if household("income", period) > 1000:
# Error: truth value of array is ambiguous
CORRECT:
income = household("income", period)
high_income = income > 1000 # Boolean array
benefit = where(high_income, low_benefit, high_benefit)
```
### Mistake 2: Using Python's and/or/not
```python
WRONG:
eligible = is_elderly or is_disabled # Python's 'or'
CORRECT:
eligible = is_elderly | is_disabled # NumPy's '|'
```
### Mistake 3: Nested if Statements
```python
WRONG:
if eligible:
if income < threshold:
return full_benefit
else:
return partial_benefit
else:
return 0
CORRECT:
return where(
eligible,
where(income < threshold, full_benefit, partial_benefit),
0
)
```
---
## 5. Advanced Patterns
### Pattern: Vectorized Lookup Tables
```python
# Instead of if-elif for ranges
if size == 1:
amount = 100
elif size == 2:
amount = 150
elif size == 3:
amount = 190
# Using parameter brackets
amount = p.benefit_schedule.calc(size)
# Or using select
amounts = [100, 150, 190, 220, 250]
amount = select(
[size == i for i in range(1, 6)],
amounts[:5],
default=amounts[-1] # 5+ people
)
```
### Pattern: Accumulating Conditions
```python
# Building complex eligibility
income_eligible = income < p.income_threshold
resource_eligible = resources < p.resource_limit
demographic_eligible = (age < 18) | is_pregnant
# Combine with & (not 'and')
eligible = income_eligible & resource_eligible & demographic_eligible
```
### Pattern: Conditional Accumulation
```python
# Sum only for eligible members
person = household.members
is_eligible = person("is_eligible", period)
person_income = person("income", period)
# Only count income of eligible members
eligible_income = where(is_eligible, person_income, 0)
total = household.sum(eligible_income)
```
---
## 6. Performance Implications
### Why Vectorization Matters
- **Scalar logic**: Processes 1 household at a time → SLOW
- **Vectorized**: Processes 1000s of households simultaneously → FAST
```python
# Performance comparison
SLOW (if it worked):
for household in households:
if household.income > 1000:
household.benefit = 500
FAST:
benefits = where(incomes > 1000, 500, 100) # All at once!
```
---
## 7. Testing for Vectorization Issues
### Signs Your Code Isn't Vectorized
**Error messages:**
- "The truth value of an array is ambiguous"
- "ValueError: The truth value of an array with more than one element"
**Performance:**
- Tests run slowly
- Microsimulation times out
### How to Test
```python
# Your formula should work with arrays
def test_vectorization():
# Create array inputs
incomes = np.array([500, 1500, 3000])
# Should return array output
benefits = formula_with_arrays(incomes)
assert len(benefits) == 3
```
---
## Quick Reference Card
| Operation | Scalar (WRONG) | Vectorized (CORRECT) |
|-----------|---------------|---------------------|
| Simple condition | `if x > 5:` | `where(x > 5, ...)` |
| Multiple conditions | `if-elif-else` | `select([...], [...])` |
| Boolean AND | `and` | `&` |
| Boolean OR | `or` | `\|` |
| Boolean NOT | `not` | `~` |
| Bounds checking | `if x < 0: x = 0` | `max_(0, x)` |
| Complex logic | Nested if | Nested where/select |
---
## For Agents
When implementing formulas:
1. **Never use if-elif-else** with entity data
2. **Always use where()** for simple conditions
3. **Use select()** for multiple conditions
4. **Use NumPy operators** (&, |, ~) not Python (and, or, not)
5. **Test with arrays** to ensure vectorization
6. **Parameter conditions** can use if-else (scalars)
7. **Entity data** must use vectorized operations