471 lines
19 KiB
Markdown
471 lines
19 KiB
Markdown
# Code Data Analysis Scaffolds Methodology
|
|
|
|
Advanced techniques for causal inference, predictive modeling, property-based testing, and complex data analysis.
|
|
|
|
## Workflow
|
|
|
|
Copy this checklist and track your progress:
|
|
|
|
```
|
|
Code Data Analysis Scaffolds Progress:
|
|
- [ ] Step 1: Clarify task and objectives
|
|
- [ ] Step 2: Choose appropriate scaffold type
|
|
- [ ] Step 3: Generate scaffold structure
|
|
- [ ] Step 4: Validate scaffold completeness
|
|
- [ ] Step 5: Deliver scaffold and guide execution
|
|
```
|
|
|
|
**Step 1: Clarify task** - Assess complexity and determine if advanced techniques needed. See [1. When to Use Advanced Methods](#1-when-to-use-advanced-methods).
|
|
|
|
**Step 2: Choose scaffold** - Select from Causal Inference, Predictive Modeling, Property-Based Testing, or Advanced EDA. See specific sections below.
|
|
|
|
**Step 3: Generate structure** - Apply advanced scaffold matching task complexity. See [2. Causal Inference Methods](#2-causal-inference-methods), [3. Predictive Modeling Pipeline](#3-predictive-modeling-pipeline), [4. Property-Based Testing](#4-property-based-testing), [5. Advanced EDA Techniques](#5-advanced-eda-techniques).
|
|
|
|
**Step 4: Validate** - Check assumptions, sensitivity analysis, robustness checks using [6. Advanced Validation Patterns](#6-advanced-validation-patterns).
|
|
|
|
**Step 5: Deliver** - Present with caveats, limitations, and recommendations for further analysis.
|
|
|
|
## 1. When to Use Advanced Methods
|
|
|
|
| Task Characteristic | Standard Template | Advanced Methodology |
|
|
|---------------------|-------------------|---------------------|
|
|
| **Causal question** | "Does X correlate with Y?" | "Does X cause Y?" → Causal inference needed |
|
|
| **Sample size** | < 1000 rows | > 10K rows with complex patterns |
|
|
| **Model complexity** | Linear/logistic regression | Ensemble methods, neural nets, feature interactions |
|
|
| **Test sophistication** | Unit tests, integration tests | Property-based tests, mutation testing, fuzz testing |
|
|
| **Data complexity** | Clean, tabular data | Multi-modal, high-dimensional, unstructured data |
|
|
| **Stakes** | Low (exploratory) | High (production ML, regulatory compliance) |
|
|
|
|
## 2. Causal Inference Methods
|
|
|
|
Use when research question is "Does X **cause** Y?" not just "Are X and Y correlated?"
|
|
|
|
### Causal Inference Scaffold
|
|
|
|
```python
|
|
# CAUSAL INFERENCE SCAFFOLD
|
|
|
|
# 1. DRAW CAUSAL DAG (Directed Acyclic Graph)
|
|
# Explicitly model: Treatment → Outcome, Confounders → Treatment & Outcome
|
|
#
|
|
# Example:
|
|
# Education → Income
|
|
# ↑ ↑
|
|
# Family Background
|
|
#
|
|
# Treatment: Education
|
|
# Outcome: Income
|
|
# Confounder: Family Background (affects both education and income)
|
|
|
|
# 2. IDENTIFY CONFOUNDERS
|
|
confounders = ['age', 'gender', 'family_income', 'region']
|
|
# These variables affect BOTH treatment and outcome
|
|
# If not controlled, they bias causal estimate
|
|
|
|
# 3. CHECK IDENTIFICATION ASSUMPTIONS
|
|
# For causal effect to be identifiable:
|
|
# a) No unmeasured confounders (all variables in DAG observed)
|
|
# b) Treatment assignment as-if random conditional on confounders
|
|
# c) Positivity: Every unit has nonzero probability of treatment/control
|
|
|
|
# 4. CHOOSE IDENTIFICATION STRATEGY
|
|
|
|
# Option A: RCT - Random assignment eliminates confounding. Check balance on confounders.
|
|
from scipy import stats
|
|
for var in confounders:
|
|
_, p = stats.ttest_ind(treatment_group[var], control_group[var])
|
|
print(f"{var}: {'✓' if p > 0.05 else '✗'} balanced")
|
|
|
|
# Option B: Regression - Control for confounders. Assumes no unmeasured confounding.
|
|
import statsmodels.formula.api as smf
|
|
model = smf.ols('outcome ~ treatment + age + gender + family_income', data=df).fit()
|
|
treatment_effect = model.params['treatment']
|
|
|
|
# Option C: Propensity Score Matching - Match treated to similar controls on P(treatment|X).
|
|
from sklearn.linear_model import LogisticRegression; from sklearn.neighbors import NearestNeighbors
|
|
ps_model = LogisticRegression().fit(df[confounders], df['treatment'])
|
|
df['ps'] = ps_model.predict_proba(df[confounders])[:,1]
|
|
treated, control = df[df['treatment']==1], df[df['treatment']==0]
|
|
nn = NearestNeighbors(n_neighbors=1).fit(control[['ps']])
|
|
_, indices = nn.kneighbors(treated[['ps']])
|
|
treatment_effect = treated['outcome'].mean() - control.iloc[indices.flatten()]['outcome'].mean()
|
|
|
|
# Option D: IV - Need instrument Z: affects treatment, not outcome (except through treatment).
|
|
from statsmodels.sandbox.regression.gmm import IV2SLS
|
|
iv_model = IV2SLS(df['income'], df[['education'] + confounders], df[['instrument'] + confounders]).fit()
|
|
|
|
# Option E: RDD - Treatment assigned at cutoff. Compare units just above/below threshold.
|
|
df['above_cutoff'] = (df['running_var'] >= cutoff).astype(int)
|
|
# Use local linear regression around cutoff to estimate effect
|
|
|
|
# Option F: DiD - Compare treatment vs control, before vs after. Assumes parallel trends.
|
|
t_before, t_after = df[(df['group']=='T') & (df['time']=='before')]['y'].mean(), df[(df['group']=='T') & (df['time']=='after')]['y'].mean()
|
|
c_before, c_after = df[(df['group']=='C') & (df['time']=='before')]['y'].mean(), df[(df['group']=='C') & (df['time']=='after')]['y'].mean()
|
|
did_estimate = (t_after - t_before) - (c_after - c_before)
|
|
|
|
# 5. SENSITIVITY ANALYSIS
|
|
print("\n=== SENSITIVITY CHECKS ===")
|
|
print("1. Unmeasured confounding: How strong would confounder need to be to change conclusion?")
|
|
print("2. Placebo tests: Check for effect in period before treatment (should be zero)")
|
|
print("3. Falsification tests: Check for effect on outcome that shouldn't be affected")
|
|
print("4. Robustness: Try different model specifications, subsamples, bandwidths (RDD)")
|
|
|
|
# 6. REPORT CAUSAL ESTIMATE WITH UNCERTAINTY
|
|
print(f"\nCausal Effect: {treatment_effect:.3f}")
|
|
print(f"95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]")
|
|
print(f"Interpretation: Treatment X causes {treatment_effect:.1%} change in outcome Y")
|
|
print(f"Assumptions: [List key identifying assumptions]")
|
|
print(f"Limitations: [Threats to validity]")
|
|
```
|
|
|
|
### Causal Inference Checklist
|
|
|
|
- [ ] **Causal question clearly stated**: "Does X cause Y?" not "Are X and Y related?"
|
|
- [ ] **DAG drawn**: Treatment, outcome, confounders, mediators identified
|
|
- [ ] **Identification strategy chosen**: RCT, regression, PS matching, IV, RDD, DiD
|
|
- [ ] **Assumptions checked**: No unmeasured confounding, positivity, parallel trends (DiD), etc.
|
|
- [ ] **Sensitivity analysis**: Test robustness to violations of assumptions
|
|
- [ ] **Limitations acknowledged**: Threats to internal/external validity stated
|
|
|
|
## 3. Predictive Modeling Pipeline
|
|
|
|
Use for forecasting, classification, regression - when goal is prediction not causal understanding.
|
|
|
|
### Predictive Modeling Scaffold
|
|
|
|
```python
|
|
# PREDICTIVE MODELING SCAFFOLD
|
|
|
|
# 1. DEFINE PREDICTION TASK & METRIC
|
|
task = "Predict customer churn (binary classification)"
|
|
primary_metric = "F1-score" # Balance precision/recall
|
|
secondary_metrics = ["AUC-ROC", "precision", "recall", "accuracy"]
|
|
|
|
# 2. TRAIN/VAL/TEST SPLIT (before any preprocessing!)
|
|
from sklearn.model_selection import train_test_split
|
|
|
|
# Split: 60% train, 20% validation, 20% test
|
|
train_val, test = train_test_split(df, test_size=0.2, random_state=42, stratify=df['target'])
|
|
train, val = train_test_split(train_val, test_size=0.25, random_state=42, stratify=train_val['target'])
|
|
|
|
print(f"Train: {len(train)}, Val: {len(val)}, Test: {len(test)}")
|
|
print(f"Class balance - Train: {train['target'].mean():.2%}, Test: {test['target'].mean():.2%}")
|
|
|
|
# 3. FEATURE ENGINEERING (fit on train, transform train/val/test)
|
|
from sklearn.preprocessing import StandardScaler
|
|
from sklearn.impute import SimpleImputer
|
|
|
|
# Numeric features: impute missing, standardize
|
|
numeric_features = ['age', 'income', 'tenure']
|
|
num_imputer = SimpleImputer(strategy='median').fit(train[numeric_features])
|
|
num_scaler = StandardScaler().fit(num_imputer.transform(train[numeric_features]))
|
|
|
|
X_train_num = num_scaler.transform(num_imputer.transform(train[numeric_features]))
|
|
X_val_num = num_scaler.transform(num_imputer.transform(val[numeric_features]))
|
|
X_test_num = num_scaler.transform(num_imputer.transform(test[numeric_features]))
|
|
|
|
# Categorical features: one-hot encode
|
|
from sklearn.preprocessing import OneHotEncoder
|
|
cat_features = ['region', 'product_type']
|
|
cat_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(train[cat_features])
|
|
|
|
X_train_cat = cat_encoder.transform(train[cat_features])
|
|
X_val_cat = cat_encoder.transform(val[cat_features])
|
|
X_test_cat = cat_encoder.transform(test[cat_features])
|
|
|
|
# Combine features
|
|
import numpy as np
|
|
X_train = np.hstack([X_train_num, X_train_cat])
|
|
X_val = np.hstack([X_val_num, X_val_cat])
|
|
X_test = np.hstack([X_test_num, X_test_cat])
|
|
y_train, y_val, y_test = train['target'], val['target'], test['target']
|
|
|
|
# 4. BASELINE MODEL (always start simple!)
|
|
from sklearn.dummy import DummyClassifier
|
|
baseline = DummyClassifier(strategy='most_frequent').fit(X_train, y_train)
|
|
baseline_f1 = f1_score(y_val, baseline.predict(X_val))
|
|
print(f"Baseline F1: {baseline_f1:.3f}")
|
|
|
|
# 5. MODEL SELECTION & HYPERPARAMETER TUNING
|
|
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
|
|
from sklearn.linear_model import LogisticRegression
|
|
from sklearn.model_selection import GridSearchCV
|
|
from sklearn.metrics import f1_score, roc_auc_score
|
|
|
|
# Try multiple models
|
|
models = {
|
|
'Logistic Regression': LogisticRegression(max_iter=1000),
|
|
'Random Forest': RandomForestClassifier(random_state=42),
|
|
'Gradient Boosting': GradientBoostingClassifier(random_state=42)
|
|
}
|
|
|
|
results = {}
|
|
for name, model in models.items():
|
|
model.fit(X_train, y_train)
|
|
y_pred = model.predict(X_val)
|
|
y_proba = model.predict_proba(X_val)[:,1]
|
|
|
|
results[name] = {
|
|
'F1': f1_score(y_val, y_pred),
|
|
'AUC': roc_auc_score(y_val, y_proba),
|
|
'Precision': precision_score(y_val, y_pred),
|
|
'Recall': recall_score(y_val, y_pred)
|
|
}
|
|
print(f"{name}: F1={results[name]['F1']:.3f}, AUC={results[name]['AUC']:.3f}")
|
|
|
|
# Select best model (highest F1 on validation)
|
|
best_model_name = max(results, key=lambda x: results[x]['F1'])
|
|
best_model = models[best_model_name]
|
|
print(f"\nBest model: {best_model_name}")
|
|
|
|
# Hyperparameter tuning on best model
|
|
if best_model_name == 'Random Forest':
|
|
param_grid = {
|
|
'n_estimators': [100, 200, 300],
|
|
'max_depth': [10, 20, None],
|
|
'min_samples_split': [2, 5, 10]
|
|
}
|
|
grid_search = GridSearchCV(best_model, param_grid, cv=5, scoring='f1', n_jobs=-1)
|
|
grid_search.fit(X_train, y_train)
|
|
best_model = grid_search.best_estimator_
|
|
print(f"Best params: {grid_search.best_params_}")
|
|
|
|
# 6. CROSS-VALIDATION (check for overfitting)
|
|
from sklearn.model_selection import cross_val_score
|
|
cv_scores = cross_val_score(best_model, X_train, y_train, cv=5, scoring='f1')
|
|
print(f"CV F1 scores: {cv_scores}")
|
|
print(f"Mean: {cv_scores.mean():.3f}, Std: {cv_scores.std():.3f}")
|
|
|
|
# 7. FINAL EVALUATION ON TEST SET (only once!)
|
|
y_test_pred = best_model.predict(X_test)
|
|
y_test_proba = best_model.predict_proba(X_test)[:,1]
|
|
|
|
test_f1 = f1_score(y_test, y_test_pred)
|
|
test_auc = roc_auc_score(y_test, y_test_proba)
|
|
print(f"\n=== FINAL TEST PERFORMANCE ===")
|
|
print(f"F1: {test_f1:.3f}, AUC: {test_auc:.3f}")
|
|
|
|
# 8. ERROR ANALYSIS
|
|
from sklearn.metrics import confusion_matrix, classification_report
|
|
print("\nConfusion Matrix:")
|
|
print(confusion_matrix(y_test, y_test_pred))
|
|
print("\nClassification Report:")
|
|
print(classification_report(y_test, y_test_pred))
|
|
|
|
# Analyze misclassifications
|
|
test_df = test.copy()
|
|
test_df['prediction'] = y_test_pred
|
|
test_df['prediction_proba'] = y_test_proba
|
|
false_positives = test_df[(test_df['target']==0) & (test_df['prediction']==1)]
|
|
false_negatives = test_df[(test_df['target']==1) & (test_df['prediction']==0)]
|
|
print(f"False Positives: {len(false_positives)}")
|
|
print(f"False Negatives: {len(false_negatives)}")
|
|
# Inspect these cases to understand failure modes
|
|
|
|
# 9. FEATURE IMPORTANCE
|
|
if hasattr(best_model, 'feature_importances_'):
|
|
feature_names = numeric_features + list(cat_encoder.get_feature_names_out(cat_features))
|
|
importances = pd.DataFrame({
|
|
'feature': feature_names,
|
|
'importance': best_model.feature_importances_
|
|
}).sort_values('importance', ascending=False)
|
|
print("\nTop 10 Features:")
|
|
print(importances.head(10))
|
|
|
|
# 10. MODEL DEPLOYMENT CHECKLIST
|
|
print("\n=== DEPLOYMENT READINESS ===")
|
|
print(f"✓ Test F1 ({test_f1:.3f}) > Baseline ({baseline_f1:.3f})")
|
|
print(f"✓ Cross-validation shows consistent performance (CV std={cv_scores.std():.3f})")
|
|
print("✓ Error analysis completed, failure modes understood")
|
|
print("✓ Feature importance computed, no surprising features")
|
|
print("□ Model serialized and saved")
|
|
print("□ Monitoring plan in place (track drift in input features, output distribution)")
|
|
print("□ Rollback plan if model underperforms in production")
|
|
```
|
|
|
|
### Predictive Modeling Checklist
|
|
|
|
- [ ] **Clear prediction task**: Classification, regression, time series forecasting
|
|
- [ ] **Appropriate metrics**: Match business objectives (precision vs recall tradeoff, etc.)
|
|
- [ ] **Train/val/test split**: Before any preprocessing (no data leakage)
|
|
- [ ] **Baseline model**: Simple model for comparison
|
|
- [ ] **Feature engineering**: Proper handling of missing values, scaling, encoding
|
|
- [ ] **Cross-validation**: k-fold CV to check for overfitting
|
|
- [ ] **Model selection**: Compare multiple model types
|
|
- [ ] **Hyperparameter tuning**: Grid/random search on validation set
|
|
- [ ] **Error analysis**: Understand failure modes, inspect misclassifications
|
|
- [ ] **Test set evaluation**: Final performance check (only once!)
|
|
- [ ] **Deployment readiness**: Monitoring, rollback plan, model versioning
|
|
|
|
## 4. Property-Based Testing
|
|
|
|
Use for testing complex logic, data transformations, invariants. Goes beyond example-based tests.
|
|
|
|
### Property-Based Testing Scaffold
|
|
|
|
```python
|
|
# PROPERTY-BASED TESTING SCAFFOLD
|
|
from hypothesis import given, strategies as st
|
|
import pytest
|
|
|
|
# Example: Testing a sort function
|
|
def my_sort(lst):
|
|
return sorted(lst)
|
|
|
|
# Property 1: Output length equals input length
|
|
@given(st.lists(st.integers()))
|
|
def test_sort_preserves_length(lst):
|
|
assert len(my_sort(lst)) == len(lst)
|
|
|
|
# Property 2: Output is sorted (each element <= next element)
|
|
@given(st.lists(st.integers()))
|
|
def test_sort_is_sorted(lst):
|
|
result = my_sort(lst)
|
|
for i in range(len(result) - 1):
|
|
assert result[i] <= result[i+1]
|
|
|
|
# Property 3: Output contains same elements as input (multiset equality)
|
|
@given(st.lists(st.integers()))
|
|
def test_sort_preserves_elements(lst):
|
|
result = my_sort(lst)
|
|
assert sorted(lst) == sorted(result) # Canonical form comparison
|
|
|
|
# Property 4: Idempotence (sorting twice = sorting once)
|
|
@given(st.lists(st.integers()))
|
|
def test_sort_is_idempotent(lst):
|
|
result = my_sort(lst)
|
|
assert my_sort(result) == result
|
|
|
|
# Property 5: Empty input → empty output
|
|
def test_sort_empty_list():
|
|
assert my_sort([]) == []
|
|
|
|
# Property 6: Single element → unchanged
|
|
@given(st.integers())
|
|
def test_sort_single_element(x):
|
|
assert my_sort([x]) == [x]
|
|
```
|
|
|
|
### Property-Based Testing Strategies
|
|
|
|
**For data transformations:**
|
|
- Idempotence: `f(f(x)) == f(x)`
|
|
- Round-trip: `decode(encode(x)) == x`
|
|
- Commutativity: `f(g(x)) == g(f(x))`
|
|
- Invariants: Properties that never change (e.g., sum after transformation)
|
|
|
|
**For numeric functions:**
|
|
- Boundary conditions: Zero, negative, very large numbers
|
|
- Inverse relationships: `f(f_inverse(x)) ≈ x`
|
|
- Known identities: `sin²(x) + cos²(x) = 1`
|
|
|
|
**For string/list operations:**
|
|
- Length preservation or predictable change
|
|
- Character/element preservation
|
|
- Order properties (sorted, reversed)
|
|
|
|
## 5. Advanced EDA Techniques
|
|
|
|
For high-dimensional, multi-modal, or complex data.
|
|
|
|
### Dimensionality Reduction
|
|
|
|
```python
|
|
# PCA: Linear dimensionality reduction
|
|
from sklearn.decomposition import PCA
|
|
pca = PCA(n_components=2)
|
|
X_pca = pca.fit_transform(X_scaled)
|
|
print(f"Explained variance: {pca.explained_variance_ratio_}")
|
|
|
|
# t-SNE: Non-linear, good for visualization
|
|
from sklearn.manifold import TSNE
|
|
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
|
|
X_tsne = tsne.fit_transform(X_scaled)
|
|
plt.scatter(X_tsne[:,0], X_tsne[:,1], c=y, cmap='viridis'); plt.show()
|
|
|
|
# UMAP: Faster alternative to t-SNE, preserves global structure
|
|
# pip install umap-learn
|
|
import umap
|
|
reducer = umap.UMAP(n_components=2, random_state=42)
|
|
X_umap = reducer.fit_transform(X_scaled)
|
|
```
|
|
|
|
### Cluster Analysis
|
|
|
|
```python
|
|
from sklearn.cluster import KMeans, DBSCAN
|
|
from sklearn.metrics import silhouette_score
|
|
|
|
# Elbow method: Find optimal K
|
|
inertias = []
|
|
for k in range(2, 11):
|
|
kmeans = KMeans(n_clusters=k, random_state=42)
|
|
kmeans.fit(X_scaled)
|
|
inertias.append(kmeans.inertia_)
|
|
plt.plot(range(2, 11), inertias); plt.xlabel('K'); plt.ylabel('Inertia'); plt.show()
|
|
|
|
# Silhouette score: Measure cluster quality
|
|
for k in range(2, 11):
|
|
kmeans = KMeans(n_clusters=k, random_state=42).fit(X_scaled)
|
|
score = silhouette_score(X_scaled, kmeans.labels_)
|
|
print(f"K={k}: Silhouette={score:.3f}")
|
|
|
|
# DBSCAN: Density-based clustering (finds arbitrary shapes)
|
|
dbscan = DBSCAN(eps=0.5, min_samples=5)
|
|
clusters = dbscan.fit_predict(X_scaled)
|
|
print(f"Clusters found: {len(set(clusters)) - (1 if -1 in clusters else 0)}")
|
|
print(f"Noise points: {(clusters == -1).sum()}")
|
|
```
|
|
|
|
## 6. Advanced Validation Patterns
|
|
|
|
### Mutation Testing
|
|
|
|
Tests the quality of your tests by introducing bugs and checking if tests catch them.
|
|
|
|
```python
|
|
# Install: pip install mutmut
|
|
# Run: mutmut run --paths-to-mutate=src/
|
|
# Check: mutmut results
|
|
# Survivors (mutations not caught) indicate weak tests
|
|
```
|
|
|
|
### Fuzz Testing
|
|
|
|
Generate random/malformed inputs to find edge cases.
|
|
|
|
```python
|
|
from hypothesis import given, strategies as st
|
|
|
|
@given(st.text())
|
|
def test_function_doesnt_crash_on_any_string(s):
|
|
result = my_function(s) # Should never raise exception
|
|
assert result is not None
|
|
```
|
|
|
|
### Data Validation Framework (Great Expectations)
|
|
|
|
```python
|
|
import great_expectations as gx
|
|
|
|
# Define expectations
|
|
expectation_suite = gx.ExpectationSuite(name="my_data_suite")
|
|
expectation_suite.add_expectation(gx.expectations.ExpectColumnToExist(column="user_id"))
|
|
expectation_suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="user_id"))
|
|
expectation_suite.add_expectation(gx.expectations.ExpectColumnValuesToBeBetween(column="age", min_value=0, max_value=120))
|
|
|
|
# Validate data
|
|
results = context.run_validation(batch_request, expectation_suite)
|
|
print(results["success"]) # True if all expectations met
|
|
```
|
|
|
|
## 7. When to Use Each Method
|
|
|
|
| Research Goal | Method | Key Consideration |
|
|
|---------------|--------|-------------------|
|
|
| Causal effect estimation | RCT, IV, RDD, DiD | Identify confounders, check assumptions |
|
|
| Prediction/forecasting | Supervised ML | Avoid data leakage, validate out-of-sample |
|
|
| Pattern discovery | Clustering, PCA, t-SNE | Dimensionality reduction first if high-D |
|
|
| Complex logic testing | Property-based testing | Define invariants that must hold |
|
|
| Data quality | Great Expectations | Automate checks in pipelines |
|