Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:38:26 +08:00
commit 41d9f6b189
304 changed files with 98322 additions and 0 deletions

View File

@@ -0,0 +1,171 @@
---
name: code-data-analysis-scaffolds
description: Use when starting technical work requiring structured approach - writing tests before code (TDD), planning data exploration (EDA), designing statistical analysis, clarifying modeling objectives (causal vs predictive), or validating results. Invoke when user mentions "write tests for", "explore this dataset", "analyze", "model", "validate", or when technical work needs systematic scaffolding before execution.
---
# Code Data Analysis Scaffolds
## Table of Contents
- [Purpose](#purpose)
- [When to Use This Skill](#when-to-use-this-skill)
- [When NOT to Use This Skill](#when-not-to-use-this-skill)
- [What Is It?](#what-is-it)
- [Workflow](#workflow)
- [Scaffold Types](#scaffold-types)
- [Guardrails](#guardrails)
- [Quick Reference](#quick-reference)
## Purpose
This skill provides structured scaffolds (frameworks, checklists, templates) for technical work in software engineering and data science. It helps you approach complex tasks systematically by defining what to do, in what order, and what to validate before proceeding.
## When to Use This Skill
Use this skill when you need to:
- **Write tests systematically** - TDD scaffolding, test suite design, test data generation
- **Explore data rigorously** - EDA plans, data quality checks, feature analysis strategies
- **Design statistical analyses** - A/B tests, causal inference, hypothesis testing frameworks
- **Build predictive models** - Model selection, validation strategy, evaluation metrics
- **Refactor with confidence** - Test coverage strategy, refactoring checklist, regression prevention
- **Validate technical work** - Data validation, model evaluation, code quality checks
- **Clarify technical approach** - Distinguish causal vs predictive goals, choose appropriate methods
**Trigger phrases:**
- "Write tests for [code/feature]"
- "Explore this dataset"
- "Analyze [data/problem]"
- "Build a model to predict..."
- "How should I validate..."
- "Design an A/B test for..."
- "What's the right approach to..."
## When NOT to Use This Skill
Skip this skill when:
- **Just execute, don't plan** - User wants immediate code/analysis without scaffolding
- **Scaffold already exists** - User has clear plan and just needs execution help
- **Non-technical tasks** - Use appropriate skill for writing, planning, decision-making
- **Simple one-liners** - No scaffold needed for trivial tasks
- **Exploratory conversation** - User is brainstorming, not ready for structured approach yet
## What Is It?
Code Data Analysis Scaffolds provides structured frameworks for common technical patterns:
1. **TDD Scaffold**: Given requirements, generate test structure before implementing code
2. **EDA Scaffold**: Given dataset, create systematic exploration plan
3. **Statistical Analysis Scaffold**: Given question, design appropriate statistical test/model
4. **Validation Scaffold**: Given code/model/data, create comprehensive validation checklist
**Quick example:**
> **Task**: "Write authentication function"
>
> **TDD Scaffold**:
> ```python
> # Test structure (write these FIRST)
> def test_valid_credentials():
> assert authenticate("user@example.com", "correct_pass") == True
>
> def test_invalid_password():
> assert authenticate("user@example.com", "wrong_pass") == False
>
> def test_nonexistent_user():
> assert authenticate("nobody@example.com", "any_pass") == False
>
> def test_empty_credentials():
> with pytest.raises(ValueError):
> authenticate("", "")
>
> # Now implement authenticate() to make tests pass
> ```
## Workflow
Copy this checklist and track your progress:
```
Code Data Analysis Scaffolds Progress:
- [ ] Step 1: Clarify task and objectives
- [ ] Step 2: Choose appropriate scaffold type
- [ ] Step 3: Generate scaffold structure
- [ ] Step 4: Validate scaffold completeness
- [ ] Step 5: Deliver scaffold and guide execution
```
**Step 1: Clarify task and objectives**
Ask user for the task, dataset/codebase context, constraints, and expected outcome. Determine if this is TDD (write tests first), EDA (explore data), statistical analysis (test hypothesis), or validation (check quality). See [resources/template.md](resources/template.md) for context questions.
**Step 2: Choose appropriate scaffold type**
Based on task, select scaffold: TDD (testing code), EDA (exploring data), Statistical Analysis (hypothesis testing, A/B tests), Causal Inference (estimating treatment effects), Predictive Modeling (building ML models), or Validation (checking quality). See [Scaffold Types](#scaffold-types) for guidance on choosing.
**Step 3: Generate scaffold structure**
Create systematic framework with clear steps, validation checkpoints, and expected outputs at each stage. For standard cases use [resources/template.md](resources/template.md); for advanced techniques see [resources/methodology.md](resources/methodology.md).
**Step 4: Validate scaffold completeness**
Check scaffold covers all requirements, includes validation steps, makes assumptions explicit, and provides clear success criteria. Self-assess using [resources/evaluators/rubric_code_data_analysis_scaffolds.json](resources/evaluators/rubric_code_data_analysis_scaffolds.json) - minimum score ≥3.5.
**Step 5: Deliver scaffold and guide execution**
Present scaffold with clear next steps. If user wants execution help, follow the scaffold systematically. If scaffold reveals gaps (missing data, unclear requirements), surface these before proceeding.
## Scaffold Types
### TDD (Test-Driven Development)
**When**: Writing new code, refactoring existing code, fixing bugs
**Output**: Test structure (test cases → implementation → refactor)
**Key Elements**: Test cases covering happy path, edge cases, error conditions, test data setup
### EDA (Exploratory Data Analysis)
**When**: New dataset, data quality questions, feature engineering
**Output**: Exploration plan (data overview → quality checks → univariate → bivariate → insights)
**Key Elements**: Data shape/types, missing values, distributions, outliers, correlations
### Statistical Analysis
**When**: Hypothesis testing, A/B testing, comparing groups
**Output**: Analysis design (question → hypothesis → test selection → assumptions → interpretation)
**Key Elements**: Null/alternative hypotheses, significance level, power analysis, assumption checks
### Causal Inference
**When**: Estimating treatment effects, understanding causation not just correlation
**Output**: Causal design (DAG → identification strategy → estimation → sensitivity analysis)
**Key Elements**: Confounders, treatment/control groups, identification assumptions, effect estimation
### Predictive Modeling
**When**: Building ML models, forecasting, classification/regression tasks
**Output**: Modeling pipeline (data prep → feature engineering → model selection → validation → evaluation)
**Key Elements**: Train/val/test split, baseline model, metrics selection, cross-validation, error analysis
### Validation
**When**: Checking data quality, code quality, model quality before deployment
**Output**: Validation checklist (assertions → edge cases → integration tests → monitoring)
**Key Elements**: Acceptance criteria, test coverage, error handling, boundary conditions
## Guardrails
- **Clarify before scaffolding** - Don't guess what user needs; ask clarifying questions first
- **Distinguish causal vs predictive** - Causal inference needs different methods than prediction (RCT/IV vs ML)
- **Make assumptions explicit** - Every scaffold has assumptions (data distribution, user behavior, system constraints)
- **Include validation steps** - Scaffold should include checkpoints to validate work at each stage
- **Provide examples** - Show what good looks like (sample test, sample EDA visualization, sample model evaluation)
- **Surface gaps early** - If scaffold reveals missing data/requirements, flag immediately
- **Avoid premature optimization** - Start with simple scaffold, add complexity only if needed
- **Follow best practices** - TDD: test first, EDA: start with data quality, Modeling: baseline before complex models
## Quick Reference
| Task Type | When to Use | Scaffold Resource |
|-----------|-------------|-------------------|
| **TDD** | Writing/refactoring code | [resources/template.md](resources/template.md) #tdd-scaffold |
| **EDA** | Exploring new dataset | [resources/template.md](resources/template.md) #eda-scaffold |
| **Statistical Analysis** | Hypothesis testing, A/B tests | [resources/template.md](resources/template.md) #statistical-analysis-scaffold |
| **Causal Inference** | Treatment effect estimation | [resources/methodology.md](resources/methodology.md) #causal-inference-methods |
| **Predictive Modeling** | ML model building | [resources/methodology.md](resources/methodology.md) #predictive-modeling-pipeline |
| **Validation** | Quality checks before shipping | [resources/template.md](resources/template.md) #validation-scaffold |
| **Examples** | See what good looks like | [resources/examples/](resources/examples/) |
| **Rubric** | Validate scaffold quality | [resources/evaluators/rubric_code_data_analysis_scaffolds.json](resources/evaluators/rubric_code_data_analysis_scaffolds.json) |

View File

@@ -0,0 +1,314 @@
{
"criteria": [
{
"name": "Scaffold Structure Clarity",
"description": "Is the scaffold structure clear, systematic, and easy to follow?",
"scoring": {
"1": "No clear structure. Random collection of steps/checks without logical flow.",
"2": "Basic structure but steps are vague or out of order. User confused about what to do next.",
"3": "Clear structure with defined steps. User can follow but may need clarification on some steps.",
"4": "Well-organized structure with clear steps, checkpoints, and expected outputs at each stage.",
"5": "Exemplary structure: systematic, numbered steps with clear inputs/outputs, decision points explicit."
},
"red_flags": [
"Steps not numbered or sequenced",
"No clear starting/ending point",
"Validation steps missing",
"User must guess what to do next"
]
},
{
"name": "Coverage Completeness",
"description": "Does the scaffold cover all necessary aspects (happy path, edge cases, validation, etc.)?",
"scoring": {
"1": "Major gaps. Only covers happy path, ignores edge cases/errors/validation.",
"2": "Partial coverage. Addresses main case but misses important edge cases or validation steps.",
"3": "Adequate coverage. Main cases and some edge cases covered. Basic validation included.",
"4": "Comprehensive coverage. Happy path, edge cases, error conditions, validation all included.",
"5": "Exhaustive coverage. All cases, validation at each step, robustness checks, limitations documented."
},
"red_flags": [
"TDD scaffold: No tests for edge cases or errors",
"EDA scaffold: Missing data quality checks",
"Statistical scaffold: No assumption checks",
"Any scaffold: No validation step before delivering"
]
},
{
"name": "Technical Rigor",
"description": "Is the approach technically sound with appropriate methods/tests?",
"scoring": {
"1": "Technically incorrect. Wrong methods, flawed logic, or inappropriate techniques.",
"2": "Questionable rigor. Some techniques correct but others questionable or missing justification.",
"3": "Adequate rigor. Standard techniques applied correctly. Acceptable for routine work.",
"4": "High rigor. Appropriate methods, assumptions checked, sensitivity analysis included.",
"5": "Exemplary rigor. Best practices followed, multiple validation approaches, limitations acknowledged."
},
"red_flags": [
"Causal inference without DAG or identification strategy",
"Statistical test without checking assumptions",
"ML model without train/val/test split (data leakage)",
"TDD without testing error conditions"
]
},
{
"name": "Actionability",
"description": "Can user execute scaffold without further guidance? Are examples concrete?",
"scoring": {
"1": "Not actionable. Vague advice, no concrete steps, no code examples.",
"2": "Somewhat actionable. General direction but user needs to figure out details.",
"3": "Actionable. Clear steps with code snippets. User can execute with minor adjustments.",
"4": "Highly actionable. Complete code examples, data assumptions stated, ready to adapt.",
"5": "Immediately executable. Copy-paste ready examples with inline comments, expected outputs shown."
},
"red_flags": [
"No code examples (just prose descriptions)",
"Code has placeholders without explaining what to fill in",
"No example inputs/outputs",
"Vague instructions ('check assumptions', 'validate results' without saying how)"
]
},
{
"name": "Test Quality (for TDD)",
"description": "For TDD scaffolds: Do tests cover happy path, edge cases, errors, and integration?",
"scoring": {
"1": "Only happy path tests. No edge cases, errors, or integration tests.",
"2": "Happy path + some edge cases. Error handling or integration missing.",
"3": "Happy path, edge cases, basic error tests. Integration tests may be missing.",
"4": "Comprehensive: Happy path, edge cases, error conditions, integration tests all present.",
"5": "Exemplary: Above + property-based tests, test fixtures, mocks for external dependencies."
},
"red_flags": [
"No tests for None/empty input",
"No tests for expected exceptions",
"No tests for state changes/side effects",
"No integration tests for external systems"
],
"applicable_to": ["TDD"]
},
{
"name": "Data Quality Assessment (for EDA)",
"description": "For EDA scaffolds: Are data quality checks (missing, duplicates, outliers, consistency) included?",
"scoring": {
"1": "No data quality checks. Jumps straight to analysis without inspecting data.",
"2": "Minimal checks. Maybe checks missing values but ignores duplicates, outliers, consistency.",
"3": "Basic quality checks. Missing values, duplicates, basic outliers checked.",
"4": "Thorough quality checks. Missing patterns, duplicates, outliers, type consistency, referential integrity.",
"5": "Comprehensive quality framework. All checks + distributions, cardinality, data lineage, validation rules."
},
"red_flags": [
"No check for missing values",
"No check for duplicates",
"No outlier detection",
"Assumes data is clean without validation"
],
"applicable_to": ["EDA", "Statistical Analysis", "Predictive Modeling"]
},
{
"name": "Assumption Documentation",
"description": "Are assumptions explicitly stated and justified?",
"scoring": {
"1": "No assumptions stated. User unaware of what's being assumed.",
"2": "Some assumptions implicit but not documented. User must infer them.",
"3": "Key assumptions stated but not justified or validated.",
"4": "Assumptions explicitly stated with justification. User knows what's assumed and why.",
"5": "Assumptions stated, justified, validated where possible, and sensitivity to violations analyzed."
},
"red_flags": [
"Statistical test applied without stating/checking assumptions",
"Causal claim without stating identification assumptions",
"ML model without documenting train/test split assumptions",
"Function implementation without stating preconditions"
]
},
{
"name": "Validation Steps Included",
"description": "Does scaffold include validation/quality checks before delivering results?",
"scoring": {
"1": "No validation. Results delivered without any quality checks.",
"2": "Informal validation. 'Looks good' without systematic checks.",
"3": "Basic validation. Some checks but not comprehensive or systematic.",
"4": "Systematic validation. Checklist of quality criteria, most items checked.",
"5": "Rigorous validation framework. Multiple validation approaches, robustness checks, edge cases tested."
},
"red_flags": [
"No validation step in workflow",
"No rubric or checklist to assess quality",
"No test suite execution before delivering code",
"No sensitivity analysis for statistical results"
]
},
{
"name": "Code/Analysis Quality",
"description": "Is code well-structured, readable, and following best practices?",
"scoring": {
"1": "Poor quality. Spaghetti code, no structure, hard to understand.",
"2": "Low quality. Works but hard to read, poor naming, no comments.",
"3": "Adequate quality. Readable, basic structure, some comments. Acceptable for prototypes.",
"4": "Good quality. Clean code, good naming, appropriate comments, follows style guide.",
"5": "Excellent quality. Modular, DRY, well-documented, type hints, follows SOLID principles."
},
"red_flags": [
"Magic numbers without explanation",
"Copy-pasted code (not DRY)",
"Functions doing multiple unrelated things",
"No docstrings or comments explaining complex logic"
]
},
{
"name": "Reproducibility",
"description": "Can another person reproduce the analysis/tests with provided information?",
"scoring": {
"1": "Not reproducible. Missing critical information (data, packages, random seeds).",
"2": "Partially reproducible. Some information provided but key details missing.",
"3": "Mostly reproducible. Enough information for skilled practitioner to reproduce with effort.",
"4": "Reproducible. All information provided (data access, package versions, random seeds, parameters).",
"5": "Fully reproducible. Documented environment, requirements.txt, Docker container, or notebook with all steps."
},
"red_flags": [
"No package versions specified",
"Random operations without setting seed",
"Data source not documented or inaccessible",
"No instructions for running tests/analysis"
]
}
],
"task_type_guidance": {
"TDD": {
"description": "Test-Driven Development scaffolds",
"focus_criteria": [
"Test Quality",
"Code/Analysis Quality",
"Validation Steps Included"
],
"target_score": 3.5,
"success_indicators": [
"Tests written before implementation",
"Happy path, edge cases, errors all tested",
"Tests pass and are maintainable",
"Red-Green-Refactor cycle followed"
]
},
"EDA": {
"description": "Exploratory Data Analysis scaffolds",
"focus_criteria": [
"Data Quality Assessment",
"Coverage Completeness",
"Assumption Documentation"
],
"target_score": 3.5,
"success_indicators": [
"Data quality systematically checked",
"Univariate and bivariate analysis completed",
"Insights and recommendations documented",
"Missing values, outliers, distributions analyzed"
]
},
"Statistical Analysis": {
"description": "Hypothesis testing, A/B tests, causal inference",
"focus_criteria": [
"Technical Rigor",
"Assumption Documentation",
"Validation Steps Included"
],
"target_score": 4.0,
"success_indicators": [
"Hypotheses clearly stated",
"Appropriate test selected and justified",
"Assumptions checked (normality, independence, etc.)",
"Effect sizes and confidence intervals reported",
"Sensitivity analysis performed"
]
},
"Predictive Modeling": {
"description": "ML model building and evaluation",
"focus_criteria": [
"Technical Rigor",
"Validation Steps Included",
"Reproducibility"
],
"target_score": 4.0,
"success_indicators": [
"Train/val/test split before preprocessing (no data leakage)",
"Baseline model for comparison",
"Cross-validation performed",
"Error analysis and feature importance computed",
"Model deployment checklist completed"
]
},
"Validation": {
"description": "Data/code/model quality checks",
"focus_criteria": [
"Coverage Completeness",
"Validation Steps Included",
"Technical Rigor"
],
"target_score": 4.0,
"success_indicators": [
"Schema validation (types, ranges, constraints)",
"Referential integrity checked",
"Edge cases tested",
"Monitoring/alerting strategy defined"
]
}
},
"common_failure_modes": [
{
"failure_mode": "Jumping to Implementation Without Scaffold",
"symptoms": "User writes code/analysis immediately without planning structure first.",
"consequences": "Missing edge cases, poor test coverage, incomplete analysis.",
"fix": "Force scaffold creation before implementation. Use template as checklist."
},
{
"failure_mode": "Testing Only Happy Path",
"symptoms": "TDD scaffold has tests for expected usage but none for errors/edge cases.",
"consequences": "Code breaks in production on unexpected inputs.",
"fix": "Require tests for: empty input, None, boundary values, invalid types, expected exceptions."
},
{
"failure_mode": "Skipping Data Quality Checks",
"symptoms": "EDA scaffold jumps to visualization without checking missing values, outliers, duplicates.",
"consequences": "Invalid conclusions based on dirty data.",
"fix": "Mandatory data quality section before any analysis. No exceptions."
},
{
"failure_mode": "Assumptions Not Documented",
"symptoms": "Statistical test applied without stating/checking assumptions (normality, independence, etc.).",
"consequences": "Invalid statistical inference. Wrong conclusions.",
"fix": "Explicit assumption section in scaffold. Check assumptions before applying test."
},
{
"failure_mode": "No Validation Step",
"symptoms": "Scaffold delivers results without any quality check or self-assessment.",
"consequences": "Low-quality outputs, errors not caught.",
"fix": "Mandatory validation step in workflow. Use rubric self-assessment."
},
{
"failure_mode": "Correlation Interpreted as Causation",
"symptoms": "EDA finds correlation, claims causal relationship without causal inference methods.",
"consequences": "Wrong business decisions based on spurious causality.",
"fix": "Distinguish predictive (correlation) from causal questions. Use causal inference methodology if claiming causation."
},
{
"failure_mode": "Data Leakage in ML",
"symptoms": "Preprocessing (scaling, imputation) done before train/test split.",
"consequences": "Overly optimistic model performance. Fails in production.",
"fix": "Scaffold enforces: split first, then preprocess. Fit transformers on train only."
},
{
"failure_mode": "Code Without Tests",
"symptoms": "Implementation provided but no test scaffold or test execution.",
"consequences": "Regressions not caught, bugs in production.",
"fix": "TDD scaffold mandatory for production code. Tests must pass before code review."
}
],
"scale": 5,
"minimum_average_score": 3.5,
"interpretation": {
"1.0-2.0": "Inadequate. Major gaps in structure, coverage, or rigor. Do not use. Revise scaffold.",
"2.0-3.0": "Needs improvement. Basic structure present but incomplete or lacks rigor. Acceptable for learning/practice only.",
"3.0-3.5": "Acceptable. Covers main cases with adequate rigor. Suitable for routine work or prototypes.",
"3.5-4.0": "Good. Comprehensive coverage with good rigor. Suitable for production code/analysis.",
"4.0-5.0": "Excellent. Exemplary structure, rigor, and completeness. Production-ready with best practices."
}
}

View File

@@ -0,0 +1,272 @@
# EDA Example: Customer Churn Analysis
Complete exploratory data analysis for telecom customer churn dataset.
## Task
Explore customer churn dataset to understand:
- What factors correlate with churn?
- Are there data quality issues?
- What features should we engineer for predictive model?
## Dataset
- **Rows**: 7,043 customers
- **Target**: `Churn` (Yes/No)
- **Features**: 20 columns (demographics, account info, usage patterns)
## EDA Scaffold Applied
### 1. Data Overview
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('telecom_churn.csv')
print(f"Shape: {df.shape}")
# Output: (7043, 21)
print(f"Columns: {df.columns.tolist()}")
# ['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
# 'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
# 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
# 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
# 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']
print(df.dtypes)
# customerID object
# gender object
# SeniorCitizen int64
# tenure int64
# MonthlyCharges float64
# TotalCharges object ← Should be numeric!
# Churn object
print(df.head())
print(df.describe())
```
**Findings**:
- TotalCharges is object type (should be numeric) - needs fixing
- Churn is target variable (26.5% churn rate)
### 2. Data Quality Checks
```python
# Missing values
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
print(missing_pct[missing_pct > 0])
# No missing values marked as NaN
# But TotalCharges is object - check for empty strings
print((df['TotalCharges'] == ' ').sum())
# Output: 11 rows have space instead of number
# Fix: Convert TotalCharges to numeric
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
print(df['TotalCharges'].isnull().sum())
# Output: 11 (now properly marked as missing)
# Strategy: Drop 11 rows (< 0.2% of data)
df = df.dropna()
# Duplicates
print(f"Duplicates: {df.duplicated().sum()}")
# Output: 0
# Data consistency checks
print("Tenure vs TotalCharges consistency:")
print(df[['tenure', 'MonthlyCharges', 'TotalCharges']].head())
# tenure=1, Monthly=$29, Total=$29 ✓
# tenure=34, Monthly=$57, Total=$1889 ≈ $57*34 ✓
```
**Findings**:
- 11 rows (0.16%) with missing TotalCharges - dropped
- No duplicates
- TotalCharges ≈ MonthlyCharges × tenure (consistent)
### 3. Univariate Analysis
```python
# Target variable
print(df['Churn'].value_counts(normalize=True))
# No 73.5%
# Yes 26.5%
# Imbalanced but not severely (>20% minority class is workable)
# Numeric variables
numeric_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
for col in numeric_cols:
print(f"\n{col}:")
print(f" Mean: {df[col].mean():.2f}, Median: {df[col].median():.2f}")
print(f" Std: {df[col].std():.2f}, Range: [{df[col].min()}, {df[col].max()}]")
# Histogram
df[col].hist(bins=50, edgecolor='black')
plt.title(f'{col} Distribution')
plt.xlabel(col)
plt.show()
# Check outliers
Q1, Q3 = df[col].quantile([0.25, 0.75])
IQR = Q3 - Q1
outliers = ((df[col] < (Q1 - 1.5*IQR)) | (df[col] > (Q3 + 1.5*IQR))).sum()
print(f" Outliers: {outliers} ({outliers/len(df)*100:.1f}%)")
```
**Findings**:
- **tenure**: Right-skewed (mean=32, median=29). Many new customers (0-12 months).
- **MonthlyCharges**: Bimodal distribution (peaks at ~$20 and ~$80). Suggests customer segments.
- **TotalCharges**: Right-skewed (correlated with tenure). Few outliers (2.3%).
```python
# Categorical variables
cat_cols = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'Contract', 'PaymentMethod']
for col in cat_cols:
print(f"\n{col}: {df[col].nunique()} unique values")
print(df[col].value_counts())
# Bar plot
df[col].value_counts().plot(kind='bar')
plt.title(f'{col} Distribution')
plt.xticks(rotation=45)
plt.show()
```
**Findings**:
- **gender**: Balanced (50/50 male/female)
- **SeniorCitizen**: 16% are senior citizens
- **Contract**: 55% month-to-month, 24% one-year, 21% two-year
- **PaymentMethod**: Electronic check most common (34%)
### 4. Bivariate Analysis (Churn vs Features)
```python
# Churn rate by categorical variables
for col in cat_cols:
churn_rate = df.groupby(col)['Churn'].apply(lambda x: (x=='Yes').mean())
print(f"\n{col} vs Churn:")
print(churn_rate.sort_values(ascending=False))
# Stacked bar chart
pd.crosstab(df[col], df['Churn'], normalize='index').plot(kind='bar', stacked=True)
plt.title(f'Churn Rate by {col}')
plt.ylabel('Proportion')
plt.show()
```
**Key Findings**:
- **Contract**: Month-to-month churn=42.7%, One-year=11.3%, Two-year=2.8% (Strong signal!)
- **SeniorCitizen**: Seniors churn=41.7%, Non-seniors=23.6%
- **PaymentMethod**: Electronic check=45.3% churn, others~15-18%
- **tenure**: Customers with tenure<12 months churn=47.5%, >60 months=7.9%
```python
# Numeric variables vs Churn
for col in numeric_cols:
plt.figure(figsize=(10, 4))
# Box plot
plt.subplot(1, 2, 1)
df.boxplot(column=col, by='Churn')
plt.title(f'{col} by Churn')
# Histogram (overlay)
plt.subplot(1, 2, 2)
df[df['Churn']=='No'][col].hist(bins=30, alpha=0.5, label='No Churn', density=True)
df[df['Churn']=='Yes'][col].hist(bins=30, alpha=0.5, label='Churn', density=True)
plt.legend()
plt.xlabel(col)
plt.title(f'{col} Distribution by Churn')
plt.show()
```
**Key Findings**:
- **tenure**: Churned customers have lower tenure (mean=18 vs 38 months)
- **MonthlyCharges**: Churned customers pay MORE ($74 vs $61/month)
- **TotalCharges**: Churned customers have lower total (correlated with tenure)
```python
# Correlation matrix
numeric_df = df[['tenure', 'MonthlyCharges', 'TotalCharges', 'SeniorCitizen']].copy()
numeric_df['Churn_binary'] = (df['Churn'] == 'Yes').astype(int)
corr = numeric_df.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.show()
```
**Key Findings**:
- tenure ↔ TotalCharges: 0.83 (strong positive correlation - expected)
- Churn ↔ tenure: -0.35 (negative: longer tenure → less churn)
- Churn ↔ MonthlyCharges: +0.19 (positive: higher charges → more churn)
- Churn ↔ TotalCharges: -0.20 (negative: driven by tenure)
### 5. Insights & Recommendations
```python
print("\n=== KEY FINDINGS ===")
print("1. Data Quality:")
print(" - 11 rows (<0.2%) dropped due to missing TotalCharges")
print(" - No other quality issues. Data is clean.")
print("")
print("2. Churn Patterns:")
print(" - Overall churn rate: 26.5% (slightly imbalanced)")
print(" - Strongest predictor: Contract type (month-to-month 42.7% vs two-year 2.8%)")
print(" - High-risk segment: New customers (<12mo tenure) with high monthly charges")
print(" - Low churn: Long-term customers (>60mo) on two-year contracts")
print("")
print("3. Feature Importance:")
print(" - **High signal**: Contract, tenure, PaymentMethod, SeniorCitizen")
print(" - **Medium signal**: MonthlyCharges, InternetService")
print(" - **Low signal**: gender, PhoneService (balanced across churn/no-churn)")
print("")
print("\n=== RECOMMENDED ACTIONS ===")
print("1. Feature Engineering:")
print(" - Create 'tenure_bucket' (0-12mo, 12-24mo, 24-60mo, >60mo)")
print(" - Create 'high_charges' flag (MonthlyCharges > $70)")
print(" - Interaction: tenure × Contract (captures switching cost)")
print(" - Payment risk score (Electronic check is risky)")
print("")
print("2. Model Strategy:")
print(" - Use all categorical features (one-hot encode)")
print(" - Baseline: Predict churn for month-to-month + new customers")
print(" - Advanced: Random Forest or Gradient Boosting (handle interactions)")
print(" - Validate with stratified 5-fold CV (preserve 26.5% churn rate)")
print("")
print("3. Business Insights:")
print(" - **Retention program**: Target month-to-month customers < 12mo tenure")
print(" - **Contract incentives**: Offer discounts for one/two-year contracts")
print(" - **Payment method**: Encourage auto-pay (reduce electronic check)")
print(" - **Early warning**: Monitor customers with high MonthlyCharges + short tenure")
```
### 6. Self-Assessment
Using rubric:
- **Clarity** (5/5): Systematic exploration, clear findings at each stage
- **Completeness** (5/5): Data quality, univariate, bivariate, insights all covered
- **Rigor** (5/5): Proper statistical analysis, visualizations, quantified relationships
- **Actionability** (5/5): Specific feature engineering and business recommendations
**Average**: 5.0/5 ✓
This EDA provides solid foundation for predictive modeling and business action.
## Next Steps
1. **Feature engineering**: Implement recommended features
2. **Baseline model**: Logistic regression with top 5 features
3. **Advanced models**: Random Forest, XGBoost with feature interactions
4. **Evaluation**: F1-score, precision/recall curves, AUC-ROC
5. **Deployment**: Real-time churn scoring API

View File

@@ -0,0 +1,226 @@
# TDD Example: User Authentication
Complete TDD example showing test-first development for authentication function.
## Task
Build a `validate_login(username, password)` function that:
- Returns `True` for valid credentials
- Returns `False` for invalid password
- Raises `ValueError` for missing username/password
- Raises `User Not FoundError` for nonexistent users
- Logs failed attempts
## Step 1: Write Tests FIRST
```python
# test_auth.py
import pytest
from auth import validate_login, UserNotFoundError
# HAPPY PATH
def test_valid_credentials():
"""User with correct password should authenticate"""
assert validate_login("alice@example.com", "SecurePass123!") == True
# EDGE CASES
def test_empty_username():
"""Empty username should raise ValueError"""
with pytest.raises(ValueError, match="Username required"):
validate_login("", "password")
def test_empty_password():
"""Empty password should raise ValueError"""
with pytest.raises(ValueError, match="Password required"):
validate_login("alice@example.com", "")
def test_none_credentials():
"""None values should raise ValueError"""
with pytest.raises(ValueError):
validate_login(None, None)
# ERROR CONDITIONS
def test_invalid_password():
"""Wrong password should return False"""
assert validate_login("alice@example.com", "WrongPassword") == False
def test_nonexistent_user():
"""User not in database should raise UserNotFoundError"""
with pytest.raises(UserNotFoundError):
validate_login("nobody@example.com", "anypassword")
def test_case_sensitive_password():
"""Password check should be case-sensitive"""
assert validate_login("alice@example.com", "securepass123!") == False
# STATE/SIDE EFFECTS
def test_failed_attempt_logged(caplog):
"""Failed login should be logged"""
validate_login("alice@example.com", "WrongPassword")
assert "Failed login attempt" in caplog.text
assert "alice@example.com" in caplog.text
def test_successful_login_logged(caplog):
"""Successful login should be logged"""
validate_login("alice@example.com", "SecurePass123!")
assert "Successful login" in caplog.text
# INTEGRATION TEST
@pytest.fixture
def mock_database():
"""Mock database with test users"""
return {
"alice@example.com": {
"password_hash": "hashed_SecurePass123!",
"salt": "random_salt_123"
}
}
def test_database_integration(mock_database, monkeypatch):
"""Function should query database correctly"""
def mock_get_user(username):
return mock_database.get(username)
monkeypatch.setattr("auth.get_user_from_db", mock_get_user)
result = validate_login("alice@example.com", "SecurePass123!")
assert result == True
```
## Step 2: Run Tests (They Should FAIL - Red)
```bash
$ pytest test_auth.py
FAILED - ModuleNotFoundError: No module named 'auth'
```
## Step 3: Write Minimal Implementation (Green)
```python
# auth.py
import logging
import hashlib
logger = logging.getLogger(__name__)
class UserNotFoundError(Exception):
pass
def validate_login(username, password):
# Input validation
if not username:
raise ValueError("Username required")
if not password:
raise ValueError("Password required")
# Get user from database
user = get_user_from_db(username)
if user is None:
raise UserNotFoundError(f"User {username} not found")
# Hash password and compare
password_hash = hash_password(password, user['salt'])
is_valid = (password_hash == user['password_hash'])
# Log attempt
if is_valid:
logger.info(f"Successful login for {username}")
else:
logger.warning(f"Failed login attempt for {username}")
return is_valid
def get_user_from_db(username):
# Stub - implement database query
users = {
"alice@example.com": {
"password_hash": hash_password("SecurePass123!", "random_salt_123"),
"salt": "random_salt_123"
}
}
return users.get(username)
def hash_password(password, salt):
# Simplified - use bcrypt/argon2 in production
return hashlib.sha256(f"{password}{salt}".encode()).hexdigest()
```
## Step 4: Run Tests Again (Should PASS - Green)
```bash
$ pytest test_auth.py -v
test_valid_credentials PASSED
test_empty_username PASSED
test_empty_password PASSED
test_none_credentials PASSED
test_invalid_password PASSED
test_nonexistent_user PASSED
test_case_sensitive_password PASSED
test_failed_attempt_logged PASSED
test_successful_login_logged PASSED
test_database_integration PASSED
========== 10 passed in 0.15s ==========
```
## Step 5: Refactor (Keep Tests Green)
```python
# auth.py (refactored for readability)
class AuthenticationService:
def __init__(self, user_repo, password_hasher):
self.user_repo = user_repo
self.password_hasher = password_hasher
self.logger = logging.getLogger(__name__)
def validate_login(self, username, password):
self._validate_inputs(username, password)
user = self._get_user(username)
is_valid = self._check_password(password, user)
self._log_attempt(username, is_valid)
return is_valid
def _validate_inputs(self, username, password):
if not username:
raise ValueError("Username required")
if not password:
raise ValueError("Password required")
def _get_user(self, username):
user = self.user_repo.get_by_username(username)
if user is None:
raise UserNotFoundError(f"User {username} not found")
return user
def _check_password(self, password, user):
password_hash = self.password_hasher.hash(password, user.salt)
return password_hash == user.password_hash
def _log_attempt(self, username, is_valid):
if is_valid:
self.logger.info(f"Successful login for {username}")
else:
self.logger.warning(f"Failed login attempt for {username}")
```
Tests still pass after refactoring!
## Key Takeaways
1. **Tests written FIRST** define expected behavior
2. **Minimal implementation** to make tests pass
3. **Refactor** with confidence (tests catch regressions)
4. **Comprehensive coverage**: happy path, edge cases, errors, side effects, integration
5. **Fast feedback**: Know immediately if something breaks
## Self-Assessment
Using rubric:
- **Clarity** (5/5): Requirements clearly defined by tests
- **Completeness** (5/5): All cases covered (happy, edge, error, integration)
- **Rigor** (5/5): TDD cycle followed (Red → Green → Refactor)
- **Actionability** (5/5): Tests are executable specification
**Average**: 5.0/5 ✓
This is production-ready test-first code.

View File

@@ -0,0 +1,470 @@
# Code Data Analysis Scaffolds Methodology
Advanced techniques for causal inference, predictive modeling, property-based testing, and complex data analysis.
## Workflow
Copy this checklist and track your progress:
```
Code Data Analysis Scaffolds Progress:
- [ ] Step 1: Clarify task and objectives
- [ ] Step 2: Choose appropriate scaffold type
- [ ] Step 3: Generate scaffold structure
- [ ] Step 4: Validate scaffold completeness
- [ ] Step 5: Deliver scaffold and guide execution
```
**Step 1: Clarify task** - Assess complexity and determine if advanced techniques needed. See [1. When to Use Advanced Methods](#1-when-to-use-advanced-methods).
**Step 2: Choose scaffold** - Select from Causal Inference, Predictive Modeling, Property-Based Testing, or Advanced EDA. See specific sections below.
**Step 3: Generate structure** - Apply advanced scaffold matching task complexity. See [2. Causal Inference Methods](#2-causal-inference-methods), [3. Predictive Modeling Pipeline](#3-predictive-modeling-pipeline), [4. Property-Based Testing](#4-property-based-testing), [5. Advanced EDA Techniques](#5-advanced-eda-techniques).
**Step 4: Validate** - Check assumptions, sensitivity analysis, robustness checks using [6. Advanced Validation Patterns](#6-advanced-validation-patterns).
**Step 5: Deliver** - Present with caveats, limitations, and recommendations for further analysis.
## 1. When to Use Advanced Methods
| Task Characteristic | Standard Template | Advanced Methodology |
|---------------------|-------------------|---------------------|
| **Causal question** | "Does X correlate with Y?" | "Does X cause Y?" → Causal inference needed |
| **Sample size** | < 1000 rows | > 10K rows with complex patterns |
| **Model complexity** | Linear/logistic regression | Ensemble methods, neural nets, feature interactions |
| **Test sophistication** | Unit tests, integration tests | Property-based tests, mutation testing, fuzz testing |
| **Data complexity** | Clean, tabular data | Multi-modal, high-dimensional, unstructured data |
| **Stakes** | Low (exploratory) | High (production ML, regulatory compliance) |
## 2. Causal Inference Methods
Use when research question is "Does X **cause** Y?" not just "Are X and Y correlated?"
### Causal Inference Scaffold
```python
# CAUSAL INFERENCE SCAFFOLD
# 1. DRAW CAUSAL DAG (Directed Acyclic Graph)
# Explicitly model: Treatment → Outcome, Confounders → Treatment & Outcome
#
# Example:
# Education → Income
# ↑ ↑
# Family Background
#
# Treatment: Education
# Outcome: Income
# Confounder: Family Background (affects both education and income)
# 2. IDENTIFY CONFOUNDERS
confounders = ['age', 'gender', 'family_income', 'region']
# These variables affect BOTH treatment and outcome
# If not controlled, they bias causal estimate
# 3. CHECK IDENTIFICATION ASSUMPTIONS
# For causal effect to be identifiable:
# a) No unmeasured confounders (all variables in DAG observed)
# b) Treatment assignment as-if random conditional on confounders
# c) Positivity: Every unit has nonzero probability of treatment/control
# 4. CHOOSE IDENTIFICATION STRATEGY
# Option A: RCT - Random assignment eliminates confounding. Check balance on confounders.
from scipy import stats
for var in confounders:
_, p = stats.ttest_ind(treatment_group[var], control_group[var])
print(f"{var}: {'' if p > 0.05 else ''} balanced")
# Option B: Regression - Control for confounders. Assumes no unmeasured confounding.
import statsmodels.formula.api as smf
model = smf.ols('outcome ~ treatment + age + gender + family_income', data=df).fit()
treatment_effect = model.params['treatment']
# Option C: Propensity Score Matching - Match treated to similar controls on P(treatment|X).
from sklearn.linear_model import LogisticRegression; from sklearn.neighbors import NearestNeighbors
ps_model = LogisticRegression().fit(df[confounders], df['treatment'])
df['ps'] = ps_model.predict_proba(df[confounders])[:,1]
treated, control = df[df['treatment']==1], df[df['treatment']==0]
nn = NearestNeighbors(n_neighbors=1).fit(control[['ps']])
_, indices = nn.kneighbors(treated[['ps']])
treatment_effect = treated['outcome'].mean() - control.iloc[indices.flatten()]['outcome'].mean()
# Option D: IV - Need instrument Z: affects treatment, not outcome (except through treatment).
from statsmodels.sandbox.regression.gmm import IV2SLS
iv_model = IV2SLS(df['income'], df[['education'] + confounders], df[['instrument'] + confounders]).fit()
# Option E: RDD - Treatment assigned at cutoff. Compare units just above/below threshold.
df['above_cutoff'] = (df['running_var'] >= cutoff).astype(int)
# Use local linear regression around cutoff to estimate effect
# Option F: DiD - Compare treatment vs control, before vs after. Assumes parallel trends.
t_before, t_after = df[(df['group']=='T') & (df['time']=='before')]['y'].mean(), df[(df['group']=='T') & (df['time']=='after')]['y'].mean()
c_before, c_after = df[(df['group']=='C') & (df['time']=='before')]['y'].mean(), df[(df['group']=='C') & (df['time']=='after')]['y'].mean()
did_estimate = (t_after - t_before) - (c_after - c_before)
# 5. SENSITIVITY ANALYSIS
print("\n=== SENSITIVITY CHECKS ===")
print("1. Unmeasured confounding: How strong would confounder need to be to change conclusion?")
print("2. Placebo tests: Check for effect in period before treatment (should be zero)")
print("3. Falsification tests: Check for effect on outcome that shouldn't be affected")
print("4. Robustness: Try different model specifications, subsamples, bandwidths (RDD)")
# 6. REPORT CAUSAL ESTIMATE WITH UNCERTAINTY
print(f"\nCausal Effect: {treatment_effect:.3f}")
print(f"95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]")
print(f"Interpretation: Treatment X causes {treatment_effect:.1%} change in outcome Y")
print(f"Assumptions: [List key identifying assumptions]")
print(f"Limitations: [Threats to validity]")
```
### Causal Inference Checklist
- [ ] **Causal question clearly stated**: "Does X cause Y?" not "Are X and Y related?"
- [ ] **DAG drawn**: Treatment, outcome, confounders, mediators identified
- [ ] **Identification strategy chosen**: RCT, regression, PS matching, IV, RDD, DiD
- [ ] **Assumptions checked**: No unmeasured confounding, positivity, parallel trends (DiD), etc.
- [ ] **Sensitivity analysis**: Test robustness to violations of assumptions
- [ ] **Limitations acknowledged**: Threats to internal/external validity stated
## 3. Predictive Modeling Pipeline
Use for forecasting, classification, regression - when goal is prediction not causal understanding.
### Predictive Modeling Scaffold
```python
# PREDICTIVE MODELING SCAFFOLD
# 1. DEFINE PREDICTION TASK & METRIC
task = "Predict customer churn (binary classification)"
primary_metric = "F1-score" # Balance precision/recall
secondary_metrics = ["AUC-ROC", "precision", "recall", "accuracy"]
# 2. TRAIN/VAL/TEST SPLIT (before any preprocessing!)
from sklearn.model_selection import train_test_split
# Split: 60% train, 20% validation, 20% test
train_val, test = train_test_split(df, test_size=0.2, random_state=42, stratify=df['target'])
train, val = train_test_split(train_val, test_size=0.25, random_state=42, stratify=train_val['target'])
print(f"Train: {len(train)}, Val: {len(val)}, Test: {len(test)}")
print(f"Class balance - Train: {train['target'].mean():.2%}, Test: {test['target'].mean():.2%}")
# 3. FEATURE ENGINEERING (fit on train, transform train/val/test)
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
# Numeric features: impute missing, standardize
numeric_features = ['age', 'income', 'tenure']
num_imputer = SimpleImputer(strategy='median').fit(train[numeric_features])
num_scaler = StandardScaler().fit(num_imputer.transform(train[numeric_features]))
X_train_num = num_scaler.transform(num_imputer.transform(train[numeric_features]))
X_val_num = num_scaler.transform(num_imputer.transform(val[numeric_features]))
X_test_num = num_scaler.transform(num_imputer.transform(test[numeric_features]))
# Categorical features: one-hot encode
from sklearn.preprocessing import OneHotEncoder
cat_features = ['region', 'product_type']
cat_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(train[cat_features])
X_train_cat = cat_encoder.transform(train[cat_features])
X_val_cat = cat_encoder.transform(val[cat_features])
X_test_cat = cat_encoder.transform(test[cat_features])
# Combine features
import numpy as np
X_train = np.hstack([X_train_num, X_train_cat])
X_val = np.hstack([X_val_num, X_val_cat])
X_test = np.hstack([X_test_num, X_test_cat])
y_train, y_val, y_test = train['target'], val['target'], test['target']
# 4. BASELINE MODEL (always start simple!)
from sklearn.dummy import DummyClassifier
baseline = DummyClassifier(strategy='most_frequent').fit(X_train, y_train)
baseline_f1 = f1_score(y_val, baseline.predict(X_val))
print(f"Baseline F1: {baseline_f1:.3f}")
# 5. MODEL SELECTION & HYPERPARAMETER TUNING
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, roc_auc_score
# Try multiple models
models = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Random Forest': RandomForestClassifier(random_state=42),
'Gradient Boosting': GradientBoostingClassifier(random_state=42)
}
results = {}
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
y_proba = model.predict_proba(X_val)[:,1]
results[name] = {
'F1': f1_score(y_val, y_pred),
'AUC': roc_auc_score(y_val, y_proba),
'Precision': precision_score(y_val, y_pred),
'Recall': recall_score(y_val, y_pred)
}
print(f"{name}: F1={results[name]['F1']:.3f}, AUC={results[name]['AUC']:.3f}")
# Select best model (highest F1 on validation)
best_model_name = max(results, key=lambda x: results[x]['F1'])
best_model = models[best_model_name]
print(f"\nBest model: {best_model_name}")
# Hyperparameter tuning on best model
if best_model_name == 'Random Forest':
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, None],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(best_model, param_grid, cv=5, scoring='f1', n_jobs=-1)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
print(f"Best params: {grid_search.best_params_}")
# 6. CROSS-VALIDATION (check for overfitting)
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(best_model, X_train, y_train, cv=5, scoring='f1')
print(f"CV F1 scores: {cv_scores}")
print(f"Mean: {cv_scores.mean():.3f}, Std: {cv_scores.std():.3f}")
# 7. FINAL EVALUATION ON TEST SET (only once!)
y_test_pred = best_model.predict(X_test)
y_test_proba = best_model.predict_proba(X_test)[:,1]
test_f1 = f1_score(y_test, y_test_pred)
test_auc = roc_auc_score(y_test, y_test_proba)
print(f"\n=== FINAL TEST PERFORMANCE ===")
print(f"F1: {test_f1:.3f}, AUC: {test_auc:.3f}")
# 8. ERROR ANALYSIS
from sklearn.metrics import confusion_matrix, classification_report
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_test_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_test_pred))
# Analyze misclassifications
test_df = test.copy()
test_df['prediction'] = y_test_pred
test_df['prediction_proba'] = y_test_proba
false_positives = test_df[(test_df['target']==0) & (test_df['prediction']==1)]
false_negatives = test_df[(test_df['target']==1) & (test_df['prediction']==0)]
print(f"False Positives: {len(false_positives)}")
print(f"False Negatives: {len(false_negatives)}")
# Inspect these cases to understand failure modes
# 9. FEATURE IMPORTANCE
if hasattr(best_model, 'feature_importances_'):
feature_names = numeric_features + list(cat_encoder.get_feature_names_out(cat_features))
importances = pd.DataFrame({
'feature': feature_names,
'importance': best_model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop 10 Features:")
print(importances.head(10))
# 10. MODEL DEPLOYMENT CHECKLIST
print("\n=== DEPLOYMENT READINESS ===")
print(f"✓ Test F1 ({test_f1:.3f}) > Baseline ({baseline_f1:.3f})")
print(f"✓ Cross-validation shows consistent performance (CV std={cv_scores.std():.3f})")
print("✓ Error analysis completed, failure modes understood")
print("✓ Feature importance computed, no surprising features")
print("□ Model serialized and saved")
print("□ Monitoring plan in place (track drift in input features, output distribution)")
print("□ Rollback plan if model underperforms in production")
```
### Predictive Modeling Checklist
- [ ] **Clear prediction task**: Classification, regression, time series forecasting
- [ ] **Appropriate metrics**: Match business objectives (precision vs recall tradeoff, etc.)
- [ ] **Train/val/test split**: Before any preprocessing (no data leakage)
- [ ] **Baseline model**: Simple model for comparison
- [ ] **Feature engineering**: Proper handling of missing values, scaling, encoding
- [ ] **Cross-validation**: k-fold CV to check for overfitting
- [ ] **Model selection**: Compare multiple model types
- [ ] **Hyperparameter tuning**: Grid/random search on validation set
- [ ] **Error analysis**: Understand failure modes, inspect misclassifications
- [ ] **Test set evaluation**: Final performance check (only once!)
- [ ] **Deployment readiness**: Monitoring, rollback plan, model versioning
## 4. Property-Based Testing
Use for testing complex logic, data transformations, invariants. Goes beyond example-based tests.
### Property-Based Testing Scaffold
```python
# PROPERTY-BASED TESTING SCAFFOLD
from hypothesis import given, strategies as st
import pytest
# Example: Testing a sort function
def my_sort(lst):
return sorted(lst)
# Property 1: Output length equals input length
@given(st.lists(st.integers()))
def test_sort_preserves_length(lst):
assert len(my_sort(lst)) == len(lst)
# Property 2: Output is sorted (each element <= next element)
@given(st.lists(st.integers()))
def test_sort_is_sorted(lst):
result = my_sort(lst)
for i in range(len(result) - 1):
assert result[i] <= result[i+1]
# Property 3: Output contains same elements as input (multiset equality)
@given(st.lists(st.integers()))
def test_sort_preserves_elements(lst):
result = my_sort(lst)
assert sorted(lst) == sorted(result) # Canonical form comparison
# Property 4: Idempotence (sorting twice = sorting once)
@given(st.lists(st.integers()))
def test_sort_is_idempotent(lst):
result = my_sort(lst)
assert my_sort(result) == result
# Property 5: Empty input → empty output
def test_sort_empty_list():
assert my_sort([]) == []
# Property 6: Single element → unchanged
@given(st.integers())
def test_sort_single_element(x):
assert my_sort([x]) == [x]
```
### Property-Based Testing Strategies
**For data transformations:**
- Idempotence: `f(f(x)) == f(x)`
- Round-trip: `decode(encode(x)) == x`
- Commutativity: `f(g(x)) == g(f(x))`
- Invariants: Properties that never change (e.g., sum after transformation)
**For numeric functions:**
- Boundary conditions: Zero, negative, very large numbers
- Inverse relationships: `f(f_inverse(x)) ≈ x`
- Known identities: `sin²(x) + cos²(x) = 1`
**For string/list operations:**
- Length preservation or predictable change
- Character/element preservation
- Order properties (sorted, reversed)
## 5. Advanced EDA Techniques
For high-dimensional, multi-modal, or complex data.
### Dimensionality Reduction
```python
# PCA: Linear dimensionality reduction
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(f"Explained variance: {pca.explained_variance_ratio_}")
# t-SNE: Non-linear, good for visualization
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
plt.scatter(X_tsne[:,0], X_tsne[:,1], c=y, cmap='viridis'); plt.show()
# UMAP: Faster alternative to t-SNE, preserves global structure
# pip install umap-learn
import umap
reducer = umap.UMAP(n_components=2, random_state=42)
X_umap = reducer.fit_transform(X_scaled)
```
### Cluster Analysis
```python
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
# Elbow method: Find optimal K
inertias = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled)
inertias.append(kmeans.inertia_)
plt.plot(range(2, 11), inertias); plt.xlabel('K'); plt.ylabel('Inertia'); plt.show()
# Silhouette score: Measure cluster quality
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=42).fit(X_scaled)
score = silhouette_score(X_scaled, kmeans.labels_)
print(f"K={k}: Silhouette={score:.3f}")
# DBSCAN: Density-based clustering (finds arbitrary shapes)
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(X_scaled)
print(f"Clusters found: {len(set(clusters)) - (1 if -1 in clusters else 0)}")
print(f"Noise points: {(clusters == -1).sum()}")
```
## 6. Advanced Validation Patterns
### Mutation Testing
Tests the quality of your tests by introducing bugs and checking if tests catch them.
```python
# Install: pip install mutmut
# Run: mutmut run --paths-to-mutate=src/
# Check: mutmut results
# Survivors (mutations not caught) indicate weak tests
```
### Fuzz Testing
Generate random/malformed inputs to find edge cases.
```python
from hypothesis import given, strategies as st
@given(st.text())
def test_function_doesnt_crash_on_any_string(s):
result = my_function(s) # Should never raise exception
assert result is not None
```
### Data Validation Framework (Great Expectations)
```python
import great_expectations as gx
# Define expectations
expectation_suite = gx.ExpectationSuite(name="my_data_suite")
expectation_suite.add_expectation(gx.expectations.ExpectColumnToExist(column="user_id"))
expectation_suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="user_id"))
expectation_suite.add_expectation(gx.expectations.ExpectColumnValuesToBeBetween(column="age", min_value=0, max_value=120))
# Validate data
results = context.run_validation(batch_request, expectation_suite)
print(results["success"]) # True if all expectations met
```
## 7. When to Use Each Method
| Research Goal | Method | Key Consideration |
|---------------|--------|-------------------|
| Causal effect estimation | RCT, IV, RDD, DiD | Identify confounders, check assumptions |
| Prediction/forecasting | Supervised ML | Avoid data leakage, validate out-of-sample |
| Pattern discovery | Clustering, PCA, t-SNE | Dimensionality reduction first if high-D |
| Complex logic testing | Property-based testing | Define invariants that must hold |
| Data quality | Great Expectations | Automate checks in pipelines |

View File

@@ -0,0 +1,391 @@
# Code Data Analysis Scaffolds Template
## Workflow
Copy this checklist and track your progress:
```
Code Data Analysis Scaffolds Progress:
- [ ] Step 1: Clarify task and objectives
- [ ] Step 2: Choose appropriate scaffold type
- [ ] Step 3: Generate scaffold structure
- [ ] Step 4: Validate scaffold completeness
- [ ] Step 5: Deliver scaffold and guide execution
```
**Step 1: Clarify task** - Ask context questions to understand task type, constraints, expected outcomes. See [Context Questions](#context-questions).
**Step 2: Choose scaffold** - Select TDD, EDA, Statistical Analysis, or Validation based on task. See [Scaffold Selection Guide](#scaffold-selection-guide).
**Step 3: Generate structure** - Use appropriate scaffold template. See [TDD Scaffold](#tdd-scaffold), [EDA Scaffold](#eda-scaffold), [Statistical Analysis Scaffold](#statistical-analysis-scaffold), or [Validation Scaffold](#validation-scaffold).
**Step 4: Validate completeness** - Check scaffold covers requirements, includes validation steps, makes assumptions explicit. See [Quality Checklist](#quality-checklist).
**Step 5: Deliver and guide** - Present scaffold, highlight next steps, surface any gaps discovered. Execute if user wants help.
## Context Questions
**For all tasks:**
- What are you trying to accomplish? (Specific outcome expected)
- What's the context? (Dataset characteristics, codebase state, existing work)
- Any constraints? (Time, tools, data limitations, performance requirements)
- What does success look like? (Acceptance criteria, quality bar)
**For TDD tasks:**
- What functionality needs tests? (Feature, bug fix, refactor)
- Existing test coverage? (None, partial, comprehensive)
- Test framework preference? (pytest, jest, junit, etc.)
- Integration vs unit tests? (Scope of testing)
**For EDA tasks:**
- What's the dataset? (Size, format, source)
- What questions are you trying to answer? (Exploratory vs. hypothesis-driven)
- Existing knowledge about data? (Schema, distributions, known issues)
- End goal? (Feature engineering, quality assessment, insights)
**For Statistical/Modeling tasks:**
- What's the research question? (Descriptive, predictive, causal)
- Available data? (Sample size, variables, treatment/control)
- Causal or predictive goal? (Understanding why vs. forecasting what)
- Significance level / acceptable error rate?
## Scaffold Selection Guide
| User Says | Task Type | Scaffold to Use |
|-----------|-----------|-----------------|
| "Write tests for..." | TDD | [TDD Scaffold](#tdd-scaffold) |
| "Explore this dataset..." | EDA | [EDA Scaffold](#eda-scaffold) |
| "Analyze the effect of..." / "Does X cause Y?" | Causal Inference | See methodology.md |
| "Predict..." / "Classify..." / "Forecast..." | Predictive Modeling | See methodology.md |
| "Design an A/B test..." / "Compare groups..." | Statistical Analysis | [Statistical Analysis Scaffold](#statistical-analysis-scaffold) |
| "Validate..." / "Check quality..." | Validation | [Validation Scaffold](#validation-scaffold) |
## TDD Scaffold
Use when writing new code, refactoring, or fixing bugs. **Write tests FIRST, then implement.**
### Quick Template
```python
# Test file: test_[module].py
import pytest
from [module] import [function_to_test]
# 1. HAPPY PATH TESTS (expected usage)
def test_[function]_with_valid_input():
"""Test normal, expected behavior"""
result = [function](valid_input)
assert result == expected_output
assert result.property == expected_value
# 2. EDGE CASE TESTS (boundary conditions)
def test_[function]_with_empty_input():
"""Test with empty/minimal input"""
result = [function]([])
assert result == expected_for_empty
def test_[function]_with_maximum_input():
"""Test with large/maximum input"""
result = [function](large_input)
assert result is not None
# 3. ERROR CONDITION TESTS (invalid input, expected failures)
def test_[function]_with_invalid_input():
"""Test proper error handling"""
with pytest.raises(ValueError):
[function](invalid_input)
def test_[function]_with_none_input():
"""Test None handling"""
with pytest.raises(TypeError):
[function](None)
# 4. STATE TESTS (if function modifies state)
def test_[function]_modifies_state_correctly():
"""Test side effects are correct"""
obj = Object()
obj.[function](param)
assert obj.state == expected_state
# 5. INTEGRATION TESTS (if interacting with external systems)
@pytest.fixture
def mock_external_service():
"""Mock external dependencies"""
return Mock(spec=ExternalService)
def test_[function]_with_external_service(mock_external_service):
"""Test integration points"""
result = [function](mock_external_service)
mock_external_service.method.assert_called_once()
assert result == expected_from_integration
```
### Test Data Setup
```python
# conftest.py or test fixtures
@pytest.fixture
def sample_data():
"""Reusable test data"""
return {
"valid": [...],
"edge_case": [...],
"invalid": [...]
}
@pytest.fixture(scope="session")
def database_session():
"""Database for integration tests"""
db = create_test_db()
yield db
db.cleanup()
```
### TDD Cycle
1. **Red**: Write failing test (defines what success looks like)
2. **Green**: Write minimal code to make test pass
3. **Refactor**: Improve code while keeping tests green
4. **Repeat**: Next test case
## EDA Scaffold
Use when exploring new dataset. Follow systematic plan to understand data quality and patterns.
### Quick Template
```python
# 1. DATA OVERVIEW
# Load and inspect
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_[format]('data.csv')
# Basic info
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(df.dtypes)
print(df.head())
print(df.info())
print(df.describe())
# 2. DATA QUALITY CHECKS
# Missing values
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
print(missing_pct[missing_pct > 0])
# Duplicates
print(f"Duplicates: {df.duplicated().sum()}")
# Data types consistency
print("Check: Are numeric columns actually numeric?")
print("Check: Are dates parsed correctly?")
print("Check: Are categorical variables encoded properly?")
# 3. UNIVARIATE ANALYSIS
# Numeric: mean, median, std, range, distribution plots, outliers (IQR method)
for col in df.select_dtypes(include=[np.number]).columns:
print(f"{col}: mean={df[col].mean():.2f}, median={df[col].median():.2f}, std={df[col].std():.2f}")
df[col].hist(bins=50); plt.title(f'{col} Distribution'); plt.show()
Q1, Q3 = df[col].quantile([0.25, 0.75])
outliers = ((df[col] < (Q1 - 1.5*(Q3-Q1))) | (df[col] > (Q3 + 1.5*(Q3-Q1)))).sum()
print(f" Outliers: {outliers} ({outliers/len(df)*100:.1f}%)")
# Categorical: value counts, unique values, bar plots
for col in df.select_dtypes(include=['object', 'category']).columns:
print(f"{col}: {df[col].nunique()} unique, most common={df[col].mode()[0]}")
df[col].value_counts().head(10).plot(kind='bar'); plt.show()
# 4. BIVARIATE ANALYSIS
# Correlation heatmap, pairplots, categorical vs numeric boxplots
sns.heatmap(df.select_dtypes(include=[np.number]).corr(), annot=True, cmap='coolwarm')
sns.pairplot(df[['var1', 'var2', 'var3', 'target']], hue='target'); plt.show()
# For each categorical-numeric pair, create boxplots to see distributions
# 5. INSIGHTS & NEXT STEPS
print("\n=== KEY FINDINGS ===")
print("1. Data quality: [summary]")
print("2. Distributions: [any skewness, outliers]")
print("3. Correlations: [strong relationships found]")
print("4. Missing patterns: [systematic missingness?]")
print("\n=== RECOMMENDED ACTIONS ===")
print("1. Handle missing data: [imputation strategy]")
print("2. Address outliers: [cap, remove, transform]")
print("3. Feature engineering: [ideas based on EDA]")
print("4. Data transformations: [log, standardize, encode]")
```
### EDA Checklist
- [ ] Load data and check shape/dtypes
- [ ] Assess missing values (how much, which variables, patterns?)
- [ ] Check for duplicates
- [ ] Validate data types (numeric, categorical, dates)
- [ ] Univariate analysis (distributions, outliers, summary stats)
- [ ] Bivariate analysis (correlations, relationships with target)
- [ ] Identify data quality issues
- [ ] Document insights and recommended next steps
## Statistical Analysis Scaffold
Use for hypothesis testing, A/B tests, comparing groups.
### Quick Template
```python
# STATISTICAL ANALYSIS SCAFFOLD
# 1. DEFINE RESEARCH QUESTION
question = "Does treatment X improve outcome Y?"
# 2. STATE HYPOTHESES
H0 = "Treatment X has no effect on outcome Y (null hypothesis)"
H1 = "Treatment X improves outcome Y (alternative hypothesis)"
# 3. SET SIGNIFICANCE LEVEL
alpha = 0.05 # 5% significance level (Type I error rate)
power = 0.80 # 80% power (1 - Type II error rate)
# 4. CHECK ASSUMPTIONS (t-test: independence, normality, equal variance)
from scipy import stats
_, p_norm = stats.shapiro(treatment_group) # Normality test
_, p_var = stats.levene(treatment_group, control_group) # Equal variance test
print(f"Normality: p={p_norm:.3f} {'' if p_norm > 0.05 else '✗ use non-parametric'}")
print(f"Equal variance: p={p_var:.3f} {'' if p_var > 0.05 else '✗ use Welch t-test'}")
# 5. PERFORM STATISTICAL TEST
# Choose appropriate test based on data type and assumptions
# For continuous outcome, 2 groups:
statistic, p_value = stats.ttest_ind(treatment_group, control_group)
print(f"t-statistic: {statistic:.3f}, p-value: {p_value:.4f}")
# For categorical outcome, 2 groups:
from scipy.stats import chi2_contingency
contingency_table = pd.crosstab(df['group'], df['outcome'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-square: {chi2:.3f}, p-value: {p_value:.4f}")
# 6. INTERPRET RESULTS & EFFECT SIZE
if p_value < alpha:
cohen_d = (treatment_group.mean() - control_group.mean()) / pooled_std
effect = "Small" if abs(cohen_d) < 0.2 else "Medium" if abs(cohen_d) < 0.5 else "Large"
print(f"REJECT H0 (p={p_value:.4f}). Effect size (Cohen's d)={cohen_d:.3f} ({effect})")
else:
print(f"FAIL TO REJECT H0 (p={p_value:.4f}). Insufficient evidence for effect.")
# 7. CONFIDENCE INTERVAL & SENSITIVITY
ci_95 = stats.t.interval(0.95, len(treatment_group)-1, loc=treatment_group.mean(), scale=stats.sem(treatment_group))
print(f"95% CI: [{ci_95[0]:.2f}, {ci_95[1]:.2f}]")
print("Sensitivity: Check without outliers, with non-parametric test, with confounders")
```
### Statistical Test Selection
| Data Type | # Groups | Test |
|-----------|----------|------|
| Continuous | 2 | t-test (or Welch's if unequal variance) |
| Continuous | 3+ | ANOVA (or Kruskal-Wallis if non-normal) |
| Categorical | 2 | Chi-square or Fisher's exact |
| Ordinal | 2 | Mann-Whitney U |
| Paired/Repeated | 2 | Paired t-test or Wilcoxon signed-rank |
## Validation Scaffold
Use for validating data quality, code quality, or model quality before shipping.
### Data Validation Template
```python
# DATA VALIDATION CHECKLIST
# 1. SCHEMA VALIDATION
expected_columns = ['id', 'timestamp', 'value', 'category']
assert set(df.columns) == set(expected_columns), "Column mismatch"
expected_dtypes = {'id': 'int64', 'timestamp': 'datetime64', 'value': 'float64', 'category': 'object'}
for col, dtype in expected_dtypes.items():
assert df[col].dtype == dtype, f"{col} type mismatch: expected {dtype}, got {df[col].dtype}"
# 2. RANGE VALIDATION
assert df['value'].min() >= 0, "Negative values found (should be >= 0)"
assert df['value'].max() <= 100, "Values exceed maximum (should be <= 100)"
# 3. UNIQUENESS VALIDATION
assert df['id'].is_unique, "Duplicate IDs found"
# 4. COMPLETENESS VALIDATION
required_fields = ['id', 'value']
for field in required_fields:
missing_pct = df[field].isnull().mean() * 100
assert missing_pct == 0, f"{field} has {missing_pct:.1f}% missing (required field)"
# 5. CONSISTENCY VALIDATION
assert (df['start_date'] <= df['end_date']).all(), "start_date after end_date found"
# 6. REFERENTIAL INTEGRITY
valid_categories = ['A', 'B', 'C']
assert df['category'].isin(valid_categories).all(), "Invalid categories found"
print("✓ All data validations passed")
```
### Code Validation Checklist
- [ ] **Unit tests**: All functions have tests covering happy path, edge cases, errors
- [ ] **Integration tests**: APIs, database interactions tested end-to-end
- [ ] **Test coverage**: ≥80% coverage for critical paths
- [ ] **Error handling**: All exceptions caught and handled gracefully
- [ ] **Input validation**: All user inputs validated before processing
- [ ] **Logging**: Key operations logged for debugging
- [ ] **Documentation**: Functions have docstrings, README updated
- [ ] **Performance**: No obvious performance bottlenecks (profiled if needed)
- [ ] **Security**: No hardcoded secrets, SQL injection protected, XSS prevented
### Model Validation Checklist
- [ ] **Train/val/test split**: Data split before any preprocessing (no data leakage)
- [ ] **Baseline model**: Simple baseline implemented for comparison
- [ ] **Cross-validation**: k-fold CV performed (k≥5)
- [ ] **Metrics**: Appropriate metrics chosen (accuracy, precision/recall, AUC, RMSE, etc.)
- [ ] **Overfitting check**: Training vs validation performance compared
- [ ] **Error analysis**: Failure modes analyzed, edge cases identified
- [ ] **Fairness**: Model checked for bias across sensitive groups
- [ ] **Interpretability**: Feature importance or SHAP values computed
- [ ] **Robustness**: Model tested with perturbed inputs
- [ ] **Monitoring**: Drift detection and performance tracking in place
## Quality Checklist
Before delivering, verify:
**Scaffold Structure:**
- [ ] Clear step-by-step process defined
- [ ] Each step has concrete actions (not vague advice)
- [ ] Validation checkpoints included
- [ ] Expected outputs specified
**Completeness:**
- [ ] Covers all requirements from user's task
- [ ] Includes example code/pseudocode where helpful
- [ ] Anticipates edge cases and error conditions
- [ ] Provides decision guidance (when to use which approach)
**Clarity:**
- [ ] Assumptions stated explicitly
- [ ] Technical terms defined or illustrated
- [ ] Success criteria clear
- [ ] Next steps obvious
**Actionability:**
- [ ] User can execute scaffold without further guidance
- [ ] Code snippets are runnable (or nearly runnable)
- [ ] Gaps surfaced early (missing data, unclear requirements)
- [ ] Includes validation/quality checks
**Rubric Score:**
- [ ] Self-assessed with rubric ≥ 3.5 average