# Code Data Analysis Scaffolds Template ## Workflow Copy this checklist and track your progress: ``` Code Data Analysis Scaffolds Progress: - [ ] Step 1: Clarify task and objectives - [ ] Step 2: Choose appropriate scaffold type - [ ] Step 3: Generate scaffold structure - [ ] Step 4: Validate scaffold completeness - [ ] Step 5: Deliver scaffold and guide execution ``` **Step 1: Clarify task** - Ask context questions to understand task type, constraints, expected outcomes. See [Context Questions](#context-questions). **Step 2: Choose scaffold** - Select TDD, EDA, Statistical Analysis, or Validation based on task. See [Scaffold Selection Guide](#scaffold-selection-guide). **Step 3: Generate structure** - Use appropriate scaffold template. See [TDD Scaffold](#tdd-scaffold), [EDA Scaffold](#eda-scaffold), [Statistical Analysis Scaffold](#statistical-analysis-scaffold), or [Validation Scaffold](#validation-scaffold). **Step 4: Validate completeness** - Check scaffold covers requirements, includes validation steps, makes assumptions explicit. See [Quality Checklist](#quality-checklist). **Step 5: Deliver and guide** - Present scaffold, highlight next steps, surface any gaps discovered. Execute if user wants help. ## Context Questions **For all tasks:** - What are you trying to accomplish? (Specific outcome expected) - What's the context? (Dataset characteristics, codebase state, existing work) - Any constraints? (Time, tools, data limitations, performance requirements) - What does success look like? (Acceptance criteria, quality bar) **For TDD tasks:** - What functionality needs tests? (Feature, bug fix, refactor) - Existing test coverage? (None, partial, comprehensive) - Test framework preference? (pytest, jest, junit, etc.) - Integration vs unit tests? (Scope of testing) **For EDA tasks:** - What's the dataset? (Size, format, source) - What questions are you trying to answer? (Exploratory vs. hypothesis-driven) - Existing knowledge about data? (Schema, distributions, known issues) - End goal? (Feature engineering, quality assessment, insights) **For Statistical/Modeling tasks:** - What's the research question? (Descriptive, predictive, causal) - Available data? (Sample size, variables, treatment/control) - Causal or predictive goal? (Understanding why vs. forecasting what) - Significance level / acceptable error rate? ## Scaffold Selection Guide | User Says | Task Type | Scaffold to Use | |-----------|-----------|-----------------| | "Write tests for..." | TDD | [TDD Scaffold](#tdd-scaffold) | | "Explore this dataset..." | EDA | [EDA Scaffold](#eda-scaffold) | | "Analyze the effect of..." / "Does X cause Y?" | Causal Inference | See methodology.md | | "Predict..." / "Classify..." / "Forecast..." | Predictive Modeling | See methodology.md | | "Design an A/B test..." / "Compare groups..." | Statistical Analysis | [Statistical Analysis Scaffold](#statistical-analysis-scaffold) | | "Validate..." / "Check quality..." | Validation | [Validation Scaffold](#validation-scaffold) | ## TDD Scaffold Use when writing new code, refactoring, or fixing bugs. **Write tests FIRST, then implement.** ### Quick Template ```python # Test file: test_[module].py import pytest from [module] import [function_to_test] # 1. HAPPY PATH TESTS (expected usage) def test_[function]_with_valid_input(): """Test normal, expected behavior""" result = [function](valid_input) assert result == expected_output assert result.property == expected_value # 2. EDGE CASE TESTS (boundary conditions) def test_[function]_with_empty_input(): """Test with empty/minimal input""" result = [function]([]) assert result == expected_for_empty def test_[function]_with_maximum_input(): """Test with large/maximum input""" result = [function](large_input) assert result is not None # 3. ERROR CONDITION TESTS (invalid input, expected failures) def test_[function]_with_invalid_input(): """Test proper error handling""" with pytest.raises(ValueError): [function](invalid_input) def test_[function]_with_none_input(): """Test None handling""" with pytest.raises(TypeError): [function](None) # 4. STATE TESTS (if function modifies state) def test_[function]_modifies_state_correctly(): """Test side effects are correct""" obj = Object() obj.[function](param) assert obj.state == expected_state # 5. INTEGRATION TESTS (if interacting with external systems) @pytest.fixture def mock_external_service(): """Mock external dependencies""" return Mock(spec=ExternalService) def test_[function]_with_external_service(mock_external_service): """Test integration points""" result = [function](mock_external_service) mock_external_service.method.assert_called_once() assert result == expected_from_integration ``` ### Test Data Setup ```python # conftest.py or test fixtures @pytest.fixture def sample_data(): """Reusable test data""" return { "valid": [...], "edge_case": [...], "invalid": [...] } @pytest.fixture(scope="session") def database_session(): """Database for integration tests""" db = create_test_db() yield db db.cleanup() ``` ### TDD Cycle 1. **Red**: Write failing test (defines what success looks like) 2. **Green**: Write minimal code to make test pass 3. **Refactor**: Improve code while keeping tests green 4. **Repeat**: Next test case ## EDA Scaffold Use when exploring new dataset. Follow systematic plan to understand data quality and patterns. ### Quick Template ```python # 1. DATA OVERVIEW # Load and inspect import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns df = pd.read_[format]('data.csv') # Basic info print(f"Shape: {df.shape}") print(f"Columns: {df.columns.tolist()}") print(df.dtypes) print(df.head()) print(df.info()) print(df.describe()) # 2. DATA QUALITY CHECKS # Missing values missing = df.isnull().sum() missing_pct = (missing / len(df)) * 100 print(missing_pct[missing_pct > 0]) # Duplicates print(f"Duplicates: {df.duplicated().sum()}") # Data types consistency print("Check: Are numeric columns actually numeric?") print("Check: Are dates parsed correctly?") print("Check: Are categorical variables encoded properly?") # 3. UNIVARIATE ANALYSIS # Numeric: mean, median, std, range, distribution plots, outliers (IQR method) for col in df.select_dtypes(include=[np.number]).columns: print(f"{col}: mean={df[col].mean():.2f}, median={df[col].median():.2f}, std={df[col].std():.2f}") df[col].hist(bins=50); plt.title(f'{col} Distribution'); plt.show() Q1, Q3 = df[col].quantile([0.25, 0.75]) outliers = ((df[col] < (Q1 - 1.5*(Q3-Q1))) | (df[col] > (Q3 + 1.5*(Q3-Q1)))).sum() print(f" Outliers: {outliers} ({outliers/len(df)*100:.1f}%)") # Categorical: value counts, unique values, bar plots for col in df.select_dtypes(include=['object', 'category']).columns: print(f"{col}: {df[col].nunique()} unique, most common={df[col].mode()[0]}") df[col].value_counts().head(10).plot(kind='bar'); plt.show() # 4. BIVARIATE ANALYSIS # Correlation heatmap, pairplots, categorical vs numeric boxplots sns.heatmap(df.select_dtypes(include=[np.number]).corr(), annot=True, cmap='coolwarm') sns.pairplot(df[['var1', 'var2', 'var3', 'target']], hue='target'); plt.show() # For each categorical-numeric pair, create boxplots to see distributions # 5. INSIGHTS & NEXT STEPS print("\n=== KEY FINDINGS ===") print("1. Data quality: [summary]") print("2. Distributions: [any skewness, outliers]") print("3. Correlations: [strong relationships found]") print("4. Missing patterns: [systematic missingness?]") print("\n=== RECOMMENDED ACTIONS ===") print("1. Handle missing data: [imputation strategy]") print("2. Address outliers: [cap, remove, transform]") print("3. Feature engineering: [ideas based on EDA]") print("4. Data transformations: [log, standardize, encode]") ``` ### EDA Checklist - [ ] Load data and check shape/dtypes - [ ] Assess missing values (how much, which variables, patterns?) - [ ] Check for duplicates - [ ] Validate data types (numeric, categorical, dates) - [ ] Univariate analysis (distributions, outliers, summary stats) - [ ] Bivariate analysis (correlations, relationships with target) - [ ] Identify data quality issues - [ ] Document insights and recommended next steps ## Statistical Analysis Scaffold Use for hypothesis testing, A/B tests, comparing groups. ### Quick Template ```python # STATISTICAL ANALYSIS SCAFFOLD # 1. DEFINE RESEARCH QUESTION question = "Does treatment X improve outcome Y?" # 2. STATE HYPOTHESES H0 = "Treatment X has no effect on outcome Y (null hypothesis)" H1 = "Treatment X improves outcome Y (alternative hypothesis)" # 3. SET SIGNIFICANCE LEVEL alpha = 0.05 # 5% significance level (Type I error rate) power = 0.80 # 80% power (1 - Type II error rate) # 4. CHECK ASSUMPTIONS (t-test: independence, normality, equal variance) from scipy import stats _, p_norm = stats.shapiro(treatment_group) # Normality test _, p_var = stats.levene(treatment_group, control_group) # Equal variance test print(f"Normality: p={p_norm:.3f} {'✓' if p_norm > 0.05 else '✗ use non-parametric'}") print(f"Equal variance: p={p_var:.3f} {'✓' if p_var > 0.05 else '✗ use Welch t-test'}") # 5. PERFORM STATISTICAL TEST # Choose appropriate test based on data type and assumptions # For continuous outcome, 2 groups: statistic, p_value = stats.ttest_ind(treatment_group, control_group) print(f"t-statistic: {statistic:.3f}, p-value: {p_value:.4f}") # For categorical outcome, 2 groups: from scipy.stats import chi2_contingency contingency_table = pd.crosstab(df['group'], df['outcome']) chi2, p_value, dof, expected = chi2_contingency(contingency_table) print(f"Chi-square: {chi2:.3f}, p-value: {p_value:.4f}") # 6. INTERPRET RESULTS & EFFECT SIZE if p_value < alpha: cohen_d = (treatment_group.mean() - control_group.mean()) / pooled_std effect = "Small" if abs(cohen_d) < 0.2 else "Medium" if abs(cohen_d) < 0.5 else "Large" print(f"REJECT H0 (p={p_value:.4f}). Effect size (Cohen's d)={cohen_d:.3f} ({effect})") else: print(f"FAIL TO REJECT H0 (p={p_value:.4f}). Insufficient evidence for effect.") # 7. CONFIDENCE INTERVAL & SENSITIVITY ci_95 = stats.t.interval(0.95, len(treatment_group)-1, loc=treatment_group.mean(), scale=stats.sem(treatment_group)) print(f"95% CI: [{ci_95[0]:.2f}, {ci_95[1]:.2f}]") print("Sensitivity: Check without outliers, with non-parametric test, with confounders") ``` ### Statistical Test Selection | Data Type | # Groups | Test | |-----------|----------|------| | Continuous | 2 | t-test (or Welch's if unequal variance) | | Continuous | 3+ | ANOVA (or Kruskal-Wallis if non-normal) | | Categorical | 2 | Chi-square or Fisher's exact | | Ordinal | 2 | Mann-Whitney U | | Paired/Repeated | 2 | Paired t-test or Wilcoxon signed-rank | ## Validation Scaffold Use for validating data quality, code quality, or model quality before shipping. ### Data Validation Template ```python # DATA VALIDATION CHECKLIST # 1. SCHEMA VALIDATION expected_columns = ['id', 'timestamp', 'value', 'category'] assert set(df.columns) == set(expected_columns), "Column mismatch" expected_dtypes = {'id': 'int64', 'timestamp': 'datetime64', 'value': 'float64', 'category': 'object'} for col, dtype in expected_dtypes.items(): assert df[col].dtype == dtype, f"{col} type mismatch: expected {dtype}, got {df[col].dtype}" # 2. RANGE VALIDATION assert df['value'].min() >= 0, "Negative values found (should be >= 0)" assert df['value'].max() <= 100, "Values exceed maximum (should be <= 100)" # 3. UNIQUENESS VALIDATION assert df['id'].is_unique, "Duplicate IDs found" # 4. COMPLETENESS VALIDATION required_fields = ['id', 'value'] for field in required_fields: missing_pct = df[field].isnull().mean() * 100 assert missing_pct == 0, f"{field} has {missing_pct:.1f}% missing (required field)" # 5. CONSISTENCY VALIDATION assert (df['start_date'] <= df['end_date']).all(), "start_date after end_date found" # 6. REFERENTIAL INTEGRITY valid_categories = ['A', 'B', 'C'] assert df['category'].isin(valid_categories).all(), "Invalid categories found" print("✓ All data validations passed") ``` ### Code Validation Checklist - [ ] **Unit tests**: All functions have tests covering happy path, edge cases, errors - [ ] **Integration tests**: APIs, database interactions tested end-to-end - [ ] **Test coverage**: ≥80% coverage for critical paths - [ ] **Error handling**: All exceptions caught and handled gracefully - [ ] **Input validation**: All user inputs validated before processing - [ ] **Logging**: Key operations logged for debugging - [ ] **Documentation**: Functions have docstrings, README updated - [ ] **Performance**: No obvious performance bottlenecks (profiled if needed) - [ ] **Security**: No hardcoded secrets, SQL injection protected, XSS prevented ### Model Validation Checklist - [ ] **Train/val/test split**: Data split before any preprocessing (no data leakage) - [ ] **Baseline model**: Simple baseline implemented for comparison - [ ] **Cross-validation**: k-fold CV performed (k≥5) - [ ] **Metrics**: Appropriate metrics chosen (accuracy, precision/recall, AUC, RMSE, etc.) - [ ] **Overfitting check**: Training vs validation performance compared - [ ] **Error analysis**: Failure modes analyzed, edge cases identified - [ ] **Fairness**: Model checked for bias across sensitive groups - [ ] **Interpretability**: Feature importance or SHAP values computed - [ ] **Robustness**: Model tested with perturbed inputs - [ ] **Monitoring**: Drift detection and performance tracking in place ## Quality Checklist Before delivering, verify: **Scaffold Structure:** - [ ] Clear step-by-step process defined - [ ] Each step has concrete actions (not vague advice) - [ ] Validation checkpoints included - [ ] Expected outputs specified **Completeness:** - [ ] Covers all requirements from user's task - [ ] Includes example code/pseudocode where helpful - [ ] Anticipates edge cases and error conditions - [ ] Provides decision guidance (when to use which approach) **Clarity:** - [ ] Assumptions stated explicitly - [ ] Technical terms defined or illustrated - [ ] Success criteria clear - [ ] Next steps obvious **Actionability:** - [ ] User can execute scaffold without further guidance - [ ] Code snippets are runnable (or nearly runnable) - [ ] Gaps surfaced early (missing data, unclear requirements) - [ ] Includes validation/quality checks **Rubric Score:** - [ ] Self-assessed with rubric ≥ 3.5 average