zhongwei/gh-lyndonkl-claude

Fork 0

Files

Zhongwei Li 41d9f6b189 Initial commit

2025-11-30 08:38:26 +08:00

14 KiB

Raw Blame History

Code Data Analysis Scaffolds Template

Workflow

Copy this checklist and track your progress:

Code Data Analysis Scaffolds Progress:
- [ ] Step 1: Clarify task and objectives
- [ ] Step 2: Choose appropriate scaffold type
- [ ] Step 3: Generate scaffold structure
- [ ] Step 4: Validate scaffold completeness
- [ ] Step 5: Deliver scaffold and guide execution

Step 1: Clarify task - Ask context questions to understand task type, constraints, expected outcomes. See Context Questions.

Step 2: Choose scaffold - Select TDD, EDA, Statistical Analysis, or Validation based on task. See Scaffold Selection Guide.

Step 3: Generate structure - Use appropriate scaffold template. See TDD Scaffold, EDA Scaffold, Statistical Analysis Scaffold, or Validation Scaffold.

Step 4: Validate completeness - Check scaffold covers requirements, includes validation steps, makes assumptions explicit. See Quality Checklist.

Step 5: Deliver and guide - Present scaffold, highlight next steps, surface any gaps discovered. Execute if user wants help.

Context Questions

For all tasks:

What are you trying to accomplish? (Specific outcome expected)
What's the context? (Dataset characteristics, codebase state, existing work)
Any constraints? (Time, tools, data limitations, performance requirements)
What does success look like? (Acceptance criteria, quality bar)

For TDD tasks:

What functionality needs tests? (Feature, bug fix, refactor)
Existing test coverage? (None, partial, comprehensive)
Test framework preference? (pytest, jest, junit, etc.)
Integration vs unit tests? (Scope of testing)

For EDA tasks:

What's the dataset? (Size, format, source)
What questions are you trying to answer? (Exploratory vs. hypothesis-driven)
Existing knowledge about data? (Schema, distributions, known issues)
End goal? (Feature engineering, quality assessment, insights)

For Statistical/Modeling tasks:

What's the research question? (Descriptive, predictive, causal)
Available data? (Sample size, variables, treatment/control)
Causal or predictive goal? (Understanding why vs. forecasting what)
Significance level / acceptable error rate?

Scaffold Selection Guide

User Says	Task Type	Scaffold to Use
"Write tests for..."	TDD	TDD Scaffold
"Explore this dataset..."	EDA	EDA Scaffold
"Analyze the effect of..." / "Does X cause Y?"	Causal Inference	See methodology.md
"Predict..." / "Classify..." / "Forecast..."	Predictive Modeling	See methodology.md
"Design an A/B test..." / "Compare groups..."	Statistical Analysis	Statistical Analysis Scaffold
"Validate..." / "Check quality..."	Validation	Validation Scaffold

TDD Scaffold

Use when writing new code, refactoring, or fixing bugs. Write tests FIRST, then implement.

Quick Template

# Test file: test_[module].py
import pytest
from [module] import [function_to_test]

# 1. HAPPY PATH TESTS (expected usage)
def test_[function]_with_valid_input():
    """Test normal, expected behavior"""
    result = [function](valid_input)
    assert result == expected_output
    assert result.property == expected_value

# 2. EDGE CASE TESTS (boundary conditions)
def test_[function]_with_empty_input():
    """Test with empty/minimal input"""
    result = [function]([])
    assert result == expected_for_empty

def test_[function]_with_maximum_input():
    """Test with large/maximum input"""
    result = [function](large_input)
    assert result is not None

# 3. ERROR CONDITION TESTS (invalid input, expected failures)
def test_[function]_with_invalid_input():
    """Test proper error handling"""
    with pytest.raises(ValueError):
        [function](invalid_input)

def test_[function]_with_none_input():
    """Test None handling"""
    with pytest.raises(TypeError):
        [function](None)

# 4. STATE TESTS (if function modifies state)
def test_[function]_modifies_state_correctly():
    """Test side effects are correct"""
    obj = Object()
    obj.[function](param)
    assert obj.state == expected_state

# 5. INTEGRATION TESTS (if interacting with external systems)
@pytest.fixture
def mock_external_service():
    """Mock external dependencies"""
    return Mock(spec=ExternalService)

def test_[function]_with_external_service(mock_external_service):
    """Test integration points"""
    result = [function](mock_external_service)
    mock_external_service.method.assert_called_once()
    assert result == expected_from_integration

Test Data Setup

# conftest.py or test fixtures
@pytest.fixture
def sample_data():
    """Reusable test data"""
    return {
        "valid": [...],
        "edge_case": [...],
        "invalid": [...]
    }

@pytest.fixture(scope="session")
def database_session():
    """Database for integration tests"""
    db = create_test_db()
    yield db
    db.cleanup()

TDD Cycle

Red: Write failing test (defines what success looks like)
Green: Write minimal code to make test pass
Refactor: Improve code while keeping tests green
Repeat: Next test case

EDA Scaffold

Use when exploring new dataset. Follow systematic plan to understand data quality and patterns.

Quick Template

# 1. DATA OVERVIEW
# Load and inspect
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_[format]('data.csv')

# Basic info
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(df.dtypes)
print(df.head())
print(df.info())
print(df.describe())

# 2. DATA QUALITY CHECKS
# Missing values
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
print(missing_pct[missing_pct > 0])

# Duplicates
print(f"Duplicates: {df.duplicated().sum()}")

# Data types consistency
print("Check: Are numeric columns actually numeric?")
print("Check: Are dates parsed correctly?")
print("Check: Are categorical variables encoded properly?")

# 3. UNIVARIATE ANALYSIS
# Numeric: mean, median, std, range, distribution plots, outliers (IQR method)
for col in df.select_dtypes(include=[np.number]).columns:
    print(f"{col}: mean={df[col].mean():.2f}, median={df[col].median():.2f}, std={df[col].std():.2f}")
    df[col].hist(bins=50); plt.title(f'{col} Distribution'); plt.show()
    Q1, Q3 = df[col].quantile([0.25, 0.75])
    outliers = ((df[col] < (Q1 - 1.5*(Q3-Q1))) | (df[col] > (Q3 + 1.5*(Q3-Q1)))).sum()
    print(f"  Outliers: {outliers} ({outliers/len(df)*100:.1f}%)")

# Categorical: value counts, unique values, bar plots
for col in df.select_dtypes(include=['object', 'category']).columns:
    print(f"{col}: {df[col].nunique()} unique, most common={df[col].mode()[0]}")
    df[col].value_counts().head(10).plot(kind='bar'); plt.show()

# 4. BIVARIATE ANALYSIS
# Correlation heatmap, pairplots, categorical vs numeric boxplots
sns.heatmap(df.select_dtypes(include=[np.number]).corr(), annot=True, cmap='coolwarm')
sns.pairplot(df[['var1', 'var2', 'var3', 'target']], hue='target'); plt.show()
# For each categorical-numeric pair, create boxplots to see distributions

# 5. INSIGHTS & NEXT STEPS
print("\n=== KEY FINDINGS ===")
print("1. Data quality: [summary]")
print("2. Distributions: [any skewness, outliers]")
print("3. Correlations: [strong relationships found]")
print("4. Missing patterns: [systematic missingness?]")
print("\n=== RECOMMENDED ACTIONS ===")
print("1. Handle missing data: [imputation strategy]")
print("2. Address outliers: [cap, remove, transform]")
print("3. Feature engineering: [ideas based on EDA]")
print("4. Data transformations: [log, standardize, encode]")

EDA Checklist

Load data and check shape/dtypes
Assess missing values (how much, which variables, patterns?)
Check for duplicates
Validate data types (numeric, categorical, dates)
Univariate analysis (distributions, outliers, summary stats)
Bivariate analysis (correlations, relationships with target)
Identify data quality issues
Document insights and recommended next steps

Statistical Analysis Scaffold

Use for hypothesis testing, A/B tests, comparing groups.

Quick Template

# STATISTICAL ANALYSIS SCAFFOLD

# 1. DEFINE RESEARCH QUESTION
question = "Does treatment X improve outcome Y?"

# 2. STATE HYPOTHESES
H0 = "Treatment X has no effect on outcome Y (null hypothesis)"
H1 = "Treatment X improves outcome Y (alternative hypothesis)"

# 3. SET SIGNIFICANCE LEVEL
alpha = 0.05  # 5% significance level (Type I error rate)
power = 0.80  # 80% power (1 - Type II error rate)

# 4. CHECK ASSUMPTIONS (t-test: independence, normality, equal variance)
from scipy import stats
_, p_norm = stats.shapiro(treatment_group)  # Normality test
_, p_var = stats.levene(treatment_group, control_group)  # Equal variance test
print(f"Normality: p={p_norm:.3f} {'✓' if p_norm > 0.05 else '✗ use non-parametric'}")
print(f"Equal variance: p={p_var:.3f} {'✓' if p_var > 0.05 else '✗ use Welch t-test'}")

# 5. PERFORM STATISTICAL TEST
# Choose appropriate test based on data type and assumptions

# For continuous outcome, 2 groups:
statistic, p_value = stats.ttest_ind(treatment_group, control_group)
print(f"t-statistic: {statistic:.3f}, p-value: {p_value:.4f}")

# For categorical outcome, 2 groups:
from scipy.stats import chi2_contingency
contingency_table = pd.crosstab(df['group'], df['outcome'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-square: {chi2:.3f}, p-value: {p_value:.4f}")

# 6. INTERPRET RESULTS & EFFECT SIZE
if p_value < alpha:
    cohen_d = (treatment_group.mean() - control_group.mean()) / pooled_std
    effect = "Small" if abs(cohen_d) < 0.2 else "Medium" if abs(cohen_d) < 0.5 else "Large"
    print(f"REJECT H0 (p={p_value:.4f}). Effect size (Cohen's d)={cohen_d:.3f} ({effect})")
else:
    print(f"FAIL TO REJECT H0 (p={p_value:.4f}). Insufficient evidence for effect.")

# 7. CONFIDENCE INTERVAL & SENSITIVITY
ci_95 = stats.t.interval(0.95, len(treatment_group)-1, loc=treatment_group.mean(), scale=stats.sem(treatment_group))
print(f"95% CI: [{ci_95[0]:.2f}, {ci_95[1]:.2f}]")
print("Sensitivity: Check without outliers, with non-parametric test, with confounders")

Statistical Test Selection

Data Type	# Groups	Test
Continuous	2	t-test (or Welch's if unequal variance)
Continuous	3+	ANOVA (or Kruskal-Wallis if non-normal)
Categorical	2	Chi-square or Fisher's exact
Ordinal	2	Mann-Whitney U
Paired/Repeated	2	Paired t-test or Wilcoxon signed-rank

Validation Scaffold

Use for validating data quality, code quality, or model quality before shipping.

Data Validation Template

# DATA VALIDATION CHECKLIST

# 1. SCHEMA VALIDATION
expected_columns = ['id', 'timestamp', 'value', 'category']
assert set(df.columns) == set(expected_columns), "Column mismatch"

expected_dtypes = {'id': 'int64', 'timestamp': 'datetime64', 'value': 'float64', 'category': 'object'}
for col, dtype in expected_dtypes.items():
    assert df[col].dtype == dtype, f"{col} type mismatch: expected {dtype}, got {df[col].dtype}"

# 2. RANGE VALIDATION
assert df['value'].min() >= 0, "Negative values found (should be >= 0)"
assert df['value'].max() <= 100, "Values exceed maximum (should be <= 100)"

# 3. UNIQUENESS VALIDATION
assert df['id'].is_unique, "Duplicate IDs found"

# 4. COMPLETENESS VALIDATION
required_fields = ['id', 'value']
for field in required_fields:
    missing_pct = df[field].isnull().mean() * 100
    assert missing_pct == 0, f"{field} has {missing_pct:.1f}% missing (required field)"

# 5. CONSISTENCY VALIDATION
assert (df['start_date'] <= df['end_date']).all(), "start_date after end_date found"

# 6. REFERENTIAL INTEGRITY
valid_categories = ['A', 'B', 'C']
assert df['category'].isin(valid_categories).all(), "Invalid categories found"

print("✓ All data validations passed")

Code Validation Checklist

Unit tests: All functions have tests covering happy path, edge cases, errors
Integration tests: APIs, database interactions tested end-to-end
Test coverage: ≥80% coverage for critical paths
Error handling: All exceptions caught and handled gracefully
Input validation: All user inputs validated before processing
Logging: Key operations logged for debugging
Documentation: Functions have docstrings, README updated
Performance: No obvious performance bottlenecks (profiled if needed)
Security: No hardcoded secrets, SQL injection protected, XSS prevented

Model Validation Checklist

Train/val/test split: Data split before any preprocessing (no data leakage)
Baseline model: Simple baseline implemented for comparison
Cross-validation: k-fold CV performed (k≥5)
Metrics: Appropriate metrics chosen (accuracy, precision/recall, AUC, RMSE, etc.)
Overfitting check: Training vs validation performance compared
Error analysis: Failure modes analyzed, edge cases identified
Fairness: Model checked for bias across sensitive groups
Interpretability: Feature importance or SHAP values computed
Robustness: Model tested with perturbed inputs
Monitoring: Drift detection and performance tracking in place

Quality Checklist

Before delivering, verify:

Scaffold Structure:

Clear step-by-step process defined
Each step has concrete actions (not vague advice)
Validation checkpoints included
Expected outputs specified

Completeness:

Covers all requirements from user's task
Includes example code/pseudocode where helpful
Anticipates edge cases and error conditions
Provides decision guidance (when to use which approach)

Clarity:

Assumptions stated explicitly
Technical terms defined or illustrated
Success criteria clear
Next steps obvious

Actionability:

User can execute scaffold without further guidance
Code snippets are runnable (or nearly runnable)
Gaps surfaced early (missing data, unclear requirements)
Includes validation/quality checks

Rubric Score:

Self-assessed with rubric ≥ 3.5 average

14 KiB Raw Blame History

Code Data Analysis Scaffolds Template

Workflow

Context Questions

Scaffold Selection Guide

TDD Scaffold

Quick Template

Test Data Setup

TDD Cycle

EDA Scaffold

Quick Template

EDA Checklist

Statistical Analysis Scaffold

Quick Template

Statistical Test Selection

Validation Scaffold

Data Validation Template

Code Validation Checklist

Model Validation Checklist

Quality Checklist

14 KiB

Raw Blame History