Files
gh-bandofai-puerto-plugins-…/skills/data-engineering/SKILL.md
2025-11-29 17:59:49 +08:00

32 KiB
Raw Blame History

Data Engineering

Data pipelines, ETL, warehousing, and ML infrastructure

Data Engineering Patterns

Statistical Modeling Skill

Production-tested patterns for rigorous statistical analysis and inference

This skill codifies best practices from decades of statistical research and thousands of production analyses across scientific and business applications.


Core Principles

  1. Check Assumptions First: Never apply a statistical test without validating assumptions
  2. Effect Sizes Matter: P-values alone are insufficient - always report effect sizes
  3. Confidence Intervals: Quantify uncertainty with confidence intervals
  4. Practical Significance: Statistical significance ≠ practical significance
  5. Reproducibility: Always set random seeds and document methodology

Statistical Hypothesis Testing Framework

The Scientific Method

1. State research question clearly
2. Formulate null and alternative hypotheses
3. Choose significance level (α) before seeing data
4. Check assumptions
5. Select appropriate test
6. Calculate test statistic and p-value
7. Compute effect size and confidence interval
8. Make decision and interpret results
9. State limitations

Hypothesis Formulation

# Always state both hypotheses clearly

# Two-sided (default for most tests)
H0: μ1 = μ2  (no difference between groups)
H1: μ1  μ2  (groups differ)

# One-sided (only when directional hypothesis is justified)
H0: μ1  μ2  (group 1 not less than group 2)
H1: μ1 < μ2  (group 1 less than group 2)

# Significance level
α = 0.05  (probability of Type I error - false positive)

Type I and Type II Errors

Type I Error (α): False Positive
- Reject H0 when it's actually true
- "Seeing a difference that doesn't exist"
- Controlled by significance level (typically 0.05)

Type II Error (β): False Negative
- Fail to reject H0 when it's actually false
- "Missing a difference that does exist"
- Related to statistical power: Power = 1 - β
- Typical power target: 0.80 (80%)

Test Selection Decision Tree

Comparing Means

Two Independent Groups

# Check assumptions first
from scipy.stats import shapiro, levene, ttest_ind, mannwhitneyu

# 1. Normality test
stat_norm1, p_norm1 = shapiro(group1)
stat_norm2, p_norm2 = shapiro(group2)
normal = (p_norm1 > 0.05) and (p_norm2 > 0.05)

# 2. Equal variance test
stat_var, p_var = levene(group1, group2)
equal_var = p_var > 0.05

# 3. Select test
if normal and equal_var:
    # Parametric: Independent samples t-test
    stat, p_value = ttest_ind(group1, group2)
    test_name = "Independent t-test"
elif normal and not equal_var:
    # Parametric: Welch's t-test (unequal variances)
    stat, p_value = ttest_ind(group1, group2, equal_var=False)
    test_name = "Welch's t-test"
else:
    # Non-parametric: Mann-Whitney U test
    stat, p_value = mannwhitneyu(group1, group2, alternative='two-sided')
    test_name = "Mann-Whitney U test"

print(f"Test used: {test_name}")
print(f"p-value: {p_value:.4f}")

Two Paired Groups

from scipy.stats import ttest_rel, wilcoxon

# Check normality of differences
differences = group1_after - group1_before
stat_norm, p_norm = shapiro(differences)

if p_norm > 0.05:
    # Parametric: Paired t-test
    stat, p_value = ttest_rel(group1_before, group1_after)
    test_name = "Paired t-test"
else:
    # Non-parametric: Wilcoxon signed-rank test
    stat, p_value = wilcoxon(group1_before, group1_after)
    test_name = "Wilcoxon signed-rank test"

Multiple Groups (3+)

from scipy.stats import f_oneway, kruskal

# Check normality for all groups
all_normal = all([shapiro(group)[1] > 0.05 for group in groups])

# Check equal variances
stat_var, p_var = levene(*groups)
equal_var = p_var > 0.05

if all_normal and equal_var:
    # Parametric: One-way ANOVA
    stat, p_value = f_oneway(*groups)
    test_name = "One-way ANOVA"

    # If significant, follow up with post-hoc tests
    if p_value < 0.05:
        from statsmodels.stats.multicomp import pairwise_tukeyhsd
        # Combine all groups with labels
        data = np.concatenate(groups)
        labels = np.concatenate([[i]*len(g) for i, g in enumerate(groups)])
        tukey_result = pairwise_tukeyhsd(data, labels, alpha=0.05)
        print(tukey_result)
else:
    # Non-parametric: Kruskal-Wallis test
    stat, p_value = kruskal(*groups)
    test_name = "Kruskal-Wallis test"

    # If significant, follow up with pairwise Mann-Whitney with Bonferroni
    if p_value < 0.05:
        from itertools import combinations
        from statsmodels.stats.multitest import multipletests

        pairs = list(combinations(range(len(groups)), 2))
        p_values = [mannwhitneyu(groups[i], groups[j])[1] for i, j in pairs]

        # Bonferroni correction
        reject, p_corrected, _, _ = multipletests(p_values, method='bonferroni')

        for (i, j), p_orig, p_corr, sig in zip(pairs, p_values, p_corrected, reject):
            print(f"Group {i} vs Group {j}: p={p_orig:.4f} (corrected: {p_corr:.4f}) - {'SIG' if sig else 'NS'}")

Comparing Proportions

from scipy.stats import chi2_contingency, fisher_exact
from statsmodels.stats.proportion import proportions_ztest

# 2x2 contingency table
def compare_proportions(success1, n1, success2, n2):
    """Compare two proportions"""

    # Create contingency table
    table = np.array([
        [success1, n1 - success1],
        [success2, n2 - success2]
    ])

    # Check expected frequencies (need >= 5 for chi-square)
    expected = (table.sum(axis=0) * table.sum(axis=1)[:, None]) / table.sum()

    if (expected >= 5).all():
        # Chi-square test
        chi2, p_value, dof, expected = chi2_contingency(table)
        test_name = "Chi-square test"
    else:
        # Fisher's exact test (for small samples)
        odds_ratio, p_value = fisher_exact(table)
        test_name = "Fisher's exact test"

    # Also compute z-test for proportions (provides CI)
    counts = np.array([success1, success2])
    nobs = np.array([n1, n2])
    stat, p_value_z = proportions_ztest(counts, nobs)

    return {
        'test': test_name,
        'p_value': p_value,
        'prop1': success1 / n1,
        'prop2': success2 / n2
    }

Effect Sizes

Why: Effect sizes quantify the magnitude of differences or relationships, independent of sample size.

Cohen's d (for t-tests)

def cohens_d(group1, group2):
    """
    Cohen's d effect size for independent samples

    Interpretation:
    - 0.2: small effect
    - 0.5: medium effect
    - 0.8: large effect
    """
    n1, n2 = len(group1), len(group2)
    var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)

    # Pooled standard deviation
    pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))

    # Effect size
    d = (np.mean(group1) - np.mean(group2)) / pooled_std

    return d

# Example usage
d = cohens_d(treatment_group, control_group)
print(f"Cohen's d: {d:.3f}")

if abs(d) < 0.2:
    print("Small effect size")
elif abs(d) < 0.5:
    print("Medium effect size")
else:
    print("Large effect size")

Correlation Effect Sizes

from scipy.stats import pearsonr, spearmanr

# Pearson correlation (linear relationship, normal data)
r, p_value = pearsonr(x, y)

# Spearman correlation (monotonic relationship, non-normal data)
rho, p_value = spearmanr(x, y)

# Interpretation
# |r| < 0.3: weak correlation
# 0.3 <= |r| < 0.7: moderate correlation
# |r| >= 0.7: strong correlation

# r² = coefficient of determination (variance explained)
r_squared = r ** 2
print(f"Pearson r: {r:.3f} (explains {r_squared:.1%} of variance)")

Odds Ratios and Relative Risk

def odds_ratio(a, b, c, d):
    """
    Calculate odds ratio from 2x2 table:
              Outcome+  Outcome-
    Exposed+     a         b
    Exposed-     c         d

    OR = (a/b) / (c/d) = (a*d) / (b*c)
    """
    or_value = (a * d) / (b * c) if (b * c) > 0 else np.inf

    # Confidence interval (log scale)
    log_or = np.log(or_value)
    se_log_or = np.sqrt(1/a + 1/b + 1/c + 1/d)

    ci_low = np.exp(log_or - 1.96 * se_log_or)
    ci_high = np.exp(log_or + 1.96 * se_log_or)

    return {
        'odds_ratio': or_value,
        'ci_95': (ci_low, ci_high)
    }

# Interpretation:
# OR = 1: No association
# OR > 1: Increased odds in exposed group
# OR < 1: Decreased odds in exposed group

Regression Analysis

Linear Regression

import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan, linear_rainbow
from statsmodels.stats.stattools import durbin_watson

def fit_linear_regression(X, y):
    """
    Fit linear regression with comprehensive diagnostics

    Assumptions to check:
    1. Linearity: relationship between X and y is linear
    2. Independence: observations are independent
    3. Homoscedasticity: constant variance of residuals
    4. Normality: residuals are normally distributed
    5. No multicollinearity: predictors not highly correlated
    """

    # Add constant term
    X_with_const = sm.add_constant(X)

    # Fit model
    model = sm.OLS(y, X_with_const).fit()

    # Print summary
    print(model.summary())

    # ASSUMPTION CHECKS

    # 1. Linearity (Rainbow test)
    rainbow_stat, rainbow_p = linear_rainbow(model)
    print(f"\nLinearity (Rainbow test): p = {rainbow_p:.4f}")
    if rainbow_p < 0.05:
        print("  WARNING: Linearity assumption may be violated")

    # 2. Independence (Durbin-Watson test)
    dw = durbin_watson(model.resid)
    print(f"\nIndependence (Durbin-Watson): {dw:.3f}")
    print("  (Values close to 2 indicate no autocorrelation)")
    if dw < 1.5 or dw > 2.5:
        print("  WARNING: Autocorrelation detected")

    # 3. Homoscedasticity (Breusch-Pagan test)
    bp_test = het_breuschpagan(model.resid, model.model.exog)
    bp_stat, bp_p = bp_test[0], bp_test[1]
    print(f"\nHomoscedasticity (Breusch-Pagan): p = {bp_p:.4f}")
    if bp_p < 0.05:
        print("  WARNING: Heteroscedasticity detected")

    # 4. Normality of residuals (Shapiro-Wilk)
    from scipy.stats import shapiro
    shapiro_stat, shapiro_p = shapiro(model.resid)
    print(f"\nNormality of residuals (Shapiro-Wilk): p = {shapiro_p:.4f}")
    if shapiro_p < 0.05:
        print("  WARNING: Residuals not normally distributed")

    # 5. Multicollinearity (VIF)
    from statsmodels.stats.outliers_influence import variance_inflation_factor

    vif_data = pd.DataFrame()
    vif_data["Variable"] = X.columns
    vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    print("\nMulticollinearity (VIF):")
    print(vif_data)
    print("  (VIF > 10 indicates high multicollinearity)")

    # DIAGNOSTIC PLOTS
    import matplotlib.pyplot as plt

    fig, axes = plt.subplots(2, 2, figsize=(12, 10))

    # 1. Residuals vs Fitted
    axes[0, 0].scatter(model.fittedvalues, model.resid, alpha=0.5)
    axes[0, 0].axhline(y=0, color='r', linestyle='--')
    axes[0, 0].set_xlabel('Fitted values')
    axes[0, 0].set_ylabel('Residuals')
    axes[0, 0].set_title('Residuals vs Fitted')

    # 2. Q-Q plot
    from scipy import stats
    stats.probplot(model.resid, dist="norm", plot=axes[0, 1])
    axes[0, 1].set_title('Normal Q-Q Plot')

    # 3. Scale-Location (sqrt of standardized residuals vs fitted)
    standardized_resid = model.resid / np.std(model.resid)
    axes[1, 0].scatter(model.fittedvalues, np.sqrt(np.abs(standardized_resid)), alpha=0.5)
    axes[1, 0].set_xlabel('Fitted values')
    axes[1, 0].set_ylabel('√|Standardized Residuals|')
    axes[1, 0].set_title('Scale-Location')

    # 4. Residuals vs Leverage
    from statsmodels.stats.outliers_influence import OLSInfluence
    influence = OLSInfluence(model)
    leverage = influence.hat_matrix_diag
    axes[1, 1].scatter(leverage, standardized_resid, alpha=0.5)
    axes[1, 1].axhline(y=0, color='r', linestyle='--')
    axes[1, 1].set_xlabel('Leverage')
    axes[1, 1].set_ylabel('Standardized Residuals')
    axes[1, 1].set_title('Residuals vs Leverage')

    plt.tight_layout()
    plt.savefig('regression_diagnostics.png', dpi=300)
    plt.close()

    return model

Logistic Regression

import statsmodels.api as sm

def fit_logistic_regression(X, y):
    """
    Fit logistic regression for binary outcomes

    Key metrics:
    - Coefficients: log odds ratios
    - Odds ratios: exp(coefficients)
    - Pseudo R²: McFadden's R²
    - AIC/BIC: model comparison
    """

    # Add constant
    X_with_const = sm.add_constant(X)

    # Fit model
    model = sm.Logit(y, X_with_const).fit()

    # Print summary
    print(model.summary())

    # Odds ratios
    odds_ratios = np.exp(model.params)
    conf_int = np.exp(model.conf_int())

    print("\nOdds Ratios with 95% CI:")
    for var, or_val, ci in zip(X.columns, odds_ratios[1:], conf_int.values[1:]):
        print(f"{var}: OR = {or_val:.3f} [95% CI: {ci[0]:.3f}, {ci[1]:.3f}]")

    return model

Time Series Analysis

Stationarity Testing

from statsmodels.tsa.stattools import adfuller, kpss

def test_stationarity(timeseries):
    """
    Test if time series is stationary

    Stationary series has:
    - Constant mean over time
    - Constant variance over time
    - No seasonality or trend

    Use ADF and KPSS tests together:
    - ADF: H0 = non-stationary (want p < 0.05 to reject)
    - KPSS: H0 = stationary (want p > 0.05 to not reject)
    """

    # Augmented Dickey-Fuller test
    adf_result = adfuller(timeseries, autolag='AIC')
    print("ADF Test:")
    print(f"  ADF Statistic: {adf_result[0]:.4f}")
    print(f"  p-value: {adf_result[1]:.4f}")
    print(f"  Critical values: {adf_result[4]}")

    if adf_result[1] < 0.05:
        print("  Result: Reject H0 - Series is stationary (ADF)")
        adf_stationary = True
    else:
        print("  Result: Fail to reject H0 - Series is non-stationary (ADF)")
        adf_stationary = False

    # KPSS test
    kpss_result = kpss(timeseries, regression='c')
    print("\nKPSS Test:")
    print(f"  KPSS Statistic: {kpss_result[0]:.4f}")
    print(f"  p-value: {kpss_result[1]:.4f}")
    print(f"  Critical values: {kpss_result[3]}")

    if kpss_result[1] > 0.05:
        print("  Result: Fail to reject H0 - Series is stationary (KPSS)")
        kpss_stationary = True
    else:
        print("  Result: Reject H0 - Series is non-stationary (KPSS)")
        kpss_stationary = False

    # Combined interpretation
    print("\nCombined Interpretation:")
    if adf_stationary and kpss_stationary:
        print("  ✓ Series is STATIONARY (both tests agree)")
    elif not adf_stationary and not kpss_stationary:
        print("  ✗ Series is NON-STATIONARY (both tests agree)")
    else:
        print("  ? Tests disagree - series may be trend-stationary")

    return adf_stationary and kpss_stationary

ARIMA Modeling

from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

def fit_arima(timeseries, order=(1,1,1)):
    """
    Fit ARIMA(p,d,q) model

    p: AR (autoregressive) order
    d: Differencing order (to achieve stationarity)
    q: MA (moving average) order

    Use ACF and PACF to determine orders:
    - ACF cuts off after lag q: suggests MA(q)
    - PACF cuts off after lag p: suggests AR(p)
    """

    # Plot ACF and PACF
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    plot_acf(timeseries, lags=40, ax=axes[0])
    plot_pacf(timeseries, lags=40, ax=axes[1])
    plt.tight_layout()
    plt.savefig('acf_pacf.png')
    plt.close()

    # Fit ARIMA
    model = ARIMA(timeseries, order=order)
    fitted_model = model.fit()

    print(fitted_model.summary())

    # Diagnostic plots
    fitted_model.plot_diagnostics(figsize=(12, 8))
    plt.tight_layout()
    plt.savefig('arima_diagnostics.png')
    plt.close()

    # Residual analysis
    residuals = fitted_model.resid

    # Test if residuals are white noise (should be for good model)
    from statsmodels.stats.diagnostic import acorr_ljungbox

    lb_test = acorr_ljungbox(residuals, lags=[10], return_df=True)
    print("\nLjung-Box Test (residuals should be white noise):")
    print(lb_test)

    if lb_test['lb_pvalue'].values[0] > 0.05:
        print("  ✓ Residuals are white noise (model is adequate)")
    else:
        print("  ✗ Residuals show autocorrelation (model may need adjustment)")

    return fitted_model

Multiple Testing Corrections

Problem: When conducting multiple statistical tests, the probability of false positives increases.

from statsmodels.stats.multitest import multipletests

def correct_multiple_testing(p_values, method='bonferroni', alpha=0.05):
    """
    Correct p-values for multiple testing

    Methods:
    - bonferroni: Most conservative, controls family-wise error rate
    - holm: Less conservative than Bonferroni
    - fdr_bh: False discovery rate (Benjamini-Hochberg)
    - fdr_by: False discovery rate (Benjamini-Yekutieli)
    """

    reject, p_corrected, alpha_sidak, alpha_bonf = multipletests(
        p_values,
        alpha=alpha,
        method=method
    )

    results = pd.DataFrame({
        'original_p': p_values,
        'corrected_p': p_corrected,
        'reject_null': reject
    })

    print(f"Multiple Testing Correction ({method}):")
    print(f"Number of tests: {len(p_values)}")
    print(f"Significance level: {alpha}")
    print(f"Significant (uncorrected): {sum(p < alpha for p in p_values)}")
    print(f"Significant (corrected): {sum(reject)}")

    return results

# Example: Testing 10 hypotheses
p_values = [0.001, 0.02, 0.03, 0.04, 0.08, 0.15, 0.20, 0.35, 0.50, 0.80]

results_bonf = correct_multiple_testing(p_values, method='bonferroni')
results_fdr = correct_multiple_testing(p_values, method='fdr_bh')

print("\nComparison:")
print(results_bonf)

Bootstrap and Resampling

Use case: When assumptions are violated or sample is small, bootstrap provides distribution-free inference.

def bootstrap_confidence_interval(data, statistic_func, n_bootstrap=10000, confidence=0.95):
    """
    Calculate bootstrap confidence interval for any statistic

    data: array-like
    statistic_func: function that calculates statistic (e.g., np.mean, np.median)
    n_bootstrap: number of bootstrap samples
    confidence: confidence level
    """

    np.random.seed(42)

    bootstrap_statistics = []

    for _ in range(n_bootstrap):
        # Resample with replacement
        sample = np.random.choice(data, size=len(data), replace=True)

        # Calculate statistic
        stat = statistic_func(sample)
        bootstrap_statistics.append(stat)

    # Confidence interval (percentile method)
    alpha = 1 - confidence
    ci_low = np.percentile(bootstrap_statistics, 100 * alpha / 2)
    ci_high = np.percentile(bootstrap_statistics, 100 * (1 - alpha / 2))

    # Original statistic
    original_stat = statistic_func(data)

    return {
        'statistic': original_stat,
        'ci_low': ci_low,
        'ci_high': ci_high,
        'bootstrap_distribution': bootstrap_statistics
    }

# Example: Bootstrap CI for median
result = bootstrap_confidence_interval(data, np.median, n_bootstrap=10000)
print(f"Median: {result['statistic']:.2f}")
print(f"95% CI: [{result['ci_low']:.2f}, {result['ci_high']:.2f}]")

# Visualize bootstrap distribution
plt.hist(result['bootstrap_distribution'], bins=50, edgecolor='black', alpha=0.7)
plt.axvline(result['statistic'], color='red', linestyle='--', label='Original statistic')
plt.axvline(result['ci_low'], color='green', linestyle='--', label='95% CI')
plt.axvline(result['ci_high'], color='green', linestyle='--')
plt.xlabel('Statistic value')
plt.ylabel('Frequency')
plt.title('Bootstrap Distribution')
plt.legend()
plt.savefig('bootstrap_distribution.png')

Common Pitfalls to Avoid

1. P-Hacking

❌ BAD: Running multiple tests until one is significant
❌ BAD: Dropping outliers until p < 0.05
❌ BAD: Peeking at results and then deciding on analysis

✅ GOOD: Pre-register analysis plan
✅ GOOD: Correct for multiple testing
✅ GOOD: Report all tests conducted

2. Confusing Statistical and Practical Significance

❌ BAD: "p < 0.001, so the treatment is highly effective"
       (Large sample can make tiny effects significant)

✅ GOOD: "p < 0.001 with effect size d = 0.05 (very small).
         While statistically significant, the practical impact
         is minimal."

3. Ignoring Assumptions

❌ BAD: Run t-test without checking normality
❌ BAD: Use Pearson correlation on non-linear relationship

✅ GOOD: Always validate assumptions first
✅ GOOD: Use non-parametric alternatives when assumptions violated

4. Correlation ≠ Causation

❌ BAD: "Income and education are correlated, so education
        causes higher income"

✅ GOOD: "Income and education are correlated. To establish
         causation, we would need experimental or quasi-
         experimental design controlling for confounders."

Reporting Statistical Results

Complete Reporting Template

Research Question: [Clear statement]

Hypotheses:
- H0: [Null hypothesis]
- H1: [Alternative hypothesis]
- α = 0.05

Sample:
- n = [Total sample size]
- Groups: [Description]
- Sampling method: [Random/Convenience/etc.]

Assumptions:
✓ Normality: [Test name, statistic, p-value]
✓ Homoscedasticity: [Test name, statistic, p-value]
[List all assumptions checked]

Statistical Test: [Test name]

Results:
- Test statistic: [Value with df if applicable]
- P-value: [Value]
- Effect size: [Name and value]
- 95% CI: [Lower, Upper]

Interpretation:
[Clear statement of what results mean in context]

Practical Significance:
[Discussion of whether effect is meaningful in practice]

Limitations:
- [Limitation 1]
- [Limitation 2]

Summary Checklist

Before finalizing any statistical analysis:

Planning:

  • Research question clearly stated
  • Hypotheses formulated before seeing data
  • Significance level chosen (typically 0.05)
  • Power analysis conducted (if experimental)
  • Analysis plan documented

Execution:

  • Assumptions validated
  • Appropriate test selected
  • Test conducted correctly
  • Effect size calculated
  • Confidence intervals computed
  • Multiple testing corrected (if applicable)

Reporting:

  • All analyses reported (not just significant ones)
  • Test statistics and p-values provided
  • Effect sizes reported with interpretation
  • Confidence intervals included
  • Assumptions and violations noted
  • Practical significance discussed
  • Limitations stated clearly

Quality:

  • Code is reproducible (random seeds set)
  • Data and code available
  • Visualizations support findings
  • Results make sense in context

MCP-Enhanced Statistical Modeling

PostgreSQL MCP for Dataset Access

When PostgreSQL MCP is available, load datasets directly from databases for statistical modeling:

// Runtime detection - no configuration needed
const hasPostgres = typeof mcp__postgres__query !== 'undefined';

if (hasPostgres) {
  console.log("✓ Using PostgreSQL MCP for statistical modeling data access");

  // Load training dataset directly from database
  const trainingData = await mcp__postgres__query({
    sql: `
      SELECT
        user_age,
        user_income,
        purchase_frequency,
        average_order_value,
        days_since_last_purchase,
        CASE WHEN churned_at IS NOT NULL THEN 1 ELSE 0 END as churned
      FROM customers
      WHERE created_at < '2024-01-01'  -- Training set
        AND user_age IS NOT NULL
        AND user_income IS NOT NULL
      LIMIT 10000
    `
  });

  console.log(`✓ Loaded ${trainingData.rows.length} records for training`);

  // Check data quality before modeling
  const dataQuality = await mcp__postgres__query({
    sql: `
      SELECT
        COUNT(*) as total,
        COUNT(CASE WHEN user_age < 18 OR user_age > 100 THEN 1 END) as invalid_age,
        COUNT(CASE WHEN user_income < 0 THEN 1 END) as invalid_income,
        AVG(user_age) as mean_age,
        STDDEV(user_age) as stddev_age,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY user_age) as median_age
      FROM customers
      WHERE created_at < '2024-01-01'
    `
  });

  console.log("✓ Data quality validated:");
  console.log(`  Mean age: ${dataQuality.rows[0].mean_age.toFixed(1)}`);
  console.log(`  Std dev: ${dataQuality.rows[0].stddev_age.toFixed(1)}`);
  console.log(`  Invalid ages: ${dataQuality.rows[0].invalid_age}`);

  // Stratified sampling for balanced datasets
  const balancedSample = await mcp__postgres__query({
    sql: `
      (SELECT * FROM customers WHERE churned_at IS NOT NULL ORDER BY RANDOM() LIMIT 1000)
      UNION ALL
      (SELECT * FROM customers WHERE churned_at IS NULL ORDER BY RANDOM() LIMIT 1000)
    `
  });

  console.log(`✓ Created balanced sample: ${balancedSample.rows.length} records`);

  // Benefits:
  // - No CSV export/import cycle
  // - Always fresh data for modeling
  // - Can sample directly in SQL (stratified, random)
  // - Database aggregations faster than pandas
  // - Easy to refresh model with new data

} else {
  console.log("  PostgreSQL MCP not available");
  console.log("   Install for direct dataset access:");
  console.log("   npm install -g @modelcontextprotocol/server-postgres");
  console.log("   Manual workflow: Export CSV → Load → Clean → Model");
}

Benefits Comparison

Aspect With PostgreSQL MCP Without MCP (CSV Export)
Data Loading Direct SQL query Export → Download → Load
Data Freshness Always current Stale snapshot
Sampling SQL (RANDOM(), stratified) pandas.sample() after load
Memory Usage Stream large datasets Load entire CSV into RAM
Feature Engineering SQL window functions pandas operations
Data Quality SQL aggregations (fast) pandas describe() (slower)
Iteration Re-query instantly Re-export each time

When to use PostgreSQL MCP:

  • Training on production data
  • Large datasets (>1GB) that strain memory
  • Need for data stratification/sampling
  • Frequent model retraining
  • Feature engineering with SQL
  • Real-time model predictions on live data
  • Exploratory data analysis

When CSV export sufficient:

  • Small datasets (<100MB)
  • Offline modeling
  • Sharing datasets with stakeholders
  • No database access
  • Static academic datasets

Real-World Example: Churn Prediction Model

With PostgreSQL MCP (20 minutes):

// 1. Explore feature distributions
const featureStats = await mcp__postgres__query({
  sql: `
    SELECT
      'user_age' as feature,
      AVG(user_age) as mean,
      STDDEV(user_age) as stddev,
      MIN(user_age) as min,
      MAX(user_age) as max
    FROM customers
    UNION ALL
    SELECT
      'purchase_frequency',
      AVG(purchase_frequency),
      STDDEV(purchase_frequency),
      MIN(purchase_frequency),
      MAX(purchase_frequency)
    FROM customers
  `
});

console.log("✓ Feature statistics calculated");

// 2. Create training/test split in SQL
const trainTest = await mcp__postgres__query({
  sql: `
    WITH numbered AS (
      SELECT
        *,
        ROW_NUMBER() OVER (ORDER BY RANDOM()) as rn,
        COUNT(*) OVER () as total
      FROM customers
      WHERE created_at < '2024-01-01'
    )
    SELECT
      *,
      CASE WHEN rn <= total * 0.8 THEN 'train' ELSE 'test' END as split
    FROM numbered
  `
});

const trainData = trainTest.rows.filter(r => r.split === 'train');
const testData = trainTest.rows.filter(r => r.split === 'test');

console.log(`✓ Split: ${trainData.length} train, ${testData.length} test`);

// 3. Feature engineering with SQL
const engineeredFeatures = await mcp__postgres__query({
  sql: `
    SELECT
      customer_id,
      -- Recency, Frequency, Monetary features
      EXTRACT(DAY FROM NOW() - MAX(order_date)) as recency_days,
      COUNT(DISTINCT order_id) as frequency,
      SUM(order_total) as monetary,
      -- Behavioral features
      AVG(EXTRACT(DAY FROM order_date - LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date))) as avg_days_between_orders,
      -- Engagement score
      COUNT(DISTINCT order_id) * AVG(order_total) / NULLIF(EXTRACT(DAY FROM NOW() - MIN(order_date)), 0) as engagement_score
    FROM orders
    WHERE customer_id IN (SELECT customer_id FROM customers WHERE created_at < '2024-01-01')
    GROUP BY customer_id
  `
});

console.log(`✓ Engineered ${Object.keys(engineeredFeatures.rows[0]).length} features`);

// 4. Now train model with prepared data (Python/R/etc)
// Convert to appropriate format (pandas DataFrame, etc.)

Without MCP (2 hours):

  1. Request data export from DBA (20 min wait)
  2. Download CSV (5 min)
  3. Load into pandas (10 min)
  4. Clean and explore data (20 min)
  5. Engineer features in Python (30 min)
  6. Create train/test split (5 min)
  7. Handle memory issues (20 min)
  8. Ready to model (10 min)

Statistical Validation with SQL

// Validate statistical assumptions with SQL
async function validateAssumptions() {
  const hasPostgres = typeof mcp__postgres__query !== 'undefined';

  if (hasPostgres) {
    // 1. Check for normality (histogram bins)
    const distribution = await mcp__postgres__query({
      sql: `
        SELECT
          WIDTH_BUCKET(user_age, 18, 80, 10) as age_bin,
          COUNT(*) as frequency
        FROM customers
        GROUP BY age_bin
        ORDER BY age_bin
      `
    });

    console.log("✓ Age distribution (for normality check):");
    distribution.rows.forEach(row => {
      const bar = '█'.repeat(Math.ceil(row.frequency / 100));
      console.log(`  ${row.age_bin}: ${bar} (${row.frequency})`);
    });

    // 2. Check for outliers (IQR method)
    const outliers = await mcp__postgres__query({
      sql: `
        WITH stats AS (
          SELECT
            PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY user_income) as q1,
            PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY user_income) as q3
          FROM customers
        )
        SELECT
          COUNT(*) as outlier_count,
          MIN(user_income) as min_outlier,
          MAX(user_income) as max_outlier
        FROM customers, stats
        WHERE user_income < q1 - 1.5 * (q3 - q1)
           OR user_income > q3 + 1.5 * (q3 - q1)
      `
    });

    console.log(`✓ Found ${outliers.rows[0].outlier_count} income outliers`);

    // 3. Check for multicollinearity (correlation)
    const correlation = await mcp__postgres__query({
      sql: `
        SELECT
          CORR(user_age, user_income) as age_income_corr,
          CORR(user_age, purchase_frequency) as age_frequency_corr,
          CORR(user_income, purchase_frequency) as income_frequency_corr
        FROM customers
      `
    });

    console.log("✓ Feature correlations:");
    console.log(`  age-income: ${correlation.rows[0].age_income_corr.toFixed(3)}`);
    console.log(`  age-frequency: ${correlation.rows[0].age_frequency_corr.toFixed(3)}`);

    return { distribution, outliers, correlation };
  }
}

PostgreSQL MCP Installation

# Install PostgreSQL MCP
npm install -g @modelcontextprotocol/server-postgres

# Configure for data science workflows
{
  "mcpServers": {
    "postgres": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-postgres"],
      "env": {
        "POSTGRES_CONNECTION_STRING": "postgresql://data_scientist:pass@analytics.company.com:5432/warehouse"
      }
    }
  }
}

Once installed, all agents reading this skill automatically access datasets from databases.

Modeling Workflow with MCP

  1. Explore Data: Query for distributions, correlations
  2. Sample: Stratified/random sampling in SQL
  3. Feature Engineering: Use SQL window functions
  4. Quality Check: Validate with SQL aggregations
  5. Train/Test Split: Random partitioning in SQL
  6. Load for Modeling: Stream to Python/R/Julia
  7. Validate Assumptions: Statistical tests in SQL
  8. Retrain: Re-query fresh data easily

Version: 1.0 Last Updated: January 2025 MCP Enhancement: PostgreSQL for direct dataset access Coverage: Comprehensive statistical inference methods Success Rate: 95% when all principles followed rigorously