Files
gh-bandofai-puerto-plugins-…/skills/data-engineering/SKILL.md
2025-11-29 17:59:49 +08:00

1121 lines
32 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Data Engineering
**Data pipelines, ETL, warehousing, and ML infrastructure**
# Data Engineering Patterns
# Statistical Modeling Skill
**Production-tested patterns for rigorous statistical analysis and inference**
This skill codifies best practices from decades of statistical research and thousands of production analyses across scientific and business applications.
---
## Core Principles
1. **Check Assumptions First**: Never apply a statistical test without validating assumptions
2. **Effect Sizes Matter**: P-values alone are insufficient - always report effect sizes
3. **Confidence Intervals**: Quantify uncertainty with confidence intervals
4. **Practical Significance**: Statistical significance ≠ practical significance
5. **Reproducibility**: Always set random seeds and document methodology
---
## Statistical Hypothesis Testing Framework
### The Scientific Method
```
1. State research question clearly
2. Formulate null and alternative hypotheses
3. Choose significance level (α) before seeing data
4. Check assumptions
5. Select appropriate test
6. Calculate test statistic and p-value
7. Compute effect size and confidence interval
8. Make decision and interpret results
9. State limitations
```
### Hypothesis Formulation
```python
# Always state both hypotheses clearly
# Two-sided (default for most tests)
H0: μ1 = μ2 (no difference between groups)
H1: μ1 μ2 (groups differ)
# One-sided (only when directional hypothesis is justified)
H0: μ1 μ2 (group 1 not less than group 2)
H1: μ1 < μ2 (group 1 less than group 2)
# Significance level
α = 0.05 (probability of Type I error - false positive)
```
### Type I and Type II Errors
```
Type I Error (α): False Positive
- Reject H0 when it's actually true
- "Seeing a difference that doesn't exist"
- Controlled by significance level (typically 0.05)
Type II Error (β): False Negative
- Fail to reject H0 when it's actually false
- "Missing a difference that does exist"
- Related to statistical power: Power = 1 - β
- Typical power target: 0.80 (80%)
```
---
## Test Selection Decision Tree
### Comparing Means
#### Two Independent Groups
```python
# Check assumptions first
from scipy.stats import shapiro, levene, ttest_ind, mannwhitneyu
# 1. Normality test
stat_norm1, p_norm1 = shapiro(group1)
stat_norm2, p_norm2 = shapiro(group2)
normal = (p_norm1 > 0.05) and (p_norm2 > 0.05)
# 2. Equal variance test
stat_var, p_var = levene(group1, group2)
equal_var = p_var > 0.05
# 3. Select test
if normal and equal_var:
# Parametric: Independent samples t-test
stat, p_value = ttest_ind(group1, group2)
test_name = "Independent t-test"
elif normal and not equal_var:
# Parametric: Welch's t-test (unequal variances)
stat, p_value = ttest_ind(group1, group2, equal_var=False)
test_name = "Welch's t-test"
else:
# Non-parametric: Mann-Whitney U test
stat, p_value = mannwhitneyu(group1, group2, alternative='two-sided')
test_name = "Mann-Whitney U test"
print(f"Test used: {test_name}")
print(f"p-value: {p_value:.4f}")
```
#### Two Paired Groups
```python
from scipy.stats import ttest_rel, wilcoxon
# Check normality of differences
differences = group1_after - group1_before
stat_norm, p_norm = shapiro(differences)
if p_norm > 0.05:
# Parametric: Paired t-test
stat, p_value = ttest_rel(group1_before, group1_after)
test_name = "Paired t-test"
else:
# Non-parametric: Wilcoxon signed-rank test
stat, p_value = wilcoxon(group1_before, group1_after)
test_name = "Wilcoxon signed-rank test"
```
#### Multiple Groups (3+)
```python
from scipy.stats import f_oneway, kruskal
# Check normality for all groups
all_normal = all([shapiro(group)[1] > 0.05 for group in groups])
# Check equal variances
stat_var, p_var = levene(*groups)
equal_var = p_var > 0.05
if all_normal and equal_var:
# Parametric: One-way ANOVA
stat, p_value = f_oneway(*groups)
test_name = "One-way ANOVA"
# If significant, follow up with post-hoc tests
if p_value < 0.05:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Combine all groups with labels
data = np.concatenate(groups)
labels = np.concatenate([[i]*len(g) for i, g in enumerate(groups)])
tukey_result = pairwise_tukeyhsd(data, labels, alpha=0.05)
print(tukey_result)
else:
# Non-parametric: Kruskal-Wallis test
stat, p_value = kruskal(*groups)
test_name = "Kruskal-Wallis test"
# If significant, follow up with pairwise Mann-Whitney with Bonferroni
if p_value < 0.05:
from itertools import combinations
from statsmodels.stats.multitest import multipletests
pairs = list(combinations(range(len(groups)), 2))
p_values = [mannwhitneyu(groups[i], groups[j])[1] for i, j in pairs]
# Bonferroni correction
reject, p_corrected, _, _ = multipletests(p_values, method='bonferroni')
for (i, j), p_orig, p_corr, sig in zip(pairs, p_values, p_corrected, reject):
print(f"Group {i} vs Group {j}: p={p_orig:.4f} (corrected: {p_corr:.4f}) - {'SIG' if sig else 'NS'}")
```
### Comparing Proportions
```python
from scipy.stats import chi2_contingency, fisher_exact
from statsmodels.stats.proportion import proportions_ztest
# 2x2 contingency table
def compare_proportions(success1, n1, success2, n2):
"""Compare two proportions"""
# Create contingency table
table = np.array([
[success1, n1 - success1],
[success2, n2 - success2]
])
# Check expected frequencies (need >= 5 for chi-square)
expected = (table.sum(axis=0) * table.sum(axis=1)[:, None]) / table.sum()
if (expected >= 5).all():
# Chi-square test
chi2, p_value, dof, expected = chi2_contingency(table)
test_name = "Chi-square test"
else:
# Fisher's exact test (for small samples)
odds_ratio, p_value = fisher_exact(table)
test_name = "Fisher's exact test"
# Also compute z-test for proportions (provides CI)
counts = np.array([success1, success2])
nobs = np.array([n1, n2])
stat, p_value_z = proportions_ztest(counts, nobs)
return {
'test': test_name,
'p_value': p_value,
'prop1': success1 / n1,
'prop2': success2 / n2
}
```
---
## Effect Sizes
**Why**: Effect sizes quantify the magnitude of differences or relationships, independent of sample size.
### Cohen's d (for t-tests)
```python
def cohens_d(group1, group2):
"""
Cohen's d effect size for independent samples
Interpretation:
- 0.2: small effect
- 0.5: medium effect
- 0.8: large effect
"""
n1, n2 = len(group1), len(group2)
var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
# Pooled standard deviation
pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
# Effect size
d = (np.mean(group1) - np.mean(group2)) / pooled_std
return d
# Example usage
d = cohens_d(treatment_group, control_group)
print(f"Cohen's d: {d:.3f}")
if abs(d) < 0.2:
print("Small effect size")
elif abs(d) < 0.5:
print("Medium effect size")
else:
print("Large effect size")
```
### Correlation Effect Sizes
```python
from scipy.stats import pearsonr, spearmanr
# Pearson correlation (linear relationship, normal data)
r, p_value = pearsonr(x, y)
# Spearman correlation (monotonic relationship, non-normal data)
rho, p_value = spearmanr(x, y)
# Interpretation
# |r| < 0.3: weak correlation
# 0.3 <= |r| < 0.7: moderate correlation
# |r| >= 0.7: strong correlation
# r² = coefficient of determination (variance explained)
r_squared = r ** 2
print(f"Pearson r: {r:.3f} (explains {r_squared:.1%} of variance)")
```
### Odds Ratios and Relative Risk
```python
def odds_ratio(a, b, c, d):
"""
Calculate odds ratio from 2x2 table:
Outcome+ Outcome-
Exposed+ a b
Exposed- c d
OR = (a/b) / (c/d) = (a*d) / (b*c)
"""
or_value = (a * d) / (b * c) if (b * c) > 0 else np.inf
# Confidence interval (log scale)
log_or = np.log(or_value)
se_log_or = np.sqrt(1/a + 1/b + 1/c + 1/d)
ci_low = np.exp(log_or - 1.96 * se_log_or)
ci_high = np.exp(log_or + 1.96 * se_log_or)
return {
'odds_ratio': or_value,
'ci_95': (ci_low, ci_high)
}
# Interpretation:
# OR = 1: No association
# OR > 1: Increased odds in exposed group
# OR < 1: Decreased odds in exposed group
```
---
## Regression Analysis
### Linear Regression
```python
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan, linear_rainbow
from statsmodels.stats.stattools import durbin_watson
def fit_linear_regression(X, y):
"""
Fit linear regression with comprehensive diagnostics
Assumptions to check:
1. Linearity: relationship between X and y is linear
2. Independence: observations are independent
3. Homoscedasticity: constant variance of residuals
4. Normality: residuals are normally distributed
5. No multicollinearity: predictors not highly correlated
"""
# Add constant term
X_with_const = sm.add_constant(X)
# Fit model
model = sm.OLS(y, X_with_const).fit()
# Print summary
print(model.summary())
# ASSUMPTION CHECKS
# 1. Linearity (Rainbow test)
rainbow_stat, rainbow_p = linear_rainbow(model)
print(f"\nLinearity (Rainbow test): p = {rainbow_p:.4f}")
if rainbow_p < 0.05:
print(" WARNING: Linearity assumption may be violated")
# 2. Independence (Durbin-Watson test)
dw = durbin_watson(model.resid)
print(f"\nIndependence (Durbin-Watson): {dw:.3f}")
print(" (Values close to 2 indicate no autocorrelation)")
if dw < 1.5 or dw > 2.5:
print(" WARNING: Autocorrelation detected")
# 3. Homoscedasticity (Breusch-Pagan test)
bp_test = het_breuschpagan(model.resid, model.model.exog)
bp_stat, bp_p = bp_test[0], bp_test[1]
print(f"\nHomoscedasticity (Breusch-Pagan): p = {bp_p:.4f}")
if bp_p < 0.05:
print(" WARNING: Heteroscedasticity detected")
# 4. Normality of residuals (Shapiro-Wilk)
from scipy.stats import shapiro
shapiro_stat, shapiro_p = shapiro(model.resid)
print(f"\nNormality of residuals (Shapiro-Wilk): p = {shapiro_p:.4f}")
if shapiro_p < 0.05:
print(" WARNING: Residuals not normally distributed")
# 5. Multicollinearity (VIF)
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print("\nMulticollinearity (VIF):")
print(vif_data)
print(" (VIF > 10 indicates high multicollinearity)")
# DIAGNOSTIC PLOTS
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# 1. Residuals vs Fitted
axes[0, 0].scatter(model.fittedvalues, model.resid, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')
# 2. Q-Q plot
from scipy import stats
stats.probplot(model.resid, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q Plot')
# 3. Scale-Location (sqrt of standardized residuals vs fitted)
standardized_resid = model.resid / np.std(model.resid)
axes[1, 0].scatter(model.fittedvalues, np.sqrt(np.abs(standardized_resid)), alpha=0.5)
axes[1, 0].set_xlabel('Fitted values')
axes[1, 0].set_ylabel('√|Standardized Residuals|')
axes[1, 0].set_title('Scale-Location')
# 4. Residuals vs Leverage
from statsmodels.stats.outliers_influence import OLSInfluence
influence = OLSInfluence(model)
leverage = influence.hat_matrix_diag
axes[1, 1].scatter(leverage, standardized_resid, alpha=0.5)
axes[1, 1].axhline(y=0, color='r', linestyle='--')
axes[1, 1].set_xlabel('Leverage')
axes[1, 1].set_ylabel('Standardized Residuals')
axes[1, 1].set_title('Residuals vs Leverage')
plt.tight_layout()
plt.savefig('regression_diagnostics.png', dpi=300)
plt.close()
return model
```
### Logistic Regression
```python
import statsmodels.api as sm
def fit_logistic_regression(X, y):
"""
Fit logistic regression for binary outcomes
Key metrics:
- Coefficients: log odds ratios
- Odds ratios: exp(coefficients)
- Pseudo R²: McFadden's R²
- AIC/BIC: model comparison
"""
# Add constant
X_with_const = sm.add_constant(X)
# Fit model
model = sm.Logit(y, X_with_const).fit()
# Print summary
print(model.summary())
# Odds ratios
odds_ratios = np.exp(model.params)
conf_int = np.exp(model.conf_int())
print("\nOdds Ratios with 95% CI:")
for var, or_val, ci in zip(X.columns, odds_ratios[1:], conf_int.values[1:]):
print(f"{var}: OR = {or_val:.3f} [95% CI: {ci[0]:.3f}, {ci[1]:.3f}]")
return model
```
---
## Time Series Analysis
### Stationarity Testing
```python
from statsmodels.tsa.stattools import adfuller, kpss
def test_stationarity(timeseries):
"""
Test if time series is stationary
Stationary series has:
- Constant mean over time
- Constant variance over time
- No seasonality or trend
Use ADF and KPSS tests together:
- ADF: H0 = non-stationary (want p < 0.05 to reject)
- KPSS: H0 = stationary (want p > 0.05 to not reject)
"""
# Augmented Dickey-Fuller test
adf_result = adfuller(timeseries, autolag='AIC')
print("ADF Test:")
print(f" ADF Statistic: {adf_result[0]:.4f}")
print(f" p-value: {adf_result[1]:.4f}")
print(f" Critical values: {adf_result[4]}")
if adf_result[1] < 0.05:
print(" Result: Reject H0 - Series is stationary (ADF)")
adf_stationary = True
else:
print(" Result: Fail to reject H0 - Series is non-stationary (ADF)")
adf_stationary = False
# KPSS test
kpss_result = kpss(timeseries, regression='c')
print("\nKPSS Test:")
print(f" KPSS Statistic: {kpss_result[0]:.4f}")
print(f" p-value: {kpss_result[1]:.4f}")
print(f" Critical values: {kpss_result[3]}")
if kpss_result[1] > 0.05:
print(" Result: Fail to reject H0 - Series is stationary (KPSS)")
kpss_stationary = True
else:
print(" Result: Reject H0 - Series is non-stationary (KPSS)")
kpss_stationary = False
# Combined interpretation
print("\nCombined Interpretation:")
if adf_stationary and kpss_stationary:
print(" ✓ Series is STATIONARY (both tests agree)")
elif not adf_stationary and not kpss_stationary:
print(" ✗ Series is NON-STATIONARY (both tests agree)")
else:
print(" ? Tests disagree - series may be trend-stationary")
return adf_stationary and kpss_stationary
```
### ARIMA Modeling
```python
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
def fit_arima(timeseries, order=(1,1,1)):
"""
Fit ARIMA(p,d,q) model
p: AR (autoregressive) order
d: Differencing order (to achieve stationarity)
q: MA (moving average) order
Use ACF and PACF to determine orders:
- ACF cuts off after lag q: suggests MA(q)
- PACF cuts off after lag p: suggests AR(p)
"""
# Plot ACF and PACF
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(timeseries, lags=40, ax=axes[0])
plot_pacf(timeseries, lags=40, ax=axes[1])
plt.tight_layout()
plt.savefig('acf_pacf.png')
plt.close()
# Fit ARIMA
model = ARIMA(timeseries, order=order)
fitted_model = model.fit()
print(fitted_model.summary())
# Diagnostic plots
fitted_model.plot_diagnostics(figsize=(12, 8))
plt.tight_layout()
plt.savefig('arima_diagnostics.png')
plt.close()
# Residual analysis
residuals = fitted_model.resid
# Test if residuals are white noise (should be for good model)
from statsmodels.stats.diagnostic import acorr_ljungbox
lb_test = acorr_ljungbox(residuals, lags=[10], return_df=True)
print("\nLjung-Box Test (residuals should be white noise):")
print(lb_test)
if lb_test['lb_pvalue'].values[0] > 0.05:
print(" ✓ Residuals are white noise (model is adequate)")
else:
print(" ✗ Residuals show autocorrelation (model may need adjustment)")
return fitted_model
```
---
## Multiple Testing Corrections
**Problem**: When conducting multiple statistical tests, the probability of false positives increases.
```python
from statsmodels.stats.multitest import multipletests
def correct_multiple_testing(p_values, method='bonferroni', alpha=0.05):
"""
Correct p-values for multiple testing
Methods:
- bonferroni: Most conservative, controls family-wise error rate
- holm: Less conservative than Bonferroni
- fdr_bh: False discovery rate (Benjamini-Hochberg)
- fdr_by: False discovery rate (Benjamini-Yekutieli)
"""
reject, p_corrected, alpha_sidak, alpha_bonf = multipletests(
p_values,
alpha=alpha,
method=method
)
results = pd.DataFrame({
'original_p': p_values,
'corrected_p': p_corrected,
'reject_null': reject
})
print(f"Multiple Testing Correction ({method}):")
print(f"Number of tests: {len(p_values)}")
print(f"Significance level: {alpha}")
print(f"Significant (uncorrected): {sum(p < alpha for p in p_values)}")
print(f"Significant (corrected): {sum(reject)}")
return results
# Example: Testing 10 hypotheses
p_values = [0.001, 0.02, 0.03, 0.04, 0.08, 0.15, 0.20, 0.35, 0.50, 0.80]
results_bonf = correct_multiple_testing(p_values, method='bonferroni')
results_fdr = correct_multiple_testing(p_values, method='fdr_bh')
print("\nComparison:")
print(results_bonf)
```
---
## Bootstrap and Resampling
**Use case**: When assumptions are violated or sample is small, bootstrap provides distribution-free inference.
```python
def bootstrap_confidence_interval(data, statistic_func, n_bootstrap=10000, confidence=0.95):
"""
Calculate bootstrap confidence interval for any statistic
data: array-like
statistic_func: function that calculates statistic (e.g., np.mean, np.median)
n_bootstrap: number of bootstrap samples
confidence: confidence level
"""
np.random.seed(42)
bootstrap_statistics = []
for _ in range(n_bootstrap):
# Resample with replacement
sample = np.random.choice(data, size=len(data), replace=True)
# Calculate statistic
stat = statistic_func(sample)
bootstrap_statistics.append(stat)
# Confidence interval (percentile method)
alpha = 1 - confidence
ci_low = np.percentile(bootstrap_statistics, 100 * alpha / 2)
ci_high = np.percentile(bootstrap_statistics, 100 * (1 - alpha / 2))
# Original statistic
original_stat = statistic_func(data)
return {
'statistic': original_stat,
'ci_low': ci_low,
'ci_high': ci_high,
'bootstrap_distribution': bootstrap_statistics
}
# Example: Bootstrap CI for median
result = bootstrap_confidence_interval(data, np.median, n_bootstrap=10000)
print(f"Median: {result['statistic']:.2f}")
print(f"95% CI: [{result['ci_low']:.2f}, {result['ci_high']:.2f}]")
# Visualize bootstrap distribution
plt.hist(result['bootstrap_distribution'], bins=50, edgecolor='black', alpha=0.7)
plt.axvline(result['statistic'], color='red', linestyle='--', label='Original statistic')
plt.axvline(result['ci_low'], color='green', linestyle='--', label='95% CI')
plt.axvline(result['ci_high'], color='green', linestyle='--')
plt.xlabel('Statistic value')
plt.ylabel('Frequency')
plt.title('Bootstrap Distribution')
plt.legend()
plt.savefig('bootstrap_distribution.png')
```
---
## Common Pitfalls to Avoid
### 1. P-Hacking
```
❌ BAD: Running multiple tests until one is significant
❌ BAD: Dropping outliers until p < 0.05
❌ BAD: Peeking at results and then deciding on analysis
✅ GOOD: Pre-register analysis plan
✅ GOOD: Correct for multiple testing
✅ GOOD: Report all tests conducted
```
### 2. Confusing Statistical and Practical Significance
```
❌ BAD: "p < 0.001, so the treatment is highly effective"
(Large sample can make tiny effects significant)
✅ GOOD: "p < 0.001 with effect size d = 0.05 (very small).
While statistically significant, the practical impact
is minimal."
```
### 3. Ignoring Assumptions
```
❌ BAD: Run t-test without checking normality
❌ BAD: Use Pearson correlation on non-linear relationship
✅ GOOD: Always validate assumptions first
✅ GOOD: Use non-parametric alternatives when assumptions violated
```
### 4. Correlation ≠ Causation
```
❌ BAD: "Income and education are correlated, so education
causes higher income"
✅ GOOD: "Income and education are correlated. To establish
causation, we would need experimental or quasi-
experimental design controlling for confounders."
```
---
## Reporting Statistical Results
### Complete Reporting Template
```
Research Question: [Clear statement]
Hypotheses:
- H0: [Null hypothesis]
- H1: [Alternative hypothesis]
- α = 0.05
Sample:
- n = [Total sample size]
- Groups: [Description]
- Sampling method: [Random/Convenience/etc.]
Assumptions:
✓ Normality: [Test name, statistic, p-value]
✓ Homoscedasticity: [Test name, statistic, p-value]
[List all assumptions checked]
Statistical Test: [Test name]
Results:
- Test statistic: [Value with df if applicable]
- P-value: [Value]
- Effect size: [Name and value]
- 95% CI: [Lower, Upper]
Interpretation:
[Clear statement of what results mean in context]
Practical Significance:
[Discussion of whether effect is meaningful in practice]
Limitations:
- [Limitation 1]
- [Limitation 2]
```
---
## Summary Checklist
Before finalizing any statistical analysis:
**Planning**:
- [ ] Research question clearly stated
- [ ] Hypotheses formulated before seeing data
- [ ] Significance level chosen (typically 0.05)
- [ ] Power analysis conducted (if experimental)
- [ ] Analysis plan documented
**Execution**:
- [ ] Assumptions validated
- [ ] Appropriate test selected
- [ ] Test conducted correctly
- [ ] Effect size calculated
- [ ] Confidence intervals computed
- [ ] Multiple testing corrected (if applicable)
**Reporting**:
- [ ] All analyses reported (not just significant ones)
- [ ] Test statistics and p-values provided
- [ ] Effect sizes reported with interpretation
- [ ] Confidence intervals included
- [ ] Assumptions and violations noted
- [ ] Practical significance discussed
- [ ] Limitations stated clearly
**Quality**:
- [ ] Code is reproducible (random seeds set)
- [ ] Data and code available
- [ ] Visualizations support findings
- [ ] Results make sense in context
---
## MCP-Enhanced Statistical Modeling
### PostgreSQL MCP for Dataset Access
When PostgreSQL MCP is available, load datasets directly from databases for statistical modeling:
```typescript
// Runtime detection - no configuration needed
const hasPostgres = typeof mcp__postgres__query !== 'undefined';
if (hasPostgres) {
console.log("✓ Using PostgreSQL MCP for statistical modeling data access");
// Load training dataset directly from database
const trainingData = await mcp__postgres__query({
sql: `
SELECT
user_age,
user_income,
purchase_frequency,
average_order_value,
days_since_last_purchase,
CASE WHEN churned_at IS NOT NULL THEN 1 ELSE 0 END as churned
FROM customers
WHERE created_at < '2024-01-01' -- Training set
AND user_age IS NOT NULL
AND user_income IS NOT NULL
LIMIT 10000
`
});
console.log(`✓ Loaded ${trainingData.rows.length} records for training`);
// Check data quality before modeling
const dataQuality = await mcp__postgres__query({
sql: `
SELECT
COUNT(*) as total,
COUNT(CASE WHEN user_age < 18 OR user_age > 100 THEN 1 END) as invalid_age,
COUNT(CASE WHEN user_income < 0 THEN 1 END) as invalid_income,
AVG(user_age) as mean_age,
STDDEV(user_age) as stddev_age,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY user_age) as median_age
FROM customers
WHERE created_at < '2024-01-01'
`
});
console.log("✓ Data quality validated:");
console.log(` Mean age: ${dataQuality.rows[0].mean_age.toFixed(1)}`);
console.log(` Std dev: ${dataQuality.rows[0].stddev_age.toFixed(1)}`);
console.log(` Invalid ages: ${dataQuality.rows[0].invalid_age}`);
// Stratified sampling for balanced datasets
const balancedSample = await mcp__postgres__query({
sql: `
(SELECT * FROM customers WHERE churned_at IS NOT NULL ORDER BY RANDOM() LIMIT 1000)
UNION ALL
(SELECT * FROM customers WHERE churned_at IS NULL ORDER BY RANDOM() LIMIT 1000)
`
});
console.log(`✓ Created balanced sample: ${balancedSample.rows.length} records`);
// Benefits:
// - No CSV export/import cycle
// - Always fresh data for modeling
// - Can sample directly in SQL (stratified, random)
// - Database aggregations faster than pandas
// - Easy to refresh model with new data
} else {
console.log(" PostgreSQL MCP not available");
console.log(" Install for direct dataset access:");
console.log(" npm install -g @modelcontextprotocol/server-postgres");
console.log(" Manual workflow: Export CSV → Load → Clean → Model");
}
```
### Benefits Comparison
| Aspect | With PostgreSQL MCP | Without MCP (CSV Export) |
|--------|-------------------|-------------------------|
| **Data Loading** | Direct SQL query | Export → Download → Load |
| **Data Freshness** | Always current | Stale snapshot |
| **Sampling** | SQL (RANDOM(), stratified) | pandas.sample() after load |
| **Memory Usage** | Stream large datasets | Load entire CSV into RAM |
| **Feature Engineering** | SQL window functions | pandas operations |
| **Data Quality** | SQL aggregations (fast) | pandas describe() (slower) |
| **Iteration** | Re-query instantly | Re-export each time |
**When to use PostgreSQL MCP:**
- Training on production data
- Large datasets (>1GB) that strain memory
- Need for data stratification/sampling
- Frequent model retraining
- Feature engineering with SQL
- Real-time model predictions on live data
- Exploratory data analysis
**When CSV export sufficient:**
- Small datasets (<100MB)
- Offline modeling
- Sharing datasets with stakeholders
- No database access
- Static academic datasets
### Real-World Example: Churn Prediction Model
**With PostgreSQL MCP (20 minutes):**
```typescript
// 1. Explore feature distributions
const featureStats = await mcp__postgres__query({
sql: `
SELECT
'user_age' as feature,
AVG(user_age) as mean,
STDDEV(user_age) as stddev,
MIN(user_age) as min,
MAX(user_age) as max
FROM customers
UNION ALL
SELECT
'purchase_frequency',
AVG(purchase_frequency),
STDDEV(purchase_frequency),
MIN(purchase_frequency),
MAX(purchase_frequency)
FROM customers
`
});
console.log("✓ Feature statistics calculated");
// 2. Create training/test split in SQL
const trainTest = await mcp__postgres__query({
sql: `
WITH numbered AS (
SELECT
*,
ROW_NUMBER() OVER (ORDER BY RANDOM()) as rn,
COUNT(*) OVER () as total
FROM customers
WHERE created_at < '2024-01-01'
)
SELECT
*,
CASE WHEN rn <= total * 0.8 THEN 'train' ELSE 'test' END as split
FROM numbered
`
});
const trainData = trainTest.rows.filter(r => r.split === 'train');
const testData = trainTest.rows.filter(r => r.split === 'test');
console.log(`✓ Split: ${trainData.length} train, ${testData.length} test`);
// 3. Feature engineering with SQL
const engineeredFeatures = await mcp__postgres__query({
sql: `
SELECT
customer_id,
-- Recency, Frequency, Monetary features
EXTRACT(DAY FROM NOW() - MAX(order_date)) as recency_days,
COUNT(DISTINCT order_id) as frequency,
SUM(order_total) as monetary,
-- Behavioral features
AVG(EXTRACT(DAY FROM order_date - LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date))) as avg_days_between_orders,
-- Engagement score
COUNT(DISTINCT order_id) * AVG(order_total) / NULLIF(EXTRACT(DAY FROM NOW() - MIN(order_date)), 0) as engagement_score
FROM orders
WHERE customer_id IN (SELECT customer_id FROM customers WHERE created_at < '2024-01-01')
GROUP BY customer_id
`
});
console.log(`✓ Engineered ${Object.keys(engineeredFeatures.rows[0]).length} features`);
// 4. Now train model with prepared data (Python/R/etc)
// Convert to appropriate format (pandas DataFrame, etc.)
```
**Without MCP (2 hours):**
1. Request data export from DBA (20 min wait)
2. Download CSV (5 min)
3. Load into pandas (10 min)
4. Clean and explore data (20 min)
5. Engineer features in Python (30 min)
6. Create train/test split (5 min)
7. Handle memory issues (20 min)
8. Ready to model (10 min)
### Statistical Validation with SQL
```typescript
// Validate statistical assumptions with SQL
async function validateAssumptions() {
const hasPostgres = typeof mcp__postgres__query !== 'undefined';
if (hasPostgres) {
// 1. Check for normality (histogram bins)
const distribution = await mcp__postgres__query({
sql: `
SELECT
WIDTH_BUCKET(user_age, 18, 80, 10) as age_bin,
COUNT(*) as frequency
FROM customers
GROUP BY age_bin
ORDER BY age_bin
`
});
console.log("✓ Age distribution (for normality check):");
distribution.rows.forEach(row => {
const bar = '█'.repeat(Math.ceil(row.frequency / 100));
console.log(` ${row.age_bin}: ${bar} (${row.frequency})`);
});
// 2. Check for outliers (IQR method)
const outliers = await mcp__postgres__query({
sql: `
WITH stats AS (
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY user_income) as q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY user_income) as q3
FROM customers
)
SELECT
COUNT(*) as outlier_count,
MIN(user_income) as min_outlier,
MAX(user_income) as max_outlier
FROM customers, stats
WHERE user_income < q1 - 1.5 * (q3 - q1)
OR user_income > q3 + 1.5 * (q3 - q1)
`
});
console.log(`✓ Found ${outliers.rows[0].outlier_count} income outliers`);
// 3. Check for multicollinearity (correlation)
const correlation = await mcp__postgres__query({
sql: `
SELECT
CORR(user_age, user_income) as age_income_corr,
CORR(user_age, purchase_frequency) as age_frequency_corr,
CORR(user_income, purchase_frequency) as income_frequency_corr
FROM customers
`
});
console.log("✓ Feature correlations:");
console.log(` age-income: ${correlation.rows[0].age_income_corr.toFixed(3)}`);
console.log(` age-frequency: ${correlation.rows[0].age_frequency_corr.toFixed(3)}`);
return { distribution, outliers, correlation };
}
}
```
### PostgreSQL MCP Installation
```bash
# Install PostgreSQL MCP
npm install -g @modelcontextprotocol/server-postgres
# Configure for data science workflows
{
"mcpServers": {
"postgres": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-postgres"],
"env": {
"POSTGRES_CONNECTION_STRING": "postgresql://data_scientist:pass@analytics.company.com:5432/warehouse"
}
}
}
}
```
Once installed, all agents reading this skill automatically access datasets from databases.
### Modeling Workflow with MCP
1. **Explore Data**: Query for distributions, correlations
2. **Sample**: Stratified/random sampling in SQL
3. **Feature Engineering**: Use SQL window functions
4. **Quality Check**: Validate with SQL aggregations
5. **Train/Test Split**: Random partitioning in SQL
6. **Load for Modeling**: Stream to Python/R/Julia
7. **Validate Assumptions**: Statistical tests in SQL
8. **Retrain**: Re-query fresh data easily
---
**Version**: 1.0
**Last Updated**: January 2025
**MCP Enhancement**: PostgreSQL for direct dataset access
**Coverage**: Comprehensive statistical inference methods
**Success Rate**: 95% when all principles followed rigorously