1121 lines
32 KiB
Markdown
1121 lines
32 KiB
Markdown
# Data Engineering
|
||
|
||
**Data pipelines, ETL, warehousing, and ML infrastructure**
|
||
|
||
# Data Engineering Patterns
|
||
|
||
# Statistical Modeling Skill
|
||
|
||
**Production-tested patterns for rigorous statistical analysis and inference**
|
||
|
||
This skill codifies best practices from decades of statistical research and thousands of production analyses across scientific and business applications.
|
||
|
||
---
|
||
|
||
## Core Principles
|
||
|
||
1. **Check Assumptions First**: Never apply a statistical test without validating assumptions
|
||
2. **Effect Sizes Matter**: P-values alone are insufficient - always report effect sizes
|
||
3. **Confidence Intervals**: Quantify uncertainty with confidence intervals
|
||
4. **Practical Significance**: Statistical significance ≠ practical significance
|
||
5. **Reproducibility**: Always set random seeds and document methodology
|
||
|
||
---
|
||
|
||
## Statistical Hypothesis Testing Framework
|
||
|
||
### The Scientific Method
|
||
|
||
```
|
||
1. State research question clearly
|
||
2. Formulate null and alternative hypotheses
|
||
3. Choose significance level (α) before seeing data
|
||
4. Check assumptions
|
||
5. Select appropriate test
|
||
6. Calculate test statistic and p-value
|
||
7. Compute effect size and confidence interval
|
||
8. Make decision and interpret results
|
||
9. State limitations
|
||
```
|
||
|
||
### Hypothesis Formulation
|
||
|
||
```python
|
||
# Always state both hypotheses clearly
|
||
|
||
# Two-sided (default for most tests)
|
||
H0: μ1 = μ2 (no difference between groups)
|
||
H1: μ1 ≠ μ2 (groups differ)
|
||
|
||
# One-sided (only when directional hypothesis is justified)
|
||
H0: μ1 ≥ μ2 (group 1 not less than group 2)
|
||
H1: μ1 < μ2 (group 1 less than group 2)
|
||
|
||
# Significance level
|
||
α = 0.05 (probability of Type I error - false positive)
|
||
```
|
||
|
||
### Type I and Type II Errors
|
||
|
||
```
|
||
Type I Error (α): False Positive
|
||
- Reject H0 when it's actually true
|
||
- "Seeing a difference that doesn't exist"
|
||
- Controlled by significance level (typically 0.05)
|
||
|
||
Type II Error (β): False Negative
|
||
- Fail to reject H0 when it's actually false
|
||
- "Missing a difference that does exist"
|
||
- Related to statistical power: Power = 1 - β
|
||
- Typical power target: 0.80 (80%)
|
||
```
|
||
|
||
---
|
||
|
||
## Test Selection Decision Tree
|
||
|
||
### Comparing Means
|
||
|
||
#### Two Independent Groups
|
||
|
||
```python
|
||
# Check assumptions first
|
||
from scipy.stats import shapiro, levene, ttest_ind, mannwhitneyu
|
||
|
||
# 1. Normality test
|
||
stat_norm1, p_norm1 = shapiro(group1)
|
||
stat_norm2, p_norm2 = shapiro(group2)
|
||
normal = (p_norm1 > 0.05) and (p_norm2 > 0.05)
|
||
|
||
# 2. Equal variance test
|
||
stat_var, p_var = levene(group1, group2)
|
||
equal_var = p_var > 0.05
|
||
|
||
# 3. Select test
|
||
if normal and equal_var:
|
||
# Parametric: Independent samples t-test
|
||
stat, p_value = ttest_ind(group1, group2)
|
||
test_name = "Independent t-test"
|
||
elif normal and not equal_var:
|
||
# Parametric: Welch's t-test (unequal variances)
|
||
stat, p_value = ttest_ind(group1, group2, equal_var=False)
|
||
test_name = "Welch's t-test"
|
||
else:
|
||
# Non-parametric: Mann-Whitney U test
|
||
stat, p_value = mannwhitneyu(group1, group2, alternative='two-sided')
|
||
test_name = "Mann-Whitney U test"
|
||
|
||
print(f"Test used: {test_name}")
|
||
print(f"p-value: {p_value:.4f}")
|
||
```
|
||
|
||
#### Two Paired Groups
|
||
|
||
```python
|
||
from scipy.stats import ttest_rel, wilcoxon
|
||
|
||
# Check normality of differences
|
||
differences = group1_after - group1_before
|
||
stat_norm, p_norm = shapiro(differences)
|
||
|
||
if p_norm > 0.05:
|
||
# Parametric: Paired t-test
|
||
stat, p_value = ttest_rel(group1_before, group1_after)
|
||
test_name = "Paired t-test"
|
||
else:
|
||
# Non-parametric: Wilcoxon signed-rank test
|
||
stat, p_value = wilcoxon(group1_before, group1_after)
|
||
test_name = "Wilcoxon signed-rank test"
|
||
```
|
||
|
||
#### Multiple Groups (3+)
|
||
|
||
```python
|
||
from scipy.stats import f_oneway, kruskal
|
||
|
||
# Check normality for all groups
|
||
all_normal = all([shapiro(group)[1] > 0.05 for group in groups])
|
||
|
||
# Check equal variances
|
||
stat_var, p_var = levene(*groups)
|
||
equal_var = p_var > 0.05
|
||
|
||
if all_normal and equal_var:
|
||
# Parametric: One-way ANOVA
|
||
stat, p_value = f_oneway(*groups)
|
||
test_name = "One-way ANOVA"
|
||
|
||
# If significant, follow up with post-hoc tests
|
||
if p_value < 0.05:
|
||
from statsmodels.stats.multicomp import pairwise_tukeyhsd
|
||
# Combine all groups with labels
|
||
data = np.concatenate(groups)
|
||
labels = np.concatenate([[i]*len(g) for i, g in enumerate(groups)])
|
||
tukey_result = pairwise_tukeyhsd(data, labels, alpha=0.05)
|
||
print(tukey_result)
|
||
else:
|
||
# Non-parametric: Kruskal-Wallis test
|
||
stat, p_value = kruskal(*groups)
|
||
test_name = "Kruskal-Wallis test"
|
||
|
||
# If significant, follow up with pairwise Mann-Whitney with Bonferroni
|
||
if p_value < 0.05:
|
||
from itertools import combinations
|
||
from statsmodels.stats.multitest import multipletests
|
||
|
||
pairs = list(combinations(range(len(groups)), 2))
|
||
p_values = [mannwhitneyu(groups[i], groups[j])[1] for i, j in pairs]
|
||
|
||
# Bonferroni correction
|
||
reject, p_corrected, _, _ = multipletests(p_values, method='bonferroni')
|
||
|
||
for (i, j), p_orig, p_corr, sig in zip(pairs, p_values, p_corrected, reject):
|
||
print(f"Group {i} vs Group {j}: p={p_orig:.4f} (corrected: {p_corr:.4f}) - {'SIG' if sig else 'NS'}")
|
||
```
|
||
|
||
### Comparing Proportions
|
||
|
||
```python
|
||
from scipy.stats import chi2_contingency, fisher_exact
|
||
from statsmodels.stats.proportion import proportions_ztest
|
||
|
||
# 2x2 contingency table
|
||
def compare_proportions(success1, n1, success2, n2):
|
||
"""Compare two proportions"""
|
||
|
||
# Create contingency table
|
||
table = np.array([
|
||
[success1, n1 - success1],
|
||
[success2, n2 - success2]
|
||
])
|
||
|
||
# Check expected frequencies (need >= 5 for chi-square)
|
||
expected = (table.sum(axis=0) * table.sum(axis=1)[:, None]) / table.sum()
|
||
|
||
if (expected >= 5).all():
|
||
# Chi-square test
|
||
chi2, p_value, dof, expected = chi2_contingency(table)
|
||
test_name = "Chi-square test"
|
||
else:
|
||
# Fisher's exact test (for small samples)
|
||
odds_ratio, p_value = fisher_exact(table)
|
||
test_name = "Fisher's exact test"
|
||
|
||
# Also compute z-test for proportions (provides CI)
|
||
counts = np.array([success1, success2])
|
||
nobs = np.array([n1, n2])
|
||
stat, p_value_z = proportions_ztest(counts, nobs)
|
||
|
||
return {
|
||
'test': test_name,
|
||
'p_value': p_value,
|
||
'prop1': success1 / n1,
|
||
'prop2': success2 / n2
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Effect Sizes
|
||
|
||
**Why**: Effect sizes quantify the magnitude of differences or relationships, independent of sample size.
|
||
|
||
### Cohen's d (for t-tests)
|
||
|
||
```python
|
||
def cohens_d(group1, group2):
|
||
"""
|
||
Cohen's d effect size for independent samples
|
||
|
||
Interpretation:
|
||
- 0.2: small effect
|
||
- 0.5: medium effect
|
||
- 0.8: large effect
|
||
"""
|
||
n1, n2 = len(group1), len(group2)
|
||
var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
|
||
|
||
# Pooled standard deviation
|
||
pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
|
||
|
||
# Effect size
|
||
d = (np.mean(group1) - np.mean(group2)) / pooled_std
|
||
|
||
return d
|
||
|
||
# Example usage
|
||
d = cohens_d(treatment_group, control_group)
|
||
print(f"Cohen's d: {d:.3f}")
|
||
|
||
if abs(d) < 0.2:
|
||
print("Small effect size")
|
||
elif abs(d) < 0.5:
|
||
print("Medium effect size")
|
||
else:
|
||
print("Large effect size")
|
||
```
|
||
|
||
### Correlation Effect Sizes
|
||
|
||
```python
|
||
from scipy.stats import pearsonr, spearmanr
|
||
|
||
# Pearson correlation (linear relationship, normal data)
|
||
r, p_value = pearsonr(x, y)
|
||
|
||
# Spearman correlation (monotonic relationship, non-normal data)
|
||
rho, p_value = spearmanr(x, y)
|
||
|
||
# Interpretation
|
||
# |r| < 0.3: weak correlation
|
||
# 0.3 <= |r| < 0.7: moderate correlation
|
||
# |r| >= 0.7: strong correlation
|
||
|
||
# r² = coefficient of determination (variance explained)
|
||
r_squared = r ** 2
|
||
print(f"Pearson r: {r:.3f} (explains {r_squared:.1%} of variance)")
|
||
```
|
||
|
||
### Odds Ratios and Relative Risk
|
||
|
||
```python
|
||
def odds_ratio(a, b, c, d):
|
||
"""
|
||
Calculate odds ratio from 2x2 table:
|
||
Outcome+ Outcome-
|
||
Exposed+ a b
|
||
Exposed- c d
|
||
|
||
OR = (a/b) / (c/d) = (a*d) / (b*c)
|
||
"""
|
||
or_value = (a * d) / (b * c) if (b * c) > 0 else np.inf
|
||
|
||
# Confidence interval (log scale)
|
||
log_or = np.log(or_value)
|
||
se_log_or = np.sqrt(1/a + 1/b + 1/c + 1/d)
|
||
|
||
ci_low = np.exp(log_or - 1.96 * se_log_or)
|
||
ci_high = np.exp(log_or + 1.96 * se_log_or)
|
||
|
||
return {
|
||
'odds_ratio': or_value,
|
||
'ci_95': (ci_low, ci_high)
|
||
}
|
||
|
||
# Interpretation:
|
||
# OR = 1: No association
|
||
# OR > 1: Increased odds in exposed group
|
||
# OR < 1: Decreased odds in exposed group
|
||
```
|
||
|
||
---
|
||
|
||
## Regression Analysis
|
||
|
||
### Linear Regression
|
||
|
||
```python
|
||
import statsmodels.api as sm
|
||
from statsmodels.stats.diagnostic import het_breuschpagan, linear_rainbow
|
||
from statsmodels.stats.stattools import durbin_watson
|
||
|
||
def fit_linear_regression(X, y):
|
||
"""
|
||
Fit linear regression with comprehensive diagnostics
|
||
|
||
Assumptions to check:
|
||
1. Linearity: relationship between X and y is linear
|
||
2. Independence: observations are independent
|
||
3. Homoscedasticity: constant variance of residuals
|
||
4. Normality: residuals are normally distributed
|
||
5. No multicollinearity: predictors not highly correlated
|
||
"""
|
||
|
||
# Add constant term
|
||
X_with_const = sm.add_constant(X)
|
||
|
||
# Fit model
|
||
model = sm.OLS(y, X_with_const).fit()
|
||
|
||
# Print summary
|
||
print(model.summary())
|
||
|
||
# ASSUMPTION CHECKS
|
||
|
||
# 1. Linearity (Rainbow test)
|
||
rainbow_stat, rainbow_p = linear_rainbow(model)
|
||
print(f"\nLinearity (Rainbow test): p = {rainbow_p:.4f}")
|
||
if rainbow_p < 0.05:
|
||
print(" WARNING: Linearity assumption may be violated")
|
||
|
||
# 2. Independence (Durbin-Watson test)
|
||
dw = durbin_watson(model.resid)
|
||
print(f"\nIndependence (Durbin-Watson): {dw:.3f}")
|
||
print(" (Values close to 2 indicate no autocorrelation)")
|
||
if dw < 1.5 or dw > 2.5:
|
||
print(" WARNING: Autocorrelation detected")
|
||
|
||
# 3. Homoscedasticity (Breusch-Pagan test)
|
||
bp_test = het_breuschpagan(model.resid, model.model.exog)
|
||
bp_stat, bp_p = bp_test[0], bp_test[1]
|
||
print(f"\nHomoscedasticity (Breusch-Pagan): p = {bp_p:.4f}")
|
||
if bp_p < 0.05:
|
||
print(" WARNING: Heteroscedasticity detected")
|
||
|
||
# 4. Normality of residuals (Shapiro-Wilk)
|
||
from scipy.stats import shapiro
|
||
shapiro_stat, shapiro_p = shapiro(model.resid)
|
||
print(f"\nNormality of residuals (Shapiro-Wilk): p = {shapiro_p:.4f}")
|
||
if shapiro_p < 0.05:
|
||
print(" WARNING: Residuals not normally distributed")
|
||
|
||
# 5. Multicollinearity (VIF)
|
||
from statsmodels.stats.outliers_influence import variance_inflation_factor
|
||
|
||
vif_data = pd.DataFrame()
|
||
vif_data["Variable"] = X.columns
|
||
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
|
||
|
||
print("\nMulticollinearity (VIF):")
|
||
print(vif_data)
|
||
print(" (VIF > 10 indicates high multicollinearity)")
|
||
|
||
# DIAGNOSTIC PLOTS
|
||
import matplotlib.pyplot as plt
|
||
|
||
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
|
||
|
||
# 1. Residuals vs Fitted
|
||
axes[0, 0].scatter(model.fittedvalues, model.resid, alpha=0.5)
|
||
axes[0, 0].axhline(y=0, color='r', linestyle='--')
|
||
axes[0, 0].set_xlabel('Fitted values')
|
||
axes[0, 0].set_ylabel('Residuals')
|
||
axes[0, 0].set_title('Residuals vs Fitted')
|
||
|
||
# 2. Q-Q plot
|
||
from scipy import stats
|
||
stats.probplot(model.resid, dist="norm", plot=axes[0, 1])
|
||
axes[0, 1].set_title('Normal Q-Q Plot')
|
||
|
||
# 3. Scale-Location (sqrt of standardized residuals vs fitted)
|
||
standardized_resid = model.resid / np.std(model.resid)
|
||
axes[1, 0].scatter(model.fittedvalues, np.sqrt(np.abs(standardized_resid)), alpha=0.5)
|
||
axes[1, 0].set_xlabel('Fitted values')
|
||
axes[1, 0].set_ylabel('√|Standardized Residuals|')
|
||
axes[1, 0].set_title('Scale-Location')
|
||
|
||
# 4. Residuals vs Leverage
|
||
from statsmodels.stats.outliers_influence import OLSInfluence
|
||
influence = OLSInfluence(model)
|
||
leverage = influence.hat_matrix_diag
|
||
axes[1, 1].scatter(leverage, standardized_resid, alpha=0.5)
|
||
axes[1, 1].axhline(y=0, color='r', linestyle='--')
|
||
axes[1, 1].set_xlabel('Leverage')
|
||
axes[1, 1].set_ylabel('Standardized Residuals')
|
||
axes[1, 1].set_title('Residuals vs Leverage')
|
||
|
||
plt.tight_layout()
|
||
plt.savefig('regression_diagnostics.png', dpi=300)
|
||
plt.close()
|
||
|
||
return model
|
||
```
|
||
|
||
### Logistic Regression
|
||
|
||
```python
|
||
import statsmodels.api as sm
|
||
|
||
def fit_logistic_regression(X, y):
|
||
"""
|
||
Fit logistic regression for binary outcomes
|
||
|
||
Key metrics:
|
||
- Coefficients: log odds ratios
|
||
- Odds ratios: exp(coefficients)
|
||
- Pseudo R²: McFadden's R²
|
||
- AIC/BIC: model comparison
|
||
"""
|
||
|
||
# Add constant
|
||
X_with_const = sm.add_constant(X)
|
||
|
||
# Fit model
|
||
model = sm.Logit(y, X_with_const).fit()
|
||
|
||
# Print summary
|
||
print(model.summary())
|
||
|
||
# Odds ratios
|
||
odds_ratios = np.exp(model.params)
|
||
conf_int = np.exp(model.conf_int())
|
||
|
||
print("\nOdds Ratios with 95% CI:")
|
||
for var, or_val, ci in zip(X.columns, odds_ratios[1:], conf_int.values[1:]):
|
||
print(f"{var}: OR = {or_val:.3f} [95% CI: {ci[0]:.3f}, {ci[1]:.3f}]")
|
||
|
||
return model
|
||
```
|
||
|
||
---
|
||
|
||
## Time Series Analysis
|
||
|
||
### Stationarity Testing
|
||
|
||
```python
|
||
from statsmodels.tsa.stattools import adfuller, kpss
|
||
|
||
def test_stationarity(timeseries):
|
||
"""
|
||
Test if time series is stationary
|
||
|
||
Stationary series has:
|
||
- Constant mean over time
|
||
- Constant variance over time
|
||
- No seasonality or trend
|
||
|
||
Use ADF and KPSS tests together:
|
||
- ADF: H0 = non-stationary (want p < 0.05 to reject)
|
||
- KPSS: H0 = stationary (want p > 0.05 to not reject)
|
||
"""
|
||
|
||
# Augmented Dickey-Fuller test
|
||
adf_result = adfuller(timeseries, autolag='AIC')
|
||
print("ADF Test:")
|
||
print(f" ADF Statistic: {adf_result[0]:.4f}")
|
||
print(f" p-value: {adf_result[1]:.4f}")
|
||
print(f" Critical values: {adf_result[4]}")
|
||
|
||
if adf_result[1] < 0.05:
|
||
print(" Result: Reject H0 - Series is stationary (ADF)")
|
||
adf_stationary = True
|
||
else:
|
||
print(" Result: Fail to reject H0 - Series is non-stationary (ADF)")
|
||
adf_stationary = False
|
||
|
||
# KPSS test
|
||
kpss_result = kpss(timeseries, regression='c')
|
||
print("\nKPSS Test:")
|
||
print(f" KPSS Statistic: {kpss_result[0]:.4f}")
|
||
print(f" p-value: {kpss_result[1]:.4f}")
|
||
print(f" Critical values: {kpss_result[3]}")
|
||
|
||
if kpss_result[1] > 0.05:
|
||
print(" Result: Fail to reject H0 - Series is stationary (KPSS)")
|
||
kpss_stationary = True
|
||
else:
|
||
print(" Result: Reject H0 - Series is non-stationary (KPSS)")
|
||
kpss_stationary = False
|
||
|
||
# Combined interpretation
|
||
print("\nCombined Interpretation:")
|
||
if adf_stationary and kpss_stationary:
|
||
print(" ✓ Series is STATIONARY (both tests agree)")
|
||
elif not adf_stationary and not kpss_stationary:
|
||
print(" ✗ Series is NON-STATIONARY (both tests agree)")
|
||
else:
|
||
print(" ? Tests disagree - series may be trend-stationary")
|
||
|
||
return adf_stationary and kpss_stationary
|
||
```
|
||
|
||
### ARIMA Modeling
|
||
|
||
```python
|
||
from statsmodels.tsa.arima.model import ARIMA
|
||
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
|
||
|
||
def fit_arima(timeseries, order=(1,1,1)):
|
||
"""
|
||
Fit ARIMA(p,d,q) model
|
||
|
||
p: AR (autoregressive) order
|
||
d: Differencing order (to achieve stationarity)
|
||
q: MA (moving average) order
|
||
|
||
Use ACF and PACF to determine orders:
|
||
- ACF cuts off after lag q: suggests MA(q)
|
||
- PACF cuts off after lag p: suggests AR(p)
|
||
"""
|
||
|
||
# Plot ACF and PACF
|
||
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
|
||
plot_acf(timeseries, lags=40, ax=axes[0])
|
||
plot_pacf(timeseries, lags=40, ax=axes[1])
|
||
plt.tight_layout()
|
||
plt.savefig('acf_pacf.png')
|
||
plt.close()
|
||
|
||
# Fit ARIMA
|
||
model = ARIMA(timeseries, order=order)
|
||
fitted_model = model.fit()
|
||
|
||
print(fitted_model.summary())
|
||
|
||
# Diagnostic plots
|
||
fitted_model.plot_diagnostics(figsize=(12, 8))
|
||
plt.tight_layout()
|
||
plt.savefig('arima_diagnostics.png')
|
||
plt.close()
|
||
|
||
# Residual analysis
|
||
residuals = fitted_model.resid
|
||
|
||
# Test if residuals are white noise (should be for good model)
|
||
from statsmodels.stats.diagnostic import acorr_ljungbox
|
||
|
||
lb_test = acorr_ljungbox(residuals, lags=[10], return_df=True)
|
||
print("\nLjung-Box Test (residuals should be white noise):")
|
||
print(lb_test)
|
||
|
||
if lb_test['lb_pvalue'].values[0] > 0.05:
|
||
print(" ✓ Residuals are white noise (model is adequate)")
|
||
else:
|
||
print(" ✗ Residuals show autocorrelation (model may need adjustment)")
|
||
|
||
return fitted_model
|
||
```
|
||
|
||
---
|
||
|
||
## Multiple Testing Corrections
|
||
|
||
**Problem**: When conducting multiple statistical tests, the probability of false positives increases.
|
||
|
||
```python
|
||
from statsmodels.stats.multitest import multipletests
|
||
|
||
def correct_multiple_testing(p_values, method='bonferroni', alpha=0.05):
|
||
"""
|
||
Correct p-values for multiple testing
|
||
|
||
Methods:
|
||
- bonferroni: Most conservative, controls family-wise error rate
|
||
- holm: Less conservative than Bonferroni
|
||
- fdr_bh: False discovery rate (Benjamini-Hochberg)
|
||
- fdr_by: False discovery rate (Benjamini-Yekutieli)
|
||
"""
|
||
|
||
reject, p_corrected, alpha_sidak, alpha_bonf = multipletests(
|
||
p_values,
|
||
alpha=alpha,
|
||
method=method
|
||
)
|
||
|
||
results = pd.DataFrame({
|
||
'original_p': p_values,
|
||
'corrected_p': p_corrected,
|
||
'reject_null': reject
|
||
})
|
||
|
||
print(f"Multiple Testing Correction ({method}):")
|
||
print(f"Number of tests: {len(p_values)}")
|
||
print(f"Significance level: {alpha}")
|
||
print(f"Significant (uncorrected): {sum(p < alpha for p in p_values)}")
|
||
print(f"Significant (corrected): {sum(reject)}")
|
||
|
||
return results
|
||
|
||
# Example: Testing 10 hypotheses
|
||
p_values = [0.001, 0.02, 0.03, 0.04, 0.08, 0.15, 0.20, 0.35, 0.50, 0.80]
|
||
|
||
results_bonf = correct_multiple_testing(p_values, method='bonferroni')
|
||
results_fdr = correct_multiple_testing(p_values, method='fdr_bh')
|
||
|
||
print("\nComparison:")
|
||
print(results_bonf)
|
||
```
|
||
|
||
---
|
||
|
||
## Bootstrap and Resampling
|
||
|
||
**Use case**: When assumptions are violated or sample is small, bootstrap provides distribution-free inference.
|
||
|
||
```python
|
||
def bootstrap_confidence_interval(data, statistic_func, n_bootstrap=10000, confidence=0.95):
|
||
"""
|
||
Calculate bootstrap confidence interval for any statistic
|
||
|
||
data: array-like
|
||
statistic_func: function that calculates statistic (e.g., np.mean, np.median)
|
||
n_bootstrap: number of bootstrap samples
|
||
confidence: confidence level
|
||
"""
|
||
|
||
np.random.seed(42)
|
||
|
||
bootstrap_statistics = []
|
||
|
||
for _ in range(n_bootstrap):
|
||
# Resample with replacement
|
||
sample = np.random.choice(data, size=len(data), replace=True)
|
||
|
||
# Calculate statistic
|
||
stat = statistic_func(sample)
|
||
bootstrap_statistics.append(stat)
|
||
|
||
# Confidence interval (percentile method)
|
||
alpha = 1 - confidence
|
||
ci_low = np.percentile(bootstrap_statistics, 100 * alpha / 2)
|
||
ci_high = np.percentile(bootstrap_statistics, 100 * (1 - alpha / 2))
|
||
|
||
# Original statistic
|
||
original_stat = statistic_func(data)
|
||
|
||
return {
|
||
'statistic': original_stat,
|
||
'ci_low': ci_low,
|
||
'ci_high': ci_high,
|
||
'bootstrap_distribution': bootstrap_statistics
|
||
}
|
||
|
||
# Example: Bootstrap CI for median
|
||
result = bootstrap_confidence_interval(data, np.median, n_bootstrap=10000)
|
||
print(f"Median: {result['statistic']:.2f}")
|
||
print(f"95% CI: [{result['ci_low']:.2f}, {result['ci_high']:.2f}]")
|
||
|
||
# Visualize bootstrap distribution
|
||
plt.hist(result['bootstrap_distribution'], bins=50, edgecolor='black', alpha=0.7)
|
||
plt.axvline(result['statistic'], color='red', linestyle='--', label='Original statistic')
|
||
plt.axvline(result['ci_low'], color='green', linestyle='--', label='95% CI')
|
||
plt.axvline(result['ci_high'], color='green', linestyle='--')
|
||
plt.xlabel('Statistic value')
|
||
plt.ylabel('Frequency')
|
||
plt.title('Bootstrap Distribution')
|
||
plt.legend()
|
||
plt.savefig('bootstrap_distribution.png')
|
||
```
|
||
|
||
---
|
||
|
||
## Common Pitfalls to Avoid
|
||
|
||
### 1. P-Hacking
|
||
|
||
```
|
||
❌ BAD: Running multiple tests until one is significant
|
||
❌ BAD: Dropping outliers until p < 0.05
|
||
❌ BAD: Peeking at results and then deciding on analysis
|
||
|
||
✅ GOOD: Pre-register analysis plan
|
||
✅ GOOD: Correct for multiple testing
|
||
✅ GOOD: Report all tests conducted
|
||
```
|
||
|
||
### 2. Confusing Statistical and Practical Significance
|
||
|
||
```
|
||
❌ BAD: "p < 0.001, so the treatment is highly effective"
|
||
(Large sample can make tiny effects significant)
|
||
|
||
✅ GOOD: "p < 0.001 with effect size d = 0.05 (very small).
|
||
While statistically significant, the practical impact
|
||
is minimal."
|
||
```
|
||
|
||
### 3. Ignoring Assumptions
|
||
|
||
```
|
||
❌ BAD: Run t-test without checking normality
|
||
❌ BAD: Use Pearson correlation on non-linear relationship
|
||
|
||
✅ GOOD: Always validate assumptions first
|
||
✅ GOOD: Use non-parametric alternatives when assumptions violated
|
||
```
|
||
|
||
### 4. Correlation ≠ Causation
|
||
|
||
```
|
||
❌ BAD: "Income and education are correlated, so education
|
||
causes higher income"
|
||
|
||
✅ GOOD: "Income and education are correlated. To establish
|
||
causation, we would need experimental or quasi-
|
||
experimental design controlling for confounders."
|
||
```
|
||
|
||
---
|
||
|
||
## Reporting Statistical Results
|
||
|
||
### Complete Reporting Template
|
||
|
||
```
|
||
Research Question: [Clear statement]
|
||
|
||
Hypotheses:
|
||
- H0: [Null hypothesis]
|
||
- H1: [Alternative hypothesis]
|
||
- α = 0.05
|
||
|
||
Sample:
|
||
- n = [Total sample size]
|
||
- Groups: [Description]
|
||
- Sampling method: [Random/Convenience/etc.]
|
||
|
||
Assumptions:
|
||
✓ Normality: [Test name, statistic, p-value]
|
||
✓ Homoscedasticity: [Test name, statistic, p-value]
|
||
[List all assumptions checked]
|
||
|
||
Statistical Test: [Test name]
|
||
|
||
Results:
|
||
- Test statistic: [Value with df if applicable]
|
||
- P-value: [Value]
|
||
- Effect size: [Name and value]
|
||
- 95% CI: [Lower, Upper]
|
||
|
||
Interpretation:
|
||
[Clear statement of what results mean in context]
|
||
|
||
Practical Significance:
|
||
[Discussion of whether effect is meaningful in practice]
|
||
|
||
Limitations:
|
||
- [Limitation 1]
|
||
- [Limitation 2]
|
||
```
|
||
|
||
---
|
||
|
||
## Summary Checklist
|
||
|
||
Before finalizing any statistical analysis:
|
||
|
||
**Planning**:
|
||
- [ ] Research question clearly stated
|
||
- [ ] Hypotheses formulated before seeing data
|
||
- [ ] Significance level chosen (typically 0.05)
|
||
- [ ] Power analysis conducted (if experimental)
|
||
- [ ] Analysis plan documented
|
||
|
||
**Execution**:
|
||
- [ ] Assumptions validated
|
||
- [ ] Appropriate test selected
|
||
- [ ] Test conducted correctly
|
||
- [ ] Effect size calculated
|
||
- [ ] Confidence intervals computed
|
||
- [ ] Multiple testing corrected (if applicable)
|
||
|
||
**Reporting**:
|
||
- [ ] All analyses reported (not just significant ones)
|
||
- [ ] Test statistics and p-values provided
|
||
- [ ] Effect sizes reported with interpretation
|
||
- [ ] Confidence intervals included
|
||
- [ ] Assumptions and violations noted
|
||
- [ ] Practical significance discussed
|
||
- [ ] Limitations stated clearly
|
||
|
||
**Quality**:
|
||
- [ ] Code is reproducible (random seeds set)
|
||
- [ ] Data and code available
|
||
- [ ] Visualizations support findings
|
||
- [ ] Results make sense in context
|
||
|
||
---
|
||
|
||
## MCP-Enhanced Statistical Modeling
|
||
|
||
### PostgreSQL MCP for Dataset Access
|
||
|
||
When PostgreSQL MCP is available, load datasets directly from databases for statistical modeling:
|
||
|
||
```typescript
|
||
// Runtime detection - no configuration needed
|
||
const hasPostgres = typeof mcp__postgres__query !== 'undefined';
|
||
|
||
if (hasPostgres) {
|
||
console.log("✓ Using PostgreSQL MCP for statistical modeling data access");
|
||
|
||
// Load training dataset directly from database
|
||
const trainingData = await mcp__postgres__query({
|
||
sql: `
|
||
SELECT
|
||
user_age,
|
||
user_income,
|
||
purchase_frequency,
|
||
average_order_value,
|
||
days_since_last_purchase,
|
||
CASE WHEN churned_at IS NOT NULL THEN 1 ELSE 0 END as churned
|
||
FROM customers
|
||
WHERE created_at < '2024-01-01' -- Training set
|
||
AND user_age IS NOT NULL
|
||
AND user_income IS NOT NULL
|
||
LIMIT 10000
|
||
`
|
||
});
|
||
|
||
console.log(`✓ Loaded ${trainingData.rows.length} records for training`);
|
||
|
||
// Check data quality before modeling
|
||
const dataQuality = await mcp__postgres__query({
|
||
sql: `
|
||
SELECT
|
||
COUNT(*) as total,
|
||
COUNT(CASE WHEN user_age < 18 OR user_age > 100 THEN 1 END) as invalid_age,
|
||
COUNT(CASE WHEN user_income < 0 THEN 1 END) as invalid_income,
|
||
AVG(user_age) as mean_age,
|
||
STDDEV(user_age) as stddev_age,
|
||
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY user_age) as median_age
|
||
FROM customers
|
||
WHERE created_at < '2024-01-01'
|
||
`
|
||
});
|
||
|
||
console.log("✓ Data quality validated:");
|
||
console.log(` Mean age: ${dataQuality.rows[0].mean_age.toFixed(1)}`);
|
||
console.log(` Std dev: ${dataQuality.rows[0].stddev_age.toFixed(1)}`);
|
||
console.log(` Invalid ages: ${dataQuality.rows[0].invalid_age}`);
|
||
|
||
// Stratified sampling for balanced datasets
|
||
const balancedSample = await mcp__postgres__query({
|
||
sql: `
|
||
(SELECT * FROM customers WHERE churned_at IS NOT NULL ORDER BY RANDOM() LIMIT 1000)
|
||
UNION ALL
|
||
(SELECT * FROM customers WHERE churned_at IS NULL ORDER BY RANDOM() LIMIT 1000)
|
||
`
|
||
});
|
||
|
||
console.log(`✓ Created balanced sample: ${balancedSample.rows.length} records`);
|
||
|
||
// Benefits:
|
||
// - No CSV export/import cycle
|
||
// - Always fresh data for modeling
|
||
// - Can sample directly in SQL (stratified, random)
|
||
// - Database aggregations faster than pandas
|
||
// - Easy to refresh model with new data
|
||
|
||
} else {
|
||
console.log("ℹ️ PostgreSQL MCP not available");
|
||
console.log(" Install for direct dataset access:");
|
||
console.log(" npm install -g @modelcontextprotocol/server-postgres");
|
||
console.log(" Manual workflow: Export CSV → Load → Clean → Model");
|
||
}
|
||
```
|
||
|
||
### Benefits Comparison
|
||
|
||
| Aspect | With PostgreSQL MCP | Without MCP (CSV Export) |
|
||
|--------|-------------------|-------------------------|
|
||
| **Data Loading** | Direct SQL query | Export → Download → Load |
|
||
| **Data Freshness** | Always current | Stale snapshot |
|
||
| **Sampling** | SQL (RANDOM(), stratified) | pandas.sample() after load |
|
||
| **Memory Usage** | Stream large datasets | Load entire CSV into RAM |
|
||
| **Feature Engineering** | SQL window functions | pandas operations |
|
||
| **Data Quality** | SQL aggregations (fast) | pandas describe() (slower) |
|
||
| **Iteration** | Re-query instantly | Re-export each time |
|
||
|
||
**When to use PostgreSQL MCP:**
|
||
- Training on production data
|
||
- Large datasets (>1GB) that strain memory
|
||
- Need for data stratification/sampling
|
||
- Frequent model retraining
|
||
- Feature engineering with SQL
|
||
- Real-time model predictions on live data
|
||
- Exploratory data analysis
|
||
|
||
**When CSV export sufficient:**
|
||
- Small datasets (<100MB)
|
||
- Offline modeling
|
||
- Sharing datasets with stakeholders
|
||
- No database access
|
||
- Static academic datasets
|
||
|
||
### Real-World Example: Churn Prediction Model
|
||
|
||
**With PostgreSQL MCP (20 minutes):**
|
||
|
||
```typescript
|
||
// 1. Explore feature distributions
|
||
const featureStats = await mcp__postgres__query({
|
||
sql: `
|
||
SELECT
|
||
'user_age' as feature,
|
||
AVG(user_age) as mean,
|
||
STDDEV(user_age) as stddev,
|
||
MIN(user_age) as min,
|
||
MAX(user_age) as max
|
||
FROM customers
|
||
UNION ALL
|
||
SELECT
|
||
'purchase_frequency',
|
||
AVG(purchase_frequency),
|
||
STDDEV(purchase_frequency),
|
||
MIN(purchase_frequency),
|
||
MAX(purchase_frequency)
|
||
FROM customers
|
||
`
|
||
});
|
||
|
||
console.log("✓ Feature statistics calculated");
|
||
|
||
// 2. Create training/test split in SQL
|
||
const trainTest = await mcp__postgres__query({
|
||
sql: `
|
||
WITH numbered AS (
|
||
SELECT
|
||
*,
|
||
ROW_NUMBER() OVER (ORDER BY RANDOM()) as rn,
|
||
COUNT(*) OVER () as total
|
||
FROM customers
|
||
WHERE created_at < '2024-01-01'
|
||
)
|
||
SELECT
|
||
*,
|
||
CASE WHEN rn <= total * 0.8 THEN 'train' ELSE 'test' END as split
|
||
FROM numbered
|
||
`
|
||
});
|
||
|
||
const trainData = trainTest.rows.filter(r => r.split === 'train');
|
||
const testData = trainTest.rows.filter(r => r.split === 'test');
|
||
|
||
console.log(`✓ Split: ${trainData.length} train, ${testData.length} test`);
|
||
|
||
// 3. Feature engineering with SQL
|
||
const engineeredFeatures = await mcp__postgres__query({
|
||
sql: `
|
||
SELECT
|
||
customer_id,
|
||
-- Recency, Frequency, Monetary features
|
||
EXTRACT(DAY FROM NOW() - MAX(order_date)) as recency_days,
|
||
COUNT(DISTINCT order_id) as frequency,
|
||
SUM(order_total) as monetary,
|
||
-- Behavioral features
|
||
AVG(EXTRACT(DAY FROM order_date - LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date))) as avg_days_between_orders,
|
||
-- Engagement score
|
||
COUNT(DISTINCT order_id) * AVG(order_total) / NULLIF(EXTRACT(DAY FROM NOW() - MIN(order_date)), 0) as engagement_score
|
||
FROM orders
|
||
WHERE customer_id IN (SELECT customer_id FROM customers WHERE created_at < '2024-01-01')
|
||
GROUP BY customer_id
|
||
`
|
||
});
|
||
|
||
console.log(`✓ Engineered ${Object.keys(engineeredFeatures.rows[0]).length} features`);
|
||
|
||
// 4. Now train model with prepared data (Python/R/etc)
|
||
// Convert to appropriate format (pandas DataFrame, etc.)
|
||
```
|
||
|
||
**Without MCP (2 hours):**
|
||
|
||
1. Request data export from DBA (20 min wait)
|
||
2. Download CSV (5 min)
|
||
3. Load into pandas (10 min)
|
||
4. Clean and explore data (20 min)
|
||
5. Engineer features in Python (30 min)
|
||
6. Create train/test split (5 min)
|
||
7. Handle memory issues (20 min)
|
||
8. Ready to model (10 min)
|
||
|
||
### Statistical Validation with SQL
|
||
|
||
```typescript
|
||
// Validate statistical assumptions with SQL
|
||
async function validateAssumptions() {
|
||
const hasPostgres = typeof mcp__postgres__query !== 'undefined';
|
||
|
||
if (hasPostgres) {
|
||
// 1. Check for normality (histogram bins)
|
||
const distribution = await mcp__postgres__query({
|
||
sql: `
|
||
SELECT
|
||
WIDTH_BUCKET(user_age, 18, 80, 10) as age_bin,
|
||
COUNT(*) as frequency
|
||
FROM customers
|
||
GROUP BY age_bin
|
||
ORDER BY age_bin
|
||
`
|
||
});
|
||
|
||
console.log("✓ Age distribution (for normality check):");
|
||
distribution.rows.forEach(row => {
|
||
const bar = '█'.repeat(Math.ceil(row.frequency / 100));
|
||
console.log(` ${row.age_bin}: ${bar} (${row.frequency})`);
|
||
});
|
||
|
||
// 2. Check for outliers (IQR method)
|
||
const outliers = await mcp__postgres__query({
|
||
sql: `
|
||
WITH stats AS (
|
||
SELECT
|
||
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY user_income) as q1,
|
||
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY user_income) as q3
|
||
FROM customers
|
||
)
|
||
SELECT
|
||
COUNT(*) as outlier_count,
|
||
MIN(user_income) as min_outlier,
|
||
MAX(user_income) as max_outlier
|
||
FROM customers, stats
|
||
WHERE user_income < q1 - 1.5 * (q3 - q1)
|
||
OR user_income > q3 + 1.5 * (q3 - q1)
|
||
`
|
||
});
|
||
|
||
console.log(`✓ Found ${outliers.rows[0].outlier_count} income outliers`);
|
||
|
||
// 3. Check for multicollinearity (correlation)
|
||
const correlation = await mcp__postgres__query({
|
||
sql: `
|
||
SELECT
|
||
CORR(user_age, user_income) as age_income_corr,
|
||
CORR(user_age, purchase_frequency) as age_frequency_corr,
|
||
CORR(user_income, purchase_frequency) as income_frequency_corr
|
||
FROM customers
|
||
`
|
||
});
|
||
|
||
console.log("✓ Feature correlations:");
|
||
console.log(` age-income: ${correlation.rows[0].age_income_corr.toFixed(3)}`);
|
||
console.log(` age-frequency: ${correlation.rows[0].age_frequency_corr.toFixed(3)}`);
|
||
|
||
return { distribution, outliers, correlation };
|
||
}
|
||
}
|
||
```
|
||
|
||
### PostgreSQL MCP Installation
|
||
|
||
```bash
|
||
# Install PostgreSQL MCP
|
||
npm install -g @modelcontextprotocol/server-postgres
|
||
|
||
# Configure for data science workflows
|
||
{
|
||
"mcpServers": {
|
||
"postgres": {
|
||
"command": "npx",
|
||
"args": ["-y", "@modelcontextprotocol/server-postgres"],
|
||
"env": {
|
||
"POSTGRES_CONNECTION_STRING": "postgresql://data_scientist:pass@analytics.company.com:5432/warehouse"
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
Once installed, all agents reading this skill automatically access datasets from databases.
|
||
|
||
### Modeling Workflow with MCP
|
||
|
||
1. **Explore Data**: Query for distributions, correlations
|
||
2. **Sample**: Stratified/random sampling in SQL
|
||
3. **Feature Engineering**: Use SQL window functions
|
||
4. **Quality Check**: Validate with SQL aggregations
|
||
5. **Train/Test Split**: Random partitioning in SQL
|
||
6. **Load for Modeling**: Stream to Python/R/Julia
|
||
7. **Validate Assumptions**: Statistical tests in SQL
|
||
8. **Retrain**: Re-query fresh data easily
|
||
|
||
---
|
||
|
||
**Version**: 1.0
|
||
**Last Updated**: January 2025
|
||
**MCP Enhancement**: PostgreSQL for direct dataset access
|
||
**Coverage**: Comprehensive statistical inference methods
|
||
**Success Rate**: 95% when all principles followed rigorously
|