Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/statistical-analysis/references/assumptions_and_diagnostics.md
+++ b/skills/statistical-analysis/references/assumptions_and_diagnostics.md
@@ -0,0 +1,369 @@
+# Statistical Assumptions and Diagnostic Procedures
+
+This document provides comprehensive guidance on checking and validating statistical assumptions for various analyses.
+
+## General Principles
+
+1. **Always check assumptions before interpreting test results**
+2. **Use multiple diagnostic methods** (visual + formal tests)
+3. **Consider robustness**: Some tests are robust to violations under certain conditions
+4. **Document all assumption checks** in analysis reports
+5. **Report violations and remedial actions taken**
+
+## Common Assumptions Across Tests
+
+### 1. Independence of Observations
+
+**What it means**: Each observation is independent; measurements on one subject do not influence measurements on another.
+
+**How to check**:
+- Review study design and data collection procedures
+- For time series: Check autocorrelation (ACF/PACF plots, Durbin-Watson test)
+- For clustered data: Consider intraclass correlation (ICC)
+
+**What to do if violated**:
+- Use mixed-effects models for clustered/hierarchical data
+- Use time series methods for temporally dependent data
+- Use generalized estimating equations (GEE) for correlated data
+
+**Critical severity**: HIGH - violations can severely inflate Type I error
+
+---
+
+### 2. Normality
+
+**What it means**: Data or residuals follow a normal (Gaussian) distribution.
+
+**When required**:
+- t-tests (for small samples; robust for n > 30 per group)
+- ANOVA (for small samples; robust for n > 30 per group)
+- Linear regression (for residuals)
+- Some correlation tests (Pearson)
+
+**How to check**:
+
+**Visual methods** (primary):
+- Q-Q (quantile-quantile) plot: Points should fall on diagonal line
+- Histogram with normal curve overlay
+- Kernel density plot
+
+**Formal tests** (secondary):
+- Shapiro-Wilk test (recommended for n < 50)
+- Kolmogorov-Smirnov test
+- Anderson-Darling test
+
+**Python implementation**:
+```python
+from scipy import stats
+import matplotlib.pyplot as plt
+
+# Shapiro-Wilk test
+statistic, p_value = stats.shapiro(data)
+
+# Q-Q plot
+stats.probplot(data, dist="norm", plot=plt)
+```
+
+**Interpretation guidance**:
+- For n < 30: Both visual and formal tests important
+- For 30 ≤ n < 100: Visual inspection primary, formal tests secondary
+- For n ≥ 100: Formal tests overly sensitive; rely on visual inspection
+- Look for severe skewness, outliers, or bimodality
+
+**What to do if violated**:
+- **Mild violations** (slight skewness): Proceed if n > 30 per group
+- **Moderate violations**: Use non-parametric alternatives (Mann-Whitney, Kruskal-Wallis, Wilcoxon)
+- **Severe violations**:
+  - Transform data (log, square root, Box-Cox)
+  - Use non-parametric methods
+  - Use robust regression methods
+  - Consider bootstrapping
+
+**Critical severity**: MEDIUM - parametric tests are often robust to mild violations with adequate sample size
+
+---
+
+### 3. Homogeneity of Variance (Homoscedasticity)
+
+**What it means**: Variances are equal across groups or across the range of predictors.
+
+**When required**:
+- Independent samples t-test
+- ANOVA
+- Linear regression (constant variance of residuals)
+
+**How to check**:
+
+**Visual methods** (primary):
+- Box plots by group (for t-test/ANOVA)
+- Residuals vs. fitted values plot (for regression) - should show random scatter
+- Scale-location plot (square root of standardized residuals vs. fitted)
+
+**Formal tests** (secondary):
+- Levene's test (robust to non-normality)
+- Bartlett's test (sensitive to non-normality, not recommended)
+- Brown-Forsythe test (median-based version of Levene's)
+- Breusch-Pagan test (for regression)
+
+**Python implementation**:
+```python
+from scipy import stats
+import pingouin as pg
+
+# Levene's test
+statistic, p_value = stats.levene(group1, group2, group3)
+
+# For regression
+# Breusch-Pagan test
+from statsmodels.stats.diagnostic import het_breuschpagan
+_, p_value, _, _ = het_breuschpagan(residuals, exog)
+```
+
+**Interpretation guidance**:
+- Variance ratio (max/min) < 2-3: Generally acceptable
+- For ANOVA: Test is robust if groups have equal sizes
+- For regression: Look for funnel patterns in residual plots
+
+**What to do if violated**:
+- **t-test**: Use Welch's t-test (does not assume equal variances)
+- **ANOVA**: Use Welch's ANOVA or Brown-Forsythe ANOVA
+- **Regression**:
+  - Transform dependent variable (log, square root)
+  - Use weighted least squares (WLS)
+  - Use robust standard errors (HC3)
+  - Use generalized linear models (GLM) with appropriate variance function
+
+**Critical severity**: MEDIUM - tests can be robust with equal sample sizes
+
+---
+
+## Test-Specific Assumptions
+
+### T-Tests
+
+**Assumptions**:
+1. Independence of observations
+2. Normality (each group for independent t-test; differences for paired t-test)
+3. Homogeneity of variance (independent t-test only)
+
+**Diagnostic workflow**:
+```python
+import scipy.stats as stats
+import pingouin as pg
+
+# Check normality for each group
+stats.shapiro(group1)
+stats.shapiro(group2)
+
+# Check homogeneity of variance
+stats.levene(group1, group2)
+
+# If assumptions violated:
+# Option 1: Welch's t-test (unequal variances)
+pg.ttest(group1, group2, correction=False)  # Welch's
+
+# Option 2: Non-parametric alternative
+pg.mwu(group1, group2)  # Mann-Whitney U
+```
+
+---
+
+### ANOVA
+
+**Assumptions**:
+1. Independence of observations within and between groups
+2. Normality in each group
+3. Homogeneity of variance across groups
+
+**Additional considerations**:
+- For repeated measures ANOVA: Sphericity assumption (Mauchly's test)
+
+**Diagnostic workflow**:
+```python
+import pingouin as pg
+
+# Check normality per group
+for group in df['group'].unique():
+    data = df[df['group'] == group]['value']
+    stats.shapiro(data)
+
+# Check homogeneity of variance
+pg.homoscedasticity(df, dv='value', group='group')
+
+# For repeated measures: Check sphericity
+# Automatically tested in pingouin's rm_anova
+```
+
+**What to do if sphericity violated** (repeated measures):
+- Greenhouse-Geisser correction (ε < 0.75)
+- Huynh-Feldt correction (ε > 0.75)
+- Use multivariate approach (MANOVA)
+
+---
+
+### Linear Regression
+
+**Assumptions**:
+1. **Linearity**: Relationship between X and Y is linear
+2. **Independence**: Residuals are independent
+3. **Homoscedasticity**: Constant variance of residuals
+4. **Normality**: Residuals are normally distributed
+5. **No multicollinearity**: Predictors are not highly correlated (multiple regression)
+
+**Diagnostic workflow**:
+
+**1. Linearity**:
+```python
+import matplotlib.pyplot as plt
+import seaborn as sns
+
+# Scatter plots of Y vs each X
+# Residuals vs. fitted values (should be randomly scattered)
+plt.scatter(fitted_values, residuals)
+plt.axhline(y=0, color='r', linestyle='--')
+```
+
+**2. Independence**:
+```python
+from statsmodels.stats.stattools import durbin_watson
+
+# Durbin-Watson test (for time series)
+dw_statistic = durbin_watson(residuals)
+# Values between 1.5-2.5 suggest independence
+```
+
+**3. Homoscedasticity**:
+```python
+# Breusch-Pagan test
+from statsmodels.stats.diagnostic import het_breuschpagan
+_, p_value, _, _ = het_breuschpagan(residuals, exog)
+
+# Visual: Scale-location plot
+plt.scatter(fitted_values, np.sqrt(np.abs(std_residuals)))
+```
+
+**4. Normality of residuals**:
+```python
+# Q-Q plot of residuals
+stats.probplot(residuals, dist="norm", plot=plt)
+
+# Shapiro-Wilk test
+stats.shapiro(residuals)
+```
+
+**5. Multicollinearity**:
+```python
+from statsmodels.stats.outliers_influence import variance_inflation_factor
+
+# Calculate VIF for each predictor
+vif_data = pd.DataFrame()
+vif_data["feature"] = X.columns
+vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
+
+# VIF > 10 indicates severe multicollinearity
+# VIF > 5 indicates moderate multicollinearity
+```
+
+**What to do if violated**:
+- **Non-linearity**: Add polynomial terms, use GAM, or transform variables
+- **Heteroscedasticity**: Transform Y, use WLS, use robust SE
+- **Non-normal residuals**: Transform Y, use robust methods, check for outliers
+- **Multicollinearity**: Remove correlated predictors, use PCA, ridge regression
+
+---
+
+### Logistic Regression
+
+**Assumptions**:
+1. **Independence**: Observations are independent
+2. **Linearity**: Linear relationship between log-odds and continuous predictors
+3. **No perfect multicollinearity**: Predictors not perfectly correlated
+4. **Large sample size**: At least 10-20 events per predictor
+
+**Diagnostic workflow**:
+
+**1. Linearity of logit**:
+```python
+# Box-Tidwell test: Add interaction with log of continuous predictor
+# If interaction is significant, linearity violated
+```
+
+**2. Multicollinearity**:
+```python
+# Use VIF as in linear regression
+```
+
+**3. Influential observations**:
+```python
+# Cook's distance, DFBetas, leverage
+from statsmodels.stats.outliers_influence import OLSInfluence
+
+influence = OLSInfluence(model)
+cooks_d = influence.cooks_distance
+```
+
+**4. Model fit**:
+```python
+# Hosmer-Lemeshow test
+# Pseudo R-squared
+# Classification metrics (accuracy, AUC-ROC)
+```
+
+---
+
+## Outlier Detection
+
+**Methods**:
+1. **Visual**: Box plots, scatter plots
+2. **Statistical**:
+   - Z-scores: |z| > 3 suggests outlier
+   - IQR method: Values < Q1 - 1.5×IQR or > Q3 + 1.5×IQR
+   - Modified Z-score using median absolute deviation (robust to outliers)
+
+**For regression**:
+- **Leverage**: High leverage points (hat values)
+- **Influence**: Cook's distance > 4/n suggests influential point
+- **Outliers**: Studentized residuals > ±3
+
+**What to do**:
+1. Investigate data entry errors
+2. Consider if outliers are valid observations
+3. Report sensitivity analysis (results with and without outliers)
+4. Use robust methods if outliers are legitimate
+
+---
+
+## Sample Size Considerations
+
+### Minimum Sample Sizes (Rules of Thumb)
+
+- **T-test**: n ≥ 30 per group for robustness to non-normality
+- **ANOVA**: n ≥ 30 per group
+- **Correlation**: n ≥ 30 for adequate power
+- **Simple regression**: n ≥ 50
+- **Multiple regression**: n ≥ 10-20 per predictor (minimum 10 + k predictors)
+- **Logistic regression**: n ≥ 10-20 events per predictor
+
+### Small Sample Considerations
+
+For small samples:
+- Assumptions become more critical
+- Use exact tests when available (Fisher's exact, exact logistic regression)
+- Consider non-parametric alternatives
+- Use permutation tests or bootstrap methods
+- Be conservative with interpretation
+
+---
+
+## Reporting Assumption Checks
+
+When reporting analyses, include:
+
+1. **Statement of assumptions checked**: List all assumptions tested
+2. **Methods used**: Describe visual and formal tests employed
+3. **Results of diagnostic tests**: Report test statistics and p-values
+4. **Assessment**: State whether assumptions were met or violated
+5. **Actions taken**: If violated, describe remedial actions (transformations, alternative tests, robust methods)
+
+**Example reporting statement**:
+> "Normality was assessed using Shapiro-Wilk tests and Q-Q plots. Data for Group A (W = 0.97, p = .18) and Group B (W = 0.96, p = .12) showed no significant departure from normality. Homogeneity of variance was assessed using Levene's test, which was non-significant (F(1, 58) = 1.23, p = .27), indicating equal variances across groups. Therefore, assumptions for the independent samples t-test were satisfied."