Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/statistical-analysis/SKILL.md
+++ b/skills/statistical-analysis/SKILL.md
@@ -0,0 +1,626 @@
+---
+name: statistical-analysis
+description: "Statistical analysis toolkit. Hypothesis tests (t-test, ANOVA, chi-square), regression, correlation, Bayesian stats, power analysis, assumption checks, APA reporting, for academic research."
+---
+
+# Statistical Analysis
+
+## Overview
+
+Statistical analysis is a systematic process for testing hypotheses and quantifying relationships. Conduct hypothesis tests (t-test, ANOVA, chi-square), regression, correlation, and Bayesian analyses with assumption checks and APA reporting. Apply this skill for academic research.
+
+## When to Use This Skill
+
+This skill should be used when:
+- Conducting statistical hypothesis tests (t-tests, ANOVA, chi-square)
+- Performing regression or correlation analyses
+- Running Bayesian statistical analyses
+- Checking statistical assumptions and diagnostics
+- Calculating effect sizes and conducting power analyses
+- Reporting statistical results in APA format
+- Analyzing experimental or observational data for research
+
+---
+
+## Core Capabilities
+
+### 1. Test Selection and Planning
+- Choose appropriate statistical tests based on research questions and data characteristics
+- Conduct a priori power analyses to determine required sample sizes
+- Plan analysis strategies including multiple comparison corrections
+
+### 2. Assumption Checking
+- Automatically verify all relevant assumptions before running tests
+- Provide diagnostic visualizations (Q-Q plots, residual plots, box plots)
+- Recommend remedial actions when assumptions are violated
+
+### 3. Statistical Testing
+- Hypothesis testing: t-tests, ANOVA, chi-square, non-parametric alternatives
+- Regression: linear, multiple, logistic, with diagnostics
+- Correlations: Pearson, Spearman, with confidence intervals
+- Bayesian alternatives: Bayesian t-tests, ANOVA, regression with Bayes Factors
+
+### 4. Effect Sizes and Interpretation
+- Calculate and interpret appropriate effect sizes for all analyses
+- Provide confidence intervals for effect estimates
+- Distinguish statistical from practical significance
+
+### 5. Professional Reporting
+- Generate APA-style statistical reports
+- Create publication-ready figures and tables
+- Provide complete interpretation with all required statistics
+
+---
+
+## Workflow Decision Tree
+
+Use this decision tree to determine your analysis path:
+
+```
+START
+│
+├─ Need to SELECT a statistical test?
+│  └─ YES → See "Test Selection Guide"
+│  └─ NO → Continue
+│
+├─ Ready to check ASSUMPTIONS?
+│  └─ YES → See "Assumption Checking"
+│  └─ NO → Continue
+│
+├─ Ready to run ANALYSIS?
+│  └─ YES → See "Running Statistical Tests"
+│  └─ NO → Continue
+│
+└─ Need to REPORT results?
+   └─ YES → See "Reporting Results"
+```
+
+---
+
+## Test Selection Guide
+
+### Quick Reference: Choosing the Right Test
+
+Use `references/test_selection_guide.md` for comprehensive guidance. Quick reference:
+
+**Comparing Two Groups:**
+- Independent, continuous, normal → Independent t-test
+- Independent, continuous, non-normal → Mann-Whitney U test
+- Paired, continuous, normal → Paired t-test
+- Paired, continuous, non-normal → Wilcoxon signed-rank test
+- Binary outcome → Chi-square or Fisher's exact test
+
+**Comparing 3+ Groups:**
+- Independent, continuous, normal → One-way ANOVA
+- Independent, continuous, non-normal → Kruskal-Wallis test
+- Paired, continuous, normal → Repeated measures ANOVA
+- Paired, continuous, non-normal → Friedman test
+
+**Relationships:**
+- Two continuous variables → Pearson (normal) or Spearman correlation (non-normal)
+- Continuous outcome with predictor(s) → Linear regression
+- Binary outcome with predictor(s) → Logistic regression
+
+**Bayesian Alternatives:**
+All tests have Bayesian versions that provide:
+- Direct probability statements about hypotheses
+- Bayes Factors quantifying evidence
+- Ability to support null hypothesis
+- See `references/bayesian_statistics.md`
+
+---
+
+## Assumption Checking
+
+### Systematic Assumption Verification
+
+**ALWAYS check assumptions before interpreting test results.**
+
+Use the provided `scripts/assumption_checks.py` module for automated checking:
+
+```python
+from scripts.assumption_checks import comprehensive_assumption_check
+
+# Comprehensive check with visualizations
+results = comprehensive_assumption_check(
+    data=df,
+    value_col='score',
+    group_col='group',  # Optional: for group comparisons
+    alpha=0.05
+)
+```
+
+This performs:
+1. **Outlier detection** (IQR and z-score methods)
+2. **Normality testing** (Shapiro-Wilk test + Q-Q plots)
+3. **Homogeneity of variance** (Levene's test + box plots)
+4. **Interpretation and recommendations**
+
+### Individual Assumption Checks
+
+For targeted checks, use individual functions:
+
+```python
+from scripts.assumption_checks import (
+    check_normality,
+    check_normality_per_group,
+    check_homogeneity_of_variance,
+    check_linearity,
+    detect_outliers
+)
+
+# Example: Check normality with visualization
+result = check_normality(
+    data=df['score'],
+    name='Test Score',
+    alpha=0.05,
+    plot=True
+)
+print(result['interpretation'])
+print(result['recommendation'])
+```
+
+### What to Do When Assumptions Are Violated
+
+**Normality violated:**
+- Mild violation + n > 30 per group → Proceed with parametric test (robust)
+- Moderate violation → Use non-parametric alternative
+- Severe violation → Transform data or use non-parametric test
+
+**Homogeneity of variance violated:**
+- For t-test → Use Welch's t-test
+- For ANOVA → Use Welch's ANOVA or Brown-Forsythe ANOVA
+- For regression → Use robust standard errors or weighted least squares
+
+**Linearity violated (regression):**
+- Add polynomial terms
+- Transform variables
+- Use non-linear models or GAM
+
+See `references/assumptions_and_diagnostics.md` for comprehensive guidance.
+
+---
+
+## Running Statistical Tests
+
+### Python Libraries
+
+Primary libraries for statistical analysis:
+- **scipy.stats**: Core statistical tests
+- **statsmodels**: Advanced regression and diagnostics
+- **pingouin**: User-friendly statistical testing with effect sizes
+- **pymc**: Bayesian statistical modeling
+- **arviz**: Bayesian visualization and diagnostics
+
+### Example Analyses
+
+#### T-Test with Complete Reporting
+
+```python
+import pingouin as pg
+import numpy as np
+
+# Run independent t-test
+result = pg.ttest(group_a, group_b, correction='auto')
+
+# Extract results
+t_stat = result['T'].values[0]
+df = result['dof'].values[0]
+p_value = result['p-val'].values[0]
+cohens_d = result['cohen-d'].values[0]
+ci_lower = result['CI95%'].values[0][0]
+ci_upper = result['CI95%'].values[0][1]
+
+# Report
+print(f"t({df:.0f}) = {t_stat:.2f}, p = {p_value:.3f}")
+print(f"Cohen's d = {cohens_d:.2f}, 95% CI [{ci_lower:.2f}, {ci_upper:.2f}]")
+```
+
+#### ANOVA with Post-Hoc Tests
+
+```python
+import pingouin as pg
+
+# One-way ANOVA
+aov = pg.anova(dv='score', between='group', data=df, detailed=True)
+print(aov)
+
+# If significant, conduct post-hoc tests
+if aov['p-unc'].values[0] < 0.05:
+    posthoc = pg.pairwise_tukey(dv='score', between='group', data=df)
+    print(posthoc)
+
+# Effect size
+eta_squared = aov['np2'].values[0]  # Partial eta-squared
+print(f"Partial η² = {eta_squared:.3f}")
+```
+
+#### Linear Regression with Diagnostics
+
+```python
+import statsmodels.api as sm
+from statsmodels.stats.outliers_influence import variance_inflation_factor
+
+# Fit model
+X = sm.add_constant(X_predictors)  # Add intercept
+model = sm.OLS(y, X).fit()
+
+# Summary
+print(model.summary())
+
+# Check multicollinearity (VIF)
+vif_data = pd.DataFrame()
+vif_data["Variable"] = X.columns
+vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
+print(vif_data)
+
+# Check assumptions
+residuals = model.resid
+fitted = model.fittedvalues
+
+# Residual plots
+import matplotlib.pyplot as plt
+fig, axes = plt.subplots(2, 2, figsize=(12, 10))
+
+# Residuals vs fitted
+axes[0, 0].scatter(fitted, residuals, alpha=0.6)
+axes[0, 0].axhline(y=0, color='r', linestyle='--')
+axes[0, 0].set_xlabel('Fitted values')
+axes[0, 0].set_ylabel('Residuals')
+axes[0, 0].set_title('Residuals vs Fitted')
+
+# Q-Q plot
+from scipy import stats
+stats.probplot(residuals, dist="norm", plot=axes[0, 1])
+axes[0, 1].set_title('Normal Q-Q')
+
+# Scale-Location
+axes[1, 0].scatter(fitted, np.sqrt(np.abs(residuals / residuals.std())), alpha=0.6)
+axes[1, 0].set_xlabel('Fitted values')
+axes[1, 0].set_ylabel('√|Standardized residuals|')
+axes[1, 0].set_title('Scale-Location')
+
+# Residuals histogram
+axes[1, 1].hist(residuals, bins=20, edgecolor='black', alpha=0.7)
+axes[1, 1].set_xlabel('Residuals')
+axes[1, 1].set_ylabel('Frequency')
+axes[1, 1].set_title('Histogram of Residuals')
+
+plt.tight_layout()
+plt.show()
+```
+
+#### Bayesian T-Test
+
+```python
+import pymc as pm
+import arviz as az
+import numpy as np
+
+with pm.Model() as model:
+    # Priors
+    mu1 = pm.Normal('mu_group1', mu=0, sigma=10)
+    mu2 = pm.Normal('mu_group2', mu=0, sigma=10)
+    sigma = pm.HalfNormal('sigma', sigma=10)
+
+    # Likelihood
+    y1 = pm.Normal('y1', mu=mu1, sigma=sigma, observed=group_a)
+    y2 = pm.Normal('y2', mu=mu2, sigma=sigma, observed=group_b)
+
+    # Derived quantity
+    diff = pm.Deterministic('difference', mu1 - mu2)
+
+    # Sample
+    trace = pm.sample(2000, tune=1000, return_inferencedata=True)
+
+# Summarize
+print(az.summary(trace, var_names=['difference']))
+
+# Probability that group1 > group2
+prob_greater = np.mean(trace.posterior['difference'].values > 0)
+print(f"P(μ₁ > μ₂ | data) = {prob_greater:.3f}")
+
+# Plot posterior
+az.plot_posterior(trace, var_names=['difference'], ref_val=0)
+```
+
+---
+
+## Effect Sizes
+
+### Always Calculate Effect Sizes
+
+**Effect sizes quantify magnitude, while p-values only indicate existence of an effect.**
+
+See `references/effect_sizes_and_power.md` for comprehensive guidance.
+
+### Quick Reference: Common Effect Sizes
+
+| Test | Effect Size | Small | Medium | Large |
+|------|-------------|-------|--------|-------|
+| T-test | Cohen's d | 0.20 | 0.50 | 0.80 |
+| ANOVA | η²_p | 0.01 | 0.06 | 0.14 |
+| Correlation | r | 0.10 | 0.30 | 0.50 |
+| Regression | R² | 0.02 | 0.13 | 0.26 |
+| Chi-square | Cramér's V | 0.07 | 0.21 | 0.35 |
+
+**Important**: Benchmarks are guidelines. Context matters!
+
+### Calculating Effect Sizes
+
+Most effect sizes are automatically calculated by pingouin:
+
+```python
+# T-test returns Cohen's d
+result = pg.ttest(x, y)
+d = result['cohen-d'].values[0]
+
+# ANOVA returns partial eta-squared
+aov = pg.anova(dv='score', between='group', data=df)
+eta_p2 = aov['np2'].values[0]
+
+# Correlation: r is already an effect size
+corr = pg.corr(x, y)
+r = corr['r'].values[0]
+```
+
+### Confidence Intervals for Effect Sizes
+
+Always report CIs to show precision:
+
+```python
+from pingouin import compute_effsize_from_t
+
+# For t-test
+d, ci = compute_effsize_from_t(
+    t_statistic,
+    nx=len(group1),
+    ny=len(group2),
+    eftype='cohen'
+)
+print(f"d = {d:.2f}, 95% CI [{ci[0]:.2f}, {ci[1]:.2f}]")
+```
+
+---
+
+## Power Analysis
+
+### A Priori Power Analysis (Study Planning)
+
+Determine required sample size before data collection:
+
+```python
+from statsmodels.stats.power import (
+    tt_ind_solve_power,
+    FTestAnovaPower
+)
+
+# T-test: What n is needed to detect d = 0.5?
+n_required = tt_ind_solve_power(
+    effect_size=0.5,
+    alpha=0.05,
+    power=0.80,
+    ratio=1.0,
+    alternative='two-sided'
+)
+print(f"Required n per group: {n_required:.0f}")
+
+# ANOVA: What n is needed to detect f = 0.25?
+anova_power = FTestAnovaPower()
+n_per_group = anova_power.solve_power(
+    effect_size=0.25,
+    ngroups=3,
+    alpha=0.05,
+    power=0.80
+)
+print(f"Required n per group: {n_per_group:.0f}")
+```
+
+### Sensitivity Analysis (Post-Study)
+
+Determine what effect size you could detect:
+
+```python
+# With n=50 per group, what effect could we detect?
+detectable_d = tt_ind_solve_power(
+    effect_size=None,  # Solve for this
+    nobs1=50,
+    alpha=0.05,
+    power=0.80,
+    ratio=1.0,
+    alternative='two-sided'
+)
+print(f"Study could detect d ≥ {detectable_d:.2f}")
+```
+
+**Note**: Post-hoc power analysis (calculating power after study) is generally not recommended. Use sensitivity analysis instead.
+
+See `references/effect_sizes_and_power.md` for detailed guidance.
+
+---
+
+## Reporting Results
+
+### APA Style Statistical Reporting
+
+Follow guidelines in `references/reporting_standards.md`.
+
+### Essential Reporting Elements
+
+1. **Descriptive statistics**: M, SD, n for all groups/variables
+2. **Test statistics**: Test name, statistic, df, exact p-value
+3. **Effect sizes**: With confidence intervals
+4. **Assumption checks**: Which tests were done, results, actions taken
+5. **All planned analyses**: Including non-significant findings
+
+### Example Report Templates
+
+#### Independent T-Test
+
+```
+Group A (n = 48, M = 75.2, SD = 8.5) scored significantly higher than
+Group B (n = 52, M = 68.3, SD = 9.2), t(98) = 3.82, p < .001, d = 0.77,
+95% CI [0.36, 1.18], two-tailed. Assumptions of normality (Shapiro-Wilk:
+Group A W = 0.97, p = .18; Group B W = 0.96, p = .12) and homogeneity
+of variance (Levene's F(1, 98) = 1.23, p = .27) were satisfied.
+```
+
+#### One-Way ANOVA
+
+```
+A one-way ANOVA revealed a significant main effect of treatment condition
+on test scores, F(2, 147) = 8.45, p < .001, η²_p = .10. Post hoc
+comparisons using Tukey's HSD indicated that Condition A (M = 78.2,
+SD = 7.3) scored significantly higher than Condition B (M = 71.5,
+SD = 8.1, p = .002, d = 0.87) and Condition C (M = 70.1, SD = 7.9,
+p < .001, d = 1.07). Conditions B and C did not differ significantly
+(p = .52, d = 0.18).
+```
+
+#### Multiple Regression
+
+```
+Multiple linear regression was conducted to predict exam scores from
+study hours, prior GPA, and attendance. The overall model was significant,
+F(3, 146) = 45.2, p < .001, R² = .48, adjusted R² = .47. Study hours
+(B = 1.80, SE = 0.31, β = .35, t = 5.78, p < .001, 95% CI [1.18, 2.42])
+and prior GPA (B = 8.52, SE = 1.95, β = .28, t = 4.37, p < .001,
+95% CI [4.66, 12.38]) were significant predictors, while attendance was
+not (B = 0.15, SE = 0.12, β = .08, t = 1.25, p = .21, 95% CI [-0.09, 0.39]).
+Multicollinearity was not a concern (all VIF < 1.5).
+```
+
+#### Bayesian Analysis
+
+```
+A Bayesian independent samples t-test was conducted using weakly
+informative priors (Normal(0, 1) for mean difference). The posterior
+distribution indicated that Group A scored higher than Group B
+(M_diff = 6.8, 95% credible interval [3.2, 10.4]). The Bayes Factor
+BF₁₀ = 45.3 provided very strong evidence for a difference between
+groups, with a 99.8% posterior probability that Group A's mean exceeded
+Group B's mean. Convergence diagnostics were satisfactory (all R̂ < 1.01,
+ESS > 1000).
+```
+
+---
+
+## Bayesian Statistics
+
+### When to Use Bayesian Methods
+
+Consider Bayesian approaches when:
+- You have prior information to incorporate
+- You want direct probability statements about hypotheses
+- Sample size is small or planning sequential data collection
+- You need to quantify evidence for the null hypothesis
+- The model is complex (hierarchical, missing data)
+
+See `references/bayesian_statistics.md` for comprehensive guidance on:
+- Bayes' theorem and interpretation
+- Prior specification (informative, weakly informative, non-informative)
+- Bayesian hypothesis testing with Bayes Factors
+- Credible intervals vs. confidence intervals
+- Bayesian t-tests, ANOVA, regression, and hierarchical models
+- Model convergence checking and posterior predictive checks
+
+### Key Advantages
+
+1. **Intuitive interpretation**: "Given the data, there is a 95% probability the parameter is in this interval"
+2. **Evidence for null**: Can quantify support for no effect
+3. **Flexible**: No p-hacking concerns; can analyze data as it arrives
+4. **Uncertainty quantification**: Full posterior distribution
+
+---
+
+## Resources
+
+This skill includes comprehensive reference materials:
+
+### References Directory
+
+- **test_selection_guide.md**: Decision tree for choosing appropriate statistical tests
+- **assumptions_and_diagnostics.md**: Detailed guidance on checking and handling assumption violations
+- **effect_sizes_and_power.md**: Calculating, interpreting, and reporting effect sizes; conducting power analyses
+- **bayesian_statistics.md**: Complete guide to Bayesian analysis methods
+- **reporting_standards.md**: APA-style reporting guidelines with examples
+
+### Scripts Directory
+
+- **assumption_checks.py**: Automated assumption checking with visualizations
+  - `comprehensive_assumption_check()`: Complete workflow
+  - `check_normality()`: Normality testing with Q-Q plots
+  - `check_homogeneity_of_variance()`: Levene's test with box plots
+  - `check_linearity()`: Regression linearity checks
+  - `detect_outliers()`: IQR and z-score outlier detection
+
+---
+
+## Best Practices
+
+1. **Pre-register analyses** when possible to distinguish confirmatory from exploratory
+2. **Always check assumptions** before interpreting results
+3. **Report effect sizes** with confidence intervals
+4. **Report all planned analyses** including non-significant results
+5. **Distinguish statistical from practical significance**
+6. **Visualize data** before and after analysis
+7. **Check diagnostics** for regression/ANOVA (residual plots, VIF, etc.)
+8. **Conduct sensitivity analyses** to assess robustness
+9. **Share data and code** for reproducibility
+10. **Be transparent** about violations, transformations, and decisions
+
+---
+
+## Common Pitfalls to Avoid
+
+1. **P-hacking**: Don't test multiple ways until something is significant
+2. **HARKing**: Don't present exploratory findings as confirmatory
+3. **Ignoring assumptions**: Check them and report violations
+4. **Confusing significance with importance**: p < .05 ≠ meaningful effect
+5. **Not reporting effect sizes**: Essential for interpretation
+6. **Cherry-picking results**: Report all planned analyses
+7. **Misinterpreting p-values**: They're NOT probability that hypothesis is true
+8. **Multiple comparisons**: Correct for family-wise error when appropriate
+9. **Ignoring missing data**: Understand mechanism (MCAR, MAR, MNAR)
+10. **Overinterpreting non-significant results**: Absence of evidence ≠ evidence of absence
+
+---
+
+## Getting Started Checklist
+
+When beginning a statistical analysis:
+
+- [ ] Define research question and hypotheses
+- [ ] Determine appropriate statistical test (use test_selection_guide.md)
+- [ ] Conduct power analysis to determine sample size
+- [ ] Load and inspect data
+- [ ] Check for missing data and outliers
+- [ ] Verify assumptions using assumption_checks.py
+- [ ] Run primary analysis
+- [ ] Calculate effect sizes with confidence intervals
+- [ ] Conduct post-hoc tests if needed (with corrections)
+- [ ] Create visualizations
+- [ ] Write results following reporting_standards.md
+- [ ] Conduct sensitivity analyses
+- [ ] Share data and code
+
+---
+
+## Support and Further Reading
+
+For questions about:
+- **Test selection**: See references/test_selection_guide.md
+- **Assumptions**: See references/assumptions_and_diagnostics.md
+- **Effect sizes**: See references/effect_sizes_and_power.md
+- **Bayesian methods**: See references/bayesian_statistics.md
+- **Reporting**: See references/reporting_standards.md
+
+**Key textbooks**:
+- Cohen, J. (1988). *Statistical Power Analysis for the Behavioral Sciences*
+- Field, A. (2013). *Discovering Statistics Using IBM SPSS Statistics*
+- Gelman, A., & Hill, J. (2006). *Data Analysis Using Regression and Multilevel/Hierarchical Models*
+- Kruschke, J. K. (2014). *Doing Bayesian Data Analysis*
+
+**Online resources**:
+- APA Style Guide: https://apastyle.apa.org/
+- Statistical Consulting: Cross Validated (stats.stackexchange.com)