Files
gh-k-dense-ai-claude-scient…/skills/scientific-critical-thinking/references/statistical_pitfalls.md
2025-11-30 08:30:18 +08:00

507 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Common Statistical Pitfalls
## P-Value Misinterpretations
### Pitfall 1: P-Value = Probability Hypothesis is True
**Misconception:** p = .05 means 5% chance the null hypothesis is true.
**Reality:** P-value is the probability of observing data this extreme (or more) *if* the null hypothesis is true. It says nothing about the probability the hypothesis is true.
**Correct interpretation:** "If there were truly no effect, we would observe data this extreme only 5% of the time."
### Pitfall 2: Non-Significant = No Effect
**Misconception:** p > .05 proves there's no effect.
**Reality:** Absence of evidence ≠ evidence of absence. Non-significant results may indicate:
- Insufficient statistical power
- True effect too small to detect
- High variability
- Small sample size
**Better approach:**
- Report confidence intervals
- Conduct power analysis
- Consider equivalence testing
### Pitfall 3: Significant = Important
**Misconception:** Statistical significance means practical importance.
**Reality:** With large samples, trivial effects become "significant." A statistically significant 0.1 IQ point difference is meaningless in practice.
**Better approach:**
- Report effect sizes
- Consider practical significance
- Use confidence intervals
### Pitfall 4: P = .049 vs. P = .051
**Misconception:** These are meaningfully different because one crosses the .05 threshold.
**Reality:** These represent nearly identical evidence. The .05 threshold is arbitrary.
**Better approach:**
- Treat p-values as continuous measures of evidence
- Report exact p-values
- Consider context and prior evidence
### Pitfall 5: One-Tailed Tests Without Justification
**Misconception:** One-tailed tests are free extra power.
**Reality:** One-tailed tests assume effects can only go one direction, which is rarely true. They're often used to artificially boost significance.
**When appropriate:** Only when effects in one direction are theoretically impossible or equivalent to null.
## Multiple Comparisons Problems
### Pitfall 6: Multiple Testing Without Correction
**Problem:** Testing 20 hypotheses at p < .05 gives ~65% chance of at least one false positive.
**Examples:**
- Testing many outcomes
- Testing many subgroups
- Conducting multiple interim analyses
- Testing at multiple time points
**Solutions:**
- Bonferroni correction (divide α by number of tests)
- False Discovery Rate (FDR) control
- Prespecify primary outcome
- Treat exploratory analyses as hypothesis-generating
### Pitfall 7: Subgroup Analysis Fishing
**Problem:** Testing many subgroups until finding significance.
**Why problematic:**
- Inflates false positive rate
- Often reported without disclosure
- "Interaction was significant in women" may be random
**Solutions:**
- Prespecify subgroups
- Use interaction tests, not separate tests
- Require replication
- Correct for multiple comparisons
### Pitfall 8: Outcome Switching
**Problem:** Analyzing many outcomes, reporting only significant ones.
**Detection signs:**
- Secondary outcomes emphasized
- Incomplete outcome reporting
- Discrepancy between registration and publication
**Solutions:**
- Preregister all outcomes
- Report all planned outcomes
- Distinguish primary from secondary
## Sample Size and Power Issues
### Pitfall 9: Underpowered Studies
**Problem:** Small samples have low probability of detecting true effects.
**Consequences:**
- High false negative rate
- Significant results more likely to be false positives
- Overestimated effect sizes (when significant)
**Solutions:**
- Conduct a priori power analysis
- Aim for 80-90% power
- Consider effect size from prior research
### Pitfall 10: Post-Hoc Power Analysis
**Problem:** Calculating power after seeing results is circular and uninformative.
**Why useless:**
- Non-significant results always have low "post-hoc power"
- It recapitulates the p-value without new information
**Better approach:**
- Calculate confidence intervals
- Plan replication with adequate sample
- Conduct prospective power analysis for future studies
### Pitfall 11: Small Sample Fallacy
**Problem:** Trusting results from very small samples.
**Issues:**
- High sampling variability
- Outliers have large influence
- Assumptions of tests violated
- Confidence intervals very wide
**Guidelines:**
- Be skeptical of n < 30
- Check assumptions carefully
- Consider non-parametric tests
- Replicate findings
## Effect Size Misunderstandings
### Pitfall 12: Ignoring Effect Size
**Problem:** Focusing only on significance, not magnitude.
**Why problematic:**
- Significance ≠ importance
- Can't compare across studies
- Doesn't inform practical decisions
**Solutions:**
- Always report effect sizes
- Use standardized measures (Cohen's d, r, η²)
- Interpret using field conventions
- Consider minimum clinically important difference
### Pitfall 13: Misinterpreting Standardized Effect Sizes
**Problem:** Treating Cohen's d = 0.5 as "medium" without context.
**Reality:**
- Field-specific norms vary
- Some fields have larger typical effects
- Real-world importance depends on context
**Better approach:**
- Compare to effects in same domain
- Consider practical implications
- Look at raw effect sizes too
### Pitfall 14: Confusing Explained Variance with Importance
**Problem:** "Only explains 5% of variance" = unimportant.
**Reality:**
- Height explains ~5% of variation in NBA player salary but is crucial
- Complex phenomena have many small contributors
- Predictive accuracy ≠ causal importance
**Consideration:** Context matters more than percentage alone.
## Correlation and Causation
### Pitfall 15: Correlation Implies Causation
**Problem:** Inferring causation from correlation.
**Alternative explanations:**
- Reverse causation (B causes A, not A causes B)
- Confounding (C causes both A and B)
- Coincidence
- Selection bias
**Criteria for causation:**
- Temporal precedence
- Covariation
- No plausible alternatives
- Ideally: experimental manipulation
### Pitfall 16: Ecological Fallacy
**Problem:** Inferring individual-level relationships from group-level data.
**Example:** Countries with more chocolate consumption have more Nobel laureates doesn't mean eating chocolate makes you win Nobels.
**Why problematic:** Group-level correlations may not hold at individual level.
### Pitfall 17: Simpson's Paradox
**Problem:** Trend appears in groups but reverses when combined (or vice versa).
**Example:** Treatment appears worse overall but better in every subgroup.
**Cause:** Confounding variable distributed differently across groups.
**Solution:** Consider confounders and look at appropriate level of analysis.
## Regression and Modeling Pitfalls
### Pitfall 18: Overfitting
**Problem:** Model fits sample data well but doesn't generalize.
**Causes:**
- Too many predictors relative to sample size
- Fitting noise rather than signal
- No cross-validation
**Solutions:**
- Use cross-validation
- Penalized regression (LASSO, ridge)
- Independent test set
- Simpler models
### Pitfall 19: Extrapolation Beyond Data Range
**Problem:** Predicting outside the range of observed data.
**Why dangerous:**
- Relationships may not hold outside observed range
- Increased uncertainty not reflected in predictions
**Solution:** Only interpolate; avoid extrapolation.
### Pitfall 20: Ignoring Model Assumptions
**Problem:** Using statistical tests without checking assumptions.
**Common violations:**
- Non-normality (for parametric tests)
- Heteroscedasticity (unequal variances)
- Non-independence
- Linearity
- No multicollinearity
**Solutions:**
- Check assumptions with diagnostics
- Use robust methods
- Transform data
- Use appropriate non-parametric alternatives
### Pitfall 21: Treating Non-Significant Covariates as Eliminating Confounding
**Problem:** "We controlled for X and it wasn't significant, so it's not a confounder."
**Reality:** Non-significant covariates can still be important confounders. Significance ≠ confounding.
**Solution:** Include theoretically important covariates regardless of significance.
### Pitfall 22: Collinearity Masking Effects
**Problem:** When predictors are highly correlated, true effects may appear non-significant.
**Manifestations:**
- Large standard errors
- Unstable coefficients
- Sign changes when adding/removing variables
**Detection:**
- Variance Inflation Factors (VIF)
- Correlation matrices
**Solutions:**
- Remove redundant predictors
- Combine correlated variables
- Use regularization methods
## Specific Test Misuses
### Pitfall 23: T-Test for Multiple Groups
**Problem:** Conducting multiple t-tests instead of ANOVA.
**Why wrong:** Inflates Type I error rate dramatically.
**Correct approach:**
- Use ANOVA first
- Follow with planned comparisons or post-hoc tests with correction
### Pitfall 24: Pearson Correlation for Non-Linear Relationships
**Problem:** Using Pearson's r for curved relationships.
**Why misleading:** r measures linear relationships only.
**Solutions:**
- Check scatterplots first
- Use Spearman's ρ for monotonic relationships
- Consider polynomial or non-linear models
### Pitfall 25: Chi-Square with Small Expected Frequencies
**Problem:** Chi-square test with expected cell counts < 5.
**Why wrong:** Violates test assumptions, p-values inaccurate.
**Solutions:**
- Fisher's exact test
- Combine categories
- Increase sample size
### Pitfall 26: Paired vs. Independent Tests
**Problem:** Using independent samples test for paired data (or vice versa).
**Why wrong:**
- Wastes power (paired data analyzed as independent)
- Violates independence assumption (independent data analyzed as paired)
**Solution:** Match test to design.
## Confidence Interval Misinterpretations
### Pitfall 27: 95% CI = 95% Probability True Value Inside
**Misconception:** "95% chance the true value is in this interval."
**Reality:** The true value either is or isn't in this specific interval. If we repeated the study many times, 95% of resulting intervals would contain the true value.
**Better interpretation:** "We're 95% confident this interval contains the true value."
### Pitfall 28: Overlapping CIs = No Difference
**Problem:** Assuming overlapping confidence intervals mean no significant difference.
**Reality:** Overlapping CIs are less stringent than difference tests. Two CIs can overlap while the difference between groups is significant.
**Guideline:** Overlap of point estimate with other CI is more relevant than overlap of intervals.
### Pitfall 29: Ignoring CI Width
**Problem:** Focusing only on whether CI includes zero, not precision.
**Why important:** Wide CIs indicate high uncertainty. "Significant" effects with huge CIs are less convincing.
**Consider:** Both significance and precision.
## Bayesian vs. Frequentist Confusions
### Pitfall 30: Mixing Bayesian and Frequentist Interpretations
**Problem:** Making Bayesian statements from frequentist analyses.
**Examples:**
- "Probability hypothesis is true" (Bayesian) from p-value (frequentist)
- "Evidence for null" from non-significant result (frequentist can't support null)
**Solution:**
- Be clear about framework
- Use Bayesian methods for Bayesian questions
- Use Bayes factors to compare hypotheses
### Pitfall 31: Ignoring Prior Probability
**Problem:** Treating all hypotheses as equally likely initially.
**Reality:** Extraordinary claims need extraordinary evidence. Prior plausibility matters.
**Consider:**
- Plausibility given existing knowledge
- Mechanism plausibility
- Base rates
## Data Transformation Issues
### Pitfall 32: Dichotomizing Continuous Variables
**Problem:** Splitting continuous variables at arbitrary cutoffs.
**Consequences:**
- Loss of information and power
- Arbitrary distinctions
- Discarding individual differences
**Exceptions:** Clinically meaningful cutoffs with strong justification.
**Better:** Keep continuous or use multiple categories.
### Pitfall 33: Trying Multiple Transformations
**Problem:** Testing many transformations until finding significance.
**Why problematic:** Inflates Type I error, is a form of p-hacking.
**Better approach:**
- Prespecify transformations
- Use theory-driven transformations
- Correct for multiple testing if exploring
## Missing Data Problems
### Pitfall 34: Listwise Deletion by Default
**Problem:** Automatically deleting all cases with any missing data.
**Consequences:**
- Reduced power
- Potential bias if data not missing completely at random (MCAR)
**Better approaches:**
- Multiple imputation
- Maximum likelihood methods
- Analyze missingness patterns
### Pitfall 35: Ignoring Missing Data Mechanisms
**Problem:** Not considering why data are missing.
**Types:**
- MCAR (Missing Completely at Random): Safe to delete
- MAR (Missing at Random): Can impute
- MNAR (Missing Not at Random): May bias results
**Solution:** Analyze patterns, use appropriate methods, consider sensitivity analyses.
## Publication and Reporting Issues
### Pitfall 36: Selective Reporting
**Problem:** Only reporting significant results or favorable analyses.
**Consequences:**
- Literature appears more consistent than reality
- Meta-analyses biased
- Wasted research effort
**Solutions:**
- Preregistration
- Report all analyses
- Use reporting guidelines (CONSORT, PRISMA, etc.)
### Pitfall 37: Rounding to p < .05
**Problem:** Reporting exact p-values selectively (e.g., p = .049 but p < .05 for .051).
**Why problematic:** Obscures values near threshold, enables p-hacking detection evasion.
**Better:** Always report exact p-values.
### Pitfall 38: No Data Sharing
**Problem:** Not making data available for verification or reanalysis.
**Consequences:**
- Can't verify results
- Can't include in meta-analyses
- Hinders scientific progress
**Best practice:** Share data unless privacy concerns prohibit.
## Cross-Validation and Generalization
### Pitfall 39: No Cross-Validation
**Problem:** Testing model on same data used to build it.
**Consequence:** Overly optimistic performance estimates.
**Solutions:**
- Split data (train/test)
- K-fold cross-validation
- Independent validation sample
### Pitfall 40: Data Leakage
**Problem:** Information from test set leaking into training.
**Examples:**
- Normalizing before splitting
- Feature selection on full dataset
- Including temporal information
**Consequence:** Inflated performance metrics.
**Prevention:** All preprocessing decisions made using only training data.
## Meta-Analysis Pitfalls
### Pitfall 41: Apples and Oranges
**Problem:** Combining studies with different designs, populations, or measures.
**Balance:** Need homogeneity but also comprehensiveness.
**Solutions:**
- Clear inclusion criteria
- Subgroup analyses
- Meta-regression for moderators
### Pitfall 42: Ignoring Publication Bias
**Problem:** Published studies overrepresent significant results.
**Consequences:** Overestimated effects in meta-analyses.
**Detection:**
- Funnel plots
- Trim-and-fill
- PET-PEESE
- P-curve analysis
**Solutions:**
- Include unpublished studies
- Register reviews
- Use bias-correction methods
## General Best Practices
1. **Preregister studies** - Distinguish confirmatory from exploratory
2. **Report transparently** - All analyses, not just significant ones
3. **Check assumptions** - Don't blindly apply tests
4. **Use appropriate tests** - Match test to data and design
5. **Report effect sizes** - Not just p-values
6. **Consider practical significance** - Not just statistical
7. **Replicate findings** - One study is rarely definitive
8. **Share data and code** - Enable verification
9. **Use confidence intervals** - Show uncertainty
10. **Think causally carefully** - Most research is correlational