Files
gh-k-dense-ai-claude-scient…/skills/scientific-critical-thinking/references/statistical_pitfalls.md
2025-11-30 08:30:18 +08:00

15 KiB
Raw Blame History

Common Statistical Pitfalls

P-Value Misinterpretations

Pitfall 1: P-Value = Probability Hypothesis is True

Misconception: p = .05 means 5% chance the null hypothesis is true.

Reality: P-value is the probability of observing data this extreme (or more) if the null hypothesis is true. It says nothing about the probability the hypothesis is true.

Correct interpretation: "If there were truly no effect, we would observe data this extreme only 5% of the time."

Pitfall 2: Non-Significant = No Effect

Misconception: p > .05 proves there's no effect.

Reality: Absence of evidence ≠ evidence of absence. Non-significant results may indicate:

  • Insufficient statistical power
  • True effect too small to detect
  • High variability
  • Small sample size

Better approach:

  • Report confidence intervals
  • Conduct power analysis
  • Consider equivalence testing

Pitfall 3: Significant = Important

Misconception: Statistical significance means practical importance.

Reality: With large samples, trivial effects become "significant." A statistically significant 0.1 IQ point difference is meaningless in practice.

Better approach:

  • Report effect sizes
  • Consider practical significance
  • Use confidence intervals

Pitfall 4: P = .049 vs. P = .051

Misconception: These are meaningfully different because one crosses the .05 threshold.

Reality: These represent nearly identical evidence. The .05 threshold is arbitrary.

Better approach:

  • Treat p-values as continuous measures of evidence
  • Report exact p-values
  • Consider context and prior evidence

Pitfall 5: One-Tailed Tests Without Justification

Misconception: One-tailed tests are free extra power.

Reality: One-tailed tests assume effects can only go one direction, which is rarely true. They're often used to artificially boost significance.

When appropriate: Only when effects in one direction are theoretically impossible or equivalent to null.

Multiple Comparisons Problems

Pitfall 6: Multiple Testing Without Correction

Problem: Testing 20 hypotheses at p < .05 gives ~65% chance of at least one false positive.

Examples:

  • Testing many outcomes
  • Testing many subgroups
  • Conducting multiple interim analyses
  • Testing at multiple time points

Solutions:

  • Bonferroni correction (divide α by number of tests)
  • False Discovery Rate (FDR) control
  • Prespecify primary outcome
  • Treat exploratory analyses as hypothesis-generating

Pitfall 7: Subgroup Analysis Fishing

Problem: Testing many subgroups until finding significance.

Why problematic:

  • Inflates false positive rate
  • Often reported without disclosure
  • "Interaction was significant in women" may be random

Solutions:

  • Prespecify subgroups
  • Use interaction tests, not separate tests
  • Require replication
  • Correct for multiple comparisons

Pitfall 8: Outcome Switching

Problem: Analyzing many outcomes, reporting only significant ones.

Detection signs:

  • Secondary outcomes emphasized
  • Incomplete outcome reporting
  • Discrepancy between registration and publication

Solutions:

  • Preregister all outcomes
  • Report all planned outcomes
  • Distinguish primary from secondary

Sample Size and Power Issues

Pitfall 9: Underpowered Studies

Problem: Small samples have low probability of detecting true effects.

Consequences:

  • High false negative rate
  • Significant results more likely to be false positives
  • Overestimated effect sizes (when significant)

Solutions:

  • Conduct a priori power analysis
  • Aim for 80-90% power
  • Consider effect size from prior research

Pitfall 10: Post-Hoc Power Analysis

Problem: Calculating power after seeing results is circular and uninformative.

Why useless:

  • Non-significant results always have low "post-hoc power"
  • It recapitulates the p-value without new information

Better approach:

  • Calculate confidence intervals
  • Plan replication with adequate sample
  • Conduct prospective power analysis for future studies

Pitfall 11: Small Sample Fallacy

Problem: Trusting results from very small samples.

Issues:

  • High sampling variability
  • Outliers have large influence
  • Assumptions of tests violated
  • Confidence intervals very wide

Guidelines:

  • Be skeptical of n < 30
  • Check assumptions carefully
  • Consider non-parametric tests
  • Replicate findings

Effect Size Misunderstandings

Pitfall 12: Ignoring Effect Size

Problem: Focusing only on significance, not magnitude.

Why problematic:

  • Significance ≠ importance
  • Can't compare across studies
  • Doesn't inform practical decisions

Solutions:

  • Always report effect sizes
  • Use standardized measures (Cohen's d, r, η²)
  • Interpret using field conventions
  • Consider minimum clinically important difference

Pitfall 13: Misinterpreting Standardized Effect Sizes

Problem: Treating Cohen's d = 0.5 as "medium" without context.

Reality:

  • Field-specific norms vary
  • Some fields have larger typical effects
  • Real-world importance depends on context

Better approach:

  • Compare to effects in same domain
  • Consider practical implications
  • Look at raw effect sizes too

Pitfall 14: Confusing Explained Variance with Importance

Problem: "Only explains 5% of variance" = unimportant.

Reality:

  • Height explains ~5% of variation in NBA player salary but is crucial
  • Complex phenomena have many small contributors
  • Predictive accuracy ≠ causal importance

Consideration: Context matters more than percentage alone.

Correlation and Causation

Pitfall 15: Correlation Implies Causation

Problem: Inferring causation from correlation.

Alternative explanations:

  • Reverse causation (B causes A, not A causes B)
  • Confounding (C causes both A and B)
  • Coincidence
  • Selection bias

Criteria for causation:

  • Temporal precedence
  • Covariation
  • No plausible alternatives
  • Ideally: experimental manipulation

Pitfall 16: Ecological Fallacy

Problem: Inferring individual-level relationships from group-level data.

Example: Countries with more chocolate consumption have more Nobel laureates doesn't mean eating chocolate makes you win Nobels.

Why problematic: Group-level correlations may not hold at individual level.

Pitfall 17: Simpson's Paradox

Problem: Trend appears in groups but reverses when combined (or vice versa).

Example: Treatment appears worse overall but better in every subgroup.

Cause: Confounding variable distributed differently across groups.

Solution: Consider confounders and look at appropriate level of analysis.

Regression and Modeling Pitfalls

Pitfall 18: Overfitting

Problem: Model fits sample data well but doesn't generalize.

Causes:

  • Too many predictors relative to sample size
  • Fitting noise rather than signal
  • No cross-validation

Solutions:

  • Use cross-validation
  • Penalized regression (LASSO, ridge)
  • Independent test set
  • Simpler models

Pitfall 19: Extrapolation Beyond Data Range

Problem: Predicting outside the range of observed data.

Why dangerous:

  • Relationships may not hold outside observed range
  • Increased uncertainty not reflected in predictions

Solution: Only interpolate; avoid extrapolation.

Pitfall 20: Ignoring Model Assumptions

Problem: Using statistical tests without checking assumptions.

Common violations:

  • Non-normality (for parametric tests)
  • Heteroscedasticity (unequal variances)
  • Non-independence
  • Linearity
  • No multicollinearity

Solutions:

  • Check assumptions with diagnostics
  • Use robust methods
  • Transform data
  • Use appropriate non-parametric alternatives

Pitfall 21: Treating Non-Significant Covariates as Eliminating Confounding

Problem: "We controlled for X and it wasn't significant, so it's not a confounder."

Reality: Non-significant covariates can still be important confounders. Significance ≠ confounding.

Solution: Include theoretically important covariates regardless of significance.

Pitfall 22: Collinearity Masking Effects

Problem: When predictors are highly correlated, true effects may appear non-significant.

Manifestations:

  • Large standard errors
  • Unstable coefficients
  • Sign changes when adding/removing variables

Detection:

  • Variance Inflation Factors (VIF)
  • Correlation matrices

Solutions:

  • Remove redundant predictors
  • Combine correlated variables
  • Use regularization methods

Specific Test Misuses

Pitfall 23: T-Test for Multiple Groups

Problem: Conducting multiple t-tests instead of ANOVA.

Why wrong: Inflates Type I error rate dramatically.

Correct approach:

  • Use ANOVA first
  • Follow with planned comparisons or post-hoc tests with correction

Pitfall 24: Pearson Correlation for Non-Linear Relationships

Problem: Using Pearson's r for curved relationships.

Why misleading: r measures linear relationships only.

Solutions:

  • Check scatterplots first
  • Use Spearman's ρ for monotonic relationships
  • Consider polynomial or non-linear models

Pitfall 25: Chi-Square with Small Expected Frequencies

Problem: Chi-square test with expected cell counts < 5.

Why wrong: Violates test assumptions, p-values inaccurate.

Solutions:

  • Fisher's exact test
  • Combine categories
  • Increase sample size

Pitfall 26: Paired vs. Independent Tests

Problem: Using independent samples test for paired data (or vice versa).

Why wrong:

  • Wastes power (paired data analyzed as independent)
  • Violates independence assumption (independent data analyzed as paired)

Solution: Match test to design.

Confidence Interval Misinterpretations

Pitfall 27: 95% CI = 95% Probability True Value Inside

Misconception: "95% chance the true value is in this interval."

Reality: The true value either is or isn't in this specific interval. If we repeated the study many times, 95% of resulting intervals would contain the true value.

Better interpretation: "We're 95% confident this interval contains the true value."

Pitfall 28: Overlapping CIs = No Difference

Problem: Assuming overlapping confidence intervals mean no significant difference.

Reality: Overlapping CIs are less stringent than difference tests. Two CIs can overlap while the difference between groups is significant.

Guideline: Overlap of point estimate with other CI is more relevant than overlap of intervals.

Pitfall 29: Ignoring CI Width

Problem: Focusing only on whether CI includes zero, not precision.

Why important: Wide CIs indicate high uncertainty. "Significant" effects with huge CIs are less convincing.

Consider: Both significance and precision.

Bayesian vs. Frequentist Confusions

Pitfall 30: Mixing Bayesian and Frequentist Interpretations

Problem: Making Bayesian statements from frequentist analyses.

Examples:

  • "Probability hypothesis is true" (Bayesian) from p-value (frequentist)
  • "Evidence for null" from non-significant result (frequentist can't support null)

Solution:

  • Be clear about framework
  • Use Bayesian methods for Bayesian questions
  • Use Bayes factors to compare hypotheses

Pitfall 31: Ignoring Prior Probability

Problem: Treating all hypotheses as equally likely initially.

Reality: Extraordinary claims need extraordinary evidence. Prior plausibility matters.

Consider:

  • Plausibility given existing knowledge
  • Mechanism plausibility
  • Base rates

Data Transformation Issues

Pitfall 32: Dichotomizing Continuous Variables

Problem: Splitting continuous variables at arbitrary cutoffs.

Consequences:

  • Loss of information and power
  • Arbitrary distinctions
  • Discarding individual differences

Exceptions: Clinically meaningful cutoffs with strong justification.

Better: Keep continuous or use multiple categories.

Pitfall 33: Trying Multiple Transformations

Problem: Testing many transformations until finding significance.

Why problematic: Inflates Type I error, is a form of p-hacking.

Better approach:

  • Prespecify transformations
  • Use theory-driven transformations
  • Correct for multiple testing if exploring

Missing Data Problems

Pitfall 34: Listwise Deletion by Default

Problem: Automatically deleting all cases with any missing data.

Consequences:

  • Reduced power
  • Potential bias if data not missing completely at random (MCAR)

Better approaches:

  • Multiple imputation
  • Maximum likelihood methods
  • Analyze missingness patterns

Pitfall 35: Ignoring Missing Data Mechanisms

Problem: Not considering why data are missing.

Types:

  • MCAR (Missing Completely at Random): Safe to delete
  • MAR (Missing at Random): Can impute
  • MNAR (Missing Not at Random): May bias results

Solution: Analyze patterns, use appropriate methods, consider sensitivity analyses.

Publication and Reporting Issues

Pitfall 36: Selective Reporting

Problem: Only reporting significant results or favorable analyses.

Consequences:

  • Literature appears more consistent than reality
  • Meta-analyses biased
  • Wasted research effort

Solutions:

  • Preregistration
  • Report all analyses
  • Use reporting guidelines (CONSORT, PRISMA, etc.)

Pitfall 37: Rounding to p < .05

Problem: Reporting exact p-values selectively (e.g., p = .049 but p < .05 for .051).

Why problematic: Obscures values near threshold, enables p-hacking detection evasion.

Better: Always report exact p-values.

Pitfall 38: No Data Sharing

Problem: Not making data available for verification or reanalysis.

Consequences:

  • Can't verify results
  • Can't include in meta-analyses
  • Hinders scientific progress

Best practice: Share data unless privacy concerns prohibit.

Cross-Validation and Generalization

Pitfall 39: No Cross-Validation

Problem: Testing model on same data used to build it.

Consequence: Overly optimistic performance estimates.

Solutions:

  • Split data (train/test)
  • K-fold cross-validation
  • Independent validation sample

Pitfall 40: Data Leakage

Problem: Information from test set leaking into training.

Examples:

  • Normalizing before splitting
  • Feature selection on full dataset
  • Including temporal information

Consequence: Inflated performance metrics.

Prevention: All preprocessing decisions made using only training data.

Meta-Analysis Pitfalls

Pitfall 41: Apples and Oranges

Problem: Combining studies with different designs, populations, or measures.

Balance: Need homogeneity but also comprehensiveness.

Solutions:

  • Clear inclusion criteria
  • Subgroup analyses
  • Meta-regression for moderators

Pitfall 42: Ignoring Publication Bias

Problem: Published studies overrepresent significant results.

Consequences: Overestimated effects in meta-analyses.

Detection:

  • Funnel plots
  • Trim-and-fill
  • PET-PEESE
  • P-curve analysis

Solutions:

  • Include unpublished studies
  • Register reviews
  • Use bias-correction methods

General Best Practices

  1. Preregister studies - Distinguish confirmatory from exploratory
  2. Report transparently - All analyses, not just significant ones
  3. Check assumptions - Don't blindly apply tests
  4. Use appropriate tests - Match test to data and design
  5. Report effect sizes - Not just p-values
  6. Consider practical significance - Not just statistical
  7. Replicate findings - One study is rarely definitive
  8. Share data and code - Enable verification
  9. Use confidence intervals - Show uncertainty
  10. Think causally carefully - Most research is correlational