Files
2025-11-30 08:30:10 +08:00

15 KiB
Raw Permalink Blame History

Effect Sizes and Power Analysis

This document provides guidance on calculating, interpreting, and reporting effect sizes, as well as conducting power analyses for study planning.

Why Effect Sizes Matter

  1. Statistical significance ≠ practical significance: p-values only tell if an effect exists, not how large it is
  2. Sample size dependent: With large samples, trivial effects become "significant"
  3. Interpretation: Effect sizes provide magnitude and practical importance
  4. Meta-analysis: Effect sizes enable combining results across studies
  5. Power analysis: Required for sample size determination

Golden rule: ALWAYS report effect sizes alongside p-values.


Effect Sizes by Analysis Type

T-Tests and Mean Differences

Cohen's d (Standardized Mean Difference)

Formula:

  • Independent groups: d = (M₁ - M₂) / SD_pooled
  • Paired groups: d = M_diff / SD_diff

Interpretation (Cohen, 1988):

  • Small: |d| = 0.20
  • Medium: |d| = 0.50
  • Large: |d| = 0.80

Context-dependent interpretation:

  • In education: d = 0.40 is typical for successful interventions
  • In psychology: d = 0.40 is considered meaningful
  • In medicine: Small effect sizes can be clinically important

Python calculation:

import pingouin as pg
import numpy as np

# Independent t-test with effect size
result = pg.ttest(group1, group2, correction=False)
cohens_d = result['cohen-d'].values[0]

# Manual calculation
mean_diff = np.mean(group1) - np.mean(group2)
pooled_std = np.sqrt((np.var(group1, ddof=1) + np.var(group2, ddof=1)) / 2)
cohens_d = mean_diff / pooled_std

# Paired t-test
result = pg.ttest(pre, post, paired=True)
cohens_d = result['cohen-d'].values[0]

Confidence intervals for d:

from pingouin import compute_effsize_from_t

d, ci = compute_effsize_from_t(t_statistic, nx=n1, ny=n2, eftype='cohen')

Hedges' g (Bias-Corrected d)

Why use it: Cohen's d has slight upward bias with small samples (n < 20)

Formula: g = d × correction_factor, where correction_factor = 1 - 3/(4df - 1)

Python calculation:

result = pg.ttest(group1, group2, correction=False)
hedges_g = result['hedges'].values[0]

Use Hedges' g when:

  • Sample sizes are small (n < 20 per group)
  • Conducting meta-analyses (standard in meta-analysis)

Glass's Δ (Delta)

When to use: When one group is a control with known variability

Formula: Δ = (M₁ - M₂) / SD_control

Use cases:

  • Clinical trials (use control group SD)
  • When treatment affects variability

ANOVA

Eta-squared (η²)

What it measures: Proportion of total variance explained by factor

Formula: η² = SS_effect / SS_total

Interpretation:

  • Small: η² = 0.01 (1% of variance)
  • Medium: η² = 0.06 (6% of variance)
  • Large: η² = 0.14 (14% of variance)

Limitation: Biased with multiple factors (sums to > 1.0)

Python calculation:

import pingouin as pg

# One-way ANOVA
aov = pg.anova(dv='value', between='group', data=df)
eta_squared = aov['SS'][0] / aov['SS'].sum()

# Or use pingouin directly
aov = pg.anova(dv='value', between='group', data=df, detailed=True)
eta_squared = aov['np2'][0]  # Note: pingouin reports partial eta-squared

Partial Eta-squared (η²_p)

What it measures: Proportion of variance explained by factor, excluding other factors

Formula: η²_p = SS_effect / (SS_effect + SS_error)

Interpretation: Same benchmarks as η²

When to use: Multi-factor ANOVA (standard in factorial designs)

Python calculation:

aov = pg.anova(dv='value', between=['factor1', 'factor2'], data=df)
# pingouin reports partial eta-squared by default
partial_eta_sq = aov['np2']

Omega-squared (ω²)

What it measures: Less biased estimate of population variance explained

Why use it: η² overestimates effect size; ω² provides better population estimate

Formula: ω² = (SS_effect - df_effect × MS_error) / (SS_total + MS_error)

Interpretation: Same benchmarks as η², but typically smaller values

Python calculation:

def omega_squared(aov_table):
    ss_effect = aov_table.loc[0, 'SS']
    ss_total = aov_table['SS'].sum()
    ms_error = aov_table.loc[aov_table.index[-1], 'MS']  # Residual MS
    df_effect = aov_table.loc[0, 'DF']

    omega_sq = (ss_effect - df_effect * ms_error) / (ss_total + ms_error)
    return omega_sq

Cohen's f

What it measures: Effect size for ANOVA (analogous to Cohen's d)

Formula: f = √(η² / (1 - η²))

Interpretation:

  • Small: f = 0.10
  • Medium: f = 0.25
  • Large: f = 0.40

Python calculation:

eta_squared = 0.06  # From ANOVA
cohens_f = np.sqrt(eta_squared / (1 - eta_squared))

Use in power analysis: Required for ANOVA power calculations


Correlation

Pearson's r / Spearman's ρ

Interpretation:

  • Small: |r| = 0.10
  • Medium: |r| = 0.30
  • Large: |r| = 0.50

Important notes:

  • r² = coefficient of determination (proportion of variance explained)
  • r = 0.30 means 9% shared variance (0.30² = 0.09)
  • Consider direction (positive/negative) and context

Python calculation:

import pingouin as pg

# Pearson correlation with CI
result = pg.corr(x, y, method='pearson')
r = result['r'].values[0]
ci = [result['CI95%'][0][0], result['CI95%'][0][1]]

# Spearman correlation
result = pg.corr(x, y, method='spearman')
rho = result['r'].values[0]

Regression

R² (Coefficient of Determination)

What it measures: Proportion of variance in Y explained by model

Interpretation:

  • Small: R² = 0.02
  • Medium: R² = 0.13
  • Large: R² = 0.26

Context-dependent:

  • Physical sciences: R² > 0.90 expected
  • Social sciences: R² > 0.30 considered good
  • Behavior prediction: R² > 0.10 may be meaningful

Python calculation:

from sklearn.metrics import r2_score
from statsmodels.api import OLS

# Using statsmodels
model = OLS(y, X).fit()
r_squared = model.rsquared
adjusted_r_squared = model.rsquared_adj

# Manual
r_squared = 1 - (SS_residual / SS_total)

Adjusted R²

Why use it: R² artificially increases when adding predictors; adjusted R² penalizes model complexity

Formula: R²_adj = 1 - (1 - R²) × (n - 1) / (n - k - 1)

When to use: Always report alongside R² for multiple regression


Standardized Regression Coefficients (β)

What it measures: Effect of one-SD change in predictor on outcome (in SD units)

Interpretation: Similar to Cohen's d

  • Small: |β| = 0.10
  • Medium: |β| = 0.30
  • Large: |β| = 0.50

Python calculation:

from scipy import stats

# Standardize variables first
X_std = (X - X.mean()) / X.std()
y_std = (y - y.mean()) / y.std()

model = OLS(y_std, X_std).fit()
beta = model.params

f² (Cohen's f-squared for Regression)

What it measures: Effect size for individual predictors or model comparison

Formula: f² = R²_AB - R²_A / (1 - R²_AB)

Where:

  • R²_AB = R² for full model with predictor
  • R²_A = R² for reduced model without predictor

Interpretation:

  • Small: f² = 0.02
  • Medium: f² = 0.15
  • Large: f² = 0.35

Python calculation:

# Compare two nested models
model_full = OLS(y, X_full).fit()
model_reduced = OLS(y, X_reduced).fit()

r2_full = model_full.rsquared
r2_reduced = model_reduced.rsquared

f_squared = (r2_full - r2_reduced) / (1 - r2_full)

Categorical Data Analysis

Cramér's V

What it measures: Association strength for χ² test (works for any table size)

Formula: V = √(χ² / (n × (k - 1)))

Where k = min(rows, columns)

Interpretation (for k > 2):

  • Small: V = 0.07
  • Medium: V = 0.21
  • Large: V = 0.35

For 2×2 tables: Use phi coefficient (φ)

Python calculation:

from scipy.stats.contingency import association

# Cramér's V
cramers_v = association(contingency_table, method='cramer')

# Phi coefficient (for 2x2)
phi = association(contingency_table, method='pearson')

Odds Ratio (OR) and Risk Ratio (RR)

For 2×2 contingency tables:

Outcome + Outcome -
Exposed a b
Unexposed c d

Odds Ratio: OR = (a/b) / (c/d) = ad / bc

Interpretation:

  • OR = 1: No association
  • OR > 1: Positive association (increased odds)
  • OR < 1: Negative association (decreased odds)
  • OR = 2: Twice the odds
  • OR = 0.5: Half the odds

Risk Ratio: RR = (a/(a+b)) / (c/(c+d))

When to use:

  • Cohort studies: Use RR (more interpretable)
  • Case-control studies: Use OR (RR not available)
  • Logistic regression: OR is natural output

Python calculation:

import statsmodels.api as sm

# From contingency table
odds_ratio = (a * d) / (b * c)

# Confidence interval
table = np.array([[a, b], [c, d]])
oddsratio, pvalue = stats.fisher_exact(table)

# From logistic regression
model = sm.Logit(y, X).fit()
odds_ratios = np.exp(model.params)  # Exponentiate coefficients
ci = np.exp(model.conf_int())  # Exponentiate CIs

Bayesian Effect Sizes

Bayes Factor (BF)

What it measures: Ratio of evidence for alternative vs. null hypothesis

Interpretation:

  • BF₁₀ = 1: Equal evidence for H₁ and H₀
  • BF₁₀ = 3: H₁ is 3× more likely than H₀ (moderate evidence)
  • BF₁₀ = 10: H₁ is 10× more likely than H₀ (strong evidence)
  • BF₁₀ = 100: H₁ is 100× more likely than H₀ (decisive evidence)
  • BF₁₀ = 0.33: H₀ is 3× more likely than H₁
  • BF₁₀ = 0.10: H₀ is 10× more likely than H₁

Classification (Jeffreys, 1961):

  • 1-3: Anecdotal evidence
  • 3-10: Moderate evidence
  • 10-30: Strong evidence
  • 30-100: Very strong evidence
  • 100: Decisive evidence

Python calculation:

import pingouin as pg

# Bayesian t-test
result = pg.ttest(group1, group2, correction=False)
# Note: pingouin doesn't include BF; use other packages

# Using JASP or BayesFactor (R) via rpy2
# Or implement using numerical integration

Power Analysis

Concepts

Statistical power: Probability of detecting an effect if it exists (1 - β)

Conventional standards:

  • Power = 0.80 (80% chance of detecting effect)
  • α = 0.05 (5% Type I error rate)

Four interconnected parameters (given 3, can solve for 4th):

  1. Sample size (n)
  2. Effect size (d, f, etc.)
  3. Significance level (α)
  4. Power (1 - β)

A Priori Power Analysis (Planning)

Purpose: Determine required sample size before study

Steps:

  1. Specify expected effect size (from literature, pilot data, or minimum meaningful effect)
  2. Set α level (typically 0.05)
  3. Set desired power (typically 0.80)
  4. Calculate required n

Python implementation:

from statsmodels.stats.power import (
    tt_ind_solve_power,
    zt_ind_solve_power,
    FTestAnovaPower,
    NormalIndPower
)

# T-test power analysis
n_required = tt_ind_solve_power(
    effect_size=0.5,  # Cohen's d
    alpha=0.05,
    power=0.80,
    ratio=1.0,  # Equal group sizes
    alternative='two-sided'
)

# ANOVA power analysis
anova_power = FTestAnovaPower()
n_per_group = anova_power.solve_power(
    effect_size=0.25,  # Cohen's f
    ngroups=3,
    alpha=0.05,
    power=0.80
)

# Correlation power analysis
from pingouin import power_corr
n_required = power_corr(r=0.30, power=0.80, alpha=0.05)

Post Hoc Power Analysis (After Study)

⚠️ CAUTION: Post hoc power is controversial and often not recommended

Why it's problematic:

  • Observed power is a direct function of p-value
  • If p > 0.05, power is always low
  • Provides no additional information beyond p-value
  • Can be misleading

When it might be acceptable:

  • Study planning for future research
  • Using effect size from multiple studies (not just your own)
  • Explicit goal is sample size for replication

Better alternatives:

  • Report confidence intervals for effect sizes
  • Conduct sensitivity analysis
  • Report minimum detectable effect size

Sensitivity Analysis

Purpose: Determine minimum detectable effect size given study parameters

When to use: After study is complete, to understand study's capability

Python implementation:

# What effect size could we detect with n=50 per group?
detectable_effect = tt_ind_solve_power(
    effect_size=None,  # Solve for this
    nobs1=50,
    alpha=0.05,
    power=0.80,
    ratio=1.0,
    alternative='two-sided'
)

print(f"With n=50 per group, we could detect d ≥ {detectable_effect:.2f}")

Reporting Effect Sizes

APA Style Guidelines

T-test example:

"Group A (M = 75.2, SD = 8.5) scored significantly higher than Group B (M = 68.3, SD = 9.2), t(98) = 3.82, p < .001, d = 0.77, 95% CI [0.36, 1.18]."

ANOVA example:

"There was a significant main effect of treatment condition on test scores, F(2, 87) = 8.45, p < .001, η²p = .16. Post hoc comparisons using Tukey's HSD revealed..."

Correlation example:

"There was a moderate positive correlation between study time and exam scores, r(148) = .42, p < .001, 95% CI [.27, .55]."

Regression example:

"The regression model significantly predicted exam scores, F(3, 146) = 45.2, p < .001, R² = .48. Study hours (β = .52, p < .001) and prior GPA (β = .31, p < .001) were significant predictors."

Bayesian example:

"A Bayesian independent samples t-test provided strong evidence for a difference between groups, BF₁₀ = 23.5, indicating the data are 23.5 times more likely under H₁ than H₀."


Effect Size Pitfalls

  1. Don't only rely on benchmarks: Context matters; small effects can be meaningful
  2. Report confidence intervals: CIs show precision of effect size estimate
  3. Distinguish statistical vs. practical significance: Large n can make trivial effects "significant"
  4. Consider cost-benefit: Even small effects may be valuable if intervention is low-cost
  5. Multiple outcomes: Effect sizes vary across outcomes; report all
  6. Don't cherry-pick: Report effects for all planned analyses
  7. Publication bias: Published effects are often overestimated

Quick Reference Table

Analysis Effect Size Small Medium Large
T-test Cohen's d 0.20 0.50 0.80
ANOVA η², ω² 0.01 0.06 0.14
ANOVA Cohen's f 0.10 0.25 0.40
Correlation r, ρ 0.10 0.30 0.50
Regression 0.02 0.13 0.26
Regression 0.02 0.15 0.35
Chi-square Cramér's V 0.07 0.21 0.35
Chi-square (2×2) φ 0.10 0.30 0.50

Resources

  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.)
  • Lakens, D. (2013). Calculating and reporting effect sizes
  • Ellis, P. D. (2010). The Essential Guide to Effect Sizes