Files
gh-k-dense-ai-claude-scient…/skills/scientific-critical-thinking/references/evidence_hierarchy.md
2025-11-30 08:30:14 +08:00

13 KiB

Evidence Hierarchy and Quality Assessment

Traditional Evidence Hierarchy (Medical/Clinical)

Level 1: Systematic Reviews and Meta-Analyses

Description: Comprehensive synthesis of all available evidence on a question.

Strengths:

  • Combines multiple studies for greater power
  • Reduces impact of single-study anomalies
  • Can identify patterns across studies
  • Quantifies overall effect size

Weaknesses:

  • Quality depends on included studies ("garbage in, garbage out")
  • Publication bias can distort findings
  • Heterogeneity may make pooling inappropriate
  • Can mask important differences between studies

Critical evaluation:

  • Was search comprehensive (multiple databases, grey literature)?
  • Were inclusion criteria appropriate and prespecified?
  • Was study quality assessed?
  • Was heterogeneity explored?
  • Was publication bias assessed (funnel plots, fail-safe N)?
  • Were appropriate statistical methods used?

Level 2: Randomized Controlled Trials (RCTs)

Description: Experimental studies with random assignment to conditions.

Strengths:

  • Gold standard for establishing causation
  • Controls for known and unknown confounders
  • Minimizes selection bias
  • Enables causal inference

Weaknesses:

  • May not be ethical or feasible
  • Artificial settings may limit generalizability
  • Often short-term with selected populations
  • Expensive and time-consuming

Critical evaluation:

  • Was randomization adequate (sequence generation, allocation concealment)?
  • Was blinding implemented (participants, providers, assessors)?
  • Was sample size adequate (power analysis)?
  • Was intention-to-treat analysis used?
  • Was attrition rate acceptable and balanced?
  • Are results generalizable?

Level 3: Cohort Studies

Description: Observational studies following groups over time.

Types:

  • Prospective: Follow forward from exposure to outcome
  • Retrospective: Look backward at existing data

Strengths:

  • Can study multiple outcomes
  • Establishes temporal sequence
  • Can calculate incidence and relative risk
  • More feasible than RCTs for many questions

Weaknesses:

  • Susceptible to confounding
  • Selection bias possible
  • Attrition can bias results
  • Cannot prove causation definitively

Critical evaluation:

  • Were cohorts comparable at baseline?
  • Was exposure measured reliably?
  • Was follow-up adequate and complete?
  • Were potential confounders measured and controlled?
  • Was outcome assessment blinded to exposure?

Level 4: Case-Control Studies

Description: Compare people with outcome (cases) to those without (controls), looking back at exposures.

Strengths:

  • Efficient for rare outcomes
  • Relatively quick and inexpensive
  • Can study multiple exposures
  • Useful for generating hypotheses

Weaknesses:

  • Cannot calculate incidence
  • Susceptible to recall bias
  • Selection of controls is challenging
  • Cannot prove causation

Critical evaluation:

  • Were cases and controls defined clearly?
  • Were controls appropriate (same source population)?
  • Was matching appropriate?
  • How was exposure ascertained (records vs. recall)?
  • Were potential confounders controlled?
  • Could recall bias explain findings?

Level 5: Cross-Sectional Studies

Description: Snapshot observation at single point in time.

Strengths:

  • Quick and inexpensive
  • Can assess prevalence
  • Useful for hypothesis generation
  • Can study multiple outcomes and exposures

Weaknesses:

  • Cannot establish temporal sequence
  • Cannot determine causation
  • Prevalence-incidence bias
  • Survival bias

Critical evaluation:

  • Was sample representative?
  • Were measures validated?
  • Could reverse causation explain findings?
  • Are confounders acknowledged?

Level 6: Case Series and Case Reports

Description: Description of observations in clinical practice.

Strengths:

  • Can identify new diseases or effects
  • Hypothesis-generating
  • Details rare phenomena
  • Quick to report

Weaknesses:

  • No control group
  • No statistical inference possible
  • Highly susceptible to bias
  • Cannot establish causation or frequency

Use: Primarily for hypothesis generation and clinical description.

Level 7: Expert Opinion

Description: Statements by recognized authorities.

Strengths:

  • Synthesizes experience
  • Useful when no research available
  • May integrate multiple sources

Weaknesses:

  • Subjective and potentially biased
  • May not reflect current evidence
  • Appeal to authority fallacy risk
  • Individual expertise varies

Use: Lowest level of evidence; should be supported by data when possible.

Nuances and Limitations of Traditional Hierarchy

When Lower-Level Evidence Can Be Strong

  1. Well-designed observational studies with:

    • Large effects (hard to confound)
    • Dose-response relationships
    • Consistent findings across contexts
    • Biological plausibility
    • No plausible confounders
  2. Multiple converging lines of evidence from different study types

  3. Natural experiments approximating randomization

When Higher-Level Evidence Can Be Weak

  1. Poor-quality RCTs with:

    • Inadequate randomization
    • High attrition
    • No blinding when feasible
    • Conflicts of interest
  2. Biased meta-analyses:

    • Publication bias
    • Selective inclusion
    • Inappropriate pooling
    • Poor search strategy
  3. Not addressing the right question:

    • Wrong population
    • Wrong comparison
    • Wrong outcome
    • Too artificial to generalize

Alternative: GRADE System

GRADE (Grading of Recommendations Assessment, Development and Evaluation) assesses evidence quality across four levels:

High Quality

Definition: Very confident that true effect is close to estimated effect.

Characteristics:

  • Well-conducted RCTs
  • Overwhelming evidence from observational studies
  • Large, consistent effects
  • No serious limitations

Moderate Quality

Definition: Moderately confident; true effect likely close to estimated, but could be substantially different.

Downgrades from high:

  • Some risk of bias
  • Inconsistency across studies
  • Indirectness (different populations/interventions)
  • Imprecision (wide confidence intervals)
  • Publication bias suspected

Low Quality

Definition: Limited confidence; true effect may be substantially different.

Downgrades:

  • Serious limitations in above factors
  • Observational studies without special strengths

Very Low Quality

Definition: Very limited confidence; true effect likely substantially different.

Characteristics:

  • Very serious limitations
  • Expert opinion
  • Multiple serious flaws

Study Quality Assessment Criteria

Internal Validity (Bias Control)

Questions:

  • Was randomization adequate?
  • Was allocation concealed?
  • Were groups similar at baseline?
  • Was blinding implemented?
  • Was attrition minimal and balanced?
  • Was intention-to-treat used?
  • Were all outcomes reported?

External Validity (Generalizability)

Questions:

  • Is sample representative of target population?
  • Are inclusion/exclusion criteria too restrictive?
  • Is setting realistic?
  • Are results applicable to other populations?
  • Are effects consistent across subgroups?

Statistical Conclusion Validity

Questions:

  • Was sample size adequate (power)?
  • Were statistical tests appropriate?
  • Were assumptions checked?
  • Were effect sizes and confidence intervals reported?
  • Were multiple comparisons addressed?
  • Was analysis prespecified?

Construct Validity (Measurement)

Questions:

  • Were measures validated and reliable?
  • Was outcome defined clearly and appropriately?
  • Were assessors blinded?
  • Were exposures measured accurately?
  • Was timing of measurement appropriate?

Critical Appraisal Tools

For Different Study Types

RCTs:

  • Cochrane Risk of Bias Tool
  • Jadad Scale
  • PEDro Scale (for trials in physical therapy)

Observational Studies:

  • Newcastle-Ottawa Scale
  • ROBINS-I (Risk of Bias in Non-randomized Studies)

Diagnostic Studies:

  • QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies)

Systematic Reviews:

  • AMSTAR-2 (A Measurement Tool to Assess Systematic Reviews)

All Study Types:

  • CASP Checklists (Critical Appraisal Skills Programme)

Domain-Specific Considerations

Basic Science Research

Hierarchy differs:

  1. Multiple convergent lines of evidence
  2. Mechanistic understanding
  3. Reproducible experiments
  4. Established theoretical framework

Key considerations:

  • Replication essential
  • Mechanistic plausibility
  • Consistency across model systems
  • Convergence of methods

Psychological Research

Additional concerns:

  • Replication crisis
  • Publication bias particularly problematic
  • Small effect sizes often expected
  • Cultural context matters
  • Measures often indirect (self-report)

Strong evidence includes:

  • Preregistered studies
  • Large samples
  • Multiple measures
  • Behavioral (not just self-report) outcomes
  • Cross-cultural replication

Epidemiology

Causal inference frameworks:

  • Bradford Hill criteria
  • Rothman's causal pies
  • Directed Acyclic Graphs (DAGs)

Strong observational evidence:

  • Dose-response relationships
  • Temporal consistency
  • Biological plausibility
  • Specificity
  • Consistency across populations
  • Large effects unlikely due to confounding

Social Sciences

Challenges:

  • Complex interventions
  • Context-dependent effects
  • Measurement challenges
  • Ethical constraints on RCTs

Strengthening evidence:

  • Mixed methods
  • Natural experiments
  • Instrumental variables
  • Regression discontinuity designs
  • Multiple operationalizations

Synthesizing Evidence Across Studies

Consistency

Strong evidence:

  • Multiple studies, different investigators
  • Different populations and settings
  • Different research designs converge
  • Different measurement methods

Weak evidence:

  • Single study
  • Only one research group
  • Conflicting results
  • Publication bias evident

Biological/Theoretical Plausibility

Strengthens evidence:

  • Known mechanism
  • Consistent with other knowledge
  • Dose-response relationship
  • Coherent with animal/in vitro data

Weakens evidence:

  • No plausible mechanism
  • Contradicts established knowledge
  • Biological implausibility

Temporality

Essential for causation:

  • Cause must precede effect
  • Cross-sectional studies cannot establish
  • Reverse causation must be ruled out

Specificity

Moderate indicator:

  • Specific cause → specific effect strengthens causation
  • But lack of specificity doesn't rule out causation
  • Most causes have multiple effects

Strength of Association

Strong evidence:

  • Large effects unlikely to be due to confounding
  • Dose-response relationships
  • All-or-none effects

Caution:

  • Small effects may still be real
  • Large effects can still be confounded

Red Flags in Evidence Quality

Study Design Red Flags

  • No control group
  • Self-selected participants
  • No randomization when feasible
  • No blinding when feasible
  • Very small sample
  • Inappropriate statistical tests

Reporting Red Flags

  • Selective outcome reporting
  • No study registration/protocol
  • Missing methodological details
  • No conflicts of interest statement
  • Cherry-picked citations
  • Results don't match methods

Interpretation Red Flags

  • Causal language from correlational data
  • Claiming "proof"
  • Ignoring limitations
  • Overgeneralizing
  • Spinning negative results
  • Post hoc rationalization

Context Red Flags

  • Industry funding without independence
  • Single study in isolation
  • Contradicts preponderance of evidence
  • No replication
  • Published in predatory journal
  • Press release before peer review

Practical Decision Framework

When Evaluating Evidence, Ask:

  1. What type of study is this? (Design)
  2. How well was it conducted? (Quality)
  3. What does it actually show? (Results)
  4. How likely is bias? (Internal validity)
  5. Does it apply to my question? (External validity)
  6. How does it fit with other evidence? (Context)
  7. Are the conclusions justified? (Interpretation)
  8. What are the limitations? (Uncertainty)

Making Decisions with Imperfect Evidence

High-quality evidence:

  • Strong confidence in acting on findings
  • Reasonable to change practice/policy

Moderate-quality evidence:

  • Provisional conclusions
  • Consider in conjunction with other factors
  • May warrant action depending on stakes

Low-quality evidence:

  • Weak confidence
  • Hypothesis-generating
  • Insufficient for major decisions alone
  • Consider cost/benefit of waiting for better evidence

Very low-quality evidence:

  • Very uncertain
  • Should not drive decisions alone
  • Useful for identifying gaps and research needs

When Evidence is Conflicting

Strategies:

  1. Weight by study quality
  2. Look for systematic differences (population, methods)
  3. Consider publication bias
  4. Update with most recent, rigorous evidence
  5. Conduct/await systematic review
  6. Consider if question is well-formed

Communicating Evidence Strength

Avoid:

  • Absolute certainty ("proves")
  • False balance (equal weight to unequal evidence)
  • Ignoring uncertainty
  • Cherry-picking studies

Better:

  • Quantify uncertainty
  • Describe strength of evidence
  • Acknowledge limitations
  • Present range of evidence
  • Distinguish established from emerging findings
  • Be clear about what is/isn't known