13 KiB
Evidence Hierarchy and Quality Assessment
Traditional Evidence Hierarchy (Medical/Clinical)
Level 1: Systematic Reviews and Meta-Analyses
Description: Comprehensive synthesis of all available evidence on a question.
Strengths:
- Combines multiple studies for greater power
- Reduces impact of single-study anomalies
- Can identify patterns across studies
- Quantifies overall effect size
Weaknesses:
- Quality depends on included studies ("garbage in, garbage out")
- Publication bias can distort findings
- Heterogeneity may make pooling inappropriate
- Can mask important differences between studies
Critical evaluation:
- Was search comprehensive (multiple databases, grey literature)?
- Were inclusion criteria appropriate and prespecified?
- Was study quality assessed?
- Was heterogeneity explored?
- Was publication bias assessed (funnel plots, fail-safe N)?
- Were appropriate statistical methods used?
Level 2: Randomized Controlled Trials (RCTs)
Description: Experimental studies with random assignment to conditions.
Strengths:
- Gold standard for establishing causation
- Controls for known and unknown confounders
- Minimizes selection bias
- Enables causal inference
Weaknesses:
- May not be ethical or feasible
- Artificial settings may limit generalizability
- Often short-term with selected populations
- Expensive and time-consuming
Critical evaluation:
- Was randomization adequate (sequence generation, allocation concealment)?
- Was blinding implemented (participants, providers, assessors)?
- Was sample size adequate (power analysis)?
- Was intention-to-treat analysis used?
- Was attrition rate acceptable and balanced?
- Are results generalizable?
Level 3: Cohort Studies
Description: Observational studies following groups over time.
Types:
- Prospective: Follow forward from exposure to outcome
- Retrospective: Look backward at existing data
Strengths:
- Can study multiple outcomes
- Establishes temporal sequence
- Can calculate incidence and relative risk
- More feasible than RCTs for many questions
Weaknesses:
- Susceptible to confounding
- Selection bias possible
- Attrition can bias results
- Cannot prove causation definitively
Critical evaluation:
- Were cohorts comparable at baseline?
- Was exposure measured reliably?
- Was follow-up adequate and complete?
- Were potential confounders measured and controlled?
- Was outcome assessment blinded to exposure?
Level 4: Case-Control Studies
Description: Compare people with outcome (cases) to those without (controls), looking back at exposures.
Strengths:
- Efficient for rare outcomes
- Relatively quick and inexpensive
- Can study multiple exposures
- Useful for generating hypotheses
Weaknesses:
- Cannot calculate incidence
- Susceptible to recall bias
- Selection of controls is challenging
- Cannot prove causation
Critical evaluation:
- Were cases and controls defined clearly?
- Were controls appropriate (same source population)?
- Was matching appropriate?
- How was exposure ascertained (records vs. recall)?
- Were potential confounders controlled?
- Could recall bias explain findings?
Level 5: Cross-Sectional Studies
Description: Snapshot observation at single point in time.
Strengths:
- Quick and inexpensive
- Can assess prevalence
- Useful for hypothesis generation
- Can study multiple outcomes and exposures
Weaknesses:
- Cannot establish temporal sequence
- Cannot determine causation
- Prevalence-incidence bias
- Survival bias
Critical evaluation:
- Was sample representative?
- Were measures validated?
- Could reverse causation explain findings?
- Are confounders acknowledged?
Level 6: Case Series and Case Reports
Description: Description of observations in clinical practice.
Strengths:
- Can identify new diseases or effects
- Hypothesis-generating
- Details rare phenomena
- Quick to report
Weaknesses:
- No control group
- No statistical inference possible
- Highly susceptible to bias
- Cannot establish causation or frequency
Use: Primarily for hypothesis generation and clinical description.
Level 7: Expert Opinion
Description: Statements by recognized authorities.
Strengths:
- Synthesizes experience
- Useful when no research available
- May integrate multiple sources
Weaknesses:
- Subjective and potentially biased
- May not reflect current evidence
- Appeal to authority fallacy risk
- Individual expertise varies
Use: Lowest level of evidence; should be supported by data when possible.
Nuances and Limitations of Traditional Hierarchy
When Lower-Level Evidence Can Be Strong
-
Well-designed observational studies with:
- Large effects (hard to confound)
- Dose-response relationships
- Consistent findings across contexts
- Biological plausibility
- No plausible confounders
-
Multiple converging lines of evidence from different study types
-
Natural experiments approximating randomization
When Higher-Level Evidence Can Be Weak
-
Poor-quality RCTs with:
- Inadequate randomization
- High attrition
- No blinding when feasible
- Conflicts of interest
-
Biased meta-analyses:
- Publication bias
- Selective inclusion
- Inappropriate pooling
- Poor search strategy
-
Not addressing the right question:
- Wrong population
- Wrong comparison
- Wrong outcome
- Too artificial to generalize
Alternative: GRADE System
GRADE (Grading of Recommendations Assessment, Development and Evaluation) assesses evidence quality across four levels:
High Quality
Definition: Very confident that true effect is close to estimated effect.
Characteristics:
- Well-conducted RCTs
- Overwhelming evidence from observational studies
- Large, consistent effects
- No serious limitations
Moderate Quality
Definition: Moderately confident; true effect likely close to estimated, but could be substantially different.
Downgrades from high:
- Some risk of bias
- Inconsistency across studies
- Indirectness (different populations/interventions)
- Imprecision (wide confidence intervals)
- Publication bias suspected
Low Quality
Definition: Limited confidence; true effect may be substantially different.
Downgrades:
- Serious limitations in above factors
- Observational studies without special strengths
Very Low Quality
Definition: Very limited confidence; true effect likely substantially different.
Characteristics:
- Very serious limitations
- Expert opinion
- Multiple serious flaws
Study Quality Assessment Criteria
Internal Validity (Bias Control)
Questions:
- Was randomization adequate?
- Was allocation concealed?
- Were groups similar at baseline?
- Was blinding implemented?
- Was attrition minimal and balanced?
- Was intention-to-treat used?
- Were all outcomes reported?
External Validity (Generalizability)
Questions:
- Is sample representative of target population?
- Are inclusion/exclusion criteria too restrictive?
- Is setting realistic?
- Are results applicable to other populations?
- Are effects consistent across subgroups?
Statistical Conclusion Validity
Questions:
- Was sample size adequate (power)?
- Were statistical tests appropriate?
- Were assumptions checked?
- Were effect sizes and confidence intervals reported?
- Were multiple comparisons addressed?
- Was analysis prespecified?
Construct Validity (Measurement)
Questions:
- Were measures validated and reliable?
- Was outcome defined clearly and appropriately?
- Were assessors blinded?
- Were exposures measured accurately?
- Was timing of measurement appropriate?
Critical Appraisal Tools
For Different Study Types
RCTs:
- Cochrane Risk of Bias Tool
- Jadad Scale
- PEDro Scale (for trials in physical therapy)
Observational Studies:
- Newcastle-Ottawa Scale
- ROBINS-I (Risk of Bias in Non-randomized Studies)
Diagnostic Studies:
- QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies)
Systematic Reviews:
- AMSTAR-2 (A Measurement Tool to Assess Systematic Reviews)
All Study Types:
- CASP Checklists (Critical Appraisal Skills Programme)
Domain-Specific Considerations
Basic Science Research
Hierarchy differs:
- Multiple convergent lines of evidence
- Mechanistic understanding
- Reproducible experiments
- Established theoretical framework
Key considerations:
- Replication essential
- Mechanistic plausibility
- Consistency across model systems
- Convergence of methods
Psychological Research
Additional concerns:
- Replication crisis
- Publication bias particularly problematic
- Small effect sizes often expected
- Cultural context matters
- Measures often indirect (self-report)
Strong evidence includes:
- Preregistered studies
- Large samples
- Multiple measures
- Behavioral (not just self-report) outcomes
- Cross-cultural replication
Epidemiology
Causal inference frameworks:
- Bradford Hill criteria
- Rothman's causal pies
- Directed Acyclic Graphs (DAGs)
Strong observational evidence:
- Dose-response relationships
- Temporal consistency
- Biological plausibility
- Specificity
- Consistency across populations
- Large effects unlikely due to confounding
Social Sciences
Challenges:
- Complex interventions
- Context-dependent effects
- Measurement challenges
- Ethical constraints on RCTs
Strengthening evidence:
- Mixed methods
- Natural experiments
- Instrumental variables
- Regression discontinuity designs
- Multiple operationalizations
Synthesizing Evidence Across Studies
Consistency
Strong evidence:
- Multiple studies, different investigators
- Different populations and settings
- Different research designs converge
- Different measurement methods
Weak evidence:
- Single study
- Only one research group
- Conflicting results
- Publication bias evident
Biological/Theoretical Plausibility
Strengthens evidence:
- Known mechanism
- Consistent with other knowledge
- Dose-response relationship
- Coherent with animal/in vitro data
Weakens evidence:
- No plausible mechanism
- Contradicts established knowledge
- Biological implausibility
Temporality
Essential for causation:
- Cause must precede effect
- Cross-sectional studies cannot establish
- Reverse causation must be ruled out
Specificity
Moderate indicator:
- Specific cause → specific effect strengthens causation
- But lack of specificity doesn't rule out causation
- Most causes have multiple effects
Strength of Association
Strong evidence:
- Large effects unlikely to be due to confounding
- Dose-response relationships
- All-or-none effects
Caution:
- Small effects may still be real
- Large effects can still be confounded
Red Flags in Evidence Quality
Study Design Red Flags
- No control group
- Self-selected participants
- No randomization when feasible
- No blinding when feasible
- Very small sample
- Inappropriate statistical tests
Reporting Red Flags
- Selective outcome reporting
- No study registration/protocol
- Missing methodological details
- No conflicts of interest statement
- Cherry-picked citations
- Results don't match methods
Interpretation Red Flags
- Causal language from correlational data
- Claiming "proof"
- Ignoring limitations
- Overgeneralizing
- Spinning negative results
- Post hoc rationalization
Context Red Flags
- Industry funding without independence
- Single study in isolation
- Contradicts preponderance of evidence
- No replication
- Published in predatory journal
- Press release before peer review
Practical Decision Framework
When Evaluating Evidence, Ask:
- What type of study is this? (Design)
- How well was it conducted? (Quality)
- What does it actually show? (Results)
- How likely is bias? (Internal validity)
- Does it apply to my question? (External validity)
- How does it fit with other evidence? (Context)
- Are the conclusions justified? (Interpretation)
- What are the limitations? (Uncertainty)
Making Decisions with Imperfect Evidence
High-quality evidence:
- Strong confidence in acting on findings
- Reasonable to change practice/policy
Moderate-quality evidence:
- Provisional conclusions
- Consider in conjunction with other factors
- May warrant action depending on stakes
Low-quality evidence:
- Weak confidence
- Hypothesis-generating
- Insufficient for major decisions alone
- Consider cost/benefit of waiting for better evidence
Very low-quality evidence:
- Very uncertain
- Should not drive decisions alone
- Useful for identifying gaps and research needs
When Evidence is Conflicting
Strategies:
- Weight by study quality
- Look for systematic differences (population, methods)
- Consider publication bias
- Update with most recent, rigorous evidence
- Conduct/await systematic review
- Consider if question is well-formed
Communicating Evidence Strength
Avoid:
- Absolute certainty ("proves")
- False balance (equal weight to unequal evidence)
- Ignoring uncertainty
- Cherry-picking studies
Better:
- Quantify uncertainty
- Describe strength of evidence
- Acknowledge limitations
- Present range of evidence
- Distinguish established from emerging findings
- Be clear about what is/isn't known