485 lines
13 KiB
Markdown
485 lines
13 KiB
Markdown
# Evidence Hierarchy and Quality Assessment
|
|
|
|
## Traditional Evidence Hierarchy (Medical/Clinical)
|
|
|
|
### Level 1: Systematic Reviews and Meta-Analyses
|
|
**Description:** Comprehensive synthesis of all available evidence on a question.
|
|
|
|
**Strengths:**
|
|
- Combines multiple studies for greater power
|
|
- Reduces impact of single-study anomalies
|
|
- Can identify patterns across studies
|
|
- Quantifies overall effect size
|
|
|
|
**Weaknesses:**
|
|
- Quality depends on included studies ("garbage in, garbage out")
|
|
- Publication bias can distort findings
|
|
- Heterogeneity may make pooling inappropriate
|
|
- Can mask important differences between studies
|
|
|
|
**Critical evaluation:**
|
|
- Was search comprehensive (multiple databases, grey literature)?
|
|
- Were inclusion criteria appropriate and prespecified?
|
|
- Was study quality assessed?
|
|
- Was heterogeneity explored?
|
|
- Was publication bias assessed (funnel plots, fail-safe N)?
|
|
- Were appropriate statistical methods used?
|
|
|
|
### Level 2: Randomized Controlled Trials (RCTs)
|
|
**Description:** Experimental studies with random assignment to conditions.
|
|
|
|
**Strengths:**
|
|
- Gold standard for establishing causation
|
|
- Controls for known and unknown confounders
|
|
- Minimizes selection bias
|
|
- Enables causal inference
|
|
|
|
**Weaknesses:**
|
|
- May not be ethical or feasible
|
|
- Artificial settings may limit generalizability
|
|
- Often short-term with selected populations
|
|
- Expensive and time-consuming
|
|
|
|
**Critical evaluation:**
|
|
- Was randomization adequate (sequence generation, allocation concealment)?
|
|
- Was blinding implemented (participants, providers, assessors)?
|
|
- Was sample size adequate (power analysis)?
|
|
- Was intention-to-treat analysis used?
|
|
- Was attrition rate acceptable and balanced?
|
|
- Are results generalizable?
|
|
|
|
### Level 3: Cohort Studies
|
|
**Description:** Observational studies following groups over time.
|
|
|
|
**Types:**
|
|
- **Prospective:** Follow forward from exposure to outcome
|
|
- **Retrospective:** Look backward at existing data
|
|
|
|
**Strengths:**
|
|
- Can study multiple outcomes
|
|
- Establishes temporal sequence
|
|
- Can calculate incidence and relative risk
|
|
- More feasible than RCTs for many questions
|
|
|
|
**Weaknesses:**
|
|
- Susceptible to confounding
|
|
- Selection bias possible
|
|
- Attrition can bias results
|
|
- Cannot prove causation definitively
|
|
|
|
**Critical evaluation:**
|
|
- Were cohorts comparable at baseline?
|
|
- Was exposure measured reliably?
|
|
- Was follow-up adequate and complete?
|
|
- Were potential confounders measured and controlled?
|
|
- Was outcome assessment blinded to exposure?
|
|
|
|
### Level 4: Case-Control Studies
|
|
**Description:** Compare people with outcome (cases) to those without (controls), looking back at exposures.
|
|
|
|
**Strengths:**
|
|
- Efficient for rare outcomes
|
|
- Relatively quick and inexpensive
|
|
- Can study multiple exposures
|
|
- Useful for generating hypotheses
|
|
|
|
**Weaknesses:**
|
|
- Cannot calculate incidence
|
|
- Susceptible to recall bias
|
|
- Selection of controls is challenging
|
|
- Cannot prove causation
|
|
|
|
**Critical evaluation:**
|
|
- Were cases and controls defined clearly?
|
|
- Were controls appropriate (same source population)?
|
|
- Was matching appropriate?
|
|
- How was exposure ascertained (records vs. recall)?
|
|
- Were potential confounders controlled?
|
|
- Could recall bias explain findings?
|
|
|
|
### Level 5: Cross-Sectional Studies
|
|
**Description:** Snapshot observation at single point in time.
|
|
|
|
**Strengths:**
|
|
- Quick and inexpensive
|
|
- Can assess prevalence
|
|
- Useful for hypothesis generation
|
|
- Can study multiple outcomes and exposures
|
|
|
|
**Weaknesses:**
|
|
- Cannot establish temporal sequence
|
|
- Cannot determine causation
|
|
- Prevalence-incidence bias
|
|
- Survival bias
|
|
|
|
**Critical evaluation:**
|
|
- Was sample representative?
|
|
- Were measures validated?
|
|
- Could reverse causation explain findings?
|
|
- Are confounders acknowledged?
|
|
|
|
### Level 6: Case Series and Case Reports
|
|
**Description:** Description of observations in clinical practice.
|
|
|
|
**Strengths:**
|
|
- Can identify new diseases or effects
|
|
- Hypothesis-generating
|
|
- Details rare phenomena
|
|
- Quick to report
|
|
|
|
**Weaknesses:**
|
|
- No control group
|
|
- No statistical inference possible
|
|
- Highly susceptible to bias
|
|
- Cannot establish causation or frequency
|
|
|
|
**Use:** Primarily for hypothesis generation and clinical description.
|
|
|
|
### Level 7: Expert Opinion
|
|
**Description:** Statements by recognized authorities.
|
|
|
|
**Strengths:**
|
|
- Synthesizes experience
|
|
- Useful when no research available
|
|
- May integrate multiple sources
|
|
|
|
**Weaknesses:**
|
|
- Subjective and potentially biased
|
|
- May not reflect current evidence
|
|
- Appeal to authority fallacy risk
|
|
- Individual expertise varies
|
|
|
|
**Use:** Lowest level of evidence; should be supported by data when possible.
|
|
|
|
## Nuances and Limitations of Traditional Hierarchy
|
|
|
|
### When Lower-Level Evidence Can Be Strong
|
|
1. **Well-designed observational studies** with:
|
|
- Large effects (hard to confound)
|
|
- Dose-response relationships
|
|
- Consistent findings across contexts
|
|
- Biological plausibility
|
|
- No plausible confounders
|
|
|
|
2. **Multiple converging lines of evidence** from different study types
|
|
|
|
3. **Natural experiments** approximating randomization
|
|
|
|
### When Higher-Level Evidence Can Be Weak
|
|
1. **Poor-quality RCTs** with:
|
|
- Inadequate randomization
|
|
- High attrition
|
|
- No blinding when feasible
|
|
- Conflicts of interest
|
|
|
|
2. **Biased meta-analyses**:
|
|
- Publication bias
|
|
- Selective inclusion
|
|
- Inappropriate pooling
|
|
- Poor search strategy
|
|
|
|
3. **Not addressing the right question**:
|
|
- Wrong population
|
|
- Wrong comparison
|
|
- Wrong outcome
|
|
- Too artificial to generalize
|
|
|
|
## Alternative: GRADE System
|
|
|
|
GRADE (Grading of Recommendations Assessment, Development and Evaluation) assesses evidence quality across four levels:
|
|
|
|
### High Quality
|
|
**Definition:** Very confident that true effect is close to estimated effect.
|
|
|
|
**Characteristics:**
|
|
- Well-conducted RCTs
|
|
- Overwhelming evidence from observational studies
|
|
- Large, consistent effects
|
|
- No serious limitations
|
|
|
|
### Moderate Quality
|
|
**Definition:** Moderately confident; true effect likely close to estimated, but could be substantially different.
|
|
|
|
**Downgrades from high:**
|
|
- Some risk of bias
|
|
- Inconsistency across studies
|
|
- Indirectness (different populations/interventions)
|
|
- Imprecision (wide confidence intervals)
|
|
- Publication bias suspected
|
|
|
|
### Low Quality
|
|
**Definition:** Limited confidence; true effect may be substantially different.
|
|
|
|
**Downgrades:**
|
|
- Serious limitations in above factors
|
|
- Observational studies without special strengths
|
|
|
|
### Very Low Quality
|
|
**Definition:** Very limited confidence; true effect likely substantially different.
|
|
|
|
**Characteristics:**
|
|
- Very serious limitations
|
|
- Expert opinion
|
|
- Multiple serious flaws
|
|
|
|
## Study Quality Assessment Criteria
|
|
|
|
### Internal Validity (Bias Control)
|
|
**Questions:**
|
|
- Was randomization adequate?
|
|
- Was allocation concealed?
|
|
- Were groups similar at baseline?
|
|
- Was blinding implemented?
|
|
- Was attrition minimal and balanced?
|
|
- Was intention-to-treat used?
|
|
- Were all outcomes reported?
|
|
|
|
### External Validity (Generalizability)
|
|
**Questions:**
|
|
- Is sample representative of target population?
|
|
- Are inclusion/exclusion criteria too restrictive?
|
|
- Is setting realistic?
|
|
- Are results applicable to other populations?
|
|
- Are effects consistent across subgroups?
|
|
|
|
### Statistical Conclusion Validity
|
|
**Questions:**
|
|
- Was sample size adequate (power)?
|
|
- Were statistical tests appropriate?
|
|
- Were assumptions checked?
|
|
- Were effect sizes and confidence intervals reported?
|
|
- Were multiple comparisons addressed?
|
|
- Was analysis prespecified?
|
|
|
|
### Construct Validity (Measurement)
|
|
**Questions:**
|
|
- Were measures validated and reliable?
|
|
- Was outcome defined clearly and appropriately?
|
|
- Were assessors blinded?
|
|
- Were exposures measured accurately?
|
|
- Was timing of measurement appropriate?
|
|
|
|
## Critical Appraisal Tools
|
|
|
|
### For Different Study Types
|
|
|
|
**RCTs:**
|
|
- Cochrane Risk of Bias Tool
|
|
- Jadad Scale
|
|
- PEDro Scale (for trials in physical therapy)
|
|
|
|
**Observational Studies:**
|
|
- Newcastle-Ottawa Scale
|
|
- ROBINS-I (Risk of Bias in Non-randomized Studies)
|
|
|
|
**Diagnostic Studies:**
|
|
- QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies)
|
|
|
|
**Systematic Reviews:**
|
|
- AMSTAR-2 (A Measurement Tool to Assess Systematic Reviews)
|
|
|
|
**All Study Types:**
|
|
- CASP Checklists (Critical Appraisal Skills Programme)
|
|
|
|
## Domain-Specific Considerations
|
|
|
|
### Basic Science Research
|
|
**Hierarchy differs:**
|
|
1. Multiple convergent lines of evidence
|
|
2. Mechanistic understanding
|
|
3. Reproducible experiments
|
|
4. Established theoretical framework
|
|
|
|
**Key considerations:**
|
|
- Replication essential
|
|
- Mechanistic plausibility
|
|
- Consistency across model systems
|
|
- Convergence of methods
|
|
|
|
### Psychological Research
|
|
**Additional concerns:**
|
|
- Replication crisis
|
|
- Publication bias particularly problematic
|
|
- Small effect sizes often expected
|
|
- Cultural context matters
|
|
- Measures often indirect (self-report)
|
|
|
|
**Strong evidence includes:**
|
|
- Preregistered studies
|
|
- Large samples
|
|
- Multiple measures
|
|
- Behavioral (not just self-report) outcomes
|
|
- Cross-cultural replication
|
|
|
|
### Epidemiology
|
|
**Causal inference frameworks:**
|
|
- Bradford Hill criteria
|
|
- Rothman's causal pies
|
|
- Directed Acyclic Graphs (DAGs)
|
|
|
|
**Strong observational evidence:**
|
|
- Dose-response relationships
|
|
- Temporal consistency
|
|
- Biological plausibility
|
|
- Specificity
|
|
- Consistency across populations
|
|
- Large effects unlikely due to confounding
|
|
|
|
### Social Sciences
|
|
**Challenges:**
|
|
- Complex interventions
|
|
- Context-dependent effects
|
|
- Measurement challenges
|
|
- Ethical constraints on RCTs
|
|
|
|
**Strengthening evidence:**
|
|
- Mixed methods
|
|
- Natural experiments
|
|
- Instrumental variables
|
|
- Regression discontinuity designs
|
|
- Multiple operationalizations
|
|
|
|
## Synthesizing Evidence Across Studies
|
|
|
|
### Consistency
|
|
**Strong evidence:**
|
|
- Multiple studies, different investigators
|
|
- Different populations and settings
|
|
- Different research designs converge
|
|
- Different measurement methods
|
|
|
|
**Weak evidence:**
|
|
- Single study
|
|
- Only one research group
|
|
- Conflicting results
|
|
- Publication bias evident
|
|
|
|
### Biological/Theoretical Plausibility
|
|
**Strengthens evidence:**
|
|
- Known mechanism
|
|
- Consistent with other knowledge
|
|
- Dose-response relationship
|
|
- Coherent with animal/in vitro data
|
|
|
|
**Weakens evidence:**
|
|
- No plausible mechanism
|
|
- Contradicts established knowledge
|
|
- Biological implausibility
|
|
|
|
### Temporality
|
|
**Essential for causation:**
|
|
- Cause must precede effect
|
|
- Cross-sectional studies cannot establish
|
|
- Reverse causation must be ruled out
|
|
|
|
### Specificity
|
|
**Moderate indicator:**
|
|
- Specific cause → specific effect strengthens causation
|
|
- But lack of specificity doesn't rule out causation
|
|
- Most causes have multiple effects
|
|
|
|
### Strength of Association
|
|
**Strong evidence:**
|
|
- Large effects unlikely to be due to confounding
|
|
- Dose-response relationships
|
|
- All-or-none effects
|
|
|
|
**Caution:**
|
|
- Small effects may still be real
|
|
- Large effects can still be confounded
|
|
|
|
## Red Flags in Evidence Quality
|
|
|
|
### Study Design Red Flags
|
|
- No control group
|
|
- Self-selected participants
|
|
- No randomization when feasible
|
|
- No blinding when feasible
|
|
- Very small sample
|
|
- Inappropriate statistical tests
|
|
|
|
### Reporting Red Flags
|
|
- Selective outcome reporting
|
|
- No study registration/protocol
|
|
- Missing methodological details
|
|
- No conflicts of interest statement
|
|
- Cherry-picked citations
|
|
- Results don't match methods
|
|
|
|
### Interpretation Red Flags
|
|
- Causal language from correlational data
|
|
- Claiming "proof"
|
|
- Ignoring limitations
|
|
- Overgeneralizing
|
|
- Spinning negative results
|
|
- Post hoc rationalization
|
|
|
|
### Context Red Flags
|
|
- Industry funding without independence
|
|
- Single study in isolation
|
|
- Contradicts preponderance of evidence
|
|
- No replication
|
|
- Published in predatory journal
|
|
- Press release before peer review
|
|
|
|
## Practical Decision Framework
|
|
|
|
### When Evaluating Evidence, Ask:
|
|
|
|
1. **What type of study is this?** (Design)
|
|
2. **How well was it conducted?** (Quality)
|
|
3. **What does it actually show?** (Results)
|
|
4. **How likely is bias?** (Internal validity)
|
|
5. **Does it apply to my question?** (External validity)
|
|
6. **How does it fit with other evidence?** (Context)
|
|
7. **Are the conclusions justified?** (Interpretation)
|
|
8. **What are the limitations?** (Uncertainty)
|
|
|
|
### Making Decisions with Imperfect Evidence
|
|
|
|
**High-quality evidence:**
|
|
- Strong confidence in acting on findings
|
|
- Reasonable to change practice/policy
|
|
|
|
**Moderate-quality evidence:**
|
|
- Provisional conclusions
|
|
- Consider in conjunction with other factors
|
|
- May warrant action depending on stakes
|
|
|
|
**Low-quality evidence:**
|
|
- Weak confidence
|
|
- Hypothesis-generating
|
|
- Insufficient for major decisions alone
|
|
- Consider cost/benefit of waiting for better evidence
|
|
|
|
**Very low-quality evidence:**
|
|
- Very uncertain
|
|
- Should not drive decisions alone
|
|
- Useful for identifying gaps and research needs
|
|
|
|
### When Evidence is Conflicting
|
|
|
|
**Strategies:**
|
|
1. Weight by study quality
|
|
2. Look for systematic differences (population, methods)
|
|
3. Consider publication bias
|
|
4. Update with most recent, rigorous evidence
|
|
5. Conduct/await systematic review
|
|
6. Consider if question is well-formed
|
|
|
|
## Communicating Evidence Strength
|
|
|
|
**Avoid:**
|
|
- Absolute certainty ("proves")
|
|
- False balance (equal weight to unequal evidence)
|
|
- Ignoring uncertainty
|
|
- Cherry-picking studies
|
|
|
|
**Better:**
|
|
- Quantify uncertainty
|
|
- Describe strength of evidence
|
|
- Acknowledge limitations
|
|
- Present range of evidence
|
|
- Distinguish established from emerging findings
|
|
- Be clear about what is/isn't known
|