Files
gh-k-dense-ai-claude-scient…/skills/literature-review/references/database_strategies.md
2025-11-30 08:30:10 +08:00

12 KiB

Literature Database Search Strategies

This document provides comprehensive guidance for searching multiple literature databases systematically and effectively.

Available Databases and Skills

Biomedical & Life Sciences

PubMed / PubMed Central

  • Access: Use gget skill or WebFetch tool
  • Coverage: 35M+ citations in biomedical literature
  • Best for: Clinical studies, biomedical research, genetics, molecular biology
  • Search tips: Use MeSH terms, Boolean operators (AND, OR, NOT), field tags [Title], [Author]
  • Example: "CRISPR"[Title] AND "gene editing"[Title/Abstract] AND 2020:2024[Publication Date]

bioRxiv / medRxiv

  • Access: Use gget skill or direct API
  • Coverage: Preprints in biology and medicine
  • Best for: Latest unpublished research, cutting-edge findings
  • Note: Not peer-reviewed; verify findings with caution
  • Search tips: Search by category (bioinformatics, genomics, etc.)

General Scientific Literature

arXiv

  • Access: Direct API access
  • Coverage: Preprints in physics, mathematics, computer science, quantitative biology
  • Best for: Computational methods, bioinformatics algorithms, theoretical work
  • Categories: q-bio (Quantitative Biology), cs.LG (Machine Learning), stat.ML (Statistics)
  • Search format: cat:q-bio.QM AND title:"single cell"

Semantic Scholar

  • Access: Direct API (requires API key)
  • Coverage: 200M+ papers across all fields
  • Best for: Cross-disciplinary searches, citation graphs, paper recommendations
  • Features: Influential citations, paper summaries, related papers
  • Rate limits: 100 requests/5 minutes with API key

Google Scholar

  • Access: Web scraping (use cautiously) or manual search
  • Coverage: Comprehensive across all fields
  • Best for: Finding highly cited papers, conference proceedings, theses
  • Limitations: No official API, rate limiting
  • Export: Use "Cite" feature for formatted citations

Specialized Databases

ChEMBL / PubChem

  • Access: Use gget skill or bioservices skill
  • Coverage: Chemical compounds, bioactivity data, drug molecules
  • Best for: Drug discovery, chemical biology, medicinal chemistry
  • ChEMBL: 2M+ compounds, bioactivity data
  • PubChem: 110M+ compounds, assay data

UniProt

  • Access: Use gget skill or bioservices skill
  • Coverage: Protein sequence and functional information
  • Best for: Protein research, sequence analysis, functional annotations
  • Search by: Protein name, gene name, organism, function

KEGG (Kyoto Encyclopedia of Genes and Genomes)

  • Access: Use bioservices skill
  • Coverage: Pathways, diseases, drugs, genes
  • Best for: Pathway analysis, systems biology, metabolic research

COSMIC (Catalogue of Somatic Mutations in Cancer)

  • Access: Use gget skill or direct download
  • Coverage: Cancer genomics, somatic mutations
  • Best for: Cancer research, mutation analysis

AlphaFold Database

  • Access: Use gget skill with alphafold command
  • Coverage: 200M+ protein structure predictions
  • Best for: Structural biology, protein modeling

PDB (Protein Data Bank)

  • Access: Use gget or direct API
  • Coverage: Experimental 3D structures of proteins, nucleic acids
  • Best for: Structural biology, drug design, molecular modeling

Citation & Reference Management

OpenAlex

  • Access: Direct API (free, no key required)
  • Coverage: 250M+ works, comprehensive metadata
  • Best for: Citation analysis, author disambiguation, institutional research
  • Features: Open access, excellent for bibliometrics

Dimensions

  • Access: Free tier available
  • Coverage: Publications, grants, patents, clinical trials
  • Best for: Research impact, funding analysis, translational research

Search Strategy Framework

1. Define Research Question (PICO Framework)

For clinical/biomedical reviews:

  • Population: Who is the study about?
  • Intervention: What is being tested?
  • Comparison: What is it compared to?
  • Outcome: What are the results?

Example: "What is the efficacy of CRISPR-Cas9 gene therapy (I) for treating sickle cell disease (P) compared to standard care (C) in improving patient outcomes (O)?"

2. Develop Search Terms

Primary Concepts

Identify 2-4 main concepts from your research question.

Example:

  • Concept 1: CRISPR, Cas9, gene editing
  • Concept 2: sickle cell disease, SCD, hemoglobin disorders
  • Concept 3: gene therapy, therapeutic editing

List alternative terms, abbreviations, and related concepts.

Tool: Use MeSH (Medical Subject Headings) browser for standardized terms

Boolean Operators

  • AND: Narrows search (must include both terms)
  • OR: Broadens search (includes either term)
  • NOT: Excludes terms

Example: (CRISPR OR Cas9 OR "gene editing") AND ("sickle cell" OR SCD) AND therapy

Wildcards & Truncation

  • * or %: Matches any characters
  • ?: Matches single character

Example: genom* matches genomic, genomics, genome

3. Set Inclusion/Exclusion Criteria

Inclusion Criteria

  • Date range: e.g., 2015-2024 (last 10 years)
  • Language: English (or specify multilingual)
  • Publication type: Peer-reviewed articles, reviews, preprints
  • Study design: RCTs, cohort studies, meta-analyses
  • Population: Human, animal models, in vitro

Exclusion Criteria

  • Case reports (n<5)
  • Conference abstracts without full text
  • Non-original research (editorials, commentaries)
  • Duplicate publications
  • Retracted articles

4. Database Selection Strategy

Multi-Database Approach

Search at least 3 complementary databases:

  1. Primary database: PubMed (biomedical) or arXiv (computational)
  2. Preprint server: bioRxiv/medRxiv or arXiv
  3. Comprehensive database: Semantic Scholar or Google Scholar
  4. Specialized database: ChEMBL, UniProt, or field-specific

Database-Specific Syntax

Database Field Tags Example
PubMed [Title], [Author], [MeSH] "CRISPR"[Title] AND 2020:2024[DP]
arXiv ti:, au:, cat: ti:"machine learning" AND cat:q-bio.QM
Semantic Scholar title:, author:, year: title:"deep learning" year:2020-2024

Search Execution Workflow

  1. Run initial search with broad terms
  2. Review first 50 results for relevance
  3. Note common keywords and MeSH terms
  4. Refine search strategy
  1. Execute refined searches across all selected databases
  2. Export results in standard format (RIS, BibTeX, JSON)
  3. Document search strings and date for each database
  4. Record number of results per database

Phase 3: Deduplication

  1. Import all results into a single file
  2. Use search_databases.py --deduplicate to remove duplicates
  3. Identify duplicates by DOI (primary) or title (fallback)
  4. Keep the version with most complete metadata

Phase 4: Screening

  1. Title screening: Review titles, exclude obviously irrelevant
  2. Abstract screening: Read abstracts, apply inclusion/exclusion criteria
  3. Full-text screening: Obtain and review full texts
  4. Document reasons for exclusion at each stage

Phase 5: Quality Assessment

  1. Assess study quality using appropriate tools:
    • RCTs: Cochrane Risk of Bias tool
    • Observational: Newcastle-Ottawa Scale
    • Systematic reviews: AMSTAR 2
  2. Grade quality of evidence (high, moderate, low, very low)
  3. Consider excluding very low-quality studies

Search Documentation Template

Required Documentation

All searches must be documented for reproducibility:

## Search Strategy

### Database: PubMed
- **Date searched**: 2024-10-25
- **Date range**: 2015-01-01 to 2024-10-25
- **Search string**:

("CRISPR"[Title] OR "Cas9"[Title] OR "gene editing"[Title/Abstract]) AND ("sickle cell disease"[MeSH] OR "SCD"[Title/Abstract]) AND ("gene therapy"[MeSH] OR "therapeutic editing"[Title/Abstract]) AND 2015:2024[Publication Date] AND English[Language]

- **Results**: 247 articles
- **After deduplication**: 189 articles

### Database: bioRxiv
- **Date searched**: 2024-10-25
- **Date range**: 2015-01-01 to 2024-10-25
- **Search string**: "CRISPR" AND "sickle cell" (in title/abstract)
- **Results**: 34 preprints
- **After deduplication**: 28 preprints

### Total Unique Articles
- **Combined results**: 217 unique articles
- **After title screening**: 156 articles
- **After abstract screening**: 89 articles
- **After full-text screening**: 52 articles included in review

Advanced Search Techniques

Citation Chaining

Find papers that cite a key paper:

  • Use Google Scholar "Cited by" feature
  • Use OpenAlex or Semantic Scholar APIs
  • Identifies newer research building on seminal work

Review references in key papers:

  • Extract references from included papers
  • Search for highly cited references
  • Identifies foundational research

Snowball Sampling

  1. Start with 3-5 highly relevant papers
  2. Extract all their references
  3. Check which references are cited by multiple papers
  4. Review those high-overlap references
  5. Repeat for newly identified key papers

Follow prolific authors in the field:

  • Search by author name across databases
  • Check author profiles (ORCID, Google Scholar)
  • Review recent publications and preprints

Related Article Features

Many databases suggest related articles:

  • PubMed "Similar articles"
  • Semantic Scholar "Recommended papers"
  • Use to discover papers missed by keyword search

Quality Control Checklist

Before Searching

  • Research question clearly defined
  • PICO criteria established (if applicable)
  • Search terms and synonyms listed
  • Inclusion/exclusion criteria documented
  • Target databases selected (minimum 3)
  • Date range determined

During Searching

  • Search string tested and refined
  • Results exported with complete metadata
  • Search parameters documented
  • Number of results recorded per database
  • Search date recorded

After Searching

  • Duplicates removed
  • Screening protocol followed
  • Reasons for exclusion documented
  • Quality assessment completed
  • All citations verified with verify_citations.py
  • Search methodology documented in review

Common Pitfalls to Avoid

  1. Too narrow search: Missing relevant papers

    • Solution: Include synonyms, related terms, broader concepts
  2. Too broad search: Thousands of irrelevant results

    • Solution: Add specific concepts with AND, use field tags
  3. Single database: Incomplete coverage

    • Solution: Search minimum 3 complementary databases
  4. Ignoring preprints: Missing latest findings

    • Solution: Include bioRxiv, medRxiv, or arXiv
  5. No documentation: Irreproducible search

    • Solution: Document every search string, date, and result count
  6. Manual deduplication: Time-consuming and error-prone

    • Solution: Use search_databases.py script
  7. Unverified citations: Broken DOIs, incorrect metadata

    • Solution: Run verify_citations.py on final reference list
  8. Publication bias: Only including published positive results

    • Solution: Search trial registries, contact authors for unpublished data

Example Multi-Database Search Workflow

# Example workflow using available skills

# 1. Search PubMed via gget
search_term = "CRISPR AND sickle cell disease"
# Use gget search pubmed search_term

# 2. Search bioRxiv
# Use gget search biorxiv search_term

# 3. Search arXiv for computational papers
# Search arXiv with: cat:q-bio AND "CRISPR" AND "sickle cell"

# 4. Search Semantic Scholar via API
# Use semantic scholar API with search query

# 5. Aggregate and deduplicate results
# python search_databases.py combined_results.json --deduplicate --format markdown --output review_papers.md

# 6. Verify all citations
# python verify_citations.py review_papers.md

# 7. Generate final PDF
# python generate_pdf.py review_papers.md --citation-style nature

Resources

MeSH Browser

https://meshb.nlm.nih.gov/search

Boolean Search Tutorial

https://www.ncbi.nlm.nih.gov/books/NBK3827/

Citation Style Guides

See references/citation_styles.md in this skill

PRISMA Guidelines

Preferred Reporting Items for Systematic Reviews and Meta-Analyses: http://www.prisma-statement.org/