Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:14 +08:00
commit 1dd5bee3b4
335 changed files with 147360 additions and 0 deletions

View File

@@ -0,0 +1,908 @@
# BibTeX Formatting Guide
Comprehensive guide to BibTeX entry types, required fields, formatting conventions, and best practices.
## Overview
BibTeX is the standard bibliography format for LaTeX documents. Proper formatting ensures:
- Correct citation rendering
- Consistent formatting
- Compatibility with citation styles
- No compilation errors
This guide covers all common entry types and formatting rules.
## Entry Types
### @article - Journal Articles
**Most common entry type** for peer-reviewed journal articles.
**Required fields**:
- `author`: Author names
- `title`: Article title
- `journal`: Journal name
- `year`: Publication year
**Optional fields**:
- `volume`: Volume number
- `number`: Issue number
- `pages`: Page range
- `month`: Publication month
- `doi`: Digital Object Identifier
- `url`: URL
- `note`: Additional notes
**Template**:
```bibtex
@article{CitationKey2024,
author = {Last1, First1 and Last2, First2},
title = {Article Title Here},
journal = {Journal Name},
year = {2024},
volume = {10},
number = {3},
pages = {123--145},
doi = {10.1234/journal.2024.123456},
month = jan
}
```
**Example**:
```bibtex
@article{Jumper2021,
author = {Jumper, John and Evans, Richard and Pritzel, Alexander and others},
title = {Highly Accurate Protein Structure Prediction with {AlphaFold}},
journal = {Nature},
year = {2021},
volume = {596},
number = {7873},
pages = {583--589},
doi = {10.1038/s41586-021-03819-2}
}
```
### @book - Books
**For entire books**.
**Required fields**:
- `author` OR `editor`: Author(s) or editor(s)
- `title`: Book title
- `publisher`: Publisher name
- `year`: Publication year
**Optional fields**:
- `volume`: Volume number (if multi-volume)
- `series`: Series name
- `address`: Publisher location
- `edition`: Edition number
- `isbn`: ISBN
- `url`: URL
**Template**:
```bibtex
@book{CitationKey2024,
author = {Last, First},
title = {Book Title},
publisher = {Publisher Name},
year = {2024},
edition = {3},
address = {City, Country},
isbn = {978-0-123-45678-9}
}
```
**Example**:
```bibtex
@book{Kumar2021,
author = {Kumar, Vinay and Abbas, Abul K. and Aster, Jon C.},
title = {Robbins and Cotran Pathologic Basis of Disease},
publisher = {Elsevier},
year = {2021},
edition = {10},
address = {Philadelphia, PA},
isbn = {978-0-323-53113-9}
}
```
### @inproceedings - Conference Papers
**For papers in conference proceedings**.
**Required fields**:
- `author`: Author names
- `title`: Paper title
- `booktitle`: Conference/proceedings name
- `year`: Year
**Optional fields**:
- `editor`: Proceedings editor(s)
- `volume`: Volume number
- `series`: Series name
- `pages`: Page range
- `address`: Conference location
- `month`: Conference month
- `organization`: Organizing body
- `publisher`: Publisher
- `doi`: DOI
**Template**:
```bibtex
@inproceedings{CitationKey2024,
author = {Last, First},
title = {Paper Title},
booktitle = {Proceedings of Conference Name},
year = {2024},
pages = {123--145},
address = {City, Country},
month = jun
}
```
**Example**:
```bibtex
@inproceedings{Vaswani2017,
author = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and others},
title = {Attention is All You Need},
booktitle = {Advances in Neural Information Processing Systems 30 (NeurIPS 2017)},
year = {2017},
pages = {5998--6008},
address = {Long Beach, CA}
}
```
**Note**: `@conference` is an alias for `@inproceedings`.
### @incollection - Book Chapters
**For chapters in edited books**.
**Required fields**:
- `author`: Chapter author(s)
- `title`: Chapter title
- `booktitle`: Book title
- `publisher`: Publisher name
- `year`: Publication year
**Optional fields**:
- `editor`: Book editor(s)
- `volume`: Volume number
- `series`: Series name
- `type`: Type of section (e.g., "chapter")
- `chapter`: Chapter number
- `pages`: Page range
- `address`: Publisher location
- `edition`: Edition
- `month`: Month
**Template**:
```bibtex
@incollection{CitationKey2024,
author = {Last, First},
title = {Chapter Title},
booktitle = {Book Title},
editor = {Editor, Last and Editor2, Last},
publisher = {Publisher Name},
year = {2024},
pages = {123--145},
chapter = {5}
}
```
**Example**:
```bibtex
@incollection{Brown2020,
author = {Brown, Peter O. and Botstein, David},
title = {Exploring the New World of the Genome with {DNA} Microarrays},
booktitle = {DNA Microarrays: A Molecular Cloning Manual},
editor = {Eisen, Michael B. and Brown, Patrick O.},
publisher = {Cold Spring Harbor Laboratory Press},
year = {2020},
pages = {1--45},
address = {Cold Spring Harbor, NY}
}
```
### @phdthesis - Doctoral Dissertations
**For PhD dissertations and theses**.
**Required fields**:
- `author`: Author name
- `title`: Thesis title
- `school`: Institution
- `year`: Year
**Optional fields**:
- `type`: Type (e.g., "PhD dissertation", "PhD thesis")
- `address`: Institution location
- `month`: Month
- `url`: URL
- `note`: Additional notes
**Template**:
```bibtex
@phdthesis{CitationKey2024,
author = {Last, First},
title = {Dissertation Title},
school = {University Name},
year = {2024},
type = {{PhD} dissertation},
address = {City, State}
}
```
**Example**:
```bibtex
@phdthesis{Johnson2023,
author = {Johnson, Mary L.},
title = {Novel Approaches to Cancer Immunotherapy Using {CRISPR} Technology},
school = {Stanford University},
year = {2023},
type = {{PhD} dissertation},
address = {Stanford, CA}
}
```
**Note**: `@mastersthesis` is similar but for Master's theses.
### @mastersthesis - Master's Theses
**For Master's theses**.
**Required fields**:
- `author`: Author name
- `title`: Thesis title
- `school`: Institution
- `year`: Year
**Template**:
```bibtex
@mastersthesis{CitationKey2024,
author = {Last, First},
title = {Thesis Title},
school = {University Name},
year = {2024}
}
```
### @misc - Miscellaneous
**For items that don't fit other categories** (preprints, datasets, software, websites, etc.).
**Required fields**:
- `author` (if known)
- `title`
- `year`
**Optional fields**:
- `howpublished`: Repository, website, format
- `url`: URL
- `doi`: DOI
- `note`: Additional information
- `month`: Month
**Template for preprints**:
```bibtex
@misc{CitationKey2024,
author = {Last, First},
title = {Preprint Title},
year = {2024},
howpublished = {bioRxiv},
doi = {10.1101/2024.01.01.123456},
note = {Preprint}
}
```
**Template for datasets**:
```bibtex
@misc{DatasetName2024,
author = {Last, First},
title = {Dataset Title},
year = {2024},
howpublished = {Zenodo},
doi = {10.5281/zenodo.123456},
note = {Version 1.2}
}
```
**Template for software**:
```bibtex
@misc{SoftwareName2024,
author = {Last, First},
title = {Software Name},
year = {2024},
howpublished = {GitHub},
url = {https://github.com/user/repo},
note = {Version 2.0}
}
```
### @techreport - Technical Reports
**For technical reports**.
**Required fields**:
- `author`: Author name(s)
- `title`: Report title
- `institution`: Institution
- `year`: Year
**Optional fields**:
- `type`: Type of report
- `number`: Report number
- `address`: Institution location
- `month`: Month
**Template**:
```bibtex
@techreport{CitationKey2024,
author = {Last, First},
title = {Report Title},
institution = {Institution Name},
year = {2024},
type = {Technical Report},
number = {TR-2024-01}
}
```
### @unpublished - Unpublished Work
**For unpublished works** (not preprints - use @misc for those).
**Required fields**:
- `author`: Author name(s)
- `title`: Work title
- `note`: Description
**Optional fields**:
- `month`: Month
- `year`: Year
**Template**:
```bibtex
@unpublished{CitationKey2024,
author = {Last, First},
title = {Work Title},
note = {Unpublished manuscript},
year = {2024}
}
```
### @online/@electronic - Online Resources
**For web pages and online-only content**.
**Note**: Not standard BibTeX, but supported by many bibliography packages (biblatex).
**Required fields**:
- `author` OR `organization`
- `title`
- `url`
- `year`
**Template**:
```bibtex
@online{CitationKey2024,
author = {{Organization Name}},
title = {Page Title},
url = {https://example.com/page},
year = {2024},
note = {Accessed: 2024-01-15}
}
```
## Formatting Rules
### Citation Keys
**Convention**: `FirstAuthorYEARkeyword`
**Examples**:
```bibtex
Smith2024protein
Doe2023machine
JohnsonWilliams2024cancer % Multiple authors, no space
NatureEditorial2024 % No author, use publication
WHO2024guidelines % Organization author
```
**Rules**:
- Alphanumeric plus: `-`, `_`, `.`, `:`
- No spaces
- Case-sensitive
- Unique within file
- Descriptive
**Avoid**:
- Special characters: `@`, `#`, `&`, `%`, `$`
- Spaces: use CamelCase or underscores
- Starting with numbers: `2024Smith` (some systems disallow)
### Author Names
**Recommended format**: `Last, First Middle`
**Single author**:
```bibtex
author = {Smith, John}
author = {Smith, John A.}
author = {Smith, John Andrew}
```
**Multiple authors** - separate with `and`:
```bibtex
author = {Smith, John and Doe, Jane}
author = {Smith, John A. and Doe, Jane M. and Johnson, Mary L.}
```
**Many authors** (10+):
```bibtex
author = {Smith, John and Doe, Jane and Johnson, Mary and others}
```
**Special cases**:
```bibtex
% Suffix (Jr., III, etc.)
author = {King, Jr., Martin Luther}
% Organization as author
author = {{World Health Organization}}
% Note: Double braces keep as single entity
% Multiple surnames
author = {Garc{\'i}a-Mart{\'i}nez, Jos{\'e}}
% Particles (van, von, de, etc.)
author = {van der Waals, Johannes}
author = {de Broglie, Louis}
```
**Wrong formats** (don't use):
```bibtex
author = {Smith, J.; Doe, J.} % Semicolons (wrong)
author = {Smith, J., Doe, J.} % Commas (wrong)
author = {Smith, J. & Doe, J.} % Ampersand (wrong)
author = {Smith J} % No comma
```
### Title Capitalization
**Protect capitalization** with braces:
```bibtex
% Proper nouns, acronyms, formulas
title = {{AlphaFold}: Protein Structure Prediction}
title = {Machine Learning for {DNA} Sequencing}
title = {The {Ising} Model in Statistical Physics}
title = {{CRISPR-Cas9} Gene Editing Technology}
```
**Reason**: Citation styles may change capitalization. Braces protect.
**Examples**:
```bibtex
% Good
title = {Advances in {COVID-19} Treatment}
title = {Using {Python} for Data Analysis}
title = {The {AlphaFold} Protein Structure Database}
% Will be lowercase in title case styles
title = {Advances in COVID-19 Treatment} % covid-19
title = {Using Python for Data Analysis} % python
```
**Whole title protection** (rarely needed):
```bibtex
title = {{This Entire Title Keeps Its Capitalization}}
```
### Page Ranges
**Use en-dash** (double hyphen `--`):
```bibtex
pages = {123--145} % Correct
pages = {1234--1256} % Correct
pages = {e0123456} % Article ID (PLOS, etc.)
pages = {123} % Single page
```
**Wrong**:
```bibtex
pages = {123-145} % Single hyphen (don't use)
pages = {pp. 123-145} % "pp." not needed
pages = {123145} % Unicode en-dash (may cause issues)
```
### Month Names
**Use three-letter abbreviations** (unquoted):
```bibtex
month = jan
month = feb
month = mar
month = apr
month = may
month = jun
month = jul
month = aug
month = sep
month = oct
month = nov
month = dec
```
**Or numeric**:
```bibtex
month = {1} % January
month = {12} % December
```
**Or full name in braces**:
```bibtex
month = {January}
```
**Standard abbreviations work without quotes** because they're defined in BibTeX.
### Journal Names
**Full name** (not abbreviated):
```bibtex
journal = {Nature}
journal = {Science}
journal = {Cell}
journal = {Proceedings of the National Academy of Sciences}
journal = {Journal of the American Chemical Society}
```
**Bibliography style** will handle abbreviation if needed.
**Avoid manual abbreviation**:
```bibtex
% Don't do this in BibTeX file
journal = {Proc. Natl. Acad. Sci. U.S.A.}
% Do this instead
journal = {Proceedings of the National Academy of Sciences}
```
**Exception**: If style requires abbreviations, use full abbreviated form:
```bibtex
journal = {Proc. Natl. Acad. Sci. U.S.A.} % If required by style
```
### DOI Formatting
**URL format** (preferred):
```bibtex
doi = {10.1038/s41586-021-03819-2}
```
**Not**:
```bibtex
doi = {https://doi.org/10.1038/s41586-021-03819-2} % Don't include URL
doi = {doi:10.1038/s41586-021-03819-2} % Don't include prefix
```
**LaTeX** will format as URL automatically.
**Note**: No period after DOI field!
### URL Formatting
```bibtex
url = {https://www.example.com/article}
```
**Use**:
- When DOI not available
- For web pages
- For supplementary materials
**Don't duplicate**:
```bibtex
% Don't include both if DOI URL is same as url
doi = {10.1038/nature12345}
url = {https://doi.org/10.1038/nature12345} % Redundant!
```
### Special Characters
**Accents and diacritics**:
```bibtex
author = {M{\"u}ller, Hans} % ü
author = {Garc{\'i}a, Jos{\'e}} % í, é
author = {Erd{\H{o}}s, Paul} % ő
author = {Schr{\"o}dinger, Erwin} % ö
```
**Or use UTF-8** (with proper LaTeX setup):
```bibtex
author = {Müller, Hans}
author = {García, José}
```
**Mathematical symbols**:
```bibtex
title = {The $\alpha$-helix Structure}
title = {$\beta$-sheet Prediction}
```
**Chemical formulas**:
```bibtex
title = {H$_2$O Molecular Dynamics}
% Or with chemformula package:
title = {\ce{H2O} Molecular Dynamics}
```
### Field Order
**Recommended order** (for readability):
```bibtex
@article{Key,
author = {},
title = {},
journal = {},
year = {},
volume = {},
number = {},
pages = {},
doi = {},
url = {},
note = {}
}
```
**Rules**:
- Most important fields first
- Consistent across entries
- Use formatter to standardize
## Best Practices
### 1. Consistent Formatting
Use same format throughout:
- Author name format
- Title capitalization
- Journal names
- Citation key style
### 2. Required Fields
Always include:
- All required fields for entry type
- DOI for modern papers (2000+)
- Volume and pages for articles
- Publisher for books
### 3. Protect Capitalization
Use braces for:
- Proper nouns: `{AlphaFold}`
- Acronyms: `{DNA}`, `{CRISPR}`
- Formulas: `{H2O}`
- Names: `{Python}`, `{R}`
### 4. Complete Author Lists
Include all authors when possible:
- All authors if <10
- Use "and others" for 10+
- Don't abbreviate to "et al." manually
### 5. Use Standard Entry Types
Choose correct entry type:
- Journal article → `@article`
- Book → `@book`
- Conference paper → `@inproceedings`
- Preprint → `@misc`
### 6. Validate Syntax
Check for:
- Balanced braces
- Commas after fields
- Unique citation keys
- Valid entry types
### 7. Use Formatters
Use automated tools:
```bash
python scripts/format_bibtex.py references.bib
```
Benefits:
- Consistent formatting
- Catch syntax errors
- Standardize field order
- Fix common issues
## Common Mistakes
### 1. Wrong Author Separator
**Wrong**:
```bibtex
author = {Smith, J.; Doe, J.} % Semicolon
author = {Smith, J., Doe, J.} % Comma
author = {Smith, J. & Doe, J.} % Ampersand
```
**Correct**:
```bibtex
author = {Smith, John and Doe, Jane}
```
### 2. Missing Commas
**Wrong**:
```bibtex
@article{Smith2024,
author = {Smith, John} % Missing comma!
title = {Title}
}
```
**Correct**:
```bibtex
@article{Smith2024,
author = {Smith, John}, % Comma after each field
title = {Title}
}
```
### 3. Unprotected Capitalization
**Wrong**:
```bibtex
title = {Machine Learning with Python}
% "Python" will become "python" in title case
```
**Correct**:
```bibtex
title = {Machine Learning with {Python}}
```
### 4. Single Hyphen in Pages
**Wrong**:
```bibtex
pages = {123-145} % Single hyphen
```
**Correct**:
```bibtex
pages = {123--145} % Double hyphen (en-dash)
```
### 5. Redundant "pp." in Pages
**Wrong**:
```bibtex
pages = {pp. 123--145}
```
**Correct**:
```bibtex
pages = {123--145}
```
### 6. DOI with URL Prefix
**Wrong**:
```bibtex
doi = {https://doi.org/10.1038/nature12345}
doi = {doi:10.1038/nature12345}
```
**Correct**:
```bibtex
doi = {10.1038/nature12345}
```
## Example Complete Bibliography
```bibtex
% Journal article
@article{Jumper2021,
author = {Jumper, John and Evans, Richard and Pritzel, Alexander and others},
title = {Highly Accurate Protein Structure Prediction with {AlphaFold}},
journal = {Nature},
year = {2021},
volume = {596},
number = {7873},
pages = {583--589},
doi = {10.1038/s41586-021-03819-2}
}
% Book
@book{Kumar2021,
author = {Kumar, Vinay and Abbas, Abul K. and Aster, Jon C.},
title = {Robbins and Cotran Pathologic Basis of Disease},
publisher = {Elsevier},
year = {2021},
edition = {10},
address = {Philadelphia, PA},
isbn = {978-0-323-53113-9}
}
% Conference paper
@inproceedings{Vaswani2017,
author = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and others},
title = {Attention is All You Need},
booktitle = {Advances in Neural Information Processing Systems 30 (NeurIPS 2017)},
year = {2017},
pages = {5998--6008}
}
% Book chapter
@incollection{Brown2020,
author = {Brown, Peter O. and Botstein, David},
title = {Exploring the New World of the Genome with {DNA} Microarrays},
booktitle = {DNA Microarrays: A Molecular Cloning Manual},
editor = {Eisen, Michael B. and Brown, Patrick O.},
publisher = {Cold Spring Harbor Laboratory Press},
year = {2020},
pages = {1--45}
}
% PhD thesis
@phdthesis{Johnson2023,
author = {Johnson, Mary L.},
title = {Novel Approaches to Cancer Immunotherapy},
school = {Stanford University},
year = {2023},
type = {{PhD} dissertation}
}
% Preprint
@misc{Zhang2024,
author = {Zhang, Yi and Chen, Li and Wang, Hui},
title = {Novel Therapeutic Targets in {Alzheimer}'s Disease},
year = {2024},
howpublished = {bioRxiv},
doi = {10.1101/2024.01.001},
note = {Preprint}
}
% Dataset
@misc{AlphaFoldDB2021,
author = {{DeepMind} and {EMBL-EBI}},
title = {{AlphaFold} Protein Structure Database},
year = {2021},
howpublished = {Database},
url = {https://alphafold.ebi.ac.uk/},
doi = {10.1093/nar/gkab1061}
}
```
## Summary
BibTeX formatting essentials:
**Choose correct entry type** (@article, @book, etc.)
**Include all required fields**
**Use `and` for multiple authors**
**Protect capitalization** with braces
**Use `--` for page ranges**
**Include DOI** for modern papers
**Validate syntax** before compilation
Use formatting tools to ensure consistency:
```bash
python scripts/format_bibtex.py references.bib
```
Properly formatted BibTeX ensures correct, consistent citations across all bibliography styles!

View File

@@ -0,0 +1,794 @@
# Citation Validation Guide
Comprehensive guide to validating citation accuracy, completeness, and formatting in BibTeX files.
## Overview
Citation validation ensures:
- All citations are accurate and complete
- DOIs resolve correctly
- Required fields are present
- No duplicate entries
- Proper formatting and syntax
- Links are accessible
Validation should be performed:
- After extracting metadata
- Before manuscript submission
- After manual edits to BibTeX files
- Periodically for maintained bibliographies
## Validation Categories
### 1. DOI Verification
**Purpose**: Ensure DOIs are valid and resolve correctly.
#### What to Check
**DOI format**:
```
Valid: 10.1038/s41586-021-03819-2
Valid: 10.1126/science.aam9317
Invalid: 10.1038/invalid
Invalid: doi:10.1038/... (should omit "doi:" prefix in BibTeX)
```
**DOI resolution**:
- DOI should resolve via https://doi.org/
- Should redirect to actual article
- Should not return 404 or error
**Metadata consistency**:
- CrossRef metadata should match BibTeX
- Author names should align
- Title should match
- Year should match
#### How to Validate
**Manual check**:
1. Copy DOI from BibTeX
2. Visit https://doi.org/10.1038/nature12345
3. Verify it redirects to correct article
4. Check metadata matches
**Automated check** (recommended):
```bash
python scripts/validate_citations.py references.bib --check-dois
```
**Process**:
1. Extract all DOIs from BibTeX file
2. Query doi.org resolver for each
3. Query CrossRef API for metadata
4. Compare metadata with BibTeX entry
5. Report discrepancies
#### Common Issues
**Broken DOIs**:
- Typos in DOI
- Publisher changed DOI (rare)
- Article retracted
- Solution: Find correct DOI from publisher site
**Mismatched metadata**:
- BibTeX has old/incorrect information
- Solution: Re-extract metadata from CrossRef
**Missing DOIs**:
- Older articles may not have DOIs
- Acceptable for pre-2000 publications
- Add URL or PMID instead
### 2. Required Fields
**Purpose**: Ensure all necessary information is present.
#### Required by Entry Type
**@article**:
```bibtex
author % REQUIRED
title % REQUIRED
journal % REQUIRED
year % REQUIRED
volume % Highly recommended
pages % Highly recommended
doi % Highly recommended for modern papers
```
**@book**:
```bibtex
author OR editor % REQUIRED (at least one)
title % REQUIRED
publisher % REQUIRED
year % REQUIRED
isbn % Recommended
```
**@inproceedings**:
```bibtex
author % REQUIRED
title % REQUIRED
booktitle % REQUIRED (conference/proceedings name)
year % REQUIRED
pages % Recommended
```
**@incollection** (book chapter):
```bibtex
author % REQUIRED
title % REQUIRED (chapter title)
booktitle % REQUIRED (book title)
publisher % REQUIRED
year % REQUIRED
editor % Recommended
pages % Recommended
```
**@phdthesis**:
```bibtex
author % REQUIRED
title % REQUIRED
school % REQUIRED
year % REQUIRED
```
**@misc** (preprints, datasets, etc.):
```bibtex
author % REQUIRED
title % REQUIRED
year % REQUIRED
howpublished % Recommended (bioRxiv, Zenodo, etc.)
doi OR url % At least one required
```
#### Validation Script
```bash
python scripts/validate_citations.py references.bib --check-required-fields
```
**Output**:
```
Error: Entry 'Smith2024' missing required field 'journal'
Error: Entry 'Doe2023' missing required field 'year'
Warning: Entry 'Jones2022' missing recommended field 'volume'
```
### 3. Author Name Formatting
**Purpose**: Ensure consistent, correct author name formatting.
#### Proper Format
**Recommended BibTeX format**:
```bibtex
author = {Last1, First1 and Last2, First2 and Last3, First3}
```
**Examples**:
```bibtex
% Correct
author = {Smith, John}
author = {Smith, John A.}
author = {Smith, John Andrew}
author = {Smith, John and Doe, Jane}
author = {Smith, John and Doe, Jane and Johnson, Mary}
% For many authors
author = {Smith, John and Doe, Jane and others}
% Incorrect
author = {John Smith} % First Last format (not recommended)
author = {Smith, J.; Doe, J.} % Semicolon separator (wrong)
author = {Smith J, Doe J} % Missing commas
```
#### Special Cases
**Suffixes (Jr., III, etc.)**:
```bibtex
author = {King, Jr., Martin Luther}
```
**Multiple surnames (hyphenated)**:
```bibtex
author = {Smith-Jones, Mary}
```
**Van, von, de, etc.**:
```bibtex
author = {van der Waals, Johannes}
author = {de Broglie, Louis}
```
**Organizations as authors**:
```bibtex
author = {{World Health Organization}}
% Double braces treat as single author
```
#### Validation Checks
**Automated validation**:
```bash
python scripts/validate_citations.py references.bib --check-authors
```
**Checks for**:
- Proper separator (and, not &, ; , etc.)
- Comma placement
- Empty author fields
- Malformed names
### 4. Data Consistency
**Purpose**: Ensure all fields contain valid, reasonable values.
#### Year Validation
**Valid years**:
```bibtex
year = {2024} % Current/recent
year = {1953} % Watson & Crick DNA structure (historical)
year = {1665} % Hooke's Micrographia (very old)
```
**Invalid years**:
```bibtex
year = {24} % Two digits (ambiguous)
year = {202} % Typo
year = {2025} % Future (unless accepted/in press)
year = {0} % Obviously wrong
```
**Check**:
- Four digits
- Reasonable range (1600-current+1)
- Not all zeros
#### Volume/Number Validation
```bibtex
volume = {123} % Numeric
volume = {12} % Valid
number = {3} % Valid
number = {S1} % Supplement issue (valid)
```
**Invalid**:
```bibtex
volume = {Vol. 123} % Should be just number
number = {Issue 3} % Should be just number
```
#### Page Range Validation
**Correct format**:
```bibtex
pages = {123--145} % En-dash (two hyphens)
pages = {e0123456} % PLOS-style article ID
pages = {123} % Single page
```
**Incorrect format**:
```bibtex
pages = {123-145} % Single hyphen (use --)
pages = {pp. 123-145} % Remove "pp."
pages = {123145} % Unicode en-dash (may cause issues)
```
#### URL Validation
**Check**:
- URLs are accessible (return 200 status)
- HTTPS when available
- No obvious typos
- Permanent links (not temporary)
**Valid**:
```bibtex
url = {https://www.nature.com/articles/nature12345}
url = {https://arxiv.org/abs/2103.14030}
```
**Questionable**:
```bibtex
url = {http://...} % HTTP instead of HTTPS
url = {file:///...} % Local file path
url = {bit.ly/...} % URL shortener (not permanent)
```
### 5. Duplicate Detection
**Purpose**: Find and remove duplicate entries.
#### Types of Duplicates
**Exact duplicates** (same DOI):
```bibtex
@article{Smith2024a,
doi = {10.1038/nature12345},
...
}
@article{Smith2024b,
doi = {10.1038/nature12345}, % Same DOI!
...
}
```
**Near duplicates** (similar title/authors):
```bibtex
@article{Smith2024,
title = {Machine Learning for Drug Discovery},
...
}
@article{Smith2024method,
title = {Machine learning for drug discovery}, % Same, different case
...
}
```
**Preprint + Published**:
```bibtex
@misc{Smith2023arxiv,
title = {AlphaFold Results},
howpublished = {arXiv},
...
}
@article{Smith2024,
title = {AlphaFold Results}, % Same paper, now published
journal = {Nature},
...
}
% Keep published version only
```
#### Detection Methods
**By DOI** (most reliable):
- Same DOI = exact duplicate
- Keep one, remove other
**By title similarity**:
- Normalize: lowercase, remove punctuation
- Calculate similarity (e.g., Levenshtein distance)
- Flag if >90% similar
**By author-year-title**:
- Same first author + year + similar title
- Likely duplicate
**Automated detection**:
```bash
python scripts/validate_citations.py references.bib --check-duplicates
```
**Output**:
```
Warning: Possible duplicate entries:
- Smith2024a (DOI: 10.1038/nature12345)
- Smith2024b (DOI: 10.1038/nature12345)
Recommendation: Keep one entry, remove the other.
```
### 6. Format and Syntax
**Purpose**: Ensure valid BibTeX syntax.
#### Common Syntax Errors
**Missing commas**:
```bibtex
@article{Smith2024,
author = {Smith, John} % Missing comma!
title = {Title}
}
% Should be:
author = {Smith, John}, % Comma after each field
```
**Unbalanced braces**:
```bibtex
title = {Title with {Protected} Text % Missing closing brace
% Should be:
title = {Title with {Protected} Text}
```
**Missing closing brace for entry**:
```bibtex
@article{Smith2024,
author = {Smith, John},
title = {Title}
% Missing closing brace!
% Should end with:
}
```
**Invalid characters in keys**:
```bibtex
@article{Smith&Doe2024, % & not allowed in key
...
}
% Use:
@article{SmithDoe2024,
...
}
```
#### BibTeX Syntax Rules
**Entry structure**:
```bibtex
@TYPE{citationkey,
field1 = {value1},
field2 = {value2},
...
fieldN = {valueN}
}
```
**Citation keys**:
- Alphanumeric and some punctuation (-, _, ., :)
- No spaces
- Case-sensitive
- Unique within file
**Field values**:
- Enclosed in {braces} or "quotes"
- Braces preferred for complex text
- Numbers can be unquoted: `year = 2024`
**Special characters**:
- `{` and `}` for grouping
- `\` for LaTeX commands
- Protect capitalization: `{AlphaFold}`
- Accents: `{\"u}`, `{\'e}`, `{\aa}`
#### Validation
```bash
python scripts/validate_citations.py references.bib --check-syntax
```
**Checks**:
- Valid BibTeX structure
- Balanced braces
- Proper commas
- Valid entry types
- Unique citation keys
## Validation Workflow
### Step 1: Basic Validation
Run comprehensive validation:
```bash
python scripts/validate_citations.py references.bib
```
**Checks all**:
- DOI resolution
- Required fields
- Author formatting
- Data consistency
- Duplicates
- Syntax
### Step 2: Review Report
Examine validation report:
```json
{
"total_entries": 150,
"valid_entries": 140,
"errors": [
{
"entry": "Smith2024",
"error": "missing_required_field",
"field": "journal",
"severity": "high"
},
{
"entry": "Doe2023",
"error": "invalid_doi",
"doi": "10.1038/broken",
"severity": "high"
}
],
"warnings": [
{
"entry": "Jones2022",
"warning": "missing_recommended_field",
"field": "volume",
"severity": "medium"
}
],
"duplicates": [
{
"entries": ["Smith2024a", "Smith2024b"],
"reason": "same_doi",
"doi": "10.1038/nature12345"
}
]
}
```
### Step 3: Fix Issues
**High-priority** (errors):
1. Add missing required fields
2. Fix broken DOIs
3. Remove duplicates
4. Correct syntax errors
**Medium-priority** (warnings):
1. Add recommended fields
2. Improve author formatting
3. Fix page ranges
**Low-priority**:
1. Standardize formatting
2. Add URLs for accessibility
### Step 4: Auto-Fix
Use auto-fix for safe corrections:
```bash
python scripts/validate_citations.py references.bib \
--auto-fix \
--output fixed_references.bib
```
**Auto-fix can**:
- Fix page range format (- to --)
- Remove "pp." from pages
- Standardize author separators
- Fix common syntax errors
- Normalize field order
**Auto-fix cannot**:
- Add missing information
- Find correct DOIs
- Determine which duplicate to keep
- Fix semantic errors
### Step 5: Manual Review
Review auto-fixed file:
```bash
# Check what changed
diff references.bib fixed_references.bib
# Review specific entries that had errors
grep -A 10 "Smith2024" fixed_references.bib
```
### Step 6: Re-Validate
Validate after fixes:
```bash
python scripts/validate_citations.py fixed_references.bib --verbose
```
Should show:
```
✓ All DOIs valid
✓ All required fields present
✓ No duplicates found
✓ Syntax valid
✓ 150/150 entries valid
```
## Validation Checklist
Use this checklist before final submission:
### DOI Validation
- [ ] All DOIs resolve correctly
- [ ] Metadata matches between BibTeX and CrossRef
- [ ] No broken or invalid DOIs
### Completeness
- [ ] All entries have required fields
- [ ] Modern papers (2000+) have DOIs
- [ ] Authors properly formatted
- [ ] Journals/conferences properly named
### Consistency
- [ ] Years are 4-digit numbers
- [ ] Page ranges use -- not -
- [ ] Volume/number are numeric
- [ ] URLs are accessible
### Duplicates
- [ ] No entries with same DOI
- [ ] No near-duplicate titles
- [ ] Preprints updated to published versions
### Formatting
- [ ] Valid BibTeX syntax
- [ ] Balanced braces
- [ ] Proper commas
- [ ] Unique citation keys
### Final Checks
- [ ] Bibliography compiles without errors
- [ ] All citations in text appear in bibliography
- [ ] All bibliography entries cited in text
- [ ] Citation style matches journal requirements
## Best Practices
### 1. Validate Early and Often
```bash
# After extraction
python scripts/extract_metadata.py --doi ... --output refs.bib
python scripts/validate_citations.py refs.bib
# After manual edits
python scripts/validate_citations.py refs.bib
# Before submission
python scripts/validate_citations.py refs.bib --strict
```
### 2. Use Automated Tools
Don't validate manually - use scripts:
- Faster
- More comprehensive
- Catches errors humans miss
- Generates reports
### 3. Keep Backup
```bash
# Before auto-fix
cp references.bib references_backup.bib
# Run auto-fix
python scripts/validate_citations.py references.bib \
--auto-fix \
--output references_fixed.bib
# Review changes
diff references.bib references_fixed.bib
# If satisfied, replace
mv references_fixed.bib references.bib
```
### 4. Fix High-Priority First
**Priority order**:
1. Syntax errors (prevent compilation)
2. Missing required fields (incomplete citations)
3. Broken DOIs (broken links)
4. Duplicates (confusion, wasted space)
5. Missing recommended fields
6. Formatting inconsistencies
### 5. Document Exceptions
For entries that can't be fixed:
```bibtex
@article{Old1950,
author = {Smith, John},
title = {Title},
journal = {Obscure Journal},
year = {1950},
volume = {12},
pages = {34--56},
note = {DOI not available for publications before 2000}
}
```
### 6. Validate Against Journal Requirements
Different journals have different requirements:
- Citation style (numbered, author-year)
- Abbreviations (journal names)
- Maximum reference count
- Format (BibTeX, EndNote, manual)
Check journal author guidelines!
## Common Validation Issues
### Issue 1: Metadata Mismatch
**Problem**: BibTeX says 2023, CrossRef says 2024.
**Cause**:
- Online-first vs print publication
- Correction/update
- Extraction error
**Solution**:
1. Check actual article
2. Use more recent/accurate date
3. Update BibTeX entry
4. Re-validate
### Issue 2: Special Characters
**Problem**: LaTeX compilation fails on special characters.
**Cause**:
- Accented characters (é, ü, ñ)
- Chemical formulas (H₂O)
- Math symbols (α, β, ±)
**Solution**:
```bibtex
% Use LaTeX commands
author = {M{\"u}ller, Hans} % Müller
title = {Study of H\textsubscript{2}O} % H₂O
% Or use UTF-8 with proper LaTeX packages
```
### Issue 3: Incomplete Extraction
**Problem**: Extracted metadata missing fields.
**Cause**:
- Source doesn't provide all metadata
- Extraction error
- Incomplete record
**Solution**:
1. Check original article
2. Manually add missing fields
3. Use alternative source (PubMed vs CrossRef)
### Issue 4: Cannot Find Duplicate
**Problem**: Same paper appears twice, not detected.
**Cause**:
- Different DOIs (should be rare)
- Different titles (abbreviated, typo)
- Different citation keys
**Solution**:
- Manual search for author + year
- Check for similar titles
- Remove manually
## Summary
Validation ensures citation quality:
**Accuracy**: DOIs resolve, metadata correct
**Completeness**: All required fields present
**Consistency**: Proper formatting throughout
**No duplicates**: Each paper cited once
**Valid syntax**: BibTeX compiles without errors
**Always validate** before final submission!
Use automated tools:
```bash
python scripts/validate_citations.py references.bib
```
Follow workflow:
1. Extract metadata
2. Validate
3. Fix errors
4. Re-validate
5. Submit

View File

@@ -0,0 +1,725 @@
# Google Scholar Search Guide
Comprehensive guide to searching Google Scholar for academic papers, including advanced search operators, filtering strategies, and metadata extraction.
## Overview
Google Scholar provides the most comprehensive coverage of academic literature across all disciplines:
- **Coverage**: 100+ million scholarly documents
- **Scope**: All academic disciplines
- **Content types**: Journal articles, books, theses, conference papers, preprints, patents, court opinions
- **Citation tracking**: "Cited by" links for forward citation tracking
- **Accessibility**: Free to use, no account required
## Basic Search
### Simple Keyword Search
Search for papers containing specific terms anywhere in the document (title, abstract, full text):
```
CRISPR gene editing
machine learning protein folding
climate change impact agriculture
quantum computing algorithms
```
**Tips**:
- Use specific technical terms
- Include key acronyms and abbreviations
- Start broad, then refine
- Check spelling of technical terms
### Exact Phrase Search
Use quotation marks to search for exact phrases:
```
"deep learning"
"CRISPR-Cas9"
"systematic review"
"randomized controlled trial"
```
**When to use**:
- Technical terms that must appear together
- Proper names
- Specific methodologies
- Exact titles
## Advanced Search Operators
### Author Search
Find papers by specific authors:
```
author:LeCun
author:"Geoffrey Hinton"
author:Church synthetic biology
```
**Variations**:
- Single last name: `author:Smith`
- Full name in quotes: `author:"Jane Smith"`
- Author + topic: `author:Doudna CRISPR`
**Tips**:
- Authors may publish under different name variations
- Try with and without middle initials
- Consider name changes (marriage, etc.)
- Use quotation marks for full names
### Title Search
Search only in article titles:
```
intitle:transformer
intitle:"attention mechanism"
intitle:review climate change
```
**Use cases**:
- Finding papers specifically about a topic
- More precise than full-text search
- Reduces irrelevant results
- Good for finding reviews or methods
### Source (Journal) Search
Search within specific journals or conferences:
```
source:Nature
source:"Nature Communications"
source:NeurIPS
source:"Journal of Machine Learning Research"
```
**Applications**:
- Track publications in top-tier venues
- Find papers in specialized journals
- Identify conference-specific work
- Verify publication venue
### Exclusion Operator
Exclude terms from results:
```
machine learning -survey
CRISPR -patent
climate change -news
deep learning -tutorial -review
```
**Common exclusions**:
- `-survey`: Exclude survey papers
- `-review`: Exclude review articles
- `-patent`: Exclude patents
- `-book`: Exclude books
- `-news`: Exclude news articles
- `-tutorial`: Exclude tutorials
### OR Operator
Search for papers containing any of multiple terms:
```
"machine learning" OR "deep learning"
CRISPR OR "gene editing"
"climate change" OR "global warming"
```
**Best practices**:
- OR must be uppercase
- Combine synonyms
- Include acronyms and spelled-out versions
- Use with exact phrases
### Wildcard Search
Use asterisk (*) as wildcard for unknown words:
```
"machine * learning"
"CRISPR * editing"
"* neural network"
```
**Note**: Limited wildcard support in Google Scholar compared to other databases.
## Advanced Filtering
### Year Range
Filter by publication year:
**Using interface**:
- Click "Since [year]" on left sidebar
- Select custom range
**Using search operators**:
```
# Not directly in search query
# Use interface or URL parameters
```
**In script**:
```bash
python scripts/search_google_scholar.py "quantum computing" \
--year-start 2020 \
--year-end 2024
```
### Sorting Options
**By relevance** (default):
- Google's algorithm determines relevance
- Considers citations, author reputation, publication venue
- Generally good for most searches
**By date**:
- Most recent papers first
- Good for fast-moving fields
- May miss highly cited older papers
- Click "Sort by date" in interface
**By citation count** (via script):
```bash
python scripts/search_google_scholar.py "transformers" \
--sort-by citations \
--limit 50
```
### Language Filtering
**In interface**:
- Settings → Languages
- Select preferred languages
**Default**: English and papers with English abstracts
## Search Strategies
### Finding Seminal Papers
Identify highly influential papers in a field:
1. **Search by topic** with broad terms
2. **Sort by citations** (most cited first)
3. **Look for review articles** for comprehensive overviews
4. **Check publication dates** for foundational vs recent work
**Example**:
```
"generative adversarial networks"
# Sort by citations
# Top results: original GAN paper (Goodfellow et al., 2014), key variants
```
### Finding Recent Work
Stay current with latest research:
1. **Search by topic**
2. **Filter to recent years** (last 1-2 years)
3. **Sort by date** for newest first
4. **Set up alerts** for ongoing tracking
**Example**:
```bash
python scripts/search_google_scholar.py "AlphaFold protein structure" \
--year-start 2023 \
--year-end 2024 \
--limit 50
```
### Finding Review Articles
Get comprehensive overviews of a field:
```
intitle:review "machine learning"
"systematic review" CRISPR
intitle:survey "natural language processing"
```
**Indicators**:
- "review", "survey", "perspective" in title
- Often highly cited
- Published in review journals (Nature Reviews, Trends, etc.)
- Comprehensive reference lists
### Citation Chain Search
**Forward citations** (papers citing a key paper):
1. Find seminal paper
2. Click "Cited by X"
3. See all papers that cite it
4. Identify how field has developed
**Backward citations** (references in a key paper):
1. Find recent review or important paper
2. Check its reference list
3. Identify foundational work
4. Trace development of ideas
**Example workflow**:
```
# Find original transformer paper
"Attention is all you need" author:Vaswani
# Check "Cited by 120,000+"
# See evolution: BERT, GPT, T5, etc.
# Check references in original paper
# Find RNN, LSTM, attention mechanism origins
```
### Comprehensive Literature Search
For thorough coverage (e.g., systematic reviews):
1. **Generate synonym list**:
- Main terms + alternatives
- Acronyms + spelled out
- US vs UK spelling
2. **Use OR operators**:
```
("machine learning" OR "deep learning" OR "neural networks")
```
3. **Combine multiple concepts**:
```
("machine learning" OR "deep learning") ("drug discovery" OR "drug development")
```
4. **Search without date filters** initially:
- Get total landscape
- Filter later if too many results
5. **Export results** for systematic analysis:
```bash
python scripts/search_google_scholar.py \
'"machine learning" OR "deep learning" drug discovery' \
--limit 500 \
--output comprehensive_search.json
```
## Extracting Citation Information
### From Google Scholar Results Page
Each result shows:
- **Title**: Paper title (linked to full text if available)
- **Authors**: Author list (often truncated)
- **Source**: Journal/conference, year, publisher
- **Cited by**: Number of citations + link to citing papers
- **Related articles**: Link to similar papers
- **All versions**: Different versions of the same paper
### Export Options
**Manual export**:
1. Click "Cite" under paper
2. Select BibTeX format
3. Copy citation
**Limitations**:
- One paper at a time
- Manual process
- Time-consuming for many papers
**Automated export** (using script):
```bash
# Search and export to BibTeX
python scripts/search_google_scholar.py "quantum computing" \
--limit 50 \
--format bibtex \
--output quantum_papers.bib
```
### Metadata Available
From Google Scholar you can typically extract:
- Title
- Authors (may be incomplete)
- Year
- Source (journal/conference)
- Citation count
- Link to full text (when available)
- Link to PDF (when available)
**Note**: Metadata quality varies:
- Some fields may be missing
- Author names may be incomplete
- Need to verify with DOI lookup for accuracy
## Rate Limiting and Access
### Rate Limits
Google Scholar has rate limiting to prevent automated scraping:
**Symptoms of rate limiting**:
- CAPTCHA challenges
- Temporary IP blocks
- 429 "Too Many Requests" errors
**Best practices**:
1. **Add delays between requests**: 2-5 seconds minimum
2. **Limit query volume**: Don't search hundreds of queries rapidly
3. **Use scholarly library**: Handles rate limiting automatically
4. **Rotate User-Agents**: Appear as different browsers
5. **Consider proxies**: For large-scale searches (use ethically)
**In our scripts**:
```python
# Automatic rate limiting built in
time.sleep(random.uniform(3, 7)) # Random delay 3-7 seconds
```
### Ethical Considerations
**DO**:
- Respect rate limits
- Use reasonable delays
- Cache results (don't re-query)
- Use official APIs when available
- Attribute data properly
**DON'T**:
- Scrape aggressively
- Use multiple IPs to bypass limits
- Violate terms of service
- Burden servers unnecessarily
- Use data commercially without permission
### Institutional Access
**Benefits of institutional access**:
- Access to full-text PDFs through library subscriptions
- Better download capabilities
- Integration with library systems
- Link resolver to full text
**Setup**:
- Google Scholar → Settings → Library links
- Add your institution
- Links appear in search results
## Tips and Best Practices
### Search Optimization
1. **Start simple, then refine**:
```
# Too specific initially
intitle:"deep learning" intitle:review source:Nature 2023..2024
# Better approach
deep learning review
# Review results
# Add intitle:, source:, year filters as needed
```
2. **Use multiple search strategies**:
- Keyword search
- Author search for known experts
- Citation chaining from key papers
- Source search in top journals
3. **Check spelling and variations**:
- Color vs colour
- Optimization vs optimisation
- Tumor vs tumour
- Try common misspellings if few results
4. **Combine operators strategically**:
```
# Good combination
author:Church intitle:"synthetic biology" 2015..2024
# Find reviews by specific author on topic in recent years
```
### Result Evaluation
1. **Check citation counts**:
- High citations indicate influence
- Recent papers may have low citations but be important
- Citation counts vary by field
2. **Verify publication venue**:
- Peer-reviewed journals vs preprints
- Conference proceedings
- Book chapters
- Technical reports
3. **Check for full text access**:
- [PDF] link on right side
- "All X versions" may have open access version
- Check institutional access
- Try author's website or ResearchGate
4. **Look for review articles**:
- Comprehensive overviews
- Good starting point for new topics
- Extensive reference lists
### Managing Results
1. **Use citation manager integration**:
- Export to BibTeX
- Import to Zotero, Mendeley, EndNote
- Maintain organized library
2. **Set up alerts** for ongoing research:
- Google Scholar → Alerts
- Get emails for new papers matching query
- Track specific authors or topics
3. **Create collections**:
- Save papers to Google Scholar Library
- Organize by project or topic
- Add labels and notes
4. **Export systematically**:
```bash
# Save search results for later analysis
python scripts/search_google_scholar.py "your topic" \
--output topic_papers.json
# Can re-process later without re-searching
python scripts/extract_metadata.py \
--input topic_papers.json \
--output topic_refs.bib
```
## Advanced Techniques
### Boolean Logic Combinations
Combine multiple operators for precise searches:
```
# Highly cited reviews on specific topic by known authors
intitle:review "machine learning" ("drug discovery" OR "drug development")
author:Horvath OR author:Bengio 2020..2024
# Method papers excluding reviews
intitle:method "protein folding" -review -survey
# Papers in top journals only
("Nature" OR "Science" OR "Cell") CRISPR 2022..2024
```
### Finding Open Access Papers
```
# Search with generic terms
machine learning
# Filter by "All versions" which often includes preprints
# Look for green [PDF] links (often open access)
# Check arXiv, bioRxiv versions
```
**In script**:
```bash
python scripts/search_google_scholar.py "topic" \
--open-access-only \
--output open_access_papers.json
```
### Tracking Research Impact
**For a specific paper**:
1. Find the paper
2. Click "Cited by X"
3. Analyze citing papers:
- How is it being used?
- What fields cite it?
- Recent vs older citations?
**For an author**:
1. Search `author:LastName`
2. Check h-index and i10-index
3. View citation history graph
4. Identify most influential papers
**For a topic**:
1. Search topic
2. Sort by citations
3. Identify seminal papers (highly cited, older)
4. Check recent highly-cited papers (emerging important work)
### Finding Preprints and Early Work
```
# arXiv papers
source:arxiv "deep learning"
# bioRxiv papers
source:biorxiv CRISPR
# All preprint servers
("arxiv" OR "biorxiv" OR "medrxiv") your topic
```
**Note**: Preprints are not peer-reviewed. Always check if published version exists.
## Common Issues and Solutions
### Too Many Results
**Problem**: Search returns 100,000+ results, overwhelming.
**Solutions**:
1. Add more specific terms
2. Use `intitle:` to search only titles
3. Filter by recent years
4. Add exclusions (e.g., `-review`)
5. Search within specific journals
### Too Few Results
**Problem**: Search returns 0-10 results, suspiciously few.
**Solutions**:
1. Remove restrictive operators
2. Try synonyms and related terms
3. Check spelling
4. Broaden year range
5. Use OR for alternative terms
### Irrelevant Results
**Problem**: Results don't match intent.
**Solutions**:
1. Use exact phrases with quotes
2. Add more specific context terms
3. Use `intitle:` for title-only search
4. Exclude common irrelevant terms
5. Combine multiple specific terms
### CAPTCHA or Rate Limiting
**Problem**: Google Scholar shows CAPTCHA or blocks access.
**Solutions**:
1. Wait several minutes before continuing
2. Reduce query frequency
3. Use longer delays in scripts (5-10 seconds)
4. Switch to different IP/network
5. Consider using institutional access
### Missing Metadata
**Problem**: Author names, year, or venue missing from results.
**Solutions**:
1. Click through to see full details
2. Check "All versions" for better metadata
3. Look up by DOI if available
4. Extract metadata from CrossRef/PubMed instead
5. Manually verify from paper PDF
### Duplicate Results
**Problem**: Same paper appears multiple times.
**Solutions**:
1. Click "All X versions" to see consolidated view
2. Choose version with best metadata
3. Use deduplication in post-processing:
```bash
python scripts/format_bibtex.py results.bib \
--deduplicate \
--output clean_results.bib
```
## Integration with Scripts
### search_google_scholar.py Usage
**Basic search**:
```bash
python scripts/search_google_scholar.py "machine learning drug discovery"
```
**With year filter**:
```bash
python scripts/search_google_scholar.py "CRISPR" \
--year-start 2020 \
--year-end 2024 \
--limit 100
```
**Sort by citations**:
```bash
python scripts/search_google_scholar.py "transformers" \
--sort-by citations \
--limit 50
```
**Export to BibTeX**:
```bash
python scripts/search_google_scholar.py "quantum computing" \
--format bibtex \
--output quantum.bib
```
**Export to JSON for later processing**:
```bash
python scripts/search_google_scholar.py "topic" \
--format json \
--output results.json
# Later: extract full metadata
python scripts/extract_metadata.py \
--input results.json \
--output references.bib
```
### Batch Searching
For multiple topics:
```bash
# Create file with search queries (queries.txt)
# One query per line
# Search each query
while read query; do
python scripts/search_google_scholar.py "$query" \
--limit 50 \
--output "${query// /_}.json"
sleep 10 # Delay between queries
done < queries.txt
```
## Summary
Google Scholar is the most comprehensive academic search engine, providing:
**Broad coverage**: All disciplines, 100M+ documents
**Free access**: No account or subscription required
**Citation tracking**: "Cited by" for impact analysis
**Multiple formats**: Articles, books, theses, patents
**Full-text search**: Not just abstracts
Key strategies:
- Use advanced operators for precision
- Combine author, title, source searches
- Track citations for impact
- Export systematically to citation manager
- Respect rate limits and access policies
- Verify metadata with CrossRef/PubMed
For biomedical research, complement with PubMed for MeSH terms and curated metadata.

View File

@@ -0,0 +1,870 @@
# Metadata Extraction Guide
Comprehensive guide to extracting accurate citation metadata from DOIs, PMIDs, arXiv IDs, and URLs using various APIs and services.
## Overview
Accurate metadata is essential for proper citations. This guide covers:
- Identifying paper identifiers (DOI, PMID, arXiv ID)
- Querying metadata APIs (CrossRef, PubMed, arXiv, DataCite)
- Required BibTeX fields by entry type
- Handling edge cases and special situations
- Validating extracted metadata
## Paper Identifiers
### DOI (Digital Object Identifier)
**Format**: `10.XXXX/suffix`
**Examples**:
```
10.1038/s41586-021-03819-2 # Nature article
10.1126/science.aam9317 # Science article
10.1016/j.cell.2023.01.001 # Cell article
10.1371/journal.pone.0123456 # PLOS ONE article
```
**Properties**:
- Permanent identifier
- Most reliable for metadata
- Resolves to current location
- Publisher-assigned
**Where to find**:
- First page of article
- Article webpage
- CrossRef, Google Scholar, PubMed
- Usually prominent on publisher site
### PMID (PubMed ID)
**Format**: 8-digit number (typically)
**Examples**:
```
34265844
28445112
35476778
```
**Properties**:
- Specific to PubMed database
- Biomedical literature only
- Assigned by NCBI
- Permanent identifier
**Where to find**:
- PubMed search results
- Article page on PubMed
- Often in article PDF footer
- PMC (PubMed Central) pages
### PMCID (PubMed Central ID)
**Format**: PMC followed by numbers
**Examples**:
```
PMC8287551
PMC7456789
```
**Properties**:
- Free full-text articles in PMC
- Subset of PubMed articles
- Open access or author manuscripts
### arXiv ID
**Format**: YYMM.NNNNN or archive/YYMMNNN
**Examples**:
```
2103.14030 # New format (since 2007)
2401.12345 # 2024 submission
arXiv:hep-th/9901001 # Old format
```
**Properties**:
- Preprints (not peer-reviewed)
- Physics, math, CS, q-bio, etc.
- Version tracking (v1, v2, etc.)
- Free, open access
**Where to find**:
- arXiv.org
- Often cited before publication
- Paper PDF header
### Other Identifiers
**ISBN** (Books):
```
978-0-12-345678-9
0-123-45678-9
```
**arXiv category**:
```
cs.LG # Computer Science - Machine Learning
q-bio.QM # Quantitative Biology - Quantitative Methods
math.ST # Mathematics - Statistics
```
## Metadata APIs
### CrossRef API
**Primary source for DOIs** - Most comprehensive metadata for journal articles.
**Base URL**: `https://api.crossref.org/works/`
**No API key required**, but polite pool recommended:
- Add email to User-Agent
- Gets better service
- No rate limits
#### Basic DOI Lookup
**Request**:
```
GET https://api.crossref.org/works/10.1038/s41586-021-03819-2
```
**Response** (simplified):
```json
{
"message": {
"DOI": "10.1038/s41586-021-03819-2",
"title": ["Article title here"],
"author": [
{"given": "John", "family": "Smith"},
{"given": "Jane", "family": "Doe"}
],
"container-title": ["Nature"],
"volume": "595",
"issue": "7865",
"page": "123-128",
"published-print": {"date-parts": [[2021, 7, 1]]},
"publisher": "Springer Nature",
"type": "journal-article",
"ISSN": ["0028-0836"]
}
}
```
#### Fields Available
**Always present**:
- `DOI`: Digital Object Identifier
- `title`: Article title (array)
- `type`: Content type (journal-article, book-chapter, etc.)
**Usually present**:
- `author`: Array of author objects
- `container-title`: Journal/book title
- `published-print` or `published-online`: Publication date
- `volume`, `issue`, `page`: Publication details
- `publisher`: Publisher name
**Sometimes present**:
- `abstract`: Article abstract
- `subject`: Subject categories
- `ISSN`: Journal ISSN
- `ISBN`: Book ISBN
- `reference`: Reference list
- `is-referenced-by-count`: Citation count
#### Content Types
CrossRef `type` field values:
- `journal-article`: Journal articles
- `book-chapter`: Book chapters
- `book`: Books
- `proceedings-article`: Conference papers
- `posted-content`: Preprints
- `dataset`: Research datasets
- `report`: Technical reports
- `dissertation`: Theses/dissertations
### PubMed E-utilities API
**Specialized for biomedical literature** - Curated metadata with MeSH terms.
**Base URL**: `https://eutils.ncbi.nlm.nih.gov/entrez/eutils/`
**API key recommended** (free):
- Higher rate limits
- Better performance
#### PMID to Metadata
**Step 1: EFetch for full record**
```
GET https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
db=pubmed&
id=34265844&
retmode=xml&
api_key=YOUR_KEY
```
**Response**: XML with comprehensive metadata
**Step 2: Parse XML**
Key fields:
```xml
<PubmedArticle>
<MedlineCitation>
<PMID>34265844</PMID>
<Article>
<ArticleTitle>Title here</ArticleTitle>
<AuthorList>
<Author><LastName>Smith</LastName><ForeName>John</ForeName></Author>
</AuthorList>
<Journal>
<Title>Nature</Title>
<JournalIssue>
<Volume>595</Volume>
<Issue>7865</Issue>
<PubDate><Year>2021</Year></PubDate>
</JournalIssue>
</Journal>
<Pagination><MedlinePgn>123-128</MedlinePgn></Pagination>
<Abstract><AbstractText>Abstract text here</AbstractText></Abstract>
</Article>
</MedlineCitation>
<PubmedData>
<ArticleIdList>
<ArticleId IdType="doi">10.1038/s41586-021-03819-2</ArticleId>
<ArticleId IdType="pmc">PMC8287551</ArticleId>
</ArticleIdList>
</PubmedData>
</PubmedArticle>
```
#### Unique PubMed Fields
**MeSH Terms**: Controlled vocabulary
```xml
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D003920">Diabetes Mellitus</DescriptorName>
</MeshHeading>
</MeshHeadingList>
```
**Publication Types**:
```xml
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
<PublicationType UI="D016449">Randomized Controlled Trial</PublicationType>
</PublicationTypeList>
```
**Grant Information**:
```xml
<GrantList>
<Grant>
<GrantID>R01-123456</GrantID>
<Agency>NIAID NIH HHS</Agency>
<Country>United States</Country>
</Grant>
</GrantList>
```
### arXiv API
**Preprints in physics, math, CS, q-bio** - Free, open access.
**Base URL**: `http://export.arxiv.org/api/query`
**No API key required**
#### arXiv ID to Metadata
**Request**:
```
GET http://export.arxiv.org/api/query?id_list=2103.14030
```
**Response**: Atom XML
```xml
<entry>
<id>http://arxiv.org/abs/2103.14030v2</id>
<title>Highly accurate protein structure prediction with AlphaFold</title>
<author><name>John Jumper</name></author>
<author><name>Richard Evans</name></author>
<published>2021-03-26T17:47:17Z</published>
<updated>2021-07-01T16:51:46Z</updated>
<summary>Abstract text here...</summary>
<arxiv:doi>10.1038/s41586-021-03819-2</arxiv:doi>
<category term="q-bio.BM" scheme="http://arxiv.org/schemas/atom"/>
<category term="cs.LG" scheme="http://arxiv.org/schemas/atom"/>
</entry>
```
#### Key Fields
- `id`: arXiv URL
- `title`: Preprint title
- `author`: Author list
- `published`: First version date
- `updated`: Latest version date
- `summary`: Abstract
- `arxiv:doi`: DOI if published
- `arxiv:journal_ref`: Journal reference if published
- `category`: arXiv categories
#### Version Tracking
arXiv tracks versions:
- `v1`: Initial submission
- `v2`, `v3`, etc.: Revisions
**Always check** if preprint has been published in journal (use DOI if available).
### DataCite API
**Research datasets, software, other outputs** - Assigns DOIs to non-traditional scholarly works.
**Base URL**: `https://api.datacite.org/dois/`
**Similar to CrossRef** but for datasets, software, code, etc.
**Request**:
```
GET https://api.datacite.org/dois/10.5281/zenodo.1234567
```
**Response**: JSON with metadata for dataset/software
## Required BibTeX Fields
### @article (Journal Articles)
**Required**:
- `author`: Author names
- `title`: Article title
- `journal`: Journal name
- `year`: Publication year
**Optional but recommended**:
- `volume`: Volume number
- `number`: Issue number
- `pages`: Page range (e.g., 123--145)
- `doi`: Digital Object Identifier
- `url`: URL if no DOI
- `month`: Publication month
**Example**:
```bibtex
@article{Smith2024,
author = {Smith, John and Doe, Jane},
title = {Novel Approach to Protein Folding},
journal = {Nature},
year = {2024},
volume = {625},
number = {8001},
pages = {123--145},
doi = {10.1038/nature12345}
}
```
### @book (Books)
**Required**:
- `author` or `editor`: Author(s) or editor(s)
- `title`: Book title
- `publisher`: Publisher name
- `year`: Publication year
**Optional but recommended**:
- `edition`: Edition number (if not first)
- `address`: Publisher location
- `isbn`: ISBN
- `url`: URL
- `series`: Series name
**Example**:
```bibtex
@book{Kumar2021,
author = {Kumar, Vinay and Abbas, Abul K. and Aster, Jon C.},
title = {Robbins and Cotran Pathologic Basis of Disease},
publisher = {Elsevier},
year = {2021},
edition = {10},
isbn = {978-0-323-53113-9}
}
```
### @inproceedings (Conference Papers)
**Required**:
- `author`: Author names
- `title`: Paper title
- `booktitle`: Conference/proceedings name
- `year`: Year
**Optional but recommended**:
- `pages`: Page range
- `organization`: Organizing body
- `publisher`: Publisher
- `address`: Conference location
- `month`: Conference month
- `doi`: DOI if available
**Example**:
```bibtex
@inproceedings{Vaswani2017,
author = {Vaswani, Ashish and Shazeer, Noam and others},
title = {Attention is All You Need},
booktitle = {Advances in Neural Information Processing Systems},
year = {2017},
pages = {5998--6008},
volume = {30}
}
```
### @incollection (Book Chapters)
**Required**:
- `author`: Chapter author(s)
- `title`: Chapter title
- `booktitle`: Book title
- `publisher`: Publisher name
- `year`: Publication year
**Optional but recommended**:
- `editor`: Book editor(s)
- `pages`: Chapter page range
- `chapter`: Chapter number
- `edition`: Edition
- `address`: Publisher location
**Example**:
```bibtex
@incollection{Brown2020,
author = {Brown, Peter O. and Botstein, David},
title = {Exploring the New World of the Genome with {DNA} Microarrays},
booktitle = {DNA Microarrays: A Molecular Cloning Manual},
editor = {Eisen, Michael B. and Brown, Patrick O.},
publisher = {Cold Spring Harbor Laboratory Press},
year = {2020},
pages = {1--45}
}
```
### @phdthesis (Dissertations)
**Required**:
- `author`: Author name
- `title`: Thesis title
- `school`: Institution
- `year`: Year
**Optional**:
- `type`: Type (e.g., "PhD dissertation")
- `address`: Institution location
- `month`: Month
- `url`: URL
**Example**:
```bibtex
@phdthesis{Johnson2023,
author = {Johnson, Mary L.},
title = {Novel Approaches to Cancer Immunotherapy},
school = {Stanford University},
year = {2023},
type = {{PhD} dissertation}
}
```
### @misc (Preprints, Software, Datasets)
**Required**:
- `author`: Author(s)
- `title`: Title
- `year`: Year
**For preprints, add**:
- `howpublished`: Repository (e.g., "bioRxiv")
- `doi`: Preprint DOI
- `note`: Preprint ID
**Example (preprint)**:
```bibtex
@misc{Zhang2024,
author = {Zhang, Yi and Chen, Li and Wang, Hui},
title = {Novel Therapeutic Targets in Alzheimer's Disease},
year = {2024},
howpublished = {bioRxiv},
doi = {10.1101/2024.01.001},
note = {Preprint}
}
```
**Example (software)**:
```bibtex
@misc{AlphaFold2021,
author = {DeepMind},
title = {{AlphaFold} Protein Structure Database},
year = {2021},
howpublished = {Software},
url = {https://alphafold.ebi.ac.uk/},
doi = {10.5281/zenodo.5123456}
}
```
## Extraction Workflows
### From DOI
**Best practice** - Most reliable source:
```bash
# Single DOI
python scripts/extract_metadata.py --doi 10.1038/s41586-021-03819-2
# Multiple DOIs
python scripts/extract_metadata.py \
--doi 10.1038/nature12345 \
--doi 10.1126/science.abc1234 \
--output refs.bib
```
**Process**:
1. Query CrossRef API with DOI
2. Parse JSON response
3. Extract required fields
4. Determine entry type (@article, @book, etc.)
5. Format as BibTeX
6. Validate completeness
### From PMID
**For biomedical literature**:
```bash
# Single PMID
python scripts/extract_metadata.py --pmid 34265844
# Multiple PMIDs
python scripts/extract_metadata.py \
--pmid 34265844 \
--pmid 28445112 \
--output refs.bib
```
**Process**:
1. Query PubMed EFetch with PMID
2. Parse XML response
3. Extract metadata including MeSH terms
4. Check for DOI in response
5. If DOI exists, optionally query CrossRef for additional metadata
6. Format as BibTeX
### From arXiv ID
**For preprints**:
```bash
python scripts/extract_metadata.py --arxiv 2103.14030
```
**Process**:
1. Query arXiv API with ID
2. Parse Atom XML response
3. Check for published version (DOI in response)
4. If published: Use DOI and CrossRef
5. If not published: Use preprint metadata
6. Format as @misc with preprint note
**Important**: Always check if preprint has been published!
### From URL
**When you only have URL**:
```bash
python scripts/extract_metadata.py \
--url "https://www.nature.com/articles/s41586-021-03819-2"
```
**Process**:
1. Parse URL to extract identifier
2. Identify type (DOI, PMID, arXiv)
3. Extract identifier from URL
4. Query appropriate API
5. Format as BibTeX
**URL patterns**:
```
# DOI URLs
https://doi.org/10.1038/nature12345
https://dx.doi.org/10.1126/science.abc123
https://www.nature.com/articles/s41586-021-03819-2
# PubMed URLs
https://pubmed.ncbi.nlm.nih.gov/34265844/
https://www.ncbi.nlm.nih.gov/pubmed/34265844
# arXiv URLs
https://arxiv.org/abs/2103.14030
https://arxiv.org/pdf/2103.14030.pdf
```
### Batch Processing
**From file with mixed identifiers**:
```bash
# Create file with one identifier per line
# identifiers.txt:
# 10.1038/nature12345
# 34265844
# 2103.14030
# https://doi.org/10.1126/science.abc123
python scripts/extract_metadata.py \
--input identifiers.txt \
--output references.bib
```
**Process**:
- Script auto-detects identifier type
- Queries appropriate API
- Combines all into single BibTeX file
- Handles errors gracefully
## Special Cases and Edge Cases
### Preprints Later Published
**Issue**: Preprint cited, but journal version now available.
**Solution**:
1. Check arXiv metadata for DOI field
2. If DOI present, use published version
3. Update citation to journal article
4. Note preprint version in comments if needed
**Example**:
```bibtex
% Originally: arXiv:2103.14030
% Published as:
@article{Jumper2021,
author = {Jumper, John and Evans, Richard and others},
title = {Highly Accurate Protein Structure Prediction with {AlphaFold}},
journal = {Nature},
year = {2021},
volume = {596},
pages = {583--589},
doi = {10.1038/s41586-021-03819-2}
}
```
### Multiple Authors (et al.)
**Issue**: Many authors (10+).
**BibTeX practice**:
- Include all authors if <10
- Use "and others" for 10+
- Or list all (journals vary)
**Example**:
```bibtex
@article{LargeCollaboration2024,
author = {First, Author and Second, Author and Third, Author and others},
...
}
```
### Author Name Variations
**Issue**: Authors publish under different name formats.
**Standardization**:
```
# Common variations
John Smith
John A. Smith
John Andrew Smith
J. A. Smith
Smith, J.
Smith, J. A.
# BibTeX format (recommended)
author = {Smith, John A.}
```
**Extraction preference**:
1. Use full name if available
2. Include middle initial if available
3. Format: Last, First Middle
### No DOI Available
**Issue**: Older papers or books without DOIs.
**Solutions**:
1. Use PMID if available (biomedical)
2. Use ISBN for books
3. Use URL to stable source
4. Include full publication details
**Example**:
```bibtex
@article{OldPaper1995,
author = {Author, Name},
title = {Title Here},
journal = {Journal Name},
year = {1995},
volume = {123},
pages = {45--67},
url = {https://stable-url-here},
note = {PMID: 12345678}
}
```
### Conference Papers vs Journal Articles
**Issue**: Same work published in both.
**Best practice**:
- Cite journal version if both available
- Journal version is archival
- Conference version for timeliness
**If citing conference**:
```bibtex
@inproceedings{Smith2024conf,
author = {Smith, John},
title = {Title},
booktitle = {Proceedings of NeurIPS 2024},
year = {2024}
}
```
**If citing journal**:
```bibtex
@article{Smith2024journal,
author = {Smith, John},
title = {Title},
journal = {Journal of Machine Learning Research},
year = {2024}
}
```
### Book Chapters vs Edited Collections
**Extract correctly**:
- Chapter: Use `@incollection`
- Whole book: Use `@book`
- Book editor: List in `editor` field
- Chapter author: List in `author` field
### Datasets and Software
**Use @misc** with appropriate fields:
```bibtex
@misc{DatasetName2024,
author = {Author, Name},
title = {Dataset Title},
year = {2024},
howpublished = {Zenodo},
doi = {10.5281/zenodo.123456},
note = {Version 1.2}
}
```
## Validation After Extraction
Always validate extracted metadata:
```bash
python scripts/validate_citations.py extracted_refs.bib
```
**Check**:
- All required fields present
- DOI resolves correctly
- Author names formatted consistently
- Year is reasonable (4 digits)
- Journal/publisher names correct
- Page ranges use -- not -
- Special characters handled properly
## Best Practices
### 1. Prefer DOI When Available
DOIs provide:
- Permanent identifier
- Best metadata source
- Publisher-verified information
- Resolvable link
### 2. Verify Automatically Extracted Metadata
Spot-check:
- Author names match publication
- Title matches (including capitalization)
- Year is correct
- Journal name is complete
### 3. Handle Special Characters
**LaTeX special characters**:
- Protect capitalization: `{AlphaFold}`
- Handle accents: `M{\"u}ller` or use Unicode
- Chemical formulas: `H$_2$O` or `\ce{H2O}`
### 4. Use Consistent Citation Keys
**Convention**: `FirstAuthorYEARkeyword`
```
Smith2024protein
Doe2023machine
Johnson2024cancer
```
### 5. Include DOI for Modern Papers
All papers published after ~2000 should have DOI:
```bibtex
doi = {10.1038/nature12345}
```
### 6. Document Source
For non-standard sources, add note:
```bibtex
note = {Preprint, not peer-reviewed}
note = {Technical report}
note = {Dataset accompanying [citation]}
```
## Summary
Metadata extraction workflow:
1. **Identify**: Determine identifier type (DOI, PMID, arXiv, URL)
2. **Query**: Use appropriate API (CrossRef, PubMed, arXiv)
3. **Extract**: Parse response for required fields
4. **Format**: Create properly formatted BibTeX entry
5. **Validate**: Check completeness and accuracy
6. **Verify**: Spot-check critical citations
**Use scripts** to automate:
- `extract_metadata.py`: Universal extractor
- `doi_to_bibtex.py`: Quick DOI conversion
- `validate_citations.py`: Verify accuracy
**Always validate** extracted metadata before final submission!

View File

@@ -0,0 +1,839 @@
# PubMed Search Guide
Comprehensive guide to searching PubMed for biomedical and life sciences literature, including MeSH terms, field tags, advanced search strategies, and E-utilities API usage.
## Overview
PubMed is the premier database for biomedical literature:
- **Coverage**: 35+ million citations
- **Scope**: Biomedical and life sciences
- **Sources**: MEDLINE, life science journals, online books
- **Authority**: Maintained by National Library of Medicine (NLM) / NCBI
- **Access**: Free, no account required
- **Updates**: Daily with new citations
- **Curation**: High-quality metadata, MeSH indexing
## Basic Search
### Simple Keyword Search
PubMed automatically maps terms to MeSH and searches multiple fields:
```
diabetes
CRISPR gene editing
Alzheimer's disease treatment
cancer immunotherapy
```
**Automatic Features**:
- Automatic MeSH mapping
- Plural/singular variants
- Abbreviation expansion
- Spell checking
### Exact Phrase Search
Use quotation marks for exact phrases:
```
"CRISPR-Cas9"
"systematic review"
"randomized controlled trial"
"machine learning"
```
## MeSH (Medical Subject Headings)
### What is MeSH?
MeSH is a controlled vocabulary thesaurus for indexing biomedical literature:
- **Hierarchical structure**: Organized in tree structures
- **Consistent indexing**: Same concept always tagged the same way
- **Comprehensive**: Covers diseases, drugs, anatomy, techniques, etc.
- **Professional curation**: NLM indexers assign MeSH terms
### Finding MeSH Terms
**MeSH Browser**: https://meshb.nlm.nih.gov/search
**Example**:
```
Search: "heart attack"
MeSH term: "Myocardial Infarction"
```
**In PubMed**:
1. Search with keyword
2. Check "MeSH Terms" in left sidebar
3. Select relevant MeSH terms
4. Add to search
### Using MeSH in Searches
**Basic MeSH search**:
```
"Diabetes Mellitus"[MeSH]
"CRISPR-Cas Systems"[MeSH]
"Alzheimer Disease"[MeSH]
"Neoplasms"[MeSH]
```
**MeSH with subheadings**:
```
"Diabetes Mellitus/drug therapy"[MeSH]
"Neoplasms/genetics"[MeSH]
"Heart Failure/prevention and control"[MeSH]
```
**Common subheadings**:
- `/drug therapy`: Drug treatment
- `/diagnosis`: Diagnostic aspects
- `/genetics`: Genetic aspects
- `/epidemiology`: Occurrence and distribution
- `/prevention and control`: Prevention methods
- `/etiology`: Causes
- `/surgery`: Surgical treatment
- `/metabolism`: Metabolic aspects
### MeSH Explosion
By default, MeSH searches include narrower terms (explosion):
```
"Neoplasms"[MeSH]
# Includes: Breast Neoplasms, Lung Neoplasms, etc.
```
**Disable explosion** (exact term only):
```
"Neoplasms"[MeSH:NoExp]
```
### MeSH Major Topic
Search only where MeSH term is a major focus:
```
"Diabetes Mellitus"[MeSH Major Topic]
# Only papers where diabetes is main topic
```
## Field Tags
Field tags specify which part of the record to search.
### Common Field Tags
**Title and Abstract**:
```
cancer[Title] # In title only
treatment[Title/Abstract] # In title or abstract
"machine learning"[Title/Abstract]
```
**Author**:
```
"Smith J"[Author]
"Doudna JA"[Author]
"Collins FS"[Author]
```
**Author - Full Name**:
```
"Smith, John"[Full Author Name]
```
**Journal**:
```
"Nature"[Journal]
"Science"[Journal]
"New England Journal of Medicine"[Journal]
"Nat Commun"[Journal] # Abbreviated form
```
**Publication Date**:
```
2023[Publication Date]
2020:2024[Publication Date] # Date range
2023/01/01:2023/12/31[Publication Date]
```
**Date Created**:
```
2023[Date - Create] # When added to PubMed
```
**Publication Type**:
```
"Review"[Publication Type]
"Clinical Trial"[Publication Type]
"Meta-Analysis"[Publication Type]
"Randomized Controlled Trial"[Publication Type]
```
**Language**:
```
English[Language]
French[Language]
```
**DOI**:
```
10.1038/nature12345[DOI]
```
**PMID (PubMed ID)**:
```
12345678[PMID]
```
**Article ID**:
```
PMC1234567[PMC] # PubMed Central ID
```
### Less Common But Useful Tags
```
humans[MeSH Terms] # Only human studies
animals[MeSH Terms] # Only animal studies
"United States"[Place of Publication]
nih[Grant Number] # NIH-funded research
"Female"[Sex] # Female subjects
"Aged, 80 and over"[Age] # Elderly subjects
```
## Boolean Operators
Combine search terms with Boolean logic.
### AND
Both terms must be present (default behavior):
```
diabetes AND treatment
"CRISPR-Cas9" AND "gene editing"
cancer AND immunotherapy AND "clinical trial"[Publication Type]
```
### OR
Either term must be present:
```
"heart attack" OR "myocardial infarction"
diabetes OR "diabetes mellitus"
CRISPR OR Cas9 OR "gene editing"
```
**Use case**: Synonyms and related terms
### NOT
Exclude terms:
```
cancer NOT review
diabetes NOT animal
"machine learning" NOT "deep learning"
```
**Caution**: May exclude relevant papers that mention both terms.
### Combining Operators
Use parentheses for complex logic:
```
(diabetes OR "diabetes mellitus") AND (treatment OR therapy)
("CRISPR" OR "gene editing") AND ("therapeutic" OR "therapy")
AND 2020:2024[Publication Date]
(cancer OR neoplasm) AND (immunotherapy OR "immune checkpoint inhibitor")
AND ("clinical trial"[Publication Type] OR "randomized controlled trial"[Publication Type])
```
## Advanced Search Builder
**Access**: https://pubmed.ncbi.nlm.nih.gov/advanced/
**Features**:
- Visual query builder
- Add multiple query boxes
- Select field tags from dropdowns
- Combine with AND/OR/NOT
- Preview results
- Shows final query string
- Save queries
**Workflow**:
1. Add search terms in separate boxes
2. Select field tags
3. Choose Boolean operators
4. Preview results
5. Refine as needed
6. Copy final query string
7. Use in scripts or save
**Example built query**:
```
#1: "Diabetes Mellitus, Type 2"[MeSH]
#2: "Metformin"[MeSH]
#3: "Clinical Trial"[Publication Type]
#4: 2020:2024[Publication Date]
#5: #1 AND #2 AND #3 AND #4
```
## Filters and Limits
### Article Types
```
"Review"[Publication Type]
"Systematic Review"[Publication Type]
"Meta-Analysis"[Publication Type]
"Clinical Trial"[Publication Type]
"Randomized Controlled Trial"[Publication Type]
"Case Reports"[Publication Type]
"Comparative Study"[Publication Type]
```
### Species
```
humans[MeSH Terms]
mice[MeSH Terms]
rats[MeSH Terms]
```
### Sex
```
"Female"[MeSH Terms]
"Male"[MeSH Terms]
```
### Age Groups
```
"Infant"[MeSH Terms]
"Child"[MeSH Terms]
"Adolescent"[MeSH Terms]
"Adult"[MeSH Terms]
"Aged"[MeSH Terms]
"Aged, 80 and over"[MeSH Terms]
```
### Text Availability
```
free full text[Filter] # Free full-text available
```
### Journal Categories
```
"Journal Article"[Publication Type]
```
## E-utilities API
NCBI provides programmatic access via E-utilities (Entrez Programming Utilities).
### Overview
**Base URL**: `https://eutils.ncbi.nlm.nih.gov/entrez/eutils/`
**Main Tools**:
- **ESearch**: Search and retrieve PMIDs
- **EFetch**: Retrieve full records
- **ESummary**: Retrieve document summaries
- **ELink**: Find related articles
- **EInfo**: Database statistics
**No API key required**, but recommended for:
- Higher rate limits (10/sec vs 3/sec)
- Better performance
- Identify your project
**Get API key**: https://www.ncbi.nlm.nih.gov/account/
### ESearch - Search PubMed
Retrieve PMIDs for a query.
**Endpoint**: `/esearch.fcgi`
**Parameters**:
- `db`: Database (pubmed)
- `term`: Search query
- `retmax`: Maximum results (default 20, max 10000)
- `retstart`: Starting position (for pagination)
- `sort`: Sort order (relevance, pub_date, author)
- `api_key`: Your API key (optional but recommended)
**Example URL**:
```
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?
db=pubmed&
term=diabetes+AND+treatment&
retmax=100&
retmode=json&
api_key=YOUR_API_KEY
```
**Response**:
```json
{
"esearchresult": {
"count": "250000",
"retmax": "100",
"idlist": ["12345678", "12345679", ...]
}
}
```
### EFetch - Retrieve Records
Get full metadata for PMIDs.
**Endpoint**: `/efetch.fcgi`
**Parameters**:
- `db`: Database (pubmed)
- `id`: Comma-separated PMIDs
- `retmode`: Format (xml, json, text)
- `rettype`: Type (abstract, medline, full)
- `api_key`: Your API key
**Example URL**:
```
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
db=pubmed&
id=12345678,12345679&
retmode=xml&
api_key=YOUR_API_KEY
```
**Response**: XML with complete metadata including:
- Title
- Authors (with affiliations)
- Abstract
- Journal
- Publication date
- DOI
- PMID, PMCID
- MeSH terms
- Keywords
### ESummary - Get Summaries
Lighter-weight alternative to EFetch.
**Example**:
```
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?
db=pubmed&
id=12345678&
retmode=json&
api_key=YOUR_API_KEY
```
**Returns**: Key metadata without full abstract and details.
### ELink - Find Related Articles
Find related articles or links to other databases.
**Example**:
```
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?
dbfrom=pubmed&
db=pubmed&
id=12345678&
linkname=pubmed_pubmed_citedin
```
**Link types**:
- `pubmed_pubmed`: Related articles
- `pubmed_pubmed_citedin`: Papers citing this article
- `pubmed_pmc`: PMC full-text versions
- `pubmed_protein`: Related protein records
### Rate Limiting
**Without API key**:
- 3 requests per second
- Block if exceeded
**With API key**:
- 10 requests per second
- Better for programmatic access
**Best practice**:
```python
import time
time.sleep(0.34) # ~3 requests/second
# or
time.sleep(0.11) # ~10 requests/second with API key
```
### API Key Usage
**Get API key**:
1. Create NCBI account: https://www.ncbi.nlm.nih.gov/account/
2. Settings → API Key Management
3. Create new API key
4. Copy key
**Use in requests**:
```
&api_key=YOUR_API_KEY_HERE
```
**Store securely**:
```bash
# In environment variable
export NCBI_API_KEY="your_key_here"
# In script
import os
api_key = os.getenv('NCBI_API_KEY')
```
## Search Strategies
### Comprehensive Systematic Search
For systematic reviews and meta-analyses:
```
# 1. Identify key concepts
Concept 1: Diabetes
Concept 2: Treatment
Concept 3: Outcomes
# 2. Find MeSH terms and synonyms
Concept 1: "Diabetes Mellitus"[MeSH] OR diabetes OR diabetic
Concept 2: "Drug Therapy"[MeSH] OR treatment OR therapy OR medication
Concept 3: "Treatment Outcome"[MeSH] OR outcome OR efficacy OR effectiveness
# 3. Combine with AND
("Diabetes Mellitus"[MeSH] OR diabetes OR diabetic)
AND ("Drug Therapy"[MeSH] OR treatment OR therapy OR medication)
AND ("Treatment Outcome"[MeSH] OR outcome OR efficacy OR effectiveness)
# 4. Add filters
AND 2015:2024[Publication Date]
AND ("Clinical Trial"[Publication Type] OR "Randomized Controlled Trial"[Publication Type])
AND English[Language]
AND humans[MeSH Terms]
```
### Finding Clinical Trials
```
# Specific disease + clinical trials
"Alzheimer Disease"[MeSH]
AND ("Clinical Trial"[Publication Type]
OR "Randomized Controlled Trial"[Publication Type])
AND 2020:2024[Publication Date]
# Specific drug trials
"Metformin"[MeSH]
AND "Diabetes Mellitus, Type 2"[MeSH]
AND "Randomized Controlled Trial"[Publication Type]
```
### Finding Reviews
```
# Systematic reviews on topic
"CRISPR-Cas Systems"[MeSH]
AND ("Systematic Review"[Publication Type] OR "Meta-Analysis"[Publication Type])
# Reviews in high-impact journals
cancer immunotherapy
AND "Review"[Publication Type]
AND ("Nature"[Journal] OR "Science"[Journal] OR "Cell"[Journal])
```
### Finding Recent Papers
```
# Papers from last year
"machine learning"[Title/Abstract]
AND "drug discovery"[Title/Abstract]
AND 2024[Publication Date]
# Recent papers in specific journal
"CRISPR"[Title/Abstract]
AND "Nature"[Journal]
AND 2023:2024[Publication Date]
```
### Author Tracking
```
# Specific author's recent work
"Doudna JA"[Author] AND 2020:2024[Publication Date]
# Author + topic
"Church GM"[Author] AND "synthetic biology"[Title/Abstract]
```
### High-Quality Evidence
```
# Meta-analyses and systematic reviews
(diabetes OR "diabetes mellitus")
AND (treatment OR therapy)
AND ("Meta-Analysis"[Publication Type] OR "Systematic Review"[Publication Type])
# RCTs only
cancer immunotherapy
AND "Randomized Controlled Trial"[Publication Type]
AND 2020:2024[Publication Date]
```
## Script Integration
### search_pubmed.py Usage
**Basic search**:
```bash
python scripts/search_pubmed.py "diabetes treatment"
```
**With MeSH terms**:
```bash
python scripts/search_pubmed.py \
--query '"Diabetes Mellitus"[MeSH] AND "Drug Therapy"[MeSH]'
```
**Date range filter**:
```bash
python scripts/search_pubmed.py "CRISPR" \
--date-start 2020-01-01 \
--date-end 2024-12-31 \
--limit 200
```
**Publication type filter**:
```bash
python scripts/search_pubmed.py "cancer immunotherapy" \
--publication-types "Clinical Trial,Randomized Controlled Trial" \
--limit 100
```
**Export to BibTeX**:
```bash
python scripts/search_pubmed.py "Alzheimer's disease" \
--limit 100 \
--format bibtex \
--output alzheimers.bib
```
**Complex query from file**:
```bash
# Save complex query in query.txt
cat > query.txt << 'EOF'
("Diabetes Mellitus, Type 2"[MeSH] OR "diabetes"[Title/Abstract])
AND ("Metformin"[MeSH] OR "metformin"[Title/Abstract])
AND "Randomized Controlled Trial"[Publication Type]
AND 2015:2024[Publication Date]
AND English[Language]
EOF
# Run search
python scripts/search_pubmed.py --query-file query.txt --limit 500
```
### Batch Searches
```bash
# Search multiple topics
TOPICS=("diabetes treatment" "cancer immunotherapy" "CRISPR gene editing")
for topic in "${TOPICS[@]}"; do
python scripts/search_pubmed.py "$topic" \
--limit 100 \
--output "${topic// /_}.json"
sleep 1
done
```
### Extract Metadata
```bash
# Search returns PMIDs
python scripts/search_pubmed.py "topic" --output results.json
# Extract full metadata
python scripts/extract_metadata.py \
--input results.json \
--output references.bib
```
## Tips and Best Practices
### Search Construction
1. **Start with MeSH terms**:
- Use MeSH Browser to find correct terms
- More precise than keyword search
- Captures all papers on topic regardless of terminology
2. **Include text word variants**:
```
# Better coverage
("Diabetes Mellitus"[MeSH] OR diabetes OR diabetic)
```
3. **Use field tags appropriately**:
- `[MeSH]` for standardized concepts
- `[Title/Abstract]` for specific terms
- `[Author]` for known authors
- `[Journal]` for specific venues
4. **Build incrementally**:
```
# Step 1: Basic search
diabetes
# Step 2: Add specificity
"Diabetes Mellitus, Type 2"[MeSH]
# Step 3: Add treatment
"Diabetes Mellitus, Type 2"[MeSH] AND "Metformin"[MeSH]
# Step 4: Add study type
"Diabetes Mellitus, Type 2"[MeSH] AND "Metformin"[MeSH]
AND "Clinical Trial"[Publication Type]
# Step 5: Add date range
... AND 2020:2024[Publication Date]
```
### Optimizing Results
1. **Too many results**: Add filters
- Restrict publication type
- Narrow date range
- Add more specific MeSH terms
- Use Major Topic: `[MeSH Major Topic]`
2. **Too few results**: Broaden search
- Remove restrictive filters
- Use OR for synonyms
- Expand date range
- Use MeSH explosion (default)
3. **Irrelevant results**: Refine terms
- Use more specific MeSH terms
- Add exclusions with NOT
- Use Title field instead of all fields
- Add MeSH subheadings
### Quality Control
1. **Document search strategy**:
- Save exact query string
- Record search date
- Note number of results
- Save filters used
2. **Export systematically**:
- Use consistent file naming
- Export to JSON for flexibility
- Convert to BibTeX as needed
- Keep original search results
3. **Validate retrieved citations**:
```bash
python scripts/validate_citations.py pubmed_results.bib
```
### Staying Current
1. **Set up search alerts**:
- PubMed → Save search
- Receive email updates
- Daily, weekly, or monthly
2. **Track specific journals**:
```
"Nature"[Journal] AND CRISPR[Title]
```
3. **Follow key authors**:
```
"Church GM"[Author]
```
## Common Issues and Solutions
### Issue: MeSH Term Not Found
**Solution**:
- Check spelling
- Use MeSH Browser
- Try related terms
- Use text word search as fallback
### Issue: Zero Results
**Solution**:
- Remove filters
- Check query syntax
- Use OR for broader search
- Try synonyms
### Issue: Poor Quality Results
**Solution**:
- Add publication type filters
- Restrict to recent years
- Use MeSH Major Topic
- Filter by journal quality
### Issue: Duplicates from Different Sources
**Solution**:
```bash
python scripts/format_bibtex.py results.bib \
--deduplicate \
--output clean.bib
```
### Issue: API Rate Limiting
**Solution**:
- Get API key (increases limit to 10/sec)
- Add delays in scripts
- Process in batches
- Use off-peak hours
## Summary
PubMed provides authoritative biomedical literature search:
**Curated content**: MeSH indexing, quality control
**Precise search**: Field tags, MeSH terms, filters
**Programmatic access**: E-utilities API
**Free access**: No subscription required
**Comprehensive**: 35M+ citations, daily updates
Key strategies:
- Use MeSH terms for precise searching
- Combine with text words for comprehensive coverage
- Apply appropriate field tags
- Filter by publication type and date
- Use E-utilities API for automation
- Document search strategy for reproducibility
For broader coverage across disciplines, complement with Google Scholar.