19 KiB
Metadata Extraction Guide
Comprehensive guide to extracting accurate citation metadata from DOIs, PMIDs, arXiv IDs, and URLs using various APIs and services.
Overview
Accurate metadata is essential for proper citations. This guide covers:
- Identifying paper identifiers (DOI, PMID, arXiv ID)
- Querying metadata APIs (CrossRef, PubMed, arXiv, DataCite)
- Required BibTeX fields by entry type
- Handling edge cases and special situations
- Validating extracted metadata
Paper Identifiers
DOI (Digital Object Identifier)
Format: 10.XXXX/suffix
Examples:
10.1038/s41586-021-03819-2 # Nature article
10.1126/science.aam9317 # Science article
10.1016/j.cell.2023.01.001 # Cell article
10.1371/journal.pone.0123456 # PLOS ONE article
Properties:
- Permanent identifier
- Most reliable for metadata
- Resolves to current location
- Publisher-assigned
Where to find:
- First page of article
- Article webpage
- CrossRef, Google Scholar, PubMed
- Usually prominent on publisher site
PMID (PubMed ID)
Format: 8-digit number (typically)
Examples:
34265844
28445112
35476778
Properties:
- Specific to PubMed database
- Biomedical literature only
- Assigned by NCBI
- Permanent identifier
Where to find:
- PubMed search results
- Article page on PubMed
- Often in article PDF footer
- PMC (PubMed Central) pages
PMCID (PubMed Central ID)
Format: PMC followed by numbers
Examples:
PMC8287551
PMC7456789
Properties:
- Free full-text articles in PMC
- Subset of PubMed articles
- Open access or author manuscripts
arXiv ID
Format: YYMM.NNNNN or archive/YYMMNNN
Examples:
2103.14030 # New format (since 2007)
2401.12345 # 2024 submission
arXiv:hep-th/9901001 # Old format
Properties:
- Preprints (not peer-reviewed)
- Physics, math, CS, q-bio, etc.
- Version tracking (v1, v2, etc.)
- Free, open access
Where to find:
- arXiv.org
- Often cited before publication
- Paper PDF header
Other Identifiers
ISBN (Books):
978-0-12-345678-9
0-123-45678-9
arXiv category:
cs.LG # Computer Science - Machine Learning
q-bio.QM # Quantitative Biology - Quantitative Methods
math.ST # Mathematics - Statistics
Metadata APIs
CrossRef API
Primary source for DOIs - Most comprehensive metadata for journal articles.
Base URL: https://api.crossref.org/works/
No API key required, but polite pool recommended:
- Add email to User-Agent
- Gets better service
- No rate limits
Basic DOI Lookup
Request:
GET https://api.crossref.org/works/10.1038/s41586-021-03819-2
Response (simplified):
{
"message": {
"DOI": "10.1038/s41586-021-03819-2",
"title": ["Article title here"],
"author": [
{"given": "John", "family": "Smith"},
{"given": "Jane", "family": "Doe"}
],
"container-title": ["Nature"],
"volume": "595",
"issue": "7865",
"page": "123-128",
"published-print": {"date-parts": [[2021, 7, 1]]},
"publisher": "Springer Nature",
"type": "journal-article",
"ISSN": ["0028-0836"]
}
}
Fields Available
Always present:
DOI: Digital Object Identifiertitle: Article title (array)type: Content type (journal-article, book-chapter, etc.)
Usually present:
author: Array of author objectscontainer-title: Journal/book titlepublished-printorpublished-online: Publication datevolume,issue,page: Publication detailspublisher: Publisher name
Sometimes present:
abstract: Article abstractsubject: Subject categoriesISSN: Journal ISSNISBN: Book ISBNreference: Reference listis-referenced-by-count: Citation count
Content Types
CrossRef type field values:
journal-article: Journal articlesbook-chapter: Book chaptersbook: Booksproceedings-article: Conference papersposted-content: Preprintsdataset: Research datasetsreport: Technical reportsdissertation: Theses/dissertations
PubMed E-utilities API
Specialized for biomedical literature - Curated metadata with MeSH terms.
Base URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
API key recommended (free):
- Higher rate limits
- Better performance
PMID to Metadata
Step 1: EFetch for full record
GET https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
db=pubmed&
id=34265844&
retmode=xml&
api_key=YOUR_KEY
Response: XML with comprehensive metadata
Step 2: Parse XML
Key fields:
<PubmedArticle>
<MedlineCitation>
<PMID>34265844</PMID>
<Article>
<ArticleTitle>Title here</ArticleTitle>
<AuthorList>
<Author><LastName>Smith</LastName><ForeName>John</ForeName></Author>
</AuthorList>
<Journal>
<Title>Nature</Title>
<JournalIssue>
<Volume>595</Volume>
<Issue>7865</Issue>
<PubDate><Year>2021</Year></PubDate>
</JournalIssue>
</Journal>
<Pagination><MedlinePgn>123-128</MedlinePgn></Pagination>
<Abstract><AbstractText>Abstract text here</AbstractText></Abstract>
</Article>
</MedlineCitation>
<PubmedData>
<ArticleIdList>
<ArticleId IdType="doi">10.1038/s41586-021-03819-2</ArticleId>
<ArticleId IdType="pmc">PMC8287551</ArticleId>
</ArticleIdList>
</PubmedData>
</PubmedArticle>
Unique PubMed Fields
MeSH Terms: Controlled vocabulary
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D003920">Diabetes Mellitus</DescriptorName>
</MeshHeading>
</MeshHeadingList>
Publication Types:
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
<PublicationType UI="D016449">Randomized Controlled Trial</PublicationType>
</PublicationTypeList>
Grant Information:
<GrantList>
<Grant>
<GrantID>R01-123456</GrantID>
<Agency>NIAID NIH HHS</Agency>
<Country>United States</Country>
</Grant>
</GrantList>
arXiv API
Preprints in physics, math, CS, q-bio - Free, open access.
Base URL: http://export.arxiv.org/api/query
No API key required
arXiv ID to Metadata
Request:
GET http://export.arxiv.org/api/query?id_list=2103.14030
Response: Atom XML
<entry>
<id>http://arxiv.org/abs/2103.14030v2</id>
<title>Highly accurate protein structure prediction with AlphaFold</title>
<author><name>John Jumper</name></author>
<author><name>Richard Evans</name></author>
<published>2021-03-26T17:47:17Z</published>
<updated>2021-07-01T16:51:46Z</updated>
<summary>Abstract text here...</summary>
<arxiv:doi>10.1038/s41586-021-03819-2</arxiv:doi>
<category term="q-bio.BM" scheme="http://arxiv.org/schemas/atom"/>
<category term="cs.LG" scheme="http://arxiv.org/schemas/atom"/>
</entry>
Key Fields
id: arXiv URLtitle: Preprint titleauthor: Author listpublished: First version dateupdated: Latest version datesummary: Abstractarxiv:doi: DOI if publishedarxiv:journal_ref: Journal reference if publishedcategory: arXiv categories
Version Tracking
arXiv tracks versions:
v1: Initial submissionv2,v3, etc.: Revisions
Always check if preprint has been published in journal (use DOI if available).
DataCite API
Research datasets, software, other outputs - Assigns DOIs to non-traditional scholarly works.
Base URL: https://api.datacite.org/dois/
Similar to CrossRef but for datasets, software, code, etc.
Request:
GET https://api.datacite.org/dois/10.5281/zenodo.1234567
Response: JSON with metadata for dataset/software
Required BibTeX Fields
@article (Journal Articles)
Required:
author: Author namestitle: Article titlejournal: Journal nameyear: Publication year
Optional but recommended:
volume: Volume numbernumber: Issue numberpages: Page range (e.g., 123--145)doi: Digital Object Identifierurl: URL if no DOImonth: Publication month
Example:
@article{Smith2024,
author = {Smith, John and Doe, Jane},
title = {Novel Approach to Protein Folding},
journal = {Nature},
year = {2024},
volume = {625},
number = {8001},
pages = {123--145},
doi = {10.1038/nature12345}
}
@book (Books)
Required:
authororeditor: Author(s) or editor(s)title: Book titlepublisher: Publisher nameyear: Publication year
Optional but recommended:
edition: Edition number (if not first)address: Publisher locationisbn: ISBNurl: URLseries: Series name
Example:
@book{Kumar2021,
author = {Kumar, Vinay and Abbas, Abul K. and Aster, Jon C.},
title = {Robbins and Cotran Pathologic Basis of Disease},
publisher = {Elsevier},
year = {2021},
edition = {10},
isbn = {978-0-323-53113-9}
}
@inproceedings (Conference Papers)
Required:
author: Author namestitle: Paper titlebooktitle: Conference/proceedings nameyear: Year
Optional but recommended:
pages: Page rangeorganization: Organizing bodypublisher: Publisheraddress: Conference locationmonth: Conference monthdoi: DOI if available
Example:
@inproceedings{Vaswani2017,
author = {Vaswani, Ashish and Shazeer, Noam and others},
title = {Attention is All You Need},
booktitle = {Advances in Neural Information Processing Systems},
year = {2017},
pages = {5998--6008},
volume = {30}
}
@incollection (Book Chapters)
Required:
author: Chapter author(s)title: Chapter titlebooktitle: Book titlepublisher: Publisher nameyear: Publication year
Optional but recommended:
editor: Book editor(s)pages: Chapter page rangechapter: Chapter numberedition: Editionaddress: Publisher location
Example:
@incollection{Brown2020,
author = {Brown, Peter O. and Botstein, David},
title = {Exploring the New World of the Genome with {DNA} Microarrays},
booktitle = {DNA Microarrays: A Molecular Cloning Manual},
editor = {Eisen, Michael B. and Brown, Patrick O.},
publisher = {Cold Spring Harbor Laboratory Press},
year = {2020},
pages = {1--45}
}
@phdthesis (Dissertations)
Required:
author: Author nametitle: Thesis titleschool: Institutionyear: Year
Optional:
type: Type (e.g., "PhD dissertation")address: Institution locationmonth: Monthurl: URL
Example:
@phdthesis{Johnson2023,
author = {Johnson, Mary L.},
title = {Novel Approaches to Cancer Immunotherapy},
school = {Stanford University},
year = {2023},
type = {{PhD} dissertation}
}
@misc (Preprints, Software, Datasets)
Required:
author: Author(s)title: Titleyear: Year
For preprints, add:
howpublished: Repository (e.g., "bioRxiv")doi: Preprint DOInote: Preprint ID
Example (preprint):
@misc{Zhang2024,
author = {Zhang, Yi and Chen, Li and Wang, Hui},
title = {Novel Therapeutic Targets in Alzheimer's Disease},
year = {2024},
howpublished = {bioRxiv},
doi = {10.1101/2024.01.001},
note = {Preprint}
}
Example (software):
@misc{AlphaFold2021,
author = {DeepMind},
title = {{AlphaFold} Protein Structure Database},
year = {2021},
howpublished = {Software},
url = {https://alphafold.ebi.ac.uk/},
doi = {10.5281/zenodo.5123456}
}
Extraction Workflows
From DOI
Best practice - Most reliable source:
# Single DOI
python scripts/extract_metadata.py --doi 10.1038/s41586-021-03819-2
# Multiple DOIs
python scripts/extract_metadata.py \
--doi 10.1038/nature12345 \
--doi 10.1126/science.abc1234 \
--output refs.bib
Process:
- Query CrossRef API with DOI
- Parse JSON response
- Extract required fields
- Determine entry type (@article, @book, etc.)
- Format as BibTeX
- Validate completeness
From PMID
For biomedical literature:
# Single PMID
python scripts/extract_metadata.py --pmid 34265844
# Multiple PMIDs
python scripts/extract_metadata.py \
--pmid 34265844 \
--pmid 28445112 \
--output refs.bib
Process:
- Query PubMed EFetch with PMID
- Parse XML response
- Extract metadata including MeSH terms
- Check for DOI in response
- If DOI exists, optionally query CrossRef for additional metadata
- Format as BibTeX
From arXiv ID
For preprints:
python scripts/extract_metadata.py --arxiv 2103.14030
Process:
- Query arXiv API with ID
- Parse Atom XML response
- Check for published version (DOI in response)
- If published: Use DOI and CrossRef
- If not published: Use preprint metadata
- Format as @misc with preprint note
Important: Always check if preprint has been published!
From URL
When you only have URL:
python scripts/extract_metadata.py \
--url "https://www.nature.com/articles/s41586-021-03819-2"
Process:
- Parse URL to extract identifier
- Identify type (DOI, PMID, arXiv)
- Extract identifier from URL
- Query appropriate API
- Format as BibTeX
URL patterns:
# DOI URLs
https://doi.org/10.1038/nature12345
https://dx.doi.org/10.1126/science.abc123
https://www.nature.com/articles/s41586-021-03819-2
# PubMed URLs
https://pubmed.ncbi.nlm.nih.gov/34265844/
https://www.ncbi.nlm.nih.gov/pubmed/34265844
# arXiv URLs
https://arxiv.org/abs/2103.14030
https://arxiv.org/pdf/2103.14030.pdf
Batch Processing
From file with mixed identifiers:
# Create file with one identifier per line
# identifiers.txt:
# 10.1038/nature12345
# 34265844
# 2103.14030
# https://doi.org/10.1126/science.abc123
python scripts/extract_metadata.py \
--input identifiers.txt \
--output references.bib
Process:
- Script auto-detects identifier type
- Queries appropriate API
- Combines all into single BibTeX file
- Handles errors gracefully
Special Cases and Edge Cases
Preprints Later Published
Issue: Preprint cited, but journal version now available.
Solution:
- Check arXiv metadata for DOI field
- If DOI present, use published version
- Update citation to journal article
- Note preprint version in comments if needed
Example:
% Originally: arXiv:2103.14030
% Published as:
@article{Jumper2021,
author = {Jumper, John and Evans, Richard and others},
title = {Highly Accurate Protein Structure Prediction with {AlphaFold}},
journal = {Nature},
year = {2021},
volume = {596},
pages = {583--589},
doi = {10.1038/s41586-021-03819-2}
}
Multiple Authors (et al.)
Issue: Many authors (10+).
BibTeX practice:
- Include all authors if <10
- Use "and others" for 10+
- Or list all (journals vary)
Example:
@article{LargeCollaboration2024,
author = {First, Author and Second, Author and Third, Author and others},
...
}
Author Name Variations
Issue: Authors publish under different name formats.
Standardization:
# Common variations
John Smith
John A. Smith
John Andrew Smith
J. A. Smith
Smith, J.
Smith, J. A.
# BibTeX format (recommended)
author = {Smith, John A.}
Extraction preference:
- Use full name if available
- Include middle initial if available
- Format: Last, First Middle
No DOI Available
Issue: Older papers or books without DOIs.
Solutions:
- Use PMID if available (biomedical)
- Use ISBN for books
- Use URL to stable source
- Include full publication details
Example:
@article{OldPaper1995,
author = {Author, Name},
title = {Title Here},
journal = {Journal Name},
year = {1995},
volume = {123},
pages = {45--67},
url = {https://stable-url-here},
note = {PMID: 12345678}
}
Conference Papers vs Journal Articles
Issue: Same work published in both.
Best practice:
- Cite journal version if both available
- Journal version is archival
- Conference version for timeliness
If citing conference:
@inproceedings{Smith2024conf,
author = {Smith, John},
title = {Title},
booktitle = {Proceedings of NeurIPS 2024},
year = {2024}
}
If citing journal:
@article{Smith2024journal,
author = {Smith, John},
title = {Title},
journal = {Journal of Machine Learning Research},
year = {2024}
}
Book Chapters vs Edited Collections
Extract correctly:
- Chapter: Use
@incollection - Whole book: Use
@book - Book editor: List in
editorfield - Chapter author: List in
authorfield
Datasets and Software
Use @misc with appropriate fields:
@misc{DatasetName2024,
author = {Author, Name},
title = {Dataset Title},
year = {2024},
howpublished = {Zenodo},
doi = {10.5281/zenodo.123456},
note = {Version 1.2}
}
Validation After Extraction
Always validate extracted metadata:
python scripts/validate_citations.py extracted_refs.bib
Check:
- All required fields present
- DOI resolves correctly
- Author names formatted consistently
- Year is reasonable (4 digits)
- Journal/publisher names correct
- Page ranges use -- not -
- Special characters handled properly
Best Practices
1. Prefer DOI When Available
DOIs provide:
- Permanent identifier
- Best metadata source
- Publisher-verified information
- Resolvable link
2. Verify Automatically Extracted Metadata
Spot-check:
- Author names match publication
- Title matches (including capitalization)
- Year is correct
- Journal name is complete
3. Handle Special Characters
LaTeX special characters:
- Protect capitalization:
{AlphaFold} - Handle accents:
M{\"u}lleror use Unicode - Chemical formulas:
H$_2$Oor\ce{H2O}
4. Use Consistent Citation Keys
Convention: FirstAuthorYEARkeyword
Smith2024protein
Doe2023machine
Johnson2024cancer
5. Include DOI for Modern Papers
All papers published after ~2000 should have DOI:
doi = {10.1038/nature12345}
6. Document Source
For non-standard sources, add note:
note = {Preprint, not peer-reviewed}
note = {Technical report}
note = {Dataset accompanying [citation]}
Summary
Metadata extraction workflow:
- Identify: Determine identifier type (DOI, PMID, arXiv, URL)
- Query: Use appropriate API (CrossRef, PubMed, arXiv)
- Extract: Parse response for required fields
- Format: Create properly formatted BibTeX entry
- Validate: Check completeness and accuracy
- Verify: Spot-check critical citations
Use scripts to automate:
extract_metadata.py: Universal extractordoi_to_bibtex.py: Quick DOI conversionvalidate_citations.py: Verify accuracy
Always validate extracted metadata before final submission!