Files
gh-k-dense-ai-claude-scient…/skills/citation-management/references/metadata_extraction.md
2025-11-30 08:30:18 +08:00

19 KiB

Metadata Extraction Guide

Comprehensive guide to extracting accurate citation metadata from DOIs, PMIDs, arXiv IDs, and URLs using various APIs and services.

Overview

Accurate metadata is essential for proper citations. This guide covers:

  • Identifying paper identifiers (DOI, PMID, arXiv ID)
  • Querying metadata APIs (CrossRef, PubMed, arXiv, DataCite)
  • Required BibTeX fields by entry type
  • Handling edge cases and special situations
  • Validating extracted metadata

Paper Identifiers

DOI (Digital Object Identifier)

Format: 10.XXXX/suffix

Examples:

10.1038/s41586-021-03819-2    # Nature article
10.1126/science.aam9317       # Science article
10.1016/j.cell.2023.01.001    # Cell article
10.1371/journal.pone.0123456  # PLOS ONE article

Properties:

  • Permanent identifier
  • Most reliable for metadata
  • Resolves to current location
  • Publisher-assigned

Where to find:

  • First page of article
  • Article webpage
  • CrossRef, Google Scholar, PubMed
  • Usually prominent on publisher site

PMID (PubMed ID)

Format: 8-digit number (typically)

Examples:

34265844
28445112
35476778

Properties:

  • Specific to PubMed database
  • Biomedical literature only
  • Assigned by NCBI
  • Permanent identifier

Where to find:

  • PubMed search results
  • Article page on PubMed
  • Often in article PDF footer
  • PMC (PubMed Central) pages

PMCID (PubMed Central ID)

Format: PMC followed by numbers

Examples:

PMC8287551
PMC7456789

Properties:

  • Free full-text articles in PMC
  • Subset of PubMed articles
  • Open access or author manuscripts

arXiv ID

Format: YYMM.NNNNN or archive/YYMMNNN

Examples:

2103.14030        # New format (since 2007)
2401.12345        # 2024 submission
arXiv:hep-th/9901001  # Old format

Properties:

  • Preprints (not peer-reviewed)
  • Physics, math, CS, q-bio, etc.
  • Version tracking (v1, v2, etc.)
  • Free, open access

Where to find:

  • arXiv.org
  • Often cited before publication
  • Paper PDF header

Other Identifiers

ISBN (Books):

978-0-12-345678-9
0-123-45678-9

arXiv category:

cs.LG    # Computer Science - Machine Learning
q-bio.QM # Quantitative Biology - Quantitative Methods
math.ST  # Mathematics - Statistics

Metadata APIs

CrossRef API

Primary source for DOIs - Most comprehensive metadata for journal articles.

Base URL: https://api.crossref.org/works/

No API key required, but polite pool recommended:

  • Add email to User-Agent
  • Gets better service
  • No rate limits

Basic DOI Lookup

Request:

GET https://api.crossref.org/works/10.1038/s41586-021-03819-2

Response (simplified):

{
  "message": {
    "DOI": "10.1038/s41586-021-03819-2",
    "title": ["Article title here"],
    "author": [
      {"given": "John", "family": "Smith"},
      {"given": "Jane", "family": "Doe"}
    ],
    "container-title": ["Nature"],
    "volume": "595",
    "issue": "7865",
    "page": "123-128",
    "published-print": {"date-parts": [[2021, 7, 1]]},
    "publisher": "Springer Nature",
    "type": "journal-article",
    "ISSN": ["0028-0836"]
  }
}

Fields Available

Always present:

  • DOI: Digital Object Identifier
  • title: Article title (array)
  • type: Content type (journal-article, book-chapter, etc.)

Usually present:

  • author: Array of author objects
  • container-title: Journal/book title
  • published-print or published-online: Publication date
  • volume, issue, page: Publication details
  • publisher: Publisher name

Sometimes present:

  • abstract: Article abstract
  • subject: Subject categories
  • ISSN: Journal ISSN
  • ISBN: Book ISBN
  • reference: Reference list
  • is-referenced-by-count: Citation count

Content Types

CrossRef type field values:

  • journal-article: Journal articles
  • book-chapter: Book chapters
  • book: Books
  • proceedings-article: Conference papers
  • posted-content: Preprints
  • dataset: Research datasets
  • report: Technical reports
  • dissertation: Theses/dissertations

PubMed E-utilities API

Specialized for biomedical literature - Curated metadata with MeSH terms.

Base URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/

API key recommended (free):

  • Higher rate limits
  • Better performance

PMID to Metadata

Step 1: EFetch for full record

GET https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
  db=pubmed&
  id=34265844&
  retmode=xml&
  api_key=YOUR_KEY

Response: XML with comprehensive metadata

Step 2: Parse XML

Key fields:

<PubmedArticle>
  <MedlineCitation>
    <PMID>34265844</PMID>
    <Article>
      <ArticleTitle>Title here</ArticleTitle>
      <AuthorList>
        <Author><LastName>Smith</LastName><ForeName>John</ForeName></Author>
      </AuthorList>
      <Journal>
        <Title>Nature</Title>
        <JournalIssue>
          <Volume>595</Volume>
          <Issue>7865</Issue>
          <PubDate><Year>2021</Year></PubDate>
        </JournalIssue>
      </Journal>
      <Pagination><MedlinePgn>123-128</MedlinePgn></Pagination>
      <Abstract><AbstractText>Abstract text here</AbstractText></Abstract>
    </Article>
  </MedlineCitation>
  <PubmedData>
    <ArticleIdList>
      <ArticleId IdType="doi">10.1038/s41586-021-03819-2</ArticleId>
      <ArticleId IdType="pmc">PMC8287551</ArticleId>
    </ArticleIdList>
  </PubmedData>
</PubmedArticle>

Unique PubMed Fields

MeSH Terms: Controlled vocabulary

<MeshHeadingList>
  <MeshHeading>
    <DescriptorName UI="D003920">Diabetes Mellitus</DescriptorName>
  </MeshHeading>
</MeshHeadingList>

Publication Types:

<PublicationTypeList>
  <PublicationType UI="D016428">Journal Article</PublicationType>
  <PublicationType UI="D016449">Randomized Controlled Trial</PublicationType>
</PublicationTypeList>

Grant Information:

<GrantList>
  <Grant>
    <GrantID>R01-123456</GrantID>
    <Agency>NIAID NIH HHS</Agency>
    <Country>United States</Country>
  </Grant>
</GrantList>

arXiv API

Preprints in physics, math, CS, q-bio - Free, open access.

Base URL: http://export.arxiv.org/api/query

No API key required

arXiv ID to Metadata

Request:

GET http://export.arxiv.org/api/query?id_list=2103.14030

Response: Atom XML

<entry>
  <id>http://arxiv.org/abs/2103.14030v2</id>
  <title>Highly accurate protein structure prediction with AlphaFold</title>
  <author><name>John Jumper</name></author>
  <author><name>Richard Evans</name></author>
  <published>2021-03-26T17:47:17Z</published>
  <updated>2021-07-01T16:51:46Z</updated>
  <summary>Abstract text here...</summary>
  <arxiv:doi>10.1038/s41586-021-03819-2</arxiv:doi>
  <category term="q-bio.BM" scheme="http://arxiv.org/schemas/atom"/>
  <category term="cs.LG" scheme="http://arxiv.org/schemas/atom"/>
</entry>

Key Fields

  • id: arXiv URL
  • title: Preprint title
  • author: Author list
  • published: First version date
  • updated: Latest version date
  • summary: Abstract
  • arxiv:doi: DOI if published
  • arxiv:journal_ref: Journal reference if published
  • category: arXiv categories

Version Tracking

arXiv tracks versions:

  • v1: Initial submission
  • v2, v3, etc.: Revisions

Always check if preprint has been published in journal (use DOI if available).

DataCite API

Research datasets, software, other outputs - Assigns DOIs to non-traditional scholarly works.

Base URL: https://api.datacite.org/dois/

Similar to CrossRef but for datasets, software, code, etc.

Request:

GET https://api.datacite.org/dois/10.5281/zenodo.1234567

Response: JSON with metadata for dataset/software

Required BibTeX Fields

@article (Journal Articles)

Required:

  • author: Author names
  • title: Article title
  • journal: Journal name
  • year: Publication year

Optional but recommended:

  • volume: Volume number
  • number: Issue number
  • pages: Page range (e.g., 123--145)
  • doi: Digital Object Identifier
  • url: URL if no DOI
  • month: Publication month

Example:

@article{Smith2024,
  author  = {Smith, John and Doe, Jane},
  title   = {Novel Approach to Protein Folding},
  journal = {Nature},
  year    = {2024},
  volume  = {625},
  number  = {8001},
  pages   = {123--145},
  doi     = {10.1038/nature12345}
}

@book (Books)

Required:

  • author or editor: Author(s) or editor(s)
  • title: Book title
  • publisher: Publisher name
  • year: Publication year

Optional but recommended:

  • edition: Edition number (if not first)
  • address: Publisher location
  • isbn: ISBN
  • url: URL
  • series: Series name

Example:

@book{Kumar2021,
  author    = {Kumar, Vinay and Abbas, Abul K. and Aster, Jon C.},
  title     = {Robbins and Cotran Pathologic Basis of Disease},
  publisher = {Elsevier},
  year      = {2021},
  edition   = {10},
  isbn      = {978-0-323-53113-9}
}

@inproceedings (Conference Papers)

Required:

  • author: Author names
  • title: Paper title
  • booktitle: Conference/proceedings name
  • year: Year

Optional but recommended:

  • pages: Page range
  • organization: Organizing body
  • publisher: Publisher
  • address: Conference location
  • month: Conference month
  • doi: DOI if available

Example:

@inproceedings{Vaswani2017,
  author    = {Vaswani, Ashish and Shazeer, Noam and others},
  title     = {Attention is All You Need},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2017},
  pages     = {5998--6008},
  volume    = {30}
}

@incollection (Book Chapters)

Required:

  • author: Chapter author(s)
  • title: Chapter title
  • booktitle: Book title
  • publisher: Publisher name
  • year: Publication year

Optional but recommended:

  • editor: Book editor(s)
  • pages: Chapter page range
  • chapter: Chapter number
  • edition: Edition
  • address: Publisher location

Example:

@incollection{Brown2020,
  author    = {Brown, Peter O. and Botstein, David},
  title     = {Exploring the New World of the Genome with {DNA} Microarrays},
  booktitle = {DNA Microarrays: A Molecular Cloning Manual},
  editor    = {Eisen, Michael B. and Brown, Patrick O.},
  publisher = {Cold Spring Harbor Laboratory Press},
  year      = {2020},
  pages     = {1--45}
}

@phdthesis (Dissertations)

Required:

  • author: Author name
  • title: Thesis title
  • school: Institution
  • year: Year

Optional:

  • type: Type (e.g., "PhD dissertation")
  • address: Institution location
  • month: Month
  • url: URL

Example:

@phdthesis{Johnson2023,
  author = {Johnson, Mary L.},
  title  = {Novel Approaches to Cancer Immunotherapy},
  school = {Stanford University},
  year   = {2023},
  type   = {{PhD} dissertation}
}

@misc (Preprints, Software, Datasets)

Required:

  • author: Author(s)
  • title: Title
  • year: Year

For preprints, add:

  • howpublished: Repository (e.g., "bioRxiv")
  • doi: Preprint DOI
  • note: Preprint ID

Example (preprint):

@misc{Zhang2024,
  author       = {Zhang, Yi and Chen, Li and Wang, Hui},
  title        = {Novel Therapeutic Targets in Alzheimer's Disease},
  year         = {2024},
  howpublished = {bioRxiv},
  doi          = {10.1101/2024.01.001},
  note         = {Preprint}
}

Example (software):

@misc{AlphaFold2021,
  author       = {DeepMind},
  title        = {{AlphaFold} Protein Structure Database},
  year         = {2021},
  howpublished = {Software},
  url          = {https://alphafold.ebi.ac.uk/},
  doi          = {10.5281/zenodo.5123456}
}

Extraction Workflows

From DOI

Best practice - Most reliable source:

# Single DOI
python scripts/extract_metadata.py --doi 10.1038/s41586-021-03819-2

# Multiple DOIs
python scripts/extract_metadata.py \
  --doi 10.1038/nature12345 \
  --doi 10.1126/science.abc1234 \
  --output refs.bib

Process:

  1. Query CrossRef API with DOI
  2. Parse JSON response
  3. Extract required fields
  4. Determine entry type (@article, @book, etc.)
  5. Format as BibTeX
  6. Validate completeness

From PMID

For biomedical literature:

# Single PMID
python scripts/extract_metadata.py --pmid 34265844

# Multiple PMIDs
python scripts/extract_metadata.py \
  --pmid 34265844 \
  --pmid 28445112 \
  --output refs.bib

Process:

  1. Query PubMed EFetch with PMID
  2. Parse XML response
  3. Extract metadata including MeSH terms
  4. Check for DOI in response
  5. If DOI exists, optionally query CrossRef for additional metadata
  6. Format as BibTeX

From arXiv ID

For preprints:

python scripts/extract_metadata.py --arxiv 2103.14030

Process:

  1. Query arXiv API with ID
  2. Parse Atom XML response
  3. Check for published version (DOI in response)
  4. If published: Use DOI and CrossRef
  5. If not published: Use preprint metadata
  6. Format as @misc with preprint note

Important: Always check if preprint has been published!

From URL

When you only have URL:

python scripts/extract_metadata.py \
  --url "https://www.nature.com/articles/s41586-021-03819-2"

Process:

  1. Parse URL to extract identifier
  2. Identify type (DOI, PMID, arXiv)
  3. Extract identifier from URL
  4. Query appropriate API
  5. Format as BibTeX

URL patterns:

# DOI URLs
https://doi.org/10.1038/nature12345
https://dx.doi.org/10.1126/science.abc123
https://www.nature.com/articles/s41586-021-03819-2

# PubMed URLs
https://pubmed.ncbi.nlm.nih.gov/34265844/
https://www.ncbi.nlm.nih.gov/pubmed/34265844

# arXiv URLs
https://arxiv.org/abs/2103.14030
https://arxiv.org/pdf/2103.14030.pdf

Batch Processing

From file with mixed identifiers:

# Create file with one identifier per line
# identifiers.txt:
#   10.1038/nature12345
#   34265844
#   2103.14030
#   https://doi.org/10.1126/science.abc123

python scripts/extract_metadata.py \
  --input identifiers.txt \
  --output references.bib

Process:

  • Script auto-detects identifier type
  • Queries appropriate API
  • Combines all into single BibTeX file
  • Handles errors gracefully

Special Cases and Edge Cases

Preprints Later Published

Issue: Preprint cited, but journal version now available.

Solution:

  1. Check arXiv metadata for DOI field
  2. If DOI present, use published version
  3. Update citation to journal article
  4. Note preprint version in comments if needed

Example:

% Originally: arXiv:2103.14030
% Published as:
@article{Jumper2021,
  author  = {Jumper, John and Evans, Richard and others},
  title   = {Highly Accurate Protein Structure Prediction with {AlphaFold}},
  journal = {Nature},
  year    = {2021},
  volume  = {596},
  pages   = {583--589},
  doi     = {10.1038/s41586-021-03819-2}
}

Multiple Authors (et al.)

Issue: Many authors (10+).

BibTeX practice:

  • Include all authors if <10
  • Use "and others" for 10+
  • Or list all (journals vary)

Example:

@article{LargeCollaboration2024,
  author = {First, Author and Second, Author and Third, Author and others},
  ...
}

Author Name Variations

Issue: Authors publish under different name formats.

Standardization:

# Common variations
John Smith
John A. Smith
John Andrew Smith
J. A. Smith
Smith, J.
Smith, J. A.

# BibTeX format (recommended)
author = {Smith, John A.}

Extraction preference:

  1. Use full name if available
  2. Include middle initial if available
  3. Format: Last, First Middle

No DOI Available

Issue: Older papers or books without DOIs.

Solutions:

  1. Use PMID if available (biomedical)
  2. Use ISBN for books
  3. Use URL to stable source
  4. Include full publication details

Example:

@article{OldPaper1995,
  author  = {Author, Name},
  title   = {Title Here},
  journal = {Journal Name},
  year    = {1995},
  volume  = {123},
  pages   = {45--67},
  url     = {https://stable-url-here},
  note    = {PMID: 12345678}
}

Conference Papers vs Journal Articles

Issue: Same work published in both.

Best practice:

  • Cite journal version if both available
  • Journal version is archival
  • Conference version for timeliness

If citing conference:

@inproceedings{Smith2024conf,
  author    = {Smith, John},
  title     = {Title},
  booktitle = {Proceedings of NeurIPS 2024},
  year      = {2024}
}

If citing journal:

@article{Smith2024journal,
  author  = {Smith, John},
  title   = {Title},
  journal = {Journal of Machine Learning Research},
  year    = {2024}
}

Book Chapters vs Edited Collections

Extract correctly:

  • Chapter: Use @incollection
  • Whole book: Use @book
  • Book editor: List in editor field
  • Chapter author: List in author field

Datasets and Software

Use @misc with appropriate fields:

@misc{DatasetName2024,
  author       = {Author, Name},
  title        = {Dataset Title},
  year         = {2024},
  howpublished = {Zenodo},
  doi          = {10.5281/zenodo.123456},
  note         = {Version 1.2}
}

Validation After Extraction

Always validate extracted metadata:

python scripts/validate_citations.py extracted_refs.bib

Check:

  • All required fields present
  • DOI resolves correctly
  • Author names formatted consistently
  • Year is reasonable (4 digits)
  • Journal/publisher names correct
  • Page ranges use -- not -
  • Special characters handled properly

Best Practices

1. Prefer DOI When Available

DOIs provide:

  • Permanent identifier
  • Best metadata source
  • Publisher-verified information
  • Resolvable link

2. Verify Automatically Extracted Metadata

Spot-check:

  • Author names match publication
  • Title matches (including capitalization)
  • Year is correct
  • Journal name is complete

3. Handle Special Characters

LaTeX special characters:

  • Protect capitalization: {AlphaFold}
  • Handle accents: M{\"u}ller or use Unicode
  • Chemical formulas: H$_2$O or \ce{H2O}

4. Use Consistent Citation Keys

Convention: FirstAuthorYEARkeyword

Smith2024protein
Doe2023machine
Johnson2024cancer

5. Include DOI for Modern Papers

All papers published after ~2000 should have DOI:

doi = {10.1038/nature12345}

6. Document Source

For non-standard sources, add note:

note = {Preprint, not peer-reviewed}
note = {Technical report}
note = {Dataset accompanying [citation]}

Summary

Metadata extraction workflow:

  1. Identify: Determine identifier type (DOI, PMID, arXiv, URL)
  2. Query: Use appropriate API (CrossRef, PubMed, arXiv)
  3. Extract: Parse response for required fields
  4. Format: Create properly formatted BibTeX entry
  5. Validate: Check completeness and accuracy
  6. Verify: Spot-check critical citations

Use scripts to automate:

  • extract_metadata.py: Universal extractor
  • doi_to_bibtex.py: Quick DOI conversion
  • validate_citations.py: Verify accuracy

Always validate extracted metadata before final submission!