zhongwei/gh-k-dense-ai-claude-scientific-writer-claude-scientific-writer

Files

Zhongwei Li 74bee324ab Initial commit

2025-11-30 08:30:18 +08:00

19 KiB

Raw Blame History

Metadata Extraction Guide

Comprehensive guide to extracting accurate citation metadata from DOIs, PMIDs, arXiv IDs, and URLs using various APIs and services.

Overview

Accurate metadata is essential for proper citations. This guide covers:

Identifying paper identifiers (DOI, PMID, arXiv ID)
Querying metadata APIs (CrossRef, PubMed, arXiv, DataCite)
Required BibTeX fields by entry type
Handling edge cases and special situations
Validating extracted metadata

Paper Identifiers

DOI (Digital Object Identifier)

Format: 10.XXXX/suffix

Examples:

10.1038/s41586-021-03819-2    # Nature article
10.1126/science.aam9317       # Science article
10.1016/j.cell.2023.01.001    # Cell article
10.1371/journal.pone.0123456  # PLOS ONE article

Properties:

Permanent identifier
Most reliable for metadata
Resolves to current location
Publisher-assigned

Where to find:

First page of article
Article webpage
CrossRef, Google Scholar, PubMed
Usually prominent on publisher site

PMID (PubMed ID)

Format: 8-digit number (typically)

Examples:

34265844
28445112
35476778

Properties:

Specific to PubMed database
Biomedical literature only
Assigned by NCBI
Permanent identifier

Where to find:

PubMed search results
Article page on PubMed
Often in article PDF footer
PMC (PubMed Central) pages

PMCID (PubMed Central ID)

Format: PMC followed by numbers

Examples:

PMC8287551
PMC7456789

Properties:

Free full-text articles in PMC
Subset of PubMed articles
Open access or author manuscripts

arXiv ID

Format: YYMM.NNNNN or archive/YYMMNNN

Examples:

2103.14030        # New format (since 2007)
2401.12345        # 2024 submission
arXiv:hep-th/9901001  # Old format

Properties:

Preprints (not peer-reviewed)
Physics, math, CS, q-bio, etc.
Version tracking (v1, v2, etc.)
Free, open access

Where to find:

arXiv.org
Often cited before publication
Paper PDF header

Other Identifiers

ISBN (Books):

978-0-12-345678-9
0-123-45678-9

arXiv category:

cs.LG    # Computer Science - Machine Learning
q-bio.QM # Quantitative Biology - Quantitative Methods
math.ST  # Mathematics - Statistics

Metadata APIs

CrossRef API

Primary source for DOIs - Most comprehensive metadata for journal articles.

Base URL: https://api.crossref.org/works/

No API key required, but polite pool recommended:

Add email to User-Agent
Gets better service
No rate limits

Basic DOI Lookup

Request:

GET https://api.crossref.org/works/10.1038/s41586-021-03819-2

Response (simplified):

{
  "message": {
    "DOI": "10.1038/s41586-021-03819-2",
    "title": ["Article title here"],
    "author": [
      {"given": "John", "family": "Smith"},
      {"given": "Jane", "family": "Doe"}
    ],
    "container-title": ["Nature"],
    "volume": "595",
    "issue": "7865",
    "page": "123-128",
    "published-print": {"date-parts": [[2021, 7, 1]]},
    "publisher": "Springer Nature",
    "type": "journal-article",
    "ISSN": ["0028-0836"]
  }
}

Fields Available

Always present:

DOI: Digital Object Identifier
title: Article title (array)
type: Content type (journal-article, book-chapter, etc.)

Usually present:

author: Array of author objects
container-title: Journal/book title
published-print or published-online: Publication date
volume, issue, page: Publication details
publisher: Publisher name

Sometimes present:

abstract: Article abstract
subject: Subject categories
ISSN: Journal ISSN
ISBN: Book ISBN
reference: Reference list
is-referenced-by-count: Citation count

Content Types

CrossRef type field values:

journal-article: Journal articles
book-chapter: Book chapters
book: Books
proceedings-article: Conference papers
posted-content: Preprints
dataset: Research datasets
report: Technical reports
dissertation: Theses/dissertations

PubMed E-utilities API

Specialized for biomedical literature - Curated metadata with MeSH terms.

Base URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/

API key recommended (free):

Higher rate limits
Better performance

PMID to Metadata

Step 1: EFetch for full record

GET https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
  db=pubmed&
  id=34265844&
  retmode=xml&
  api_key=YOUR_KEY

Response: XML with comprehensive metadata

Step 2: Parse XML

Key fields:

<PubmedArticle>
  <MedlineCitation>
    <PMID>34265844</PMID>
    <Article>
      <ArticleTitle>Title here</ArticleTitle>
      <AuthorList>
        <Author><LastName>Smith</LastName><ForeName>John</ForeName></Author>
      </AuthorList>
      <Journal>
        <Title>Nature</Title>
        <JournalIssue>
          <Volume>595</Volume>
          <Issue>7865</Issue>
          <PubDate><Year>2021</Year></PubDate>
        </JournalIssue>
      </Journal>
      <Pagination><MedlinePgn>123-128</MedlinePgn></Pagination>
      <Abstract><AbstractText>Abstract text here</AbstractText></Abstract>
    </Article>
  </MedlineCitation>
  <PubmedData>
    <ArticleIdList>
      <ArticleId IdType="doi">10.1038/s41586-021-03819-2</ArticleId>
      <ArticleId IdType="pmc">PMC8287551</ArticleId>
    </ArticleIdList>
  </PubmedData>
</PubmedArticle>

Unique PubMed Fields

MeSH Terms: Controlled vocabulary

<MeshHeadingList>
  <MeshHeading>
    <DescriptorName UI="D003920">Diabetes Mellitus</DescriptorName>
  </MeshHeading>
</MeshHeadingList>

Publication Types:

<PublicationTypeList>
  <PublicationType UI="D016428">Journal Article</PublicationType>
  <PublicationType UI="D016449">Randomized Controlled Trial</PublicationType>
</PublicationTypeList>

Grant Information:

<GrantList>
  <Grant>
    <GrantID>R01-123456</GrantID>
    <Agency>NIAID NIH HHS</Agency>
    <Country>United States</Country>
  </Grant>
</GrantList>

arXiv API

Preprints in physics, math, CS, q-bio - Free, open access.

Base URL: http://export.arxiv.org/api/query

No API key required

arXiv ID to Metadata

Request:

GET http://export.arxiv.org/api/query?id_list=2103.14030

Response: Atom XML

<entry>
  <id>http://arxiv.org/abs/2103.14030v2</id>
  <title>Highly accurate protein structure prediction with AlphaFold</title>
  <author><name>John Jumper</name></author>
  <author><name>Richard Evans</name></author>
  <published>2021-03-26T17:47:17Z</published>
  <updated>2021-07-01T16:51:46Z</updated>
  <summary>Abstract text here...</summary>
  <arxiv:doi>10.1038/s41586-021-03819-2</arxiv:doi>
  <category term="q-bio.BM" scheme="http://arxiv.org/schemas/atom"/>
  <category term="cs.LG" scheme="http://arxiv.org/schemas/atom"/>
</entry>

Key Fields

id: arXiv URL
title: Preprint title
author: Author list
published: First version date
updated: Latest version date
summary: Abstract
arxiv:doi: DOI if published
arxiv:journal_ref: Journal reference if published
category: arXiv categories

Version Tracking

arXiv tracks versions:

v1: Initial submission
v2, v3, etc.: Revisions

Always check if preprint has been published in journal (use DOI if available).

DataCite API

Research datasets, software, other outputs - Assigns DOIs to non-traditional scholarly works.

Base URL: https://api.datacite.org/dois/

Similar to CrossRef but for datasets, software, code, etc.

Request:

GET https://api.datacite.org/dois/10.5281/zenodo.1234567

Response: JSON with metadata for dataset/software

Required BibTeX Fields

@article (Journal Articles)

Required:

author: Author names
title: Article title
journal: Journal name
year: Publication year

Optional but recommended:

volume: Volume number
number: Issue number
pages: Page range (e.g., 123--145)
doi: Digital Object Identifier
url: URL if no DOI
month: Publication month

Example:

@article{Smith2024,
  author  = {Smith, John and Doe, Jane},
  title   = {Novel Approach to Protein Folding},
  journal = {Nature},
  year    = {2024},
  volume  = {625},
  number  = {8001},
  pages   = {123--145},
  doi     = {10.1038/nature12345}
}

@book (Books)

Required:

author or editor: Author(s) or editor(s)
title: Book title
publisher: Publisher name
year: Publication year

Optional but recommended:

edition: Edition number (if not first)
address: Publisher location
isbn: ISBN
url: URL
series: Series name

Example:

@book{Kumar2021,
  author    = {Kumar, Vinay and Abbas, Abul K. and Aster, Jon C.},
  title     = {Robbins and Cotran Pathologic Basis of Disease},
  publisher = {Elsevier},
  year      = {2021},
  edition   = {10},
  isbn      = {978-0-323-53113-9}
}

@inproceedings (Conference Papers)

Required:

author: Author names
title: Paper title
booktitle: Conference/proceedings name
year: Year

Optional but recommended:

pages: Page range
organization: Organizing body
publisher: Publisher
address: Conference location
month: Conference month
doi: DOI if available

Example:

@inproceedings{Vaswani2017,
  author    = {Vaswani, Ashish and Shazeer, Noam and others},
  title     = {Attention is All You Need},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2017},
  pages     = {5998--6008},
  volume    = {30}
}

@incollection (Book Chapters)

Required:

author: Chapter author(s)
title: Chapter title
booktitle: Book title
publisher: Publisher name
year: Publication year

Optional but recommended:

editor: Book editor(s)
pages: Chapter page range
chapter: Chapter number
edition: Edition
address: Publisher location

Example:

@incollection{Brown2020,
  author    = {Brown, Peter O. and Botstein, David},
  title     = {Exploring the New World of the Genome with {DNA} Microarrays},
  booktitle = {DNA Microarrays: A Molecular Cloning Manual},
  editor    = {Eisen, Michael B. and Brown, Patrick O.},
  publisher = {Cold Spring Harbor Laboratory Press},
  year      = {2020},
  pages     = {1--45}
}

@phdthesis (Dissertations)

Required:

author: Author name
title: Thesis title
school: Institution
year: Year

Optional:

type: Type (e.g., "PhD dissertation")
address: Institution location
month: Month
url: URL

Example:

@phdthesis{Johnson2023,
  author = {Johnson, Mary L.},
  title  = {Novel Approaches to Cancer Immunotherapy},
  school = {Stanford University},
  year   = {2023},
  type   = {{PhD} dissertation}
}

@misc (Preprints, Software, Datasets)

Required:

author: Author(s)
title: Title
year: Year

For preprints, add:

howpublished: Repository (e.g., "bioRxiv")
doi: Preprint DOI
note: Preprint ID

Example (preprint):

@misc{Zhang2024,
  author       = {Zhang, Yi and Chen, Li and Wang, Hui},
  title        = {Novel Therapeutic Targets in Alzheimer's Disease},
  year         = {2024},
  howpublished = {bioRxiv},
  doi          = {10.1101/2024.01.001},
  note         = {Preprint}
}

Example (software):

@misc{AlphaFold2021,
  author       = {DeepMind},
  title        = {{AlphaFold} Protein Structure Database},
  year         = {2021},
  howpublished = {Software},
  url          = {https://alphafold.ebi.ac.uk/},
  doi          = {10.5281/zenodo.5123456}
}

Extraction Workflows

From DOI

Best practice - Most reliable source:

# Single DOI
python scripts/extract_metadata.py --doi 10.1038/s41586-021-03819-2

# Multiple DOIs
python scripts/extract_metadata.py \
  --doi 10.1038/nature12345 \
  --doi 10.1126/science.abc1234 \
  --output refs.bib

Process:

Query CrossRef API with DOI
Parse JSON response
Extract required fields
Determine entry type (@article, @book, etc.)
Format as BibTeX
Validate completeness

From PMID

For biomedical literature:

# Single PMID
python scripts/extract_metadata.py --pmid 34265844

# Multiple PMIDs
python scripts/extract_metadata.py \
  --pmid 34265844 \
  --pmid 28445112 \
  --output refs.bib

Process:

Query PubMed EFetch with PMID
Parse XML response
Extract metadata including MeSH terms
Check for DOI in response
If DOI exists, optionally query CrossRef for additional metadata
Format as BibTeX

From arXiv ID

For preprints:

python scripts/extract_metadata.py --arxiv 2103.14030

Process:

Query arXiv API with ID
Parse Atom XML response
Check for published version (DOI in response)
If published: Use DOI and CrossRef
If not published: Use preprint metadata
Format as @misc with preprint note

Important: Always check if preprint has been published!

From URL

When you only have URL:

python scripts/extract_metadata.py \
  --url "https://www.nature.com/articles/s41586-021-03819-2"

Process:

Parse URL to extract identifier
Identify type (DOI, PMID, arXiv)
Extract identifier from URL
Query appropriate API
Format as BibTeX

URL patterns:

# DOI URLs
https://doi.org/10.1038/nature12345
https://dx.doi.org/10.1126/science.abc123
https://www.nature.com/articles/s41586-021-03819-2

# PubMed URLs
https://pubmed.ncbi.nlm.nih.gov/34265844/
https://www.ncbi.nlm.nih.gov/pubmed/34265844

# arXiv URLs
https://arxiv.org/abs/2103.14030
https://arxiv.org/pdf/2103.14030.pdf

Batch Processing

From file with mixed identifiers:

# Create file with one identifier per line
# identifiers.txt:
#   10.1038/nature12345
#   34265844
#   2103.14030
#   https://doi.org/10.1126/science.abc123

python scripts/extract_metadata.py \
  --input identifiers.txt \
  --output references.bib

Process:

Script auto-detects identifier type
Queries appropriate API
Combines all into single BibTeX file
Handles errors gracefully

Special Cases and Edge Cases

Preprints Later Published

Issue: Preprint cited, but journal version now available.

Solution:

Check arXiv metadata for DOI field
If DOI present, use published version
Update citation to journal article
Note preprint version in comments if needed

Example:

% Originally: arXiv:2103.14030
% Published as:
@article{Jumper2021,
  author  = {Jumper, John and Evans, Richard and others},
  title   = {Highly Accurate Protein Structure Prediction with {AlphaFold}},
  journal = {Nature},
  year    = {2021},
  volume  = {596},
  pages   = {583--589},
  doi     = {10.1038/s41586-021-03819-2}
}

Multiple Authors (et al.)

Issue: Many authors (10+).

BibTeX practice:

Include all authors if <10
Use "and others" for 10+
Or list all (journals vary)

Example:

@article{LargeCollaboration2024,
  author = {First, Author and Second, Author and Third, Author and others},
  ...
}

Author Name Variations

Issue: Authors publish under different name formats.

Standardization:

# Common variations
John Smith
John A. Smith
John Andrew Smith
J. A. Smith
Smith, J.
Smith, J. A.

# BibTeX format (recommended)
author = {Smith, John A.}

Extraction preference:

Use full name if available
Include middle initial if available
Format: Last, First Middle

No DOI Available

Issue: Older papers or books without DOIs.

Solutions:

Use PMID if available (biomedical)
Use ISBN for books
Use URL to stable source
Include full publication details

Example:

@article{OldPaper1995,
  author  = {Author, Name},
  title   = {Title Here},
  journal = {Journal Name},
  year    = {1995},
  volume  = {123},
  pages   = {45--67},
  url     = {https://stable-url-here},
  note    = {PMID: 12345678}
}

Conference Papers vs Journal Articles

Issue: Same work published in both.

Best practice:

Cite journal version if both available
Journal version is archival
Conference version for timeliness

If citing conference:

@inproceedings{Smith2024conf,
  author    = {Smith, John},
  title     = {Title},
  booktitle = {Proceedings of NeurIPS 2024},
  year      = {2024}
}

If citing journal:

@article{Smith2024journal,
  author  = {Smith, John},
  title   = {Title},
  journal = {Journal of Machine Learning Research},
  year    = {2024}
}

Book Chapters vs Edited Collections

Extract correctly:

Chapter: Use @incollection
Whole book: Use @book
Book editor: List in editor field
Chapter author: List in author field

Datasets and Software

Use @misc with appropriate fields:

@misc{DatasetName2024,
  author       = {Author, Name},
  title        = {Dataset Title},
  year         = {2024},
  howpublished = {Zenodo},
  doi          = {10.5281/zenodo.123456},
  note         = {Version 1.2}
}

Validation After Extraction

Always validate extracted metadata:

python scripts/validate_citations.py extracted_refs.bib

Check:

All required fields present
DOI resolves correctly
Author names formatted consistently
Year is reasonable (4 digits)
Journal/publisher names correct
Page ranges use -- not -
Special characters handled properly

Best Practices

1. Prefer DOI When Available

DOIs provide:

Permanent identifier
Best metadata source
Publisher-verified information
Resolvable link

2. Verify Automatically Extracted Metadata

Spot-check:

Author names match publication
Title matches (including capitalization)
Year is correct
Journal name is complete

3. Handle Special Characters

LaTeX special characters:

Protect capitalization: {AlphaFold}
Handle accents: M{\"u}ller or use Unicode
Chemical formulas: H$_2$O or \ce{H2O}

4. Use Consistent Citation Keys

Convention: FirstAuthorYEARkeyword

Smith2024protein
Doe2023machine
Johnson2024cancer

5. Include DOI for Modern Papers

All papers published after ~2000 should have DOI:

doi = {10.1038/nature12345}

6. Document Source

For non-standard sources, add note:

note = {Preprint, not peer-reviewed}
note = {Technical report}
note = {Dataset accompanying [citation]}

Summary

Metadata extraction workflow:

Identify: Determine identifier type (DOI, PMID, arXiv, URL)
Query: Use appropriate API (CrossRef, PubMed, arXiv)
Extract: Parse response for required fields
Format: Create properly formatted BibTeX entry
Validate: Check completeness and accuracy
Verify: Spot-check critical citations

Use scripts to automate:

extract_metadata.py: Universal extractor
doi_to_bibtex.py: Quick DOI conversion
validate_citations.py: Verify accuracy

Always validate extracted metadata before final submission!

19 KiB Raw Blame History

Metadata Extraction Guide

Overview

Paper Identifiers

DOI (Digital Object Identifier)

PMID (PubMed ID)

PMCID (PubMed Central ID)

arXiv ID

Other Identifiers

Metadata APIs

CrossRef API

Basic DOI Lookup

Fields Available

Content Types

PubMed E-utilities API

PMID to Metadata

Unique PubMed Fields

arXiv API

arXiv ID to Metadata

Key Fields

Version Tracking

DataCite API

Required BibTeX Fields

@article (Journal Articles)

@book (Books)

@inproceedings (Conference Papers)

@incollection (Book Chapters)

@phdthesis (Dissertations)

@misc (Preprints, Software, Datasets)

Extraction Workflows

From DOI

From PMID

From arXiv ID

From URL

Batch Processing

Special Cases and Edge Cases

Preprints Later Published

Multiple Authors (et al.)

Author Name Variations

No DOI Available

Conference Papers vs Journal Articles

Book Chapters vs Edited Collections

Datasets and Software

Validation After Extraction

Best Practices

1. Prefer DOI When Available

2. Verify Automatically Extracted Metadata

3. Handle Special Characters

4. Use Consistent Citation Keys

5. Include DOI for Modern Papers

6. Document Source

Summary

19 KiB

Raw Blame History