Files
gh-k-dense-ai-claude-scient…/skills/citation-management/references/citation_validation.md
2025-11-30 08:30:14 +08:00

16 KiB
Raw Blame History

Citation Validation Guide

Comprehensive guide to validating citation accuracy, completeness, and formatting in BibTeX files.

Overview

Citation validation ensures:

  • All citations are accurate and complete
  • DOIs resolve correctly
  • Required fields are present
  • No duplicate entries
  • Proper formatting and syntax
  • Links are accessible

Validation should be performed:

  • After extracting metadata
  • Before manuscript submission
  • After manual edits to BibTeX files
  • Periodically for maintained bibliographies

Validation Categories

1. DOI Verification

Purpose: Ensure DOIs are valid and resolve correctly.

What to Check

DOI format:

Valid:   10.1038/s41586-021-03819-2
Valid:   10.1126/science.aam9317
Invalid: 10.1038/invalid
Invalid: doi:10.1038/... (should omit "doi:" prefix in BibTeX)

DOI resolution:

  • DOI should resolve via https://doi.org/
  • Should redirect to actual article
  • Should not return 404 or error

Metadata consistency:

  • CrossRef metadata should match BibTeX
  • Author names should align
  • Title should match
  • Year should match

How to Validate

Manual check:

  1. Copy DOI from BibTeX
  2. Visit https://doi.org/10.1038/nature12345
  3. Verify it redirects to correct article
  4. Check metadata matches

Automated check (recommended):

python scripts/validate_citations.py references.bib --check-dois

Process:

  1. Extract all DOIs from BibTeX file
  2. Query doi.org resolver for each
  3. Query CrossRef API for metadata
  4. Compare metadata with BibTeX entry
  5. Report discrepancies

Common Issues

Broken DOIs:

  • Typos in DOI
  • Publisher changed DOI (rare)
  • Article retracted
  • Solution: Find correct DOI from publisher site

Mismatched metadata:

  • BibTeX has old/incorrect information
  • Solution: Re-extract metadata from CrossRef

Missing DOIs:

  • Older articles may not have DOIs
  • Acceptable for pre-2000 publications
  • Add URL or PMID instead

2. Required Fields

Purpose: Ensure all necessary information is present.

Required by Entry Type

@article:

author   % REQUIRED
title    % REQUIRED
journal  % REQUIRED
year     % REQUIRED
volume   % Highly recommended
pages    % Highly recommended
doi      % Highly recommended for modern papers

@book:

author OR editor  % REQUIRED (at least one)
title            % REQUIRED
publisher        % REQUIRED
year             % REQUIRED
isbn             % Recommended

@inproceedings:

author     % REQUIRED
title      % REQUIRED
booktitle  % REQUIRED (conference/proceedings name)
year       % REQUIRED
pages      % Recommended

@incollection (book chapter):

author     % REQUIRED
title      % REQUIRED (chapter title)
booktitle  % REQUIRED (book title)
publisher  % REQUIRED
year       % REQUIRED
editor     % Recommended
pages      % Recommended

@phdthesis:

author  % REQUIRED
title   % REQUIRED
school  % REQUIRED
year    % REQUIRED

@misc (preprints, datasets, etc.):

author  % REQUIRED
title   % REQUIRED
year    % REQUIRED
howpublished  % Recommended (bioRxiv, Zenodo, etc.)
doi OR url    % At least one required

Validation Script

python scripts/validate_citations.py references.bib --check-required-fields

Output:

Error: Entry 'Smith2024' missing required field 'journal'
Error: Entry 'Doe2023' missing required field 'year'
Warning: Entry 'Jones2022' missing recommended field 'volume'

3. Author Name Formatting

Purpose: Ensure consistent, correct author name formatting.

Proper Format

Recommended BibTeX format:

author = {Last1, First1 and Last2, First2 and Last3, First3}

Examples:

% Correct
author = {Smith, John}
author = {Smith, John A.}
author = {Smith, John Andrew}
author = {Smith, John and Doe, Jane}
author = {Smith, John and Doe, Jane and Johnson, Mary}

% For many authors
author = {Smith, John and Doe, Jane and others}

% Incorrect
author = {John Smith}  % First Last format (not recommended)
author = {Smith, J.; Doe, J.}  % Semicolon separator (wrong)
author = {Smith J, Doe J}  % Missing commas

Special Cases

Suffixes (Jr., III, etc.):

author = {King, Jr., Martin Luther}

Multiple surnames (hyphenated):

author = {Smith-Jones, Mary}

Van, von, de, etc.:

author = {van der Waals, Johannes}
author = {de Broglie, Louis}

Organizations as authors:

author = {{World Health Organization}}
% Double braces treat as single author

Validation Checks

Automated validation:

python scripts/validate_citations.py references.bib --check-authors

Checks for:

  • Proper separator (and, not &, ; , etc.)
  • Comma placement
  • Empty author fields
  • Malformed names

4. Data Consistency

Purpose: Ensure all fields contain valid, reasonable values.

Year Validation

Valid years:

year = {2024}    % Current/recent
year = {1953}    % Watson & Crick DNA structure (historical)
year = {1665}    % Hooke's Micrographia (very old)

Invalid years:

year = {24}      % Two digits (ambiguous)
year = {202}     % Typo
year = {2025}    % Future (unless accepted/in press)
year = {0}       % Obviously wrong

Check:

  • Four digits
  • Reasonable range (1600-current+1)
  • Not all zeros

Volume/Number Validation

volume = {123}      % Numeric
volume = {12}       % Valid
number = {3}        % Valid
number = {S1}       % Supplement issue (valid)

Invalid:

volume = {Vol. 123}  % Should be just number
number = {Issue 3}   % Should be just number

Page Range Validation

Correct format:

pages = {123--145}    % En-dash (two hyphens)
pages = {e0123456}    % PLOS-style article ID
pages = {123}         % Single page

Incorrect format:

pages = {123-145}     % Single hyphen (use --)
pages = {pp. 123-145} % Remove "pp."
pages = {123145}     % Unicode en-dash (may cause issues)

URL Validation

Check:

  • URLs are accessible (return 200 status)
  • HTTPS when available
  • No obvious typos
  • Permanent links (not temporary)

Valid:

url = {https://www.nature.com/articles/nature12345}
url = {https://arxiv.org/abs/2103.14030}

Questionable:

url = {http://...}  % HTTP instead of HTTPS
url = {file:///...} % Local file path
url = {bit.ly/...}  % URL shortener (not permanent)

5. Duplicate Detection

Purpose: Find and remove duplicate entries.

Types of Duplicates

Exact duplicates (same DOI):

@article{Smith2024a,
  doi = {10.1038/nature12345},
  ...
}

@article{Smith2024b,
  doi = {10.1038/nature12345},  % Same DOI!
  ...
}

Near duplicates (similar title/authors):

@article{Smith2024,
  title = {Machine Learning for Drug Discovery},
  ...
}

@article{Smith2024method,
  title = {Machine learning for drug discovery},  % Same, different case
  ...
}

Preprint + Published:

@misc{Smith2023arxiv,
  title = {AlphaFold Results},
  howpublished = {arXiv},
  ...
}

@article{Smith2024,
  title = {AlphaFold Results},  % Same paper, now published
  journal = {Nature},
  ...
}
% Keep published version only

Detection Methods

By DOI (most reliable):

  • Same DOI = exact duplicate
  • Keep one, remove other

By title similarity:

  • Normalize: lowercase, remove punctuation
  • Calculate similarity (e.g., Levenshtein distance)
  • Flag if >90% similar

By author-year-title:

  • Same first author + year + similar title
  • Likely duplicate

Automated detection:

python scripts/validate_citations.py references.bib --check-duplicates

Output:

Warning: Possible duplicate entries:
  - Smith2024a (DOI: 10.1038/nature12345)
  - Smith2024b (DOI: 10.1038/nature12345)
  Recommendation: Keep one entry, remove the other.

6. Format and Syntax

Purpose: Ensure valid BibTeX syntax.

Common Syntax Errors

Missing commas:

@article{Smith2024,
  author = {Smith, John}   % Missing comma!
  title = {Title}
}
% Should be:
  author = {Smith, John},  % Comma after each field

Unbalanced braces:

title = {Title with {Protected} Text  % Missing closing brace
% Should be:
title = {Title with {Protected} Text}

Missing closing brace for entry:

@article{Smith2024,
  author = {Smith, John},
  title = {Title}
  % Missing closing brace!
% Should end with:
}

Invalid characters in keys:

@article{Smith&Doe2024,  % & not allowed in key
  ...
}
% Use:
@article{SmithDoe2024,
  ...
}

BibTeX Syntax Rules

Entry structure:

@TYPE{citationkey,
  field1 = {value1},
  field2 = {value2},
  ...
  fieldN = {valueN}
}

Citation keys:

  • Alphanumeric and some punctuation (-, _, ., :)
  • No spaces
  • Case-sensitive
  • Unique within file

Field values:

  • Enclosed in {braces} or "quotes"
  • Braces preferred for complex text
  • Numbers can be unquoted: year = 2024

Special characters:

  • { and } for grouping
  • \ for LaTeX commands
  • Protect capitalization: {AlphaFold}
  • Accents: {\"u}, {\'e}, {\aa}

Validation

python scripts/validate_citations.py references.bib --check-syntax

Checks:

  • Valid BibTeX structure
  • Balanced braces
  • Proper commas
  • Valid entry types
  • Unique citation keys

Validation Workflow

Step 1: Basic Validation

Run comprehensive validation:

python scripts/validate_citations.py references.bib

Checks all:

  • DOI resolution
  • Required fields
  • Author formatting
  • Data consistency
  • Duplicates
  • Syntax

Step 2: Review Report

Examine validation report:

{
  "total_entries": 150,
  "valid_entries": 140,
  "errors": [
    {
      "entry": "Smith2024",
      "error": "missing_required_field",
      "field": "journal",
      "severity": "high"
    },
    {
      "entry": "Doe2023",
      "error": "invalid_doi",
      "doi": "10.1038/broken",
      "severity": "high"
    }
  ],
  "warnings": [
    {
      "entry": "Jones2022",
      "warning": "missing_recommended_field",
      "field": "volume",
      "severity": "medium"
    }
  ],
  "duplicates": [
    {
      "entries": ["Smith2024a", "Smith2024b"],
      "reason": "same_doi",
      "doi": "10.1038/nature12345"
    }
  ]
}

Step 3: Fix Issues

High-priority (errors):

  1. Add missing required fields
  2. Fix broken DOIs
  3. Remove duplicates
  4. Correct syntax errors

Medium-priority (warnings):

  1. Add recommended fields
  2. Improve author formatting
  3. Fix page ranges

Low-priority:

  1. Standardize formatting
  2. Add URLs for accessibility

Step 4: Auto-Fix

Use auto-fix for safe corrections:

python scripts/validate_citations.py references.bib \
  --auto-fix \
  --output fixed_references.bib

Auto-fix can:

  • Fix page range format (- to --)
  • Remove "pp." from pages
  • Standardize author separators
  • Fix common syntax errors
  • Normalize field order

Auto-fix cannot:

  • Add missing information
  • Find correct DOIs
  • Determine which duplicate to keep
  • Fix semantic errors

Step 5: Manual Review

Review auto-fixed file:

# Check what changed
diff references.bib fixed_references.bib

# Review specific entries that had errors
grep -A 10 "Smith2024" fixed_references.bib

Step 6: Re-Validate

Validate after fixes:

python scripts/validate_citations.py fixed_references.bib --verbose

Should show:

✓ All DOIs valid
✓ All required fields present
✓ No duplicates found
✓ Syntax valid
✓ 150/150 entries valid

Validation Checklist

Use this checklist before final submission:

DOI Validation

  • All DOIs resolve correctly
  • Metadata matches between BibTeX and CrossRef
  • No broken or invalid DOIs

Completeness

  • All entries have required fields
  • Modern papers (2000+) have DOIs
  • Authors properly formatted
  • Journals/conferences properly named

Consistency

  • Years are 4-digit numbers
  • Page ranges use -- not -
  • Volume/number are numeric
  • URLs are accessible

Duplicates

  • No entries with same DOI
  • No near-duplicate titles
  • Preprints updated to published versions

Formatting

  • Valid BibTeX syntax
  • Balanced braces
  • Proper commas
  • Unique citation keys

Final Checks

  • Bibliography compiles without errors
  • All citations in text appear in bibliography
  • All bibliography entries cited in text
  • Citation style matches journal requirements

Best Practices

1. Validate Early and Often

# After extraction
python scripts/extract_metadata.py --doi ... --output refs.bib
python scripts/validate_citations.py refs.bib

# After manual edits
python scripts/validate_citations.py refs.bib

# Before submission
python scripts/validate_citations.py refs.bib --strict

2. Use Automated Tools

Don't validate manually - use scripts:

  • Faster
  • More comprehensive
  • Catches errors humans miss
  • Generates reports

3. Keep Backup

# Before auto-fix
cp references.bib references_backup.bib

# Run auto-fix
python scripts/validate_citations.py references.bib \
  --auto-fix \
  --output references_fixed.bib

# Review changes
diff references.bib references_fixed.bib

# If satisfied, replace
mv references_fixed.bib references.bib

4. Fix High-Priority First

Priority order:

  1. Syntax errors (prevent compilation)
  2. Missing required fields (incomplete citations)
  3. Broken DOIs (broken links)
  4. Duplicates (confusion, wasted space)
  5. Missing recommended fields
  6. Formatting inconsistencies

5. Document Exceptions

For entries that can't be fixed:

@article{Old1950,
  author = {Smith, John},
  title = {Title},
  journal = {Obscure Journal},
  year = {1950},
  volume = {12},
  pages = {34--56},
  note = {DOI not available for publications before 2000}
}

6. Validate Against Journal Requirements

Different journals have different requirements:

  • Citation style (numbered, author-year)
  • Abbreviations (journal names)
  • Maximum reference count
  • Format (BibTeX, EndNote, manual)

Check journal author guidelines!

Common Validation Issues

Issue 1: Metadata Mismatch

Problem: BibTeX says 2023, CrossRef says 2024.

Cause:

  • Online-first vs print publication
  • Correction/update
  • Extraction error

Solution:

  1. Check actual article
  2. Use more recent/accurate date
  3. Update BibTeX entry
  4. Re-validate

Issue 2: Special Characters

Problem: LaTeX compilation fails on special characters.

Cause:

  • Accented characters (é, ü, ñ)
  • Chemical formulas (H₂O)
  • Math symbols (α, β, ±)

Solution:

% Use LaTeX commands
author = {M{\"u}ller, Hans}  % Müller
title = {Study of H\textsubscript{2}O}  % H₂O
% Or use UTF-8 with proper LaTeX packages

Issue 3: Incomplete Extraction

Problem: Extracted metadata missing fields.

Cause:

  • Source doesn't provide all metadata
  • Extraction error
  • Incomplete record

Solution:

  1. Check original article
  2. Manually add missing fields
  3. Use alternative source (PubMed vs CrossRef)

Issue 4: Cannot Find Duplicate

Problem: Same paper appears twice, not detected.

Cause:

  • Different DOIs (should be rare)
  • Different titles (abbreviated, typo)
  • Different citation keys

Solution:

  • Manual search for author + year
  • Check for similar titles
  • Remove manually

Summary

Validation ensures citation quality:

Accuracy: DOIs resolve, metadata correct
Completeness: All required fields present
Consistency: Proper formatting throughout
No duplicates: Each paper cited once
Valid syntax: BibTeX compiles without errors

Always validate before final submission!

Use automated tools:

python scripts/validate_citations.py references.bib

Follow workflow:

  1. Extract metadata
  2. Validate
  3. Fix errors
  4. Re-validate
  5. Submit