Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:14 +08:00
commit 1dd5bee3b4
335 changed files with 147360 additions and 0 deletions

View File

@@ -0,0 +1,794 @@
# Citation Validation Guide
Comprehensive guide to validating citation accuracy, completeness, and formatting in BibTeX files.
## Overview
Citation validation ensures:
- All citations are accurate and complete
- DOIs resolve correctly
- Required fields are present
- No duplicate entries
- Proper formatting and syntax
- Links are accessible
Validation should be performed:
- After extracting metadata
- Before manuscript submission
- After manual edits to BibTeX files
- Periodically for maintained bibliographies
## Validation Categories
### 1. DOI Verification
**Purpose**: Ensure DOIs are valid and resolve correctly.
#### What to Check
**DOI format**:
```
Valid: 10.1038/s41586-021-03819-2
Valid: 10.1126/science.aam9317
Invalid: 10.1038/invalid
Invalid: doi:10.1038/... (should omit "doi:" prefix in BibTeX)
```
**DOI resolution**:
- DOI should resolve via https://doi.org/
- Should redirect to actual article
- Should not return 404 or error
**Metadata consistency**:
- CrossRef metadata should match BibTeX
- Author names should align
- Title should match
- Year should match
#### How to Validate
**Manual check**:
1. Copy DOI from BibTeX
2. Visit https://doi.org/10.1038/nature12345
3. Verify it redirects to correct article
4. Check metadata matches
**Automated check** (recommended):
```bash
python scripts/validate_citations.py references.bib --check-dois
```
**Process**:
1. Extract all DOIs from BibTeX file
2. Query doi.org resolver for each
3. Query CrossRef API for metadata
4. Compare metadata with BibTeX entry
5. Report discrepancies
#### Common Issues
**Broken DOIs**:
- Typos in DOI
- Publisher changed DOI (rare)
- Article retracted
- Solution: Find correct DOI from publisher site
**Mismatched metadata**:
- BibTeX has old/incorrect information
- Solution: Re-extract metadata from CrossRef
**Missing DOIs**:
- Older articles may not have DOIs
- Acceptable for pre-2000 publications
- Add URL or PMID instead
### 2. Required Fields
**Purpose**: Ensure all necessary information is present.
#### Required by Entry Type
**@article**:
```bibtex
author % REQUIRED
title % REQUIRED
journal % REQUIRED
year % REQUIRED
volume % Highly recommended
pages % Highly recommended
doi % Highly recommended for modern papers
```
**@book**:
```bibtex
author OR editor % REQUIRED (at least one)
title % REQUIRED
publisher % REQUIRED
year % REQUIRED
isbn % Recommended
```
**@inproceedings**:
```bibtex
author % REQUIRED
title % REQUIRED
booktitle % REQUIRED (conference/proceedings name)
year % REQUIRED
pages % Recommended
```
**@incollection** (book chapter):
```bibtex
author % REQUIRED
title % REQUIRED (chapter title)
booktitle % REQUIRED (book title)
publisher % REQUIRED
year % REQUIRED
editor % Recommended
pages % Recommended
```
**@phdthesis**:
```bibtex
author % REQUIRED
title % REQUIRED
school % REQUIRED
year % REQUIRED
```
**@misc** (preprints, datasets, etc.):
```bibtex
author % REQUIRED
title % REQUIRED
year % REQUIRED
howpublished % Recommended (bioRxiv, Zenodo, etc.)
doi OR url % At least one required
```
#### Validation Script
```bash
python scripts/validate_citations.py references.bib --check-required-fields
```
**Output**:
```
Error: Entry 'Smith2024' missing required field 'journal'
Error: Entry 'Doe2023' missing required field 'year'
Warning: Entry 'Jones2022' missing recommended field 'volume'
```
### 3. Author Name Formatting
**Purpose**: Ensure consistent, correct author name formatting.
#### Proper Format
**Recommended BibTeX format**:
```bibtex
author = {Last1, First1 and Last2, First2 and Last3, First3}
```
**Examples**:
```bibtex
% Correct
author = {Smith, John}
author = {Smith, John A.}
author = {Smith, John Andrew}
author = {Smith, John and Doe, Jane}
author = {Smith, John and Doe, Jane and Johnson, Mary}
% For many authors
author = {Smith, John and Doe, Jane and others}
% Incorrect
author = {John Smith} % First Last format (not recommended)
author = {Smith, J.; Doe, J.} % Semicolon separator (wrong)
author = {Smith J, Doe J} % Missing commas
```
#### Special Cases
**Suffixes (Jr., III, etc.)**:
```bibtex
author = {King, Jr., Martin Luther}
```
**Multiple surnames (hyphenated)**:
```bibtex
author = {Smith-Jones, Mary}
```
**Van, von, de, etc.**:
```bibtex
author = {van der Waals, Johannes}
author = {de Broglie, Louis}
```
**Organizations as authors**:
```bibtex
author = {{World Health Organization}}
% Double braces treat as single author
```
#### Validation Checks
**Automated validation**:
```bash
python scripts/validate_citations.py references.bib --check-authors
```
**Checks for**:
- Proper separator (and, not &, ; , etc.)
- Comma placement
- Empty author fields
- Malformed names
### 4. Data Consistency
**Purpose**: Ensure all fields contain valid, reasonable values.
#### Year Validation
**Valid years**:
```bibtex
year = {2024} % Current/recent
year = {1953} % Watson & Crick DNA structure (historical)
year = {1665} % Hooke's Micrographia (very old)
```
**Invalid years**:
```bibtex
year = {24} % Two digits (ambiguous)
year = {202} % Typo
year = {2025} % Future (unless accepted/in press)
year = {0} % Obviously wrong
```
**Check**:
- Four digits
- Reasonable range (1600-current+1)
- Not all zeros
#### Volume/Number Validation
```bibtex
volume = {123} % Numeric
volume = {12} % Valid
number = {3} % Valid
number = {S1} % Supplement issue (valid)
```
**Invalid**:
```bibtex
volume = {Vol. 123} % Should be just number
number = {Issue 3} % Should be just number
```
#### Page Range Validation
**Correct format**:
```bibtex
pages = {123--145} % En-dash (two hyphens)
pages = {e0123456} % PLOS-style article ID
pages = {123} % Single page
```
**Incorrect format**:
```bibtex
pages = {123-145} % Single hyphen (use --)
pages = {pp. 123-145} % Remove "pp."
pages = {123145} % Unicode en-dash (may cause issues)
```
#### URL Validation
**Check**:
- URLs are accessible (return 200 status)
- HTTPS when available
- No obvious typos
- Permanent links (not temporary)
**Valid**:
```bibtex
url = {https://www.nature.com/articles/nature12345}
url = {https://arxiv.org/abs/2103.14030}
```
**Questionable**:
```bibtex
url = {http://...} % HTTP instead of HTTPS
url = {file:///...} % Local file path
url = {bit.ly/...} % URL shortener (not permanent)
```
### 5. Duplicate Detection
**Purpose**: Find and remove duplicate entries.
#### Types of Duplicates
**Exact duplicates** (same DOI):
```bibtex
@article{Smith2024a,
doi = {10.1038/nature12345},
...
}
@article{Smith2024b,
doi = {10.1038/nature12345}, % Same DOI!
...
}
```
**Near duplicates** (similar title/authors):
```bibtex
@article{Smith2024,
title = {Machine Learning for Drug Discovery},
...
}
@article{Smith2024method,
title = {Machine learning for drug discovery}, % Same, different case
...
}
```
**Preprint + Published**:
```bibtex
@misc{Smith2023arxiv,
title = {AlphaFold Results},
howpublished = {arXiv},
...
}
@article{Smith2024,
title = {AlphaFold Results}, % Same paper, now published
journal = {Nature},
...
}
% Keep published version only
```
#### Detection Methods
**By DOI** (most reliable):
- Same DOI = exact duplicate
- Keep one, remove other
**By title similarity**:
- Normalize: lowercase, remove punctuation
- Calculate similarity (e.g., Levenshtein distance)
- Flag if >90% similar
**By author-year-title**:
- Same first author + year + similar title
- Likely duplicate
**Automated detection**:
```bash
python scripts/validate_citations.py references.bib --check-duplicates
```
**Output**:
```
Warning: Possible duplicate entries:
- Smith2024a (DOI: 10.1038/nature12345)
- Smith2024b (DOI: 10.1038/nature12345)
Recommendation: Keep one entry, remove the other.
```
### 6. Format and Syntax
**Purpose**: Ensure valid BibTeX syntax.
#### Common Syntax Errors
**Missing commas**:
```bibtex
@article{Smith2024,
author = {Smith, John} % Missing comma!
title = {Title}
}
% Should be:
author = {Smith, John}, % Comma after each field
```
**Unbalanced braces**:
```bibtex
title = {Title with {Protected} Text % Missing closing brace
% Should be:
title = {Title with {Protected} Text}
```
**Missing closing brace for entry**:
```bibtex
@article{Smith2024,
author = {Smith, John},
title = {Title}
% Missing closing brace!
% Should end with:
}
```
**Invalid characters in keys**:
```bibtex
@article{Smith&Doe2024, % & not allowed in key
...
}
% Use:
@article{SmithDoe2024,
...
}
```
#### BibTeX Syntax Rules
**Entry structure**:
```bibtex
@TYPE{citationkey,
field1 = {value1},
field2 = {value2},
...
fieldN = {valueN}
}
```
**Citation keys**:
- Alphanumeric and some punctuation (-, _, ., :)
- No spaces
- Case-sensitive
- Unique within file
**Field values**:
- Enclosed in {braces} or "quotes"
- Braces preferred for complex text
- Numbers can be unquoted: `year = 2024`
**Special characters**:
- `{` and `}` for grouping
- `\` for LaTeX commands
- Protect capitalization: `{AlphaFold}`
- Accents: `{\"u}`, `{\'e}`, `{\aa}`
#### Validation
```bash
python scripts/validate_citations.py references.bib --check-syntax
```
**Checks**:
- Valid BibTeX structure
- Balanced braces
- Proper commas
- Valid entry types
- Unique citation keys
## Validation Workflow
### Step 1: Basic Validation
Run comprehensive validation:
```bash
python scripts/validate_citations.py references.bib
```
**Checks all**:
- DOI resolution
- Required fields
- Author formatting
- Data consistency
- Duplicates
- Syntax
### Step 2: Review Report
Examine validation report:
```json
{
"total_entries": 150,
"valid_entries": 140,
"errors": [
{
"entry": "Smith2024",
"error": "missing_required_field",
"field": "journal",
"severity": "high"
},
{
"entry": "Doe2023",
"error": "invalid_doi",
"doi": "10.1038/broken",
"severity": "high"
}
],
"warnings": [
{
"entry": "Jones2022",
"warning": "missing_recommended_field",
"field": "volume",
"severity": "medium"
}
],
"duplicates": [
{
"entries": ["Smith2024a", "Smith2024b"],
"reason": "same_doi",
"doi": "10.1038/nature12345"
}
]
}
```
### Step 3: Fix Issues
**High-priority** (errors):
1. Add missing required fields
2. Fix broken DOIs
3. Remove duplicates
4. Correct syntax errors
**Medium-priority** (warnings):
1. Add recommended fields
2. Improve author formatting
3. Fix page ranges
**Low-priority**:
1. Standardize formatting
2. Add URLs for accessibility
### Step 4: Auto-Fix
Use auto-fix for safe corrections:
```bash
python scripts/validate_citations.py references.bib \
--auto-fix \
--output fixed_references.bib
```
**Auto-fix can**:
- Fix page range format (- to --)
- Remove "pp." from pages
- Standardize author separators
- Fix common syntax errors
- Normalize field order
**Auto-fix cannot**:
- Add missing information
- Find correct DOIs
- Determine which duplicate to keep
- Fix semantic errors
### Step 5: Manual Review
Review auto-fixed file:
```bash
# Check what changed
diff references.bib fixed_references.bib
# Review specific entries that had errors
grep -A 10 "Smith2024" fixed_references.bib
```
### Step 6: Re-Validate
Validate after fixes:
```bash
python scripts/validate_citations.py fixed_references.bib --verbose
```
Should show:
```
✓ All DOIs valid
✓ All required fields present
✓ No duplicates found
✓ Syntax valid
✓ 150/150 entries valid
```
## Validation Checklist
Use this checklist before final submission:
### DOI Validation
- [ ] All DOIs resolve correctly
- [ ] Metadata matches between BibTeX and CrossRef
- [ ] No broken or invalid DOIs
### Completeness
- [ ] All entries have required fields
- [ ] Modern papers (2000+) have DOIs
- [ ] Authors properly formatted
- [ ] Journals/conferences properly named
### Consistency
- [ ] Years are 4-digit numbers
- [ ] Page ranges use -- not -
- [ ] Volume/number are numeric
- [ ] URLs are accessible
### Duplicates
- [ ] No entries with same DOI
- [ ] No near-duplicate titles
- [ ] Preprints updated to published versions
### Formatting
- [ ] Valid BibTeX syntax
- [ ] Balanced braces
- [ ] Proper commas
- [ ] Unique citation keys
### Final Checks
- [ ] Bibliography compiles without errors
- [ ] All citations in text appear in bibliography
- [ ] All bibliography entries cited in text
- [ ] Citation style matches journal requirements
## Best Practices
### 1. Validate Early and Often
```bash
# After extraction
python scripts/extract_metadata.py --doi ... --output refs.bib
python scripts/validate_citations.py refs.bib
# After manual edits
python scripts/validate_citations.py refs.bib
# Before submission
python scripts/validate_citations.py refs.bib --strict
```
### 2. Use Automated Tools
Don't validate manually - use scripts:
- Faster
- More comprehensive
- Catches errors humans miss
- Generates reports
### 3. Keep Backup
```bash
# Before auto-fix
cp references.bib references_backup.bib
# Run auto-fix
python scripts/validate_citations.py references.bib \
--auto-fix \
--output references_fixed.bib
# Review changes
diff references.bib references_fixed.bib
# If satisfied, replace
mv references_fixed.bib references.bib
```
### 4. Fix High-Priority First
**Priority order**:
1. Syntax errors (prevent compilation)
2. Missing required fields (incomplete citations)
3. Broken DOIs (broken links)
4. Duplicates (confusion, wasted space)
5. Missing recommended fields
6. Formatting inconsistencies
### 5. Document Exceptions
For entries that can't be fixed:
```bibtex
@article{Old1950,
author = {Smith, John},
title = {Title},
journal = {Obscure Journal},
year = {1950},
volume = {12},
pages = {34--56},
note = {DOI not available for publications before 2000}
}
```
### 6. Validate Against Journal Requirements
Different journals have different requirements:
- Citation style (numbered, author-year)
- Abbreviations (journal names)
- Maximum reference count
- Format (BibTeX, EndNote, manual)
Check journal author guidelines!
## Common Validation Issues
### Issue 1: Metadata Mismatch
**Problem**: BibTeX says 2023, CrossRef says 2024.
**Cause**:
- Online-first vs print publication
- Correction/update
- Extraction error
**Solution**:
1. Check actual article
2. Use more recent/accurate date
3. Update BibTeX entry
4. Re-validate
### Issue 2: Special Characters
**Problem**: LaTeX compilation fails on special characters.
**Cause**:
- Accented characters (é, ü, ñ)
- Chemical formulas (H₂O)
- Math symbols (α, β, ±)
**Solution**:
```bibtex
% Use LaTeX commands
author = {M{\"u}ller, Hans} % Müller
title = {Study of H\textsubscript{2}O} % H₂O
% Or use UTF-8 with proper LaTeX packages
```
### Issue 3: Incomplete Extraction
**Problem**: Extracted metadata missing fields.
**Cause**:
- Source doesn't provide all metadata
- Extraction error
- Incomplete record
**Solution**:
1. Check original article
2. Manually add missing fields
3. Use alternative source (PubMed vs CrossRef)
### Issue 4: Cannot Find Duplicate
**Problem**: Same paper appears twice, not detected.
**Cause**:
- Different DOIs (should be rare)
- Different titles (abbreviated, typo)
- Different citation keys
**Solution**:
- Manual search for author + year
- Check for similar titles
- Remove manually
## Summary
Validation ensures citation quality:
**Accuracy**: DOIs resolve, metadata correct
**Completeness**: All required fields present
**Consistency**: Proper formatting throughout
**No duplicates**: Each paper cited once
**Valid syntax**: BibTeX compiles without errors
**Always validate** before final submission!
Use automated tools:
```bash
python scripts/validate_citations.py references.bib
```
Follow workflow:
1. Extract metadata
2. Validate
3. Fix errors
4. Re-validate
5. Submit