8.0 KiB
Extract Structured Data from Scientific PDFs
A comprehensive pipeline for extracting standardized data from scientific literature PDFs using Claude AI.
Overview
This skill provides an end-to-end workflow for:
- Organizing PDF literature and metadata from various sources
- Filtering relevant papers based on abstract content (optional)
- Extracting structured data from full PDFs using Claude's vision capabilities
- Repairing and validating JSON outputs
- Enriching data with external scientific databases
- Exporting to multiple analysis formats (Python, R, Excel, CSV, SQLite)
Quick Start
1. Installation
Create a conda environment:
conda env create -f environment.yml
conda activate pdf_extraction
Or install with pip:
pip install -r requirements.txt
2. Setup API Keys
Set your Anthropic API key:
export ANTHROPIC_API_KEY='your-api-key-here'
For geographic validation (optional):
export GEONAMES_USERNAME='your-geonames-username'
3. Run the Skill
The easiest way is to use the skill through Claude Code:
claude-code
Then activate the skill by mentioning it in your conversation. The skill will guide you through an interactive setup process.
Documentation
The skill includes comprehensive reference documentation:
references/setup_guide.md- Installation and configurationreferences/workflow_guide.md- Complete step-by-step workflow with examplesreferences/validation_guide.md- Validation methodology and metrics interpretationreferences/api_reference.md- External API integration details
Manual Workflow
You can also run the scripts manually:
Step 1: Organize Metadata
python scripts/01_organize_metadata.py \
--source-type bibtex \
--source path/to/library.bib \
--pdf-dir path/to/pdfs \
--organize-pdfs \
--output metadata.json
Step 2: Filter Papers (Optional)
First, customize the filtering prompt in scripts/02_filter_abstracts.py for your use case.
Option A: Claude Haiku (Fast & Cheap - ~$0.25/M tokens)
python scripts/02_filter_abstracts.py \
--metadata metadata.json \
--backend anthropic-haiku \
--use-batches \
--output filtered_papers.json
Option B: Local Model via Ollama (FREE)
# One-time setup:
# 1. Install Ollama from https://ollama.com
# 2. Pull model: ollama pull llama3.1:8b
# 3. Start server: ollama serve
python scripts/02_filter_abstracts.py \
--metadata metadata.json \
--backend ollama \
--ollama-model llama3.1:8b \
--output filtered_papers.json
Recommended Ollama models:
llama3.1:8b- Good balance (8GB RAM)mistral:7b- Fast, good for simple filteringqwen2.5:7b- Good multilingual supportllama3.1:70b- Better accuracy (64GB RAM)
Step 3: Extract Data from PDFs
First, create your extraction schema by copying and customizing assets/schema_template.json.
python scripts/03_extract_from_pdfs.py \
--metadata filtered_papers.json \
--schema my_schema.json \
--method batches \
--output extracted_data.json
Step 4: Repair JSON
python scripts/04_repair_json.py \
--input extracted_data.json \
--schema my_schema.json \
--output cleaned_data.json
Step 5: Validate with APIs
First, create your API configuration by copying and customizing assets/api_config_template.json.
python scripts/05_validate_with_apis.py \
--input cleaned_data.json \
--apis my_api_config.json \
--output validated_data.json
Step 6: Export
# For Python/pandas
python scripts/06_export_database.py \
--input validated_data.json \
--format python \
--flatten \
--output results
# For R
python scripts/06_export_database.py \
--input validated_data.json \
--format r \
--flatten \
--output results
# For CSV
python scripts/06_export_database.py \
--input validated_data.json \
--format csv \
--flatten \
--output results.csv
Validation & Quality Assurance (Optional but Recommended)
Validate extraction quality using precision and recall metrics:
Step 7: Prepare Validation Set
python scripts/07_prepare_validation_set.py \
--extraction-results cleaned_data.json \
--schema my_schema.json \
--sample-size 20 \
--strategy stratified \
--output validation_set.json
Sampling strategies:
random- Random samplestratified- Sample by extraction characteristicsdiverse- Maximize diversity
Step 8: Manual Annotation
- Open
validation_set.json - For each sampled paper:
- Read the PDF
- Fill in
ground_truthfield with correct extraction - Add
annotatorname andannotation_date - Use
notesfor ambiguous cases
- Save the file
Step 9: Calculate Metrics
python scripts/08_calculate_validation_metrics.py \
--annotations validation_set.json \
--output validation_metrics.json \
--report validation_report.txt
This produces:
- Precision: % of extracted items that are correct
- Recall: % of true items that were extracted
- F1 Score: Harmonic mean of precision and recall
- Per-field metrics: Accuracy by field type
Use these metrics to:
- Identify weak points in extraction prompts
- Compare models (Haiku vs Sonnet vs Ollama)
- Iterate and improve schema
- Report quality in publications
Customization
Creating Your Extraction Schema
- Copy
assets/schema_template.jsontomy_schema.json - Customize the following sections:
objective: What you're extractingsystem_context: Your scientific domaininstructions: Step-by-step guidance for Claudeoutput_schema: JSON schema defining your data structureoutput_example: Example of desired output
See assets/example_flower_visitors_schema.json for a real-world example.
Configuring API Validation
- Copy
assets/api_config_template.jsontomy_api_config.json - Map your schema fields to appropriate validation APIs
- See available APIs in
scripts/05_validate_with_apis.pyandreferences/api_reference.md
See assets/example_api_config_ecology.json for an ecology example.
Cost Estimation
PDF processing costs approximately 1,500-3,000 tokens per page:
- 10-page paper: ~20,000-30,000 tokens
- 100 papers: ~2-3M tokens
- With Sonnet 4.5: ~$6-9 for 100 papers
Tips to reduce costs:
- Use abstract filtering (Step 2) to reduce full PDF processing
- Enable prompt caching with
--use-caching - Use batch processing (
--method batches) - Consider using Haiku for simpler extractions
Supported Data Sources
Bibliography Formats
- BibTeX (Zotero, JabRef, etc.)
- RIS (Mendeley, EndNote, etc.)
- Directory of PDFs
- List of DOIs
Output Formats
- Python (pandas DataFrame pickle)
- R (RDS file)
- CSV
- JSON
- Excel
- SQLite database
Validation APIs
- Biology: GBIF, World Flora Online, NCBI Gene
- Geography: GeoNames, OpenStreetMap Nominatim
- Chemistry: PubChem
- Medicine: (extensible - add your own)
Examples
See the beetle flower visitors repository for a real-world example of this workflow in action.
Troubleshooting
PDF Size Limits
- Maximum file size: 32MB
- Maximum pages: 100
- Solution: Use chunked processing for larger PDFs
JSON Parsing Errors
- The
json-repairlibrary handles most common issues - Check your schema validation
- Review Claude's analysis output for clues
API Rate Limits
- Add delays between requests (implemented in scripts)
- Use batch processing when available
- Check specific API documentation for limits
Contributing
To add support for additional validation APIs:
- Add validator function to
scripts/05_validate_with_apis.py - Register in
API_VALIDATORSdictionary - Update
api_config_template.jsonwith examples
Citation
If you use this skill in your research, please cite:
@software{pdf_extraction_skill,
title = {Extract Structured Data from Scientific PDFs},
author = {Your Name},
year = {2025},
url = {https://github.com/your-repo}
}
License
MIT License - see LICENSE file for details