322 lines
8.0 KiB
Markdown
322 lines
8.0 KiB
Markdown
# Extract Structured Data from Scientific PDFs
|
|
|
|
A comprehensive pipeline for extracting standardized data from scientific literature PDFs using Claude AI.
|
|
|
|
## Overview
|
|
|
|
This skill provides an end-to-end workflow for:
|
|
- Organizing PDF literature and metadata from various sources
|
|
- Filtering relevant papers based on abstract content (optional)
|
|
- Extracting structured data from full PDFs using Claude's vision capabilities
|
|
- Repairing and validating JSON outputs
|
|
- Enriching data with external scientific databases
|
|
- Exporting to multiple analysis formats (Python, R, Excel, CSV, SQLite)
|
|
|
|
## Quick Start
|
|
|
|
### 1. Installation
|
|
|
|
Create a conda environment:
|
|
|
|
```bash
|
|
conda env create -f environment.yml
|
|
conda activate pdf_extraction
|
|
```
|
|
|
|
Or install with pip:
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
### 2. Setup API Keys
|
|
|
|
Set your Anthropic API key:
|
|
|
|
```bash
|
|
export ANTHROPIC_API_KEY='your-api-key-here'
|
|
```
|
|
|
|
For geographic validation (optional):
|
|
```bash
|
|
export GEONAMES_USERNAME='your-geonames-username'
|
|
```
|
|
|
|
### 3. Run the Skill
|
|
|
|
The easiest way is to use the skill through Claude Code:
|
|
|
|
```bash
|
|
claude-code
|
|
```
|
|
|
|
Then activate the skill by mentioning it in your conversation. The skill will guide you through an interactive setup process.
|
|
|
|
## Documentation
|
|
|
|
The skill includes comprehensive reference documentation:
|
|
|
|
- `references/setup_guide.md` - Installation and configuration
|
|
- `references/workflow_guide.md` - Complete step-by-step workflow with examples
|
|
- `references/validation_guide.md` - Validation methodology and metrics interpretation
|
|
- `references/api_reference.md` - External API integration details
|
|
|
|
## Manual Workflow
|
|
|
|
You can also run the scripts manually:
|
|
|
|
### Step 1: Organize Metadata
|
|
|
|
```bash
|
|
python scripts/01_organize_metadata.py \
|
|
--source-type bibtex \
|
|
--source path/to/library.bib \
|
|
--pdf-dir path/to/pdfs \
|
|
--organize-pdfs \
|
|
--output metadata.json
|
|
```
|
|
|
|
### Step 2: Filter Papers (Optional)
|
|
|
|
First, customize the filtering prompt in `scripts/02_filter_abstracts.py` for your use case.
|
|
|
|
**Option A: Claude Haiku (Fast & Cheap - ~$0.25/M tokens)**
|
|
```bash
|
|
python scripts/02_filter_abstracts.py \
|
|
--metadata metadata.json \
|
|
--backend anthropic-haiku \
|
|
--use-batches \
|
|
--output filtered_papers.json
|
|
```
|
|
|
|
**Option B: Local Model via Ollama (FREE)**
|
|
```bash
|
|
# One-time setup:
|
|
# 1. Install Ollama from https://ollama.com
|
|
# 2. Pull model: ollama pull llama3.1:8b
|
|
# 3. Start server: ollama serve
|
|
|
|
python scripts/02_filter_abstracts.py \
|
|
--metadata metadata.json \
|
|
--backend ollama \
|
|
--ollama-model llama3.1:8b \
|
|
--output filtered_papers.json
|
|
```
|
|
|
|
Recommended Ollama models:
|
|
- `llama3.1:8b` - Good balance (8GB RAM)
|
|
- `mistral:7b` - Fast, good for simple filtering
|
|
- `qwen2.5:7b` - Good multilingual support
|
|
- `llama3.1:70b` - Better accuracy (64GB RAM)
|
|
|
|
### Step 3: Extract Data from PDFs
|
|
|
|
First, create your extraction schema by copying and customizing `assets/schema_template.json`.
|
|
|
|
```bash
|
|
python scripts/03_extract_from_pdfs.py \
|
|
--metadata filtered_papers.json \
|
|
--schema my_schema.json \
|
|
--method batches \
|
|
--output extracted_data.json
|
|
```
|
|
|
|
### Step 4: Repair JSON
|
|
|
|
```bash
|
|
python scripts/04_repair_json.py \
|
|
--input extracted_data.json \
|
|
--schema my_schema.json \
|
|
--output cleaned_data.json
|
|
```
|
|
|
|
### Step 5: Validate with APIs
|
|
|
|
First, create your API configuration by copying and customizing `assets/api_config_template.json`.
|
|
|
|
```bash
|
|
python scripts/05_validate_with_apis.py \
|
|
--input cleaned_data.json \
|
|
--apis my_api_config.json \
|
|
--output validated_data.json
|
|
```
|
|
|
|
### Step 6: Export
|
|
|
|
```bash
|
|
# For Python/pandas
|
|
python scripts/06_export_database.py \
|
|
--input validated_data.json \
|
|
--format python \
|
|
--flatten \
|
|
--output results
|
|
|
|
# For R
|
|
python scripts/06_export_database.py \
|
|
--input validated_data.json \
|
|
--format r \
|
|
--flatten \
|
|
--output results
|
|
|
|
# For CSV
|
|
python scripts/06_export_database.py \
|
|
--input validated_data.json \
|
|
--format csv \
|
|
--flatten \
|
|
--output results.csv
|
|
```
|
|
|
|
### Validation & Quality Assurance (Optional but Recommended)
|
|
|
|
Validate extraction quality using precision and recall metrics:
|
|
|
|
#### Step 7: Prepare Validation Set
|
|
|
|
```bash
|
|
python scripts/07_prepare_validation_set.py \
|
|
--extraction-results cleaned_data.json \
|
|
--schema my_schema.json \
|
|
--sample-size 20 \
|
|
--strategy stratified \
|
|
--output validation_set.json
|
|
```
|
|
|
|
Sampling strategies:
|
|
- `random` - Random sample
|
|
- `stratified` - Sample by extraction characteristics
|
|
- `diverse` - Maximize diversity
|
|
|
|
#### Step 8: Manual Annotation
|
|
|
|
1. Open `validation_set.json`
|
|
2. For each sampled paper:
|
|
- Read the PDF
|
|
- Fill in `ground_truth` field with correct extraction
|
|
- Add `annotator` name and `annotation_date`
|
|
- Use `notes` for ambiguous cases
|
|
3. Save the file
|
|
|
|
#### Step 9: Calculate Metrics
|
|
|
|
```bash
|
|
python scripts/08_calculate_validation_metrics.py \
|
|
--annotations validation_set.json \
|
|
--output validation_metrics.json \
|
|
--report validation_report.txt
|
|
```
|
|
|
|
This produces:
|
|
- **Precision**: % of extracted items that are correct
|
|
- **Recall**: % of true items that were extracted
|
|
- **F1 Score**: Harmonic mean of precision and recall
|
|
- **Per-field metrics**: Accuracy by field type
|
|
|
|
Use these metrics to:
|
|
- Identify weak points in extraction prompts
|
|
- Compare models (Haiku vs Sonnet vs Ollama)
|
|
- Iterate and improve schema
|
|
- Report quality in publications
|
|
|
|
## Customization
|
|
|
|
### Creating Your Extraction Schema
|
|
|
|
1. Copy `assets/schema_template.json` to `my_schema.json`
|
|
2. Customize the following sections:
|
|
- `objective`: What you're extracting
|
|
- `system_context`: Your scientific domain
|
|
- `instructions`: Step-by-step guidance for Claude
|
|
- `output_schema`: JSON schema defining your data structure
|
|
- `output_example`: Example of desired output
|
|
|
|
See `assets/example_flower_visitors_schema.json` for a real-world example.
|
|
|
|
### Configuring API Validation
|
|
|
|
1. Copy `assets/api_config_template.json` to `my_api_config.json`
|
|
2. Map your schema fields to appropriate validation APIs
|
|
3. See available APIs in `scripts/05_validate_with_apis.py` and `references/api_reference.md`
|
|
|
|
See `assets/example_api_config_ecology.json` for an ecology example.
|
|
|
|
## Cost Estimation
|
|
|
|
PDF processing costs approximately 1,500-3,000 tokens per page:
|
|
|
|
- 10-page paper: ~20,000-30,000 tokens
|
|
- 100 papers: ~2-3M tokens
|
|
- With Sonnet 4.5: ~$6-9 for 100 papers
|
|
|
|
Tips to reduce costs:
|
|
- Use abstract filtering (Step 2) to reduce full PDF processing
|
|
- Enable prompt caching with `--use-caching`
|
|
- Use batch processing (`--method batches`)
|
|
- Consider using Haiku for simpler extractions
|
|
|
|
## Supported Data Sources
|
|
|
|
### Bibliography Formats
|
|
- BibTeX (Zotero, JabRef, etc.)
|
|
- RIS (Mendeley, EndNote, etc.)
|
|
- Directory of PDFs
|
|
- List of DOIs
|
|
|
|
### Output Formats
|
|
- Python (pandas DataFrame pickle)
|
|
- R (RDS file)
|
|
- CSV
|
|
- JSON
|
|
- Excel
|
|
- SQLite database
|
|
|
|
### Validation APIs
|
|
- **Biology**: GBIF, World Flora Online, NCBI Gene
|
|
- **Geography**: GeoNames, OpenStreetMap Nominatim
|
|
- **Chemistry**: PubChem
|
|
- **Medicine**: (extensible - add your own)
|
|
|
|
## Examples
|
|
|
|
See the [beetle flower visitors repository](https://github.com/brunoasm/ARE_2026_beetle_flower_visitors) for a real-world example of this workflow in action.
|
|
|
|
## Troubleshooting
|
|
|
|
### PDF Size Limits
|
|
- Maximum file size: 32MB
|
|
- Maximum pages: 100
|
|
- Solution: Use chunked processing for larger PDFs
|
|
|
|
### JSON Parsing Errors
|
|
- The `json-repair` library handles most common issues
|
|
- Check your schema validation
|
|
- Review Claude's analysis output for clues
|
|
|
|
### API Rate Limits
|
|
- Add delays between requests (implemented in scripts)
|
|
- Use batch processing when available
|
|
- Check specific API documentation for limits
|
|
|
|
## Contributing
|
|
|
|
To add support for additional validation APIs:
|
|
1. Add validator function to `scripts/05_validate_with_apis.py`
|
|
2. Register in `API_VALIDATORS` dictionary
|
|
3. Update `api_config_template.json` with examples
|
|
|
|
## Citation
|
|
|
|
If you use this skill in your research, please cite:
|
|
|
|
```bibtex
|
|
@software{pdf_extraction_skill,
|
|
title = {Extract Structured Data from Scientific PDFs},
|
|
author = {Your Name},
|
|
year = {2025},
|
|
url = {https://github.com/your-repo}
|
|
}
|
|
```
|
|
|
|
## License
|
|
|
|
MIT License - see LICENSE file for details
|