Files
2025-11-29 18:02:40 +08:00

322 lines
8.0 KiB
Markdown

# Extract Structured Data from Scientific PDFs
A comprehensive pipeline for extracting standardized data from scientific literature PDFs using Claude AI.
## Overview
This skill provides an end-to-end workflow for:
- Organizing PDF literature and metadata from various sources
- Filtering relevant papers based on abstract content (optional)
- Extracting structured data from full PDFs using Claude's vision capabilities
- Repairing and validating JSON outputs
- Enriching data with external scientific databases
- Exporting to multiple analysis formats (Python, R, Excel, CSV, SQLite)
## Quick Start
### 1. Installation
Create a conda environment:
```bash
conda env create -f environment.yml
conda activate pdf_extraction
```
Or install with pip:
```bash
pip install -r requirements.txt
```
### 2. Setup API Keys
Set your Anthropic API key:
```bash
export ANTHROPIC_API_KEY='your-api-key-here'
```
For geographic validation (optional):
```bash
export GEONAMES_USERNAME='your-geonames-username'
```
### 3. Run the Skill
The easiest way is to use the skill through Claude Code:
```bash
claude-code
```
Then activate the skill by mentioning it in your conversation. The skill will guide you through an interactive setup process.
## Documentation
The skill includes comprehensive reference documentation:
- `references/setup_guide.md` - Installation and configuration
- `references/workflow_guide.md` - Complete step-by-step workflow with examples
- `references/validation_guide.md` - Validation methodology and metrics interpretation
- `references/api_reference.md` - External API integration details
## Manual Workflow
You can also run the scripts manually:
### Step 1: Organize Metadata
```bash
python scripts/01_organize_metadata.py \
--source-type bibtex \
--source path/to/library.bib \
--pdf-dir path/to/pdfs \
--organize-pdfs \
--output metadata.json
```
### Step 2: Filter Papers (Optional)
First, customize the filtering prompt in `scripts/02_filter_abstracts.py` for your use case.
**Option A: Claude Haiku (Fast & Cheap - ~$0.25/M tokens)**
```bash
python scripts/02_filter_abstracts.py \
--metadata metadata.json \
--backend anthropic-haiku \
--use-batches \
--output filtered_papers.json
```
**Option B: Local Model via Ollama (FREE)**
```bash
# One-time setup:
# 1. Install Ollama from https://ollama.com
# 2. Pull model: ollama pull llama3.1:8b
# 3. Start server: ollama serve
python scripts/02_filter_abstracts.py \
--metadata metadata.json \
--backend ollama \
--ollama-model llama3.1:8b \
--output filtered_papers.json
```
Recommended Ollama models:
- `llama3.1:8b` - Good balance (8GB RAM)
- `mistral:7b` - Fast, good for simple filtering
- `qwen2.5:7b` - Good multilingual support
- `llama3.1:70b` - Better accuracy (64GB RAM)
### Step 3: Extract Data from PDFs
First, create your extraction schema by copying and customizing `assets/schema_template.json`.
```bash
python scripts/03_extract_from_pdfs.py \
--metadata filtered_papers.json \
--schema my_schema.json \
--method batches \
--output extracted_data.json
```
### Step 4: Repair JSON
```bash
python scripts/04_repair_json.py \
--input extracted_data.json \
--schema my_schema.json \
--output cleaned_data.json
```
### Step 5: Validate with APIs
First, create your API configuration by copying and customizing `assets/api_config_template.json`.
```bash
python scripts/05_validate_with_apis.py \
--input cleaned_data.json \
--apis my_api_config.json \
--output validated_data.json
```
### Step 6: Export
```bash
# For Python/pandas
python scripts/06_export_database.py \
--input validated_data.json \
--format python \
--flatten \
--output results
# For R
python scripts/06_export_database.py \
--input validated_data.json \
--format r \
--flatten \
--output results
# For CSV
python scripts/06_export_database.py \
--input validated_data.json \
--format csv \
--flatten \
--output results.csv
```
### Validation & Quality Assurance (Optional but Recommended)
Validate extraction quality using precision and recall metrics:
#### Step 7: Prepare Validation Set
```bash
python scripts/07_prepare_validation_set.py \
--extraction-results cleaned_data.json \
--schema my_schema.json \
--sample-size 20 \
--strategy stratified \
--output validation_set.json
```
Sampling strategies:
- `random` - Random sample
- `stratified` - Sample by extraction characteristics
- `diverse` - Maximize diversity
#### Step 8: Manual Annotation
1. Open `validation_set.json`
2. For each sampled paper:
- Read the PDF
- Fill in `ground_truth` field with correct extraction
- Add `annotator` name and `annotation_date`
- Use `notes` for ambiguous cases
3. Save the file
#### Step 9: Calculate Metrics
```bash
python scripts/08_calculate_validation_metrics.py \
--annotations validation_set.json \
--output validation_metrics.json \
--report validation_report.txt
```
This produces:
- **Precision**: % of extracted items that are correct
- **Recall**: % of true items that were extracted
- **F1 Score**: Harmonic mean of precision and recall
- **Per-field metrics**: Accuracy by field type
Use these metrics to:
- Identify weak points in extraction prompts
- Compare models (Haiku vs Sonnet vs Ollama)
- Iterate and improve schema
- Report quality in publications
## Customization
### Creating Your Extraction Schema
1. Copy `assets/schema_template.json` to `my_schema.json`
2. Customize the following sections:
- `objective`: What you're extracting
- `system_context`: Your scientific domain
- `instructions`: Step-by-step guidance for Claude
- `output_schema`: JSON schema defining your data structure
- `output_example`: Example of desired output
See `assets/example_flower_visitors_schema.json` for a real-world example.
### Configuring API Validation
1. Copy `assets/api_config_template.json` to `my_api_config.json`
2. Map your schema fields to appropriate validation APIs
3. See available APIs in `scripts/05_validate_with_apis.py` and `references/api_reference.md`
See `assets/example_api_config_ecology.json` for an ecology example.
## Cost Estimation
PDF processing costs approximately 1,500-3,000 tokens per page:
- 10-page paper: ~20,000-30,000 tokens
- 100 papers: ~2-3M tokens
- With Sonnet 4.5: ~$6-9 for 100 papers
Tips to reduce costs:
- Use abstract filtering (Step 2) to reduce full PDF processing
- Enable prompt caching with `--use-caching`
- Use batch processing (`--method batches`)
- Consider using Haiku for simpler extractions
## Supported Data Sources
### Bibliography Formats
- BibTeX (Zotero, JabRef, etc.)
- RIS (Mendeley, EndNote, etc.)
- Directory of PDFs
- List of DOIs
### Output Formats
- Python (pandas DataFrame pickle)
- R (RDS file)
- CSV
- JSON
- Excel
- SQLite database
### Validation APIs
- **Biology**: GBIF, World Flora Online, NCBI Gene
- **Geography**: GeoNames, OpenStreetMap Nominatim
- **Chemistry**: PubChem
- **Medicine**: (extensible - add your own)
## Examples
See the [beetle flower visitors repository](https://github.com/brunoasm/ARE_2026_beetle_flower_visitors) for a real-world example of this workflow in action.
## Troubleshooting
### PDF Size Limits
- Maximum file size: 32MB
- Maximum pages: 100
- Solution: Use chunked processing for larger PDFs
### JSON Parsing Errors
- The `json-repair` library handles most common issues
- Check your schema validation
- Review Claude's analysis output for clues
### API Rate Limits
- Add delays between requests (implemented in scripts)
- Use batch processing when available
- Check specific API documentation for limits
## Contributing
To add support for additional validation APIs:
1. Add validator function to `scripts/05_validate_with_apis.py`
2. Register in `API_VALIDATORS` dictionary
3. Update `api_config_template.json` with examples
## Citation
If you use this skill in your research, please cite:
```bibtex
@software{pdf_extraction_skill,
title = {Extract Structured Data from Scientific PDFs},
author = {Your Name},
year = {2025},
url = {https://github.com/your-repo}
}
```
## License
MIT License - see LICENSE file for details