Initial commit
This commit is contained in:
321
skills/extract_from_pdfs/README.md
Normal file
321
skills/extract_from_pdfs/README.md
Normal file
@@ -0,0 +1,321 @@
|
||||
# Extract Structured Data from Scientific PDFs
|
||||
|
||||
A comprehensive pipeline for extracting standardized data from scientific literature PDFs using Claude AI.
|
||||
|
||||
## Overview
|
||||
|
||||
This skill provides an end-to-end workflow for:
|
||||
- Organizing PDF literature and metadata from various sources
|
||||
- Filtering relevant papers based on abstract content (optional)
|
||||
- Extracting structured data from full PDFs using Claude's vision capabilities
|
||||
- Repairing and validating JSON outputs
|
||||
- Enriching data with external scientific databases
|
||||
- Exporting to multiple analysis formats (Python, R, Excel, CSV, SQLite)
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Installation
|
||||
|
||||
Create a conda environment:
|
||||
|
||||
```bash
|
||||
conda env create -f environment.yml
|
||||
conda activate pdf_extraction
|
||||
```
|
||||
|
||||
Or install with pip:
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### 2. Setup API Keys
|
||||
|
||||
Set your Anthropic API key:
|
||||
|
||||
```bash
|
||||
export ANTHROPIC_API_KEY='your-api-key-here'
|
||||
```
|
||||
|
||||
For geographic validation (optional):
|
||||
```bash
|
||||
export GEONAMES_USERNAME='your-geonames-username'
|
||||
```
|
||||
|
||||
### 3. Run the Skill
|
||||
|
||||
The easiest way is to use the skill through Claude Code:
|
||||
|
||||
```bash
|
||||
claude-code
|
||||
```
|
||||
|
||||
Then activate the skill by mentioning it in your conversation. The skill will guide you through an interactive setup process.
|
||||
|
||||
## Documentation
|
||||
|
||||
The skill includes comprehensive reference documentation:
|
||||
|
||||
- `references/setup_guide.md` - Installation and configuration
|
||||
- `references/workflow_guide.md` - Complete step-by-step workflow with examples
|
||||
- `references/validation_guide.md` - Validation methodology and metrics interpretation
|
||||
- `references/api_reference.md` - External API integration details
|
||||
|
||||
## Manual Workflow
|
||||
|
||||
You can also run the scripts manually:
|
||||
|
||||
### Step 1: Organize Metadata
|
||||
|
||||
```bash
|
||||
python scripts/01_organize_metadata.py \
|
||||
--source-type bibtex \
|
||||
--source path/to/library.bib \
|
||||
--pdf-dir path/to/pdfs \
|
||||
--organize-pdfs \
|
||||
--output metadata.json
|
||||
```
|
||||
|
||||
### Step 2: Filter Papers (Optional)
|
||||
|
||||
First, customize the filtering prompt in `scripts/02_filter_abstracts.py` for your use case.
|
||||
|
||||
**Option A: Claude Haiku (Fast & Cheap - ~$0.25/M tokens)**
|
||||
```bash
|
||||
python scripts/02_filter_abstracts.py \
|
||||
--metadata metadata.json \
|
||||
--backend anthropic-haiku \
|
||||
--use-batches \
|
||||
--output filtered_papers.json
|
||||
```
|
||||
|
||||
**Option B: Local Model via Ollama (FREE)**
|
||||
```bash
|
||||
# One-time setup:
|
||||
# 1. Install Ollama from https://ollama.com
|
||||
# 2. Pull model: ollama pull llama3.1:8b
|
||||
# 3. Start server: ollama serve
|
||||
|
||||
python scripts/02_filter_abstracts.py \
|
||||
--metadata metadata.json \
|
||||
--backend ollama \
|
||||
--ollama-model llama3.1:8b \
|
||||
--output filtered_papers.json
|
||||
```
|
||||
|
||||
Recommended Ollama models:
|
||||
- `llama3.1:8b` - Good balance (8GB RAM)
|
||||
- `mistral:7b` - Fast, good for simple filtering
|
||||
- `qwen2.5:7b` - Good multilingual support
|
||||
- `llama3.1:70b` - Better accuracy (64GB RAM)
|
||||
|
||||
### Step 3: Extract Data from PDFs
|
||||
|
||||
First, create your extraction schema by copying and customizing `assets/schema_template.json`.
|
||||
|
||||
```bash
|
||||
python scripts/03_extract_from_pdfs.py \
|
||||
--metadata filtered_papers.json \
|
||||
--schema my_schema.json \
|
||||
--method batches \
|
||||
--output extracted_data.json
|
||||
```
|
||||
|
||||
### Step 4: Repair JSON
|
||||
|
||||
```bash
|
||||
python scripts/04_repair_json.py \
|
||||
--input extracted_data.json \
|
||||
--schema my_schema.json \
|
||||
--output cleaned_data.json
|
||||
```
|
||||
|
||||
### Step 5: Validate with APIs
|
||||
|
||||
First, create your API configuration by copying and customizing `assets/api_config_template.json`.
|
||||
|
||||
```bash
|
||||
python scripts/05_validate_with_apis.py \
|
||||
--input cleaned_data.json \
|
||||
--apis my_api_config.json \
|
||||
--output validated_data.json
|
||||
```
|
||||
|
||||
### Step 6: Export
|
||||
|
||||
```bash
|
||||
# For Python/pandas
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format python \
|
||||
--flatten \
|
||||
--output results
|
||||
|
||||
# For R
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format r \
|
||||
--flatten \
|
||||
--output results
|
||||
|
||||
# For CSV
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format csv \
|
||||
--flatten \
|
||||
--output results.csv
|
||||
```
|
||||
|
||||
### Validation & Quality Assurance (Optional but Recommended)
|
||||
|
||||
Validate extraction quality using precision and recall metrics:
|
||||
|
||||
#### Step 7: Prepare Validation Set
|
||||
|
||||
```bash
|
||||
python scripts/07_prepare_validation_set.py \
|
||||
--extraction-results cleaned_data.json \
|
||||
--schema my_schema.json \
|
||||
--sample-size 20 \
|
||||
--strategy stratified \
|
||||
--output validation_set.json
|
||||
```
|
||||
|
||||
Sampling strategies:
|
||||
- `random` - Random sample
|
||||
- `stratified` - Sample by extraction characteristics
|
||||
- `diverse` - Maximize diversity
|
||||
|
||||
#### Step 8: Manual Annotation
|
||||
|
||||
1. Open `validation_set.json`
|
||||
2. For each sampled paper:
|
||||
- Read the PDF
|
||||
- Fill in `ground_truth` field with correct extraction
|
||||
- Add `annotator` name and `annotation_date`
|
||||
- Use `notes` for ambiguous cases
|
||||
3. Save the file
|
||||
|
||||
#### Step 9: Calculate Metrics
|
||||
|
||||
```bash
|
||||
python scripts/08_calculate_validation_metrics.py \
|
||||
--annotations validation_set.json \
|
||||
--output validation_metrics.json \
|
||||
--report validation_report.txt
|
||||
```
|
||||
|
||||
This produces:
|
||||
- **Precision**: % of extracted items that are correct
|
||||
- **Recall**: % of true items that were extracted
|
||||
- **F1 Score**: Harmonic mean of precision and recall
|
||||
- **Per-field metrics**: Accuracy by field type
|
||||
|
||||
Use these metrics to:
|
||||
- Identify weak points in extraction prompts
|
||||
- Compare models (Haiku vs Sonnet vs Ollama)
|
||||
- Iterate and improve schema
|
||||
- Report quality in publications
|
||||
|
||||
## Customization
|
||||
|
||||
### Creating Your Extraction Schema
|
||||
|
||||
1. Copy `assets/schema_template.json` to `my_schema.json`
|
||||
2. Customize the following sections:
|
||||
- `objective`: What you're extracting
|
||||
- `system_context`: Your scientific domain
|
||||
- `instructions`: Step-by-step guidance for Claude
|
||||
- `output_schema`: JSON schema defining your data structure
|
||||
- `output_example`: Example of desired output
|
||||
|
||||
See `assets/example_flower_visitors_schema.json` for a real-world example.
|
||||
|
||||
### Configuring API Validation
|
||||
|
||||
1. Copy `assets/api_config_template.json` to `my_api_config.json`
|
||||
2. Map your schema fields to appropriate validation APIs
|
||||
3. See available APIs in `scripts/05_validate_with_apis.py` and `references/api_reference.md`
|
||||
|
||||
See `assets/example_api_config_ecology.json` for an ecology example.
|
||||
|
||||
## Cost Estimation
|
||||
|
||||
PDF processing costs approximately 1,500-3,000 tokens per page:
|
||||
|
||||
- 10-page paper: ~20,000-30,000 tokens
|
||||
- 100 papers: ~2-3M tokens
|
||||
- With Sonnet 4.5: ~$6-9 for 100 papers
|
||||
|
||||
Tips to reduce costs:
|
||||
- Use abstract filtering (Step 2) to reduce full PDF processing
|
||||
- Enable prompt caching with `--use-caching`
|
||||
- Use batch processing (`--method batches`)
|
||||
- Consider using Haiku for simpler extractions
|
||||
|
||||
## Supported Data Sources
|
||||
|
||||
### Bibliography Formats
|
||||
- BibTeX (Zotero, JabRef, etc.)
|
||||
- RIS (Mendeley, EndNote, etc.)
|
||||
- Directory of PDFs
|
||||
- List of DOIs
|
||||
|
||||
### Output Formats
|
||||
- Python (pandas DataFrame pickle)
|
||||
- R (RDS file)
|
||||
- CSV
|
||||
- JSON
|
||||
- Excel
|
||||
- SQLite database
|
||||
|
||||
### Validation APIs
|
||||
- **Biology**: GBIF, World Flora Online, NCBI Gene
|
||||
- **Geography**: GeoNames, OpenStreetMap Nominatim
|
||||
- **Chemistry**: PubChem
|
||||
- **Medicine**: (extensible - add your own)
|
||||
|
||||
## Examples
|
||||
|
||||
See the [beetle flower visitors repository](https://github.com/brunoasm/ARE_2026_beetle_flower_visitors) for a real-world example of this workflow in action.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### PDF Size Limits
|
||||
- Maximum file size: 32MB
|
||||
- Maximum pages: 100
|
||||
- Solution: Use chunked processing for larger PDFs
|
||||
|
||||
### JSON Parsing Errors
|
||||
- The `json-repair` library handles most common issues
|
||||
- Check your schema validation
|
||||
- Review Claude's analysis output for clues
|
||||
|
||||
### API Rate Limits
|
||||
- Add delays between requests (implemented in scripts)
|
||||
- Use batch processing when available
|
||||
- Check specific API documentation for limits
|
||||
|
||||
## Contributing
|
||||
|
||||
To add support for additional validation APIs:
|
||||
1. Add validator function to `scripts/05_validate_with_apis.py`
|
||||
2. Register in `API_VALIDATORS` dictionary
|
||||
3. Update `api_config_template.json` with examples
|
||||
|
||||
## Citation
|
||||
|
||||
If you use this skill in your research, please cite:
|
||||
|
||||
```bibtex
|
||||
@software{pdf_extraction_skill,
|
||||
title = {Extract Structured Data from Scientific PDFs},
|
||||
author = {Your Name},
|
||||
year = {2025},
|
||||
url = {https://github.com/your-repo}
|
||||
}
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
MIT License - see LICENSE file for details
|
||||
Reference in New Issue
Block a user