Files
2025-11-29 18:02:40 +08:00

8.0 KiB

Extract Structured Data from Scientific PDFs

A comprehensive pipeline for extracting standardized data from scientific literature PDFs using Claude AI.

Overview

This skill provides an end-to-end workflow for:

  • Organizing PDF literature and metadata from various sources
  • Filtering relevant papers based on abstract content (optional)
  • Extracting structured data from full PDFs using Claude's vision capabilities
  • Repairing and validating JSON outputs
  • Enriching data with external scientific databases
  • Exporting to multiple analysis formats (Python, R, Excel, CSV, SQLite)

Quick Start

1. Installation

Create a conda environment:

conda env create -f environment.yml
conda activate pdf_extraction

Or install with pip:

pip install -r requirements.txt

2. Setup API Keys

Set your Anthropic API key:

export ANTHROPIC_API_KEY='your-api-key-here'

For geographic validation (optional):

export GEONAMES_USERNAME='your-geonames-username'

3. Run the Skill

The easiest way is to use the skill through Claude Code:

claude-code

Then activate the skill by mentioning it in your conversation. The skill will guide you through an interactive setup process.

Documentation

The skill includes comprehensive reference documentation:

  • references/setup_guide.md - Installation and configuration
  • references/workflow_guide.md - Complete step-by-step workflow with examples
  • references/validation_guide.md - Validation methodology and metrics interpretation
  • references/api_reference.md - External API integration details

Manual Workflow

You can also run the scripts manually:

Step 1: Organize Metadata

python scripts/01_organize_metadata.py \
  --source-type bibtex \
  --source path/to/library.bib \
  --pdf-dir path/to/pdfs \
  --organize-pdfs \
  --output metadata.json

Step 2: Filter Papers (Optional)

First, customize the filtering prompt in scripts/02_filter_abstracts.py for your use case.

Option A: Claude Haiku (Fast & Cheap - ~$0.25/M tokens)

python scripts/02_filter_abstracts.py \
  --metadata metadata.json \
  --backend anthropic-haiku \
  --use-batches \
  --output filtered_papers.json

Option B: Local Model via Ollama (FREE)

# One-time setup:
# 1. Install Ollama from https://ollama.com
# 2. Pull model: ollama pull llama3.1:8b
# 3. Start server: ollama serve

python scripts/02_filter_abstracts.py \
  --metadata metadata.json \
  --backend ollama \
  --ollama-model llama3.1:8b \
  --output filtered_papers.json

Recommended Ollama models:

  • llama3.1:8b - Good balance (8GB RAM)
  • mistral:7b - Fast, good for simple filtering
  • qwen2.5:7b - Good multilingual support
  • llama3.1:70b - Better accuracy (64GB RAM)

Step 3: Extract Data from PDFs

First, create your extraction schema by copying and customizing assets/schema_template.json.

python scripts/03_extract_from_pdfs.py \
  --metadata filtered_papers.json \
  --schema my_schema.json \
  --method batches \
  --output extracted_data.json

Step 4: Repair JSON

python scripts/04_repair_json.py \
  --input extracted_data.json \
  --schema my_schema.json \
  --output cleaned_data.json

Step 5: Validate with APIs

First, create your API configuration by copying and customizing assets/api_config_template.json.

python scripts/05_validate_with_apis.py \
  --input cleaned_data.json \
  --apis my_api_config.json \
  --output validated_data.json

Step 6: Export

# For Python/pandas
python scripts/06_export_database.py \
  --input validated_data.json \
  --format python \
  --flatten \
  --output results

# For R
python scripts/06_export_database.py \
  --input validated_data.json \
  --format r \
  --flatten \
  --output results

# For CSV
python scripts/06_export_database.py \
  --input validated_data.json \
  --format csv \
  --flatten \
  --output results.csv

Validate extraction quality using precision and recall metrics:

Step 7: Prepare Validation Set

python scripts/07_prepare_validation_set.py \
  --extraction-results cleaned_data.json \
  --schema my_schema.json \
  --sample-size 20 \
  --strategy stratified \
  --output validation_set.json

Sampling strategies:

  • random - Random sample
  • stratified - Sample by extraction characteristics
  • diverse - Maximize diversity

Step 8: Manual Annotation

  1. Open validation_set.json
  2. For each sampled paper:
    • Read the PDF
    • Fill in ground_truth field with correct extraction
    • Add annotator name and annotation_date
    • Use notes for ambiguous cases
  3. Save the file

Step 9: Calculate Metrics

python scripts/08_calculate_validation_metrics.py \
  --annotations validation_set.json \
  --output validation_metrics.json \
  --report validation_report.txt

This produces:

  • Precision: % of extracted items that are correct
  • Recall: % of true items that were extracted
  • F1 Score: Harmonic mean of precision and recall
  • Per-field metrics: Accuracy by field type

Use these metrics to:

  • Identify weak points in extraction prompts
  • Compare models (Haiku vs Sonnet vs Ollama)
  • Iterate and improve schema
  • Report quality in publications

Customization

Creating Your Extraction Schema

  1. Copy assets/schema_template.json to my_schema.json
  2. Customize the following sections:
    • objective: What you're extracting
    • system_context: Your scientific domain
    • instructions: Step-by-step guidance for Claude
    • output_schema: JSON schema defining your data structure
    • output_example: Example of desired output

See assets/example_flower_visitors_schema.json for a real-world example.

Configuring API Validation

  1. Copy assets/api_config_template.json to my_api_config.json
  2. Map your schema fields to appropriate validation APIs
  3. See available APIs in scripts/05_validate_with_apis.py and references/api_reference.md

See assets/example_api_config_ecology.json for an ecology example.

Cost Estimation

PDF processing costs approximately 1,500-3,000 tokens per page:

  • 10-page paper: ~20,000-30,000 tokens
  • 100 papers: ~2-3M tokens
  • With Sonnet 4.5: ~$6-9 for 100 papers

Tips to reduce costs:

  • Use abstract filtering (Step 2) to reduce full PDF processing
  • Enable prompt caching with --use-caching
  • Use batch processing (--method batches)
  • Consider using Haiku for simpler extractions

Supported Data Sources

Bibliography Formats

  • BibTeX (Zotero, JabRef, etc.)
  • RIS (Mendeley, EndNote, etc.)
  • Directory of PDFs
  • List of DOIs

Output Formats

  • Python (pandas DataFrame pickle)
  • R (RDS file)
  • CSV
  • JSON
  • Excel
  • SQLite database

Validation APIs

  • Biology: GBIF, World Flora Online, NCBI Gene
  • Geography: GeoNames, OpenStreetMap Nominatim
  • Chemistry: PubChem
  • Medicine: (extensible - add your own)

Examples

See the beetle flower visitors repository for a real-world example of this workflow in action.

Troubleshooting

PDF Size Limits

  • Maximum file size: 32MB
  • Maximum pages: 100
  • Solution: Use chunked processing for larger PDFs

JSON Parsing Errors

  • The json-repair library handles most common issues
  • Check your schema validation
  • Review Claude's analysis output for clues

API Rate Limits

  • Add delays between requests (implemented in scripts)
  • Use batch processing when available
  • Check specific API documentation for limits

Contributing

To add support for additional validation APIs:

  1. Add validator function to scripts/05_validate_with_apis.py
  2. Register in API_VALIDATORS dictionary
  3. Update api_config_template.json with examples

Citation

If you use this skill in your research, please cite:

@software{pdf_extraction_skill,
  title = {Extract Structured Data from Scientific PDFs},
  author = {Your Name},
  year = {2025},
  url = {https://github.com/your-repo}
}

License

MIT License - see LICENSE file for details