# Extract Structured Data from Scientific PDFs A comprehensive pipeline for extracting standardized data from scientific literature PDFs using Claude AI. ## Overview This skill provides an end-to-end workflow for: - Organizing PDF literature and metadata from various sources - Filtering relevant papers based on abstract content (optional) - Extracting structured data from full PDFs using Claude's vision capabilities - Repairing and validating JSON outputs - Enriching data with external scientific databases - Exporting to multiple analysis formats (Python, R, Excel, CSV, SQLite) ## Quick Start ### 1. Installation Create a conda environment: ```bash conda env create -f environment.yml conda activate pdf_extraction ``` Or install with pip: ```bash pip install -r requirements.txt ``` ### 2. Setup API Keys Set your Anthropic API key: ```bash export ANTHROPIC_API_KEY='your-api-key-here' ``` For geographic validation (optional): ```bash export GEONAMES_USERNAME='your-geonames-username' ``` ### 3. Run the Skill The easiest way is to use the skill through Claude Code: ```bash claude-code ``` Then activate the skill by mentioning it in your conversation. The skill will guide you through an interactive setup process. ## Documentation The skill includes comprehensive reference documentation: - `references/setup_guide.md` - Installation and configuration - `references/workflow_guide.md` - Complete step-by-step workflow with examples - `references/validation_guide.md` - Validation methodology and metrics interpretation - `references/api_reference.md` - External API integration details ## Manual Workflow You can also run the scripts manually: ### Step 1: Organize Metadata ```bash python scripts/01_organize_metadata.py \ --source-type bibtex \ --source path/to/library.bib \ --pdf-dir path/to/pdfs \ --organize-pdfs \ --output metadata.json ``` ### Step 2: Filter Papers (Optional) First, customize the filtering prompt in `scripts/02_filter_abstracts.py` for your use case. **Option A: Claude Haiku (Fast & Cheap - ~$0.25/M tokens)** ```bash python scripts/02_filter_abstracts.py \ --metadata metadata.json \ --backend anthropic-haiku \ --use-batches \ --output filtered_papers.json ``` **Option B: Local Model via Ollama (FREE)** ```bash # One-time setup: # 1. Install Ollama from https://ollama.com # 2. Pull model: ollama pull llama3.1:8b # 3. Start server: ollama serve python scripts/02_filter_abstracts.py \ --metadata metadata.json \ --backend ollama \ --ollama-model llama3.1:8b \ --output filtered_papers.json ``` Recommended Ollama models: - `llama3.1:8b` - Good balance (8GB RAM) - `mistral:7b` - Fast, good for simple filtering - `qwen2.5:7b` - Good multilingual support - `llama3.1:70b` - Better accuracy (64GB RAM) ### Step 3: Extract Data from PDFs First, create your extraction schema by copying and customizing `assets/schema_template.json`. ```bash python scripts/03_extract_from_pdfs.py \ --metadata filtered_papers.json \ --schema my_schema.json \ --method batches \ --output extracted_data.json ``` ### Step 4: Repair JSON ```bash python scripts/04_repair_json.py \ --input extracted_data.json \ --schema my_schema.json \ --output cleaned_data.json ``` ### Step 5: Validate with APIs First, create your API configuration by copying and customizing `assets/api_config_template.json`. ```bash python scripts/05_validate_with_apis.py \ --input cleaned_data.json \ --apis my_api_config.json \ --output validated_data.json ``` ### Step 6: Export ```bash # For Python/pandas python scripts/06_export_database.py \ --input validated_data.json \ --format python \ --flatten \ --output results # For R python scripts/06_export_database.py \ --input validated_data.json \ --format r \ --flatten \ --output results # For CSV python scripts/06_export_database.py \ --input validated_data.json \ --format csv \ --flatten \ --output results.csv ``` ### Validation & Quality Assurance (Optional but Recommended) Validate extraction quality using precision and recall metrics: #### Step 7: Prepare Validation Set ```bash python scripts/07_prepare_validation_set.py \ --extraction-results cleaned_data.json \ --schema my_schema.json \ --sample-size 20 \ --strategy stratified \ --output validation_set.json ``` Sampling strategies: - `random` - Random sample - `stratified` - Sample by extraction characteristics - `diverse` - Maximize diversity #### Step 8: Manual Annotation 1. Open `validation_set.json` 2. For each sampled paper: - Read the PDF - Fill in `ground_truth` field with correct extraction - Add `annotator` name and `annotation_date` - Use `notes` for ambiguous cases 3. Save the file #### Step 9: Calculate Metrics ```bash python scripts/08_calculate_validation_metrics.py \ --annotations validation_set.json \ --output validation_metrics.json \ --report validation_report.txt ``` This produces: - **Precision**: % of extracted items that are correct - **Recall**: % of true items that were extracted - **F1 Score**: Harmonic mean of precision and recall - **Per-field metrics**: Accuracy by field type Use these metrics to: - Identify weak points in extraction prompts - Compare models (Haiku vs Sonnet vs Ollama) - Iterate and improve schema - Report quality in publications ## Customization ### Creating Your Extraction Schema 1. Copy `assets/schema_template.json` to `my_schema.json` 2. Customize the following sections: - `objective`: What you're extracting - `system_context`: Your scientific domain - `instructions`: Step-by-step guidance for Claude - `output_schema`: JSON schema defining your data structure - `output_example`: Example of desired output See `assets/example_flower_visitors_schema.json` for a real-world example. ### Configuring API Validation 1. Copy `assets/api_config_template.json` to `my_api_config.json` 2. Map your schema fields to appropriate validation APIs 3. See available APIs in `scripts/05_validate_with_apis.py` and `references/api_reference.md` See `assets/example_api_config_ecology.json` for an ecology example. ## Cost Estimation PDF processing costs approximately 1,500-3,000 tokens per page: - 10-page paper: ~20,000-30,000 tokens - 100 papers: ~2-3M tokens - With Sonnet 4.5: ~$6-9 for 100 papers Tips to reduce costs: - Use abstract filtering (Step 2) to reduce full PDF processing - Enable prompt caching with `--use-caching` - Use batch processing (`--method batches`) - Consider using Haiku for simpler extractions ## Supported Data Sources ### Bibliography Formats - BibTeX (Zotero, JabRef, etc.) - RIS (Mendeley, EndNote, etc.) - Directory of PDFs - List of DOIs ### Output Formats - Python (pandas DataFrame pickle) - R (RDS file) - CSV - JSON - Excel - SQLite database ### Validation APIs - **Biology**: GBIF, World Flora Online, NCBI Gene - **Geography**: GeoNames, OpenStreetMap Nominatim - **Chemistry**: PubChem - **Medicine**: (extensible - add your own) ## Examples See the [beetle flower visitors repository](https://github.com/brunoasm/ARE_2026_beetle_flower_visitors) for a real-world example of this workflow in action. ## Troubleshooting ### PDF Size Limits - Maximum file size: 32MB - Maximum pages: 100 - Solution: Use chunked processing for larger PDFs ### JSON Parsing Errors - The `json-repair` library handles most common issues - Check your schema validation - Review Claude's analysis output for clues ### API Rate Limits - Add delays between requests (implemented in scripts) - Use batch processing when available - Check specific API documentation for limits ## Contributing To add support for additional validation APIs: 1. Add validator function to `scripts/05_validate_with_apis.py` 2. Register in `API_VALIDATORS` dictionary 3. Update `api_config_template.json` with examples ## Citation If you use this skill in your research, please cite: ```bibtex @software{pdf_extraction_skill, title = {Extract Structured Data from Scientific PDFs}, author = {Your Name}, year = {2025}, url = {https://github.com/your-repo} } ``` ## License MIT License - see LICENSE file for details