2.9 KiB
2.9 KiB
Setup Guide for PDF Data Extraction
Installation
Using Conda (Recommended)
Create a dedicated environment for the extraction pipeline:
conda env create -f environment.yml
conda activate pdf_extraction
Using pip
pip install -r requirements.txt
Required Dependencies
Core Dependencies
anthropic>=0.40.0- Anthropic API clientpybtex>=0.24.0- BibTeX file handlingrispy>=0.6.0- RIS file handlingjson-repair>=0.25.0- JSON repair and validationjsonschema>=4.20.0- JSON schema validationpandas>=2.0.0- Data processingrequests>=2.31.0- HTTP requests for APIs
Export Dependencies
openpyxl>=3.1.0- Excel exportpyreadr>=0.5.0- R RDS export
API Keys Setup
Anthropic API Key (Required for Claude backends)
export ANTHROPIC_API_KEY='your-api-key-here'
Add to your shell profile (~/.bashrc, ~/.zshrc) for persistence:
echo 'export ANTHROPIC_API_KEY="your-api-key-here"' >> ~/.bashrc
source ~/.bashrc
GeoNames Username (Optional - for geographic validation)
- Register at https://www.geonames.org/login
- Enable web services in your account
- Set environment variable:
export GEONAMES_USERNAME='your-username'
Local Model Setup (Ollama)
For free, private, offline abstract filtering:
Installation
macOS:
brew install ollama
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download from https://ollama.com/download
Pulling Models
# Recommended models
ollama pull llama3.1:8b # Good balance (8GB RAM)
ollama pull mistral:7b # Fast, simple filtering
ollama pull qwen2.5:7b # Multilingual support
ollama pull llama3.1:70b # Best accuracy (64GB RAM)
Starting Ollama Server
Usually auto-starts, but can be manually started:
ollama serve
The server runs at http://localhost:11434 by default.
Verifying Installation
Test that all components are properly installed:
# Test Python dependencies
python -c "import anthropic, pybtex, rispy, json_repair, pandas; print('All dependencies OK')"
# Test Anthropic API
python -c "import os; from anthropic import Anthropic; client = Anthropic(); print('API key valid')"
# Test Ollama (if using)
curl http://localhost:11434/api/tags
Directory Structure
The skill will work with PDFs and metadata organized in various ways:
Option A: Reference Manager Export
project/
├── library.bib # BibTeX export
└── pdfs/
├── Smith2020.pdf
├── Jones2021.pdf
└── ...
Option B: Simple Directory
project/
└── pdfs/
├── paper1.pdf
├── paper2.pdf
└── ...
Option C: DOI List
project/
└── dois.txt # One DOI per line
Next Steps
After installation, proceed to the workflow guide to start extracting data from your PDFs.
See: references/workflow_guide.md