148 lines
2.9 KiB
Markdown
148 lines
2.9 KiB
Markdown
# Setup Guide for PDF Data Extraction
|
|
|
|
## Installation
|
|
|
|
### Using Conda (Recommended)
|
|
|
|
Create a dedicated environment for the extraction pipeline:
|
|
|
|
```bash
|
|
conda env create -f environment.yml
|
|
conda activate pdf_extraction
|
|
```
|
|
|
|
### Using pip
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
## Required Dependencies
|
|
|
|
### Core Dependencies
|
|
- `anthropic>=0.40.0` - Anthropic API client
|
|
- `pybtex>=0.24.0` - BibTeX file handling
|
|
- `rispy>=0.6.0` - RIS file handling
|
|
- `json-repair>=0.25.0` - JSON repair and validation
|
|
- `jsonschema>=4.20.0` - JSON schema validation
|
|
- `pandas>=2.0.0` - Data processing
|
|
- `requests>=2.31.0` - HTTP requests for APIs
|
|
|
|
### Export Dependencies
|
|
- `openpyxl>=3.1.0` - Excel export
|
|
- `pyreadr>=0.5.0` - R RDS export
|
|
|
|
## API Keys Setup
|
|
|
|
### Anthropic API Key (Required for Claude backends)
|
|
|
|
```bash
|
|
export ANTHROPIC_API_KEY='your-api-key-here'
|
|
```
|
|
|
|
Add to your shell profile (~/.bashrc, ~/.zshrc) for persistence:
|
|
|
|
```bash
|
|
echo 'export ANTHROPIC_API_KEY="your-api-key-here"' >> ~/.bashrc
|
|
source ~/.bashrc
|
|
```
|
|
|
|
### GeoNames Username (Optional - for geographic validation)
|
|
|
|
1. Register at https://www.geonames.org/login
|
|
2. Enable web services in your account
|
|
3. Set environment variable:
|
|
|
|
```bash
|
|
export GEONAMES_USERNAME='your-username'
|
|
```
|
|
|
|
## Local Model Setup (Ollama)
|
|
|
|
For free, private, offline abstract filtering:
|
|
|
|
### Installation
|
|
|
|
**macOS:**
|
|
```bash
|
|
brew install ollama
|
|
```
|
|
|
|
**Linux:**
|
|
```bash
|
|
curl -fsSL https://ollama.com/install.sh | sh
|
|
```
|
|
|
|
**Windows:**
|
|
Download from https://ollama.com/download
|
|
|
|
### Pulling Models
|
|
|
|
```bash
|
|
# Recommended models
|
|
ollama pull llama3.1:8b # Good balance (8GB RAM)
|
|
ollama pull mistral:7b # Fast, simple filtering
|
|
ollama pull qwen2.5:7b # Multilingual support
|
|
ollama pull llama3.1:70b # Best accuracy (64GB RAM)
|
|
```
|
|
|
|
### Starting Ollama Server
|
|
|
|
Usually auto-starts, but can be manually started:
|
|
|
|
```bash
|
|
ollama serve
|
|
```
|
|
|
|
The server runs at http://localhost:11434 by default.
|
|
|
|
## Verifying Installation
|
|
|
|
Test that all components are properly installed:
|
|
|
|
```bash
|
|
# Test Python dependencies
|
|
python -c "import anthropic, pybtex, rispy, json_repair, pandas; print('All dependencies OK')"
|
|
|
|
# Test Anthropic API
|
|
python -c "import os; from anthropic import Anthropic; client = Anthropic(); print('API key valid')"
|
|
|
|
# Test Ollama (if using)
|
|
curl http://localhost:11434/api/tags
|
|
```
|
|
|
|
## Directory Structure
|
|
|
|
The skill will work with PDFs and metadata organized in various ways:
|
|
|
|
### Option A: Reference Manager Export
|
|
```
|
|
project/
|
|
├── library.bib # BibTeX export
|
|
└── pdfs/
|
|
├── Smith2020.pdf
|
|
├── Jones2021.pdf
|
|
└── ...
|
|
```
|
|
|
|
### Option B: Simple Directory
|
|
```
|
|
project/
|
|
└── pdfs/
|
|
├── paper1.pdf
|
|
├── paper2.pdf
|
|
└── ...
|
|
```
|
|
|
|
### Option C: DOI List
|
|
```
|
|
project/
|
|
└── dois.txt # One DOI per line
|
|
```
|
|
|
|
## Next Steps
|
|
|
|
After installation, proceed to the workflow guide to start extracting data from your PDFs.
|
|
|
|
See: `references/workflow_guide.md`
|