# Setup Guide for PDF Data Extraction

## Installation

### Using Conda (Recommended)

Create a dedicated environment for the extraction pipeline:

```bash
conda env create -f environment.yml
conda activate pdf_extraction
```

### Using pip

```bash
pip install -r requirements.txt
```

## Required Dependencies

### Core Dependencies
- `anthropic>=0.40.0` - Anthropic API client
- `pybtex>=0.24.0` - BibTeX file handling
- `rispy>=0.6.0` - RIS file handling
- `json-repair>=0.25.0` - JSON repair and validation
- `jsonschema>=4.20.0` - JSON schema validation
- `pandas>=2.0.0` - Data processing
- `requests>=2.31.0` - HTTP requests for APIs

### Export Dependencies
- `openpyxl>=3.1.0` - Excel export
- `pyreadr>=0.5.0` - R RDS export

## API Keys Setup

### Anthropic API Key (Required for Claude backends)

```bash
export ANTHROPIC_API_KEY='your-api-key-here'
```

Add to your shell profile (~/.bashrc, ~/.zshrc) for persistence:

```bash
echo 'export ANTHROPIC_API_KEY="your-api-key-here"' >> ~/.bashrc
source ~/.bashrc
```

### GeoNames Username (Optional - for geographic validation)

1. Register at https://www.geonames.org/login
2. Enable web services in your account
3. Set environment variable:

```bash
export GEONAMES_USERNAME='your-username'
```

## Local Model Setup (Ollama)

For free, private, offline abstract filtering:

### Installation

**macOS:**
```bash
brew install ollama
```

**Linux:**
```bash
curl -fsSL https://ollama.com/install.sh | sh
```

**Windows:**
Download from https://ollama.com/download

### Pulling Models

```bash
# Recommended models
ollama pull llama3.1:8b      # Good balance (8GB RAM)
ollama pull mistral:7b       # Fast, simple filtering
ollama pull qwen2.5:7b       # Multilingual support
ollama pull llama3.1:70b     # Best accuracy (64GB RAM)
```

### Starting Ollama Server

Usually auto-starts, but can be manually started:

```bash
ollama serve
```

The server runs at http://localhost:11434 by default.

## Verifying Installation

Test that all components are properly installed:

```bash
# Test Python dependencies
python -c "import anthropic, pybtex, rispy, json_repair, pandas; print('All dependencies OK')"

# Test Anthropic API
python -c "import os; from anthropic import Anthropic; client = Anthropic(); print('API key valid')"

# Test Ollama (if using)
curl http://localhost:11434/api/tags
```

## Directory Structure

The skill will work with PDFs and metadata organized in various ways:

### Option A: Reference Manager Export
```
project/
├── library.bib              # BibTeX export
└── pdfs/
    ├── Smith2020.pdf
    ├── Jones2021.pdf
    └── ...
```

### Option B: Simple Directory
```
project/
└── pdfs/
    ├── paper1.pdf
    ├── paper2.pdf
    └── ...
```

### Option C: DOI List
```
project/
└── dois.txt                 # One DOI per line
```

## Next Steps

After installation, proceed to the workflow guide to start extracting data from your PDFs.

See: `references/workflow_guide.md`