Files
gh-brunoasm-my-claude-skill…/skills/extract_from_pdfs/references/setup_guide.md
2025-11-29 18:02:40 +08:00

148 lines
2.9 KiB
Markdown

# Setup Guide for PDF Data Extraction
## Installation
### Using Conda (Recommended)
Create a dedicated environment for the extraction pipeline:
```bash
conda env create -f environment.yml
conda activate pdf_extraction
```
### Using pip
```bash
pip install -r requirements.txt
```
## Required Dependencies
### Core Dependencies
- `anthropic>=0.40.0` - Anthropic API client
- `pybtex>=0.24.0` - BibTeX file handling
- `rispy>=0.6.0` - RIS file handling
- `json-repair>=0.25.0` - JSON repair and validation
- `jsonschema>=4.20.0` - JSON schema validation
- `pandas>=2.0.0` - Data processing
- `requests>=2.31.0` - HTTP requests for APIs
### Export Dependencies
- `openpyxl>=3.1.0` - Excel export
- `pyreadr>=0.5.0` - R RDS export
## API Keys Setup
### Anthropic API Key (Required for Claude backends)
```bash
export ANTHROPIC_API_KEY='your-api-key-here'
```
Add to your shell profile (~/.bashrc, ~/.zshrc) for persistence:
```bash
echo 'export ANTHROPIC_API_KEY="your-api-key-here"' >> ~/.bashrc
source ~/.bashrc
```
### GeoNames Username (Optional - for geographic validation)
1. Register at https://www.geonames.org/login
2. Enable web services in your account
3. Set environment variable:
```bash
export GEONAMES_USERNAME='your-username'
```
## Local Model Setup (Ollama)
For free, private, offline abstract filtering:
### Installation
**macOS:**
```bash
brew install ollama
```
**Linux:**
```bash
curl -fsSL https://ollama.com/install.sh | sh
```
**Windows:**
Download from https://ollama.com/download
### Pulling Models
```bash
# Recommended models
ollama pull llama3.1:8b # Good balance (8GB RAM)
ollama pull mistral:7b # Fast, simple filtering
ollama pull qwen2.5:7b # Multilingual support
ollama pull llama3.1:70b # Best accuracy (64GB RAM)
```
### Starting Ollama Server
Usually auto-starts, but can be manually started:
```bash
ollama serve
```
The server runs at http://localhost:11434 by default.
## Verifying Installation
Test that all components are properly installed:
```bash
# Test Python dependencies
python -c "import anthropic, pybtex, rispy, json_repair, pandas; print('All dependencies OK')"
# Test Anthropic API
python -c "import os; from anthropic import Anthropic; client = Anthropic(); print('API key valid')"
# Test Ollama (if using)
curl http://localhost:11434/api/tags
```
## Directory Structure
The skill will work with PDFs and metadata organized in various ways:
### Option A: Reference Manager Export
```
project/
├── library.bib # BibTeX export
└── pdfs/
├── Smith2020.pdf
├── Jones2021.pdf
└── ...
```
### Option B: Simple Directory
```
project/
└── pdfs/
├── paper1.pdf
├── paper2.pdf
└── ...
```
### Option C: DOI List
```
project/
└── dois.txt # One DOI per line
```
## Next Steps
After installation, proceed to the workflow guide to start extracting data from your PDFs.
See: `references/workflow_guide.md`