Initial commit
This commit is contained in:
147
skills/extract_from_pdfs/references/setup_guide.md
Normal file
147
skills/extract_from_pdfs/references/setup_guide.md
Normal file
@@ -0,0 +1,147 @@
|
||||
# Setup Guide for PDF Data Extraction
|
||||
|
||||
## Installation
|
||||
|
||||
### Using Conda (Recommended)
|
||||
|
||||
Create a dedicated environment for the extraction pipeline:
|
||||
|
||||
```bash
|
||||
conda env create -f environment.yml
|
||||
conda activate pdf_extraction
|
||||
```
|
||||
|
||||
### Using pip
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Required Dependencies
|
||||
|
||||
### Core Dependencies
|
||||
- `anthropic>=0.40.0` - Anthropic API client
|
||||
- `pybtex>=0.24.0` - BibTeX file handling
|
||||
- `rispy>=0.6.0` - RIS file handling
|
||||
- `json-repair>=0.25.0` - JSON repair and validation
|
||||
- `jsonschema>=4.20.0` - JSON schema validation
|
||||
- `pandas>=2.0.0` - Data processing
|
||||
- `requests>=2.31.0` - HTTP requests for APIs
|
||||
|
||||
### Export Dependencies
|
||||
- `openpyxl>=3.1.0` - Excel export
|
||||
- `pyreadr>=0.5.0` - R RDS export
|
||||
|
||||
## API Keys Setup
|
||||
|
||||
### Anthropic API Key (Required for Claude backends)
|
||||
|
||||
```bash
|
||||
export ANTHROPIC_API_KEY='your-api-key-here'
|
||||
```
|
||||
|
||||
Add to your shell profile (~/.bashrc, ~/.zshrc) for persistence:
|
||||
|
||||
```bash
|
||||
echo 'export ANTHROPIC_API_KEY="your-api-key-here"' >> ~/.bashrc
|
||||
source ~/.bashrc
|
||||
```
|
||||
|
||||
### GeoNames Username (Optional - for geographic validation)
|
||||
|
||||
1. Register at https://www.geonames.org/login
|
||||
2. Enable web services in your account
|
||||
3. Set environment variable:
|
||||
|
||||
```bash
|
||||
export GEONAMES_USERNAME='your-username'
|
||||
```
|
||||
|
||||
## Local Model Setup (Ollama)
|
||||
|
||||
For free, private, offline abstract filtering:
|
||||
|
||||
### Installation
|
||||
|
||||
**macOS:**
|
||||
```bash
|
||||
brew install ollama
|
||||
```
|
||||
|
||||
**Linux:**
|
||||
```bash
|
||||
curl -fsSL https://ollama.com/install.sh | sh
|
||||
```
|
||||
|
||||
**Windows:**
|
||||
Download from https://ollama.com/download
|
||||
|
||||
### Pulling Models
|
||||
|
||||
```bash
|
||||
# Recommended models
|
||||
ollama pull llama3.1:8b # Good balance (8GB RAM)
|
||||
ollama pull mistral:7b # Fast, simple filtering
|
||||
ollama pull qwen2.5:7b # Multilingual support
|
||||
ollama pull llama3.1:70b # Best accuracy (64GB RAM)
|
||||
```
|
||||
|
||||
### Starting Ollama Server
|
||||
|
||||
Usually auto-starts, but can be manually started:
|
||||
|
||||
```bash
|
||||
ollama serve
|
||||
```
|
||||
|
||||
The server runs at http://localhost:11434 by default.
|
||||
|
||||
## Verifying Installation
|
||||
|
||||
Test that all components are properly installed:
|
||||
|
||||
```bash
|
||||
# Test Python dependencies
|
||||
python -c "import anthropic, pybtex, rispy, json_repair, pandas; print('All dependencies OK')"
|
||||
|
||||
# Test Anthropic API
|
||||
python -c "import os; from anthropic import Anthropic; client = Anthropic(); print('API key valid')"
|
||||
|
||||
# Test Ollama (if using)
|
||||
curl http://localhost:11434/api/tags
|
||||
```
|
||||
|
||||
## Directory Structure
|
||||
|
||||
The skill will work with PDFs and metadata organized in various ways:
|
||||
|
||||
### Option A: Reference Manager Export
|
||||
```
|
||||
project/
|
||||
├── library.bib # BibTeX export
|
||||
└── pdfs/
|
||||
├── Smith2020.pdf
|
||||
├── Jones2021.pdf
|
||||
└── ...
|
||||
```
|
||||
|
||||
### Option B: Simple Directory
|
||||
```
|
||||
project/
|
||||
└── pdfs/
|
||||
├── paper1.pdf
|
||||
├── paper2.pdf
|
||||
└── ...
|
||||
```
|
||||
|
||||
### Option C: DOI List
|
||||
```
|
||||
project/
|
||||
└── dois.txt # One DOI per line
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
After installation, proceed to the workflow guide to start extracting data from your PDFs.
|
||||
|
||||
See: `references/workflow_guide.md`
|
||||
Reference in New Issue
Block a user