Files
gh-brunoasm-my-claude-skill…/skills/extract_from_pdfs/references/setup_guide.md
2025-11-29 18:02:40 +08:00

2.9 KiB

Setup Guide for PDF Data Extraction

Installation

Create a dedicated environment for the extraction pipeline:

conda env create -f environment.yml
conda activate pdf_extraction

Using pip

pip install -r requirements.txt

Required Dependencies

Core Dependencies

  • anthropic>=0.40.0 - Anthropic API client
  • pybtex>=0.24.0 - BibTeX file handling
  • rispy>=0.6.0 - RIS file handling
  • json-repair>=0.25.0 - JSON repair and validation
  • jsonschema>=4.20.0 - JSON schema validation
  • pandas>=2.0.0 - Data processing
  • requests>=2.31.0 - HTTP requests for APIs

Export Dependencies

  • openpyxl>=3.1.0 - Excel export
  • pyreadr>=0.5.0 - R RDS export

API Keys Setup

Anthropic API Key (Required for Claude backends)

export ANTHROPIC_API_KEY='your-api-key-here'

Add to your shell profile (~/.bashrc, ~/.zshrc) for persistence:

echo 'export ANTHROPIC_API_KEY="your-api-key-here"' >> ~/.bashrc
source ~/.bashrc

GeoNames Username (Optional - for geographic validation)

  1. Register at https://www.geonames.org/login
  2. Enable web services in your account
  3. Set environment variable:
export GEONAMES_USERNAME='your-username'

Local Model Setup (Ollama)

For free, private, offline abstract filtering:

Installation

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download from https://ollama.com/download

Pulling Models

# Recommended models
ollama pull llama3.1:8b      # Good balance (8GB RAM)
ollama pull mistral:7b       # Fast, simple filtering
ollama pull qwen2.5:7b       # Multilingual support
ollama pull llama3.1:70b     # Best accuracy (64GB RAM)

Starting Ollama Server

Usually auto-starts, but can be manually started:

ollama serve

The server runs at http://localhost:11434 by default.

Verifying Installation

Test that all components are properly installed:

# Test Python dependencies
python -c "import anthropic, pybtex, rispy, json_repair, pandas; print('All dependencies OK')"

# Test Anthropic API
python -c "import os; from anthropic import Anthropic; client = Anthropic(); print('API key valid')"

# Test Ollama (if using)
curl http://localhost:11434/api/tags

Directory Structure

The skill will work with PDFs and metadata organized in various ways:

Option A: Reference Manager Export

project/
├── library.bib              # BibTeX export
└── pdfs/
    ├── Smith2020.pdf
    ├── Jones2021.pdf
    └── ...

Option B: Simple Directory

project/
└── pdfs/
    ├── paper1.pdf
    ├── paper2.pdf
    └── ...

Option C: DOI List

project/
└── dois.txt                 # One DOI per line

Next Steps

After installation, proceed to the workflow guide to start extracting data from your PDFs.

See: references/workflow_guide.md