zhongwei/gh-brunoasm-my-claude-skills-general-skills

Fork 0

Files

Zhongwei Li 69617b598e Initial commit

2025-11-29 18:02:40 +08:00

2.9 KiB

Raw Blame History

Setup Guide for PDF Data Extraction

Installation

Using Conda (Recommended)

Create a dedicated environment for the extraction pipeline:

conda env create -f environment.yml
conda activate pdf_extraction

Using pip

pip install -r requirements.txt

Required Dependencies

Core Dependencies

anthropic>=0.40.0 - Anthropic API client
pybtex>=0.24.0 - BibTeX file handling
rispy>=0.6.0 - RIS file handling
json-repair>=0.25.0 - JSON repair and validation
jsonschema>=4.20.0 - JSON schema validation
pandas>=2.0.0 - Data processing
requests>=2.31.0 - HTTP requests for APIs

Export Dependencies

openpyxl>=3.1.0 - Excel export
pyreadr>=0.5.0 - R RDS export

API Keys Setup

Anthropic API Key (Required for Claude backends)

export ANTHROPIC_API_KEY='your-api-key-here'

Add to your shell profile (~/.bashrc, ~/.zshrc) for persistence:

echo 'export ANTHROPIC_API_KEY="your-api-key-here"' >> ~/.bashrc
source ~/.bashrc

GeoNames Username (Optional - for geographic validation)

Register at https://www.geonames.org/login
Enable web services in your account
Set environment variable:

export GEONAMES_USERNAME='your-username'

Local Model Setup (Ollama)

For free, private, offline abstract filtering:

Installation

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download from https://ollama.com/download

Pulling Models

# Recommended models
ollama pull llama3.1:8b      # Good balance (8GB RAM)
ollama pull mistral:7b       # Fast, simple filtering
ollama pull qwen2.5:7b       # Multilingual support
ollama pull llama3.1:70b     # Best accuracy (64GB RAM)

Starting Ollama Server

Usually auto-starts, but can be manually started:

ollama serve

The server runs at http://localhost:11434 by default.

Verifying Installation

Test that all components are properly installed:

# Test Python dependencies
python -c "import anthropic, pybtex, rispy, json_repair, pandas; print('All dependencies OK')"

# Test Anthropic API
python -c "import os; from anthropic import Anthropic; client = Anthropic(); print('API key valid')"

# Test Ollama (if using)
curl http://localhost:11434/api/tags

Directory Structure

The skill will work with PDFs and metadata organized in various ways:

Option A: Reference Manager Export

project/
├── library.bib              # BibTeX export
└── pdfs/
    ├── Smith2020.pdf
    ├── Jones2021.pdf
    └── ...

Option B: Simple Directory

project/
└── pdfs/
    ├── paper1.pdf
    ├── paper2.pdf
    └── ...

Option C: DOI List

project/
└── dois.txt                 # One DOI per line

Next Steps

After installation, proceed to the workflow guide to start extracting data from your PDFs.

See: references/workflow_guide.md

2.9 KiB Raw Blame History

Setup Guide for PDF Data Extraction

Installation

Using Conda (Recommended)

Using pip

Required Dependencies

Core Dependencies

Export Dependencies

API Keys Setup

Anthropic API Key (Required for Claude backends)

GeoNames Username (Optional - for geographic validation)

Local Model Setup (Ollama)

Installation

Pulling Models

Starting Ollama Server

Verifying Installation

Directory Structure

Option A: Reference Manager Export

Option B: Simple Directory

Option C: DOI List

Next Steps

2.9 KiB

Raw Blame History