Files
gh-brunoasm-my-claude-skill…/skills/extract_from_pdfs/references/workflow_guide.md
2025-11-29 18:02:40 +08:00

7.7 KiB
Raw Blame History

Complete Workflow Guide

This guide provides step-by-step instructions for the complete PDF extraction pipeline.

Overview

The pipeline consists of 6 main steps plus optional validation:

  1. Organize Metadata - Standardize PDF and metadata organization
  2. Filter Papers - Identify relevant papers by abstract (optional)
  3. Extract Data - Extract structured data from PDFs
  4. Repair JSON - Validate and repair JSON outputs
  5. Validate with APIs - Enrich with external databases
  6. Export - Convert to analysis format

Optional: Steps 7-9 for quality validation

Step 1: Organize Metadata

Standardize PDF organization and metadata from various sources.

From BibTeX (Zotero, JabRef, etc.)

python scripts/01_organize_metadata.py \
  --source-type bibtex \
  --source path/to/library.bib \
  --pdf-dir path/to/pdfs \
  --organize-pdfs \
  --output metadata.json

From RIS (Mendeley, EndNote, etc.)

python scripts/01_organize_metadata.py \
  --source-type ris \
  --source path/to/library.ris \
  --pdf-dir path/to/pdfs \
  --organize-pdfs \
  --output metadata.json

From PDF Directory

python scripts/01_organize_metadata.py \
  --source-type directory \
  --source path/to/pdfs \
  --output metadata.json

From DOI List

python scripts/01_organize_metadata.py \
  --source-type doi_list \
  --source dois.txt \
  --output metadata.json

Outputs:

  • metadata.json - Standardized metadata file
  • organized_pdfs/ - Renamed PDFs (if --organize-pdfs used)

Filter papers by analyzing abstracts to reduce PDF processing costs.

Backend Selection

Option A: Claude Haiku (Fast & Cheap)

  • Cost: ~$0.25 per million input tokens
  • Speed: Very fast with batches API
  • Accuracy: Good for most filtering tasks
python scripts/02_filter_abstracts.py \
  --metadata metadata.json \
  --backend anthropic-haiku \
  --use-batches \
  --output filtered_papers.json

Option B: Claude Sonnet (More Accurate)

  • Cost: ~$3 per million input tokens
  • Speed: Fast with batches API
  • Accuracy: Higher for complex criteria
python scripts/02_filter_abstracts.py \
  --metadata metadata.json \
  --backend anthropic-sonnet \
  --use-batches \
  --output filtered_papers.json

Option C: Local Ollama (FREE & Private)

  • Cost: $0 (runs locally)
  • Speed: Depends on hardware
  • Accuracy: Good with llama3.1:8b or better
python scripts/02_filter_abstracts.py \
  --metadata metadata.json \
  --backend ollama \
  --ollama-model llama3.1:8b \
  --output filtered_papers.json

Before running: Customize the filtering prompt in scripts/02_filter_abstracts.py (line 74) to match your criteria.

Outputs:

  • filtered_papers.json - Papers marked as relevant/irrelevant

Step 3: Extract Data from PDFs

Extract structured data using Claude's PDF vision capabilities.

Schema Preparation

  1. Copy schema template:
cp assets/schema_template.json my_schema.json
  1. Customize for your domain:
    • Update objective with your extraction goal
    • Define output_schema structure
    • Add domain-specific instructions
    • Provide an output_example

See assets/example_flower_visitors_schema.json for a real-world example.

Run Extraction

python scripts/03_extract_from_pdfs.py \
  --metadata filtered_papers.json \
  --schema my_schema.json \
  --method batches \
  --output extracted_data.json

Processing methods:

  • batches - Most efficient for many PDFs
  • base64 - Sequential processing

Optional flags:

  • --filter-results filtered_papers.json - Only process relevant papers
  • --test - Process only 3 PDFs for testing
  • --model claude-3-5-sonnet-20241022 - Change model

Outputs:

  • extracted_data.json - Raw extraction results with token counts

Step 4: Repair and Validate JSON

Repair malformed JSON and validate against schema.

python scripts/04_repair_json.py \
  --input extracted_data.json \
  --schema my_schema.json \
  --output cleaned_data.json

Optional flags:

  • --strict - Reject records that fail validation

Outputs:

  • cleaned_data.json - Repaired and validated extractions

Step 5: Validate with External APIs

Enrich data using external scientific databases.

API Configuration

  1. Copy API config template:
cp assets/api_config_template.json my_api_config.json
  1. Map fields to validation APIs:
    • gbif_taxonomy - GBIF for biological taxonomy
    • wfo_plants - World Flora Online for plant names
    • geonames - GeoNames for locations (requires account)
    • geocode - OpenStreetMap for geocoding (free)
    • pubchem - PubChem for chemical compounds
    • ncbi_gene - NCBI Gene database

See assets/example_api_config_ecology.json for an ecology example.

Run Validation

python scripts/05_validate_with_apis.py \
  --input cleaned_data.json \
  --apis my_api_config.json \
  --output validated_data.json

Optional flags:

  • --skip-validation - Skip API calls, only structure data

Outputs:

  • validated_data.json - Data enriched with validated taxonomy, geography, etc.

Step 6: Export to Analysis Format

Convert to format for your analysis environment.

Python (pandas)

python scripts/06_export_database.py \
  --input validated_data.json \
  --format python \
  --flatten \
  --output results

Outputs:

  • results.pkl - pandas DataFrame
  • results.py - Loading script

R

python scripts/06_export_database.py \
  --input validated_data.json \
  --format r \
  --flatten \
  --output results

Outputs:

  • results.rds - R data frame
  • results.R - Loading script

CSV

python scripts/06_export_database.py \
  --input validated_data.json \
  --format csv \
  --flatten \
  --output results.csv

Excel

python scripts/06_export_database.py \
  --input validated_data.json \
  --format excel \
  --flatten \
  --output results.xlsx

SQLite Database

python scripts/06_export_database.py \
  --input validated_data.json \
  --format sqlite \
  --flatten \
  --output results.db

Outputs:

  • results.db - SQLite database
  • results.sql - Example queries

Flags:

  • --flatten - Flatten nested JSON for tabular format
  • --include-metadata - Include paper metadata in output

Cost Estimation

Example: 100 papers, 10 pages each

With Filtering (Recommended):

  1. Filter (Haiku): ~200 abstracts × 500 tokens × $0.25/M = $0.03
  2. Extract (Sonnet): ~50 relevant papers × 10 pages × 2,500 tokens × $3/M = $3.75
  3. Total: ~$3.78

Without Filtering:

  1. Extract (Sonnet): 100 papers × 10 pages × 2,500 tokens × $3/M = $7.50

With Local Ollama:

  1. Filter (Ollama): $0
  2. Extract (Sonnet): ~50 papers × 10 pages × 2,500 tokens × $3/M = $3.75
  3. Total: ~$3.75

Token Usage by Step

  • Abstract (~200 words): ~500 tokens
  • PDF page (text-heavy): ~1,500-3,000 tokens
  • Extraction prompt: ~500-1,000 tokens
  • Schema/context: ~500-1,000 tokens

Tips to reduce costs:

  • Use abstract filtering (Step 2)
  • Use Haiku for filtering instead of Sonnet
  • Use local Ollama for filtering (free)
  • Enable prompt caching with --use-caching
  • Process in batches with --use-batches

Common Issues

PDF Not Found

Check PDF paths in metadata.json match actual file locations.

JSON Parsing Errors

Run Step 4 (repair JSON) - the json_repair library handles most issues.

API Rate Limits

Scripts include delays, but check specific API documentation for limits.

Ollama Connection Error

Ensure Ollama server is running: ollama serve

Next Steps

For quality assurance, proceed to the validation workflow to calculate precision and recall metrics.

See: references/validation_guide.md