Initial commit

2025-11-29 18:02:40 +08:00
commit 69617b598e
25 changed files with 5790 additions and 0 deletions
--- a/skills/extract_from_pdfs/references/api_reference.md
+++ b/skills/extract_from_pdfs/references/api_reference.md
@@ -0,0 +1,406 @@
+# External API Validation Reference
+
+## Overview
+
+Step 5 validates and enriches extracted data using external scientific databases. This ensures taxonomic names are standardized, locations are geocoded, and chemical/gene identifiers are canonical.
+
+## Available APIs
+
+### Biological Taxonomy
+
+#### GBIF (Global Biodiversity Information Facility)
+
+**Use for:** General biological taxonomy (animals, plants, fungi, etc.)
+
+**Function:** `validate_gbif_taxonomy(scientific_name)`
+
+**Returns:**
+- Matched canonical name
+- Full scientific name with authority
+- Taxonomic hierarchy (kingdom, phylum, class, order, family, genus)
+- GBIF ID
+- Match confidence and type
+- Taxonomic status
+
+**Example:**
+```python
+validate_gbif_taxonomy("Apis melifera")
+# Returns:
+{
+  "matched_name": "Apis mellifera",
+  "scientific_name": "Apis mellifera Linnaeus, 1758",
+  "rank": "SPECIES",
+  "kingdom": "Animalia",
+  "phylum": "Arthropoda",
+  "class": "Insecta",
+  "order": "Hymenoptera",
+  "family": "Apidae",
+  "genus": "Apis",
+  "gbif_id": 1340278,
+  "confidence": 100,
+  "match_type": "EXACT"
+}
+```
+
+**No API key required** - Free and unlimited
+
+**Documentation:** https://www.gbif.org/developer/species
+
+#### World Flora Online (WFO)
+
+**Use for:** Plant taxonomy specifically
+
+**Function:** `validate_wfo_plant(scientific_name)`
+
+**Returns:**
+- Matched name
+- Scientific name with authors
+- Family
+- WFO ID
+- Taxonomic status
+
+**Example:**
+```python
+validate_wfo_plant("Magnolia grandiflora")
+# Returns:
+{
+  "matched_name": "Magnolia grandiflora",
+  "scientific_name": "Magnolia grandiflora L.",
+  "authors": "L.",
+  "family": "Magnoliaceae",
+  "wfo_id": "wfo-0000988234",
+  "status": "Accepted"
+}
+```
+
+**No API key required** - Free
+
+**Documentation:** http://www.worldfloraonline.org/
+
+### Geography
+
+#### GeoNames
+
+**Use for:** Location validation and standardization
+
+**Function:** `validate_geonames(location, country=None)`
+
+**Returns:**
+- Matched place name
+- Country name and code
+- Administrative divisions (state, province)
+- Latitude/longitude
+- GeoNames ID
+
+**Example:**
+```python
+validate_geonames("São Paulo", country="BR")
+# Returns:
+{
+  "matched_name": "São Paulo",
+  "country": "Brazil",
+  "country_code": "BR",
+  "admin1": "São Paulo",
+  "admin2": None,
+  "latitude": "-23.5475",
+  "longitude": "-46.63611",
+  "geonames_id": 3448439
+}
+```
+
+**Requires free account:** Register at https://www.geonames.org/login
+
+**Setup:**
+1. Create account
+2. Enable web services in account settings
+3. Set environment variable: `export GEONAMES_USERNAME='your-username'`
+
+**Rate limit:** Free tier allows reasonable usage
+
+**Documentation:** https://www.geonames.org/export/web-services.html
+
+#### OpenStreetMap Nominatim
+
+**Use for:** Geocoding addresses to coordinates
+
+**Function:** `geocode_location(address)`
+
+**Returns:**
+- Display name (formatted address)
+- Latitude/longitude
+- OSM type and ID
+- Place rank
+
+**Example:**
+```python
+geocode_location("Field Museum, Chicago, IL")
+# Returns:
+{
+  "display_name": "Field Museum, 1400, South Lake Shore Drive, Chicago, Illinois, 60605, United States",
+  "latitude": "41.8662",
+  "longitude": "-87.6169",
+  "osm_type": "way",
+  "osm_id": 54856789,
+  "place_rank": 30
+}
+```
+
+**No API key required** - Free
+
+**Important:** Add 1-second delays between requests (implemented in script)
+
+**Documentation:** https://nominatim.org/release-docs/latest/api/Overview/
+
+### Chemistry
+
+#### PubChem
+
+**Use for:** Chemical compound validation
+
+**Function:** `validate_pubchem_compound(compound_name)`
+
+**Returns:**
+- PubChem CID (compound ID)
+- Molecular formula
+- PubChem URL
+
+**Example:**
+```python
+validate_pubchem_compound("aspirin")
+# Returns:
+{
+  "cid": 2244,
+  "molecular_formula": "C9H8O4",
+  "pubchem_url": "https://pubchem.ncbi.nlm.nih.gov/compound/2244"
+}
+```
+
+**No API key required** - Free
+
+**Documentation:** https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
+
+### Genetics
+
+#### NCBI Gene
+
+**Use for:** Gene validation
+
+**Function:** `validate_ncbi_gene(gene_symbol, organism=None)`
+
+**Returns:**
+- NCBI Gene ID
+- NCBI URL
+
+**Example:**
+```python
+validate_ncbi_gene("BRCA1", organism="Homo sapiens")
+# Returns:
+{
+  "gene_id": "672",
+  "ncbi_url": "https://www.ncbi.nlm.nih.gov/gene/672"
+}
+```
+
+**No API key required** - Free
+
+**Rate limit:** Max 3 requests/second
+
+**Documentation:** https://www.ncbi.nlm.nih.gov/books/NBK25500/
+
+## Configuration
+
+### API Config File Structure
+
+Create `my_api_config.json` based on `assets/api_config_template.json`:
+
+```json
+{
+  "field_mappings": {
+    "species": {
+      "api": "gbif_taxonomy",
+      "output_field": "validated_species",
+      "description": "Validate species names against GBIF"
+    },
+    "location": {
+      "api": "geocode",
+      "output_field": "coordinates"
+    }
+  },
+
+  "nested_field_mappings": {
+    "records.plant_species": {
+      "api": "wfo_plants",
+      "output_field": "validated_plant_taxonomy"
+    },
+    "records.location": {
+      "api": "geocode",
+      "output_field": "coordinates"
+    }
+  }
+}
+```
+
+### Field Mapping Parameters
+
+**Required:**
+- `api` - API name (see list above)
+- `output_field` - Name for validated data
+
+**Optional:**
+- `description` - Documentation
+- `extra_params` - Additional API-specific parameters
+
+## Adding Custom APIs
+
+To add a new validation API:
+
+1. **Create validator function** in `scripts/05_validate_with_apis.py`:
+
+```python
+def validate_custom_api(value: str, extra_param: str = None) -> Optional[Dict]:
+    """
+    Validate value using custom API.
+
+    Args:
+        value: The value to validate
+        extra_param: Optional additional parameter
+
+    Returns:
+        Dictionary with validated data or None if not found
+    """
+    try:
+        # Make API request
+        response = requests.get(f"https://api.example.com/{value}")
+        if response.status_code == 200:
+            data = response.json()
+            return {
+                'validated_value': data.get('canonical_name'),
+                'api_id': data.get('id'),
+                'additional_info': data.get('info')
+            }
+    except Exception as e:
+        print(f"Custom API error: {e}")
+
+    return None
+```
+
+2. **Register in API_VALIDATORS** dictionary:
+
+```python
+API_VALIDATORS = {
+    'gbif_taxonomy': validate_gbif_taxonomy,
+    'wfo_plants': validate_wfo_plant,
+    # ... existing validators ...
+    'custom_api': validate_custom_api,  # Add here
+}
+```
+
+3. **Use in config file:**
+
+```json
+{
+  "field_mappings": {
+    "your_field": {
+      "api": "custom_api",
+      "output_field": "validated_field",
+      "extra_params": {
+        "extra_param": "value"
+      }
+    }
+  }
+}
+```
+
+## Rate Limiting
+
+The script implements rate limiting to respect API usage policies:
+
+**Default delays (built into script):**
+- GeoNames: 0.5 seconds
+- Nominatim: 1.0 second (required)
+- WFO: 1.0 second
+- Others: 0.5 seconds
+
+**Modify delays if needed** in `scripts/05_validate_with_apis.py`:
+
+```python
+# In main() function
+if not args.skip_validation:
+    time.sleep(0.5)  # Adjust this value
+```
+
+## Error Handling
+
+APIs may fail for various reasons:
+
+**Common errors:**
+- Connection timeout
+- Rate limit exceeded
+- Invalid API key
+- Malformed query
+- No match found
+
+**Script behavior:**
+- Continues processing on error
+- Logs error to console
+- Sets validated field to None
+- Original extracted value preserved
+
+**Retry logic:**
+- 3 retries with exponential backoff
+- Implemented for network errors
+- Not for "no match found" errors
+
+## Best Practices
+
+1. **Start with test run:**
+   ```bash
+   python scripts/05_validate_with_apis.py \
+     --input cleaned_data.json \
+     --apis my_api_config.json \
+     --skip-validation \
+     --output test_structure.json
+   ```
+
+2. **Validate subset first:**
+   - Test on 10 papers before full run
+   - Verify API connections work
+   - Check output structure
+
+3. **Monitor API usage:**
+   - Track request counts for paid APIs
+   - Respect rate limits
+   - Consider caching results
+
+4. **Handle failures gracefully:**
+   - Original data is never lost
+   - Can re-run validation separately
+   - Manually fix failed validations if needed
+
+5. **Optimize API calls:**
+   - Only validate fields that need standardization
+   - Use cached results when re-running
+   - Batch similar queries when possible
+
+## Troubleshooting
+
+### GeoNames "Service disabled" error
+- Check account email is verified
+- Enable web services in account settings
+- Wait up to 1 hour after enabling
+
+### Nominatim rate limit errors
+- Script includes 1-second delays
+- Don't run multiple instances
+- Consider using local Nominatim instance
+
+### NCBI errors
+- Reduce request frequency
+- Add longer delays
+- Use E-utilities API key (optional, increases limit)
+
+### No matches found
+- Check spelling and formatting
+- Try variations of name
+- Some names may not be in database
+- Consider manual curation for important cases
--- a/skills/extract_from_pdfs/references/setup_guide.md
+++ b/skills/extract_from_pdfs/references/setup_guide.md
@@ -0,0 +1,147 @@
+# Setup Guide for PDF Data Extraction
+
+## Installation
+
+### Using Conda (Recommended)
+
+Create a dedicated environment for the extraction pipeline:
+
+```bash
+conda env create -f environment.yml
+conda activate pdf_extraction
+```
+
+### Using pip
+
+```bash
+pip install -r requirements.txt
+```
+
+## Required Dependencies
+
+### Core Dependencies
+- `anthropic>=0.40.0` - Anthropic API client
+- `pybtex>=0.24.0` - BibTeX file handling
+- `rispy>=0.6.0` - RIS file handling
+- `json-repair>=0.25.0` - JSON repair and validation
+- `jsonschema>=4.20.0` - JSON schema validation
+- `pandas>=2.0.0` - Data processing
+- `requests>=2.31.0` - HTTP requests for APIs
+
+### Export Dependencies
+- `openpyxl>=3.1.0` - Excel export
+- `pyreadr>=0.5.0` - R RDS export
+
+## API Keys Setup
+
+### Anthropic API Key (Required for Claude backends)
+
+```bash
+export ANTHROPIC_API_KEY='your-api-key-here'
+```
+
+Add to your shell profile (~/.bashrc, ~/.zshrc) for persistence:
+
+```bash
+echo 'export ANTHROPIC_API_KEY="your-api-key-here"' >> ~/.bashrc
+source ~/.bashrc
+```
+
+### GeoNames Username (Optional - for geographic validation)
+
+1. Register at https://www.geonames.org/login
+2. Enable web services in your account
+3. Set environment variable:
+
+```bash
+export GEONAMES_USERNAME='your-username'
+```
+
+## Local Model Setup (Ollama)
+
+For free, private, offline abstract filtering:
+
+### Installation
+
+**macOS:**
+```bash
+brew install ollama
+```
+
+**Linux:**
+```bash
+curl -fsSL https://ollama.com/install.sh | sh
+```
+
+**Windows:**
+Download from https://ollama.com/download
+
+### Pulling Models
+
+```bash
+# Recommended models
+ollama pull llama3.1:8b      # Good balance (8GB RAM)
+ollama pull mistral:7b       # Fast, simple filtering
+ollama pull qwen2.5:7b       # Multilingual support
+ollama pull llama3.1:70b     # Best accuracy (64GB RAM)
+```
+
+### Starting Ollama Server
+
+Usually auto-starts, but can be manually started:
+
+```bash
+ollama serve
+```
+
+The server runs at http://localhost:11434 by default.
+
+## Verifying Installation
+
+Test that all components are properly installed:
+
+```bash
+# Test Python dependencies
+python -c "import anthropic, pybtex, rispy, json_repair, pandas; print('All dependencies OK')"
+
+# Test Anthropic API
+python -c "import os; from anthropic import Anthropic; client = Anthropic(); print('API key valid')"
+
+# Test Ollama (if using)
+curl http://localhost:11434/api/tags
+```
+
+## Directory Structure
+
+The skill will work with PDFs and metadata organized in various ways:
+
+### Option A: Reference Manager Export
+```
+project/
+├── library.bib              # BibTeX export
+└── pdfs/
+    ├── Smith2020.pdf
+    ├── Jones2021.pdf
+    └── ...
+```
+
+### Option B: Simple Directory
+```
+project/
+└── pdfs/
+    ├── paper1.pdf
+    ├── paper2.pdf
+    └── ...
+```
+
+### Option C: DOI List
+```
+project/
+└── dois.txt                 # One DOI per line
+```
+
+## Next Steps
+
+After installation, proceed to the workflow guide to start extracting data from your PDFs.
+
+See: `references/workflow_guide.md`
--- a/skills/extract_from_pdfs/references/validation_guide.md
+++ b/skills/extract_from_pdfs/references/validation_guide.md
@@ -0,0 +1,329 @@
+# Validation and Quality Assurance Guide
+
+## Overview
+
+Validation quantifies extraction accuracy using precision, recall, and F1 metrics by comparing automated extraction against manually annotated ground truth.
+
+## When to Validate
+
+- **Before production use** - Establish baseline quality
+- **After schema changes** - Verify improvements
+- **When comparing models** - Test Haiku vs Sonnet vs Ollama
+- **For publication** - Report extraction quality metrics
+
+## Recommended Sample Sizes
+
+- Small projects (<100 papers): 10-20 papers
+- Medium projects (100-500 papers): 20-50 papers
+- Large projects (>500 papers): 50-100 papers
+
+## Step 7: Prepare Validation Set
+
+Sample papers for manual annotation using one of three strategies.
+
+### Random Sampling (General Quality)
+
+```bash
+python scripts/07_prepare_validation_set.py \
+  --extraction-results cleaned_data.json \
+  --schema my_schema.json \
+  --sample-size 20 \
+  --strategy random \
+  --output validation_set.json
+```
+
+Provides overall quality estimate but may miss rare cases.
+
+### Stratified Sampling (Identify Weaknesses)
+
+```bash
+python scripts/07_prepare_validation_set.py \
+  --extraction-results cleaned_data.json \
+  --schema my_schema.json \
+  --sample-size 20 \
+  --strategy stratified \
+  --output validation_set.json
+```
+
+Samples papers with different characteristics:
+- Papers with no records
+- Papers with few records (1-2)
+- Papers with medium records (3-5)
+- Papers with many records (6+)
+
+Best for identifying weak points in extraction.
+
+### Diverse Sampling (Comprehensive)
+
+```bash
+python scripts/07_prepare_validation_set.py \
+  --extraction-results cleaned_data.json \
+  --schema my_schema.json \
+  --sample-size 20 \
+  --strategy diverse \
+  --output validation_set.json
+```
+
+Maximizes diversity across different paper types.
+
+## Step 8: Manual Annotation
+
+### Annotation Process
+
+1. **Open validation file:**
+   ```bash
+   # Use your preferred JSON editor
+   code validation_set.json  # VS Code
+   vim validation_set.json   # Vim
+   ```
+
+2. **For each paper in `validation_papers`:**
+   - Locate and read the original PDF
+   - Extract data according to the schema
+   - Fill the `ground_truth` field with correct extraction
+   - The structure should match `automated_extraction`
+
+3. **Fill metadata fields:**
+   - `annotator`: Your name
+   - `annotation_date`: YYYY-MM-DD
+   - `notes`: Any ambiguous cases or comments
+
+### Annotation Tips
+
+**Be thorough:**
+- Extract ALL relevant information, even if automated extraction missed it
+- This ensures accurate recall calculation
+
+**Be precise:**
+- Use exact values as they appear in the paper
+- Follow the same schema structure as automated extraction
+
+**Be consistent:**
+- Apply the same interpretation rules across all papers
+- Document interpretation decisions in notes
+
+**Mark ambiguities:**
+- If a field is unclear, note it and make your best judgment
+- Consider having multiple annotators for inter-rater reliability
+
+### Example Annotation
+
+```json
+{
+  "paper_id_123": {
+    "automated_extraction": {
+      "has_relevant_data": true,
+      "records": [
+        {
+          "species": "Apis mellifera",
+          "location": "Brazil"
+        }
+      ]
+    },
+    "ground_truth": {
+      "has_relevant_data": true,
+      "records": [
+        {
+          "species": "Apis mellifera",
+          "location": "Brazil",
+          "state_province": "São Paulo"  // Automated missed this
+        },
+        {
+          "species": "Bombus terrestris",  // Automated missed this record
+          "location": "Brazil",
+          "state_province": "São Paulo"
+        }
+      ]
+    },
+    "notes": "Automated extraction missed the state and second species",
+    "annotator": "John Doe",
+    "annotation_date": "2025-01-15"
+  }
+}
+```
+
+## Step 9: Calculate Validation Metrics
+
+### Basic Metrics Calculation
+
+```bash
+python scripts/08_calculate_validation_metrics.py \
+  --annotations validation_set.json \
+  --output validation_metrics.json \
+  --report validation_report.txt
+```
+
+### Advanced Options
+
+**Fuzzy string matching:**
+```bash
+python scripts/08_calculate_validation_metrics.py \
+  --annotations validation_set.json \
+  --fuzzy-strings \
+  --output validation_metrics.json
+```
+
+Normalizes whitespace and case for string comparisons.
+
+**Numeric tolerance:**
+```bash
+python scripts/08_calculate_validation_metrics.py \
+  --annotations validation_set.json \
+  --numeric-tolerance 0.01 \
+  --output validation_metrics.json
+```
+
+Allows small differences in numeric values.
+
+**Ordered list comparison:**
+```bash
+python scripts/08_calculate_validation_metrics.py \
+  --annotations validation_set.json \
+  --list-order-matters \
+  --output validation_metrics.json
+```
+
+Treats lists as ordered sequences instead of sets.
+
+## Understanding the Metrics
+
+### Precision
+**Definition:** Of the items extracted, what percentage are correct?
+
+**Formula:** TP / (TP + FP)
+
+**Example:** Extracted 10 species, 8 were correct → Precision = 80%
+
+**High precision, low recall:** Conservative extraction (misses data)
+
+### Recall
+**Definition:** Of the true items, what percentage were extracted?
+
+**Formula:** TP / (TP + FN)
+
+**Example:** Paper has 12 species, extracted 8 → Recall = 67%
+
+**Low precision, high recall:** Liberal extraction (includes errors)
+
+### F1 Score
+**Definition:** Harmonic mean of precision and recall
+
+**Formula:** 2 × (Precision × Recall) / (Precision + Recall)
+
+**Use:** Single metric balancing precision and recall
+
+### Field-Level Metrics
+
+Metrics are calculated for each field type:
+
+**Boolean fields:**
+- True positives, false positives, false negatives
+
+**Numeric fields:**
+- Exact match or within tolerance
+
+**String fields:**
+- Exact or fuzzy match
+
+**List fields:**
+- Set-based comparison (default)
+- Items in both (TP), in automated only (FP), in truth only (FN)
+
+**Nested objects:**
+- Recursive field-by-field comparison
+
+## Interpreting Results
+
+### Validation Report Structure
+
+```
+OVERALL METRICS
+  Papers evaluated: 20
+  Precision: 87.3%
+  Recall: 79.2%
+  F1 Score: 83.1%
+
+METRICS BY FIELD
+  Field                  Precision    Recall       F1
+  species               95.2%        89.1%        92.0%
+  location              82.3%        75.4%        78.7%
+  method                91.0%        68.2%        77.9%
+
+COMMON ISSUES
+  Fields with low recall (missed information):
+  - method: 68.2% recall, 12 missed items
+
+  Fields with low precision (incorrect extractions):
+  - location: 82.3% precision, 8 incorrect items
+```
+
+### Using Results to Improve
+
+**Low Recall (Missing Information):**
+- Review extraction prompt instructions
+- Add examples of the missed pattern
+- Emphasize completeness in prompt
+- Consider using more capable model (Haiku → Sonnet)
+
+**Low Precision (Incorrect Extractions):**
+- Add validation rules to prompt
+- Provide clearer field definitions
+- Add negative examples
+- Tighten extraction criteria
+
+**Field-Specific Issues:**
+- Identify problematic field types
+- Revise schema definitions
+- Add field-specific instructions
+- Update examples
+
+## Inter-Rater Reliability (Optional)
+
+For critical applications, have multiple annotators:
+
+1. **Split validation set:**
+   - 10 papers: Single annotator
+   - 10 papers: Both annotators independently
+
+2. **Calculate agreement:**
+   ```bash
+   python scripts/08_calculate_validation_metrics.py \
+     --annotations annotator1.json \
+     --compare-with annotator2.json \
+     --output agreement_metrics.json
+   ```
+
+3. **Resolve disagreements:**
+   - Discuss discrepancies
+   - Establish interpretation guidelines
+   - Re-annotate if needed
+
+## Iterative Improvement Workflow
+
+1. **Baseline:** Run extraction with initial schema
+2. **Validate:** Calculate metrics on sample
+3. **Analyze:** Identify weak fields and error patterns
+4. **Revise:** Update schema, prompts, or model
+5. **Re-extract:** Run extraction with improvements
+6. **Re-validate:** Calculate new metrics
+7. **Compare:** Check if metrics improved
+8. **Repeat:** Until acceptable quality achieved
+
+## Reporting Validation in Publications
+
+Include in methods section:
+
+```
+Extraction quality was assessed on a stratified random sample of
+20 papers. Automated extraction achieved 87.3% precision (95% CI:
+81.2-93.4%) and 79.2% recall (95% CI: 72.8-85.6%), with an F1
+score of 83.1%. Field-level metrics ranged from 77.9% (method
+descriptions) to 92.0% (species names).
+```
+
+Consider reporting:
+- Sample size and sampling strategy
+- Overall precision, recall, F1
+- Field-level metrics for key fields
+- Confidence intervals
+- Common error types
--- a/skills/extract_from_pdfs/references/workflow_guide.md
+++ b/skills/extract_from_pdfs/references/workflow_guide.md
@@ -0,0 +1,328 @@
+# Complete Workflow Guide
+
+This guide provides step-by-step instructions for the complete PDF extraction pipeline.
+
+## Overview
+
+The pipeline consists of 6 main steps plus optional validation:
+
+1. **Organize Metadata** - Standardize PDF and metadata organization
+2. **Filter Papers** - Identify relevant papers by abstract (optional)
+3. **Extract Data** - Extract structured data from PDFs
+4. **Repair JSON** - Validate and repair JSON outputs
+5. **Validate with APIs** - Enrich with external databases
+6. **Export** - Convert to analysis format
+
+**Optional:** Steps 7-9 for quality validation
+
+## Step 1: Organize Metadata
+
+Standardize PDF organization and metadata from various sources.
+
+### From BibTeX (Zotero, JabRef, etc.)
+
+```bash
+python scripts/01_organize_metadata.py \
+  --source-type bibtex \
+  --source path/to/library.bib \
+  --pdf-dir path/to/pdfs \
+  --organize-pdfs \
+  --output metadata.json
+```
+
+### From RIS (Mendeley, EndNote, etc.)
+
+```bash
+python scripts/01_organize_metadata.py \
+  --source-type ris \
+  --source path/to/library.ris \
+  --pdf-dir path/to/pdfs \
+  --organize-pdfs \
+  --output metadata.json
+```
+
+### From PDF Directory
+
+```bash
+python scripts/01_organize_metadata.py \
+  --source-type directory \
+  --source path/to/pdfs \
+  --output metadata.json
+```
+
+### From DOI List
+
+```bash
+python scripts/01_organize_metadata.py \
+  --source-type doi_list \
+  --source dois.txt \
+  --output metadata.json
+```
+
+**Outputs:**
+- `metadata.json` - Standardized metadata file
+- `organized_pdfs/` - Renamed PDFs (if --organize-pdfs used)
+
+## Step 2: Filter Papers (Optional but Recommended)
+
+Filter papers by analyzing abstracts to reduce PDF processing costs.
+
+### Backend Selection
+
+**Option A: Claude Haiku (Fast & Cheap)**
+- Cost: ~$0.25 per million input tokens
+- Speed: Very fast with batches API
+- Accuracy: Good for most filtering tasks
+
+```bash
+python scripts/02_filter_abstracts.py \
+  --metadata metadata.json \
+  --backend anthropic-haiku \
+  --use-batches \
+  --output filtered_papers.json
+```
+
+**Option B: Claude Sonnet (More Accurate)**
+- Cost: ~$3 per million input tokens
+- Speed: Fast with batches API
+- Accuracy: Higher for complex criteria
+
+```bash
+python scripts/02_filter_abstracts.py \
+  --metadata metadata.json \
+  --backend anthropic-sonnet \
+  --use-batches \
+  --output filtered_papers.json
+```
+
+**Option C: Local Ollama (FREE & Private)**
+- Cost: $0 (runs locally)
+- Speed: Depends on hardware
+- Accuracy: Good with llama3.1:8b or better
+
+```bash
+python scripts/02_filter_abstracts.py \
+  --metadata metadata.json \
+  --backend ollama \
+  --ollama-model llama3.1:8b \
+  --output filtered_papers.json
+```
+
+**Before running:** Customize the filtering prompt in `scripts/02_filter_abstracts.py` (line 74) to match your criteria.
+
+**Outputs:**
+- `filtered_papers.json` - Papers marked as relevant/irrelevant
+
+## Step 3: Extract Data from PDFs
+
+Extract structured data using Claude's PDF vision capabilities.
+
+### Schema Preparation
+
+1. Copy schema template:
+```bash
+cp assets/schema_template.json my_schema.json
+```
+
+2. Customize for your domain:
+   - Update `objective` with your extraction goal
+   - Define `output_schema` structure
+   - Add domain-specific `instructions`
+   - Provide an `output_example`
+
+See `assets/example_flower_visitors_schema.json` for a real-world example.
+
+### Run Extraction
+
+```bash
+python scripts/03_extract_from_pdfs.py \
+  --metadata filtered_papers.json \
+  --schema my_schema.json \
+  --method batches \
+  --output extracted_data.json
+```
+
+**Processing methods:**
+- `batches` - Most efficient for many PDFs
+- `base64` - Sequential processing
+
+**Optional flags:**
+- `--filter-results filtered_papers.json` - Only process relevant papers
+- `--test` - Process only 3 PDFs for testing
+- `--model claude-3-5-sonnet-20241022` - Change model
+
+**Outputs:**
+- `extracted_data.json` - Raw extraction results with token counts
+
+## Step 4: Repair and Validate JSON
+
+Repair malformed JSON and validate against schema.
+
+```bash
+python scripts/04_repair_json.py \
+  --input extracted_data.json \
+  --schema my_schema.json \
+  --output cleaned_data.json
+```
+
+**Optional flags:**
+- `--strict` - Reject records that fail validation
+
+**Outputs:**
+- `cleaned_data.json` - Repaired and validated extractions
+
+## Step 5: Validate with External APIs
+
+Enrich data using external scientific databases.
+
+### API Configuration
+
+1. Copy API config template:
+```bash
+cp assets/api_config_template.json my_api_config.json
+```
+
+2. Map fields to validation APIs:
+   - `gbif_taxonomy` - GBIF for biological taxonomy
+   - `wfo_plants` - World Flora Online for plant names
+   - `geonames` - GeoNames for locations (requires account)
+   - `geocode` - OpenStreetMap for geocoding (free)
+   - `pubchem` - PubChem for chemical compounds
+   - `ncbi_gene` - NCBI Gene database
+
+See `assets/example_api_config_ecology.json` for an ecology example.
+
+### Run Validation
+
+```bash
+python scripts/05_validate_with_apis.py \
+  --input cleaned_data.json \
+  --apis my_api_config.json \
+  --output validated_data.json
+```
+
+**Optional flags:**
+- `--skip-validation` - Skip API calls, only structure data
+
+**Outputs:**
+- `validated_data.json` - Data enriched with validated taxonomy, geography, etc.
+
+## Step 6: Export to Analysis Format
+
+Convert to format for your analysis environment.
+
+### Python (pandas)
+
+```bash
+python scripts/06_export_database.py \
+  --input validated_data.json \
+  --format python \
+  --flatten \
+  --output results
+```
+
+Outputs:
+- `results.pkl` - pandas DataFrame
+- `results.py` - Loading script
+
+### R
+
+```bash
+python scripts/06_export_database.py \
+  --input validated_data.json \
+  --format r \
+  --flatten \
+  --output results
+```
+
+Outputs:
+- `results.rds` - R data frame
+- `results.R` - Loading script
+
+### CSV
+
+```bash
+python scripts/06_export_database.py \
+  --input validated_data.json \
+  --format csv \
+  --flatten \
+  --output results.csv
+```
+
+### Excel
+
+```bash
+python scripts/06_export_database.py \
+  --input validated_data.json \
+  --format excel \
+  --flatten \
+  --output results.xlsx
+```
+
+### SQLite Database
+
+```bash
+python scripts/06_export_database.py \
+  --input validated_data.json \
+  --format sqlite \
+  --flatten \
+  --output results.db
+```
+
+Outputs:
+- `results.db` - SQLite database
+- `results.sql` - Example queries
+
+**Flags:**
+- `--flatten` - Flatten nested JSON for tabular format
+- `--include-metadata` - Include paper metadata in output
+
+## Cost Estimation
+
+### Example: 100 papers, 10 pages each
+
+**With Filtering (Recommended):**
+1. Filter (Haiku): ~200 abstracts × 500 tokens × $0.25/M = **$0.03**
+2. Extract (Sonnet): ~50 relevant papers × 10 pages × 2,500 tokens × $3/M = **$3.75**
+3. **Total: ~$3.78**
+
+**Without Filtering:**
+1. Extract (Sonnet): 100 papers × 10 pages × 2,500 tokens × $3/M = **$7.50**
+
+**With Local Ollama:**
+1. Filter (Ollama): **$0**
+2. Extract (Sonnet): ~50 papers × 10 pages × 2,500 tokens × $3/M = **$3.75**
+3. **Total: ~$3.75**
+
+### Token Usage by Step
+- Abstract (~200 words): ~500 tokens
+- PDF page (text-heavy): ~1,500-3,000 tokens
+- Extraction prompt: ~500-1,000 tokens
+- Schema/context: ~500-1,000 tokens
+
+**Tips to reduce costs:**
+- Use abstract filtering (Step 2)
+- Use Haiku for filtering instead of Sonnet
+- Use local Ollama for filtering (free)
+- Enable prompt caching with `--use-caching`
+- Process in batches with `--use-batches`
+
+## Common Issues
+
+### PDF Not Found
+Check PDF paths in metadata.json match actual file locations.
+
+### JSON Parsing Errors
+Run Step 4 (repair JSON) - the json_repair library handles most issues.
+
+### API Rate Limits
+Scripts include delays, but check specific API documentation for limits.
+
+### Ollama Connection Error
+Ensure Ollama server is running: `ollama serve`
+
+## Next Steps
+
+For quality assurance, proceed to the validation workflow to calculate precision and recall metrics.
+
+See: `references/validation_guide.md`