Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:02:40 +08:00
commit 69617b598e
25 changed files with 5790 additions and 0 deletions

View File

@@ -0,0 +1,406 @@
# External API Validation Reference
## Overview
Step 5 validates and enriches extracted data using external scientific databases. This ensures taxonomic names are standardized, locations are geocoded, and chemical/gene identifiers are canonical.
## Available APIs
### Biological Taxonomy
#### GBIF (Global Biodiversity Information Facility)
**Use for:** General biological taxonomy (animals, plants, fungi, etc.)
**Function:** `validate_gbif_taxonomy(scientific_name)`
**Returns:**
- Matched canonical name
- Full scientific name with authority
- Taxonomic hierarchy (kingdom, phylum, class, order, family, genus)
- GBIF ID
- Match confidence and type
- Taxonomic status
**Example:**
```python
validate_gbif_taxonomy("Apis melifera")
# Returns:
{
"matched_name": "Apis mellifera",
"scientific_name": "Apis mellifera Linnaeus, 1758",
"rank": "SPECIES",
"kingdom": "Animalia",
"phylum": "Arthropoda",
"class": "Insecta",
"order": "Hymenoptera",
"family": "Apidae",
"genus": "Apis",
"gbif_id": 1340278,
"confidence": 100,
"match_type": "EXACT"
}
```
**No API key required** - Free and unlimited
**Documentation:** https://www.gbif.org/developer/species
#### World Flora Online (WFO)
**Use for:** Plant taxonomy specifically
**Function:** `validate_wfo_plant(scientific_name)`
**Returns:**
- Matched name
- Scientific name with authors
- Family
- WFO ID
- Taxonomic status
**Example:**
```python
validate_wfo_plant("Magnolia grandiflora")
# Returns:
{
"matched_name": "Magnolia grandiflora",
"scientific_name": "Magnolia grandiflora L.",
"authors": "L.",
"family": "Magnoliaceae",
"wfo_id": "wfo-0000988234",
"status": "Accepted"
}
```
**No API key required** - Free
**Documentation:** http://www.worldfloraonline.org/
### Geography
#### GeoNames
**Use for:** Location validation and standardization
**Function:** `validate_geonames(location, country=None)`
**Returns:**
- Matched place name
- Country name and code
- Administrative divisions (state, province)
- Latitude/longitude
- GeoNames ID
**Example:**
```python
validate_geonames("São Paulo", country="BR")
# Returns:
{
"matched_name": "São Paulo",
"country": "Brazil",
"country_code": "BR",
"admin1": "São Paulo",
"admin2": None,
"latitude": "-23.5475",
"longitude": "-46.63611",
"geonames_id": 3448439
}
```
**Requires free account:** Register at https://www.geonames.org/login
**Setup:**
1. Create account
2. Enable web services in account settings
3. Set environment variable: `export GEONAMES_USERNAME='your-username'`
**Rate limit:** Free tier allows reasonable usage
**Documentation:** https://www.geonames.org/export/web-services.html
#### OpenStreetMap Nominatim
**Use for:** Geocoding addresses to coordinates
**Function:** `geocode_location(address)`
**Returns:**
- Display name (formatted address)
- Latitude/longitude
- OSM type and ID
- Place rank
**Example:**
```python
geocode_location("Field Museum, Chicago, IL")
# Returns:
{
"display_name": "Field Museum, 1400, South Lake Shore Drive, Chicago, Illinois, 60605, United States",
"latitude": "41.8662",
"longitude": "-87.6169",
"osm_type": "way",
"osm_id": 54856789,
"place_rank": 30
}
```
**No API key required** - Free
**Important:** Add 1-second delays between requests (implemented in script)
**Documentation:** https://nominatim.org/release-docs/latest/api/Overview/
### Chemistry
#### PubChem
**Use for:** Chemical compound validation
**Function:** `validate_pubchem_compound(compound_name)`
**Returns:**
- PubChem CID (compound ID)
- Molecular formula
- PubChem URL
**Example:**
```python
validate_pubchem_compound("aspirin")
# Returns:
{
"cid": 2244,
"molecular_formula": "C9H8O4",
"pubchem_url": "https://pubchem.ncbi.nlm.nih.gov/compound/2244"
}
```
**No API key required** - Free
**Documentation:** https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
### Genetics
#### NCBI Gene
**Use for:** Gene validation
**Function:** `validate_ncbi_gene(gene_symbol, organism=None)`
**Returns:**
- NCBI Gene ID
- NCBI URL
**Example:**
```python
validate_ncbi_gene("BRCA1", organism="Homo sapiens")
# Returns:
{
"gene_id": "672",
"ncbi_url": "https://www.ncbi.nlm.nih.gov/gene/672"
}
```
**No API key required** - Free
**Rate limit:** Max 3 requests/second
**Documentation:** https://www.ncbi.nlm.nih.gov/books/NBK25500/
## Configuration
### API Config File Structure
Create `my_api_config.json` based on `assets/api_config_template.json`:
```json
{
"field_mappings": {
"species": {
"api": "gbif_taxonomy",
"output_field": "validated_species",
"description": "Validate species names against GBIF"
},
"location": {
"api": "geocode",
"output_field": "coordinates"
}
},
"nested_field_mappings": {
"records.plant_species": {
"api": "wfo_plants",
"output_field": "validated_plant_taxonomy"
},
"records.location": {
"api": "geocode",
"output_field": "coordinates"
}
}
}
```
### Field Mapping Parameters
**Required:**
- `api` - API name (see list above)
- `output_field` - Name for validated data
**Optional:**
- `description` - Documentation
- `extra_params` - Additional API-specific parameters
## Adding Custom APIs
To add a new validation API:
1. **Create validator function** in `scripts/05_validate_with_apis.py`:
```python
def validate_custom_api(value: str, extra_param: str = None) -> Optional[Dict]:
"""
Validate value using custom API.
Args:
value: The value to validate
extra_param: Optional additional parameter
Returns:
Dictionary with validated data or None if not found
"""
try:
# Make API request
response = requests.get(f"https://api.example.com/{value}")
if response.status_code == 200:
data = response.json()
return {
'validated_value': data.get('canonical_name'),
'api_id': data.get('id'),
'additional_info': data.get('info')
}
except Exception as e:
print(f"Custom API error: {e}")
return None
```
2. **Register in API_VALIDATORS** dictionary:
```python
API_VALIDATORS = {
'gbif_taxonomy': validate_gbif_taxonomy,
'wfo_plants': validate_wfo_plant,
# ... existing validators ...
'custom_api': validate_custom_api, # Add here
}
```
3. **Use in config file:**
```json
{
"field_mappings": {
"your_field": {
"api": "custom_api",
"output_field": "validated_field",
"extra_params": {
"extra_param": "value"
}
}
}
}
```
## Rate Limiting
The script implements rate limiting to respect API usage policies:
**Default delays (built into script):**
- GeoNames: 0.5 seconds
- Nominatim: 1.0 second (required)
- WFO: 1.0 second
- Others: 0.5 seconds
**Modify delays if needed** in `scripts/05_validate_with_apis.py`:
```python
# In main() function
if not args.skip_validation:
time.sleep(0.5) # Adjust this value
```
## Error Handling
APIs may fail for various reasons:
**Common errors:**
- Connection timeout
- Rate limit exceeded
- Invalid API key
- Malformed query
- No match found
**Script behavior:**
- Continues processing on error
- Logs error to console
- Sets validated field to None
- Original extracted value preserved
**Retry logic:**
- 3 retries with exponential backoff
- Implemented for network errors
- Not for "no match found" errors
## Best Practices
1. **Start with test run:**
```bash
python scripts/05_validate_with_apis.py \
--input cleaned_data.json \
--apis my_api_config.json \
--skip-validation \
--output test_structure.json
```
2. **Validate subset first:**
- Test on 10 papers before full run
- Verify API connections work
- Check output structure
3. **Monitor API usage:**
- Track request counts for paid APIs
- Respect rate limits
- Consider caching results
4. **Handle failures gracefully:**
- Original data is never lost
- Can re-run validation separately
- Manually fix failed validations if needed
5. **Optimize API calls:**
- Only validate fields that need standardization
- Use cached results when re-running
- Batch similar queries when possible
## Troubleshooting
### GeoNames "Service disabled" error
- Check account email is verified
- Enable web services in account settings
- Wait up to 1 hour after enabling
### Nominatim rate limit errors
- Script includes 1-second delays
- Don't run multiple instances
- Consider using local Nominatim instance
### NCBI errors
- Reduce request frequency
- Add longer delays
- Use E-utilities API key (optional, increases limit)
### No matches found
- Check spelling and formatting
- Try variations of name
- Some names may not be in database
- Consider manual curation for important cases

View File

@@ -0,0 +1,147 @@
# Setup Guide for PDF Data Extraction
## Installation
### Using Conda (Recommended)
Create a dedicated environment for the extraction pipeline:
```bash
conda env create -f environment.yml
conda activate pdf_extraction
```
### Using pip
```bash
pip install -r requirements.txt
```
## Required Dependencies
### Core Dependencies
- `anthropic>=0.40.0` - Anthropic API client
- `pybtex>=0.24.0` - BibTeX file handling
- `rispy>=0.6.0` - RIS file handling
- `json-repair>=0.25.0` - JSON repair and validation
- `jsonschema>=4.20.0` - JSON schema validation
- `pandas>=2.0.0` - Data processing
- `requests>=2.31.0` - HTTP requests for APIs
### Export Dependencies
- `openpyxl>=3.1.0` - Excel export
- `pyreadr>=0.5.0` - R RDS export
## API Keys Setup
### Anthropic API Key (Required for Claude backends)
```bash
export ANTHROPIC_API_KEY='your-api-key-here'
```
Add to your shell profile (~/.bashrc, ~/.zshrc) for persistence:
```bash
echo 'export ANTHROPIC_API_KEY="your-api-key-here"' >> ~/.bashrc
source ~/.bashrc
```
### GeoNames Username (Optional - for geographic validation)
1. Register at https://www.geonames.org/login
2. Enable web services in your account
3. Set environment variable:
```bash
export GEONAMES_USERNAME='your-username'
```
## Local Model Setup (Ollama)
For free, private, offline abstract filtering:
### Installation
**macOS:**
```bash
brew install ollama
```
**Linux:**
```bash
curl -fsSL https://ollama.com/install.sh | sh
```
**Windows:**
Download from https://ollama.com/download
### Pulling Models
```bash
# Recommended models
ollama pull llama3.1:8b # Good balance (8GB RAM)
ollama pull mistral:7b # Fast, simple filtering
ollama pull qwen2.5:7b # Multilingual support
ollama pull llama3.1:70b # Best accuracy (64GB RAM)
```
### Starting Ollama Server
Usually auto-starts, but can be manually started:
```bash
ollama serve
```
The server runs at http://localhost:11434 by default.
## Verifying Installation
Test that all components are properly installed:
```bash
# Test Python dependencies
python -c "import anthropic, pybtex, rispy, json_repair, pandas; print('All dependencies OK')"
# Test Anthropic API
python -c "import os; from anthropic import Anthropic; client = Anthropic(); print('API key valid')"
# Test Ollama (if using)
curl http://localhost:11434/api/tags
```
## Directory Structure
The skill will work with PDFs and metadata organized in various ways:
### Option A: Reference Manager Export
```
project/
├── library.bib # BibTeX export
└── pdfs/
├── Smith2020.pdf
├── Jones2021.pdf
└── ...
```
### Option B: Simple Directory
```
project/
└── pdfs/
├── paper1.pdf
├── paper2.pdf
└── ...
```
### Option C: DOI List
```
project/
└── dois.txt # One DOI per line
```
## Next Steps
After installation, proceed to the workflow guide to start extracting data from your PDFs.
See: `references/workflow_guide.md`

View File

@@ -0,0 +1,329 @@
# Validation and Quality Assurance Guide
## Overview
Validation quantifies extraction accuracy using precision, recall, and F1 metrics by comparing automated extraction against manually annotated ground truth.
## When to Validate
- **Before production use** - Establish baseline quality
- **After schema changes** - Verify improvements
- **When comparing models** - Test Haiku vs Sonnet vs Ollama
- **For publication** - Report extraction quality metrics
## Recommended Sample Sizes
- Small projects (<100 papers): 10-20 papers
- Medium projects (100-500 papers): 20-50 papers
- Large projects (>500 papers): 50-100 papers
## Step 7: Prepare Validation Set
Sample papers for manual annotation using one of three strategies.
### Random Sampling (General Quality)
```bash
python scripts/07_prepare_validation_set.py \
--extraction-results cleaned_data.json \
--schema my_schema.json \
--sample-size 20 \
--strategy random \
--output validation_set.json
```
Provides overall quality estimate but may miss rare cases.
### Stratified Sampling (Identify Weaknesses)
```bash
python scripts/07_prepare_validation_set.py \
--extraction-results cleaned_data.json \
--schema my_schema.json \
--sample-size 20 \
--strategy stratified \
--output validation_set.json
```
Samples papers with different characteristics:
- Papers with no records
- Papers with few records (1-2)
- Papers with medium records (3-5)
- Papers with many records (6+)
Best for identifying weak points in extraction.
### Diverse Sampling (Comprehensive)
```bash
python scripts/07_prepare_validation_set.py \
--extraction-results cleaned_data.json \
--schema my_schema.json \
--sample-size 20 \
--strategy diverse \
--output validation_set.json
```
Maximizes diversity across different paper types.
## Step 8: Manual Annotation
### Annotation Process
1. **Open validation file:**
```bash
# Use your preferred JSON editor
code validation_set.json # VS Code
vim validation_set.json # Vim
```
2. **For each paper in `validation_papers`:**
- Locate and read the original PDF
- Extract data according to the schema
- Fill the `ground_truth` field with correct extraction
- The structure should match `automated_extraction`
3. **Fill metadata fields:**
- `annotator`: Your name
- `annotation_date`: YYYY-MM-DD
- `notes`: Any ambiguous cases or comments
### Annotation Tips
**Be thorough:**
- Extract ALL relevant information, even if automated extraction missed it
- This ensures accurate recall calculation
**Be precise:**
- Use exact values as they appear in the paper
- Follow the same schema structure as automated extraction
**Be consistent:**
- Apply the same interpretation rules across all papers
- Document interpretation decisions in notes
**Mark ambiguities:**
- If a field is unclear, note it and make your best judgment
- Consider having multiple annotators for inter-rater reliability
### Example Annotation
```json
{
"paper_id_123": {
"automated_extraction": {
"has_relevant_data": true,
"records": [
{
"species": "Apis mellifera",
"location": "Brazil"
}
]
},
"ground_truth": {
"has_relevant_data": true,
"records": [
{
"species": "Apis mellifera",
"location": "Brazil",
"state_province": "São Paulo" // Automated missed this
},
{
"species": "Bombus terrestris", // Automated missed this record
"location": "Brazil",
"state_province": "São Paulo"
}
]
},
"notes": "Automated extraction missed the state and second species",
"annotator": "John Doe",
"annotation_date": "2025-01-15"
}
}
```
## Step 9: Calculate Validation Metrics
### Basic Metrics Calculation
```bash
python scripts/08_calculate_validation_metrics.py \
--annotations validation_set.json \
--output validation_metrics.json \
--report validation_report.txt
```
### Advanced Options
**Fuzzy string matching:**
```bash
python scripts/08_calculate_validation_metrics.py \
--annotations validation_set.json \
--fuzzy-strings \
--output validation_metrics.json
```
Normalizes whitespace and case for string comparisons.
**Numeric tolerance:**
```bash
python scripts/08_calculate_validation_metrics.py \
--annotations validation_set.json \
--numeric-tolerance 0.01 \
--output validation_metrics.json
```
Allows small differences in numeric values.
**Ordered list comparison:**
```bash
python scripts/08_calculate_validation_metrics.py \
--annotations validation_set.json \
--list-order-matters \
--output validation_metrics.json
```
Treats lists as ordered sequences instead of sets.
## Understanding the Metrics
### Precision
**Definition:** Of the items extracted, what percentage are correct?
**Formula:** TP / (TP + FP)
**Example:** Extracted 10 species, 8 were correct → Precision = 80%
**High precision, low recall:** Conservative extraction (misses data)
### Recall
**Definition:** Of the true items, what percentage were extracted?
**Formula:** TP / (TP + FN)
**Example:** Paper has 12 species, extracted 8 → Recall = 67%
**Low precision, high recall:** Liberal extraction (includes errors)
### F1 Score
**Definition:** Harmonic mean of precision and recall
**Formula:** 2 × (Precision × Recall) / (Precision + Recall)
**Use:** Single metric balancing precision and recall
### Field-Level Metrics
Metrics are calculated for each field type:
**Boolean fields:**
- True positives, false positives, false negatives
**Numeric fields:**
- Exact match or within tolerance
**String fields:**
- Exact or fuzzy match
**List fields:**
- Set-based comparison (default)
- Items in both (TP), in automated only (FP), in truth only (FN)
**Nested objects:**
- Recursive field-by-field comparison
## Interpreting Results
### Validation Report Structure
```
OVERALL METRICS
Papers evaluated: 20
Precision: 87.3%
Recall: 79.2%
F1 Score: 83.1%
METRICS BY FIELD
Field Precision Recall F1
species 95.2% 89.1% 92.0%
location 82.3% 75.4% 78.7%
method 91.0% 68.2% 77.9%
COMMON ISSUES
Fields with low recall (missed information):
- method: 68.2% recall, 12 missed items
Fields with low precision (incorrect extractions):
- location: 82.3% precision, 8 incorrect items
```
### Using Results to Improve
**Low Recall (Missing Information):**
- Review extraction prompt instructions
- Add examples of the missed pattern
- Emphasize completeness in prompt
- Consider using more capable model (Haiku → Sonnet)
**Low Precision (Incorrect Extractions):**
- Add validation rules to prompt
- Provide clearer field definitions
- Add negative examples
- Tighten extraction criteria
**Field-Specific Issues:**
- Identify problematic field types
- Revise schema definitions
- Add field-specific instructions
- Update examples
## Inter-Rater Reliability (Optional)
For critical applications, have multiple annotators:
1. **Split validation set:**
- 10 papers: Single annotator
- 10 papers: Both annotators independently
2. **Calculate agreement:**
```bash
python scripts/08_calculate_validation_metrics.py \
--annotations annotator1.json \
--compare-with annotator2.json \
--output agreement_metrics.json
```
3. **Resolve disagreements:**
- Discuss discrepancies
- Establish interpretation guidelines
- Re-annotate if needed
## Iterative Improvement Workflow
1. **Baseline:** Run extraction with initial schema
2. **Validate:** Calculate metrics on sample
3. **Analyze:** Identify weak fields and error patterns
4. **Revise:** Update schema, prompts, or model
5. **Re-extract:** Run extraction with improvements
6. **Re-validate:** Calculate new metrics
7. **Compare:** Check if metrics improved
8. **Repeat:** Until acceptable quality achieved
## Reporting Validation in Publications
Include in methods section:
```
Extraction quality was assessed on a stratified random sample of
20 papers. Automated extraction achieved 87.3% precision (95% CI:
81.2-93.4%) and 79.2% recall (95% CI: 72.8-85.6%), with an F1
score of 83.1%. Field-level metrics ranged from 77.9% (method
descriptions) to 92.0% (species names).
```
Consider reporting:
- Sample size and sampling strategy
- Overall precision, recall, F1
- Field-level metrics for key fields
- Confidence intervals
- Common error types

View File

@@ -0,0 +1,328 @@
# Complete Workflow Guide
This guide provides step-by-step instructions for the complete PDF extraction pipeline.
## Overview
The pipeline consists of 6 main steps plus optional validation:
1. **Organize Metadata** - Standardize PDF and metadata organization
2. **Filter Papers** - Identify relevant papers by abstract (optional)
3. **Extract Data** - Extract structured data from PDFs
4. **Repair JSON** - Validate and repair JSON outputs
5. **Validate with APIs** - Enrich with external databases
6. **Export** - Convert to analysis format
**Optional:** Steps 7-9 for quality validation
## Step 1: Organize Metadata
Standardize PDF organization and metadata from various sources.
### From BibTeX (Zotero, JabRef, etc.)
```bash
python scripts/01_organize_metadata.py \
--source-type bibtex \
--source path/to/library.bib \
--pdf-dir path/to/pdfs \
--organize-pdfs \
--output metadata.json
```
### From RIS (Mendeley, EndNote, etc.)
```bash
python scripts/01_organize_metadata.py \
--source-type ris \
--source path/to/library.ris \
--pdf-dir path/to/pdfs \
--organize-pdfs \
--output metadata.json
```
### From PDF Directory
```bash
python scripts/01_organize_metadata.py \
--source-type directory \
--source path/to/pdfs \
--output metadata.json
```
### From DOI List
```bash
python scripts/01_organize_metadata.py \
--source-type doi_list \
--source dois.txt \
--output metadata.json
```
**Outputs:**
- `metadata.json` - Standardized metadata file
- `organized_pdfs/` - Renamed PDFs (if --organize-pdfs used)
## Step 2: Filter Papers (Optional but Recommended)
Filter papers by analyzing abstracts to reduce PDF processing costs.
### Backend Selection
**Option A: Claude Haiku (Fast & Cheap)**
- Cost: ~$0.25 per million input tokens
- Speed: Very fast with batches API
- Accuracy: Good for most filtering tasks
```bash
python scripts/02_filter_abstracts.py \
--metadata metadata.json \
--backend anthropic-haiku \
--use-batches \
--output filtered_papers.json
```
**Option B: Claude Sonnet (More Accurate)**
- Cost: ~$3 per million input tokens
- Speed: Fast with batches API
- Accuracy: Higher for complex criteria
```bash
python scripts/02_filter_abstracts.py \
--metadata metadata.json \
--backend anthropic-sonnet \
--use-batches \
--output filtered_papers.json
```
**Option C: Local Ollama (FREE & Private)**
- Cost: $0 (runs locally)
- Speed: Depends on hardware
- Accuracy: Good with llama3.1:8b or better
```bash
python scripts/02_filter_abstracts.py \
--metadata metadata.json \
--backend ollama \
--ollama-model llama3.1:8b \
--output filtered_papers.json
```
**Before running:** Customize the filtering prompt in `scripts/02_filter_abstracts.py` (line 74) to match your criteria.
**Outputs:**
- `filtered_papers.json` - Papers marked as relevant/irrelevant
## Step 3: Extract Data from PDFs
Extract structured data using Claude's PDF vision capabilities.
### Schema Preparation
1. Copy schema template:
```bash
cp assets/schema_template.json my_schema.json
```
2. Customize for your domain:
- Update `objective` with your extraction goal
- Define `output_schema` structure
- Add domain-specific `instructions`
- Provide an `output_example`
See `assets/example_flower_visitors_schema.json` for a real-world example.
### Run Extraction
```bash
python scripts/03_extract_from_pdfs.py \
--metadata filtered_papers.json \
--schema my_schema.json \
--method batches \
--output extracted_data.json
```
**Processing methods:**
- `batches` - Most efficient for many PDFs
- `base64` - Sequential processing
**Optional flags:**
- `--filter-results filtered_papers.json` - Only process relevant papers
- `--test` - Process only 3 PDFs for testing
- `--model claude-3-5-sonnet-20241022` - Change model
**Outputs:**
- `extracted_data.json` - Raw extraction results with token counts
## Step 4: Repair and Validate JSON
Repair malformed JSON and validate against schema.
```bash
python scripts/04_repair_json.py \
--input extracted_data.json \
--schema my_schema.json \
--output cleaned_data.json
```
**Optional flags:**
- `--strict` - Reject records that fail validation
**Outputs:**
- `cleaned_data.json` - Repaired and validated extractions
## Step 5: Validate with External APIs
Enrich data using external scientific databases.
### API Configuration
1. Copy API config template:
```bash
cp assets/api_config_template.json my_api_config.json
```
2. Map fields to validation APIs:
- `gbif_taxonomy` - GBIF for biological taxonomy
- `wfo_plants` - World Flora Online for plant names
- `geonames` - GeoNames for locations (requires account)
- `geocode` - OpenStreetMap for geocoding (free)
- `pubchem` - PubChem for chemical compounds
- `ncbi_gene` - NCBI Gene database
See `assets/example_api_config_ecology.json` for an ecology example.
### Run Validation
```bash
python scripts/05_validate_with_apis.py \
--input cleaned_data.json \
--apis my_api_config.json \
--output validated_data.json
```
**Optional flags:**
- `--skip-validation` - Skip API calls, only structure data
**Outputs:**
- `validated_data.json` - Data enriched with validated taxonomy, geography, etc.
## Step 6: Export to Analysis Format
Convert to format for your analysis environment.
### Python (pandas)
```bash
python scripts/06_export_database.py \
--input validated_data.json \
--format python \
--flatten \
--output results
```
Outputs:
- `results.pkl` - pandas DataFrame
- `results.py` - Loading script
### R
```bash
python scripts/06_export_database.py \
--input validated_data.json \
--format r \
--flatten \
--output results
```
Outputs:
- `results.rds` - R data frame
- `results.R` - Loading script
### CSV
```bash
python scripts/06_export_database.py \
--input validated_data.json \
--format csv \
--flatten \
--output results.csv
```
### Excel
```bash
python scripts/06_export_database.py \
--input validated_data.json \
--format excel \
--flatten \
--output results.xlsx
```
### SQLite Database
```bash
python scripts/06_export_database.py \
--input validated_data.json \
--format sqlite \
--flatten \
--output results.db
```
Outputs:
- `results.db` - SQLite database
- `results.sql` - Example queries
**Flags:**
- `--flatten` - Flatten nested JSON for tabular format
- `--include-metadata` - Include paper metadata in output
## Cost Estimation
### Example: 100 papers, 10 pages each
**With Filtering (Recommended):**
1. Filter (Haiku): ~200 abstracts × 500 tokens × $0.25/M = **$0.03**
2. Extract (Sonnet): ~50 relevant papers × 10 pages × 2,500 tokens × $3/M = **$3.75**
3. **Total: ~$3.78**
**Without Filtering:**
1. Extract (Sonnet): 100 papers × 10 pages × 2,500 tokens × $3/M = **$7.50**
**With Local Ollama:**
1. Filter (Ollama): **$0**
2. Extract (Sonnet): ~50 papers × 10 pages × 2,500 tokens × $3/M = **$3.75**
3. **Total: ~$3.75**
### Token Usage by Step
- Abstract (~200 words): ~500 tokens
- PDF page (text-heavy): ~1,500-3,000 tokens
- Extraction prompt: ~500-1,000 tokens
- Schema/context: ~500-1,000 tokens
**Tips to reduce costs:**
- Use abstract filtering (Step 2)
- Use Haiku for filtering instead of Sonnet
- Use local Ollama for filtering (free)
- Enable prompt caching with `--use-caching`
- Process in batches with `--use-batches`
## Common Issues
### PDF Not Found
Check PDF paths in metadata.json match actual file locations.
### JSON Parsing Errors
Run Step 4 (repair JSON) - the json_repair library handles most issues.
### API Rate Limits
Scripts include delays, but check specific API documentation for limits.
### Ollama Connection Error
Ensure Ollama server is running: `ollama serve`
## Next Steps
For quality assurance, proceed to the validation workflow to calculate precision and recall metrics.
See: `references/validation_guide.md`