Initial commit
This commit is contained in:
321
skills/extract_from_pdfs/README.md
Normal file
321
skills/extract_from_pdfs/README.md
Normal file
@@ -0,0 +1,321 @@
|
||||
# Extract Structured Data from Scientific PDFs
|
||||
|
||||
A comprehensive pipeline for extracting standardized data from scientific literature PDFs using Claude AI.
|
||||
|
||||
## Overview
|
||||
|
||||
This skill provides an end-to-end workflow for:
|
||||
- Organizing PDF literature and metadata from various sources
|
||||
- Filtering relevant papers based on abstract content (optional)
|
||||
- Extracting structured data from full PDFs using Claude's vision capabilities
|
||||
- Repairing and validating JSON outputs
|
||||
- Enriching data with external scientific databases
|
||||
- Exporting to multiple analysis formats (Python, R, Excel, CSV, SQLite)
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Installation
|
||||
|
||||
Create a conda environment:
|
||||
|
||||
```bash
|
||||
conda env create -f environment.yml
|
||||
conda activate pdf_extraction
|
||||
```
|
||||
|
||||
Or install with pip:
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### 2. Setup API Keys
|
||||
|
||||
Set your Anthropic API key:
|
||||
|
||||
```bash
|
||||
export ANTHROPIC_API_KEY='your-api-key-here'
|
||||
```
|
||||
|
||||
For geographic validation (optional):
|
||||
```bash
|
||||
export GEONAMES_USERNAME='your-geonames-username'
|
||||
```
|
||||
|
||||
### 3. Run the Skill
|
||||
|
||||
The easiest way is to use the skill through Claude Code:
|
||||
|
||||
```bash
|
||||
claude-code
|
||||
```
|
||||
|
||||
Then activate the skill by mentioning it in your conversation. The skill will guide you through an interactive setup process.
|
||||
|
||||
## Documentation
|
||||
|
||||
The skill includes comprehensive reference documentation:
|
||||
|
||||
- `references/setup_guide.md` - Installation and configuration
|
||||
- `references/workflow_guide.md` - Complete step-by-step workflow with examples
|
||||
- `references/validation_guide.md` - Validation methodology and metrics interpretation
|
||||
- `references/api_reference.md` - External API integration details
|
||||
|
||||
## Manual Workflow
|
||||
|
||||
You can also run the scripts manually:
|
||||
|
||||
### Step 1: Organize Metadata
|
||||
|
||||
```bash
|
||||
python scripts/01_organize_metadata.py \
|
||||
--source-type bibtex \
|
||||
--source path/to/library.bib \
|
||||
--pdf-dir path/to/pdfs \
|
||||
--organize-pdfs \
|
||||
--output metadata.json
|
||||
```
|
||||
|
||||
### Step 2: Filter Papers (Optional)
|
||||
|
||||
First, customize the filtering prompt in `scripts/02_filter_abstracts.py` for your use case.
|
||||
|
||||
**Option A: Claude Haiku (Fast & Cheap - ~$0.25/M tokens)**
|
||||
```bash
|
||||
python scripts/02_filter_abstracts.py \
|
||||
--metadata metadata.json \
|
||||
--backend anthropic-haiku \
|
||||
--use-batches \
|
||||
--output filtered_papers.json
|
||||
```
|
||||
|
||||
**Option B: Local Model via Ollama (FREE)**
|
||||
```bash
|
||||
# One-time setup:
|
||||
# 1. Install Ollama from https://ollama.com
|
||||
# 2. Pull model: ollama pull llama3.1:8b
|
||||
# 3. Start server: ollama serve
|
||||
|
||||
python scripts/02_filter_abstracts.py \
|
||||
--metadata metadata.json \
|
||||
--backend ollama \
|
||||
--ollama-model llama3.1:8b \
|
||||
--output filtered_papers.json
|
||||
```
|
||||
|
||||
Recommended Ollama models:
|
||||
- `llama3.1:8b` - Good balance (8GB RAM)
|
||||
- `mistral:7b` - Fast, good for simple filtering
|
||||
- `qwen2.5:7b` - Good multilingual support
|
||||
- `llama3.1:70b` - Better accuracy (64GB RAM)
|
||||
|
||||
### Step 3: Extract Data from PDFs
|
||||
|
||||
First, create your extraction schema by copying and customizing `assets/schema_template.json`.
|
||||
|
||||
```bash
|
||||
python scripts/03_extract_from_pdfs.py \
|
||||
--metadata filtered_papers.json \
|
||||
--schema my_schema.json \
|
||||
--method batches \
|
||||
--output extracted_data.json
|
||||
```
|
||||
|
||||
### Step 4: Repair JSON
|
||||
|
||||
```bash
|
||||
python scripts/04_repair_json.py \
|
||||
--input extracted_data.json \
|
||||
--schema my_schema.json \
|
||||
--output cleaned_data.json
|
||||
```
|
||||
|
||||
### Step 5: Validate with APIs
|
||||
|
||||
First, create your API configuration by copying and customizing `assets/api_config_template.json`.
|
||||
|
||||
```bash
|
||||
python scripts/05_validate_with_apis.py \
|
||||
--input cleaned_data.json \
|
||||
--apis my_api_config.json \
|
||||
--output validated_data.json
|
||||
```
|
||||
|
||||
### Step 6: Export
|
||||
|
||||
```bash
|
||||
# For Python/pandas
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format python \
|
||||
--flatten \
|
||||
--output results
|
||||
|
||||
# For R
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format r \
|
||||
--flatten \
|
||||
--output results
|
||||
|
||||
# For CSV
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format csv \
|
||||
--flatten \
|
||||
--output results.csv
|
||||
```
|
||||
|
||||
### Validation & Quality Assurance (Optional but Recommended)
|
||||
|
||||
Validate extraction quality using precision and recall metrics:
|
||||
|
||||
#### Step 7: Prepare Validation Set
|
||||
|
||||
```bash
|
||||
python scripts/07_prepare_validation_set.py \
|
||||
--extraction-results cleaned_data.json \
|
||||
--schema my_schema.json \
|
||||
--sample-size 20 \
|
||||
--strategy stratified \
|
||||
--output validation_set.json
|
||||
```
|
||||
|
||||
Sampling strategies:
|
||||
- `random` - Random sample
|
||||
- `stratified` - Sample by extraction characteristics
|
||||
- `diverse` - Maximize diversity
|
||||
|
||||
#### Step 8: Manual Annotation
|
||||
|
||||
1. Open `validation_set.json`
|
||||
2. For each sampled paper:
|
||||
- Read the PDF
|
||||
- Fill in `ground_truth` field with correct extraction
|
||||
- Add `annotator` name and `annotation_date`
|
||||
- Use `notes` for ambiguous cases
|
||||
3. Save the file
|
||||
|
||||
#### Step 9: Calculate Metrics
|
||||
|
||||
```bash
|
||||
python scripts/08_calculate_validation_metrics.py \
|
||||
--annotations validation_set.json \
|
||||
--output validation_metrics.json \
|
||||
--report validation_report.txt
|
||||
```
|
||||
|
||||
This produces:
|
||||
- **Precision**: % of extracted items that are correct
|
||||
- **Recall**: % of true items that were extracted
|
||||
- **F1 Score**: Harmonic mean of precision and recall
|
||||
- **Per-field metrics**: Accuracy by field type
|
||||
|
||||
Use these metrics to:
|
||||
- Identify weak points in extraction prompts
|
||||
- Compare models (Haiku vs Sonnet vs Ollama)
|
||||
- Iterate and improve schema
|
||||
- Report quality in publications
|
||||
|
||||
## Customization
|
||||
|
||||
### Creating Your Extraction Schema
|
||||
|
||||
1. Copy `assets/schema_template.json` to `my_schema.json`
|
||||
2. Customize the following sections:
|
||||
- `objective`: What you're extracting
|
||||
- `system_context`: Your scientific domain
|
||||
- `instructions`: Step-by-step guidance for Claude
|
||||
- `output_schema`: JSON schema defining your data structure
|
||||
- `output_example`: Example of desired output
|
||||
|
||||
See `assets/example_flower_visitors_schema.json` for a real-world example.
|
||||
|
||||
### Configuring API Validation
|
||||
|
||||
1. Copy `assets/api_config_template.json` to `my_api_config.json`
|
||||
2. Map your schema fields to appropriate validation APIs
|
||||
3. See available APIs in `scripts/05_validate_with_apis.py` and `references/api_reference.md`
|
||||
|
||||
See `assets/example_api_config_ecology.json` for an ecology example.
|
||||
|
||||
## Cost Estimation
|
||||
|
||||
PDF processing costs approximately 1,500-3,000 tokens per page:
|
||||
|
||||
- 10-page paper: ~20,000-30,000 tokens
|
||||
- 100 papers: ~2-3M tokens
|
||||
- With Sonnet 4.5: ~$6-9 for 100 papers
|
||||
|
||||
Tips to reduce costs:
|
||||
- Use abstract filtering (Step 2) to reduce full PDF processing
|
||||
- Enable prompt caching with `--use-caching`
|
||||
- Use batch processing (`--method batches`)
|
||||
- Consider using Haiku for simpler extractions
|
||||
|
||||
## Supported Data Sources
|
||||
|
||||
### Bibliography Formats
|
||||
- BibTeX (Zotero, JabRef, etc.)
|
||||
- RIS (Mendeley, EndNote, etc.)
|
||||
- Directory of PDFs
|
||||
- List of DOIs
|
||||
|
||||
### Output Formats
|
||||
- Python (pandas DataFrame pickle)
|
||||
- R (RDS file)
|
||||
- CSV
|
||||
- JSON
|
||||
- Excel
|
||||
- SQLite database
|
||||
|
||||
### Validation APIs
|
||||
- **Biology**: GBIF, World Flora Online, NCBI Gene
|
||||
- **Geography**: GeoNames, OpenStreetMap Nominatim
|
||||
- **Chemistry**: PubChem
|
||||
- **Medicine**: (extensible - add your own)
|
||||
|
||||
## Examples
|
||||
|
||||
See the [beetle flower visitors repository](https://github.com/brunoasm/ARE_2026_beetle_flower_visitors) for a real-world example of this workflow in action.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### PDF Size Limits
|
||||
- Maximum file size: 32MB
|
||||
- Maximum pages: 100
|
||||
- Solution: Use chunked processing for larger PDFs
|
||||
|
||||
### JSON Parsing Errors
|
||||
- The `json-repair` library handles most common issues
|
||||
- Check your schema validation
|
||||
- Review Claude's analysis output for clues
|
||||
|
||||
### API Rate Limits
|
||||
- Add delays between requests (implemented in scripts)
|
||||
- Use batch processing when available
|
||||
- Check specific API documentation for limits
|
||||
|
||||
## Contributing
|
||||
|
||||
To add support for additional validation APIs:
|
||||
1. Add validator function to `scripts/05_validate_with_apis.py`
|
||||
2. Register in `API_VALIDATORS` dictionary
|
||||
3. Update `api_config_template.json` with examples
|
||||
|
||||
## Citation
|
||||
|
||||
If you use this skill in your research, please cite:
|
||||
|
||||
```bibtex
|
||||
@software{pdf_extraction_skill,
|
||||
title = {Extract Structured Data from Scientific PDFs},
|
||||
author = {Your Name},
|
||||
year = {2025},
|
||||
url = {https://github.com/your-repo}
|
||||
}
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
MIT License - see LICENSE file for details
|
||||
289
skills/extract_from_pdfs/SKILL.md
Normal file
289
skills/extract_from_pdfs/SKILL.md
Normal file
@@ -0,0 +1,289 @@
|
||||
---
|
||||
name: extract-from-pdfs
|
||||
description: This skill should be used when extracting structured data from scientific PDFs for systematic reviews, meta-analyses, or database creation. Use when working with collections of research papers that need to be converted into analyzable datasets with validation metrics.
|
||||
---
|
||||
|
||||
# Extract Structured Data from Scientific PDFs
|
||||
|
||||
## Purpose
|
||||
|
||||
Extract standardized, structured data from scientific PDF literature using Claude's vision capabilities. Transform PDF collections into validated databases ready for statistical analysis in Python, R, or other frameworks.
|
||||
|
||||
**Core capabilities:**
|
||||
- Organize metadata from BibTeX, RIS, directories, or DOI lists
|
||||
- Filter papers by abstract using Claude (Haiku/Sonnet) or local models (Ollama)
|
||||
- Extract structured data from PDFs with customizable schemas
|
||||
- Repair and validate JSON outputs automatically
|
||||
- Enrich with external databases (GBIF, WFO, GeoNames, PubChem, NCBI)
|
||||
- Calculate precision/recall metrics for quality assurance
|
||||
- Export to Python, R, CSV, Excel, or SQLite
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use when:
|
||||
- Conducting systematic literature reviews requiring data extraction
|
||||
- Building databases from scientific publications
|
||||
- Converting PDF collections to structured datasets
|
||||
- Validating extraction quality with ground truth metrics
|
||||
- Comparing extraction approaches (different models, prompts)
|
||||
|
||||
Do not use for:
|
||||
- Single PDF summarization (use basic PDF reading instead)
|
||||
- Full-text PDF search (use document search tools)
|
||||
- PDF editing or manipulation
|
||||
|
||||
## Getting Started
|
||||
|
||||
### 1. Initial Setup
|
||||
|
||||
Read the setup guide for installation and configuration:
|
||||
|
||||
```bash
|
||||
cat references/setup_guide.md
|
||||
```
|
||||
|
||||
Key setup steps:
|
||||
- Install dependencies: `conda env create -f environment.yml`
|
||||
- Set API keys: `export ANTHROPIC_API_KEY='your-key'`
|
||||
- Optional: Install Ollama for free local filtering
|
||||
|
||||
### 2. Define Extraction Requirements
|
||||
|
||||
**Ask the user:**
|
||||
- Research domain and extraction goals
|
||||
- How PDFs are organized (reference manager, directory, DOI list)
|
||||
- Approximate collection size
|
||||
- Preferred analysis environment (Python, R, etc.)
|
||||
|
||||
**Provide 2-3 example PDFs** to analyze structure and design schema.
|
||||
|
||||
### 3. Design Extraction Schema
|
||||
|
||||
Create custom schema from template:
|
||||
|
||||
```bash
|
||||
cp assets/schema_template.json my_schema.json
|
||||
```
|
||||
|
||||
Customize for the specific domain:
|
||||
- Set `objective` describing what to extract
|
||||
- Define `output_schema` with field types and descriptions
|
||||
- Add domain-specific `instructions` for Claude
|
||||
- Provide `output_example` showing desired format
|
||||
|
||||
See `assets/example_flower_visitors_schema.json` for real-world ecology example.
|
||||
|
||||
## Workflow Execution
|
||||
|
||||
### Complete Pipeline
|
||||
|
||||
Run the 6-step pipeline (plus optional validation):
|
||||
|
||||
```bash
|
||||
# Step 1: Organize metadata
|
||||
python scripts/01_organize_metadata.py \
|
||||
--source-type bibtex \
|
||||
--source library.bib \
|
||||
--pdf-dir pdfs/ \
|
||||
--output metadata.json
|
||||
|
||||
# Step 2: Filter papers (optional - recommended)
|
||||
# Choose backend: anthropic-haiku (cheap), anthropic-sonnet (accurate), ollama (free)
|
||||
python scripts/02_filter_abstracts.py \
|
||||
--metadata metadata.json \
|
||||
--backend anthropic-haiku \
|
||||
--use-batches \
|
||||
--output filtered_papers.json
|
||||
|
||||
# Step 3: Extract from PDFs
|
||||
python scripts/03_extract_from_pdfs.py \
|
||||
--metadata filtered_papers.json \
|
||||
--schema my_schema.json \
|
||||
--method batches \
|
||||
--output extracted_data.json
|
||||
|
||||
# Step 4: Repair JSON
|
||||
python scripts/04_repair_json.py \
|
||||
--input extracted_data.json \
|
||||
--schema my_schema.json \
|
||||
--output cleaned_data.json
|
||||
|
||||
# Step 5: Validate with APIs
|
||||
python scripts/05_validate_with_apis.py \
|
||||
--input cleaned_data.json \
|
||||
--apis my_api_config.json \
|
||||
--output validated_data.json
|
||||
|
||||
# Step 6: Export to analysis format
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format python \
|
||||
--output results
|
||||
```
|
||||
|
||||
### Validation (Optional but Recommended)
|
||||
|
||||
Calculate extraction quality metrics:
|
||||
|
||||
```bash
|
||||
# Step 7: Sample papers for annotation
|
||||
python scripts/07_prepare_validation_set.py \
|
||||
--extraction-results cleaned_data.json \
|
||||
--schema my_schema.json \
|
||||
--sample-size 20 \
|
||||
--strategy stratified \
|
||||
--output validation_set.json
|
||||
|
||||
# Step 8: Manually annotate (edit validation_set.json)
|
||||
# Fill ground_truth field for each sampled paper
|
||||
|
||||
# Step 9: Calculate metrics
|
||||
python scripts/08_calculate_validation_metrics.py \
|
||||
--annotations validation_set.json \
|
||||
--output validation_metrics.json \
|
||||
--report validation_report.txt
|
||||
```
|
||||
|
||||
Validation produces precision, recall, and F1 metrics per field and overall.
|
||||
|
||||
## Detailed Documentation
|
||||
|
||||
Access comprehensive guides in the `references/` directory:
|
||||
|
||||
**Setup and installation:**
|
||||
```bash
|
||||
cat references/setup_guide.md
|
||||
```
|
||||
|
||||
**Complete workflow with examples:**
|
||||
```bash
|
||||
cat references/workflow_guide.md
|
||||
```
|
||||
|
||||
**Validation methodology:**
|
||||
```bash
|
||||
cat references/validation_guide.md
|
||||
```
|
||||
|
||||
**API integration details:**
|
||||
```bash
|
||||
cat references/api_reference.md
|
||||
```
|
||||
|
||||
## Customization
|
||||
|
||||
### Schema Customization
|
||||
|
||||
Modify `my_schema.json` to match the research domain:
|
||||
|
||||
1. **Objective:** Describe what data to extract
|
||||
2. **Instructions:** Step-by-step extraction guidance
|
||||
3. **Output schema:** JSON schema defining structure
|
||||
4. **Important notes:** Domain-specific rules
|
||||
5. **Examples:** Show desired output format
|
||||
|
||||
Use imperative language in instructions. Be specific about data types, required vs optional fields, and edge cases.
|
||||
|
||||
### API Configuration
|
||||
|
||||
Configure external database validation in `my_api_config.json`:
|
||||
|
||||
Map extracted fields to validation APIs:
|
||||
- `gbif_taxonomy` - Biological taxonomy
|
||||
- `wfo_plants` - Plant names specifically
|
||||
- `geonames` - Geographic locations
|
||||
- `geocode` - Address to coordinates
|
||||
- `pubchem` - Chemical compounds
|
||||
- `ncbi_gene` - Gene identifiers
|
||||
|
||||
See `assets/example_api_config_ecology.json` for ecology-specific example.
|
||||
|
||||
### Filtering Customization
|
||||
|
||||
Edit filtering criteria in `scripts/02_filter_abstracts.py` (line 74):
|
||||
|
||||
Replace TODO section with domain-specific criteria:
|
||||
- What constitutes primary data vs review?
|
||||
- What data types are relevant?
|
||||
- What scope (geographic, temporal, taxonomic) is needed?
|
||||
|
||||
Use conservative criteria (when in doubt, include paper) to avoid false negatives.
|
||||
|
||||
## Cost Optimization
|
||||
|
||||
**Backend selection for filtering (Step 2):**
|
||||
- Ollama (local): $0 - Best for privacy and high volume
|
||||
- Haiku (API): ~$0.25/M tokens - Best balance of cost/quality
|
||||
- Sonnet (API): ~$3/M tokens - Best for complex filtering
|
||||
|
||||
**Typical costs for 100 papers:**
|
||||
- With filtering (Haiku + Sonnet): ~$4
|
||||
- With local Ollama + Sonnet: ~$3.75
|
||||
- Without filtering (Sonnet only): ~$7.50
|
||||
|
||||
**Optimization strategies:**
|
||||
- Use abstract filtering to reduce PDF processing
|
||||
- Use local Ollama for filtering (free)
|
||||
- Enable prompt caching with `--use-caching`
|
||||
- Process in batches with `--use-batches`
|
||||
|
||||
## Quality Assurance
|
||||
|
||||
**Validation workflow provides:**
|
||||
- Precision: % of extracted items that are correct
|
||||
- Recall: % of true items that were extracted
|
||||
- F1 score: Harmonic mean of precision and recall
|
||||
- Per-field metrics: Identify weak fields
|
||||
|
||||
**Use metrics to:**
|
||||
- Establish baseline extraction quality
|
||||
- Compare different approaches (models, prompts, schemas)
|
||||
- Identify areas for improvement
|
||||
- Report extraction quality in publications
|
||||
|
||||
**Recommended sample sizes:**
|
||||
- Small projects (<100 papers): 10-20 papers
|
||||
- Medium projects (100-500 papers): 20-50 papers
|
||||
- Large projects (>500 papers): 50-100 papers
|
||||
|
||||
## Iterative Improvement
|
||||
|
||||
1. Run initial extraction with baseline schema
|
||||
2. Validate on sample using Steps 7-9
|
||||
3. Analyze field-level metrics and error patterns
|
||||
4. Revise schema, prompts, or model selection
|
||||
5. Re-extract and re-validate
|
||||
6. Compare metrics to verify improvement
|
||||
7. Repeat until acceptable quality achieved
|
||||
|
||||
See `references/validation_guide.md` for detailed guidance on interpreting metrics and improving extraction quality.
|
||||
|
||||
## Available Scripts
|
||||
|
||||
**Data organization:**
|
||||
- `scripts/01_organize_metadata.py` - Standardize PDFs and metadata
|
||||
|
||||
**Filtering:**
|
||||
- `scripts/02_filter_abstracts.py` - Filter by abstract (Haiku/Sonnet/Ollama)
|
||||
|
||||
**Extraction:**
|
||||
- `scripts/03_extract_from_pdfs.py` - Extract from PDFs with Claude vision
|
||||
|
||||
**Processing:**
|
||||
- `scripts/04_repair_json.py` - Repair and validate JSON
|
||||
- `scripts/05_validate_with_apis.py` - Enrich with external databases
|
||||
- `scripts/06_export_database.py` - Export to analysis formats
|
||||
|
||||
**Validation:**
|
||||
- `scripts/07_prepare_validation_set.py` - Sample papers for annotation
|
||||
- `scripts/08_calculate_validation_metrics.py` - Calculate P/R/F1 metrics
|
||||
|
||||
## Assets
|
||||
|
||||
**Templates:**
|
||||
- `assets/schema_template.json` - Blank extraction schema template
|
||||
- `assets/api_config_template.json` - API validation configuration template
|
||||
|
||||
**Examples:**
|
||||
- `assets/example_flower_visitors_schema.json` - Ecology extraction example
|
||||
- `assets/example_api_config_ecology.json` - Ecology API validation example
|
||||
62
skills/extract_from_pdfs/assets/api_config_template.json
Normal file
62
skills/extract_from_pdfs/assets/api_config_template.json
Normal file
@@ -0,0 +1,62 @@
|
||||
{
|
||||
"_comment": "Configuration for API validation in step 05",
|
||||
"_instructions": [
|
||||
"Specify which external APIs to use for validating/enriching each field",
|
||||
"Available APIs:",
|
||||
" - gbif_taxonomy: GBIF for biological taxonomy",
|
||||
" - wfo_plants: World Flora Online for plant names",
|
||||
" - geonames: GeoNames for geographic locations (requires account)",
|
||||
" - geocode: OpenStreetMap Nominatim for geocoding",
|
||||
" - pubchem: PubChem for chemical compounds",
|
||||
" - ncbi_gene: NCBI Gene database",
|
||||
"Customize the field_mappings below based on your extraction schema"
|
||||
],
|
||||
|
||||
"field_mappings": {
|
||||
"_example_species_field": {
|
||||
"api": "gbif_taxonomy",
|
||||
"output_field": "validated_species",
|
||||
"description": "Validate species names against GBIF"
|
||||
},
|
||||
|
||||
"_example_location_field": {
|
||||
"api": "geocode",
|
||||
"output_field": "geocoded_location",
|
||||
"description": "Geocode location to lat/lon coordinates"
|
||||
},
|
||||
|
||||
"_example_compound_field": {
|
||||
"api": "pubchem",
|
||||
"output_field": "validated_compound",
|
||||
"description": "Validate chemical compound names"
|
||||
}
|
||||
},
|
||||
|
||||
"nested_field_mappings": {
|
||||
"_comment": "For fields nested in 'records' array",
|
||||
"_example": "records.species would validate the 'species' field within each record",
|
||||
|
||||
"records.species": {
|
||||
"api": "gbif_taxonomy",
|
||||
"output_field": "validated_species"
|
||||
},
|
||||
|
||||
"records.location": {
|
||||
"api": "geocode",
|
||||
"output_field": "coordinates"
|
||||
}
|
||||
},
|
||||
|
||||
"api_specific_settings": {
|
||||
"geonames": {
|
||||
"_note": "Requires free account at geonames.org",
|
||||
"_setup": "Set GEONAMES_USERNAME environment variable"
|
||||
},
|
||||
|
||||
"rate_limits": {
|
||||
"_comment": "Be respectful of API rate limits",
|
||||
"default_delay_seconds": 0.5,
|
||||
"nominatim_delay_seconds": 1.0
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,72 @@
|
||||
{
|
||||
"_comment": "Example API configuration for ecology/flower visitor research",
|
||||
"_note": "This shows how to validate taxonomic and geographic data",
|
||||
|
||||
"field_mappings": {
|
||||
"plant_species": {
|
||||
"api": "wfo_plants",
|
||||
"output_field": "validated_plant_taxonomy",
|
||||
"description": "Validate plant species names against World Flora Online"
|
||||
},
|
||||
|
||||
"country": {
|
||||
"api": "geonames",
|
||||
"output_field": "validated_country",
|
||||
"description": "Validate and standardize country names"
|
||||
}
|
||||
},
|
||||
|
||||
"nested_field_mappings": {
|
||||
"_comment": "These apply to fields within the 'records' array",
|
||||
|
||||
"records.plant_species": {
|
||||
"api": "wfo_plants",
|
||||
"output_field": "validated_plant_taxonomy",
|
||||
"extra_params": {}
|
||||
},
|
||||
|
||||
"records.country": {
|
||||
"api": "geonames",
|
||||
"output_field": "geocoded_country"
|
||||
},
|
||||
|
||||
"records.locality": {
|
||||
"api": "geocode",
|
||||
"output_field": "coordinates",
|
||||
"description": "Get coordinates for field sites"
|
||||
}
|
||||
},
|
||||
|
||||
"validation_rules": {
|
||||
"plant_species": {
|
||||
"required": true,
|
||||
"validate_taxonomy": true,
|
||||
"accept_genus_only": false
|
||||
},
|
||||
|
||||
"visitors": {
|
||||
"type": "array",
|
||||
"min_items": 1,
|
||||
"validate_items": false,
|
||||
"_note": "Visitor names as-written, not validated against taxonomy"
|
||||
},
|
||||
|
||||
"location_completeness": {
|
||||
"require_country": true,
|
||||
"require_coordinates": false,
|
||||
"_note": "Country is required but exact coordinates are optional"
|
||||
}
|
||||
},
|
||||
|
||||
"api_settings": {
|
||||
"retry_on_failure": true,
|
||||
"max_retries": 3,
|
||||
"timeout_seconds": 10,
|
||||
|
||||
"rate_limits": {
|
||||
"wfo": 1.0,
|
||||
"geonames": 0.5,
|
||||
"nominatim": 1.0
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,169 @@
|
||||
{
|
||||
"_comment": "Example extraction schema based on flower visitor ecology research",
|
||||
"_note": "This is a real-world example. Copy and adapt for your own domain.",
|
||||
|
||||
"objective": "carefully analyze this paper and extract empirical observations of flower visitors",
|
||||
|
||||
"system_context": "You are a scientific research assistant specializing in analyzing papers about plant-pollinator interactions. Your task is to analyze scientific papers and extract structured data for a meta-analysis of flower visitation.",
|
||||
|
||||
"instructions": [
|
||||
"Determine if the paper contains any empirical observations of flower visitors",
|
||||
"If empirical observations are present, extract all records of flower visitors",
|
||||
"Each record should represent observations of one plant species in one locality"
|
||||
],
|
||||
|
||||
"analysis_steps": [
|
||||
"1. Identify and quote relevant sections of the paper that contain empirical primary observations of flower visitors. If there is no primary data, explain why and do not create any records.",
|
||||
"2. List out each plant species mentioned in these observations. Consider species as the smallest taxonomic unit for plants. If there are multiple varieties or subspecies, summarize all records for the same species as a single record.",
|
||||
"3. For each plant species, extract the required information: location, method of observation, time of observation, and list of ALL flower visitors (be comprehensive)",
|
||||
"4. Assess whether any visitors or pollinators are beetles (Coleoptera). For each visitor, classify as 'Beetle' or 'Non-beetle'",
|
||||
"5. Evaluate whether the methods are unbiased by checking observation times and methods",
|
||||
"6. Double-check your findings for accuracy and completeness"
|
||||
],
|
||||
|
||||
"important_notes": [
|
||||
"Only include PRIMARY observations from the paper. Do not consider secondary data or citations",
|
||||
"If a record involves more than one plant species or country, separate it into multiple records",
|
||||
"Do not add any variables to the output that are not explicitly listed in the schema",
|
||||
"Do not use external information to update taxonomic names. List names as they appear in the source",
|
||||
"If anything is unknown, use 'none' or empty lists as appropriate",
|
||||
"Always include all records in the response, even if it ends up being extremely long"
|
||||
],
|
||||
|
||||
"output_schema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"has_primary_visitor_data": {
|
||||
"type": "boolean",
|
||||
"description": "Whether there are primary observations about flower visitors in this study"
|
||||
},
|
||||
"has_visitor_notes": {
|
||||
"type": "string",
|
||||
"description": "Brief explanation of evidence supporting the assessment"
|
||||
},
|
||||
"response_truncated": {
|
||||
"type": "boolean",
|
||||
"description": "Whether there were too many records to retrieve comprehensively"
|
||||
},
|
||||
"noteworthy_beetle_fact": {
|
||||
"type": "string",
|
||||
"description": "One or two sentences summarizing noteworthy facts about beetles discovered in this study"
|
||||
},
|
||||
"beetle_pollen_feeders": {
|
||||
"type": "boolean",
|
||||
"description": "Whether the paper mentions any beetle pollen feeder as adult"
|
||||
},
|
||||
"beetle_nectar_feeders": {
|
||||
"type": "boolean",
|
||||
"description": "Whether the paper mentions any beetle drinking nectar as adult"
|
||||
},
|
||||
"beetle_florivores": {
|
||||
"type": "boolean",
|
||||
"description": "Whether beetles damage flower parts other than pollen and nectar"
|
||||
},
|
||||
"beetle_larval_breeding": {
|
||||
"type": "boolean",
|
||||
"description": "Whether beetle larvae feed on parts of the same plant visited by adults"
|
||||
},
|
||||
"records": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"country": {
|
||||
"type": "string",
|
||||
"description": "Country name"
|
||||
},
|
||||
"state_province": {
|
||||
"type": "string",
|
||||
"description": "State or province name"
|
||||
},
|
||||
"locality": {
|
||||
"type": "string",
|
||||
"description": "Specific location of the observation"
|
||||
},
|
||||
"plant_species": {
|
||||
"type": "string",
|
||||
"description": "Plant species name"
|
||||
},
|
||||
"method": {
|
||||
"type": "string",
|
||||
"description": "One-sentence description of observation methods"
|
||||
},
|
||||
"observation_time": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "string",
|
||||
"enum": ["day", "night", "dawn", "dusk"]
|
||||
},
|
||||
"description": "List of observation times"
|
||||
},
|
||||
"visitors": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "string"
|
||||
},
|
||||
"description": "List of all flower visitors observed"
|
||||
},
|
||||
"beetle_families": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "string"
|
||||
},
|
||||
"description": "List of beetle families mentioned as flower visitors"
|
||||
},
|
||||
"beetle_visitors": {
|
||||
"type": "boolean",
|
||||
"description": "Whether beetles were found as flower visitors"
|
||||
},
|
||||
"beetle_pollinators": {
|
||||
"type": "boolean",
|
||||
"description": "Whether beetles were found as significant pollinators"
|
||||
},
|
||||
"methods_unbiased": {
|
||||
"type": "boolean",
|
||||
"description": "Whether methods appear to be unbiased"
|
||||
},
|
||||
"methods_biased_reasoning": {
|
||||
"type": "string",
|
||||
"description": "One-sentence explanation for bias assessment"
|
||||
}
|
||||
},
|
||||
"required": ["country", "plant_species", "method", "visitors"]
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["has_primary_visitor_data", "records"]
|
||||
},
|
||||
|
||||
"output_example": {
|
||||
"has_primary_visitor_data": true,
|
||||
"has_visitor_notes": "Paper reports direct field observations of flower visitors across multiple species",
|
||||
"response_truncated": false,
|
||||
"noteworthy_beetle_fact": "Beetles from the family Scarabaeidae were observed as frequent visitors and effective pollen carriers",
|
||||
"beetle_pollen_feeders": true,
|
||||
"beetle_nectar_feeders": false,
|
||||
"beetle_florivores": false,
|
||||
"beetle_larval_breeding": false,
|
||||
"records": [
|
||||
{
|
||||
"country": "Brazil",
|
||||
"state_province": "São Paulo",
|
||||
"locality": "Parque Estadual da Serra do Mar",
|
||||
"plant_species": "Magnolia ovata",
|
||||
"method": "Direct observation of floral visitors during anthesis over 3 days",
|
||||
"observation_time": ["day", "night"],
|
||||
"visitors": [
|
||||
"Cyclocephala paraguayensis (Coleoptera: Scarabaeidae)",
|
||||
"Apis mellifera (Hymenoptera: Apidae)",
|
||||
"Trigona spinipes (Hymenoptera: Apidae)"
|
||||
],
|
||||
"beetle_families": ["Scarabaeidae"],
|
||||
"beetle_visitors": true,
|
||||
"beetle_pollinators": true,
|
||||
"methods_unbiased": true,
|
||||
"methods_biased_reasoning": "Observations conducted during both day and night, allowing detection of nocturnal visitors"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
105
skills/extract_from_pdfs/assets/schema_template.json
Normal file
105
skills/extract_from_pdfs/assets/schema_template.json
Normal file
@@ -0,0 +1,105 @@
|
||||
{
|
||||
"_comment": "This is a template extraction schema. Customize for your specific use case.",
|
||||
"_instructions": "Fill in the sections below with your specific extraction requirements.",
|
||||
|
||||
"objective": "carefully analyze this paper and extract [DESCRIBE YOUR DATA TYPE, e.g., 'empirical observations of X', 'experimental measurements of Y', etc.]",
|
||||
|
||||
"system_context": "You are a scientific research assistant specializing in [YOUR DOMAIN, e.g., 'ecology', 'chemistry', 'medicine', etc.]. Your task is to analyze scientific papers and extract structured data for systematic review and meta-analysis.",
|
||||
|
||||
"instructions": [
|
||||
"Determine if the paper contains [YOUR CRITERIA, e.g., 'primary empirical data']",
|
||||
"If present, extract all [YOUR RECORD TYPE, e.g., 'observation records', 'measurements', 'outcomes']",
|
||||
"For each record, extract the following information: [LIST KEY FIELDS]"
|
||||
],
|
||||
|
||||
"analysis_steps": [
|
||||
"1. Identify and quote relevant sections containing [YOUR DATA TYPE]",
|
||||
"2. List out each [RECORD UNIT, e.g., 'species', 'compound', 'patient cohort']",
|
||||
"3. For each unit, extract required information and quote supporting text",
|
||||
"4. [ADD DOMAIN-SPECIFIC VALIDATION STEPS]",
|
||||
"5. Double-check for accuracy and completeness"
|
||||
],
|
||||
|
||||
"important_notes": [
|
||||
"Only include PRIMARY data from this paper, not secondary sources",
|
||||
"If a record involves multiple [UNITS], separate into individual records",
|
||||
"Do not add fields not in the schema",
|
||||
"Use 'none' or empty lists for unknown values",
|
||||
"List names exactly as they appear in the source"
|
||||
],
|
||||
|
||||
"output_schema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"has_relevant_data": {
|
||||
"type": "boolean",
|
||||
"description": "Whether the paper contains the target data type"
|
||||
},
|
||||
"data_description": {
|
||||
"type": "string",
|
||||
"description": "Brief explanation of what data is present"
|
||||
},
|
||||
"records": {
|
||||
"type": "array",
|
||||
"description": "List of extracted records",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"_comment": "CUSTOMIZE THESE FIELDS FOR YOUR USE CASE",
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "Geographic location (if applicable)"
|
||||
},
|
||||
"subject": {
|
||||
"type": "string",
|
||||
"description": "Main subject of the record (species, compound, etc.)"
|
||||
},
|
||||
"measurement_type": {
|
||||
"type": "string",
|
||||
"description": "Type of measurement or observation"
|
||||
},
|
||||
"value": {
|
||||
"type": ["number", "string"],
|
||||
"description": "Measured or observed value"
|
||||
},
|
||||
"units": {
|
||||
"type": "string",
|
||||
"description": "Units of measurement"
|
||||
},
|
||||
"method": {
|
||||
"type": "string",
|
||||
"description": "Brief description of methodology"
|
||||
},
|
||||
"sample_size": {
|
||||
"type": "integer",
|
||||
"description": "Sample size if applicable"
|
||||
},
|
||||
"notes": {
|
||||
"type": "string",
|
||||
"description": "Additional relevant notes"
|
||||
}
|
||||
},
|
||||
"required": ["subject"]
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["has_relevant_data", "records"]
|
||||
},
|
||||
|
||||
"output_example": {
|
||||
"has_relevant_data": true,
|
||||
"data_description": "Paper reports 5 observation records across 3 locations",
|
||||
"records": [
|
||||
{
|
||||
"location": "Example Location",
|
||||
"subject": "Example Subject",
|
||||
"measurement_type": "Example Type",
|
||||
"value": 42.5,
|
||||
"units": "mg/L",
|
||||
"method": "Brief methodology description",
|
||||
"sample_size": 20,
|
||||
"notes": "Any relevant notes"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
17
skills/extract_from_pdfs/environment.yml
Normal file
17
skills/extract_from_pdfs/environment.yml
Normal file
@@ -0,0 +1,17 @@
|
||||
name: pdf_extraction
|
||||
channels:
|
||||
- conda-forge
|
||||
- defaults
|
||||
dependencies:
|
||||
- python>=3.10
|
||||
- pip>=23.0
|
||||
- pip:
|
||||
- anthropic>=0.40.0
|
||||
- pybtex>=0.24.0
|
||||
- rispy>=0.6.0
|
||||
- json-repair>=0.25.0
|
||||
- jsonschema>=4.20.0
|
||||
- pandas>=2.0.0
|
||||
- openpyxl>=3.1.0
|
||||
- pyreadr>=0.5.0
|
||||
- requests>=2.31.0
|
||||
406
skills/extract_from_pdfs/references/api_reference.md
Normal file
406
skills/extract_from_pdfs/references/api_reference.md
Normal file
@@ -0,0 +1,406 @@
|
||||
# External API Validation Reference
|
||||
|
||||
## Overview
|
||||
|
||||
Step 5 validates and enriches extracted data using external scientific databases. This ensures taxonomic names are standardized, locations are geocoded, and chemical/gene identifiers are canonical.
|
||||
|
||||
## Available APIs
|
||||
|
||||
### Biological Taxonomy
|
||||
|
||||
#### GBIF (Global Biodiversity Information Facility)
|
||||
|
||||
**Use for:** General biological taxonomy (animals, plants, fungi, etc.)
|
||||
|
||||
**Function:** `validate_gbif_taxonomy(scientific_name)`
|
||||
|
||||
**Returns:**
|
||||
- Matched canonical name
|
||||
- Full scientific name with authority
|
||||
- Taxonomic hierarchy (kingdom, phylum, class, order, family, genus)
|
||||
- GBIF ID
|
||||
- Match confidence and type
|
||||
- Taxonomic status
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
validate_gbif_taxonomy("Apis melifera")
|
||||
# Returns:
|
||||
{
|
||||
"matched_name": "Apis mellifera",
|
||||
"scientific_name": "Apis mellifera Linnaeus, 1758",
|
||||
"rank": "SPECIES",
|
||||
"kingdom": "Animalia",
|
||||
"phylum": "Arthropoda",
|
||||
"class": "Insecta",
|
||||
"order": "Hymenoptera",
|
||||
"family": "Apidae",
|
||||
"genus": "Apis",
|
||||
"gbif_id": 1340278,
|
||||
"confidence": 100,
|
||||
"match_type": "EXACT"
|
||||
}
|
||||
```
|
||||
|
||||
**No API key required** - Free and unlimited
|
||||
|
||||
**Documentation:** https://www.gbif.org/developer/species
|
||||
|
||||
#### World Flora Online (WFO)
|
||||
|
||||
**Use for:** Plant taxonomy specifically
|
||||
|
||||
**Function:** `validate_wfo_plant(scientific_name)`
|
||||
|
||||
**Returns:**
|
||||
- Matched name
|
||||
- Scientific name with authors
|
||||
- Family
|
||||
- WFO ID
|
||||
- Taxonomic status
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
validate_wfo_plant("Magnolia grandiflora")
|
||||
# Returns:
|
||||
{
|
||||
"matched_name": "Magnolia grandiflora",
|
||||
"scientific_name": "Magnolia grandiflora L.",
|
||||
"authors": "L.",
|
||||
"family": "Magnoliaceae",
|
||||
"wfo_id": "wfo-0000988234",
|
||||
"status": "Accepted"
|
||||
}
|
||||
```
|
||||
|
||||
**No API key required** - Free
|
||||
|
||||
**Documentation:** http://www.worldfloraonline.org/
|
||||
|
||||
### Geography
|
||||
|
||||
#### GeoNames
|
||||
|
||||
**Use for:** Location validation and standardization
|
||||
|
||||
**Function:** `validate_geonames(location, country=None)`
|
||||
|
||||
**Returns:**
|
||||
- Matched place name
|
||||
- Country name and code
|
||||
- Administrative divisions (state, province)
|
||||
- Latitude/longitude
|
||||
- GeoNames ID
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
validate_geonames("São Paulo", country="BR")
|
||||
# Returns:
|
||||
{
|
||||
"matched_name": "São Paulo",
|
||||
"country": "Brazil",
|
||||
"country_code": "BR",
|
||||
"admin1": "São Paulo",
|
||||
"admin2": None,
|
||||
"latitude": "-23.5475",
|
||||
"longitude": "-46.63611",
|
||||
"geonames_id": 3448439
|
||||
}
|
||||
```
|
||||
|
||||
**Requires free account:** Register at https://www.geonames.org/login
|
||||
|
||||
**Setup:**
|
||||
1. Create account
|
||||
2. Enable web services in account settings
|
||||
3. Set environment variable: `export GEONAMES_USERNAME='your-username'`
|
||||
|
||||
**Rate limit:** Free tier allows reasonable usage
|
||||
|
||||
**Documentation:** https://www.geonames.org/export/web-services.html
|
||||
|
||||
#### OpenStreetMap Nominatim
|
||||
|
||||
**Use for:** Geocoding addresses to coordinates
|
||||
|
||||
**Function:** `geocode_location(address)`
|
||||
|
||||
**Returns:**
|
||||
- Display name (formatted address)
|
||||
- Latitude/longitude
|
||||
- OSM type and ID
|
||||
- Place rank
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
geocode_location("Field Museum, Chicago, IL")
|
||||
# Returns:
|
||||
{
|
||||
"display_name": "Field Museum, 1400, South Lake Shore Drive, Chicago, Illinois, 60605, United States",
|
||||
"latitude": "41.8662",
|
||||
"longitude": "-87.6169",
|
||||
"osm_type": "way",
|
||||
"osm_id": 54856789,
|
||||
"place_rank": 30
|
||||
}
|
||||
```
|
||||
|
||||
**No API key required** - Free
|
||||
|
||||
**Important:** Add 1-second delays between requests (implemented in script)
|
||||
|
||||
**Documentation:** https://nominatim.org/release-docs/latest/api/Overview/
|
||||
|
||||
### Chemistry
|
||||
|
||||
#### PubChem
|
||||
|
||||
**Use for:** Chemical compound validation
|
||||
|
||||
**Function:** `validate_pubchem_compound(compound_name)`
|
||||
|
||||
**Returns:**
|
||||
- PubChem CID (compound ID)
|
||||
- Molecular formula
|
||||
- PubChem URL
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
validate_pubchem_compound("aspirin")
|
||||
# Returns:
|
||||
{
|
||||
"cid": 2244,
|
||||
"molecular_formula": "C9H8O4",
|
||||
"pubchem_url": "https://pubchem.ncbi.nlm.nih.gov/compound/2244"
|
||||
}
|
||||
```
|
||||
|
||||
**No API key required** - Free
|
||||
|
||||
**Documentation:** https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
|
||||
|
||||
### Genetics
|
||||
|
||||
#### NCBI Gene
|
||||
|
||||
**Use for:** Gene validation
|
||||
|
||||
**Function:** `validate_ncbi_gene(gene_symbol, organism=None)`
|
||||
|
||||
**Returns:**
|
||||
- NCBI Gene ID
|
||||
- NCBI URL
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
validate_ncbi_gene("BRCA1", organism="Homo sapiens")
|
||||
# Returns:
|
||||
{
|
||||
"gene_id": "672",
|
||||
"ncbi_url": "https://www.ncbi.nlm.nih.gov/gene/672"
|
||||
}
|
||||
```
|
||||
|
||||
**No API key required** - Free
|
||||
|
||||
**Rate limit:** Max 3 requests/second
|
||||
|
||||
**Documentation:** https://www.ncbi.nlm.nih.gov/books/NBK25500/
|
||||
|
||||
## Configuration
|
||||
|
||||
### API Config File Structure
|
||||
|
||||
Create `my_api_config.json` based on `assets/api_config_template.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"field_mappings": {
|
||||
"species": {
|
||||
"api": "gbif_taxonomy",
|
||||
"output_field": "validated_species",
|
||||
"description": "Validate species names against GBIF"
|
||||
},
|
||||
"location": {
|
||||
"api": "geocode",
|
||||
"output_field": "coordinates"
|
||||
}
|
||||
},
|
||||
|
||||
"nested_field_mappings": {
|
||||
"records.plant_species": {
|
||||
"api": "wfo_plants",
|
||||
"output_field": "validated_plant_taxonomy"
|
||||
},
|
||||
"records.location": {
|
||||
"api": "geocode",
|
||||
"output_field": "coordinates"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Field Mapping Parameters
|
||||
|
||||
**Required:**
|
||||
- `api` - API name (see list above)
|
||||
- `output_field` - Name for validated data
|
||||
|
||||
**Optional:**
|
||||
- `description` - Documentation
|
||||
- `extra_params` - Additional API-specific parameters
|
||||
|
||||
## Adding Custom APIs
|
||||
|
||||
To add a new validation API:
|
||||
|
||||
1. **Create validator function** in `scripts/05_validate_with_apis.py`:
|
||||
|
||||
```python
|
||||
def validate_custom_api(value: str, extra_param: str = None) -> Optional[Dict]:
|
||||
"""
|
||||
Validate value using custom API.
|
||||
|
||||
Args:
|
||||
value: The value to validate
|
||||
extra_param: Optional additional parameter
|
||||
|
||||
Returns:
|
||||
Dictionary with validated data or None if not found
|
||||
"""
|
||||
try:
|
||||
# Make API request
|
||||
response = requests.get(f"https://api.example.com/{value}")
|
||||
if response.status_code == 200:
|
||||
data = response.json()
|
||||
return {
|
||||
'validated_value': data.get('canonical_name'),
|
||||
'api_id': data.get('id'),
|
||||
'additional_info': data.get('info')
|
||||
}
|
||||
except Exception as e:
|
||||
print(f"Custom API error: {e}")
|
||||
|
||||
return None
|
||||
```
|
||||
|
||||
2. **Register in API_VALIDATORS** dictionary:
|
||||
|
||||
```python
|
||||
API_VALIDATORS = {
|
||||
'gbif_taxonomy': validate_gbif_taxonomy,
|
||||
'wfo_plants': validate_wfo_plant,
|
||||
# ... existing validators ...
|
||||
'custom_api': validate_custom_api, # Add here
|
||||
}
|
||||
```
|
||||
|
||||
3. **Use in config file:**
|
||||
|
||||
```json
|
||||
{
|
||||
"field_mappings": {
|
||||
"your_field": {
|
||||
"api": "custom_api",
|
||||
"output_field": "validated_field",
|
||||
"extra_params": {
|
||||
"extra_param": "value"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
The script implements rate limiting to respect API usage policies:
|
||||
|
||||
**Default delays (built into script):**
|
||||
- GeoNames: 0.5 seconds
|
||||
- Nominatim: 1.0 second (required)
|
||||
- WFO: 1.0 second
|
||||
- Others: 0.5 seconds
|
||||
|
||||
**Modify delays if needed** in `scripts/05_validate_with_apis.py`:
|
||||
|
||||
```python
|
||||
# In main() function
|
||||
if not args.skip_validation:
|
||||
time.sleep(0.5) # Adjust this value
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
APIs may fail for various reasons:
|
||||
|
||||
**Common errors:**
|
||||
- Connection timeout
|
||||
- Rate limit exceeded
|
||||
- Invalid API key
|
||||
- Malformed query
|
||||
- No match found
|
||||
|
||||
**Script behavior:**
|
||||
- Continues processing on error
|
||||
- Logs error to console
|
||||
- Sets validated field to None
|
||||
- Original extracted value preserved
|
||||
|
||||
**Retry logic:**
|
||||
- 3 retries with exponential backoff
|
||||
- Implemented for network errors
|
||||
- Not for "no match found" errors
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Start with test run:**
|
||||
```bash
|
||||
python scripts/05_validate_with_apis.py \
|
||||
--input cleaned_data.json \
|
||||
--apis my_api_config.json \
|
||||
--skip-validation \
|
||||
--output test_structure.json
|
||||
```
|
||||
|
||||
2. **Validate subset first:**
|
||||
- Test on 10 papers before full run
|
||||
- Verify API connections work
|
||||
- Check output structure
|
||||
|
||||
3. **Monitor API usage:**
|
||||
- Track request counts for paid APIs
|
||||
- Respect rate limits
|
||||
- Consider caching results
|
||||
|
||||
4. **Handle failures gracefully:**
|
||||
- Original data is never lost
|
||||
- Can re-run validation separately
|
||||
- Manually fix failed validations if needed
|
||||
|
||||
5. **Optimize API calls:**
|
||||
- Only validate fields that need standardization
|
||||
- Use cached results when re-running
|
||||
- Batch similar queries when possible
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### GeoNames "Service disabled" error
|
||||
- Check account email is verified
|
||||
- Enable web services in account settings
|
||||
- Wait up to 1 hour after enabling
|
||||
|
||||
### Nominatim rate limit errors
|
||||
- Script includes 1-second delays
|
||||
- Don't run multiple instances
|
||||
- Consider using local Nominatim instance
|
||||
|
||||
### NCBI errors
|
||||
- Reduce request frequency
|
||||
- Add longer delays
|
||||
- Use E-utilities API key (optional, increases limit)
|
||||
|
||||
### No matches found
|
||||
- Check spelling and formatting
|
||||
- Try variations of name
|
||||
- Some names may not be in database
|
||||
- Consider manual curation for important cases
|
||||
147
skills/extract_from_pdfs/references/setup_guide.md
Normal file
147
skills/extract_from_pdfs/references/setup_guide.md
Normal file
@@ -0,0 +1,147 @@
|
||||
# Setup Guide for PDF Data Extraction
|
||||
|
||||
## Installation
|
||||
|
||||
### Using Conda (Recommended)
|
||||
|
||||
Create a dedicated environment for the extraction pipeline:
|
||||
|
||||
```bash
|
||||
conda env create -f environment.yml
|
||||
conda activate pdf_extraction
|
||||
```
|
||||
|
||||
### Using pip
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Required Dependencies
|
||||
|
||||
### Core Dependencies
|
||||
- `anthropic>=0.40.0` - Anthropic API client
|
||||
- `pybtex>=0.24.0` - BibTeX file handling
|
||||
- `rispy>=0.6.0` - RIS file handling
|
||||
- `json-repair>=0.25.0` - JSON repair and validation
|
||||
- `jsonschema>=4.20.0` - JSON schema validation
|
||||
- `pandas>=2.0.0` - Data processing
|
||||
- `requests>=2.31.0` - HTTP requests for APIs
|
||||
|
||||
### Export Dependencies
|
||||
- `openpyxl>=3.1.0` - Excel export
|
||||
- `pyreadr>=0.5.0` - R RDS export
|
||||
|
||||
## API Keys Setup
|
||||
|
||||
### Anthropic API Key (Required for Claude backends)
|
||||
|
||||
```bash
|
||||
export ANTHROPIC_API_KEY='your-api-key-here'
|
||||
```
|
||||
|
||||
Add to your shell profile (~/.bashrc, ~/.zshrc) for persistence:
|
||||
|
||||
```bash
|
||||
echo 'export ANTHROPIC_API_KEY="your-api-key-here"' >> ~/.bashrc
|
||||
source ~/.bashrc
|
||||
```
|
||||
|
||||
### GeoNames Username (Optional - for geographic validation)
|
||||
|
||||
1. Register at https://www.geonames.org/login
|
||||
2. Enable web services in your account
|
||||
3. Set environment variable:
|
||||
|
||||
```bash
|
||||
export GEONAMES_USERNAME='your-username'
|
||||
```
|
||||
|
||||
## Local Model Setup (Ollama)
|
||||
|
||||
For free, private, offline abstract filtering:
|
||||
|
||||
### Installation
|
||||
|
||||
**macOS:**
|
||||
```bash
|
||||
brew install ollama
|
||||
```
|
||||
|
||||
**Linux:**
|
||||
```bash
|
||||
curl -fsSL https://ollama.com/install.sh | sh
|
||||
```
|
||||
|
||||
**Windows:**
|
||||
Download from https://ollama.com/download
|
||||
|
||||
### Pulling Models
|
||||
|
||||
```bash
|
||||
# Recommended models
|
||||
ollama pull llama3.1:8b # Good balance (8GB RAM)
|
||||
ollama pull mistral:7b # Fast, simple filtering
|
||||
ollama pull qwen2.5:7b # Multilingual support
|
||||
ollama pull llama3.1:70b # Best accuracy (64GB RAM)
|
||||
```
|
||||
|
||||
### Starting Ollama Server
|
||||
|
||||
Usually auto-starts, but can be manually started:
|
||||
|
||||
```bash
|
||||
ollama serve
|
||||
```
|
||||
|
||||
The server runs at http://localhost:11434 by default.
|
||||
|
||||
## Verifying Installation
|
||||
|
||||
Test that all components are properly installed:
|
||||
|
||||
```bash
|
||||
# Test Python dependencies
|
||||
python -c "import anthropic, pybtex, rispy, json_repair, pandas; print('All dependencies OK')"
|
||||
|
||||
# Test Anthropic API
|
||||
python -c "import os; from anthropic import Anthropic; client = Anthropic(); print('API key valid')"
|
||||
|
||||
# Test Ollama (if using)
|
||||
curl http://localhost:11434/api/tags
|
||||
```
|
||||
|
||||
## Directory Structure
|
||||
|
||||
The skill will work with PDFs and metadata organized in various ways:
|
||||
|
||||
### Option A: Reference Manager Export
|
||||
```
|
||||
project/
|
||||
├── library.bib # BibTeX export
|
||||
└── pdfs/
|
||||
├── Smith2020.pdf
|
||||
├── Jones2021.pdf
|
||||
└── ...
|
||||
```
|
||||
|
||||
### Option B: Simple Directory
|
||||
```
|
||||
project/
|
||||
└── pdfs/
|
||||
├── paper1.pdf
|
||||
├── paper2.pdf
|
||||
└── ...
|
||||
```
|
||||
|
||||
### Option C: DOI List
|
||||
```
|
||||
project/
|
||||
└── dois.txt # One DOI per line
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
After installation, proceed to the workflow guide to start extracting data from your PDFs.
|
||||
|
||||
See: `references/workflow_guide.md`
|
||||
329
skills/extract_from_pdfs/references/validation_guide.md
Normal file
329
skills/extract_from_pdfs/references/validation_guide.md
Normal file
@@ -0,0 +1,329 @@
|
||||
# Validation and Quality Assurance Guide
|
||||
|
||||
## Overview
|
||||
|
||||
Validation quantifies extraction accuracy using precision, recall, and F1 metrics by comparing automated extraction against manually annotated ground truth.
|
||||
|
||||
## When to Validate
|
||||
|
||||
- **Before production use** - Establish baseline quality
|
||||
- **After schema changes** - Verify improvements
|
||||
- **When comparing models** - Test Haiku vs Sonnet vs Ollama
|
||||
- **For publication** - Report extraction quality metrics
|
||||
|
||||
## Recommended Sample Sizes
|
||||
|
||||
- Small projects (<100 papers): 10-20 papers
|
||||
- Medium projects (100-500 papers): 20-50 papers
|
||||
- Large projects (>500 papers): 50-100 papers
|
||||
|
||||
## Step 7: Prepare Validation Set
|
||||
|
||||
Sample papers for manual annotation using one of three strategies.
|
||||
|
||||
### Random Sampling (General Quality)
|
||||
|
||||
```bash
|
||||
python scripts/07_prepare_validation_set.py \
|
||||
--extraction-results cleaned_data.json \
|
||||
--schema my_schema.json \
|
||||
--sample-size 20 \
|
||||
--strategy random \
|
||||
--output validation_set.json
|
||||
```
|
||||
|
||||
Provides overall quality estimate but may miss rare cases.
|
||||
|
||||
### Stratified Sampling (Identify Weaknesses)
|
||||
|
||||
```bash
|
||||
python scripts/07_prepare_validation_set.py \
|
||||
--extraction-results cleaned_data.json \
|
||||
--schema my_schema.json \
|
||||
--sample-size 20 \
|
||||
--strategy stratified \
|
||||
--output validation_set.json
|
||||
```
|
||||
|
||||
Samples papers with different characteristics:
|
||||
- Papers with no records
|
||||
- Papers with few records (1-2)
|
||||
- Papers with medium records (3-5)
|
||||
- Papers with many records (6+)
|
||||
|
||||
Best for identifying weak points in extraction.
|
||||
|
||||
### Diverse Sampling (Comprehensive)
|
||||
|
||||
```bash
|
||||
python scripts/07_prepare_validation_set.py \
|
||||
--extraction-results cleaned_data.json \
|
||||
--schema my_schema.json \
|
||||
--sample-size 20 \
|
||||
--strategy diverse \
|
||||
--output validation_set.json
|
||||
```
|
||||
|
||||
Maximizes diversity across different paper types.
|
||||
|
||||
## Step 8: Manual Annotation
|
||||
|
||||
### Annotation Process
|
||||
|
||||
1. **Open validation file:**
|
||||
```bash
|
||||
# Use your preferred JSON editor
|
||||
code validation_set.json # VS Code
|
||||
vim validation_set.json # Vim
|
||||
```
|
||||
|
||||
2. **For each paper in `validation_papers`:**
|
||||
- Locate and read the original PDF
|
||||
- Extract data according to the schema
|
||||
- Fill the `ground_truth` field with correct extraction
|
||||
- The structure should match `automated_extraction`
|
||||
|
||||
3. **Fill metadata fields:**
|
||||
- `annotator`: Your name
|
||||
- `annotation_date`: YYYY-MM-DD
|
||||
- `notes`: Any ambiguous cases or comments
|
||||
|
||||
### Annotation Tips
|
||||
|
||||
**Be thorough:**
|
||||
- Extract ALL relevant information, even if automated extraction missed it
|
||||
- This ensures accurate recall calculation
|
||||
|
||||
**Be precise:**
|
||||
- Use exact values as they appear in the paper
|
||||
- Follow the same schema structure as automated extraction
|
||||
|
||||
**Be consistent:**
|
||||
- Apply the same interpretation rules across all papers
|
||||
- Document interpretation decisions in notes
|
||||
|
||||
**Mark ambiguities:**
|
||||
- If a field is unclear, note it and make your best judgment
|
||||
- Consider having multiple annotators for inter-rater reliability
|
||||
|
||||
### Example Annotation
|
||||
|
||||
```json
|
||||
{
|
||||
"paper_id_123": {
|
||||
"automated_extraction": {
|
||||
"has_relevant_data": true,
|
||||
"records": [
|
||||
{
|
||||
"species": "Apis mellifera",
|
||||
"location": "Brazil"
|
||||
}
|
||||
]
|
||||
},
|
||||
"ground_truth": {
|
||||
"has_relevant_data": true,
|
||||
"records": [
|
||||
{
|
||||
"species": "Apis mellifera",
|
||||
"location": "Brazil",
|
||||
"state_province": "São Paulo" // Automated missed this
|
||||
},
|
||||
{
|
||||
"species": "Bombus terrestris", // Automated missed this record
|
||||
"location": "Brazil",
|
||||
"state_province": "São Paulo"
|
||||
}
|
||||
]
|
||||
},
|
||||
"notes": "Automated extraction missed the state and second species",
|
||||
"annotator": "John Doe",
|
||||
"annotation_date": "2025-01-15"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Step 9: Calculate Validation Metrics
|
||||
|
||||
### Basic Metrics Calculation
|
||||
|
||||
```bash
|
||||
python scripts/08_calculate_validation_metrics.py \
|
||||
--annotations validation_set.json \
|
||||
--output validation_metrics.json \
|
||||
--report validation_report.txt
|
||||
```
|
||||
|
||||
### Advanced Options
|
||||
|
||||
**Fuzzy string matching:**
|
||||
```bash
|
||||
python scripts/08_calculate_validation_metrics.py \
|
||||
--annotations validation_set.json \
|
||||
--fuzzy-strings \
|
||||
--output validation_metrics.json
|
||||
```
|
||||
|
||||
Normalizes whitespace and case for string comparisons.
|
||||
|
||||
**Numeric tolerance:**
|
||||
```bash
|
||||
python scripts/08_calculate_validation_metrics.py \
|
||||
--annotations validation_set.json \
|
||||
--numeric-tolerance 0.01 \
|
||||
--output validation_metrics.json
|
||||
```
|
||||
|
||||
Allows small differences in numeric values.
|
||||
|
||||
**Ordered list comparison:**
|
||||
```bash
|
||||
python scripts/08_calculate_validation_metrics.py \
|
||||
--annotations validation_set.json \
|
||||
--list-order-matters \
|
||||
--output validation_metrics.json
|
||||
```
|
||||
|
||||
Treats lists as ordered sequences instead of sets.
|
||||
|
||||
## Understanding the Metrics
|
||||
|
||||
### Precision
|
||||
**Definition:** Of the items extracted, what percentage are correct?
|
||||
|
||||
**Formula:** TP / (TP + FP)
|
||||
|
||||
**Example:** Extracted 10 species, 8 were correct → Precision = 80%
|
||||
|
||||
**High precision, low recall:** Conservative extraction (misses data)
|
||||
|
||||
### Recall
|
||||
**Definition:** Of the true items, what percentage were extracted?
|
||||
|
||||
**Formula:** TP / (TP + FN)
|
||||
|
||||
**Example:** Paper has 12 species, extracted 8 → Recall = 67%
|
||||
|
||||
**Low precision, high recall:** Liberal extraction (includes errors)
|
||||
|
||||
### F1 Score
|
||||
**Definition:** Harmonic mean of precision and recall
|
||||
|
||||
**Formula:** 2 × (Precision × Recall) / (Precision + Recall)
|
||||
|
||||
**Use:** Single metric balancing precision and recall
|
||||
|
||||
### Field-Level Metrics
|
||||
|
||||
Metrics are calculated for each field type:
|
||||
|
||||
**Boolean fields:**
|
||||
- True positives, false positives, false negatives
|
||||
|
||||
**Numeric fields:**
|
||||
- Exact match or within tolerance
|
||||
|
||||
**String fields:**
|
||||
- Exact or fuzzy match
|
||||
|
||||
**List fields:**
|
||||
- Set-based comparison (default)
|
||||
- Items in both (TP), in automated only (FP), in truth only (FN)
|
||||
|
||||
**Nested objects:**
|
||||
- Recursive field-by-field comparison
|
||||
|
||||
## Interpreting Results
|
||||
|
||||
### Validation Report Structure
|
||||
|
||||
```
|
||||
OVERALL METRICS
|
||||
Papers evaluated: 20
|
||||
Precision: 87.3%
|
||||
Recall: 79.2%
|
||||
F1 Score: 83.1%
|
||||
|
||||
METRICS BY FIELD
|
||||
Field Precision Recall F1
|
||||
species 95.2% 89.1% 92.0%
|
||||
location 82.3% 75.4% 78.7%
|
||||
method 91.0% 68.2% 77.9%
|
||||
|
||||
COMMON ISSUES
|
||||
Fields with low recall (missed information):
|
||||
- method: 68.2% recall, 12 missed items
|
||||
|
||||
Fields with low precision (incorrect extractions):
|
||||
- location: 82.3% precision, 8 incorrect items
|
||||
```
|
||||
|
||||
### Using Results to Improve
|
||||
|
||||
**Low Recall (Missing Information):**
|
||||
- Review extraction prompt instructions
|
||||
- Add examples of the missed pattern
|
||||
- Emphasize completeness in prompt
|
||||
- Consider using more capable model (Haiku → Sonnet)
|
||||
|
||||
**Low Precision (Incorrect Extractions):**
|
||||
- Add validation rules to prompt
|
||||
- Provide clearer field definitions
|
||||
- Add negative examples
|
||||
- Tighten extraction criteria
|
||||
|
||||
**Field-Specific Issues:**
|
||||
- Identify problematic field types
|
||||
- Revise schema definitions
|
||||
- Add field-specific instructions
|
||||
- Update examples
|
||||
|
||||
## Inter-Rater Reliability (Optional)
|
||||
|
||||
For critical applications, have multiple annotators:
|
||||
|
||||
1. **Split validation set:**
|
||||
- 10 papers: Single annotator
|
||||
- 10 papers: Both annotators independently
|
||||
|
||||
2. **Calculate agreement:**
|
||||
```bash
|
||||
python scripts/08_calculate_validation_metrics.py \
|
||||
--annotations annotator1.json \
|
||||
--compare-with annotator2.json \
|
||||
--output agreement_metrics.json
|
||||
```
|
||||
|
||||
3. **Resolve disagreements:**
|
||||
- Discuss discrepancies
|
||||
- Establish interpretation guidelines
|
||||
- Re-annotate if needed
|
||||
|
||||
## Iterative Improvement Workflow
|
||||
|
||||
1. **Baseline:** Run extraction with initial schema
|
||||
2. **Validate:** Calculate metrics on sample
|
||||
3. **Analyze:** Identify weak fields and error patterns
|
||||
4. **Revise:** Update schema, prompts, or model
|
||||
5. **Re-extract:** Run extraction with improvements
|
||||
6. **Re-validate:** Calculate new metrics
|
||||
7. **Compare:** Check if metrics improved
|
||||
8. **Repeat:** Until acceptable quality achieved
|
||||
|
||||
## Reporting Validation in Publications
|
||||
|
||||
Include in methods section:
|
||||
|
||||
```
|
||||
Extraction quality was assessed on a stratified random sample of
|
||||
20 papers. Automated extraction achieved 87.3% precision (95% CI:
|
||||
81.2-93.4%) and 79.2% recall (95% CI: 72.8-85.6%), with an F1
|
||||
score of 83.1%. Field-level metrics ranged from 77.9% (method
|
||||
descriptions) to 92.0% (species names).
|
||||
```
|
||||
|
||||
Consider reporting:
|
||||
- Sample size and sampling strategy
|
||||
- Overall precision, recall, F1
|
||||
- Field-level metrics for key fields
|
||||
- Confidence intervals
|
||||
- Common error types
|
||||
328
skills/extract_from_pdfs/references/workflow_guide.md
Normal file
328
skills/extract_from_pdfs/references/workflow_guide.md
Normal file
@@ -0,0 +1,328 @@
|
||||
# Complete Workflow Guide
|
||||
|
||||
This guide provides step-by-step instructions for the complete PDF extraction pipeline.
|
||||
|
||||
## Overview
|
||||
|
||||
The pipeline consists of 6 main steps plus optional validation:
|
||||
|
||||
1. **Organize Metadata** - Standardize PDF and metadata organization
|
||||
2. **Filter Papers** - Identify relevant papers by abstract (optional)
|
||||
3. **Extract Data** - Extract structured data from PDFs
|
||||
4. **Repair JSON** - Validate and repair JSON outputs
|
||||
5. **Validate with APIs** - Enrich with external databases
|
||||
6. **Export** - Convert to analysis format
|
||||
|
||||
**Optional:** Steps 7-9 for quality validation
|
||||
|
||||
## Step 1: Organize Metadata
|
||||
|
||||
Standardize PDF organization and metadata from various sources.
|
||||
|
||||
### From BibTeX (Zotero, JabRef, etc.)
|
||||
|
||||
```bash
|
||||
python scripts/01_organize_metadata.py \
|
||||
--source-type bibtex \
|
||||
--source path/to/library.bib \
|
||||
--pdf-dir path/to/pdfs \
|
||||
--organize-pdfs \
|
||||
--output metadata.json
|
||||
```
|
||||
|
||||
### From RIS (Mendeley, EndNote, etc.)
|
||||
|
||||
```bash
|
||||
python scripts/01_organize_metadata.py \
|
||||
--source-type ris \
|
||||
--source path/to/library.ris \
|
||||
--pdf-dir path/to/pdfs \
|
||||
--organize-pdfs \
|
||||
--output metadata.json
|
||||
```
|
||||
|
||||
### From PDF Directory
|
||||
|
||||
```bash
|
||||
python scripts/01_organize_metadata.py \
|
||||
--source-type directory \
|
||||
--source path/to/pdfs \
|
||||
--output metadata.json
|
||||
```
|
||||
|
||||
### From DOI List
|
||||
|
||||
```bash
|
||||
python scripts/01_organize_metadata.py \
|
||||
--source-type doi_list \
|
||||
--source dois.txt \
|
||||
--output metadata.json
|
||||
```
|
||||
|
||||
**Outputs:**
|
||||
- `metadata.json` - Standardized metadata file
|
||||
- `organized_pdfs/` - Renamed PDFs (if --organize-pdfs used)
|
||||
|
||||
## Step 2: Filter Papers (Optional but Recommended)
|
||||
|
||||
Filter papers by analyzing abstracts to reduce PDF processing costs.
|
||||
|
||||
### Backend Selection
|
||||
|
||||
**Option A: Claude Haiku (Fast & Cheap)**
|
||||
- Cost: ~$0.25 per million input tokens
|
||||
- Speed: Very fast with batches API
|
||||
- Accuracy: Good for most filtering tasks
|
||||
|
||||
```bash
|
||||
python scripts/02_filter_abstracts.py \
|
||||
--metadata metadata.json \
|
||||
--backend anthropic-haiku \
|
||||
--use-batches \
|
||||
--output filtered_papers.json
|
||||
```
|
||||
|
||||
**Option B: Claude Sonnet (More Accurate)**
|
||||
- Cost: ~$3 per million input tokens
|
||||
- Speed: Fast with batches API
|
||||
- Accuracy: Higher for complex criteria
|
||||
|
||||
```bash
|
||||
python scripts/02_filter_abstracts.py \
|
||||
--metadata metadata.json \
|
||||
--backend anthropic-sonnet \
|
||||
--use-batches \
|
||||
--output filtered_papers.json
|
||||
```
|
||||
|
||||
**Option C: Local Ollama (FREE & Private)**
|
||||
- Cost: $0 (runs locally)
|
||||
- Speed: Depends on hardware
|
||||
- Accuracy: Good with llama3.1:8b or better
|
||||
|
||||
```bash
|
||||
python scripts/02_filter_abstracts.py \
|
||||
--metadata metadata.json \
|
||||
--backend ollama \
|
||||
--ollama-model llama3.1:8b \
|
||||
--output filtered_papers.json
|
||||
```
|
||||
|
||||
**Before running:** Customize the filtering prompt in `scripts/02_filter_abstracts.py` (line 74) to match your criteria.
|
||||
|
||||
**Outputs:**
|
||||
- `filtered_papers.json` - Papers marked as relevant/irrelevant
|
||||
|
||||
## Step 3: Extract Data from PDFs
|
||||
|
||||
Extract structured data using Claude's PDF vision capabilities.
|
||||
|
||||
### Schema Preparation
|
||||
|
||||
1. Copy schema template:
|
||||
```bash
|
||||
cp assets/schema_template.json my_schema.json
|
||||
```
|
||||
|
||||
2. Customize for your domain:
|
||||
- Update `objective` with your extraction goal
|
||||
- Define `output_schema` structure
|
||||
- Add domain-specific `instructions`
|
||||
- Provide an `output_example`
|
||||
|
||||
See `assets/example_flower_visitors_schema.json` for a real-world example.
|
||||
|
||||
### Run Extraction
|
||||
|
||||
```bash
|
||||
python scripts/03_extract_from_pdfs.py \
|
||||
--metadata filtered_papers.json \
|
||||
--schema my_schema.json \
|
||||
--method batches \
|
||||
--output extracted_data.json
|
||||
```
|
||||
|
||||
**Processing methods:**
|
||||
- `batches` - Most efficient for many PDFs
|
||||
- `base64` - Sequential processing
|
||||
|
||||
**Optional flags:**
|
||||
- `--filter-results filtered_papers.json` - Only process relevant papers
|
||||
- `--test` - Process only 3 PDFs for testing
|
||||
- `--model claude-3-5-sonnet-20241022` - Change model
|
||||
|
||||
**Outputs:**
|
||||
- `extracted_data.json` - Raw extraction results with token counts
|
||||
|
||||
## Step 4: Repair and Validate JSON
|
||||
|
||||
Repair malformed JSON and validate against schema.
|
||||
|
||||
```bash
|
||||
python scripts/04_repair_json.py \
|
||||
--input extracted_data.json \
|
||||
--schema my_schema.json \
|
||||
--output cleaned_data.json
|
||||
```
|
||||
|
||||
**Optional flags:**
|
||||
- `--strict` - Reject records that fail validation
|
||||
|
||||
**Outputs:**
|
||||
- `cleaned_data.json` - Repaired and validated extractions
|
||||
|
||||
## Step 5: Validate with External APIs
|
||||
|
||||
Enrich data using external scientific databases.
|
||||
|
||||
### API Configuration
|
||||
|
||||
1. Copy API config template:
|
||||
```bash
|
||||
cp assets/api_config_template.json my_api_config.json
|
||||
```
|
||||
|
||||
2. Map fields to validation APIs:
|
||||
- `gbif_taxonomy` - GBIF for biological taxonomy
|
||||
- `wfo_plants` - World Flora Online for plant names
|
||||
- `geonames` - GeoNames for locations (requires account)
|
||||
- `geocode` - OpenStreetMap for geocoding (free)
|
||||
- `pubchem` - PubChem for chemical compounds
|
||||
- `ncbi_gene` - NCBI Gene database
|
||||
|
||||
See `assets/example_api_config_ecology.json` for an ecology example.
|
||||
|
||||
### Run Validation
|
||||
|
||||
```bash
|
||||
python scripts/05_validate_with_apis.py \
|
||||
--input cleaned_data.json \
|
||||
--apis my_api_config.json \
|
||||
--output validated_data.json
|
||||
```
|
||||
|
||||
**Optional flags:**
|
||||
- `--skip-validation` - Skip API calls, only structure data
|
||||
|
||||
**Outputs:**
|
||||
- `validated_data.json` - Data enriched with validated taxonomy, geography, etc.
|
||||
|
||||
## Step 6: Export to Analysis Format
|
||||
|
||||
Convert to format for your analysis environment.
|
||||
|
||||
### Python (pandas)
|
||||
|
||||
```bash
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format python \
|
||||
--flatten \
|
||||
--output results
|
||||
```
|
||||
|
||||
Outputs:
|
||||
- `results.pkl` - pandas DataFrame
|
||||
- `results.py` - Loading script
|
||||
|
||||
### R
|
||||
|
||||
```bash
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format r \
|
||||
--flatten \
|
||||
--output results
|
||||
```
|
||||
|
||||
Outputs:
|
||||
- `results.rds` - R data frame
|
||||
- `results.R` - Loading script
|
||||
|
||||
### CSV
|
||||
|
||||
```bash
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format csv \
|
||||
--flatten \
|
||||
--output results.csv
|
||||
```
|
||||
|
||||
### Excel
|
||||
|
||||
```bash
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format excel \
|
||||
--flatten \
|
||||
--output results.xlsx
|
||||
```
|
||||
|
||||
### SQLite Database
|
||||
|
||||
```bash
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format sqlite \
|
||||
--flatten \
|
||||
--output results.db
|
||||
```
|
||||
|
||||
Outputs:
|
||||
- `results.db` - SQLite database
|
||||
- `results.sql` - Example queries
|
||||
|
||||
**Flags:**
|
||||
- `--flatten` - Flatten nested JSON for tabular format
|
||||
- `--include-metadata` - Include paper metadata in output
|
||||
|
||||
## Cost Estimation
|
||||
|
||||
### Example: 100 papers, 10 pages each
|
||||
|
||||
**With Filtering (Recommended):**
|
||||
1. Filter (Haiku): ~200 abstracts × 500 tokens × $0.25/M = **$0.03**
|
||||
2. Extract (Sonnet): ~50 relevant papers × 10 pages × 2,500 tokens × $3/M = **$3.75**
|
||||
3. **Total: ~$3.78**
|
||||
|
||||
**Without Filtering:**
|
||||
1. Extract (Sonnet): 100 papers × 10 pages × 2,500 tokens × $3/M = **$7.50**
|
||||
|
||||
**With Local Ollama:**
|
||||
1. Filter (Ollama): **$0**
|
||||
2. Extract (Sonnet): ~50 papers × 10 pages × 2,500 tokens × $3/M = **$3.75**
|
||||
3. **Total: ~$3.75**
|
||||
|
||||
### Token Usage by Step
|
||||
- Abstract (~200 words): ~500 tokens
|
||||
- PDF page (text-heavy): ~1,500-3,000 tokens
|
||||
- Extraction prompt: ~500-1,000 tokens
|
||||
- Schema/context: ~500-1,000 tokens
|
||||
|
||||
**Tips to reduce costs:**
|
||||
- Use abstract filtering (Step 2)
|
||||
- Use Haiku for filtering instead of Sonnet
|
||||
- Use local Ollama for filtering (free)
|
||||
- Enable prompt caching with `--use-caching`
|
||||
- Process in batches with `--use-batches`
|
||||
|
||||
## Common Issues
|
||||
|
||||
### PDF Not Found
|
||||
Check PDF paths in metadata.json match actual file locations.
|
||||
|
||||
### JSON Parsing Errors
|
||||
Run Step 4 (repair JSON) - the json_repair library handles most issues.
|
||||
|
||||
### API Rate Limits
|
||||
Scripts include delays, but check specific API documentation for limits.
|
||||
|
||||
### Ollama Connection Error
|
||||
Ensure Ollama server is running: `ollama serve`
|
||||
|
||||
## Next Steps
|
||||
|
||||
For quality assurance, proceed to the validation workflow to calculate precision and recall metrics.
|
||||
|
||||
See: `references/validation_guide.md`
|
||||
26
skills/extract_from_pdfs/requirements.txt
Normal file
26
skills/extract_from_pdfs/requirements.txt
Normal file
@@ -0,0 +1,26 @@
|
||||
# Core dependencies for PDF extraction skill
|
||||
|
||||
# Anthropic API
|
||||
anthropic>=0.40.0
|
||||
|
||||
# PDF and bibliography handling
|
||||
pybtex>=0.24.0
|
||||
rispy>=0.6.0
|
||||
|
||||
# JSON processing and repair
|
||||
json-repair>=0.25.0
|
||||
jsonschema>=4.20.0
|
||||
|
||||
# Data processing and export
|
||||
pandas>=2.0.0
|
||||
openpyxl>=3.1.0 # For Excel export
|
||||
pyreadr>=0.5.0 # For R RDS export
|
||||
|
||||
# API requests
|
||||
requests>=2.31.0
|
||||
|
||||
# Optional: For enhanced functionality
|
||||
# Uncomment if needed:
|
||||
# numpy>=1.24.0
|
||||
# matplotlib>=3.7.0
|
||||
# seaborn>=0.12.0
|
||||
310
skills/extract_from_pdfs/scripts/01_organize_metadata.py
Normal file
310
skills/extract_from_pdfs/scripts/01_organize_metadata.py
Normal file
@@ -0,0 +1,310 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Organize PDFs and metadata from various sources (BibTeX, RIS, directory, DOI list).
|
||||
Standardizes file naming and creates a unified metadata JSON for downstream processing.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional
|
||||
import re
|
||||
|
||||
try:
|
||||
from pybtex.database.input import bibtex
|
||||
BIBTEX_AVAILABLE = True
|
||||
except ImportError:
|
||||
BIBTEX_AVAILABLE = False
|
||||
print("Warning: pybtex not installed. BibTeX support disabled.")
|
||||
|
||||
try:
|
||||
import rispy
|
||||
RIS_AVAILABLE = True
|
||||
except ImportError:
|
||||
RIS_AVAILABLE = False
|
||||
print("Warning: rispy not installed. RIS support disabled.")
|
||||
|
||||
|
||||
def parse_args():
|
||||
"""Parse command line arguments"""
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Organize PDFs and metadata from various sources'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--source-type',
|
||||
choices=['bibtex', 'ris', 'directory', 'doi_list'],
|
||||
required=True,
|
||||
help='Type of source data'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--source',
|
||||
required=True,
|
||||
help='Path to source file (BibTeX/RIS file, directory, or DOI list)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--pdf-dir',
|
||||
help='Directory containing PDFs (for bibtex/ris with relative paths)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--output',
|
||||
default='metadata.json',
|
||||
help='Output metadata JSON file'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--organize-pdfs',
|
||||
action='store_true',
|
||||
help='Copy PDFs to standardized directory structure'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--pdf-output-dir',
|
||||
default='organized_pdfs',
|
||||
help='Directory for organized PDFs'
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def load_bibtex_metadata(bib_path: Path, pdf_base_dir: Optional[Path] = None) -> List[Dict]:
|
||||
"""Load metadata from BibTeX file"""
|
||||
if not BIBTEX_AVAILABLE:
|
||||
raise ImportError("pybtex is required for BibTeX support. Install with: pip install pybtex")
|
||||
|
||||
parser = bibtex.Parser()
|
||||
bib_data = parser.parse_file(str(bib_path))
|
||||
|
||||
metadata = []
|
||||
for key, entry in bib_data.entries.items():
|
||||
record = {
|
||||
'id': key,
|
||||
'type': entry.type,
|
||||
'title': entry.fields.get('title', ''),
|
||||
'year': entry.fields.get('year', ''),
|
||||
'doi': entry.fields.get('doi', ''),
|
||||
'abstract': entry.fields.get('abstract', ''),
|
||||
'journal': entry.fields.get('journal', ''),
|
||||
'authors': ', '.join(
|
||||
[' '.join([p for p in person.last_names + person.first_names])
|
||||
for person in entry.persons.get('author', [])]
|
||||
),
|
||||
'keywords': entry.fields.get('keywords', ''),
|
||||
'pdf_path': None
|
||||
}
|
||||
|
||||
# Extract PDF path from file field
|
||||
if 'file' in entry.fields:
|
||||
file_field = entry.fields['file']
|
||||
if file_field.startswith('{') and file_field.endswith('}'):
|
||||
file_field = file_field[1:-1]
|
||||
|
||||
for file_entry in file_field.split(';'):
|
||||
parts = file_entry.strip().split(':')
|
||||
if len(parts) >= 3 and parts[2].lower() == 'application/pdf':
|
||||
pdf_path = parts[1].strip()
|
||||
if pdf_base_dir:
|
||||
pdf_path = str(pdf_base_dir / pdf_path)
|
||||
record['pdf_path'] = pdf_path
|
||||
break
|
||||
|
||||
metadata.append(record)
|
||||
|
||||
print(f"Loaded {len(metadata)} entries from BibTeX file")
|
||||
return metadata
|
||||
|
||||
|
||||
def load_ris_metadata(ris_path: Path, pdf_base_dir: Optional[Path] = None) -> List[Dict]:
|
||||
"""Load metadata from RIS file"""
|
||||
if not RIS_AVAILABLE:
|
||||
raise ImportError("rispy is required for RIS support. Install with: pip install rispy")
|
||||
|
||||
with open(ris_path, 'r', encoding='utf-8') as f:
|
||||
entries = rispy.load(f)
|
||||
|
||||
metadata = []
|
||||
for i, entry in enumerate(entries):
|
||||
# Generate ID from first author and year or use index
|
||||
first_author = entry.get('authors', [None])[0] or 'Unknown'
|
||||
year = entry.get('year', 'NoYear')
|
||||
entry_id = f"{first_author.split()[-1]}{year}_{i}"
|
||||
|
||||
record = {
|
||||
'id': entry_id,
|
||||
'type': entry.get('type_of_reference', 'article'),
|
||||
'title': entry.get('title', ''),
|
||||
'year': str(entry.get('year', '')),
|
||||
'doi': entry.get('doi', ''),
|
||||
'abstract': entry.get('abstract', ''),
|
||||
'journal': entry.get('journal_name', ''),
|
||||
'authors': '; '.join(entry.get('authors', [])),
|
||||
'keywords': '; '.join(entry.get('keywords', [])),
|
||||
'pdf_path': None
|
||||
}
|
||||
|
||||
# Try to find PDF in standard locations
|
||||
if pdf_base_dir:
|
||||
# Common patterns: FirstAuthorYear.pdf, doi_cleaned.pdf, etc.
|
||||
pdf_candidates = [
|
||||
f"{entry_id}.pdf",
|
||||
f"{first_author.split()[-1]}_{year}.pdf"
|
||||
]
|
||||
if record['doi']:
|
||||
safe_doi = re.sub(r'[^\w\-_]', '_', record['doi'])
|
||||
pdf_candidates.append(f"{safe_doi}.pdf")
|
||||
|
||||
for candidate in pdf_candidates:
|
||||
pdf_path = pdf_base_dir / candidate
|
||||
if pdf_path.exists():
|
||||
record['pdf_path'] = str(pdf_path)
|
||||
break
|
||||
|
||||
metadata.append(record)
|
||||
|
||||
print(f"Loaded {len(metadata)} entries from RIS file")
|
||||
return metadata
|
||||
|
||||
|
||||
def load_directory_metadata(dir_path: Path) -> List[Dict]:
|
||||
"""Load metadata by scanning directory for PDFs"""
|
||||
pdf_files = list(dir_path.glob('**/*.pdf'))
|
||||
|
||||
metadata = []
|
||||
for pdf_path in pdf_files:
|
||||
# Generate ID from filename
|
||||
entry_id = pdf_path.stem
|
||||
|
||||
record = {
|
||||
'id': entry_id,
|
||||
'type': 'article',
|
||||
'title': entry_id.replace('_', ' '),
|
||||
'year': '',
|
||||
'doi': '',
|
||||
'abstract': '',
|
||||
'journal': '',
|
||||
'authors': '',
|
||||
'keywords': '',
|
||||
'pdf_path': str(pdf_path)
|
||||
}
|
||||
|
||||
# Try to extract DOI from filename if present
|
||||
doi_match = re.search(r'10\.\d{4,}/[^\s]+', entry_id)
|
||||
if doi_match:
|
||||
record['doi'] = doi_match.group(0)
|
||||
|
||||
metadata.append(record)
|
||||
|
||||
print(f"Found {len(metadata)} PDFs in directory")
|
||||
return metadata
|
||||
|
||||
|
||||
def load_doi_list_metadata(doi_list_path: Path) -> List[Dict]:
|
||||
"""Load metadata from a list of DOIs (will need to fetch metadata separately)"""
|
||||
with open(doi_list_path, 'r') as f:
|
||||
dois = [line.strip() for line in f if line.strip()]
|
||||
|
||||
metadata = []
|
||||
for doi in dois:
|
||||
safe_doi = re.sub(r'[^\w\-_]', '_', doi)
|
||||
record = {
|
||||
'id': safe_doi,
|
||||
'type': 'article',
|
||||
'title': '',
|
||||
'year': '',
|
||||
'doi': doi,
|
||||
'abstract': '',
|
||||
'journal': '',
|
||||
'authors': '',
|
||||
'keywords': '',
|
||||
'pdf_path': None
|
||||
}
|
||||
metadata.append(record)
|
||||
|
||||
print(f"Loaded {len(metadata)} DOIs")
|
||||
print("Note: You'll need to fetch full metadata and PDFs separately")
|
||||
return metadata
|
||||
|
||||
|
||||
def organize_pdfs(metadata: List[Dict], output_dir: Path) -> List[Dict]:
|
||||
"""Copy and rename PDFs to standardized directory structure"""
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
organized_metadata = []
|
||||
stats = {'copied': 0, 'missing': 0, 'total': len(metadata)}
|
||||
|
||||
for record in metadata:
|
||||
if record['pdf_path'] and Path(record['pdf_path']).exists():
|
||||
source_path = Path(record['pdf_path'])
|
||||
dest_path = output_dir / f"{record['id']}.pdf"
|
||||
|
||||
try:
|
||||
shutil.copy2(source_path, dest_path)
|
||||
record['pdf_path'] = str(dest_path)
|
||||
stats['copied'] += 1
|
||||
except Exception as e:
|
||||
print(f"Error copying {source_path}: {e}")
|
||||
stats['missing'] += 1
|
||||
else:
|
||||
if record['pdf_path']:
|
||||
print(f"PDF not found: {record['pdf_path']}")
|
||||
stats['missing'] += 1
|
||||
|
||||
organized_metadata.append(record)
|
||||
|
||||
print(f"\nPDF Organization Summary:")
|
||||
print(f" Total entries: {stats['total']}")
|
||||
print(f" PDFs copied: {stats['copied']}")
|
||||
print(f" PDFs missing: {stats['missing']}")
|
||||
|
||||
return organized_metadata
|
||||
|
||||
|
||||
def save_metadata(metadata: List[Dict], output_path: Path):
|
||||
"""Save metadata to JSON file"""
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
json.dump(metadata, f, indent=2, ensure_ascii=False)
|
||||
|
||||
print(f"\nMetadata saved to: {output_path}")
|
||||
|
||||
|
||||
def main():
|
||||
args = parse_args()
|
||||
|
||||
source_path = Path(args.source)
|
||||
pdf_base_dir = Path(args.pdf_dir) if args.pdf_dir else None
|
||||
output_path = Path(args.output)
|
||||
|
||||
# Load metadata based on source type
|
||||
if args.source_type == 'bibtex':
|
||||
metadata = load_bibtex_metadata(source_path, pdf_base_dir)
|
||||
elif args.source_type == 'ris':
|
||||
metadata = load_ris_metadata(source_path, pdf_base_dir)
|
||||
elif args.source_type == 'directory':
|
||||
metadata = load_directory_metadata(source_path)
|
||||
elif args.source_type == 'doi_list':
|
||||
metadata = load_doi_list_metadata(source_path)
|
||||
else:
|
||||
raise ValueError(f"Unknown source type: {args.source_type}")
|
||||
|
||||
# Organize PDFs if requested
|
||||
if args.organize_pdfs:
|
||||
pdf_output_dir = Path(args.pdf_output_dir)
|
||||
metadata = organize_pdfs(metadata, pdf_output_dir)
|
||||
|
||||
# Save metadata
|
||||
save_metadata(metadata, output_path)
|
||||
|
||||
# Print summary statistics
|
||||
total = len(metadata)
|
||||
with_pdfs = sum(1 for r in metadata if r['pdf_path'])
|
||||
with_abstracts = sum(1 for r in metadata if r['abstract'])
|
||||
with_dois = sum(1 for r in metadata if r['doi'])
|
||||
|
||||
print(f"\nMetadata Summary:")
|
||||
print(f" Total entries: {total}")
|
||||
print(f" With PDFs: {with_pdfs}")
|
||||
print(f" With abstracts: {with_abstracts}")
|
||||
print(f" With DOIs: {with_dois}")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
468
skills/extract_from_pdfs/scripts/02_filter_abstracts.py
Normal file
468
skills/extract_from_pdfs/scripts/02_filter_abstracts.py
Normal file
@@ -0,0 +1,468 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Filter papers based on abstract content using Claude API or local models.
|
||||
Reduces processing costs by identifying relevant papers before full PDF extraction.
|
||||
This script template needs to be customized with your specific filtering criteria.
|
||||
|
||||
Supports:
|
||||
- Claude Haiku (cheap, fast API option)
|
||||
- Claude Sonnet (more accurate API option)
|
||||
- Local models via Ollama (free, private, requires local setup)
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
from anthropic import Anthropic
|
||||
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
|
||||
from anthropic.types.messages.batch_create_params import Request
|
||||
|
||||
try:
|
||||
import requests
|
||||
REQUESTS_AVAILABLE = True
|
||||
except ImportError:
|
||||
REQUESTS_AVAILABLE = False
|
||||
|
||||
|
||||
def parse_args():
|
||||
"""Parse command line arguments"""
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Filter papers by analyzing abstracts with Claude or local models',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Backend options:
|
||||
anthropic-haiku : Claude 3 Haiku (cheap, fast, ~$0.25/million input tokens)
|
||||
anthropic-sonnet : Claude 3.5 Sonnet (more accurate, ~$3/million input tokens)
|
||||
ollama : Local model via Ollama (free, requires local setup)
|
||||
|
||||
Local model setup (Ollama):
|
||||
1. Install Ollama: https://ollama.com
|
||||
2. Pull a model: ollama pull llama3.1:8b
|
||||
3. Run server: ollama serve (usually starts automatically)
|
||||
4. Use --backend ollama --ollama-model llama3.1:8b
|
||||
|
||||
Recommended models for Ollama:
|
||||
- llama3.1:8b (good balance)
|
||||
- llama3.1:70b (better accuracy, needs more RAM)
|
||||
- mistral:7b (fast, good for simple filtering)
|
||||
- qwen2.5:7b (good multilingual support)
|
||||
"""
|
||||
)
|
||||
parser.add_argument(
|
||||
'--metadata',
|
||||
required=True,
|
||||
help='Input metadata JSON file from step 01'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--output',
|
||||
default='filtered_papers.json',
|
||||
help='Output JSON file with filter results'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--backend',
|
||||
choices=['anthropic-haiku', 'anthropic-sonnet', 'ollama'],
|
||||
default='anthropic-haiku',
|
||||
help='Model backend to use (default: anthropic-haiku for cost efficiency)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--ollama-model',
|
||||
default='llama3.1:8b',
|
||||
help='Ollama model name (default: llama3.1:8b)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--ollama-url',
|
||||
default='http://localhost:11434',
|
||||
help='Ollama server URL (default: http://localhost:11434)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--use-batches',
|
||||
action='store_true',
|
||||
help='Use Anthropic Batches API (only for anthropic backends)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--test',
|
||||
action='store_true',
|
||||
help='Run in test mode (process only 10 records)'
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def load_metadata(metadata_path: Path) -> List[Dict]:
|
||||
"""Load metadata from JSON file"""
|
||||
with open(metadata_path, 'r', encoding='utf-8') as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def load_existing_results(output_path: Path) -> Dict:
|
||||
"""Load existing filter results if available"""
|
||||
if output_path.exists():
|
||||
with open(output_path, 'r', encoding='utf-8') as f:
|
||||
return json.load(f)
|
||||
return {}
|
||||
|
||||
|
||||
def save_results(results: Dict, output_path: Path):
|
||||
"""Save filter results to JSON file"""
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
json.dump(results, f, indent=2, ensure_ascii=False)
|
||||
|
||||
|
||||
def create_filter_prompt(title: str, abstract: str) -> str:
|
||||
"""
|
||||
Create the filtering prompt.
|
||||
|
||||
TODO: CUSTOMIZE THIS PROMPT FOR YOUR SPECIFIC USE CASE
|
||||
|
||||
This is a template. Replace the example criteria with your own.
|
||||
"""
|
||||
return f"""You are analyzing scientific literature to identify relevant papers for a research project.
|
||||
|
||||
<title>
|
||||
{title}
|
||||
</title>
|
||||
|
||||
<abstract>
|
||||
{abstract}
|
||||
</abstract>
|
||||
|
||||
Your task is to determine if this paper meets the following criteria:
|
||||
|
||||
TODO: Replace these example criteria with your own:
|
||||
|
||||
1. Does the paper contain PRIMARY empirical data (not review/meta-analysis)?
|
||||
2. Does the paper report [YOUR SPECIFIC DATA TYPE, e.g., "field observations", "experimental measurements", "clinical outcomes"]?
|
||||
3. Is the geographic/temporal/taxonomic scope relevant to [YOUR STUDY SYSTEM]?
|
||||
|
||||
Important considerations:
|
||||
- Be conservative: when in doubt, include the paper (false positives are better than false negatives)
|
||||
- Distinguish between primary data and citations of others' work
|
||||
- Consider whether the abstract suggests the full paper likely contains the data of interest
|
||||
|
||||
Provide your determination as a JSON object with these boolean fields:
|
||||
1. "has_relevant_data": true if the paper likely contains the data type of interest
|
||||
2. "is_primary_research": true if the paper reports original empirical data
|
||||
3. "meets_scope": true if the study system/scope is relevant
|
||||
|
||||
Also provide:
|
||||
4. "confidence": your confidence level (high/medium/low)
|
||||
5. "reasoning": brief explanation (1-2 sentences)
|
||||
|
||||
Wrap your response in <output> tags. Example:
|
||||
<output>
|
||||
{{
|
||||
"has_relevant_data": true,
|
||||
"is_primary_research": true,
|
||||
"meets_scope": true,
|
||||
"confidence": "high",
|
||||
"reasoning": "Abstract explicitly mentions field observations of the target phenomenon in the relevant geographic region."
|
||||
}}
|
||||
</output>
|
||||
|
||||
Base your determination solely on the title and abstract provided."""
|
||||
|
||||
|
||||
def extract_json_from_xml(text: str) -> Dict:
|
||||
"""Extract JSON from XML output tags in Claude's response"""
|
||||
import re
|
||||
|
||||
match = re.search(r'<output>\s*(\{.*?\})\s*</output>', text, re.DOTALL)
|
||||
if match:
|
||||
json_str = match.group(1)
|
||||
try:
|
||||
return json.loads(json_str)
|
||||
except json.JSONDecodeError as e:
|
||||
print(f"Failed to parse JSON: {e}")
|
||||
print(f"JSON string: {json_str}")
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
def filter_paper_ollama(record: Dict, ollama_url: str, ollama_model: str) -> Dict:
|
||||
"""Use local Ollama model to filter a single paper"""
|
||||
if not REQUESTS_AVAILABLE:
|
||||
return {
|
||||
'status': 'error',
|
||||
'reason': 'requests library not available. Install with: pip install requests'
|
||||
}
|
||||
|
||||
if not record.get('title') or not record.get('abstract'):
|
||||
return {
|
||||
'status': 'skipped',
|
||||
'reason': 'missing_title_or_abstract'
|
||||
}
|
||||
|
||||
max_retries = 3
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
# Ollama uses OpenAI-compatible chat API
|
||||
response = requests.post(
|
||||
f"{ollama_url}/api/chat",
|
||||
json={
|
||||
"model": ollama_model,
|
||||
"messages": [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a scientific literature analyst specializing in identifying relevant papers for systematic reviews and meta-analyses."
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": create_filter_prompt(record['title'], record['abstract'])
|
||||
}
|
||||
],
|
||||
"stream": False,
|
||||
"options": {
|
||||
"temperature": 0,
|
||||
"num_predict": 2048
|
||||
}
|
||||
},
|
||||
timeout=60
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
data = response.json()
|
||||
content = data.get('message', {}).get('content', '')
|
||||
result = extract_json_from_xml(content)
|
||||
|
||||
if result:
|
||||
return {
|
||||
'status': 'success',
|
||||
'filter_result': result,
|
||||
'model_used': ollama_model
|
||||
}
|
||||
else:
|
||||
return {
|
||||
'status': 'error',
|
||||
'reason': 'failed_to_parse_json',
|
||||
'raw_response': content[:500]
|
||||
}
|
||||
else:
|
||||
return {
|
||||
'status': 'error',
|
||||
'reason': f'Ollama API error: {response.status_code} {response.text[:200]}'
|
||||
}
|
||||
|
||||
except requests.exceptions.ConnectionError:
|
||||
return {
|
||||
'status': 'error',
|
||||
'reason': f'Cannot connect to Ollama at {ollama_url}. Make sure Ollama is running: ollama serve'
|
||||
}
|
||||
except Exception as e:
|
||||
if attempt == max_retries - 1:
|
||||
return {
|
||||
'status': 'error',
|
||||
'reason': str(e)
|
||||
}
|
||||
time.sleep(2 ** attempt)
|
||||
|
||||
|
||||
def filter_paper_direct(client: Anthropic, record: Dict, model: str) -> Dict:
|
||||
"""Use Claude API directly to filter a single paper"""
|
||||
if not record.get('title') or not record.get('abstract'):
|
||||
return {
|
||||
'status': 'skipped',
|
||||
'reason': 'missing_title_or_abstract'
|
||||
}
|
||||
|
||||
max_retries = 3
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
response = client.messages.create(
|
||||
model=model,
|
||||
max_tokens=2048,
|
||||
temperature=0,
|
||||
system="You are a scientific literature analyst specializing in identifying relevant papers for systematic reviews and meta-analyses.",
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": create_filter_prompt(record['title'], record['abstract'])
|
||||
}]
|
||||
)
|
||||
|
||||
result = extract_json_from_xml(response.content[0].text)
|
||||
if result:
|
||||
return {
|
||||
'status': 'success',
|
||||
'filter_result': result,
|
||||
'model_used': model
|
||||
}
|
||||
else:
|
||||
return {
|
||||
'status': 'error',
|
||||
'reason': 'failed_to_parse_json'
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
if attempt == max_retries - 1:
|
||||
return {
|
||||
'status': 'error',
|
||||
'reason': str(e)
|
||||
}
|
||||
time.sleep(2 ** attempt)
|
||||
|
||||
|
||||
def filter_papers_batch(client: Anthropic, records: List[Dict], model: str) -> Dict[str, Dict]:
|
||||
"""Use Claude Batches API to filter multiple papers efficiently"""
|
||||
requests = []
|
||||
|
||||
for record in records:
|
||||
if not record.get('title') or not record.get('abstract'):
|
||||
continue
|
||||
|
||||
requests.append(Request(
|
||||
custom_id=record['id'],
|
||||
params=MessageCreateParamsNonStreaming(
|
||||
model=model,
|
||||
max_tokens=2048,
|
||||
temperature=0,
|
||||
system="You are a scientific literature analyst specializing in identifying relevant papers for systematic reviews and meta-analyses.",
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": create_filter_prompt(record['title'], record['abstract'])
|
||||
}]
|
||||
)
|
||||
))
|
||||
|
||||
if not requests:
|
||||
print("No papers to process (missing titles or abstracts)")
|
||||
return {}
|
||||
|
||||
# Create batch
|
||||
print(f"Creating batch with {len(requests)} requests...")
|
||||
message_batch = client.messages.batches.create(requests=requests)
|
||||
print(f"Batch created: {message_batch.id}")
|
||||
|
||||
# Poll for completion
|
||||
while message_batch.processing_status == "in_progress":
|
||||
print("Waiting for batch processing...")
|
||||
time.sleep(30)
|
||||
message_batch = client.messages.batches.retrieve(message_batch.id)
|
||||
|
||||
# Process results
|
||||
results = {}
|
||||
if message_batch.processing_status == "ended":
|
||||
print("Batch completed. Processing results...")
|
||||
for result in client.messages.batches.results(message_batch.id):
|
||||
if result.result.type == "succeeded":
|
||||
filter_result = extract_json_from_xml(
|
||||
result.result.message.content[0].text
|
||||
)
|
||||
if filter_result:
|
||||
results[result.custom_id] = {
|
||||
'status': 'success',
|
||||
'filter_result': filter_result
|
||||
}
|
||||
else:
|
||||
results[result.custom_id] = {
|
||||
'status': 'error',
|
||||
'reason': 'failed_to_parse_json'
|
||||
}
|
||||
else:
|
||||
results[result.custom_id] = {
|
||||
'status': 'error',
|
||||
'reason': f"{result.result.type}: {getattr(result.result, 'error', 'unknown error')}"
|
||||
}
|
||||
else:
|
||||
print(f"Batch failed with status: {message_batch.processing_status}")
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def get_model_name(backend: str) -> str:
|
||||
"""Get the appropriate model name for the backend"""
|
||||
if backend == 'anthropic-haiku':
|
||||
return 'claude-3-5-haiku-20241022'
|
||||
elif backend == 'anthropic-sonnet':
|
||||
return 'claude-3-5-sonnet-20241022'
|
||||
return backend
|
||||
|
||||
|
||||
def main():
|
||||
args = parse_args()
|
||||
|
||||
# Backend-specific setup
|
||||
client = None
|
||||
if args.backend.startswith('anthropic'):
|
||||
if not os.getenv('ANTHROPIC_API_KEY'):
|
||||
raise ValueError("Please set ANTHROPIC_API_KEY environment variable for Anthropic backends")
|
||||
client = Anthropic()
|
||||
model = get_model_name(args.backend)
|
||||
print(f"Using Anthropic backend: {model}")
|
||||
elif args.backend == 'ollama':
|
||||
if args.use_batches:
|
||||
print("Warning: Batches API not available for Ollama. Processing sequentially.")
|
||||
args.use_batches = False
|
||||
print(f"Using Ollama backend: {args.ollama_model} at {args.ollama_url}")
|
||||
print("Make sure Ollama is running: ollama serve")
|
||||
|
||||
# Load metadata
|
||||
metadata = load_metadata(Path(args.metadata))
|
||||
print(f"Loaded {len(metadata)} metadata records")
|
||||
|
||||
# Apply test mode if specified
|
||||
if args.test:
|
||||
metadata = metadata[:10]
|
||||
print(f"Test mode: processing {len(metadata)} records")
|
||||
|
||||
# Load existing results
|
||||
output_path = Path(args.output)
|
||||
results = load_existing_results(output_path)
|
||||
print(f"Loaded {len(results)} existing results")
|
||||
|
||||
# Identify papers to process
|
||||
to_process = [r for r in metadata if r['id'] not in results]
|
||||
print(f"Papers to process: {len(to_process)}")
|
||||
|
||||
if not to_process:
|
||||
print("All papers already processed!")
|
||||
return
|
||||
|
||||
# Process papers based on backend
|
||||
if args.backend == 'ollama':
|
||||
print("Processing papers with Ollama...")
|
||||
for record in to_process:
|
||||
print(f"Processing: {record['id']}")
|
||||
result = filter_paper_ollama(record, args.ollama_url, args.ollama_model)
|
||||
results[record['id']] = result
|
||||
save_results(results, output_path)
|
||||
# No sleep needed for local models
|
||||
elif args.use_batches:
|
||||
print("Using Batches API...")
|
||||
batch_results = filter_papers_batch(client, to_process, model)
|
||||
results.update(batch_results)
|
||||
else:
|
||||
print("Processing papers sequentially with Anthropic API...")
|
||||
for record in to_process:
|
||||
print(f"Processing: {record['id']}")
|
||||
result = filter_paper_direct(client, record, model)
|
||||
results[record['id']] = result
|
||||
save_results(results, output_path)
|
||||
time.sleep(1) # Rate limiting
|
||||
|
||||
# Save final results
|
||||
save_results(results, output_path)
|
||||
|
||||
# Print summary statistics
|
||||
total = len(results)
|
||||
successful = sum(1 for r in results.values() if r.get('status') == 'success')
|
||||
relevant = sum(
|
||||
1 for r in results.values()
|
||||
if r.get('status') == 'success' and r.get('filter_result', {}).get('has_relevant_data')
|
||||
)
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print("Filtering Summary")
|
||||
print(f"{'='*60}")
|
||||
print(f"Total papers processed: {total}")
|
||||
print(f"Successfully analyzed: {successful}")
|
||||
print(f"Papers with relevant data: {relevant}")
|
||||
print(f"Relevance rate: {relevant/successful*100:.1f}%" if successful > 0 else "N/A")
|
||||
print(f"\nResults saved to: {output_path}")
|
||||
print(f"\nNext step: Review results and proceed to PDF extraction for relevant papers")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
478
skills/extract_from_pdfs/scripts/03_extract_from_pdfs.py
Normal file
478
skills/extract_from_pdfs/scripts/03_extract_from_pdfs.py
Normal file
@@ -0,0 +1,478 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Extract structured data from PDFs using Claude API.
|
||||
Supports multiple PDF processing methods and prompt caching for efficiency.
|
||||
This script template needs to be customized with your specific extraction schema.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import base64
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional
|
||||
import re
|
||||
|
||||
from anthropic import Anthropic
|
||||
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
|
||||
from anthropic.types.messages.batch_create_params import Request
|
||||
|
||||
|
||||
# Configuration
|
||||
BATCH_SIZE = 5
|
||||
SIMULTANEOUS_BATCHES = 4
|
||||
BATCH_CHECK_INTERVAL = 30
|
||||
BATCH_SUBMISSION_INTERVAL = 20
|
||||
|
||||
|
||||
def parse_args():
|
||||
"""Parse command line arguments"""
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Extract structured data from PDFs using Claude'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--metadata',
|
||||
required=True,
|
||||
help='Input metadata JSON file (from step 01 or 02)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--schema',
|
||||
required=True,
|
||||
help='JSON file defining extraction schema and prompts'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--output',
|
||||
default='extracted_data.json',
|
||||
help='Output JSON file with extraction results'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--method',
|
||||
choices=['base64', 'files_api', 'batches'],
|
||||
default='batches',
|
||||
help='PDF processing method (default: batches)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--use-caching',
|
||||
action='store_true',
|
||||
help='Enable prompt caching (reduces costs by ~90%% for repeated queries)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--test',
|
||||
action='store_true',
|
||||
help='Run in test mode (process only 3 PDFs)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--model',
|
||||
default='claude-3-5-sonnet-20241022',
|
||||
help='Claude model to use'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--filter-results',
|
||||
help='Optional: JSON file with filter results from step 02 (only process relevant papers)'
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def load_metadata(metadata_path: Path) -> List[Dict]:
|
||||
"""Load metadata from JSON file"""
|
||||
with open(metadata_path, 'r', encoding='utf-8') as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def load_schema(schema_path: Path) -> Dict:
|
||||
"""Load extraction schema definition"""
|
||||
with open(schema_path, 'r', encoding='utf-8') as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def load_filter_results(filter_path: Path) -> Dict:
|
||||
"""Load filter results from step 02"""
|
||||
with open(filter_path, 'r', encoding='utf-8') as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def load_existing_results(output_path: Path) -> Dict:
|
||||
"""Load existing extraction results if available"""
|
||||
if output_path.exists():
|
||||
with open(output_path, 'r', encoding='utf-8') as f:
|
||||
return json.load(f)
|
||||
return {}
|
||||
|
||||
|
||||
def save_results(results: Dict, output_path: Path):
|
||||
"""Save extraction results to JSON file"""
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
json.dump(results, f, indent=2, ensure_ascii=False)
|
||||
|
||||
|
||||
def create_extraction_prompt(schema: Dict) -> str:
|
||||
"""
|
||||
Create extraction prompt from schema definition.
|
||||
|
||||
The schema JSON should contain:
|
||||
- system_context: Description of the analysis task
|
||||
- instructions: Step-by-step analysis instructions
|
||||
- output_schema: JSON schema for the output
|
||||
- output_example: Example of desired output
|
||||
|
||||
TODO: Customize schema.json for your specific use case
|
||||
"""
|
||||
prompt_parts = []
|
||||
|
||||
# Add objective
|
||||
if 'objective' in schema:
|
||||
prompt_parts.append(f"Your objective is to {schema['objective']}\n")
|
||||
|
||||
# Add instructions
|
||||
if 'instructions' in schema:
|
||||
prompt_parts.append("Please follow these steps:\n")
|
||||
for i, instruction in enumerate(schema['instructions'], 1):
|
||||
prompt_parts.append(f"{i}. {instruction}")
|
||||
prompt_parts.append("")
|
||||
|
||||
# Add analysis framework
|
||||
if 'analysis_steps' in schema:
|
||||
prompt_parts.append("<analysis_framework>")
|
||||
for step in schema['analysis_steps']:
|
||||
prompt_parts.append(f"- {step}")
|
||||
prompt_parts.append("</analysis_framework>\n")
|
||||
prompt_parts.append(
|
||||
"Your analysis must be wrapped within <analysis> tags. "
|
||||
"Be thorough and explicit in your reasoning.\n"
|
||||
)
|
||||
|
||||
# Add output schema explanation
|
||||
if 'output_schema' in schema:
|
||||
prompt_parts.append("<output_schema>")
|
||||
prompt_parts.append(json.dumps(schema['output_schema'], indent=2))
|
||||
prompt_parts.append("</output_schema>\n")
|
||||
|
||||
# Add output example
|
||||
if 'output_example' in schema:
|
||||
prompt_parts.append("<output_example>")
|
||||
prompt_parts.append(json.dumps(schema['output_example'], indent=2))
|
||||
prompt_parts.append("</output_example>\n")
|
||||
|
||||
# Add important notes
|
||||
if 'important_notes' in schema:
|
||||
prompt_parts.append("Important considerations:")
|
||||
for note in schema['important_notes']:
|
||||
prompt_parts.append(f"- {note}")
|
||||
prompt_parts.append("")
|
||||
|
||||
# Add final instruction
|
||||
prompt_parts.append(
|
||||
"After your analysis, provide the final output in the following JSON format, "
|
||||
"wrapped in <output> tags. The output must be valid, parseable JSON.\n"
|
||||
)
|
||||
|
||||
return "\n".join(prompt_parts)
|
||||
|
||||
|
||||
def extract_json_from_response(text: str) -> Optional[Dict]:
|
||||
"""Extract JSON from XML output tags in Claude's response"""
|
||||
match = re.search(r'<output>\s*(\{.*?\})\s*</output>', text, re.DOTALL)
|
||||
if match:
|
||||
json_str = match.group(1)
|
||||
try:
|
||||
return json.loads(json_str)
|
||||
except json.JSONDecodeError as e:
|
||||
print(f"Failed to parse JSON: {e}")
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
def extract_analysis_from_response(text: str) -> Optional[str]:
|
||||
"""Extract analysis from XML tags in Claude's response"""
|
||||
match = re.search(r'<analysis>(.*?)</analysis>', text, re.DOTALL)
|
||||
if match:
|
||||
return match.group(1).strip()
|
||||
return None
|
||||
|
||||
|
||||
def process_pdf_base64(
|
||||
client: Anthropic,
|
||||
pdf_path: Path,
|
||||
schema: Dict,
|
||||
model: str
|
||||
) -> Dict:
|
||||
"""Process a single PDF using base64 encoding (direct upload)"""
|
||||
if not pdf_path.exists():
|
||||
return {
|
||||
'status': 'error',
|
||||
'error': f'PDF not found: {pdf_path}'
|
||||
}
|
||||
|
||||
# Check file size (32MB limit)
|
||||
file_size = pdf_path.stat().st_size
|
||||
if file_size > 32 * 1024 * 1024:
|
||||
return {
|
||||
'status': 'error',
|
||||
'error': f'PDF exceeds 32MB limit: {file_size / 1024 / 1024:.1f}MB'
|
||||
}
|
||||
|
||||
try:
|
||||
# Read and encode PDF
|
||||
with open(pdf_path, 'rb') as f:
|
||||
pdf_data = base64.b64encode(f.read()).decode('utf-8')
|
||||
|
||||
# Create message
|
||||
response = client.messages.create(
|
||||
model=model,
|
||||
max_tokens=16384,
|
||||
temperature=0,
|
||||
system=schema.get('system_context', 'You are a scientific research assistant.'),
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "document",
|
||||
"source": {
|
||||
"type": "base64",
|
||||
"media_type": "application/pdf",
|
||||
"data": pdf_data
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "text",
|
||||
"text": create_extraction_prompt(schema)
|
||||
}
|
||||
]
|
||||
}]
|
||||
)
|
||||
|
||||
response_text = response.content[0].text
|
||||
|
||||
return {
|
||||
'status': 'success',
|
||||
'extracted_data': extract_json_from_response(response_text),
|
||||
'analysis': extract_analysis_from_response(response_text),
|
||||
'input_tokens': response.usage.input_tokens,
|
||||
'output_tokens': response.usage.output_tokens
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {
|
||||
'status': 'error',
|
||||
'error': str(e)
|
||||
}
|
||||
|
||||
|
||||
def process_pdfs_batch(
|
||||
client: Anthropic,
|
||||
records: List[tuple],
|
||||
schema: Dict,
|
||||
model: str
|
||||
) -> Dict[str, Dict]:
|
||||
"""Process multiple PDFs using Batches API for efficiency"""
|
||||
all_results = {}
|
||||
|
||||
for window_start in range(0, len(records), SIMULTANEOUS_BATCHES * BATCH_SIZE):
|
||||
window_records = records[window_start:window_start + (SIMULTANEOUS_BATCHES * BATCH_SIZE)]
|
||||
print(f"\nProcessing window starting at index {window_start} ({len(window_records)} PDFs)")
|
||||
|
||||
active_batches = {}
|
||||
|
||||
for batch_start in range(0, len(window_records), BATCH_SIZE):
|
||||
batch_records = window_records[batch_start:batch_start + BATCH_SIZE]
|
||||
requests = []
|
||||
|
||||
for record_id, pdf_data in batch_records:
|
||||
requests.append(Request(
|
||||
custom_id=record_id,
|
||||
params=MessageCreateParamsNonStreaming(
|
||||
model=model,
|
||||
max_tokens=16384,
|
||||
temperature=0,
|
||||
system=schema.get('system_context', 'You are a scientific research assistant.'),
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "document",
|
||||
"source": {
|
||||
"type": "base64",
|
||||
"media_type": "application/pdf",
|
||||
"data": pdf_data
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "text",
|
||||
"text": create_extraction_prompt(schema)
|
||||
}
|
||||
]
|
||||
}]
|
||||
)
|
||||
))
|
||||
|
||||
try:
|
||||
message_batch = client.messages.batches.create(requests=requests)
|
||||
print(f"Created batch {message_batch.id} with {len(requests)} requests")
|
||||
active_batches[message_batch.id] = {r.custom_id for r in requests}
|
||||
time.sleep(BATCH_SUBMISSION_INTERVAL)
|
||||
except Exception as e:
|
||||
print(f"Error creating batch: {e}")
|
||||
|
||||
# Wait for batches
|
||||
window_results = wait_for_batches(client, list(active_batches.keys()), schema)
|
||||
all_results.update(window_results)
|
||||
|
||||
return all_results
|
||||
|
||||
|
||||
def wait_for_batches(
|
||||
client: Anthropic,
|
||||
batch_ids: List[str],
|
||||
schema: Dict
|
||||
) -> Dict[str, Dict]:
|
||||
"""Wait for batches to complete and return results"""
|
||||
print(f"\nWaiting for {len(batch_ids)} batches to complete...")
|
||||
|
||||
incomplete = set(batch_ids)
|
||||
|
||||
while incomplete:
|
||||
time.sleep(BATCH_CHECK_INTERVAL)
|
||||
|
||||
for batch_id in list(incomplete):
|
||||
batch = client.messages.batches.retrieve(batch_id)
|
||||
if batch.processing_status != "in_progress":
|
||||
incomplete.remove(batch_id)
|
||||
print(f"Batch {batch_id} completed: {batch.processing_status}")
|
||||
|
||||
# Collect results
|
||||
results = {}
|
||||
for batch_id in batch_ids:
|
||||
batch = client.messages.batches.retrieve(batch_id)
|
||||
if batch.processing_status == "ended":
|
||||
for result in client.messages.batches.results(batch_id):
|
||||
if result.result.type == "succeeded":
|
||||
response_text = result.result.message.content[0].text
|
||||
results[result.custom_id] = {
|
||||
'status': 'success',
|
||||
'extracted_data': extract_json_from_response(response_text),
|
||||
'analysis': extract_analysis_from_response(response_text),
|
||||
'input_tokens': result.result.message.usage.input_tokens,
|
||||
'output_tokens': result.result.message.usage.output_tokens
|
||||
}
|
||||
else:
|
||||
results[result.custom_id] = {
|
||||
'status': 'error',
|
||||
'error': str(getattr(result.result, 'error', 'Unknown error'))
|
||||
}
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def main():
|
||||
args = parse_args()
|
||||
|
||||
# Check for API key
|
||||
if not os.getenv('ANTHROPIC_API_KEY'):
|
||||
raise ValueError("Please set ANTHROPIC_API_KEY environment variable")
|
||||
|
||||
client = Anthropic()
|
||||
|
||||
# Load inputs
|
||||
metadata = load_metadata(Path(args.metadata))
|
||||
schema = load_schema(Path(args.schema))
|
||||
print(f"Loaded {len(metadata)} metadata records")
|
||||
|
||||
# Filter by relevance if filter results provided
|
||||
if args.filter_results:
|
||||
filter_results = load_filter_results(Path(args.filter_results))
|
||||
relevant_ids = {
|
||||
id for id, result in filter_results.items()
|
||||
if result.get('status') == 'success'
|
||||
and result.get('filter_result', {}).get('has_relevant_data')
|
||||
}
|
||||
metadata = [r for r in metadata if r['id'] in relevant_ids]
|
||||
print(f"Filtered to {len(metadata)} relevant papers")
|
||||
|
||||
# Apply test mode
|
||||
if args.test:
|
||||
metadata = metadata[:3]
|
||||
print(f"Test mode: processing {len(metadata)} PDFs")
|
||||
|
||||
# Load existing results
|
||||
output_path = Path(args.output)
|
||||
results = load_existing_results(output_path)
|
||||
print(f"Loaded {len(results)} existing results")
|
||||
|
||||
# Prepare PDFs to process
|
||||
to_process = []
|
||||
for record in metadata:
|
||||
if record['id'] in results:
|
||||
continue
|
||||
if not record.get('pdf_path'):
|
||||
print(f"Skipping {record['id']}: no PDF path")
|
||||
continue
|
||||
pdf_path = Path(record['pdf_path'])
|
||||
if not pdf_path.exists():
|
||||
print(f"Skipping {record['id']}: PDF not found")
|
||||
continue
|
||||
|
||||
# Read and encode PDF
|
||||
try:
|
||||
with open(pdf_path, 'rb') as f:
|
||||
pdf_data = base64.b64encode(f.read()).decode('utf-8')
|
||||
to_process.append((record['id'], pdf_data))
|
||||
except Exception as e:
|
||||
print(f"Error reading {pdf_path}: {e}")
|
||||
|
||||
print(f"PDFs to process: {len(to_process)}")
|
||||
|
||||
if not to_process:
|
||||
print("All PDFs already processed!")
|
||||
return
|
||||
|
||||
# Process PDFs
|
||||
if args.method == 'batches':
|
||||
print("Using Batches API...")
|
||||
batch_results = process_pdfs_batch(client, to_process, schema, args.model)
|
||||
results.update(batch_results)
|
||||
else:
|
||||
print("Processing PDFs sequentially...")
|
||||
for record_id, pdf_data in to_process:
|
||||
print(f"Processing: {record_id}")
|
||||
# For sequential processing, reconstruct Path
|
||||
record = next(r for r in metadata if r['id'] == record_id)
|
||||
result = process_pdf_base64(
|
||||
client, Path(record['pdf_path']), schema, args.model
|
||||
)
|
||||
results[record_id] = result
|
||||
save_results(results, output_path)
|
||||
time.sleep(2)
|
||||
|
||||
# Save final results
|
||||
save_results(results, output_path)
|
||||
|
||||
# Print summary
|
||||
total = len(results)
|
||||
successful = sum(1 for r in results.values() if r.get('status') == 'success')
|
||||
total_input_tokens = sum(
|
||||
r.get('input_tokens', 0) for r in results.values()
|
||||
if r.get('status') == 'success'
|
||||
)
|
||||
total_output_tokens = sum(
|
||||
r.get('output_tokens', 0) for r in results.values()
|
||||
if r.get('status') == 'success'
|
||||
)
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print("Extraction Summary")
|
||||
print(f"{'='*60}")
|
||||
print(f"Total PDFs processed: {total}")
|
||||
print(f"Successful extractions: {successful}")
|
||||
print(f"Failed extractions: {total - successful}")
|
||||
print(f"\nToken usage:")
|
||||
print(f" Input tokens: {total_input_tokens:,}")
|
||||
print(f" Output tokens: {total_output_tokens:,}")
|
||||
print(f" Total tokens: {total_input_tokens + total_output_tokens:,}")
|
||||
print(f"\nResults saved to: {output_path}")
|
||||
print(f"\nNext step: Repair and validate JSON outputs")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
227
skills/extract_from_pdfs/scripts/04_repair_json.py
Normal file
227
skills/extract_from_pdfs/scripts/04_repair_json.py
Normal file
@@ -0,0 +1,227 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Repair and validate JSON extractions using json_repair library.
|
||||
Handles common JSON parsing issues and validates against schema.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import Dict, Any, Optional
|
||||
import jsonschema
|
||||
|
||||
try:
|
||||
from json_repair import repair_json
|
||||
JSON_REPAIR_AVAILABLE = True
|
||||
except ImportError:
|
||||
JSON_REPAIR_AVAILABLE = False
|
||||
print("Warning: json_repair not installed. Install with: pip install json-repair")
|
||||
|
||||
|
||||
def parse_args():
|
||||
"""Parse command line arguments"""
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Repair and validate JSON extractions'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--input',
|
||||
required=True,
|
||||
help='Input JSON file with extraction results from step 03'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--output',
|
||||
default='cleaned_extractions.json',
|
||||
help='Output JSON file with cleaned results'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--schema',
|
||||
help='Optional: JSON schema file for validation'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--strict',
|
||||
action='store_true',
|
||||
help='Strict mode: reject records that fail validation'
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def load_results(input_path: Path) -> Dict:
|
||||
"""Load extraction results from JSON file"""
|
||||
with open(input_path, 'r', encoding='utf-8') as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def load_schema(schema_path: Path) -> Dict:
|
||||
"""Load JSON schema for validation"""
|
||||
with open(schema_path, 'r', encoding='utf-8') as f:
|
||||
schema_data = json.load(f)
|
||||
return schema_data.get('output_schema', schema_data)
|
||||
|
||||
|
||||
def save_results(results: Dict, output_path: Path):
|
||||
"""Save cleaned results to JSON file"""
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
json.dump(results, f, indent=2, ensure_ascii=False)
|
||||
|
||||
|
||||
def repair_json_data(data: Any) -> tuple[Any, bool]:
|
||||
"""
|
||||
Attempt to repair JSON data using json_repair library.
|
||||
Returns (repaired_data, success)
|
||||
"""
|
||||
if not JSON_REPAIR_AVAILABLE:
|
||||
return data, True # Skip repair if library not available
|
||||
|
||||
try:
|
||||
# Convert to JSON string and back to repair
|
||||
json_str = json.dumps(data)
|
||||
repaired_str = repair_json(json_str, return_objects=False)
|
||||
repaired_data = json.loads(repaired_str)
|
||||
return repaired_data, True
|
||||
except Exception as e:
|
||||
print(f"Failed to repair JSON: {e}")
|
||||
return data, False
|
||||
|
||||
|
||||
def validate_against_schema(data: Any, schema: Dict) -> tuple[bool, Optional[str]]:
|
||||
"""
|
||||
Validate data against JSON schema.
|
||||
Returns (is_valid, error_message)
|
||||
"""
|
||||
try:
|
||||
jsonschema.validate(instance=data, schema=schema)
|
||||
return True, None
|
||||
except jsonschema.exceptions.ValidationError as e:
|
||||
return False, str(e)
|
||||
except Exception as e:
|
||||
return False, f"Validation error: {str(e)}"
|
||||
|
||||
|
||||
def clean_extraction_result(
|
||||
result: Dict,
|
||||
schema: Optional[Dict] = None,
|
||||
strict: bool = False
|
||||
) -> Dict:
|
||||
"""
|
||||
Clean and validate a single extraction result.
|
||||
|
||||
Returns updated result with:
|
||||
- repaired_data: Repaired JSON if repair was needed
|
||||
- validation_status: 'valid', 'invalid', or 'repaired'
|
||||
- validation_errors: List of validation errors if any
|
||||
"""
|
||||
if result.get('status') != 'success':
|
||||
return result # Skip non-successful results
|
||||
|
||||
extracted_data = result.get('extracted_data')
|
||||
if not extracted_data:
|
||||
result['validation_status'] = 'invalid'
|
||||
result['validation_errors'] = ['No extracted data found']
|
||||
if strict:
|
||||
result['status'] = 'failed_validation'
|
||||
return result
|
||||
|
||||
# Try to repair JSON
|
||||
repaired_data, repair_success = repair_json_data(extracted_data)
|
||||
|
||||
# Validate against schema if provided
|
||||
validation_errors = []
|
||||
if schema:
|
||||
is_valid, error_msg = validate_against_schema(repaired_data, schema)
|
||||
if not is_valid:
|
||||
validation_errors.append(error_msg)
|
||||
if strict:
|
||||
result['status'] = 'failed_validation'
|
||||
|
||||
# Update result
|
||||
if repaired_data != extracted_data and repair_success:
|
||||
result['extracted_data'] = repaired_data
|
||||
result['validation_status'] = 'repaired'
|
||||
elif validation_errors:
|
||||
result['validation_status'] = 'invalid'
|
||||
else:
|
||||
result['validation_status'] = 'valid'
|
||||
|
||||
if validation_errors:
|
||||
result['validation_errors'] = validation_errors
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def main():
|
||||
args = parse_args()
|
||||
|
||||
# Load inputs
|
||||
results = load_results(Path(args.input))
|
||||
print(f"Loaded {len(results)} extraction results")
|
||||
|
||||
schema = None
|
||||
if args.schema:
|
||||
schema = load_schema(Path(args.schema))
|
||||
print(f"Loaded validation schema from {args.schema}")
|
||||
|
||||
# Clean each result
|
||||
cleaned_results = {}
|
||||
stats = {
|
||||
'total': len(results),
|
||||
'valid': 0,
|
||||
'repaired': 0,
|
||||
'invalid': 0,
|
||||
'failed': 0
|
||||
}
|
||||
|
||||
for record_id, result in results.items():
|
||||
cleaned_result = clean_extraction_result(result, schema, args.strict)
|
||||
cleaned_results[record_id] = cleaned_result
|
||||
|
||||
# Update statistics
|
||||
if cleaned_result.get('status') == 'success':
|
||||
status = cleaned_result.get('validation_status', 'unknown')
|
||||
if status == 'valid':
|
||||
stats['valid'] += 1
|
||||
elif status == 'repaired':
|
||||
stats['repaired'] += 1
|
||||
elif status == 'invalid':
|
||||
stats['invalid'] += 1
|
||||
else:
|
||||
stats['failed'] += 1
|
||||
|
||||
# Save cleaned results
|
||||
output_path = Path(args.output)
|
||||
save_results(cleaned_results, output_path)
|
||||
|
||||
# Print summary
|
||||
print(f"\n{'='*60}")
|
||||
print("JSON Repair and Validation Summary")
|
||||
print(f"{'='*60}")
|
||||
print(f"Total records: {stats['total']}")
|
||||
print(f"Valid JSON: {stats['valid']}")
|
||||
print(f"Repaired JSON: {stats['repaired']}")
|
||||
print(f"Invalid JSON: {stats['invalid']}")
|
||||
print(f"Failed extractions: {stats['failed']}")
|
||||
|
||||
if schema:
|
||||
validation_rate = (stats['valid'] + stats['repaired']) / stats['total'] * 100
|
||||
print(f"\nValidation rate: {validation_rate:.1f}%")
|
||||
|
||||
print(f"\nCleaned results saved to: {output_path}")
|
||||
|
||||
# Print examples of validation errors
|
||||
if stats['invalid'] > 0:
|
||||
print(f"\nShowing first 3 validation errors:")
|
||||
error_count = 0
|
||||
for record_id, result in cleaned_results.items():
|
||||
if result.get('validation_errors'):
|
||||
print(f"\n{record_id}:")
|
||||
for error in result['validation_errors'][:2]:
|
||||
print(f" - {error[:200]}")
|
||||
error_count += 1
|
||||
if error_count >= 3:
|
||||
break
|
||||
|
||||
print(f"\nNext step: Validate and enrich data with external APIs")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
390
skills/extract_from_pdfs/scripts/05_validate_with_apis.py
Normal file
390
skills/extract_from_pdfs/scripts/05_validate_with_apis.py
Normal file
@@ -0,0 +1,390 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Validate and enrich extracted data using external API databases.
|
||||
Supports common scientific databases for taxonomy, geography, chemistry, etc.
|
||||
|
||||
This script template includes examples for common databases. Customize for your needs.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional, Any
|
||||
import requests
|
||||
from urllib.parse import quote
|
||||
|
||||
|
||||
def parse_args():
|
||||
"""Parse command line arguments"""
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Validate and enrich data with external APIs'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--input',
|
||||
required=True,
|
||||
help='Input JSON file with cleaned extraction results from step 04'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--output',
|
||||
default='validated_data.json',
|
||||
help='Output JSON file with validated and enriched data'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--apis',
|
||||
required=True,
|
||||
help='JSON configuration file specifying which APIs to use and for which fields'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--skip-validation',
|
||||
action='store_true',
|
||||
help='Skip API calls, only load and structure data'
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def load_results(input_path: Path) -> Dict:
|
||||
"""Load extraction results from JSON file"""
|
||||
with open(input_path, 'r', encoding='utf-8') as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def load_api_config(config_path: Path) -> Dict:
|
||||
"""Load API configuration"""
|
||||
with open(config_path, 'r', encoding='utf-8') as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def save_results(results: Dict, output_path: Path):
|
||||
"""Save validated results to JSON file"""
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
json.dump(results, f, indent=2, ensure_ascii=False)
|
||||
|
||||
|
||||
# ==============================================================================
|
||||
# Taxonomy validation functions
|
||||
# ==============================================================================
|
||||
|
||||
def validate_gbif_taxonomy(scientific_name: str) -> Optional[Dict]:
|
||||
"""
|
||||
Validate taxonomic name using GBIF (Global Biodiversity Information Facility).
|
||||
Returns standardized taxonomy if found.
|
||||
"""
|
||||
url = f"https://api.gbif.org/v1/species/match?name={quote(scientific_name)}"
|
||||
|
||||
try:
|
||||
response = requests.get(url, timeout=10)
|
||||
if response.status_code == 200:
|
||||
data = response.json()
|
||||
if data.get('matchType') != 'NONE':
|
||||
return {
|
||||
'matched_name': data.get('canonicalName', scientific_name),
|
||||
'scientific_name': data.get('scientificName'),
|
||||
'rank': data.get('rank'),
|
||||
'kingdom': data.get('kingdom'),
|
||||
'phylum': data.get('phylum'),
|
||||
'class': data.get('class'),
|
||||
'order': data.get('order'),
|
||||
'family': data.get('family'),
|
||||
'genus': data.get('genus'),
|
||||
'gbif_id': data.get('usageKey'),
|
||||
'confidence': data.get('confidence'),
|
||||
'match_type': data.get('matchType'),
|
||||
'status': data.get('status')
|
||||
}
|
||||
except Exception as e:
|
||||
print(f"GBIF API error for '{scientific_name}': {e}")
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def validate_wfo_plant(scientific_name: str) -> Optional[Dict]:
|
||||
"""
|
||||
Validate plant name using World Flora Online.
|
||||
Returns standardized plant taxonomy if found.
|
||||
"""
|
||||
# WFO requires name parsing - this is a simplified example
|
||||
url = f"http://www.worldfloraonline.org/api/1.0/search?query={quote(scientific_name)}"
|
||||
|
||||
try:
|
||||
response = requests.get(url, timeout=10)
|
||||
if response.status_code == 200:
|
||||
data = response.json()
|
||||
if data.get('results'):
|
||||
first_result = data['results'][0]
|
||||
return {
|
||||
'matched_name': first_result.get('name'),
|
||||
'scientific_name': first_result.get('scientificName'),
|
||||
'authors': first_result.get('authors'),
|
||||
'family': first_result.get('family'),
|
||||
'wfo_id': first_result.get('wfoId'),
|
||||
'status': first_result.get('status')
|
||||
}
|
||||
except Exception as e:
|
||||
print(f"WFO API error for '{scientific_name}': {e}")
|
||||
|
||||
return None
|
||||
|
||||
|
||||
# ==============================================================================
|
||||
# Geography validation functions
|
||||
# ==============================================================================
|
||||
|
||||
def validate_geonames(location: str, country: Optional[str] = None) -> Optional[Dict]:
|
||||
"""
|
||||
Validate location using GeoNames.
|
||||
Note: Requires free GeoNames account and username.
|
||||
Set GEONAMES_USERNAME environment variable.
|
||||
"""
|
||||
import os
|
||||
username = os.getenv('GEONAMES_USERNAME')
|
||||
if not username:
|
||||
print("Warning: GEONAMES_USERNAME not set. Skipping GeoNames validation.")
|
||||
return None
|
||||
|
||||
url = f"http://api.geonames.org/searchJSON?q={quote(location)}&maxRows=1&username={username}"
|
||||
if country:
|
||||
url += f"&country={country[:2]}" # Country code
|
||||
|
||||
try:
|
||||
response = requests.get(url, timeout=10)
|
||||
if response.status_code == 200:
|
||||
data = response.json()
|
||||
if data.get('geonames'):
|
||||
place = data['geonames'][0]
|
||||
return {
|
||||
'matched_name': place.get('name'),
|
||||
'country': place.get('countryName'),
|
||||
'country_code': place.get('countryCode'),
|
||||
'admin1': place.get('adminName1'),
|
||||
'admin2': place.get('adminName2'),
|
||||
'latitude': place.get('lat'),
|
||||
'longitude': place.get('lng'),
|
||||
'geonames_id': place.get('geonameId')
|
||||
}
|
||||
except Exception as e:
|
||||
print(f"GeoNames API error for '{location}': {e}")
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def geocode_location(address: str) -> Optional[Dict]:
|
||||
"""
|
||||
Geocode an address using OpenStreetMap Nominatim (free, no API key needed).
|
||||
Please use responsibly - add delays between calls.
|
||||
"""
|
||||
url = f"https://nominatim.openstreetmap.org/search?q={quote(address)}&format=json&limit=1"
|
||||
headers = {'User-Agent': 'Scientific-PDF-Extraction/1.0'}
|
||||
|
||||
try:
|
||||
time.sleep(1) # Be nice to OSM
|
||||
response = requests.get(url, headers=headers, timeout=10)
|
||||
if response.status_code == 200:
|
||||
data = response.json()
|
||||
if data:
|
||||
place = data[0]
|
||||
return {
|
||||
'display_name': place.get('display_name'),
|
||||
'latitude': place.get('lat'),
|
||||
'longitude': place.get('lon'),
|
||||
'osm_type': place.get('osm_type'),
|
||||
'osm_id': place.get('osm_id'),
|
||||
'place_rank': place.get('place_rank')
|
||||
}
|
||||
except Exception as e:
|
||||
print(f"Nominatim error for '{address}': {e}")
|
||||
|
||||
return None
|
||||
|
||||
|
||||
# ==============================================================================
|
||||
# Chemistry validation functions
|
||||
# ==============================================================================
|
||||
|
||||
def validate_pubchem_compound(compound_name: str) -> Optional[Dict]:
|
||||
"""
|
||||
Validate chemical compound using PubChem.
|
||||
Returns standardized compound information.
|
||||
"""
|
||||
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{quote(compound_name)}/JSON"
|
||||
|
||||
try:
|
||||
response = requests.get(url, timeout=10)
|
||||
if response.status_code == 200:
|
||||
data = response.json()
|
||||
if 'PC_Compounds' in data and data['PC_Compounds']:
|
||||
compound = data['PC_Compounds'][0]
|
||||
return {
|
||||
'cid': compound['id']['id']['cid'],
|
||||
'molecular_formula': compound.get('props', [{}])[0].get('value', {}).get('sval'),
|
||||
'pubchem_url': f"https://pubchem.ncbi.nlm.nih.gov/compound/{compound['id']['id']['cid']}"
|
||||
}
|
||||
except Exception as e:
|
||||
print(f"PubChem API error for '{compound_name}': {e}")
|
||||
|
||||
return None
|
||||
|
||||
|
||||
# ==============================================================================
|
||||
# Gene/Protein validation functions
|
||||
# ==============================================================================
|
||||
|
||||
def validate_ncbi_gene(gene_symbol: str, organism: Optional[str] = None) -> Optional[Dict]:
|
||||
"""
|
||||
Validate gene using NCBI Gene database.
|
||||
"""
|
||||
query = gene_symbol
|
||||
if organism:
|
||||
query += f"[Gene Name] AND {organism}[Organism]"
|
||||
|
||||
search_url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene&term={quote(query)}&retmode=json"
|
||||
|
||||
try:
|
||||
response = requests.get(search_url, timeout=10)
|
||||
if response.status_code == 200:
|
||||
data = response.json()
|
||||
if data.get('esearchresult', {}).get('idlist'):
|
||||
gene_id = data['esearchresult']['idlist'][0]
|
||||
return {
|
||||
'gene_id': gene_id,
|
||||
'ncbi_url': f"https://www.ncbi.nlm.nih.gov/gene/{gene_id}"
|
||||
}
|
||||
except Exception as e:
|
||||
print(f"NCBI Gene API error for '{gene_symbol}': {e}")
|
||||
|
||||
return None
|
||||
|
||||
|
||||
# ==============================================================================
|
||||
# Main validation orchestration
|
||||
# ==============================================================================
|
||||
|
||||
API_VALIDATORS = {
|
||||
'gbif_taxonomy': validate_gbif_taxonomy,
|
||||
'wfo_plants': validate_wfo_plant,
|
||||
'geonames': validate_geonames,
|
||||
'geocode': geocode_location,
|
||||
'pubchem': validate_pubchem_compound,
|
||||
'ncbi_gene': validate_ncbi_gene
|
||||
}
|
||||
|
||||
|
||||
def validate_field(value: Any, api_name: str, extra_params: Dict = None) -> Optional[Dict]:
|
||||
"""
|
||||
Validate a single field value using the specified API.
|
||||
"""
|
||||
if not value or value == 'none' or value == '':
|
||||
return None
|
||||
|
||||
validator = API_VALIDATORS.get(api_name)
|
||||
if not validator:
|
||||
print(f"Unknown API: {api_name}")
|
||||
return None
|
||||
|
||||
try:
|
||||
if extra_params:
|
||||
return validator(value, **extra_params)
|
||||
else:
|
||||
return validator(value)
|
||||
except Exception as e:
|
||||
print(f"Validation error for {api_name} with value '{value}': {e}")
|
||||
return None
|
||||
|
||||
|
||||
def process_record(
|
||||
record_data: Dict,
|
||||
api_config: Dict,
|
||||
skip_validation: bool = False
|
||||
) -> Dict:
|
||||
"""
|
||||
Process a single record, validating specified fields.
|
||||
|
||||
api_config should map field names to API names:
|
||||
{
|
||||
"field_mappings": {
|
||||
"species": {"api": "gbif_taxonomy", "output_field": "validated_species"},
|
||||
"location": {"api": "geocode", "output_field": "geocoded_location"}
|
||||
}
|
||||
}
|
||||
"""
|
||||
if skip_validation:
|
||||
return record_data
|
||||
|
||||
field_mappings = api_config.get('field_mappings', {})
|
||||
|
||||
for field_name, field_config in field_mappings.items():
|
||||
api_name = field_config.get('api')
|
||||
output_field = field_config.get('output_field', f'validated_{field_name}')
|
||||
extra_params = field_config.get('extra_params', {})
|
||||
|
||||
# Handle nested fields (e.g., 'records.species')
|
||||
if '.' in field_name:
|
||||
# This is a simplified example - you'd need to implement proper nested access
|
||||
continue
|
||||
|
||||
value = record_data.get(field_name)
|
||||
if value:
|
||||
validated = validate_field(value, api_name, extra_params)
|
||||
if validated:
|
||||
record_data[output_field] = validated
|
||||
|
||||
return record_data
|
||||
|
||||
|
||||
def main():
|
||||
args = parse_args()
|
||||
|
||||
# Load inputs
|
||||
results = load_results(Path(args.input))
|
||||
api_config = load_api_config(Path(args.apis))
|
||||
print(f"Loaded {len(results)} extraction results")
|
||||
|
||||
# Process each result
|
||||
validated_results = {}
|
||||
stats = {'total': 0, 'validated': 0, 'failed': 0}
|
||||
|
||||
for record_id, result in results.items():
|
||||
if result.get('status') != 'success':
|
||||
validated_results[record_id] = result
|
||||
stats['failed'] += 1
|
||||
continue
|
||||
|
||||
stats['total'] += 1
|
||||
|
||||
# Get extracted data
|
||||
extracted_data = result.get('extracted_data', {})
|
||||
|
||||
# Process/validate the data
|
||||
validated_data = process_record(
|
||||
extracted_data.copy(),
|
||||
api_config,
|
||||
args.skip_validation
|
||||
)
|
||||
|
||||
# Update result
|
||||
result['validated_data'] = validated_data
|
||||
validated_results[record_id] = result
|
||||
stats['validated'] += 1
|
||||
|
||||
# Rate limiting
|
||||
if not args.skip_validation:
|
||||
time.sleep(0.5)
|
||||
|
||||
# Save results
|
||||
output_path = Path(args.output)
|
||||
save_results(validated_results, output_path)
|
||||
|
||||
# Print summary
|
||||
print(f"\n{'='*60}")
|
||||
print("Validation and Enrichment Summary")
|
||||
print(f"{'='*60}")
|
||||
print(f"Total records: {len(results)}")
|
||||
print(f"Successfully validated: {stats['validated']}")
|
||||
print(f"Failed extractions: {stats['failed']}")
|
||||
print(f"\nResults saved to: {output_path}")
|
||||
print(f"\nNext step: Export to analysis format")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
345
skills/extract_from_pdfs/scripts/06_export_database.py
Normal file
345
skills/extract_from_pdfs/scripts/06_export_database.py
Normal file
@@ -0,0 +1,345 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Export validated data to various analysis formats.
|
||||
Supports Python (pandas/SQLite), R (RDS/CSV), Excel, and more.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import csv
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Any
|
||||
import sys
|
||||
|
||||
|
||||
def parse_args():
|
||||
"""Parse command line arguments"""
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Export validated data to analysis format'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--input',
|
||||
required=True,
|
||||
help='Input JSON file with validated data from step 05'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--format',
|
||||
choices=['python', 'r', 'csv', 'json', 'excel', 'sqlite'],
|
||||
required=True,
|
||||
help='Output format'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--output',
|
||||
required=True,
|
||||
help='Output file path (without extension for some formats)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--flatten',
|
||||
action='store_true',
|
||||
help='Flatten nested JSON structures for tabular formats'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--include-metadata',
|
||||
action='store_true',
|
||||
help='Include original paper metadata in output'
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def load_results(input_path: Path) -> Dict:
|
||||
"""Load validated results from JSON file"""
|
||||
with open(input_path, 'r', encoding='utf-8') as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def flatten_dict(d: Dict, parent_key: str = '', sep: str = '_') -> Dict:
|
||||
"""
|
||||
Flatten nested dictionary structure.
|
||||
Useful for converting JSON to tabular format.
|
||||
"""
|
||||
items = []
|
||||
for k, v in d.items():
|
||||
new_key = f"{parent_key}{sep}{k}" if parent_key else k
|
||||
if isinstance(v, dict):
|
||||
items.extend(flatten_dict(v, new_key, sep=sep).items())
|
||||
elif isinstance(v, list):
|
||||
# Convert lists to comma-separated strings
|
||||
if v and isinstance(v[0], dict):
|
||||
# List of dicts - create numbered columns
|
||||
for i, item in enumerate(v):
|
||||
items.extend(flatten_dict(item, f"{new_key}_{i}", sep=sep).items())
|
||||
else:
|
||||
# Simple list
|
||||
items.append((new_key, ', '.join(str(x) for x in v)))
|
||||
else:
|
||||
items.append((new_key, v))
|
||||
return dict(items)
|
||||
|
||||
|
||||
def extract_records(results: Dict, flatten: bool = False, include_metadata: bool = False) -> List[Dict]:
|
||||
"""
|
||||
Extract records from results structure.
|
||||
Returns a list of dictionaries suitable for tabular export.
|
||||
"""
|
||||
records = []
|
||||
|
||||
for paper_id, result in results.items():
|
||||
if result.get('status') != 'success':
|
||||
continue
|
||||
|
||||
# Get the validated data (or fall back to extracted data)
|
||||
data = result.get('validated_data', result.get('extracted_data', {}))
|
||||
|
||||
if not data:
|
||||
continue
|
||||
|
||||
# Check if data contains nested records or is a single record
|
||||
if 'records' in data and isinstance(data['records'], list):
|
||||
# Multiple records per paper
|
||||
for record in data['records']:
|
||||
record_dict = record.copy() if isinstance(record, dict) else {'value': record}
|
||||
|
||||
# Add paper-level fields
|
||||
if include_metadata:
|
||||
record_dict['paper_id'] = paper_id
|
||||
for key in data:
|
||||
if key != 'records':
|
||||
record_dict[f'paper_{key}'] = data[key]
|
||||
|
||||
if flatten:
|
||||
record_dict = flatten_dict(record_dict)
|
||||
|
||||
records.append(record_dict)
|
||||
else:
|
||||
# Single record per paper
|
||||
record_dict = data.copy()
|
||||
if include_metadata:
|
||||
record_dict['paper_id'] = paper_id
|
||||
|
||||
if flatten:
|
||||
record_dict = flatten_dict(record_dict)
|
||||
|
||||
records.append(record_dict)
|
||||
|
||||
return records
|
||||
|
||||
|
||||
def export_to_csv(records: List[Dict], output_path: Path):
|
||||
"""Export to CSV format"""
|
||||
if not records:
|
||||
print("No records to export")
|
||||
return
|
||||
|
||||
# Get all possible field names
|
||||
fieldnames = set()
|
||||
for record in records:
|
||||
fieldnames.update(record.keys())
|
||||
fieldnames = sorted(fieldnames)
|
||||
|
||||
with open(output_path, 'w', newline='', encoding='utf-8') as f:
|
||||
writer = csv.DictWriter(f, fieldnames=fieldnames)
|
||||
writer.writeheader()
|
||||
writer.writerows(records)
|
||||
|
||||
print(f"Exported {len(records)} records to CSV: {output_path}")
|
||||
|
||||
|
||||
def export_to_json(records: List[Dict], output_path: Path):
|
||||
"""Export to JSON format"""
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
json.dump(records, f, indent=2, ensure_ascii=False)
|
||||
|
||||
print(f"Exported {len(records)} records to JSON: {output_path}")
|
||||
|
||||
|
||||
def export_to_python(records: List[Dict], output_path: Path):
|
||||
"""Export to Python format (pandas DataFrame pickle)"""
|
||||
try:
|
||||
import pandas as pd
|
||||
except ImportError:
|
||||
print("Error: pandas is required for Python export. Install with: pip install pandas")
|
||||
sys.exit(1)
|
||||
|
||||
df = pd.DataFrame(records)
|
||||
|
||||
# Save as pickle
|
||||
pickle_path = output_path.with_suffix('.pkl')
|
||||
df.to_pickle(pickle_path)
|
||||
print(f"Exported {len(records)} records to pandas pickle: {pickle_path}")
|
||||
|
||||
# Also create a Python script to load it
|
||||
script_path = output_path.with_suffix('.py')
|
||||
script_content = f'''#!/usr/bin/env python3
|
||||
"""
|
||||
Data loading script
|
||||
Generated by extract_from_pdfs skill
|
||||
"""
|
||||
|
||||
import pandas as pd
|
||||
|
||||
# Load the data
|
||||
df = pd.read_pickle('{pickle_path.name}')
|
||||
|
||||
print(f"Loaded {{len(df)}} records")
|
||||
print(f"Columns: {{list(df.columns)}}")
|
||||
print("\\nFirst few rows:")
|
||||
print(df.head())
|
||||
|
||||
# Example analyses:
|
||||
# df.describe()
|
||||
# df.groupby('some_column').size()
|
||||
# df.to_csv('output.csv', index=False)
|
||||
'''
|
||||
|
||||
with open(script_path, 'w') as f:
|
||||
f.write(script_content)
|
||||
|
||||
print(f"Created loading script: {script_path}")
|
||||
|
||||
|
||||
def export_to_r(records: List[Dict], output_path: Path):
|
||||
"""Export to R format (RDS file)"""
|
||||
try:
|
||||
import pandas as pd
|
||||
import pyreadr
|
||||
except ImportError:
|
||||
print("Error: pandas and pyreadr are required for R export.")
|
||||
print("Install with: pip install pandas pyreadr")
|
||||
sys.exit(1)
|
||||
|
||||
df = pd.DataFrame(records)
|
||||
|
||||
# Save as RDS
|
||||
rds_path = output_path.with_suffix('.rds')
|
||||
pyreadr.write_rds(rds_path, df)
|
||||
print(f"Exported {len(records)} records to RDS: {rds_path}")
|
||||
|
||||
# Also create an R script to load it
|
||||
script_path = output_path.with_suffix('.R')
|
||||
script_content = f'''# Data loading script
|
||||
# Generated by extract_from_pdfs skill
|
||||
|
||||
# Load the data
|
||||
data <- readRDS('{rds_path.name}')
|
||||
|
||||
cat(sprintf("Loaded %d records\\n", nrow(data)))
|
||||
cat(sprintf("Columns: %s\\n", paste(colnames(data), collapse=", ")))
|
||||
cat("\\nFirst few rows:\\n")
|
||||
print(head(data))
|
||||
|
||||
# Example analyses:
|
||||
# summary(data)
|
||||
# table(data$some_column)
|
||||
# write.csv(data, 'output.csv', row.names=FALSE)
|
||||
'''
|
||||
|
||||
with open(script_path, 'w') as f:
|
||||
f.write(script_content)
|
||||
|
||||
print(f"Created loading script: {script_path}")
|
||||
|
||||
|
||||
def export_to_excel(records: List[Dict], output_path: Path):
|
||||
"""Export to Excel format"""
|
||||
try:
|
||||
import pandas as pd
|
||||
except ImportError:
|
||||
print("Error: pandas is required for Excel export. Install with: pip install pandas openpyxl")
|
||||
sys.exit(1)
|
||||
|
||||
df = pd.DataFrame(records)
|
||||
|
||||
# Save as Excel
|
||||
excel_path = output_path.with_suffix('.xlsx')
|
||||
df.to_excel(excel_path, index=False, engine='openpyxl')
|
||||
print(f"Exported {len(records)} records to Excel: {excel_path}")
|
||||
|
||||
|
||||
def export_to_sqlite(records: List[Dict], output_path: Path):
|
||||
"""Export to SQLite database"""
|
||||
try:
|
||||
import pandas as pd
|
||||
import sqlite3
|
||||
except ImportError:
|
||||
print("Error: pandas is required for SQLite export. Install with: pip install pandas")
|
||||
sys.exit(1)
|
||||
|
||||
df = pd.DataFrame(records)
|
||||
|
||||
# Create database
|
||||
db_path = output_path.with_suffix('.db')
|
||||
conn = sqlite3.connect(db_path)
|
||||
|
||||
# Write to database
|
||||
table_name = 'extracted_data'
|
||||
df.to_sql(table_name, conn, if_exists='replace', index=False)
|
||||
|
||||
conn.close()
|
||||
print(f"Exported {len(records)} records to SQLite database: {db_path}")
|
||||
print(f"Table name: {table_name}")
|
||||
|
||||
# Create SQL script with example queries
|
||||
sql_script_path = output_path.with_suffix('.sql')
|
||||
sql_content = f'''-- Example SQL queries for {db_path.name}
|
||||
-- Generated by extract_from_pdfs skill
|
||||
|
||||
-- View all records
|
||||
SELECT * FROM {table_name} LIMIT 10;
|
||||
|
||||
-- Count total records
|
||||
SELECT COUNT(*) as total_records FROM {table_name};
|
||||
|
||||
-- Example: Group by a column (adjust column name as needed)
|
||||
-- SELECT column_name, COUNT(*) as count
|
||||
-- FROM {table_name}
|
||||
-- GROUP BY column_name
|
||||
-- ORDER BY count DESC;
|
||||
'''
|
||||
|
||||
with open(sql_script_path, 'w') as f:
|
||||
f.write(sql_content)
|
||||
|
||||
print(f"Created SQL example script: {sql_script_path}")
|
||||
|
||||
|
||||
def main():
|
||||
args = parse_args()
|
||||
|
||||
# Load validated results
|
||||
results = load_results(Path(args.input))
|
||||
print(f"Loaded {len(results)} results")
|
||||
|
||||
# Extract records
|
||||
records = extract_records(
|
||||
results,
|
||||
flatten=args.flatten,
|
||||
include_metadata=args.include_metadata
|
||||
)
|
||||
print(f"Extracted {len(records)} records")
|
||||
|
||||
if not records:
|
||||
print("No records to export. Check your data.")
|
||||
return
|
||||
|
||||
# Export based on format
|
||||
output_path = Path(args.output)
|
||||
|
||||
if args.format == 'csv':
|
||||
export_to_csv(records, output_path)
|
||||
elif args.format == 'json':
|
||||
export_to_json(records, output_path)
|
||||
elif args.format == 'python':
|
||||
export_to_python(records, output_path)
|
||||
elif args.format == 'r':
|
||||
export_to_r(records, output_path)
|
||||
elif args.format == 'excel':
|
||||
export_to_excel(records, output_path)
|
||||
elif args.format == 'sqlite':
|
||||
export_to_sqlite(records, output_path)
|
||||
|
||||
print(f"\nExport complete!")
|
||||
print(f"Your data is ready for analysis in {args.format.upper()} format.")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
280
skills/extract_from_pdfs/scripts/07_prepare_validation_set.py
Normal file
280
skills/extract_from_pdfs/scripts/07_prepare_validation_set.py
Normal file
@@ -0,0 +1,280 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Prepare a validation set for evaluating extraction quality.
|
||||
|
||||
This script helps you:
|
||||
1. Sample a subset of papers for manual annotation
|
||||
2. Set up a structured annotation file
|
||||
3. Guide the annotation process
|
||||
|
||||
The validation set is used to calculate precision and recall metrics.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import random
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Any
|
||||
import sys
|
||||
|
||||
|
||||
def parse_args():
|
||||
"""Parse command line arguments"""
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Prepare validation set for extraction quality evaluation',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Workflow:
|
||||
1. This script samples papers from your extraction results
|
||||
2. It creates an annotation template based on your schema
|
||||
3. You manually annotate the sampled papers with ground truth
|
||||
4. Use 08_calculate_validation_metrics.py to compare automated vs. manual extraction
|
||||
|
||||
Sampling strategies:
|
||||
random : Random sample (good for overall quality)
|
||||
stratified: Sample by extraction characteristics (good for identifying weaknesses)
|
||||
diverse : Sample to maximize diversity (good for comprehensive evaluation)
|
||||
"""
|
||||
)
|
||||
parser.add_argument(
|
||||
'--extraction-results',
|
||||
required=True,
|
||||
help='JSON file with extraction results from step 03 or 04'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--schema',
|
||||
required=True,
|
||||
help='Extraction schema JSON file used in step 03'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--output',
|
||||
default='validation_set.json',
|
||||
help='Output file for validation annotations'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--sample-size',
|
||||
type=int,
|
||||
default=20,
|
||||
help='Number of papers to sample (default: 20, recommended: 20-50)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--strategy',
|
||||
choices=['random', 'stratified', 'diverse'],
|
||||
default='random',
|
||||
help='Sampling strategy (default: random)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--seed',
|
||||
type=int,
|
||||
default=42,
|
||||
help='Random seed for reproducibility'
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def load_results(results_path: Path) -> Dict:
|
||||
"""Load extraction results"""
|
||||
with open(results_path, 'r', encoding='utf-8') as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def load_schema(schema_path: Path) -> Dict:
|
||||
"""Load extraction schema"""
|
||||
with open(schema_path, 'r', encoding='utf-8') as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def sample_random(results: Dict, sample_size: int, seed: int) -> List[str]:
|
||||
"""Random sampling strategy"""
|
||||
# Only sample from successful extractions
|
||||
successful = [
|
||||
paper_id for paper_id, result in results.items()
|
||||
if result.get('status') == 'success' and result.get('extracted_data')
|
||||
]
|
||||
|
||||
if len(successful) < sample_size:
|
||||
print(f"Warning: Only {len(successful)} successful extractions available")
|
||||
sample_size = len(successful)
|
||||
|
||||
random.seed(seed)
|
||||
return random.sample(successful, sample_size)
|
||||
|
||||
|
||||
def sample_stratified(results: Dict, sample_size: int, seed: int) -> List[str]:
|
||||
"""
|
||||
Stratified sampling: sample papers with different characteristics
|
||||
E.g., papers with many records vs. few records, different data completeness
|
||||
"""
|
||||
successful = {}
|
||||
for paper_id, result in results.items():
|
||||
if result.get('status') == 'success' and result.get('extracted_data'):
|
||||
data = result['extracted_data']
|
||||
# Count records if present
|
||||
num_records = len(data.get('records', [])) if 'records' in data else 0
|
||||
successful[paper_id] = num_records
|
||||
|
||||
if not successful:
|
||||
print("No successful extractions found")
|
||||
return []
|
||||
|
||||
# Create strata based on number of records
|
||||
strata = {
|
||||
'zero': [],
|
||||
'few': [], # 1-2 records
|
||||
'medium': [], # 3-5 records
|
||||
'many': [] # 6+ records
|
||||
}
|
||||
|
||||
for paper_id, count in successful.items():
|
||||
if count == 0:
|
||||
strata['zero'].append(paper_id)
|
||||
elif count <= 2:
|
||||
strata['few'].append(paper_id)
|
||||
elif count <= 5:
|
||||
strata['medium'].append(paper_id)
|
||||
else:
|
||||
strata['many'].append(paper_id)
|
||||
|
||||
# Sample proportionally from each stratum
|
||||
random.seed(seed)
|
||||
sampled = []
|
||||
total_papers = len(successful)
|
||||
|
||||
for stratum_name, papers in strata.items():
|
||||
if not papers:
|
||||
continue
|
||||
# Sample proportionally, at least 1 from each non-empty stratum
|
||||
stratum_sample_size = max(1, int(len(papers) / total_papers * sample_size))
|
||||
stratum_sample_size = min(stratum_sample_size, len(papers))
|
||||
sampled.extend(random.sample(papers, stratum_sample_size))
|
||||
|
||||
# If we haven't reached sample_size, add more randomly
|
||||
if len(sampled) < sample_size:
|
||||
remaining = [p for p in successful.keys() if p not in sampled]
|
||||
additional = min(sample_size - len(sampled), len(remaining))
|
||||
sampled.extend(random.sample(remaining, additional))
|
||||
|
||||
return sampled[:sample_size]
|
||||
|
||||
|
||||
def sample_diverse(results: Dict, sample_size: int, seed: int) -> List[str]:
|
||||
"""
|
||||
Diverse sampling: maximize diversity in sampled papers
|
||||
This is a simplified version - could be enhanced with actual diversity metrics
|
||||
"""
|
||||
# For now, use stratified sampling as a proxy for diversity
|
||||
return sample_stratified(results, sample_size, seed)
|
||||
|
||||
|
||||
def create_annotation_template(
|
||||
sampled_ids: List[str],
|
||||
results: Dict,
|
||||
schema: Dict
|
||||
) -> Dict:
|
||||
"""
|
||||
Create annotation template for manual validation.
|
||||
|
||||
Structure:
|
||||
{
|
||||
"paper_id": {
|
||||
"automated_extraction": {...},
|
||||
"ground_truth": null, # To be filled manually
|
||||
"notes": "",
|
||||
"annotator": "",
|
||||
"annotation_date": ""
|
||||
}
|
||||
}
|
||||
"""
|
||||
template = {
|
||||
"_instructions": {
|
||||
"overview": "This is a validation annotation file. For each paper, review the PDF and fill in the ground_truth field with the correct extraction.",
|
||||
"steps": [
|
||||
"1. Read the PDF for each paper_id",
|
||||
"2. Extract data according to the schema, filling the 'ground_truth' field",
|
||||
"3. The 'ground_truth' should have the same structure as 'automated_extraction'",
|
||||
"4. Add your name in 'annotator' and date in 'annotation_date'",
|
||||
"5. Use 'notes' field for any comments or ambiguities",
|
||||
"6. Once complete, use 08_calculate_validation_metrics.py to compare"
|
||||
],
|
||||
"schema_reference": schema.get('output_schema', {}),
|
||||
"tips": [
|
||||
"Be thorough: extract ALL relevant information, even if automated extraction missed it",
|
||||
"Be precise: use exact values as they appear in the paper",
|
||||
"Be consistent: follow the same schema structure",
|
||||
"Mark ambiguous cases in notes field"
|
||||
]
|
||||
},
|
||||
"validation_papers": {}
|
||||
}
|
||||
|
||||
for paper_id in sampled_ids:
|
||||
result = results[paper_id]
|
||||
template["validation_papers"][paper_id] = {
|
||||
"automated_extraction": result.get('extracted_data', {}),
|
||||
"ground_truth": None, # To be filled by annotator
|
||||
"notes": "",
|
||||
"annotator": "",
|
||||
"annotation_date": "",
|
||||
"_pdf_path": None, # Will try to find from metadata
|
||||
"_extraction_metadata": {
|
||||
"extraction_status": result.get('status'),
|
||||
"validation_status": result.get('validation_status'),
|
||||
"has_analysis": bool(result.get('analysis'))
|
||||
}
|
||||
}
|
||||
|
||||
return template
|
||||
|
||||
|
||||
def save_template(template: Dict, output_path: Path):
|
||||
"""Save annotation template to JSON file"""
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
json.dump(template, f, indent=2, ensure_ascii=False)
|
||||
|
||||
|
||||
def main():
|
||||
args = parse_args()
|
||||
|
||||
# Load inputs
|
||||
results = load_results(Path(args.extraction_results))
|
||||
schema = load_schema(Path(args.schema))
|
||||
print(f"Loaded {len(results)} extraction results")
|
||||
|
||||
# Sample papers
|
||||
if args.strategy == 'random':
|
||||
sampled = sample_random(results, args.sample_size, args.seed)
|
||||
elif args.strategy == 'stratified':
|
||||
sampled = sample_stratified(results, args.sample_size, args.seed)
|
||||
elif args.strategy == 'diverse':
|
||||
sampled = sample_diverse(results, args.sample_size, args.seed)
|
||||
|
||||
print(f"Sampled {len(sampled)} papers using '{args.strategy}' strategy")
|
||||
|
||||
# Create annotation template
|
||||
template = create_annotation_template(sampled, results, schema)
|
||||
|
||||
# Save template
|
||||
output_path = Path(args.output)
|
||||
save_template(template, output_path)
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print("Validation Set Preparation Complete")
|
||||
print(f"{'='*60}")
|
||||
print(f"Annotation file created: {output_path}")
|
||||
print(f"Papers to annotate: {len(sampled)}")
|
||||
print(f"\nNext steps:")
|
||||
print(f"1. Open {output_path} in a text editor")
|
||||
print(f"2. For each paper, read the PDF and fill in the 'ground_truth' field")
|
||||
print(f"3. Follow the schema structure shown in '_instructions'")
|
||||
print(f"4. Save your annotations")
|
||||
print(f"5. Run: python 08_calculate_validation_metrics.py --annotations {output_path}")
|
||||
print(f"\nTips for efficient annotation:")
|
||||
print(f"- Work in batches of 5-10 papers")
|
||||
print(f"- Use the automated extraction as a starting point to check")
|
||||
print(f"- Document any ambiguous cases in the notes field")
|
||||
print(f"- Consider having 2+ annotators for inter-rater reliability")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
@@ -0,0 +1,513 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Calculate validation metrics (precision, recall, F1) for extraction quality.
|
||||
|
||||
Compares automated extraction against ground truth annotations to evaluate:
|
||||
- Field-level precision and recall
|
||||
- Record-level accuracy
|
||||
- Overall extraction quality
|
||||
|
||||
Handles different data types appropriately:
|
||||
- Boolean: exact match
|
||||
- Numeric: exact match or tolerance
|
||||
- String: exact match or fuzzy matching
|
||||
- Lists: set-based precision/recall
|
||||
- Nested objects: recursive comparison
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Any, Tuple, Optional
|
||||
from collections import defaultdict
|
||||
import sys
|
||||
|
||||
|
||||
def parse_args():
|
||||
"""Parse command line arguments"""
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Calculate validation metrics for extraction quality',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Metrics calculated:
|
||||
Precision : Of extracted items, how many are correct?
|
||||
Recall : Of true items, how many were extracted?
|
||||
F1 Score : Harmonic mean of precision and recall
|
||||
Accuracy : Overall correctness (for boolean/categorical fields)
|
||||
|
||||
Field type handling:
|
||||
Boolean/Categorical : Exact match
|
||||
Numeric : Exact match or within tolerance
|
||||
String : Exact match or fuzzy (normalized)
|
||||
Lists : Set-based precision/recall
|
||||
Nested objects : Recursive field-by-field comparison
|
||||
|
||||
Output:
|
||||
- Overall metrics
|
||||
- Per-field metrics
|
||||
- Per-paper detailed comparison
|
||||
- Common error patterns
|
||||
"""
|
||||
)
|
||||
parser.add_argument(
|
||||
'--annotations',
|
||||
required=True,
|
||||
help='Annotation file from 07_prepare_validation_set.py (with ground truth filled in)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--output',
|
||||
default='validation_metrics.json',
|
||||
help='Output file for detailed metrics'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--report',
|
||||
default='validation_report.txt',
|
||||
help='Human-readable validation report'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--numeric-tolerance',
|
||||
type=float,
|
||||
default=0.0,
|
||||
help='Tolerance for numeric comparisons (default: 0.0 for exact match)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--fuzzy-strings',
|
||||
action='store_true',
|
||||
help='Use fuzzy string matching (normalize whitespace, case)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--list-order-matters',
|
||||
action='store_true',
|
||||
help='Consider order in list comparisons (default: treat as sets)'
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def load_annotations(annotations_path: Path) -> Dict:
|
||||
"""Load annotations file"""
|
||||
with open(annotations_path, 'r', encoding='utf-8') as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def normalize_string(s: str, fuzzy: bool = False) -> str:
|
||||
"""Normalize string for comparison"""
|
||||
if not isinstance(s, str):
|
||||
return str(s)
|
||||
if fuzzy:
|
||||
return ' '.join(s.lower().split())
|
||||
return s
|
||||
|
||||
|
||||
def compare_boolean(automated: Any, truth: Any) -> Dict[str, int]:
|
||||
"""Compare boolean values"""
|
||||
if automated == truth:
|
||||
return {'tp': 1, 'fp': 0, 'fn': 0, 'tn': 0}
|
||||
elif automated and not truth:
|
||||
return {'tp': 0, 'fp': 1, 'fn': 0, 'tn': 0}
|
||||
elif not automated and truth:
|
||||
return {'tp': 0, 'fp': 0, 'fn': 1, 'tn': 0}
|
||||
else:
|
||||
return {'tp': 0, 'fp': 0, 'fn': 0, 'tn': 1}
|
||||
|
||||
|
||||
def compare_numeric(automated: Any, truth: Any, tolerance: float = 0.0) -> bool:
|
||||
"""Compare numeric values with optional tolerance"""
|
||||
try:
|
||||
a = float(automated) if automated is not None else None
|
||||
t = float(truth) if truth is not None else None
|
||||
|
||||
if a is None and t is None:
|
||||
return True
|
||||
if a is None or t is None:
|
||||
return False
|
||||
|
||||
if tolerance > 0:
|
||||
return abs(a - t) <= tolerance
|
||||
else:
|
||||
return a == t
|
||||
except (ValueError, TypeError):
|
||||
return automated == truth
|
||||
|
||||
|
||||
def compare_string(automated: Any, truth: Any, fuzzy: bool = False) -> bool:
|
||||
"""Compare string values"""
|
||||
if automated is None and truth is None:
|
||||
return True
|
||||
if automated is None or truth is None:
|
||||
return False
|
||||
|
||||
a = normalize_string(automated, fuzzy)
|
||||
t = normalize_string(truth, fuzzy)
|
||||
return a == t
|
||||
|
||||
|
||||
def compare_list(
|
||||
automated: List,
|
||||
truth: List,
|
||||
order_matters: bool = False,
|
||||
fuzzy: bool = False
|
||||
) -> Dict[str, int]:
|
||||
"""
|
||||
Compare lists and calculate precision/recall.
|
||||
|
||||
Returns counts of true positives, false positives, and false negatives.
|
||||
"""
|
||||
if automated is None:
|
||||
automated = []
|
||||
if truth is None:
|
||||
truth = []
|
||||
|
||||
if not isinstance(automated, list):
|
||||
automated = [automated]
|
||||
if not isinstance(truth, list):
|
||||
truth = [truth]
|
||||
|
||||
if order_matters:
|
||||
# Ordered comparison
|
||||
tp = sum(1 for a, t in zip(automated, truth) if compare_string(a, t, fuzzy))
|
||||
fp = max(0, len(automated) - len(truth))
|
||||
fn = max(0, len(truth) - len(automated))
|
||||
else:
|
||||
# Set-based comparison
|
||||
if fuzzy:
|
||||
auto_set = {normalize_string(x, fuzzy) for x in automated}
|
||||
truth_set = {normalize_string(x, fuzzy) for x in truth}
|
||||
else:
|
||||
auto_set = set(automated)
|
||||
truth_set = set(truth)
|
||||
|
||||
tp = len(auto_set & truth_set) # Intersection
|
||||
fp = len(auto_set - truth_set) # In automated but not in truth
|
||||
fn = len(truth_set - auto_set) # In truth but not in automated
|
||||
|
||||
return {'tp': tp, 'fp': fp, 'fn': fn}
|
||||
|
||||
|
||||
def calculate_metrics(tp: int, fp: int, fn: int) -> Dict[str, float]:
|
||||
"""Calculate precision, recall, and F1 from counts"""
|
||||
precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
|
||||
recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
|
||||
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
|
||||
|
||||
return {
|
||||
'precision': precision,
|
||||
'recall': recall,
|
||||
'f1': f1,
|
||||
'tp': tp,
|
||||
'fp': fp,
|
||||
'fn': fn
|
||||
}
|
||||
|
||||
|
||||
def compare_field(
|
||||
automated: Any,
|
||||
truth: Any,
|
||||
field_name: str,
|
||||
config: Dict
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Compare a single field between automated and ground truth.
|
||||
|
||||
Returns metrics appropriate for the field type.
|
||||
"""
|
||||
# Determine field type
|
||||
if isinstance(truth, bool):
|
||||
return compare_boolean(automated, truth)
|
||||
elif isinstance(truth, (int, float)):
|
||||
match = compare_numeric(automated, truth, config['numeric_tolerance'])
|
||||
return {'tp': 1 if match else 0, 'fp': 0 if match else 1, 'fn': 0 if match else 1}
|
||||
elif isinstance(truth, str):
|
||||
match = compare_string(automated, truth, config['fuzzy_strings'])
|
||||
return {'tp': 1 if match else 0, 'fp': 0 if match else 1, 'fn': 0 if match else 1}
|
||||
elif isinstance(truth, list):
|
||||
return compare_list(automated, truth, config['list_order_matters'], config['fuzzy_strings'])
|
||||
elif isinstance(truth, dict):
|
||||
# Recursive comparison for nested objects
|
||||
return compare_nested(automated or {}, truth, config)
|
||||
elif truth is None:
|
||||
# Field should be empty/null
|
||||
if automated is None or automated == "" or automated == []:
|
||||
return {'tp': 1, 'fp': 0, 'fn': 0}
|
||||
else:
|
||||
return {'tp': 0, 'fp': 1, 'fn': 0}
|
||||
else:
|
||||
# Fallback to exact match
|
||||
match = automated == truth
|
||||
return {'tp': 1 if match else 0, 'fp': 0 if match else 1, 'fn': 0 if match else 1}
|
||||
|
||||
|
||||
def compare_nested(automated: Dict, truth: Dict, config: Dict) -> Dict[str, int]:
|
||||
"""Recursively compare nested objects"""
|
||||
total_counts = {'tp': 0, 'fp': 0, 'fn': 0}
|
||||
|
||||
all_fields = set(automated.keys()) | set(truth.keys())
|
||||
|
||||
for field in all_fields:
|
||||
auto_val = automated.get(field)
|
||||
truth_val = truth.get(field)
|
||||
|
||||
field_counts = compare_field(auto_val, truth_val, field, config)
|
||||
|
||||
for key in ['tp', 'fp', 'fn']:
|
||||
total_counts[key] += field_counts.get(key, 0)
|
||||
|
||||
return total_counts
|
||||
|
||||
|
||||
def evaluate_paper(
|
||||
paper_id: str,
|
||||
automated: Dict,
|
||||
truth: Dict,
|
||||
config: Dict
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Evaluate extraction for a single paper.
|
||||
|
||||
Returns field-level and overall metrics.
|
||||
"""
|
||||
if truth is None:
|
||||
return {
|
||||
'status': 'not_annotated',
|
||||
'message': 'Ground truth not provided'
|
||||
}
|
||||
|
||||
field_metrics = {}
|
||||
all_fields = set(automated.keys()) | set(truth.keys())
|
||||
|
||||
for field in all_fields:
|
||||
if field == 'records':
|
||||
# Special handling for records arrays
|
||||
auto_records = automated.get('records', [])
|
||||
truth_records = truth.get('records', [])
|
||||
|
||||
# Overall record count comparison
|
||||
record_counts = compare_list(auto_records, truth_records, order_matters=False)
|
||||
|
||||
# Detailed record-level comparison
|
||||
record_details = []
|
||||
for i, (auto_rec, truth_rec) in enumerate(zip(auto_records, truth_records)):
|
||||
rec_comparison = compare_nested(auto_rec, truth_rec, config)
|
||||
record_details.append({
|
||||
'record_index': i,
|
||||
'metrics': calculate_metrics(**rec_comparison)
|
||||
})
|
||||
|
||||
field_metrics['records'] = {
|
||||
'count_metrics': calculate_metrics(**record_counts),
|
||||
'record_details': record_details
|
||||
}
|
||||
else:
|
||||
auto_val = automated.get(field)
|
||||
truth_val = truth.get(field)
|
||||
counts = compare_field(auto_val, truth_val, field, config)
|
||||
field_metrics[field] = calculate_metrics(**counts)
|
||||
|
||||
# Calculate overall metrics
|
||||
total_tp = sum(
|
||||
m.get('tp', 0) if isinstance(m, dict) and 'tp' in m
|
||||
else m.get('count_metrics', {}).get('tp', 0)
|
||||
for m in field_metrics.values()
|
||||
)
|
||||
total_fp = sum(
|
||||
m.get('fp', 0) if isinstance(m, dict) and 'fp' in m
|
||||
else m.get('count_metrics', {}).get('fp', 0)
|
||||
for m in field_metrics.values()
|
||||
)
|
||||
total_fn = sum(
|
||||
m.get('fn', 0) if isinstance(m, dict) and 'fn' in m
|
||||
else m.get('count_metrics', {}).get('fn', 0)
|
||||
for m in field_metrics.values()
|
||||
)
|
||||
|
||||
overall = calculate_metrics(total_tp, total_fp, total_fn)
|
||||
|
||||
return {
|
||||
'status': 'evaluated',
|
||||
'field_metrics': field_metrics,
|
||||
'overall': overall
|
||||
}
|
||||
|
||||
|
||||
def aggregate_metrics(paper_evaluations: Dict[str, Dict]) -> Dict[str, Any]:
|
||||
"""Aggregate metrics across all papers"""
|
||||
# Collect field-level metrics
|
||||
field_aggregates = defaultdict(lambda: {'tp': 0, 'fp': 0, 'fn': 0})
|
||||
|
||||
evaluated_papers = [
|
||||
p for p in paper_evaluations.values()
|
||||
if p.get('status') == 'evaluated'
|
||||
]
|
||||
|
||||
for paper_eval in evaluated_papers:
|
||||
for field, metrics in paper_eval.get('field_metrics', {}).items():
|
||||
if isinstance(metrics, dict):
|
||||
if 'tp' in metrics:
|
||||
# Simple field
|
||||
field_aggregates[field]['tp'] += metrics['tp']
|
||||
field_aggregates[field]['fp'] += metrics['fp']
|
||||
field_aggregates[field]['fn'] += metrics['fn']
|
||||
elif 'count_metrics' in metrics:
|
||||
# Records field
|
||||
field_aggregates[field]['tp'] += metrics['count_metrics']['tp']
|
||||
field_aggregates[field]['fp'] += metrics['count_metrics']['fp']
|
||||
field_aggregates[field]['fn'] += metrics['count_metrics']['fn']
|
||||
|
||||
# Calculate metrics for each field
|
||||
field_metrics = {}
|
||||
for field, counts in field_aggregates.items():
|
||||
field_metrics[field] = calculate_metrics(**counts)
|
||||
|
||||
# Overall aggregated metrics
|
||||
total_tp = sum(counts['tp'] for counts in field_aggregates.values())
|
||||
total_fp = sum(counts['fp'] for counts in field_aggregates.values())
|
||||
total_fn = sum(counts['fn'] for counts in field_aggregates.values())
|
||||
|
||||
overall = calculate_metrics(total_tp, total_fp, total_fn)
|
||||
|
||||
return {
|
||||
'overall': overall,
|
||||
'by_field': field_metrics,
|
||||
'num_papers_evaluated': len(evaluated_papers)
|
||||
}
|
||||
|
||||
|
||||
def generate_report(
|
||||
paper_evaluations: Dict[str, Dict],
|
||||
aggregated: Dict,
|
||||
output_path: Path
|
||||
):
|
||||
"""Generate human-readable validation report"""
|
||||
lines = []
|
||||
lines.append("="*80)
|
||||
lines.append("EXTRACTION VALIDATION REPORT")
|
||||
lines.append("="*80)
|
||||
lines.append("")
|
||||
|
||||
# Overall summary
|
||||
lines.append("OVERALL METRICS")
|
||||
lines.append("-"*80)
|
||||
overall = aggregated['overall']
|
||||
lines.append(f"Papers evaluated: {aggregated['num_papers_evaluated']}")
|
||||
lines.append(f"Precision: {overall['precision']:.2%}")
|
||||
lines.append(f"Recall: {overall['recall']:.2%}")
|
||||
lines.append(f"F1 Score: {overall['f1']:.2%}")
|
||||
lines.append(f"True Positives: {overall['tp']}")
|
||||
lines.append(f"False Positives: {overall['fp']}")
|
||||
lines.append(f"False Negatives: {overall['fn']}")
|
||||
lines.append("")
|
||||
|
||||
# Per-field metrics
|
||||
lines.append("METRICS BY FIELD")
|
||||
lines.append("-"*80)
|
||||
lines.append(f"{'Field':<30} {'Precision':>10} {'Recall':>10} {'F1':>10}")
|
||||
lines.append("-"*80)
|
||||
|
||||
for field, metrics in sorted(aggregated['by_field'].items()):
|
||||
lines.append(
|
||||
f"{field:<30} "
|
||||
f"{metrics['precision']:>9.1%} "
|
||||
f"{metrics['recall']:>9.1%} "
|
||||
f"{metrics['f1']:>9.1%}"
|
||||
)
|
||||
lines.append("")
|
||||
|
||||
# Top errors
|
||||
lines.append("COMMON ISSUES")
|
||||
lines.append("-"*80)
|
||||
|
||||
# Fields with low recall (missed information)
|
||||
low_recall = [
|
||||
(field, metrics) for field, metrics in aggregated['by_field'].items()
|
||||
if metrics['recall'] < 0.7 and metrics['fn'] > 0
|
||||
]
|
||||
if low_recall:
|
||||
lines.append("\nFields with low recall (missed information):")
|
||||
for field, metrics in sorted(low_recall, key=lambda x: x[1]['recall']):
|
||||
lines.append(f" - {field}: {metrics['recall']:.1%} recall, {metrics['fn']} missed items")
|
||||
|
||||
# Fields with low precision (incorrect extractions)
|
||||
low_precision = [
|
||||
(field, metrics) for field, metrics in aggregated['by_field'].items()
|
||||
if metrics['precision'] < 0.7 and metrics['fp'] > 0
|
||||
]
|
||||
if low_precision:
|
||||
lines.append("\nFields with low precision (incorrect extractions):")
|
||||
for field, metrics in sorted(low_precision, key=lambda x: x[1]['precision']):
|
||||
lines.append(f" - {field}: {metrics['precision']:.1%} precision, {metrics['fp']} incorrect items")
|
||||
|
||||
lines.append("")
|
||||
lines.append("="*80)
|
||||
|
||||
# Write report
|
||||
report_text = "\n".join(lines)
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
f.write(report_text)
|
||||
|
||||
# Also print to console
|
||||
print(report_text)
|
||||
|
||||
|
||||
def main():
|
||||
args = parse_args()
|
||||
|
||||
# Load annotations
|
||||
annotations = load_annotations(Path(args.annotations))
|
||||
validation_papers = annotations.get('validation_papers', {})
|
||||
|
||||
print(f"Loaded {len(validation_papers)} validation papers")
|
||||
|
||||
# Check how many have ground truth
|
||||
annotated = sum(1 for p in validation_papers.values() if p.get('ground_truth') is not None)
|
||||
print(f"Papers with ground truth: {annotated}")
|
||||
|
||||
if annotated == 0:
|
||||
print("\nError: No ground truth annotations found!")
|
||||
print("Please fill in the 'ground_truth' field for each paper in the annotation file.")
|
||||
sys.exit(1)
|
||||
|
||||
# Configuration for comparisons
|
||||
config = {
|
||||
'numeric_tolerance': args.numeric_tolerance,
|
||||
'fuzzy_strings': args.fuzzy_strings,
|
||||
'list_order_matters': args.list_order_matters
|
||||
}
|
||||
|
||||
# Evaluate each paper
|
||||
paper_evaluations = {}
|
||||
for paper_id, paper_data in validation_papers.items():
|
||||
automated = paper_data.get('automated_extraction', {})
|
||||
truth = paper_data.get('ground_truth')
|
||||
|
||||
evaluation = evaluate_paper(paper_id, automated, truth, config)
|
||||
paper_evaluations[paper_id] = evaluation
|
||||
|
||||
if evaluation['status'] == 'evaluated':
|
||||
overall = evaluation['overall']
|
||||
print(f"{paper_id}: P={overall['precision']:.2%} R={overall['recall']:.2%} F1={overall['f1']:.2%}")
|
||||
|
||||
# Aggregate metrics
|
||||
aggregated = aggregate_metrics(paper_evaluations)
|
||||
|
||||
# Save detailed metrics
|
||||
output_path = Path(args.output)
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
detailed_output = {
|
||||
'summary': aggregated,
|
||||
'by_paper': paper_evaluations,
|
||||
'config': config
|
||||
}
|
||||
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
json.dump(detailed_output, f, indent=2, ensure_ascii=False)
|
||||
|
||||
print(f"\nDetailed metrics saved to: {output_path}")
|
||||
|
||||
# Generate report
|
||||
report_path = Path(args.report)
|
||||
generate_report(paper_evaluations, aggregated, report_path)
|
||||
print(f"Validation report saved to: {report_path}")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
152
skills/think_deeply/README.md
Normal file
152
skills/think_deeply/README.md
Normal file
@@ -0,0 +1,152 @@
|
||||
# Deep Thinking Protocol - Claude Skill
|
||||
|
||||
A custom Claude skill that prevents automatic agreement or disagreement by enforcing deeper analysis and multi-perspective thinking.
|
||||
|
||||
## Overview
|
||||
|
||||
This skill makes Claude pause and think more deeply when responding to questions or statements that might trigger quick agreement or disagreement. Instead of reflexively validating user assumptions, Claude will:
|
||||
|
||||
- Reframe questions to expose underlying concerns
|
||||
- Present multiple valid perspectives
|
||||
- Identify context-dependent factors
|
||||
- Provide nuanced, well-reasoned recommendations
|
||||
|
||||
## What Problem Does This Solve?
|
||||
|
||||
Without this skill, Claude may:
|
||||
- Quickly agree with user statements without thorough analysis
|
||||
- Accept embedded assumptions without questioning them
|
||||
- Provide binary yes/no answers without exploring nuances
|
||||
- Miss important context or alternative perspectives
|
||||
- Validate the user's framing even when broader analysis would be more helpful
|
||||
|
||||
## How It Works
|
||||
|
||||
When Claude encounters questions or statements that could lead to automatic responses, this skill triggers a structured thinking process:
|
||||
|
||||
1. **Pause and Recognize**: Identify what's really being asked
|
||||
2. **Reframe**: Transform the question into a broader investigation
|
||||
3. **Map the Landscape**: Consider multiple perspectives, trade-offs, and context
|
||||
4. **Structured Response**: Deliver analysis using a clear framework
|
||||
5. **Avoid Anti-Patterns**: Resist reflexive agreement/disagreement
|
||||
|
||||
## Installation
|
||||
|
||||
### For Claude.ai (Web/Mobile Apps)
|
||||
|
||||
1. Download this repository as a ZIP file
|
||||
2. Ensure the ZIP structure is:
|
||||
```
|
||||
claude_rethink.zip
|
||||
└── claude_rethink/
|
||||
├── Skill.md
|
||||
└── README.md
|
||||
```
|
||||
3. Go to Claude.ai Settings > Capabilities > Skills
|
||||
4. Click "Upload Skill" and select the ZIP file
|
||||
5. Enable the "Deep Thinking Protocol" skill
|
||||
|
||||
### For Claude API
|
||||
|
||||
Place the `Skill.md` file in your skills directory according to your API integration setup. Consult the Claude API documentation for skill configuration.
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Example 1: Technology Choice
|
||||
|
||||
**User asks:** "React is better than Vue for this project, right?"
|
||||
|
||||
**Without skill:** "Yes, React would be a great choice!"
|
||||
|
||||
**With skill:** Claude will:
|
||||
- Reframe: Identify what makes a framework "better" for the specific context
|
||||
- Analyze: Compare React and Vue across multiple dimensions
|
||||
- Consider: Team experience, project complexity, timeline, hiring needs
|
||||
- Recommend: Provide context-dependent guidance with clear reasoning
|
||||
|
||||
### Example 2: Architectural Decisions
|
||||
|
||||
**User states:** "Obviously using microservices is the modern way to build applications."
|
||||
|
||||
**Without skill:** "You're right, microservices are definitely the modern approach!"
|
||||
|
||||
**With skill:** Claude will:
|
||||
- Challenge: Question whether "modern" equals "appropriate"
|
||||
- Explore: Analyze microservices vs. monolith trade-offs
|
||||
- Contextualize: Consider team size, operational maturity, actual scale needs
|
||||
- Advise: Suggest architecture based on specific requirements, not trends
|
||||
|
||||
### Example 3: Binary Choices
|
||||
|
||||
**User asks:** "Should I use TypeScript or JavaScript?"
|
||||
|
||||
**Without skill:** "TypeScript is the better choice - use TypeScript!"
|
||||
|
||||
**With skill:** Claude will:
|
||||
- Expand: Transform binary choice into a spectrum of considerations
|
||||
- Compare: Analyze benefits and trade-offs of each option
|
||||
- Identify: Determine which factors matter for this specific situation
|
||||
- Guide: Provide decision framework rather than simple directive
|
||||
|
||||
## When the Skill Activates
|
||||
|
||||
The skill triggers when Claude detects:
|
||||
|
||||
- Confirmation-seeking questions ("Is X the best?", "Should I do Y?")
|
||||
- Leading statements ("Obviously A is better than B")
|
||||
- Binary choice questions ("Which is better, X or Y?")
|
||||
- Assumption-laden questions
|
||||
- Situations prompting quick validation
|
||||
- Polarizing statements
|
||||
|
||||
## Benefits
|
||||
|
||||
- **Deeper Analysis**: Forces consideration of multiple perspectives
|
||||
- **Better Decisions**: Users receive context-dependent guidance
|
||||
- **Reduced Bias**: Prevents confirmation bias from reflexive agreement
|
||||
- **Learning**: Users understand trade-offs and decision factors
|
||||
- **Intellectual Honesty**: Promotes truth-seeking over validation
|
||||
|
||||
## Configuration
|
||||
|
||||
The skill works out-of-the-box with no configuration needed. Claude will automatically apply the Deep Thinking Protocol when appropriate.
|
||||
|
||||
To adjust when the skill triggers, you can modify the description field in `Skill.md`:
|
||||
|
||||
```yaml
|
||||
description: Engage deeper analysis when responding to user statements or questions requiring confirmation, preventing automatic agreement or disagreement
|
||||
```
|
||||
|
||||
Make this description more specific to narrow triggering, or more general to broaden it.
|
||||
|
||||
## Version History
|
||||
|
||||
- **1.0.0** (2025-01-17): Initial release
|
||||
- Core deep thinking protocol
|
||||
- Structured response framework
|
||||
- Multiple usage examples
|
||||
|
||||
## Contributing
|
||||
|
||||
To improve this skill:
|
||||
|
||||
1. Test with various question types and note where it helps/hinders
|
||||
2. Identify patterns where the skill should (or shouldn't) trigger
|
||||
3. Refine the structured response framework
|
||||
4. Add more examples for specific domains
|
||||
|
||||
## License
|
||||
|
||||
This skill is provided as-is for use with Claude. Modify and distribute freely.
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions:
|
||||
- Review the `Skill.md` file for the complete protocol
|
||||
- Test with different question phrasings
|
||||
- Check Claude skill documentation: https://support.claude.com/en/articles/12512198-how-to-create-custom-skills
|
||||
|
||||
## Related Resources
|
||||
|
||||
- [Claude Skills Documentation](https://support.claude.com/en/articles/12512198-how-to-create-custom-skills)
|
||||
- [Professional Objectivity in AI Interactions](https://www.anthropic.com/research)
|
||||
212
skills/think_deeply/SKILL.md
Normal file
212
skills/think_deeply/SKILL.md
Normal file
@@ -0,0 +1,212 @@
|
||||
---
|
||||
name: thinking-deeply
|
||||
description: Engages structured analysis to explore multiple perspectives and context dependencies before responding. Use when users ask confirmation-seeking questions, make leading statements, request binary choices, or when feeling inclined to quickly agree or disagree without thorough consideration.
|
||||
---
|
||||
|
||||
# Thinking Deeply
|
||||
|
||||
## Purpose
|
||||
|
||||
This skill activates when you're about to respond to user statements, questions, or requests that could lead to automatic agreement or disagreement without thorough consideration. It enforces a structured thinking process to ensure responses are well-reasoned and consider multiple perspectives.
|
||||
|
||||
## When This Skill Activates
|
||||
|
||||
This skill should trigger in these scenarios:
|
||||
|
||||
1. **Confirmation-seeking questions**: "Is X the best approach?", "Should I do Y?", "Don't you think Z?" Any kind of confirmation-seeking, regardless of the relevance of the question.
|
||||
2. **Leading statements**: "Obviously A is better than B", "It's clear that..."
|
||||
3. **Binary choice questions**: "Which is better, X or Y?"
|
||||
4. **Assumption-laden questions**: Questions that contain embedded assumptions
|
||||
5. **Quick validation requests**: Situations where you feel inclined to immediately agree or disagree
|
||||
6. **Polarizing statements**: Strong claims that might trigger reflexive agreement/disagreement
|
||||
|
||||
## Core Protocol
|
||||
|
||||
When this skill activates, follow this structured approach:
|
||||
|
||||
### 1. PAUSE AND RECOGNIZE
|
||||
First, identify why you're being triggered:
|
||||
- What is the user actually asking or claiming?
|
||||
- What assumptions are embedded in their question/statement?
|
||||
- Am I feeling inclined to quickly agree or disagree?
|
||||
|
||||
### 2. REFRAME THE QUESTION
|
||||
Transform the original query into a broader, more neutral investigation:
|
||||
- Extract the core concern or goal beneath the surface question
|
||||
- Identify what the user is really trying to achieve or understand
|
||||
- Reformulate as an open exploration rather than a yes/no question
|
||||
|
||||
### 3. MAP THE LANDSCAPE
|
||||
Before responding, systematically consider:
|
||||
|
||||
**Multiple Perspectives:**
|
||||
- What are 3-5 different valid approaches or viewpoints?
|
||||
- What would advocates of different positions say?
|
||||
- What factors might I be initially overlooking?
|
||||
|
||||
**Context Dependencies:**
|
||||
- Under what conditions might different answers be correct?
|
||||
- What information is missing that would change the answer?
|
||||
- What are the user's specific constraints, goals, and context?
|
||||
|
||||
**Trade-offs and Nuances:**
|
||||
- What are the advantages and disadvantages of each option?
|
||||
- What hidden costs or benefits exist?
|
||||
- What second-order effects should be considered?
|
||||
|
||||
### 4. STRUCTURED RESPONSE FORMAT
|
||||
|
||||
Deliver your response using this framework:
|
||||
|
||||
**a) Acknowledge and Reframe:**
|
||||
"Let me think about this more deeply. Rather than [original framing], I think the key question is: [reframed question]."
|
||||
|
||||
**b) Present Multiple Dimensions:**
|
||||
Outline 2-4 relevant perspectives, approaches, or considerations:
|
||||
- Option/Perspective A: [explanation, pros, cons, when it applies]
|
||||
- Option/Perspective B: [explanation, pros, cons, when it applies]
|
||||
- Option/Perspective C: [explanation, pros, cons, when it applies]
|
||||
|
||||
**c) Context-Dependent Analysis:**
|
||||
"The best path forward depends on:"
|
||||
- Factor 1: [how it influences the decision]
|
||||
- Factor 2: [how it influences the decision]
|
||||
- Factor 3: [how it influences the decision]
|
||||
|
||||
**d) Thoughtful Recommendation:**
|
||||
Only after the above analysis, provide a nuanced recommendation:
|
||||
"Given [context], I'd lean towards [option] because [reasoning]. However, [important caveats or alternative scenarios]."
|
||||
|
||||
### 5. ANTI-PATTERNS TO AVOID
|
||||
|
||||
Do NOT do these things when this skill is active:
|
||||
|
||||
❌ Immediately agreeing with the user's framing
|
||||
❌ Giving a simple yes/no without exploration
|
||||
❌ Reflexively validating or rejecting the premise
|
||||
❌ Accepting embedded assumptions without examination
|
||||
❌ Presenting only one perspective as "correct"
|
||||
❌ Using language like "obviously," "clearly," "definitely" without justification
|
||||
❌ Skipping the reframing step
|
||||
❌ Forgetting to identify context dependencies
|
||||
|
||||
## Examples
|
||||
|
||||
### Example 1: Confirmation-Seeking Question
|
||||
|
||||
**User:** "React is better than Vue for this project, right?"
|
||||
|
||||
**Without Skill (Bad):**
|
||||
"Yes, React would be a great choice for your project!"
|
||||
|
||||
**With Skill (Good):**
|
||||
"Let me think about this more carefully. Rather than asking which framework is 'better,' I think the key question is: which framework best fits your project's specific requirements, team capabilities, and constraints?
|
||||
|
||||
Let me consider several dimensions:
|
||||
|
||||
**For React:**
|
||||
- Larger ecosystem and job market
|
||||
- More flexibility but requires more decisions
|
||||
- Better for complex, large-scale applications
|
||||
- Steeper learning curve
|
||||
|
||||
**For Vue:**
|
||||
- Gentler learning curve, excellent documentation
|
||||
- More opinionated structure (can be good or bad)
|
||||
- Strong for rapid development
|
||||
- Smaller ecosystem but still robust
|
||||
|
||||
**The best choice depends on:**
|
||||
- Team experience: Does your team already know React or Vue?
|
||||
- Project complexity: How large and complex will this grow?
|
||||
- Timeline: How quickly do you need to ship?
|
||||
- Hiring needs: Will you need to hire developers for this?
|
||||
|
||||
Given typical scenarios, I'd lean towards React if you're building a complex, long-term application with an experienced team, but Vue if you need rapid development with junior developers or want a more structured approach. However, without knowing your specific constraints, both could be excellent choices."
|
||||
|
||||
### Example 2: Leading Statement
|
||||
|
||||
**User:** "Obviously using microservices is the modern way to build applications."
|
||||
|
||||
**Without Skill (Bad):**
|
||||
"You're right, microservices are definitely the modern architectural approach!"
|
||||
|
||||
**With Skill (Good):**
|
||||
"Let me reconsider this more deeply. Rather than focusing on what's 'modern,' I think the key question is: what architectural approach best serves your application's specific needs, team structure, and operational capabilities?
|
||||
|
||||
**Microservices strengths:**
|
||||
- Independent scaling and deployment
|
||||
- Technology diversity
|
||||
- Team autonomy
|
||||
- Fault isolation
|
||||
|
||||
**Microservices challenges:**
|
||||
- Operational complexity
|
||||
- Distributed system challenges (networking, consistency, debugging)
|
||||
- Requires mature DevOps practices
|
||||
- Can be over-engineering for smaller applications
|
||||
|
||||
**Monolith strengths:**
|
||||
- Simpler deployment and debugging
|
||||
- Easier local development
|
||||
- Better for smaller teams
|
||||
- Lower operational overhead
|
||||
|
||||
**The architecture depends on:**
|
||||
- Scale: What's your actual traffic and scaling needs?
|
||||
- Team size: Do you have teams that need independence?
|
||||
- Operational maturity: Do you have robust DevOps/monitoring?
|
||||
- Application boundaries: Are there clear service boundaries?
|
||||
|
||||
Many successful modern applications use well-designed monoliths or modular monoliths. Microservices solve specific organizational and scaling problems but introduce complexity. I'd recommend starting with a modular monolith unless you have clear organizational needs for microservices, like multiple independent teams or proven scaling challenges."
|
||||
|
||||
### Example 3: Binary Choice
|
||||
|
||||
**User:** "Should I use TypeScript or JavaScript?"
|
||||
|
||||
**Without Skill (Bad):**
|
||||
"TypeScript is the better choice - use TypeScript!"
|
||||
|
||||
**With Skill (Good):**
|
||||
"Let me think about this more thoroughly. Rather than a simple either/or, I think the key question is: what level of type safety and tooling support provides the best balance of productivity and maintenance for your specific project and team?
|
||||
|
||||
**TypeScript advantages:**
|
||||
- Catch errors at compile time
|
||||
- Better IDE support and autocomplete
|
||||
- Self-documenting code through types
|
||||
- Easier refactoring in large codebases
|
||||
|
||||
**TypeScript trade-offs:**
|
||||
- Additional build step and tooling complexity
|
||||
- Learning curve for the type system
|
||||
- Can slow down prototyping
|
||||
- Sometimes fights you on dynamic patterns
|
||||
|
||||
**JavaScript advantages:**
|
||||
- Simpler setup, no compilation needed
|
||||
- Faster for quick prototypes and scripts
|
||||
- Full flexibility for dynamic patterns
|
||||
- Lower barrier to entry
|
||||
|
||||
**This decision depends on:**
|
||||
- Project lifespan: Is this a long-term maintained codebase?
|
||||
- Team experience: Is your team comfortable with TypeScript?
|
||||
- Project size: How many lines of code and contributors?
|
||||
- Change frequency: Will this need extensive refactoring?
|
||||
|
||||
For a long-term project with multiple contributors, I'd strongly recommend TypeScript - the upfront cost pays dividends in maintenance. For small scripts, prototypes, or solo projects with infrequent changes, JavaScript may be more pragmatic. There's also a middle ground: start with JavaScript and gradually adopt TypeScript as the project grows."
|
||||
|
||||
## Success Criteria
|
||||
|
||||
You've successfully applied this skill when:
|
||||
|
||||
✓ You've reframed the question to reveal deeper concerns
|
||||
✓ You've presented multiple valid perspectives or approaches
|
||||
✓ You've identified context-dependent factors that influence the answer
|
||||
✓ You've avoided automatic agreement or disagreement
|
||||
✓ You've provided nuanced reasoning rather than definitive pronouncements
|
||||
✓ The user has more clarity about trade-offs and decision factors
|
||||
|
||||
## Remember
|
||||
|
||||
The goal is not to be contrarian or to always present counterarguments. The goal is to **think deeply and comprehensively** before responding, ensuring that your answer serves the user's actual needs rather than simply validating their initial framing.
|
||||
Reference in New Issue
Block a user