Initial commit
This commit is contained in:
289
skills/extract_from_pdfs/SKILL.md
Normal file
289
skills/extract_from_pdfs/SKILL.md
Normal file
@@ -0,0 +1,289 @@
|
||||
---
|
||||
name: extract-from-pdfs
|
||||
description: This skill should be used when extracting structured data from scientific PDFs for systematic reviews, meta-analyses, or database creation. Use when working with collections of research papers that need to be converted into analyzable datasets with validation metrics.
|
||||
---
|
||||
|
||||
# Extract Structured Data from Scientific PDFs
|
||||
|
||||
## Purpose
|
||||
|
||||
Extract standardized, structured data from scientific PDF literature using Claude's vision capabilities. Transform PDF collections into validated databases ready for statistical analysis in Python, R, or other frameworks.
|
||||
|
||||
**Core capabilities:**
|
||||
- Organize metadata from BibTeX, RIS, directories, or DOI lists
|
||||
- Filter papers by abstract using Claude (Haiku/Sonnet) or local models (Ollama)
|
||||
- Extract structured data from PDFs with customizable schemas
|
||||
- Repair and validate JSON outputs automatically
|
||||
- Enrich with external databases (GBIF, WFO, GeoNames, PubChem, NCBI)
|
||||
- Calculate precision/recall metrics for quality assurance
|
||||
- Export to Python, R, CSV, Excel, or SQLite
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use when:
|
||||
- Conducting systematic literature reviews requiring data extraction
|
||||
- Building databases from scientific publications
|
||||
- Converting PDF collections to structured datasets
|
||||
- Validating extraction quality with ground truth metrics
|
||||
- Comparing extraction approaches (different models, prompts)
|
||||
|
||||
Do not use for:
|
||||
- Single PDF summarization (use basic PDF reading instead)
|
||||
- Full-text PDF search (use document search tools)
|
||||
- PDF editing or manipulation
|
||||
|
||||
## Getting Started
|
||||
|
||||
### 1. Initial Setup
|
||||
|
||||
Read the setup guide for installation and configuration:
|
||||
|
||||
```bash
|
||||
cat references/setup_guide.md
|
||||
```
|
||||
|
||||
Key setup steps:
|
||||
- Install dependencies: `conda env create -f environment.yml`
|
||||
- Set API keys: `export ANTHROPIC_API_KEY='your-key'`
|
||||
- Optional: Install Ollama for free local filtering
|
||||
|
||||
### 2. Define Extraction Requirements
|
||||
|
||||
**Ask the user:**
|
||||
- Research domain and extraction goals
|
||||
- How PDFs are organized (reference manager, directory, DOI list)
|
||||
- Approximate collection size
|
||||
- Preferred analysis environment (Python, R, etc.)
|
||||
|
||||
**Provide 2-3 example PDFs** to analyze structure and design schema.
|
||||
|
||||
### 3. Design Extraction Schema
|
||||
|
||||
Create custom schema from template:
|
||||
|
||||
```bash
|
||||
cp assets/schema_template.json my_schema.json
|
||||
```
|
||||
|
||||
Customize for the specific domain:
|
||||
- Set `objective` describing what to extract
|
||||
- Define `output_schema` with field types and descriptions
|
||||
- Add domain-specific `instructions` for Claude
|
||||
- Provide `output_example` showing desired format
|
||||
|
||||
See `assets/example_flower_visitors_schema.json` for real-world ecology example.
|
||||
|
||||
## Workflow Execution
|
||||
|
||||
### Complete Pipeline
|
||||
|
||||
Run the 6-step pipeline (plus optional validation):
|
||||
|
||||
```bash
|
||||
# Step 1: Organize metadata
|
||||
python scripts/01_organize_metadata.py \
|
||||
--source-type bibtex \
|
||||
--source library.bib \
|
||||
--pdf-dir pdfs/ \
|
||||
--output metadata.json
|
||||
|
||||
# Step 2: Filter papers (optional - recommended)
|
||||
# Choose backend: anthropic-haiku (cheap), anthropic-sonnet (accurate), ollama (free)
|
||||
python scripts/02_filter_abstracts.py \
|
||||
--metadata metadata.json \
|
||||
--backend anthropic-haiku \
|
||||
--use-batches \
|
||||
--output filtered_papers.json
|
||||
|
||||
# Step 3: Extract from PDFs
|
||||
python scripts/03_extract_from_pdfs.py \
|
||||
--metadata filtered_papers.json \
|
||||
--schema my_schema.json \
|
||||
--method batches \
|
||||
--output extracted_data.json
|
||||
|
||||
# Step 4: Repair JSON
|
||||
python scripts/04_repair_json.py \
|
||||
--input extracted_data.json \
|
||||
--schema my_schema.json \
|
||||
--output cleaned_data.json
|
||||
|
||||
# Step 5: Validate with APIs
|
||||
python scripts/05_validate_with_apis.py \
|
||||
--input cleaned_data.json \
|
||||
--apis my_api_config.json \
|
||||
--output validated_data.json
|
||||
|
||||
# Step 6: Export to analysis format
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format python \
|
||||
--output results
|
||||
```
|
||||
|
||||
### Validation (Optional but Recommended)
|
||||
|
||||
Calculate extraction quality metrics:
|
||||
|
||||
```bash
|
||||
# Step 7: Sample papers for annotation
|
||||
python scripts/07_prepare_validation_set.py \
|
||||
--extraction-results cleaned_data.json \
|
||||
--schema my_schema.json \
|
||||
--sample-size 20 \
|
||||
--strategy stratified \
|
||||
--output validation_set.json
|
||||
|
||||
# Step 8: Manually annotate (edit validation_set.json)
|
||||
# Fill ground_truth field for each sampled paper
|
||||
|
||||
# Step 9: Calculate metrics
|
||||
python scripts/08_calculate_validation_metrics.py \
|
||||
--annotations validation_set.json \
|
||||
--output validation_metrics.json \
|
||||
--report validation_report.txt
|
||||
```
|
||||
|
||||
Validation produces precision, recall, and F1 metrics per field and overall.
|
||||
|
||||
## Detailed Documentation
|
||||
|
||||
Access comprehensive guides in the `references/` directory:
|
||||
|
||||
**Setup and installation:**
|
||||
```bash
|
||||
cat references/setup_guide.md
|
||||
```
|
||||
|
||||
**Complete workflow with examples:**
|
||||
```bash
|
||||
cat references/workflow_guide.md
|
||||
```
|
||||
|
||||
**Validation methodology:**
|
||||
```bash
|
||||
cat references/validation_guide.md
|
||||
```
|
||||
|
||||
**API integration details:**
|
||||
```bash
|
||||
cat references/api_reference.md
|
||||
```
|
||||
|
||||
## Customization
|
||||
|
||||
### Schema Customization
|
||||
|
||||
Modify `my_schema.json` to match the research domain:
|
||||
|
||||
1. **Objective:** Describe what data to extract
|
||||
2. **Instructions:** Step-by-step extraction guidance
|
||||
3. **Output schema:** JSON schema defining structure
|
||||
4. **Important notes:** Domain-specific rules
|
||||
5. **Examples:** Show desired output format
|
||||
|
||||
Use imperative language in instructions. Be specific about data types, required vs optional fields, and edge cases.
|
||||
|
||||
### API Configuration
|
||||
|
||||
Configure external database validation in `my_api_config.json`:
|
||||
|
||||
Map extracted fields to validation APIs:
|
||||
- `gbif_taxonomy` - Biological taxonomy
|
||||
- `wfo_plants` - Plant names specifically
|
||||
- `geonames` - Geographic locations
|
||||
- `geocode` - Address to coordinates
|
||||
- `pubchem` - Chemical compounds
|
||||
- `ncbi_gene` - Gene identifiers
|
||||
|
||||
See `assets/example_api_config_ecology.json` for ecology-specific example.
|
||||
|
||||
### Filtering Customization
|
||||
|
||||
Edit filtering criteria in `scripts/02_filter_abstracts.py` (line 74):
|
||||
|
||||
Replace TODO section with domain-specific criteria:
|
||||
- What constitutes primary data vs review?
|
||||
- What data types are relevant?
|
||||
- What scope (geographic, temporal, taxonomic) is needed?
|
||||
|
||||
Use conservative criteria (when in doubt, include paper) to avoid false negatives.
|
||||
|
||||
## Cost Optimization
|
||||
|
||||
**Backend selection for filtering (Step 2):**
|
||||
- Ollama (local): $0 - Best for privacy and high volume
|
||||
- Haiku (API): ~$0.25/M tokens - Best balance of cost/quality
|
||||
- Sonnet (API): ~$3/M tokens - Best for complex filtering
|
||||
|
||||
**Typical costs for 100 papers:**
|
||||
- With filtering (Haiku + Sonnet): ~$4
|
||||
- With local Ollama + Sonnet: ~$3.75
|
||||
- Without filtering (Sonnet only): ~$7.50
|
||||
|
||||
**Optimization strategies:**
|
||||
- Use abstract filtering to reduce PDF processing
|
||||
- Use local Ollama for filtering (free)
|
||||
- Enable prompt caching with `--use-caching`
|
||||
- Process in batches with `--use-batches`
|
||||
|
||||
## Quality Assurance
|
||||
|
||||
**Validation workflow provides:**
|
||||
- Precision: % of extracted items that are correct
|
||||
- Recall: % of true items that were extracted
|
||||
- F1 score: Harmonic mean of precision and recall
|
||||
- Per-field metrics: Identify weak fields
|
||||
|
||||
**Use metrics to:**
|
||||
- Establish baseline extraction quality
|
||||
- Compare different approaches (models, prompts, schemas)
|
||||
- Identify areas for improvement
|
||||
- Report extraction quality in publications
|
||||
|
||||
**Recommended sample sizes:**
|
||||
- Small projects (<100 papers): 10-20 papers
|
||||
- Medium projects (100-500 papers): 20-50 papers
|
||||
- Large projects (>500 papers): 50-100 papers
|
||||
|
||||
## Iterative Improvement
|
||||
|
||||
1. Run initial extraction with baseline schema
|
||||
2. Validate on sample using Steps 7-9
|
||||
3. Analyze field-level metrics and error patterns
|
||||
4. Revise schema, prompts, or model selection
|
||||
5. Re-extract and re-validate
|
||||
6. Compare metrics to verify improvement
|
||||
7. Repeat until acceptable quality achieved
|
||||
|
||||
See `references/validation_guide.md` for detailed guidance on interpreting metrics and improving extraction quality.
|
||||
|
||||
## Available Scripts
|
||||
|
||||
**Data organization:**
|
||||
- `scripts/01_organize_metadata.py` - Standardize PDFs and metadata
|
||||
|
||||
**Filtering:**
|
||||
- `scripts/02_filter_abstracts.py` - Filter by abstract (Haiku/Sonnet/Ollama)
|
||||
|
||||
**Extraction:**
|
||||
- `scripts/03_extract_from_pdfs.py` - Extract from PDFs with Claude vision
|
||||
|
||||
**Processing:**
|
||||
- `scripts/04_repair_json.py` - Repair and validate JSON
|
||||
- `scripts/05_validate_with_apis.py` - Enrich with external databases
|
||||
- `scripts/06_export_database.py` - Export to analysis formats
|
||||
|
||||
**Validation:**
|
||||
- `scripts/07_prepare_validation_set.py` - Sample papers for annotation
|
||||
- `scripts/08_calculate_validation_metrics.py` - Calculate P/R/F1 metrics
|
||||
|
||||
## Assets
|
||||
|
||||
**Templates:**
|
||||
- `assets/schema_template.json` - Blank extraction schema template
|
||||
- `assets/api_config_template.json` - API validation configuration template
|
||||
|
||||
**Examples:**
|
||||
- `assets/example_flower_visitors_schema.json` - Ecology extraction example
|
||||
- `assets/example_api_config_ecology.json` - Ecology API validation example
|
||||
Reference in New Issue
Block a user