Initial commit

2025-11-29 18:02:40 +08:00
commit 69617b598e
25 changed files with 5790 additions and 0 deletions
--- a/skills/extract_from_pdfs/SKILL.md
+++ b/skills/extract_from_pdfs/SKILL.md
@@ -0,0 +1,289 @@
+---
+name: extract-from-pdfs
+description: This skill should be used when extracting structured data from scientific PDFs for systematic reviews, meta-analyses, or database creation. Use when working with collections of research papers that need to be converted into analyzable datasets with validation metrics.
+---
+
+# Extract Structured Data from Scientific PDFs
+
+## Purpose
+
+Extract standardized, structured data from scientific PDF literature using Claude's vision capabilities. Transform PDF collections into validated databases ready for statistical analysis in Python, R, or other frameworks.
+
+**Core capabilities:**
+- Organize metadata from BibTeX, RIS, directories, or DOI lists
+- Filter papers by abstract using Claude (Haiku/Sonnet) or local models (Ollama)
+- Extract structured data from PDFs with customizable schemas
+- Repair and validate JSON outputs automatically
+- Enrich with external databases (GBIF, WFO, GeoNames, PubChem, NCBI)
+- Calculate precision/recall metrics for quality assurance
+- Export to Python, R, CSV, Excel, or SQLite
+
+## When to Use This Skill
+
+Use when:
+- Conducting systematic literature reviews requiring data extraction
+- Building databases from scientific publications
+- Converting PDF collections to structured datasets
+- Validating extraction quality with ground truth metrics
+- Comparing extraction approaches (different models, prompts)
+
+Do not use for:
+- Single PDF summarization (use basic PDF reading instead)
+- Full-text PDF search (use document search tools)
+- PDF editing or manipulation
+
+## Getting Started
+
+### 1. Initial Setup
+
+Read the setup guide for installation and configuration:
+
+```bash
+cat references/setup_guide.md
+```
+
+Key setup steps:
+- Install dependencies: `conda env create -f environment.yml`
+- Set API keys: `export ANTHROPIC_API_KEY='your-key'`
+- Optional: Install Ollama for free local filtering
+
+### 2. Define Extraction Requirements
+
+**Ask the user:**
+- Research domain and extraction goals
+- How PDFs are organized (reference manager, directory, DOI list)
+- Approximate collection size
+- Preferred analysis environment (Python, R, etc.)
+
+**Provide 2-3 example PDFs** to analyze structure and design schema.
+
+### 3. Design Extraction Schema
+
+Create custom schema from template:
+
+```bash
+cp assets/schema_template.json my_schema.json
+```
+
+Customize for the specific domain:
+- Set `objective` describing what to extract
+- Define `output_schema` with field types and descriptions
+- Add domain-specific `instructions` for Claude
+- Provide `output_example` showing desired format
+
+See `assets/example_flower_visitors_schema.json` for real-world ecology example.
+
+## Workflow Execution
+
+### Complete Pipeline
+
+Run the 6-step pipeline (plus optional validation):
+
+```bash
+# Step 1: Organize metadata
+python scripts/01_organize_metadata.py \
+  --source-type bibtex \
+  --source library.bib \
+  --pdf-dir pdfs/ \
+  --output metadata.json
+
+# Step 2: Filter papers (optional - recommended)
+# Choose backend: anthropic-haiku (cheap), anthropic-sonnet (accurate), ollama (free)
+python scripts/02_filter_abstracts.py \
+  --metadata metadata.json \
+  --backend anthropic-haiku \
+  --use-batches \
+  --output filtered_papers.json
+
+# Step 3: Extract from PDFs
+python scripts/03_extract_from_pdfs.py \
+  --metadata filtered_papers.json \
+  --schema my_schema.json \
+  --method batches \
+  --output extracted_data.json
+
+# Step 4: Repair JSON
+python scripts/04_repair_json.py \
+  --input extracted_data.json \
+  --schema my_schema.json \
+  --output cleaned_data.json
+
+# Step 5: Validate with APIs
+python scripts/05_validate_with_apis.py \
+  --input cleaned_data.json \
+  --apis my_api_config.json \
+  --output validated_data.json
+
+# Step 6: Export to analysis format
+python scripts/06_export_database.py \
+  --input validated_data.json \
+  --format python \
+  --output results
+```
+
+### Validation (Optional but Recommended)
+
+Calculate extraction quality metrics:
+
+```bash
+# Step 7: Sample papers for annotation
+python scripts/07_prepare_validation_set.py \
+  --extraction-results cleaned_data.json \
+  --schema my_schema.json \
+  --sample-size 20 \
+  --strategy stratified \
+  --output validation_set.json
+
+# Step 8: Manually annotate (edit validation_set.json)
+# Fill ground_truth field for each sampled paper
+
+# Step 9: Calculate metrics
+python scripts/08_calculate_validation_metrics.py \
+  --annotations validation_set.json \
+  --output validation_metrics.json \
+  --report validation_report.txt
+```
+
+Validation produces precision, recall, and F1 metrics per field and overall.
+
+## Detailed Documentation
+
+Access comprehensive guides in the `references/` directory:
+
+**Setup and installation:**
+```bash
+cat references/setup_guide.md
+```
+
+**Complete workflow with examples:**
+```bash
+cat references/workflow_guide.md
+```
+
+**Validation methodology:**
+```bash
+cat references/validation_guide.md
+```
+
+**API integration details:**
+```bash
+cat references/api_reference.md
+```
+
+## Customization
+
+### Schema Customization
+
+Modify `my_schema.json` to match the research domain:
+
+1. **Objective:** Describe what data to extract
+2. **Instructions:** Step-by-step extraction guidance
+3. **Output schema:** JSON schema defining structure
+4. **Important notes:** Domain-specific rules
+5. **Examples:** Show desired output format
+
+Use imperative language in instructions. Be specific about data types, required vs optional fields, and edge cases.
+
+### API Configuration
+
+Configure external database validation in `my_api_config.json`:
+
+Map extracted fields to validation APIs:
+- `gbif_taxonomy` - Biological taxonomy
+- `wfo_plants` - Plant names specifically
+- `geonames` - Geographic locations
+- `geocode` - Address to coordinates
+- `pubchem` - Chemical compounds
+- `ncbi_gene` - Gene identifiers
+
+See `assets/example_api_config_ecology.json` for ecology-specific example.
+
+### Filtering Customization
+
+Edit filtering criteria in `scripts/02_filter_abstracts.py` (line 74):
+
+Replace TODO section with domain-specific criteria:
+- What constitutes primary data vs review?
+- What data types are relevant?
+- What scope (geographic, temporal, taxonomic) is needed?
+
+Use conservative criteria (when in doubt, include paper) to avoid false negatives.
+
+## Cost Optimization
+
+**Backend selection for filtering (Step 2):**
+- Ollama (local): $0 - Best for privacy and high volume
+- Haiku (API): ~$0.25/M tokens - Best balance of cost/quality
+- Sonnet (API): ~$3/M tokens - Best for complex filtering
+
+**Typical costs for 100 papers:**
+- With filtering (Haiku + Sonnet): ~$4
+- With local Ollama + Sonnet: ~$3.75
+- Without filtering (Sonnet only): ~$7.50
+
+**Optimization strategies:**
+- Use abstract filtering to reduce PDF processing
+- Use local Ollama for filtering (free)
+- Enable prompt caching with `--use-caching`
+- Process in batches with `--use-batches`
+
+## Quality Assurance
+
+**Validation workflow provides:**
+- Precision: % of extracted items that are correct
+- Recall: % of true items that were extracted
+- F1 score: Harmonic mean of precision and recall
+- Per-field metrics: Identify weak fields
+
+**Use metrics to:**
+- Establish baseline extraction quality
+- Compare different approaches (models, prompts, schemas)
+- Identify areas for improvement
+- Report extraction quality in publications
+
+**Recommended sample sizes:**
+- Small projects (<100 papers): 10-20 papers
+- Medium projects (100-500 papers): 20-50 papers
+- Large projects (>500 papers): 50-100 papers
+
+## Iterative Improvement
+
+1. Run initial extraction with baseline schema
+2. Validate on sample using Steps 7-9
+3. Analyze field-level metrics and error patterns
+4. Revise schema, prompts, or model selection
+5. Re-extract and re-validate
+6. Compare metrics to verify improvement
+7. Repeat until acceptable quality achieved
+
+See `references/validation_guide.md` for detailed guidance on interpreting metrics and improving extraction quality.
+
+## Available Scripts
+
+**Data organization:**
+- `scripts/01_organize_metadata.py` - Standardize PDFs and metadata
+
+**Filtering:**
+- `scripts/02_filter_abstracts.py` - Filter by abstract (Haiku/Sonnet/Ollama)
+
+**Extraction:**
+- `scripts/03_extract_from_pdfs.py` - Extract from PDFs with Claude vision
+
+**Processing:**
+- `scripts/04_repair_json.py` - Repair and validate JSON
+- `scripts/05_validate_with_apis.py` - Enrich with external databases
+- `scripts/06_export_database.py` - Export to analysis formats
+
+**Validation:**
+- `scripts/07_prepare_validation_set.py` - Sample papers for annotation
+- `scripts/08_calculate_validation_metrics.py` - Calculate P/R/F1 metrics
+
+## Assets
+
+**Templates:**
+- `assets/schema_template.json` - Blank extraction schema template
+- `assets/api_config_template.json` - API validation configuration template
+
+**Examples:**
+- `assets/example_flower_visitors_schema.json` - Ecology extraction example
+- `assets/example_api_config_ecology.json` - Ecology API validation example