Initial commit

2025-11-29 18:02:40 +08:00
commit 69617b598e
25 changed files with 5790 additions and 0 deletions
--- a/skills/extract_from_pdfs/README.md
+++ b/skills/extract_from_pdfs/README.md
@@ -0,0 +1,321 @@
+# Extract Structured Data from Scientific PDFs
+
+A comprehensive pipeline for extracting standardized data from scientific literature PDFs using Claude AI.
+
+## Overview
+
+This skill provides an end-to-end workflow for:
+- Organizing PDF literature and metadata from various sources
+- Filtering relevant papers based on abstract content (optional)
+- Extracting structured data from full PDFs using Claude's vision capabilities
+- Repairing and validating JSON outputs
+- Enriching data with external scientific databases
+- Exporting to multiple analysis formats (Python, R, Excel, CSV, SQLite)
+
+## Quick Start
+
+### 1. Installation
+
+Create a conda environment:
+
+```bash
+conda env create -f environment.yml
+conda activate pdf_extraction
+```
+
+Or install with pip:
+
+```bash
+pip install -r requirements.txt
+```
+
+### 2. Setup API Keys
+
+Set your Anthropic API key:
+
+```bash
+export ANTHROPIC_API_KEY='your-api-key-here'
+```
+
+For geographic validation (optional):
+```bash
+export GEONAMES_USERNAME='your-geonames-username'
+```
+
+### 3. Run the Skill
+
+The easiest way is to use the skill through Claude Code:
+
+```bash
+claude-code
+```
+
+Then activate the skill by mentioning it in your conversation. The skill will guide you through an interactive setup process.
+
+## Documentation
+
+The skill includes comprehensive reference documentation:
+
+- `references/setup_guide.md` - Installation and configuration
+- `references/workflow_guide.md` - Complete step-by-step workflow with examples
+- `references/validation_guide.md` - Validation methodology and metrics interpretation
+- `references/api_reference.md` - External API integration details
+
+## Manual Workflow
+
+You can also run the scripts manually:
+
+### Step 1: Organize Metadata
+
+```bash
+python scripts/01_organize_metadata.py \
+  --source-type bibtex \
+  --source path/to/library.bib \
+  --pdf-dir path/to/pdfs \
+  --organize-pdfs \
+  --output metadata.json
+```
+
+### Step 2: Filter Papers (Optional)
+
+First, customize the filtering prompt in `scripts/02_filter_abstracts.py` for your use case.
+
+**Option A: Claude Haiku (Fast & Cheap - ~$0.25/M tokens)**
+```bash
+python scripts/02_filter_abstracts.py \
+  --metadata metadata.json \
+  --backend anthropic-haiku \
+  --use-batches \
+  --output filtered_papers.json
+```
+
+**Option B: Local Model via Ollama (FREE)**
+```bash
+# One-time setup:
+# 1. Install Ollama from https://ollama.com
+# 2. Pull model: ollama pull llama3.1:8b
+# 3. Start server: ollama serve
+
+python scripts/02_filter_abstracts.py \
+  --metadata metadata.json \
+  --backend ollama \
+  --ollama-model llama3.1:8b \
+  --output filtered_papers.json
+```
+
+Recommended Ollama models:
+- `llama3.1:8b` - Good balance (8GB RAM)
+- `mistral:7b` - Fast, good for simple filtering
+- `qwen2.5:7b` - Good multilingual support
+- `llama3.1:70b` - Better accuracy (64GB RAM)
+
+### Step 3: Extract Data from PDFs
+
+First, create your extraction schema by copying and customizing `assets/schema_template.json`.
+
+```bash
+python scripts/03_extract_from_pdfs.py \
+  --metadata filtered_papers.json \
+  --schema my_schema.json \
+  --method batches \
+  --output extracted_data.json
+```
+
+### Step 4: Repair JSON
+
+```bash
+python scripts/04_repair_json.py \
+  --input extracted_data.json \
+  --schema my_schema.json \
+  --output cleaned_data.json
+```
+
+### Step 5: Validate with APIs
+
+First, create your API configuration by copying and customizing `assets/api_config_template.json`.
+
+```bash
+python scripts/05_validate_with_apis.py \
+  --input cleaned_data.json \
+  --apis my_api_config.json \
+  --output validated_data.json
+```
+
+### Step 6: Export
+
+```bash
+# For Python/pandas
+python scripts/06_export_database.py \
+  --input validated_data.json \
+  --format python \
+  --flatten \
+  --output results
+
+# For R
+python scripts/06_export_database.py \
+  --input validated_data.json \
+  --format r \
+  --flatten \
+  --output results
+
+# For CSV
+python scripts/06_export_database.py \
+  --input validated_data.json \
+  --format csv \
+  --flatten \
+  --output results.csv
+```
+
+### Validation & Quality Assurance (Optional but Recommended)
+
+Validate extraction quality using precision and recall metrics:
+
+#### Step 7: Prepare Validation Set
+
+```bash
+python scripts/07_prepare_validation_set.py \
+  --extraction-results cleaned_data.json \
+  --schema my_schema.json \
+  --sample-size 20 \
+  --strategy stratified \
+  --output validation_set.json
+```
+
+Sampling strategies:
+- `random` - Random sample
+- `stratified` - Sample by extraction characteristics
+- `diverse` - Maximize diversity
+
+#### Step 8: Manual Annotation
+
+1. Open `validation_set.json`
+2. For each sampled paper:
+   - Read the PDF
+   - Fill in `ground_truth` field with correct extraction
+   - Add `annotator` name and `annotation_date`
+   - Use `notes` for ambiguous cases
+3. Save the file
+
+#### Step 9: Calculate Metrics
+
+```bash
+python scripts/08_calculate_validation_metrics.py \
+  --annotations validation_set.json \
+  --output validation_metrics.json \
+  --report validation_report.txt
+```
+
+This produces:
+- **Precision**: % of extracted items that are correct
+- **Recall**: % of true items that were extracted
+- **F1 Score**: Harmonic mean of precision and recall
+- **Per-field metrics**: Accuracy by field type
+
+Use these metrics to:
+- Identify weak points in extraction prompts
+- Compare models (Haiku vs Sonnet vs Ollama)
+- Iterate and improve schema
+- Report quality in publications
+
+## Customization
+
+### Creating Your Extraction Schema
+
+1. Copy `assets/schema_template.json` to `my_schema.json`
+2. Customize the following sections:
+   - `objective`: What you're extracting
+   - `system_context`: Your scientific domain
+   - `instructions`: Step-by-step guidance for Claude
+   - `output_schema`: JSON schema defining your data structure
+   - `output_example`: Example of desired output
+
+See `assets/example_flower_visitors_schema.json` for a real-world example.
+
+### Configuring API Validation
+
+1. Copy `assets/api_config_template.json` to `my_api_config.json`
+2. Map your schema fields to appropriate validation APIs
+3. See available APIs in `scripts/05_validate_with_apis.py` and `references/api_reference.md`
+
+See `assets/example_api_config_ecology.json` for an ecology example.
+
+## Cost Estimation
+
+PDF processing costs approximately 1,500-3,000 tokens per page:
+
+- 10-page paper: ~20,000-30,000 tokens
+- 100 papers: ~2-3M tokens
+- With Sonnet 4.5: ~$6-9 for 100 papers
+
+Tips to reduce costs:
+- Use abstract filtering (Step 2) to reduce full PDF processing
+- Enable prompt caching with `--use-caching`
+- Use batch processing (`--method batches`)
+- Consider using Haiku for simpler extractions
+
+## Supported Data Sources
+
+### Bibliography Formats
+- BibTeX (Zotero, JabRef, etc.)
+- RIS (Mendeley, EndNote, etc.)
+- Directory of PDFs
+- List of DOIs
+
+### Output Formats
+- Python (pandas DataFrame pickle)
+- R (RDS file)
+- CSV
+- JSON
+- Excel
+- SQLite database
+
+### Validation APIs
+- **Biology**: GBIF, World Flora Online, NCBI Gene
+- **Geography**: GeoNames, OpenStreetMap Nominatim
+- **Chemistry**: PubChem
+- **Medicine**: (extensible - add your own)
+
+## Examples
+
+See the [beetle flower visitors repository](https://github.com/brunoasm/ARE_2026_beetle_flower_visitors) for a real-world example of this workflow in action.
+
+## Troubleshooting
+
+### PDF Size Limits
+- Maximum file size: 32MB
+- Maximum pages: 100
+- Solution: Use chunked processing for larger PDFs
+
+### JSON Parsing Errors
+- The `json-repair` library handles most common issues
+- Check your schema validation
+- Review Claude's analysis output for clues
+
+### API Rate Limits
+- Add delays between requests (implemented in scripts)
+- Use batch processing when available
+- Check specific API documentation for limits
+
+## Contributing
+
+To add support for additional validation APIs:
+1. Add validator function to `scripts/05_validate_with_apis.py`
+2. Register in `API_VALIDATORS` dictionary
+3. Update `api_config_template.json` with examples
+
+## Citation
+
+If you use this skill in your research, please cite:
+
+```bibtex
+@software{pdf_extraction_skill,
+  title = {Extract Structured Data from Scientific PDFs},
+  author = {Your Name},
+  year = {2025},
+  url = {https://github.com/your-repo}
+}
+```
+
+## License
+
+MIT License - see LICENSE file for details