Initial commit

2025-11-29 18:02:40 +08:00
commit 69617b598e
25 changed files with 5790 additions and 0 deletions
--- a/skills/extract_from_pdfs/references/workflow_guide.md
+++ b/skills/extract_from_pdfs/references/workflow_guide.md
@@ -0,0 +1,328 @@
+# Complete Workflow Guide
+
+This guide provides step-by-step instructions for the complete PDF extraction pipeline.
+
+## Overview
+
+The pipeline consists of 6 main steps plus optional validation:
+
+1. **Organize Metadata** - Standardize PDF and metadata organization
+2. **Filter Papers** - Identify relevant papers by abstract (optional)
+3. **Extract Data** - Extract structured data from PDFs
+4. **Repair JSON** - Validate and repair JSON outputs
+5. **Validate with APIs** - Enrich with external databases
+6. **Export** - Convert to analysis format
+
+**Optional:** Steps 7-9 for quality validation
+
+## Step 1: Organize Metadata
+
+Standardize PDF organization and metadata from various sources.
+
+### From BibTeX (Zotero, JabRef, etc.)
+
+```bash
+python scripts/01_organize_metadata.py \
+  --source-type bibtex \
+  --source path/to/library.bib \
+  --pdf-dir path/to/pdfs \
+  --organize-pdfs \
+  --output metadata.json
+```
+
+### From RIS (Mendeley, EndNote, etc.)
+
+```bash
+python scripts/01_organize_metadata.py \
+  --source-type ris \
+  --source path/to/library.ris \
+  --pdf-dir path/to/pdfs \
+  --organize-pdfs \
+  --output metadata.json
+```
+
+### From PDF Directory
+
+```bash
+python scripts/01_organize_metadata.py \
+  --source-type directory \
+  --source path/to/pdfs \
+  --output metadata.json
+```
+
+### From DOI List
+
+```bash
+python scripts/01_organize_metadata.py \
+  --source-type doi_list \
+  --source dois.txt \
+  --output metadata.json
+```
+
+**Outputs:**
+- `metadata.json` - Standardized metadata file
+- `organized_pdfs/` - Renamed PDFs (if --organize-pdfs used)
+
+## Step 2: Filter Papers (Optional but Recommended)
+
+Filter papers by analyzing abstracts to reduce PDF processing costs.
+
+### Backend Selection
+
+**Option A: Claude Haiku (Fast & Cheap)**
+- Cost: ~$0.25 per million input tokens
+- Speed: Very fast with batches API
+- Accuracy: Good for most filtering tasks
+
+```bash
+python scripts/02_filter_abstracts.py \
+  --metadata metadata.json \
+  --backend anthropic-haiku \
+  --use-batches \
+  --output filtered_papers.json
+```
+
+**Option B: Claude Sonnet (More Accurate)**
+- Cost: ~$3 per million input tokens
+- Speed: Fast with batches API
+- Accuracy: Higher for complex criteria
+
+```bash
+python scripts/02_filter_abstracts.py \
+  --metadata metadata.json \
+  --backend anthropic-sonnet \
+  --use-batches \
+  --output filtered_papers.json
+```
+
+**Option C: Local Ollama (FREE & Private)**
+- Cost: $0 (runs locally)
+- Speed: Depends on hardware
+- Accuracy: Good with llama3.1:8b or better
+
+```bash
+python scripts/02_filter_abstracts.py \
+  --metadata metadata.json \
+  --backend ollama \
+  --ollama-model llama3.1:8b \
+  --output filtered_papers.json
+```
+
+**Before running:** Customize the filtering prompt in `scripts/02_filter_abstracts.py` (line 74) to match your criteria.
+
+**Outputs:**
+- `filtered_papers.json` - Papers marked as relevant/irrelevant
+
+## Step 3: Extract Data from PDFs
+
+Extract structured data using Claude's PDF vision capabilities.
+
+### Schema Preparation
+
+1. Copy schema template:
+```bash
+cp assets/schema_template.json my_schema.json
+```
+
+2. Customize for your domain:
+   - Update `objective` with your extraction goal
+   - Define `output_schema` structure
+   - Add domain-specific `instructions`
+   - Provide an `output_example`
+
+See `assets/example_flower_visitors_schema.json` for a real-world example.
+
+### Run Extraction
+
+```bash
+python scripts/03_extract_from_pdfs.py \
+  --metadata filtered_papers.json \
+  --schema my_schema.json \
+  --method batches \
+  --output extracted_data.json
+```
+
+**Processing methods:**
+- `batches` - Most efficient for many PDFs
+- `base64` - Sequential processing
+
+**Optional flags:**
+- `--filter-results filtered_papers.json` - Only process relevant papers
+- `--test` - Process only 3 PDFs for testing
+- `--model claude-3-5-sonnet-20241022` - Change model
+
+**Outputs:**
+- `extracted_data.json` - Raw extraction results with token counts
+
+## Step 4: Repair and Validate JSON
+
+Repair malformed JSON and validate against schema.
+
+```bash
+python scripts/04_repair_json.py \
+  --input extracted_data.json \
+  --schema my_schema.json \
+  --output cleaned_data.json
+```
+
+**Optional flags:**
+- `--strict` - Reject records that fail validation
+
+**Outputs:**
+- `cleaned_data.json` - Repaired and validated extractions
+
+## Step 5: Validate with External APIs
+
+Enrich data using external scientific databases.
+
+### API Configuration
+
+1. Copy API config template:
+```bash
+cp assets/api_config_template.json my_api_config.json
+```
+
+2. Map fields to validation APIs:
+   - `gbif_taxonomy` - GBIF for biological taxonomy
+   - `wfo_plants` - World Flora Online for plant names
+   - `geonames` - GeoNames for locations (requires account)
+   - `geocode` - OpenStreetMap for geocoding (free)
+   - `pubchem` - PubChem for chemical compounds
+   - `ncbi_gene` - NCBI Gene database
+
+See `assets/example_api_config_ecology.json` for an ecology example.
+
+### Run Validation
+
+```bash
+python scripts/05_validate_with_apis.py \
+  --input cleaned_data.json \
+  --apis my_api_config.json \
+  --output validated_data.json
+```
+
+**Optional flags:**
+- `--skip-validation` - Skip API calls, only structure data
+
+**Outputs:**
+- `validated_data.json` - Data enriched with validated taxonomy, geography, etc.
+
+## Step 6: Export to Analysis Format
+
+Convert to format for your analysis environment.
+
+### Python (pandas)
+
+```bash
+python scripts/06_export_database.py \
+  --input validated_data.json \
+  --format python \
+  --flatten \
+  --output results
+```
+
+Outputs:
+- `results.pkl` - pandas DataFrame
+- `results.py` - Loading script
+
+### R
+
+```bash
+python scripts/06_export_database.py \
+  --input validated_data.json \
+  --format r \
+  --flatten \
+  --output results
+```
+
+Outputs:
+- `results.rds` - R data frame
+- `results.R` - Loading script
+
+### CSV
+
+```bash
+python scripts/06_export_database.py \
+  --input validated_data.json \
+  --format csv \
+  --flatten \
+  --output results.csv
+```
+
+### Excel
+
+```bash
+python scripts/06_export_database.py \
+  --input validated_data.json \
+  --format excel \
+  --flatten \
+  --output results.xlsx
+```
+
+### SQLite Database
+
+```bash
+python scripts/06_export_database.py \
+  --input validated_data.json \
+  --format sqlite \
+  --flatten \
+  --output results.db
+```
+
+Outputs:
+- `results.db` - SQLite database
+- `results.sql` - Example queries
+
+**Flags:**
+- `--flatten` - Flatten nested JSON for tabular format
+- `--include-metadata` - Include paper metadata in output
+
+## Cost Estimation
+
+### Example: 100 papers, 10 pages each
+
+**With Filtering (Recommended):**
+1. Filter (Haiku): ~200 abstracts × 500 tokens × $0.25/M = **$0.03**
+2. Extract (Sonnet): ~50 relevant papers × 10 pages × 2,500 tokens × $3/M = **$3.75**
+3. **Total: ~$3.78**
+
+**Without Filtering:**
+1. Extract (Sonnet): 100 papers × 10 pages × 2,500 tokens × $3/M = **$7.50**
+
+**With Local Ollama:**
+1. Filter (Ollama): **$0**
+2. Extract (Sonnet): ~50 papers × 10 pages × 2,500 tokens × $3/M = **$3.75**
+3. **Total: ~$3.75**
+
+### Token Usage by Step
+- Abstract (~200 words): ~500 tokens
+- PDF page (text-heavy): ~1,500-3,000 tokens
+- Extraction prompt: ~500-1,000 tokens
+- Schema/context: ~500-1,000 tokens
+
+**Tips to reduce costs:**
+- Use abstract filtering (Step 2)
+- Use Haiku for filtering instead of Sonnet
+- Use local Ollama for filtering (free)
+- Enable prompt caching with `--use-caching`
+- Process in batches with `--use-batches`
+
+## Common Issues
+
+### PDF Not Found
+Check PDF paths in metadata.json match actual file locations.
+
+### JSON Parsing Errors
+Run Step 4 (repair JSON) - the json_repair library handles most issues.
+
+### API Rate Limits
+Scripts include delays, but check specific API documentation for limits.
+
+### Ollama Connection Error
+Ensure Ollama server is running: `ollama serve`
+
+## Next Steps
+
+For quality assurance, proceed to the validation workflow to calculate precision and recall metrics.
+
+See: `references/validation_guide.md`