Initial commit
This commit is contained in:
328
skills/extract_from_pdfs/references/workflow_guide.md
Normal file
328
skills/extract_from_pdfs/references/workflow_guide.md
Normal file
@@ -0,0 +1,328 @@
|
||||
# Complete Workflow Guide
|
||||
|
||||
This guide provides step-by-step instructions for the complete PDF extraction pipeline.
|
||||
|
||||
## Overview
|
||||
|
||||
The pipeline consists of 6 main steps plus optional validation:
|
||||
|
||||
1. **Organize Metadata** - Standardize PDF and metadata organization
|
||||
2. **Filter Papers** - Identify relevant papers by abstract (optional)
|
||||
3. **Extract Data** - Extract structured data from PDFs
|
||||
4. **Repair JSON** - Validate and repair JSON outputs
|
||||
5. **Validate with APIs** - Enrich with external databases
|
||||
6. **Export** - Convert to analysis format
|
||||
|
||||
**Optional:** Steps 7-9 for quality validation
|
||||
|
||||
## Step 1: Organize Metadata
|
||||
|
||||
Standardize PDF organization and metadata from various sources.
|
||||
|
||||
### From BibTeX (Zotero, JabRef, etc.)
|
||||
|
||||
```bash
|
||||
python scripts/01_organize_metadata.py \
|
||||
--source-type bibtex \
|
||||
--source path/to/library.bib \
|
||||
--pdf-dir path/to/pdfs \
|
||||
--organize-pdfs \
|
||||
--output metadata.json
|
||||
```
|
||||
|
||||
### From RIS (Mendeley, EndNote, etc.)
|
||||
|
||||
```bash
|
||||
python scripts/01_organize_metadata.py \
|
||||
--source-type ris \
|
||||
--source path/to/library.ris \
|
||||
--pdf-dir path/to/pdfs \
|
||||
--organize-pdfs \
|
||||
--output metadata.json
|
||||
```
|
||||
|
||||
### From PDF Directory
|
||||
|
||||
```bash
|
||||
python scripts/01_organize_metadata.py \
|
||||
--source-type directory \
|
||||
--source path/to/pdfs \
|
||||
--output metadata.json
|
||||
```
|
||||
|
||||
### From DOI List
|
||||
|
||||
```bash
|
||||
python scripts/01_organize_metadata.py \
|
||||
--source-type doi_list \
|
||||
--source dois.txt \
|
||||
--output metadata.json
|
||||
```
|
||||
|
||||
**Outputs:**
|
||||
- `metadata.json` - Standardized metadata file
|
||||
- `organized_pdfs/` - Renamed PDFs (if --organize-pdfs used)
|
||||
|
||||
## Step 2: Filter Papers (Optional but Recommended)
|
||||
|
||||
Filter papers by analyzing abstracts to reduce PDF processing costs.
|
||||
|
||||
### Backend Selection
|
||||
|
||||
**Option A: Claude Haiku (Fast & Cheap)**
|
||||
- Cost: ~$0.25 per million input tokens
|
||||
- Speed: Very fast with batches API
|
||||
- Accuracy: Good for most filtering tasks
|
||||
|
||||
```bash
|
||||
python scripts/02_filter_abstracts.py \
|
||||
--metadata metadata.json \
|
||||
--backend anthropic-haiku \
|
||||
--use-batches \
|
||||
--output filtered_papers.json
|
||||
```
|
||||
|
||||
**Option B: Claude Sonnet (More Accurate)**
|
||||
- Cost: ~$3 per million input tokens
|
||||
- Speed: Fast with batches API
|
||||
- Accuracy: Higher for complex criteria
|
||||
|
||||
```bash
|
||||
python scripts/02_filter_abstracts.py \
|
||||
--metadata metadata.json \
|
||||
--backend anthropic-sonnet \
|
||||
--use-batches \
|
||||
--output filtered_papers.json
|
||||
```
|
||||
|
||||
**Option C: Local Ollama (FREE & Private)**
|
||||
- Cost: $0 (runs locally)
|
||||
- Speed: Depends on hardware
|
||||
- Accuracy: Good with llama3.1:8b or better
|
||||
|
||||
```bash
|
||||
python scripts/02_filter_abstracts.py \
|
||||
--metadata metadata.json \
|
||||
--backend ollama \
|
||||
--ollama-model llama3.1:8b \
|
||||
--output filtered_papers.json
|
||||
```
|
||||
|
||||
**Before running:** Customize the filtering prompt in `scripts/02_filter_abstracts.py` (line 74) to match your criteria.
|
||||
|
||||
**Outputs:**
|
||||
- `filtered_papers.json` - Papers marked as relevant/irrelevant
|
||||
|
||||
## Step 3: Extract Data from PDFs
|
||||
|
||||
Extract structured data using Claude's PDF vision capabilities.
|
||||
|
||||
### Schema Preparation
|
||||
|
||||
1. Copy schema template:
|
||||
```bash
|
||||
cp assets/schema_template.json my_schema.json
|
||||
```
|
||||
|
||||
2. Customize for your domain:
|
||||
- Update `objective` with your extraction goal
|
||||
- Define `output_schema` structure
|
||||
- Add domain-specific `instructions`
|
||||
- Provide an `output_example`
|
||||
|
||||
See `assets/example_flower_visitors_schema.json` for a real-world example.
|
||||
|
||||
### Run Extraction
|
||||
|
||||
```bash
|
||||
python scripts/03_extract_from_pdfs.py \
|
||||
--metadata filtered_papers.json \
|
||||
--schema my_schema.json \
|
||||
--method batches \
|
||||
--output extracted_data.json
|
||||
```
|
||||
|
||||
**Processing methods:**
|
||||
- `batches` - Most efficient for many PDFs
|
||||
- `base64` - Sequential processing
|
||||
|
||||
**Optional flags:**
|
||||
- `--filter-results filtered_papers.json` - Only process relevant papers
|
||||
- `--test` - Process only 3 PDFs for testing
|
||||
- `--model claude-3-5-sonnet-20241022` - Change model
|
||||
|
||||
**Outputs:**
|
||||
- `extracted_data.json` - Raw extraction results with token counts
|
||||
|
||||
## Step 4: Repair and Validate JSON
|
||||
|
||||
Repair malformed JSON and validate against schema.
|
||||
|
||||
```bash
|
||||
python scripts/04_repair_json.py \
|
||||
--input extracted_data.json \
|
||||
--schema my_schema.json \
|
||||
--output cleaned_data.json
|
||||
```
|
||||
|
||||
**Optional flags:**
|
||||
- `--strict` - Reject records that fail validation
|
||||
|
||||
**Outputs:**
|
||||
- `cleaned_data.json` - Repaired and validated extractions
|
||||
|
||||
## Step 5: Validate with External APIs
|
||||
|
||||
Enrich data using external scientific databases.
|
||||
|
||||
### API Configuration
|
||||
|
||||
1. Copy API config template:
|
||||
```bash
|
||||
cp assets/api_config_template.json my_api_config.json
|
||||
```
|
||||
|
||||
2. Map fields to validation APIs:
|
||||
- `gbif_taxonomy` - GBIF for biological taxonomy
|
||||
- `wfo_plants` - World Flora Online for plant names
|
||||
- `geonames` - GeoNames for locations (requires account)
|
||||
- `geocode` - OpenStreetMap for geocoding (free)
|
||||
- `pubchem` - PubChem for chemical compounds
|
||||
- `ncbi_gene` - NCBI Gene database
|
||||
|
||||
See `assets/example_api_config_ecology.json` for an ecology example.
|
||||
|
||||
### Run Validation
|
||||
|
||||
```bash
|
||||
python scripts/05_validate_with_apis.py \
|
||||
--input cleaned_data.json \
|
||||
--apis my_api_config.json \
|
||||
--output validated_data.json
|
||||
```
|
||||
|
||||
**Optional flags:**
|
||||
- `--skip-validation` - Skip API calls, only structure data
|
||||
|
||||
**Outputs:**
|
||||
- `validated_data.json` - Data enriched with validated taxonomy, geography, etc.
|
||||
|
||||
## Step 6: Export to Analysis Format
|
||||
|
||||
Convert to format for your analysis environment.
|
||||
|
||||
### Python (pandas)
|
||||
|
||||
```bash
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format python \
|
||||
--flatten \
|
||||
--output results
|
||||
```
|
||||
|
||||
Outputs:
|
||||
- `results.pkl` - pandas DataFrame
|
||||
- `results.py` - Loading script
|
||||
|
||||
### R
|
||||
|
||||
```bash
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format r \
|
||||
--flatten \
|
||||
--output results
|
||||
```
|
||||
|
||||
Outputs:
|
||||
- `results.rds` - R data frame
|
||||
- `results.R` - Loading script
|
||||
|
||||
### CSV
|
||||
|
||||
```bash
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format csv \
|
||||
--flatten \
|
||||
--output results.csv
|
||||
```
|
||||
|
||||
### Excel
|
||||
|
||||
```bash
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format excel \
|
||||
--flatten \
|
||||
--output results.xlsx
|
||||
```
|
||||
|
||||
### SQLite Database
|
||||
|
||||
```bash
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format sqlite \
|
||||
--flatten \
|
||||
--output results.db
|
||||
```
|
||||
|
||||
Outputs:
|
||||
- `results.db` - SQLite database
|
||||
- `results.sql` - Example queries
|
||||
|
||||
**Flags:**
|
||||
- `--flatten` - Flatten nested JSON for tabular format
|
||||
- `--include-metadata` - Include paper metadata in output
|
||||
|
||||
## Cost Estimation
|
||||
|
||||
### Example: 100 papers, 10 pages each
|
||||
|
||||
**With Filtering (Recommended):**
|
||||
1. Filter (Haiku): ~200 abstracts × 500 tokens × $0.25/M = **$0.03**
|
||||
2. Extract (Sonnet): ~50 relevant papers × 10 pages × 2,500 tokens × $3/M = **$3.75**
|
||||
3. **Total: ~$3.78**
|
||||
|
||||
**Without Filtering:**
|
||||
1. Extract (Sonnet): 100 papers × 10 pages × 2,500 tokens × $3/M = **$7.50**
|
||||
|
||||
**With Local Ollama:**
|
||||
1. Filter (Ollama): **$0**
|
||||
2. Extract (Sonnet): ~50 papers × 10 pages × 2,500 tokens × $3/M = **$3.75**
|
||||
3. **Total: ~$3.75**
|
||||
|
||||
### Token Usage by Step
|
||||
- Abstract (~200 words): ~500 tokens
|
||||
- PDF page (text-heavy): ~1,500-3,000 tokens
|
||||
- Extraction prompt: ~500-1,000 tokens
|
||||
- Schema/context: ~500-1,000 tokens
|
||||
|
||||
**Tips to reduce costs:**
|
||||
- Use abstract filtering (Step 2)
|
||||
- Use Haiku for filtering instead of Sonnet
|
||||
- Use local Ollama for filtering (free)
|
||||
- Enable prompt caching with `--use-caching`
|
||||
- Process in batches with `--use-batches`
|
||||
|
||||
## Common Issues
|
||||
|
||||
### PDF Not Found
|
||||
Check PDF paths in metadata.json match actual file locations.
|
||||
|
||||
### JSON Parsing Errors
|
||||
Run Step 4 (repair JSON) - the json_repair library handles most issues.
|
||||
|
||||
### API Rate Limits
|
||||
Scripts include delays, but check specific API documentation for limits.
|
||||
|
||||
### Ollama Connection Error
|
||||
Ensure Ollama server is running: `ollama serve`
|
||||
|
||||
## Next Steps
|
||||
|
||||
For quality assurance, proceed to the validation workflow to calculate precision and recall metrics.
|
||||
|
||||
See: `references/validation_guide.md`
|
||||
Reference in New Issue
Block a user