Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:02:40 +08:00
commit 69617b598e
25 changed files with 5790 additions and 0 deletions

View File

@@ -0,0 +1,328 @@
# Complete Workflow Guide
This guide provides step-by-step instructions for the complete PDF extraction pipeline.
## Overview
The pipeline consists of 6 main steps plus optional validation:
1. **Organize Metadata** - Standardize PDF and metadata organization
2. **Filter Papers** - Identify relevant papers by abstract (optional)
3. **Extract Data** - Extract structured data from PDFs
4. **Repair JSON** - Validate and repair JSON outputs
5. **Validate with APIs** - Enrich with external databases
6. **Export** - Convert to analysis format
**Optional:** Steps 7-9 for quality validation
## Step 1: Organize Metadata
Standardize PDF organization and metadata from various sources.
### From BibTeX (Zotero, JabRef, etc.)
```bash
python scripts/01_organize_metadata.py \
--source-type bibtex \
--source path/to/library.bib \
--pdf-dir path/to/pdfs \
--organize-pdfs \
--output metadata.json
```
### From RIS (Mendeley, EndNote, etc.)
```bash
python scripts/01_organize_metadata.py \
--source-type ris \
--source path/to/library.ris \
--pdf-dir path/to/pdfs \
--organize-pdfs \
--output metadata.json
```
### From PDF Directory
```bash
python scripts/01_organize_metadata.py \
--source-type directory \
--source path/to/pdfs \
--output metadata.json
```
### From DOI List
```bash
python scripts/01_organize_metadata.py \
--source-type doi_list \
--source dois.txt \
--output metadata.json
```
**Outputs:**
- `metadata.json` - Standardized metadata file
- `organized_pdfs/` - Renamed PDFs (if --organize-pdfs used)
## Step 2: Filter Papers (Optional but Recommended)
Filter papers by analyzing abstracts to reduce PDF processing costs.
### Backend Selection
**Option A: Claude Haiku (Fast & Cheap)**
- Cost: ~$0.25 per million input tokens
- Speed: Very fast with batches API
- Accuracy: Good for most filtering tasks
```bash
python scripts/02_filter_abstracts.py \
--metadata metadata.json \
--backend anthropic-haiku \
--use-batches \
--output filtered_papers.json
```
**Option B: Claude Sonnet (More Accurate)**
- Cost: ~$3 per million input tokens
- Speed: Fast with batches API
- Accuracy: Higher for complex criteria
```bash
python scripts/02_filter_abstracts.py \
--metadata metadata.json \
--backend anthropic-sonnet \
--use-batches \
--output filtered_papers.json
```
**Option C: Local Ollama (FREE & Private)**
- Cost: $0 (runs locally)
- Speed: Depends on hardware
- Accuracy: Good with llama3.1:8b or better
```bash
python scripts/02_filter_abstracts.py \
--metadata metadata.json \
--backend ollama \
--ollama-model llama3.1:8b \
--output filtered_papers.json
```
**Before running:** Customize the filtering prompt in `scripts/02_filter_abstracts.py` (line 74) to match your criteria.
**Outputs:**
- `filtered_papers.json` - Papers marked as relevant/irrelevant
## Step 3: Extract Data from PDFs
Extract structured data using Claude's PDF vision capabilities.
### Schema Preparation
1. Copy schema template:
```bash
cp assets/schema_template.json my_schema.json
```
2. Customize for your domain:
- Update `objective` with your extraction goal
- Define `output_schema` structure
- Add domain-specific `instructions`
- Provide an `output_example`
See `assets/example_flower_visitors_schema.json` for a real-world example.
### Run Extraction
```bash
python scripts/03_extract_from_pdfs.py \
--metadata filtered_papers.json \
--schema my_schema.json \
--method batches \
--output extracted_data.json
```
**Processing methods:**
- `batches` - Most efficient for many PDFs
- `base64` - Sequential processing
**Optional flags:**
- `--filter-results filtered_papers.json` - Only process relevant papers
- `--test` - Process only 3 PDFs for testing
- `--model claude-3-5-sonnet-20241022` - Change model
**Outputs:**
- `extracted_data.json` - Raw extraction results with token counts
## Step 4: Repair and Validate JSON
Repair malformed JSON and validate against schema.
```bash
python scripts/04_repair_json.py \
--input extracted_data.json \
--schema my_schema.json \
--output cleaned_data.json
```
**Optional flags:**
- `--strict` - Reject records that fail validation
**Outputs:**
- `cleaned_data.json` - Repaired and validated extractions
## Step 5: Validate with External APIs
Enrich data using external scientific databases.
### API Configuration
1. Copy API config template:
```bash
cp assets/api_config_template.json my_api_config.json
```
2. Map fields to validation APIs:
- `gbif_taxonomy` - GBIF for biological taxonomy
- `wfo_plants` - World Flora Online for plant names
- `geonames` - GeoNames for locations (requires account)
- `geocode` - OpenStreetMap for geocoding (free)
- `pubchem` - PubChem for chemical compounds
- `ncbi_gene` - NCBI Gene database
See `assets/example_api_config_ecology.json` for an ecology example.
### Run Validation
```bash
python scripts/05_validate_with_apis.py \
--input cleaned_data.json \
--apis my_api_config.json \
--output validated_data.json
```
**Optional flags:**
- `--skip-validation` - Skip API calls, only structure data
**Outputs:**
- `validated_data.json` - Data enriched with validated taxonomy, geography, etc.
## Step 6: Export to Analysis Format
Convert to format for your analysis environment.
### Python (pandas)
```bash
python scripts/06_export_database.py \
--input validated_data.json \
--format python \
--flatten \
--output results
```
Outputs:
- `results.pkl` - pandas DataFrame
- `results.py` - Loading script
### R
```bash
python scripts/06_export_database.py \
--input validated_data.json \
--format r \
--flatten \
--output results
```
Outputs:
- `results.rds` - R data frame
- `results.R` - Loading script
### CSV
```bash
python scripts/06_export_database.py \
--input validated_data.json \
--format csv \
--flatten \
--output results.csv
```
### Excel
```bash
python scripts/06_export_database.py \
--input validated_data.json \
--format excel \
--flatten \
--output results.xlsx
```
### SQLite Database
```bash
python scripts/06_export_database.py \
--input validated_data.json \
--format sqlite \
--flatten \
--output results.db
```
Outputs:
- `results.db` - SQLite database
- `results.sql` - Example queries
**Flags:**
- `--flatten` - Flatten nested JSON for tabular format
- `--include-metadata` - Include paper metadata in output
## Cost Estimation
### Example: 100 papers, 10 pages each
**With Filtering (Recommended):**
1. Filter (Haiku): ~200 abstracts × 500 tokens × $0.25/M = **$0.03**
2. Extract (Sonnet): ~50 relevant papers × 10 pages × 2,500 tokens × $3/M = **$3.75**
3. **Total: ~$3.78**
**Without Filtering:**
1. Extract (Sonnet): 100 papers × 10 pages × 2,500 tokens × $3/M = **$7.50**
**With Local Ollama:**
1. Filter (Ollama): **$0**
2. Extract (Sonnet): ~50 papers × 10 pages × 2,500 tokens × $3/M = **$3.75**
3. **Total: ~$3.75**
### Token Usage by Step
- Abstract (~200 words): ~500 tokens
- PDF page (text-heavy): ~1,500-3,000 tokens
- Extraction prompt: ~500-1,000 tokens
- Schema/context: ~500-1,000 tokens
**Tips to reduce costs:**
- Use abstract filtering (Step 2)
- Use Haiku for filtering instead of Sonnet
- Use local Ollama for filtering (free)
- Enable prompt caching with `--use-caching`
- Process in batches with `--use-batches`
## Common Issues
### PDF Not Found
Check PDF paths in metadata.json match actual file locations.
### JSON Parsing Errors
Run Step 4 (repair JSON) - the json_repair library handles most issues.
### API Rate Limits
Scripts include delays, but check specific API documentation for limits.
### Ollama Connection Error
Ensure Ollama server is running: `ollama serve`
## Next Steps
For quality assurance, proceed to the validation workflow to calculate precision and recall metrics.
See: `references/validation_guide.md`