# Complete Workflow Guide This guide provides step-by-step instructions for the complete PDF extraction pipeline. ## Overview The pipeline consists of 6 main steps plus optional validation: 1. **Organize Metadata** - Standardize PDF and metadata organization 2. **Filter Papers** - Identify relevant papers by abstract (optional) 3. **Extract Data** - Extract structured data from PDFs 4. **Repair JSON** - Validate and repair JSON outputs 5. **Validate with APIs** - Enrich with external databases 6. **Export** - Convert to analysis format **Optional:** Steps 7-9 for quality validation ## Step 1: Organize Metadata Standardize PDF organization and metadata from various sources. ### From BibTeX (Zotero, JabRef, etc.) ```bash python scripts/01_organize_metadata.py \ --source-type bibtex \ --source path/to/library.bib \ --pdf-dir path/to/pdfs \ --organize-pdfs \ --output metadata.json ``` ### From RIS (Mendeley, EndNote, etc.) ```bash python scripts/01_organize_metadata.py \ --source-type ris \ --source path/to/library.ris \ --pdf-dir path/to/pdfs \ --organize-pdfs \ --output metadata.json ``` ### From PDF Directory ```bash python scripts/01_organize_metadata.py \ --source-type directory \ --source path/to/pdfs \ --output metadata.json ``` ### From DOI List ```bash python scripts/01_organize_metadata.py \ --source-type doi_list \ --source dois.txt \ --output metadata.json ``` **Outputs:** - `metadata.json` - Standardized metadata file - `organized_pdfs/` - Renamed PDFs (if --organize-pdfs used) ## Step 2: Filter Papers (Optional but Recommended) Filter papers by analyzing abstracts to reduce PDF processing costs. ### Backend Selection **Option A: Claude Haiku (Fast & Cheap)** - Cost: ~$0.25 per million input tokens - Speed: Very fast with batches API - Accuracy: Good for most filtering tasks ```bash python scripts/02_filter_abstracts.py \ --metadata metadata.json \ --backend anthropic-haiku \ --use-batches \ --output filtered_papers.json ``` **Option B: Claude Sonnet (More Accurate)** - Cost: ~$3 per million input tokens - Speed: Fast with batches API - Accuracy: Higher for complex criteria ```bash python scripts/02_filter_abstracts.py \ --metadata metadata.json \ --backend anthropic-sonnet \ --use-batches \ --output filtered_papers.json ``` **Option C: Local Ollama (FREE & Private)** - Cost: $0 (runs locally) - Speed: Depends on hardware - Accuracy: Good with llama3.1:8b or better ```bash python scripts/02_filter_abstracts.py \ --metadata metadata.json \ --backend ollama \ --ollama-model llama3.1:8b \ --output filtered_papers.json ``` **Before running:** Customize the filtering prompt in `scripts/02_filter_abstracts.py` (line 74) to match your criteria. **Outputs:** - `filtered_papers.json` - Papers marked as relevant/irrelevant ## Step 3: Extract Data from PDFs Extract structured data using Claude's PDF vision capabilities. ### Schema Preparation 1. Copy schema template: ```bash cp assets/schema_template.json my_schema.json ``` 2. Customize for your domain: - Update `objective` with your extraction goal - Define `output_schema` structure - Add domain-specific `instructions` - Provide an `output_example` See `assets/example_flower_visitors_schema.json` for a real-world example. ### Run Extraction ```bash python scripts/03_extract_from_pdfs.py \ --metadata filtered_papers.json \ --schema my_schema.json \ --method batches \ --output extracted_data.json ``` **Processing methods:** - `batches` - Most efficient for many PDFs - `base64` - Sequential processing **Optional flags:** - `--filter-results filtered_papers.json` - Only process relevant papers - `--test` - Process only 3 PDFs for testing - `--model claude-3-5-sonnet-20241022` - Change model **Outputs:** - `extracted_data.json` - Raw extraction results with token counts ## Step 4: Repair and Validate JSON Repair malformed JSON and validate against schema. ```bash python scripts/04_repair_json.py \ --input extracted_data.json \ --schema my_schema.json \ --output cleaned_data.json ``` **Optional flags:** - `--strict` - Reject records that fail validation **Outputs:** - `cleaned_data.json` - Repaired and validated extractions ## Step 5: Validate with External APIs Enrich data using external scientific databases. ### API Configuration 1. Copy API config template: ```bash cp assets/api_config_template.json my_api_config.json ``` 2. Map fields to validation APIs: - `gbif_taxonomy` - GBIF for biological taxonomy - `wfo_plants` - World Flora Online for plant names - `geonames` - GeoNames for locations (requires account) - `geocode` - OpenStreetMap for geocoding (free) - `pubchem` - PubChem for chemical compounds - `ncbi_gene` - NCBI Gene database See `assets/example_api_config_ecology.json` for an ecology example. ### Run Validation ```bash python scripts/05_validate_with_apis.py \ --input cleaned_data.json \ --apis my_api_config.json \ --output validated_data.json ``` **Optional flags:** - `--skip-validation` - Skip API calls, only structure data **Outputs:** - `validated_data.json` - Data enriched with validated taxonomy, geography, etc. ## Step 6: Export to Analysis Format Convert to format for your analysis environment. ### Python (pandas) ```bash python scripts/06_export_database.py \ --input validated_data.json \ --format python \ --flatten \ --output results ``` Outputs: - `results.pkl` - pandas DataFrame - `results.py` - Loading script ### R ```bash python scripts/06_export_database.py \ --input validated_data.json \ --format r \ --flatten \ --output results ``` Outputs: - `results.rds` - R data frame - `results.R` - Loading script ### CSV ```bash python scripts/06_export_database.py \ --input validated_data.json \ --format csv \ --flatten \ --output results.csv ``` ### Excel ```bash python scripts/06_export_database.py \ --input validated_data.json \ --format excel \ --flatten \ --output results.xlsx ``` ### SQLite Database ```bash python scripts/06_export_database.py \ --input validated_data.json \ --format sqlite \ --flatten \ --output results.db ``` Outputs: - `results.db` - SQLite database - `results.sql` - Example queries **Flags:** - `--flatten` - Flatten nested JSON for tabular format - `--include-metadata` - Include paper metadata in output ## Cost Estimation ### Example: 100 papers, 10 pages each **With Filtering (Recommended):** 1. Filter (Haiku): ~200 abstracts × 500 tokens × $0.25/M = **$0.03** 2. Extract (Sonnet): ~50 relevant papers × 10 pages × 2,500 tokens × $3/M = **$3.75** 3. **Total: ~$3.78** **Without Filtering:** 1. Extract (Sonnet): 100 papers × 10 pages × 2,500 tokens × $3/M = **$7.50** **With Local Ollama:** 1. Filter (Ollama): **$0** 2. Extract (Sonnet): ~50 papers × 10 pages × 2,500 tokens × $3/M = **$3.75** 3. **Total: ~$3.75** ### Token Usage by Step - Abstract (~200 words): ~500 tokens - PDF page (text-heavy): ~1,500-3,000 tokens - Extraction prompt: ~500-1,000 tokens - Schema/context: ~500-1,000 tokens **Tips to reduce costs:** - Use abstract filtering (Step 2) - Use Haiku for filtering instead of Sonnet - Use local Ollama for filtering (free) - Enable prompt caching with `--use-caching` - Process in batches with `--use-batches` ## Common Issues ### PDF Not Found Check PDF paths in metadata.json match actual file locations. ### JSON Parsing Errors Run Step 4 (repair JSON) - the json_repair library handles most issues. ### API Rate Limits Scripts include delays, but check specific API documentation for limits. ### Ollama Connection Error Ensure Ollama server is running: `ollama serve` ## Next Steps For quality assurance, proceed to the validation workflow to calculate precision and recall metrics. See: `references/validation_guide.md`