Files
2025-11-29 18:14:46 +08:00

402 lines
8.7 KiB
Markdown

# CLI Operations Reference
Complete guide for operating CocoIndex flows using the CLI.
## Overview
The CocoIndex CLI (`cocoindex` command) provides tools for managing and inspecting flows. Most commands require an `APP_TARGET` argument specifying where flow definitions are located.
## Environment Setup
### Environment Variables
Create a `.env` file in the project directory:
```bash
# Database connection (required)
COCOINDEX_DATABASE_URL=postgresql://user:password@localhost/cocoindex_db
# Optional: App namespace for organizing flows
COCOINDEX_APP_NAMESPACE=dev
# Optional: Global concurrency limits
COCOINDEX_SOURCE_MAX_INFLIGHT_ROWS=50
COCOINDEX_SOURCE_MAX_INFLIGHT_BYTES=524288000 # 500MB
# Optional: LLM API keys (if using LLM functions)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
VOYAGE_API_KEY=pa-...
```
### Loading Environment Files
```bash
# Default: loads .env from current directory
cocoindex <command> ...
# Specify custom env file
cocoindex --env-file path/to/.env <command> ...
# Specify app directory
cocoindex --app-dir /path/to/project <command> ...
```
## APP_TARGET Format
The `APP_TARGET` tells the CLI where flow definitions are located:
### Python Module
```bash
# Load from module name
cocoindex update main
# Load from package module
cocoindex update my_package.flows
```
### Python File
```bash
# Load from file path
cocoindex update main.py
# Load from nested file
cocoindex update path/to/flows.py
```
### Specific Flow
```bash
# Target specific flow in module
cocoindex update main:MyFlowName
# Target specific flow in file
cocoindex update path/to/flows.py:MyFlowName
```
## Core Commands
### setup - Initialize Flow Resources
Create all persistent backends needed by flows (database tables, collections, etc.).
```bash
# Setup all flows
cocoindex setup main.py
# Setup specific flow
cocoindex setup main.py:MyFlow
```
**What it does:**
- Creates internal storage tables in Postgres
- Creates target resources (database tables, vector collections, graph structures)
- Updates schemas if flow definition changed
- No-op if already set up and no changes needed
**When to use:**
- First time running a flow
- After modifying flow structure (new fields, new targets)
- After dropping flows to recreate resources
### update - Build/Update Target Data
Run transformations and update target data based on current source data.
```bash
# One-time update
cocoindex update main.py
# One-time update with setup
cocoindex update --setup main.py
# One-time update specific flow
cocoindex update main.py:TextEmbedding
# Force reexport even if no changes
cocoindex update --reexport main.py
```
**What it does:**
- Reads source data
- Applies transformations
- Updates target databases
- Uses incremental processing (only processes changed data)
**Options:**
- `--setup` - Run setup first if needed
- `--reexport` - Reexport all data even if unchanged (useful after data loss)
### update -L - Live Update Mode
Continuously monitor source changes and update targets.
```bash
# Live update mode
cocoindex update main.py -L
# Live update with setup
cocoindex update --setup main.py -L
# Live update with reexport on initial update
cocoindex update --reexport main.py -L
```
**What it does:**
- Performs initial one-time update
- Continuously monitors source changes
- Automatically processes updates
- Runs until aborted (Ctrl-C)
**Requires:**
- At least one source with change capture enabled:
- `refresh_interval` parameter on source
- Source-specific change capture (Postgres notifications, S3 events, etc.)
**Example with refresh interval:**
```python
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="documents"),
refresh_interval=datetime.timedelta(minutes=1) # Check every minute
)
```
### drop - Remove Flow Resources
Remove all persistent backends owned by flows.
```bash
# Drop all flows
cocoindex drop main.py
# Drop specific flow
cocoindex drop main.py:MyFlow
```
**What it does:**
- Drops internal storage tables
- Drops target resources (tables, collections, graphs)
- Cleans up all persistent data
**Warning:** This is destructive and cannot be undone!
### show - Inspect Flow Definition
Display flow structure and statistics.
```bash
# Show flow structure
cocoindex show main.py:MyFlow
# Show all flows
cocoindex show main.py
```
**What it shows:**
- Flow name and structure
- Sources configured
- Transformations defined
- Targets and their schemas
- Current statistics (if flow is set up)
### evaluate - Test Flow Without Updating
Run transformations and dump results to files without updating targets.
```bash
# Evaluate flow
cocoindex evaluate main.py:MyFlow
# Specify output directory
cocoindex evaluate main.py:MyFlow --output-dir ./eval_results
# Disable cache
cocoindex evaluate main.py:MyFlow --no-cache
```
**What it does:**
- Runs transformations
- Saves results to files (JSON, CSV, etc.)
- Does NOT update targets
- Uses existing cache by default
**When to use:**
- Testing flow logic before running full update
- Debugging transformation issues
- Inspecting intermediate data
- Validating output format
**Options:**
- `--output-dir PATH` - Directory for output files (default: `eval_{flow_name}_{timestamp}`)
- `--no-cache` - Disable reading from cache (still doesn't write to cache)
## Complete Workflow Examples
### First-Time Setup and Indexing
```bash
# 1. Setup flow resources
cocoindex setup main.py
# 2. Run initial indexing
cocoindex update main.py
# 3. Verify results
cocoindex show main.py
```
### Development Workflow
```bash
# 1. Test with evaluate (no side effects)
cocoindex evaluate main.py:MyFlow --output-dir ./test_output
# 2. If looks good, setup and update
cocoindex update --setup main.py:MyFlow
# 3. Check results
cocoindex show main.py:MyFlow
```
### Production Live Updates
```bash
# Run with live updates and auto-setup
cocoindex update --setup main.py -L
```
### Rebuild After Changes
```bash
# Drop old resources
cocoindex drop main.py
# Setup with new definition
cocoindex setup main.py
# Reindex everything
cocoindex update --reexport main.py
```
### Multiple Flows
```bash
# Setup all flows
cocoindex setup main.py
# Update specific flows
cocoindex update main.py:CodeEmbedding
cocoindex update main.py:DocumentEmbedding
# Show all flows
cocoindex show main.py
```
## Common Issues and Solutions
### Issue: "Flow not found"
**Problem:** CLI can't find the flow definition.
**Solutions:**
```bash
# Make sure APP_TARGET is correct
cocoindex show main.py # Should list flows
# Use --app-dir if not in project root
cocoindex --app-dir /path/to/project show main.py
# Check flow name is correct
cocoindex show main.py:CorrectFlowName
```
### Issue: "Database connection failed"
**Problem:** Can't connect to Postgres.
**Solutions:**
```bash
# Check .env file exists
cat .env | grep COCOINDEX_DATABASE_URL
# Test connection
psql $COCOINDEX_DATABASE_URL
# Use --env-file if .env is elsewhere
cocoindex --env-file /path/to/.env update main.py
```
### Issue: "Schema mismatch"
**Problem:** Flow definition changed but resources not updated.
**Solution:**
```bash
# Re-run setup to update schemas
cocoindex setup main.py
# Then update data
cocoindex update main.py
```
### Issue: "Live update exits immediately"
**Problem:** No change capture mechanisms enabled.
**Solution:**
Add refresh_interval or use source-specific change capture:
```python
data_scope["docs"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="docs"),
refresh_interval=datetime.timedelta(seconds=30) # Add this
)
```
## Advanced Options
### Global Options
```bash
# Show version
cocoindex --version
# Show help
cocoindex --help
cocoindex update --help
# Specify app directory
cocoindex --app-dir /custom/path update main
# Custom env file
cocoindex --env-file prod.env update main
```
### Performance Tuning
Set environment variables for concurrency:
```bash
# In .env file
COCOINDEX_SOURCE_MAX_INFLIGHT_ROWS=100
COCOINDEX_SOURCE_MAX_INFLIGHT_BYTES=1073741824 # 1GB
```
Or per-source in code:
```python
data_scope["docs"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="docs"),
max_inflight_rows=50,
max_inflight_bytes=500*1024*1024 # 500MB
)
```
## Best Practices
1. **Use evaluate before update** - Test flow logic without side effects
2. **Always setup before first update** - Or use `--setup` flag
3. **Use live updates in production** - Keeps targets always fresh
4. **Set app namespace** - Organize flows across environments (dev/staging/prod)
5. **Monitor with show** - Regularly check flow statistics
6. **Version control .env.example** - Document required environment variables
7. **Use specific flow targets** - For selective updates: `main.py:FlowName`
8. **Setup after definition changes** - Ensures schemas match flow definition