Files
gh-cocoindex-io-cocoindex-c…/skills/cocoindex/references/cli_operations.md
2025-11-29 18:14:46 +08:00

8.7 KiB

CLI Operations Reference

Complete guide for operating CocoIndex flows using the CLI.

Overview

The CocoIndex CLI (cocoindex command) provides tools for managing and inspecting flows. Most commands require an APP_TARGET argument specifying where flow definitions are located.

Environment Setup

Environment Variables

Create a .env file in the project directory:

# Database connection (required)
COCOINDEX_DATABASE_URL=postgresql://user:password@localhost/cocoindex_db

# Optional: App namespace for organizing flows
COCOINDEX_APP_NAMESPACE=dev

# Optional: Global concurrency limits
COCOINDEX_SOURCE_MAX_INFLIGHT_ROWS=50
COCOINDEX_SOURCE_MAX_INFLIGHT_BYTES=524288000  # 500MB

# Optional: LLM API keys (if using LLM functions)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
VOYAGE_API_KEY=pa-...

Loading Environment Files

# Default: loads .env from current directory
cocoindex <command> ...

# Specify custom env file
cocoindex --env-file path/to/.env <command> ...

# Specify app directory
cocoindex --app-dir /path/to/project <command> ...

APP_TARGET Format

The APP_TARGET tells the CLI where flow definitions are located:

Python Module

# Load from module name
cocoindex update main

# Load from package module
cocoindex update my_package.flows

Python File

# Load from file path
cocoindex update main.py

# Load from nested file
cocoindex update path/to/flows.py

Specific Flow

# Target specific flow in module
cocoindex update main:MyFlowName

# Target specific flow in file
cocoindex update path/to/flows.py:MyFlowName

Core Commands

setup - Initialize Flow Resources

Create all persistent backends needed by flows (database tables, collections, etc.).

# Setup all flows
cocoindex setup main.py

# Setup specific flow
cocoindex setup main.py:MyFlow

What it does:

  • Creates internal storage tables in Postgres
  • Creates target resources (database tables, vector collections, graph structures)
  • Updates schemas if flow definition changed
  • No-op if already set up and no changes needed

When to use:

  • First time running a flow
  • After modifying flow structure (new fields, new targets)
  • After dropping flows to recreate resources

update - Build/Update Target Data

Run transformations and update target data based on current source data.

# One-time update
cocoindex update main.py

# One-time update with setup
cocoindex update --setup main.py

# One-time update specific flow
cocoindex update main.py:TextEmbedding

# Force reexport even if no changes
cocoindex update --reexport main.py

What it does:

  • Reads source data
  • Applies transformations
  • Updates target databases
  • Uses incremental processing (only processes changed data)

Options:

  • --setup - Run setup first if needed
  • --reexport - Reexport all data even if unchanged (useful after data loss)

update -L - Live Update Mode

Continuously monitor source changes and update targets.

# Live update mode
cocoindex update main.py -L

# Live update with setup
cocoindex update --setup main.py -L

# Live update with reexport on initial update
cocoindex update --reexport main.py -L

What it does:

  • Performs initial one-time update
  • Continuously monitors source changes
  • Automatically processes updates
  • Runs until aborted (Ctrl-C)

Requires:

  • At least one source with change capture enabled:
    • refresh_interval parameter on source
    • Source-specific change capture (Postgres notifications, S3 events, etc.)

Example with refresh interval:

data_scope["documents"] = flow_builder.add_source(
    cocoindex.sources.LocalFile(path="documents"),
    refresh_interval=datetime.timedelta(minutes=1)  # Check every minute
)

drop - Remove Flow Resources

Remove all persistent backends owned by flows.

# Drop all flows
cocoindex drop main.py

# Drop specific flow
cocoindex drop main.py:MyFlow

What it does:

  • Drops internal storage tables
  • Drops target resources (tables, collections, graphs)
  • Cleans up all persistent data

Warning: This is destructive and cannot be undone!

show - Inspect Flow Definition

Display flow structure and statistics.

# Show flow structure
cocoindex show main.py:MyFlow

# Show all flows
cocoindex show main.py

What it shows:

  • Flow name and structure
  • Sources configured
  • Transformations defined
  • Targets and their schemas
  • Current statistics (if flow is set up)

evaluate - Test Flow Without Updating

Run transformations and dump results to files without updating targets.

# Evaluate flow
cocoindex evaluate main.py:MyFlow

# Specify output directory
cocoindex evaluate main.py:MyFlow --output-dir ./eval_results

# Disable cache
cocoindex evaluate main.py:MyFlow --no-cache

What it does:

  • Runs transformations
  • Saves results to files (JSON, CSV, etc.)
  • Does NOT update targets
  • Uses existing cache by default

When to use:

  • Testing flow logic before running full update
  • Debugging transformation issues
  • Inspecting intermediate data
  • Validating output format

Options:

  • --output-dir PATH - Directory for output files (default: eval_{flow_name}_{timestamp})
  • --no-cache - Disable reading from cache (still doesn't write to cache)

Complete Workflow Examples

First-Time Setup and Indexing

# 1. Setup flow resources
cocoindex setup main.py

# 2. Run initial indexing
cocoindex update main.py

# 3. Verify results
cocoindex show main.py

Development Workflow

# 1. Test with evaluate (no side effects)
cocoindex evaluate main.py:MyFlow --output-dir ./test_output

# 2. If looks good, setup and update
cocoindex update --setup main.py:MyFlow

# 3. Check results
cocoindex show main.py:MyFlow

Production Live Updates

# Run with live updates and auto-setup
cocoindex update --setup main.py -L

Rebuild After Changes

# Drop old resources
cocoindex drop main.py

# Setup with new definition
cocoindex setup main.py

# Reindex everything
cocoindex update --reexport main.py

Multiple Flows

# Setup all flows
cocoindex setup main.py

# Update specific flows
cocoindex update main.py:CodeEmbedding
cocoindex update main.py:DocumentEmbedding

# Show all flows
cocoindex show main.py

Common Issues and Solutions

Issue: "Flow not found"

Problem: CLI can't find the flow definition.

Solutions:

# Make sure APP_TARGET is correct
cocoindex show main.py  # Should list flows

# Use --app-dir if not in project root
cocoindex --app-dir /path/to/project show main.py

# Check flow name is correct
cocoindex show main.py:CorrectFlowName

Issue: "Database connection failed"

Problem: Can't connect to Postgres.

Solutions:

# Check .env file exists
cat .env | grep COCOINDEX_DATABASE_URL

# Test connection
psql $COCOINDEX_DATABASE_URL

# Use --env-file if .env is elsewhere
cocoindex --env-file /path/to/.env update main.py

Issue: "Schema mismatch"

Problem: Flow definition changed but resources not updated.

Solution:

# Re-run setup to update schemas
cocoindex setup main.py

# Then update data
cocoindex update main.py

Issue: "Live update exits immediately"

Problem: No change capture mechanisms enabled.

Solution: Add refresh_interval or use source-specific change capture:

data_scope["docs"] = flow_builder.add_source(
    cocoindex.sources.LocalFile(path="docs"),
    refresh_interval=datetime.timedelta(seconds=30)  # Add this
)

Advanced Options

Global Options

# Show version
cocoindex --version

# Show help
cocoindex --help
cocoindex update --help

# Specify app directory
cocoindex --app-dir /custom/path update main

# Custom env file
cocoindex --env-file prod.env update main

Performance Tuning

Set environment variables for concurrency:

# In .env file
COCOINDEX_SOURCE_MAX_INFLIGHT_ROWS=100
COCOINDEX_SOURCE_MAX_INFLIGHT_BYTES=1073741824  # 1GB

Or per-source in code:

data_scope["docs"] = flow_builder.add_source(
    cocoindex.sources.LocalFile(path="docs"),
    max_inflight_rows=50,
    max_inflight_bytes=500*1024*1024  # 500MB
)

Best Practices

  1. Use evaluate before update - Test flow logic without side effects
  2. Always setup before first update - Or use --setup flag
  3. Use live updates in production - Keeps targets always fresh
  4. Set app namespace - Organize flows across environments (dev/staging/prod)
  5. Monitor with show - Regularly check flow statistics
  6. Version control .env.example - Document required environment variables
  7. Use specific flow targets - For selective updates: main.py:FlowName
  8. Setup after definition changes - Ensures schemas match flow definition