Files
gh-cocoindex-io-cocoindex-c…/skills/cocoindex/references/api_operations.md
2025-11-29 18:14:46 +08:00

13 KiB

API Operations Reference

Guide for operating CocoIndex flows programmatically using Python APIs.

Overview

CocoIndex flows can be operated through Python APIs, providing programmatic control over setup, updates, and queries. This is useful for embedding flows in applications, automating workflows, or building custom tools.

Basic Setup

Initialization

from dotenv import load_dotenv
import cocoindex

# Load environment variables
load_dotenv()

# Initialize CocoIndex
cocoindex.init()

Flow Definition

@cocoindex.flow_def(name="MyFlow")
def my_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    # Flow definition
    pass

The decorator returns a cocoindex.Flow object that can be used for operations.

Flow Operations

Setup Flow

Create persistent backends (tables, collections, etc.) for the flow.

# Basic setup
my_flow.setup()

# With progress output
my_flow.setup(report_to_stdout=True)

# Async version
await my_flow.setup_async(report_to_stdout=True)

When to use:

  • Before first update
  • After modifying flow structure
  • After dropping flow to recreate resources

Setup All Flows

# Setup all flows at once
cocoindex.setup_all_flows(report_to_stdout=True)

Drop Flow

Remove all persistent backends owned by the flow.

# Drop flow
my_flow.drop()

# With progress output
my_flow.drop(report_to_stdout=True)

# Async version
await my_flow.drop_async(report_to_stdout=True)

Note: After dropping, the Flow object is still valid and can be setup again.

Drop All Flows

# Drop all flows
cocoindex.drop_all_flows(report_to_stdout=True)

Close Flow

Remove flow from current process memory (doesn't affect persistent data).

my_flow.close()
# After this, my_flow is invalid and should not be used

Update Operations

One-Time Update

Build or update target data based on current source data.

# Basic update
stats = my_flow.update()
print(f"Processed {stats.total_rows} rows")

# With reexport (force reprocess even if unchanged)
stats = my_flow.update(reexport_targets=True)

# Async version
stats = await my_flow.update_async()
stats = await my_flow.update_async(reexport_targets=True)

Returns: Statistics about processed data

Note: Multiple calls to update() can run simultaneously. CocoIndex will automatically combine them efficiently.

Live Update

Continuously monitor source changes and update targets.

import cocoindex

# Create live updater
updater = cocoindex.FlowLiveUpdater(
    my_flow,
    cocoindex.FlowLiveUpdaterOptions(
        live_mode=True,        # Enable live updates
        print_stats=True,      # Print progress
        reexport_targets=False # Only reexport on first update if True
    )
)

# Start the updater
updater.start()

# Your application logic here
# (updater runs in background threads)

# Wait for completion
updater.wait()

# Print final stats
print(updater.update_stats())

As Context Manager

with cocoindex.FlowLiveUpdater(my_flow) as updater:
    # Updater starts automatically
    # Your application logic here
    pass
# Updater aborts and waits automatically

# Async version
async with cocoindex.FlowLiveUpdater(my_flow) as updater:
    # Your application logic
    pass

Monitoring Status Updates

updater = cocoindex.FlowLiveUpdater(my_flow)
updater.start()

while True:
    # Block until next status update
    updates = updater.next_status_updates()

    # Check which sources were updated
    for source in updates.updated_sources:
        print(f"Source {source} has new data")
        # Trigger downstream operations

    # Check if updater stopped
    if not updates.active_sources:
        print("All sources stopped")
        break

# Async version
while True:
    updates = await updater.next_status_updates_async()
    # ... same logic

Control Methods

# Start updater
updater.start()
await updater.start_async()

# Abort updater
updater.abort()

# Wait for completion
updater.wait()
await updater.wait_async()

# Get current stats
stats = updater.update_stats()

Evaluate Flow

Run transformations without updating targets (for testing).

# Evaluate and dump results
my_flow.evaluate_and_dump(
    cocoindex.EvaluateAndDumpOptions(
        output_dir="./eval_output",
        use_cache=True  # Use existing cache (but don't update it)
    )
)

Use cases:

  • Testing flow logic
  • Debugging transformations
  • Inspecting intermediate data

Query Operations

Transform Flows

Transform flows enable reusable transformation logic for both indexing and querying.

from numpy.typing import NDArray
import numpy as np

# Define transform flow
@cocoindex.transform_flow()
def text_to_embedding(
    text: cocoindex.DataSlice[str]
) -> cocoindex.DataSlice[NDArray[np.float32]]:
    """Convert text to embedding vector."""
    return text.transform(
        cocoindex.functions.SentenceTransformerEmbed(
            model="sentence-transformers/all-MiniLM-L6-v2"
        )
    )

# Use in indexing flow
@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder, data_scope):
    # ... setup source ...
    with data_scope["documents"].row() as doc:
        doc["embedding"] = text_to_embedding(doc["content"])
        # ... collect and export ...

# Use for querying (evaluate with input)
query_embedding = text_to_embedding.eval("search query text")
# query_embedding is now a numpy array

Query Handlers

Attach query logic to flows for easy query execution.

import functools
from psycopg_pool import ConnectionPool
from pgvector.psycopg import register_vector

@functools.cache
def connection_pool():
    return ConnectionPool(os.environ["COCOINDEX_DATABASE_URL"])

# Register query handler
@my_flow.query_handler(
    result_fields=cocoindex.QueryHandlerResultFields(
        embedding=["embedding"],  # Field name(s) containing embeddings
        score="score"             # Field name for similarity score
    )
)
def search(query: str) -> cocoindex.QueryOutput:
    """Search for documents matching query."""

    # Get table name for this flow's export
    table_name = cocoindex.utils.get_target_default_name(my_flow, "doc_embeddings")

    # Compute query embedding using transform flow
    query_vector = text_to_embedding.eval(query)

    # Execute query
    with connection_pool().connection() as conn:
        register_vector(conn)
        with conn.cursor() as cur:
            cur.execute(
                f"""
                SELECT filename, text, embedding, embedding <=> %s AS distance
                FROM {table_name}
                ORDER BY distance
                LIMIT 10
                """,
                (query_vector,)
            )

            return cocoindex.QueryOutput(
                query_info=cocoindex.QueryInfo(
                    embedding=query_vector,
                    similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY
                ),
                results=[
                    {
                        "filename": row[0],
                        "text": row[1],
                        "embedding": row[2],
                        "score": 1.0 - row[3]  # Convert distance to similarity
                    }
                    for row in cur.fetchall()
                ]
            )

# Call the query handler
results = search("machine learning algorithms")
for result in results.results:
    print(f"[{result['score']:.3f}] {result['filename']}: {result['text']}")

Query with Qdrant

from qdrant_client import QdrantClient
import functools

@functools.cache
def get_qdrant_client():
    return QdrantClient(url="http://localhost:6334", prefer_grpc=True)

@my_flow.query_handler(
    result_fields=cocoindex.QueryHandlerResultFields(
        embedding=["embedding"],
        score="score"
    )
)
def search_qdrant(query: str) -> cocoindex.QueryOutput:
    client = get_qdrant_client()

    # Get query embedding
    query_embedding = text_to_embedding.eval(query)

    # Search Qdrant
    search_results = client.search(
        collection_name="my_collection",
        query_vector=("text_embedding", query_embedding),
        limit=10
    )

    return cocoindex.QueryOutput(
        query_info=cocoindex.QueryInfo(
            embedding=query_embedding,
            similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY
        ),
        results=[
            {
                "text": result.payload["text"],
                "embedding": result.vector,
                "score": result.score
            }
            for result in search_results
        ]
    )

Application Integration Patterns

Pattern 1: Simple Application with Update

from dotenv import load_dotenv
import cocoindex

# Initialize
load_dotenv()
cocoindex.init()

# Define flow
@cocoindex.flow_def(name="MyApp")
def my_app_flow(flow_builder, data_scope):
    # ... flow definition ...
    pass

def main():
    # Ensure flow is set up and data is fresh
    stats = my_app_flow.update()
    print(f"Updated index: {stats}")

    # Run application logic
    while True:
        query = input("Search: ")
        if not query:
            break
        results = search(query)
        for result in results.results:
            print(f"  {result['score']:.3f}: {result['text']}")

if __name__ == "__main__":
    main()

Pattern 2: Web Application with Live Updates

from fastapi import FastAPI
import cocoindex
from dotenv import load_dotenv

load_dotenv()
cocoindex.init()

@cocoindex.flow_def(name="WebAppFlow")
def web_app_flow(flow_builder, data_scope):
    # ... flow definition ...
    pass

# Create FastAPI app
app = FastAPI()

# Global updater
updater = None

@app.on_event("startup")
async def startup():
    global updater
    # Start live updater in background
    updater = cocoindex.FlowLiveUpdater(
        web_app_flow,
        cocoindex.FlowLiveUpdaterOptions(live_mode=True, print_stats=True)
    )
    await updater.start_async()
    print("Live updater started")

@app.on_event("shutdown")
async def shutdown():
    global updater
    if updater:
        updater.abort()
        await updater.wait_async()
        print("Live updater stopped")

@app.get("/search")
async def search_endpoint(q: str):
    results = search(q)
    return {
        "query": q,
        "results": results.results
    }

Pattern 3: Batch Processing

import cocoindex
from dotenv import load_dotenv

load_dotenv()
cocoindex.init()

@cocoindex.flow_def(name="BatchProcessor")
def batch_flow(flow_builder, data_scope):
    # ... flow definition ...
    pass

def process_batch():
    """Run as scheduled job (cron, etc.)"""
    # Setup if needed (no-op if already set up)
    batch_flow.setup()

    # Run update
    stats = batch_flow.update()

    # Log results
    print(f"Batch completed: {stats.total_rows} rows processed")

    return stats

if __name__ == "__main__":
    process_batch()

Pattern 4: React to Updates

import cocoindex

@cocoindex.flow_def(name="ReactiveFlow")
def reactive_flow(flow_builder, data_scope):
    # ... flow definition ...
    pass

async def run_with_reactions():
    """Monitor updates and trigger downstream actions."""
    async with cocoindex.FlowLiveUpdater(reactive_flow) as updater:
        while True:
            updates = await updater.next_status_updates_async()

            # React to specific source updates
            if "products" in updates.updated_sources:
                await rebuild_product_index()

            if "customers" in updates.updated_sources:
                await refresh_customer_cache()

            # Exit when updater stops
            if not updates.active_sources:
                break

async def rebuild_product_index():
    print("Rebuilding product index...")
    # Custom logic

async def refresh_customer_cache():
    print("Refreshing customer cache...")
    # Custom logic

Error Handling

Handling Update Errors

try:
    stats = my_flow.update()
except cocoindex.CocoIndexError as e:
    print(f"Update failed: {e}")
    # Handle error (log, retry, alert, etc.)

Graceful Shutdown

import signal

updater = None

def signal_handler(sig, frame):
    print("Shutting down gracefully...")
    if updater:
        updater.abort()
        updater.wait()
    print("Shutdown complete")
    exit(0)

signal.signal(signal.SIGINT, signal_handler)
signal.signal(signal.SIGTERM, signal_handler)

updater = cocoindex.FlowLiveUpdater(my_flow)
updater.start()
updater.wait()

Best Practices

  1. Always call cocoindex.init() - Initialize before using any CocoIndex APIs
  2. Load environment variables - Use dotenv or similar to load configuration
  3. Use context managers - For live updaters to ensure cleanup
  4. Cache expensive resources - Use @functools.cache for database pools, clients
  5. Handle signals - Gracefully shutdown live updaters on SIGINT/SIGTERM
  6. Separate concerns - Keep flow definitions, queries, and application logic separate
  7. Use transform flows - Share logic between indexing and querying
  8. Monitor update stats - Log and track processing statistics
  9. Test with evaluate - Use evaluate_and_dump for testing before updates