Files
gh-cwensel-arcaneum/commands/corpus.md
2025-11-29 18:17:12 +08:00

2.6 KiB

description, argument-hint
description argument-hint
Manage dual-index corpora <create|sync> [options]

Manage corpora that combine both vector search (Qdrant) and full-text search (MeiliSearch) for the same content.

Subcommands:

  • create: Create both Qdrant collection and MeiliSearch index
  • sync: Index directory to both systems simultaneously

Common Options:

  • --json: Output in JSON format

Create Options:

  • name: Corpus name (required)
  • --type: Corpus type - code or pdf (required)
  • --models: Embedding models, comma-separated (default: stella,jina)

Sync Options:

  • directory: Directory path to index (required)
  • --corpus: Corpus name (required)
  • --models: Embedding models (default: stella,jina)
  • --file-types: File extensions to index (e.g., .py,.md)

Examples:

/corpus create MyDocs --type pdf --models stella
/corpus sync ~/Documents --corpus MyDocs
/corpus create CodeBase --type code
/corpus sync ~/projects --corpus CodeBase --file-types .py,.js,.md

Execution:

cd ${CLAUDE_PLUGIN_ROOT}
arc corpus $ARGUMENTS

What Is a Corpus?

A corpus combines two search systems:

  1. Vector search (Qdrant): Semantic similarity, concept matching
  2. Full-text search (MeiliSearch): Keyword, phrase, boolean operators

This enables hybrid search strategies:

  • Broad semantic discovery (vector search)
  • Precise keyword refinement (full-text search)
  • Combined results for best of both worlds

When to Use Corpus vs Collection:

Use Corpus When:

  • Need both semantic and keyword search
  • Users search different ways (concepts vs exact terms)
  • Want fast keyword filtering of semantic results
  • Building search UIs with multiple search modes

Use Collection When:

  • Only need semantic search
  • Working with embeddings/vectors directly
  • Integrating with existing vector workflows
  • MeiliSearch not available/needed

How Sync Works:

  1. Discovers files in directory (respects .gitignore for code)
  2. Chunks content appropriately (PDFs vs code)
  3. Generates embeddings with specified models
  4. Uploads to Qdrant (vector search)
  5. Indexes to MeiliSearch (full-text search)
  6. Both indexes share same document IDs and metadata

Performance:

Corpus sync is approximately 2x slower than single-system indexing due to dual upload, but still efficient:

  • PDFs: ~5-15/minute
  • Source files: 50-100 files/second

Related Commands:

  • /collection create - Create vector-only collection
  • /index pdf - Index PDFs to vector only
  • /index code - Index code to vector only
  • /search semantic - Search vector index
  • /search text - Search full-text index

Implementation:

  • RDR-009: Dual indexing strategy
  • RDR-006: Claude Code integration