gh-cwensel-arcaneum/commands/corpus.md at b923023331732db3b4c05012ed4abff8c74fda27

Files

Zhongwei Li b923023331 Initial commit

2025-11-29 18:17:12 +08:00

2.6 KiB

Raw Blame History

description, argument-hint

description	argument-hint
Manage dual-index corpora	<create\|sync> [options]

Manage corpora that combine both vector search (Qdrant) and full-text search (MeiliSearch) for the same content.

Subcommands:

create: Create both Qdrant collection and MeiliSearch index
sync: Index directory to both systems simultaneously

Common Options:

--json: Output in JSON format

Create Options:

name: Corpus name (required)
--type: Corpus type - code or pdf (required)
--models: Embedding models, comma-separated (default: stella,jina)

Sync Options:

directory: Directory path to index (required)
--corpus: Corpus name (required)
--models: Embedding models (default: stella,jina)
--file-types: File extensions to index (e.g., .py,.md)

Examples:

/corpus create MyDocs --type pdf --models stella
/corpus sync ~/Documents --corpus MyDocs
/corpus create CodeBase --type code
/corpus sync ~/projects --corpus CodeBase --file-types .py,.js,.md

Execution:

cd ${CLAUDE_PLUGIN_ROOT}
arc corpus $ARGUMENTS

What Is a Corpus?

A corpus combines two search systems:

Vector search (Qdrant): Semantic similarity, concept matching
Full-text search (MeiliSearch): Keyword, phrase, boolean operators

This enables hybrid search strategies:

Broad semantic discovery (vector search)
Precise keyword refinement (full-text search)
Combined results for best of both worlds

When to Use Corpus vs Collection:

Use Corpus When:

Need both semantic and keyword search
Users search different ways (concepts vs exact terms)
Want fast keyword filtering of semantic results
Building search UIs with multiple search modes

Use Collection When:

Only need semantic search
Working with embeddings/vectors directly
Integrating with existing vector workflows
MeiliSearch not available/needed

How Sync Works:

Discovers files in directory (respects .gitignore for code)
Chunks content appropriately (PDFs vs code)
Generates embeddings with specified models
Uploads to Qdrant (vector search)
Indexes to MeiliSearch (full-text search)
Both indexes share same document IDs and metadata

Performance:

Corpus sync is approximately 2x slower than single-system indexing due to dual upload, but still efficient:

PDFs: ~5-15/minute
Source files: 50-100 files/second

Related Commands:

/collection create - Create vector-only collection
/index pdf - Index PDFs to vector only
/index code - Index code to vector only
/search semantic - Search vector index
/search text - Search full-text index

Implementation:

RDR-009: Dual indexing strategy
RDR-006: Claude Code integration

2.6 KiB Raw Blame History

2.6 KiB

Raw Blame History