Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:17:12 +08:00
commit b923023331
12 changed files with 860 additions and 0 deletions

58
commands/collection.md Normal file
View File

@@ -0,0 +1,58 @@
---
description: Manage Qdrant collections
argument-hint: <create|list|info|delete> [options]
---
Manage Qdrant vector collections for storing embeddings.
**Subcommands:**
- create: Create a new collection with specified embedding model
- list: List all collections in Qdrant
- info: Show detailed information about a collection
- delete: Delete a collection permanently
**Common Options:**
- --json: Output in JSON format for programmatic use
**Examples:**
```text
/collection create MyDocs --model stella --type pdf
/collection list
/collection info MyDocs
/collection delete MyDocs --confirm
```
**Execution:**
```bash
cd ${CLAUDE_PLUGIN_ROOT}
arc collection $ARGUMENTS
```
**Collection Types:**
Collections should specify a type (pdf or code) to ensure content matches:
- pdf: For document collections (PDFs, text files)
- code: For source code repositories
**Available Models:**
- stella: 1024D, best for documents and PDFs (default for pdfs)
- jina-code: 768D, optimized for source code (default for code)
- bge: 1024D, general purpose
- modernbert: 1024D, newer general-purpose model
**Related Commands:**
- /index pdf - Index PDFs into a collection
- /index code - Index source code into a collection
- /corpus create - Create both vector and full-text indexes
**Implementation:**
- Defined in RDR-003 (Collection Creation)
- Enhanced in RDR-006 (Claude Code Integration)
- Type enforcement added in arcaneum-122

45
commands/config.md Normal file
View File

@@ -0,0 +1,45 @@
---
description: Configuration and cache management
argument-hint: <show-cache-dir|clear-cache> [options]
---
Manage Arcaneum configuration and model cache.
**Subcommands:**
- show-cache-dir: Display cache locations and sizes
- clear-cache: Clear model cache to free disk space
**Arguments:**
- --confirm: Confirm cache deletion (required for clear-cache)
**Examples:**
```text
/config show-cache-dir
/config clear-cache --confirm
```
**Execution:**
```bash
cd ${CLAUDE_PLUGIN_ROOT}
arc config $ARGUMENTS
```
**Note:** The model cache (~/.arcaneum/models/) stores downloaded embedding models.
First-time indexing downloads ~1-2GB of models which are then reused. I'll show you:
- Current cache directory locations
- Size of each directory (models, data)
- Free disk space information
Use clear-cache when models are corrupted or to free disk space (models will
be re-downloaded on next use).
**Related:**
- Models auto-downloaded to ~/.arcaneum/models/
- Data stored in ~/.arcaneum/data/
- Implemented in arcaneum-157 and arcaneum-162

60
commands/container.md Normal file
View File

@@ -0,0 +1,60 @@
---
description: Manage container services (Qdrant, MeiliSearch)
argument-hint: <start|stop|status|logs|restart|reset> [options]
---
Manage Docker container services for Qdrant and MeiliSearch.
**Subcommands:**
- start: Start all services
- stop: Stop all services
- status: Show service status and health
- logs: View service logs
- restart: Restart services
- reset: Delete all data and reset (WARNING: destructive)
**Arguments:**
- --follow, -f: Follow log output (logs command only)
- --tail <n>: Number of log lines to show (logs command, default: 100)
- --confirm: Confirm data deletion (reset command only)
**Examples:**
```text
/container start
/container status
/container logs
/container logs --follow
/container stop
/container restart
/container reset --confirm
```
**Execution:**
```bash
cd ${CLAUDE_PLUGIN_ROOT}
arc container $ARGUMENTS
```
**Note:** Container management commands check Docker availability and provide
helpful error messages if Docker is not running. I'll show you:
- Service startup confirmation with URLs
- Health check status for Qdrant
- Data directory locations and sizes
- Log output for debugging
**Data Locations:**
- Qdrant: ~/.arcaneum/data/qdrant/
- Snapshots: ~/.arcaneum/data/qdrant_snapshots/
- All data persists across container restarts
**Related:**
- Implemented in arcaneum-167, renamed in arcaneum-169
- Replaces old scripts/qdrant-manage.sh
- Part of simplified Docker management (arcaneum-158)

97
commands/corpus.md Normal file
View File

@@ -0,0 +1,97 @@
---
description: Manage dual-index corpora
argument-hint: <create|sync> [options]
---
Manage corpora that combine both vector search (Qdrant) and full-text search (MeiliSearch) for the same content.
**Subcommands:**
- create: Create both Qdrant collection and MeiliSearch index
- sync: Index directory to both systems simultaneously
**Common Options:**
- --json: Output in JSON format
**Create Options:**
- name: Corpus name (required)
- --type: Corpus type - code or pdf (required)
- --models: Embedding models, comma-separated (default: stella,jina)
**Sync Options:**
- directory: Directory path to index (required)
- --corpus: Corpus name (required)
- --models: Embedding models (default: stella,jina)
- --file-types: File extensions to index (e.g., .py,.md)
**Examples:**
```text
/corpus create MyDocs --type pdf --models stella
/corpus sync ~/Documents --corpus MyDocs
/corpus create CodeBase --type code
/corpus sync ~/projects --corpus CodeBase --file-types .py,.js,.md
```
**Execution:**
```bash
cd ${CLAUDE_PLUGIN_ROOT}
arc corpus $ARGUMENTS
```
**What Is a Corpus?**
A corpus combines two search systems:
1. **Vector search** (Qdrant): Semantic similarity, concept matching
2. **Full-text search** (MeiliSearch): Keyword, phrase, boolean operators
This enables hybrid search strategies:
- Broad semantic discovery (vector search)
- Precise keyword refinement (full-text search)
- Combined results for best of both worlds
**When to Use Corpus vs Collection:**
**Use Corpus When:**
- Need both semantic and keyword search
- Users search different ways (concepts vs exact terms)
- Want fast keyword filtering of semantic results
- Building search UIs with multiple search modes
**Use Collection When:**
- Only need semantic search
- Working with embeddings/vectors directly
- Integrating with existing vector workflows
- MeiliSearch not available/needed
**How Sync Works:**
1. Discovers files in directory (respects .gitignore for code)
2. Chunks content appropriately (PDFs vs code)
3. Generates embeddings with specified models
4. Uploads to Qdrant (vector search)
5. Indexes to MeiliSearch (full-text search)
6. Both indexes share same document IDs and metadata
**Performance:**
Corpus sync is approximately 2x slower than single-system indexing due to dual upload, but still efficient:
- PDFs: ~5-15/minute
- Source files: 50-100 files/second
**Related Commands:**
- /collection create - Create vector-only collection
- /index pdf - Index PDFs to vector only
- /index code - Index code to vector only
- /search semantic - Search vector index
- /search text - Search full-text index
**Implementation:**
- RDR-009: Dual indexing strategy
- RDR-006: Claude Code integration

40
commands/doctor.md Normal file
View File

@@ -0,0 +1,40 @@
---
description: Verify Arcaneum setup and prerequisites
argument-hint: [--verbose] [--json]
---
Check that all Arcaneum prerequisites are met and the system is ready for use.
**Arguments:**
- --verbose: Show detailed diagnostic information
- --json: Output JSON format
**Checks Performed:**
- Python version (>= 3.12 required)
- Required Python dependencies installed
- Qdrant server connectivity and health
- MeiliSearch server connectivity (if configured)
- Embedding model availability
- Write permissions for temporary files
- Environment variable configuration
**Examples:**
```text
/doctor
/doctor --verbose
/doctor --json
```
**Execution:**
```bash
cd ${CLAUDE_PLUGIN_ROOT}
arc doctor $ARGUMENTS
```
**Note:** This diagnostic command helps troubleshoot setup issues. I'll run all
checks and present a summary showing which requirements are met (✅) and which
need attention (❌), along with specific guidance for fixing any problems found.

199
commands/index.md Normal file
View File

@@ -0,0 +1,199 @@
---
description: Index content into collections
argument-hint: <pdf|code|markdown> [<path> | --from-file <file>] [options]
---
Index PDFs, markdown, or source code into Qdrant collections for semantic search.
**Subcommands:**
- pdf: Index PDF documents (with OCR support)
- markdown: Index markdown files (with frontmatter extraction)
- code: Index source code repositories (git-aware)
**Common Options:**
- --collection: Target collection (required)
- --from-file: Read file paths from list (one per line, or "-" for stdin)
- --model: Embedding model (auto-selected by content type)
- --workers: Parallel workers (default: 4)
- --force: Force reindex all files
- --randomize: Randomize file processing order (useful for parallel indexing)
- --no-gpu: Disable GPU acceleration (GPU enabled by default)
- --verbose: Show detailed progress (suppress library warnings)
- --debug: Show all library warnings including transformers
- --json: Output in JSON format
**PDF Indexing Options:**
- --no-ocr: Disable OCR (enabled by default for scanned PDFs)
- --ocr-language: OCR language code (default: eng)
- --ocr-workers: Parallel OCR workers (default: cpu_count)
- --normalize-only: Skip markdown conversion, only normalize whitespace
- --preserve-images: Extract images for multimodal search
- --process-priority: Process scheduling priority (low, normal, high)
- --embedding-batch-size: Batch size for embeddings (auto-tuned if not specified)
- --offline: Use cached models only (no network)
**Markdown Indexing Options:**
- --chunk-size: Target chunk size in tokens (overrides model default)
- --chunk-overlap: Overlap between chunks in tokens
- --recursive/--no-recursive: Search subdirectories recursively (default: recursive)
- --exclude: Patterns to exclude (e.g., node_modules, .obsidian)
- --offline: Use cached models only (no network)
**Source Code Indexing Options:**
- --depth: Git discovery depth (traverse subdirectories)
**Examples:**
```text
# Basic indexing (GPU enabled by default)
/index pdf ~/Documents/Research --collection PDFs --model stella
/index markdown ~/notes --collection Notes --model stella
/index code ~/projects/myapp --collection MyCode --model jina-code
# Index from file list
/index pdf --from-file /path/to/pdf_list.txt --collection PDFs
/index markdown --from-file /path/to/md_list.txt --collection Notes
# Index from stdin (pipe file paths)
find ~/Documents -name "*.pdf" | /index pdf --from-file - --collection PDFs
ls ~/notes/*.md | /index markdown --from-file - --collection Notes
# With options
/index markdown ~/docs --collection Docs --chunk-size 512 --verbose
/index pdf ~/scanned-docs --collection Scans --no-ocr --offline
# Force CPU-only mode (disable GPU)
/index pdf ~/Documents/Research --collection PDFs --model stella --no-gpu
# Parallel indexing from multiple terminals (randomize order)
/index pdf ~/Documents/Research --collection PDFs --randomize
/index markdown ~/notes --collection Notes --randomize
# Debug mode (show all warnings)
/index pdf ~/Documents/Research --collection PDFs --model stella --debug
```
**Execution:**
```bash
cd ${CLAUDE_PLUGIN_ROOT}
arc index $ARGUMENTS
```
**File List Format (--from-file):**
When using `--from-file`, provide a text file with one file path per line:
```text
# Comments are supported (lines starting with #)
/absolute/path/to/file1.pdf
relative/path/to/file2.md
/another/file3.pdf
# Empty lines are ignored
```
Features:
- Supports both absolute and relative paths
- Relative paths resolved from current directory
- Comments (lines starting with #) and empty lines are skipped
- Non-existent files are warned about but processing continues
- Wrong file extensions are filtered with warnings
- Use "-" to read from stdin
**How It Works:**
**PDF Indexing:**
1. Extract text from PDFs (PyMuPDF + pdfplumber fallback)
2. Auto-trigger OCR for scanned PDFs (< 100 chars extracted)
3. Chunk text with 15% overlap for context
4. Generate embeddings (stella default: 1024D for documents)
5. Upload to Qdrant with metadata (file path, page numbers)
6. Incremental: Skips unchanged files (file hash metadata check)
**Markdown Indexing:**
1. Discover markdown files (.md, .markdown extensions)
2. Extract YAML frontmatter (title, author, tags, category, etc.)
3. Semantic chunking preserving document structure (headers, code blocks)
4. Generate embeddings (stella default: 1024D for documents)
5. Upload to Qdrant with metadata (file path, frontmatter fields, header context)
6. Incremental: Skips unchanged files (SHA256 content hash check)
**Source Code Indexing:**
1. Discover git repositories in directory tree
2. Extract git metadata (project, branch, commit)
3. Parse code with tree-sitter (AST-aware chunking, 15+ languages)
4. Generate embeddings (jina-code default: 768D for code)
5. Upload to Qdrant with metadata (git info, file path, language)
6. Multi-branch support: project#branch identifier
7. Incremental: Skips unchanged commits (metadata-based sync)
**Default Models:**
- PDFs: stella (1024D, document-optimized)
- Markdown: stella (1024D, document-optimized)
- Source: jina-code (768D, code-optimized)
**Performance:**
- PDF: ~10-30 PDFs/minute (depends on OCR workload)
- Markdown: ~50-100 files/minute (depends on file size)
- Source: 100-200 files/second (depends on file size)
- Batch upload: 100-200 chunks per batch
- Parallel workers: 4 (adjustable with --workers for PDF/source)
- **GPU acceleration**: 1.5-3x speedup (enabled by default, use --no-gpu to disable)
**GPU Acceleration:**
GPU acceleration is **enabled by default** for faster embedding generation:
- **Apple Silicon**: MPS (Metal Performance Shaders) backend
- **NVIDIA GPUs**: CUDA backend
- **CPU fallback**: Automatic if GPU unavailable
- **Disable GPU**: Use --no-gpu flag (for thermal/battery concerns)
**Compatible models** (verified with GPU support):
- stella (recommended for PDFs/markdown) - Full MPS support
- jina-code (recommended for source code) - Full MPS support
- bge-small, bge-base - CoreML support
**Offline Mode:**
Use --offline for corporate proxies or SSL issues:
- Requires models pre-downloaded: `arc models download`
- No network calls during indexing
- Fails if model not cached
**Related Commands:**
- /collection create - Create collection before indexing
- /search semantic - Search indexed content
- /corpus create - Create both vector + full-text indexes
**Debug Mode:**
Use --debug to troubleshoot indexing issues:
- Shows all library warnings (including HuggingFace transformers)
- Displays detailed stack traces
- Helps diagnose model loading or GPU issues
- Use --verbose for user-facing progress without library warnings
**Implementation:**
- RDR-004: PDF bulk indexing
- RDR-005: Git-aware source code indexing
- RDR-013: Performance optimization with GPU acceleration
- RDR-014: Markdown content indexing
- RDR-006: Claude Code integration

96
commands/models.md Normal file
View File

@@ -0,0 +1,96 @@
---
description: Manage embedding models
argument-hint: <list> [options]
---
Manage and view available embedding models for vector search.
**Subcommands:**
- list: List all available embedding models with details
**Options:**
- --json: Output in JSON format
**Examples:**
```text
/models list
/models list --json
```
**Execution:**
```bash
cd ${CLAUDE_PLUGIN_ROOT}
arc models $ARGUMENTS
```
**Available Models:**
The list command shows:
- Model name (for --model flags)
- Dimensions (vector size)
- Backend (fastembed, sentence-transformers)
- Best use case (PDFs, code, general)
- Model ID (HuggingFace identifier)
**Current Models:**
**For Documents/PDFs:**
- **stella** (1024D): Best for documents, PDFs, general text
- **bge-large** (1024D): General purpose, high quality
- **modernbert** (1024D): Newer general-purpose model
**For Source Code:**
- **jina-code** (768D): Optimized for code, cross-language
- **jina-v2-code** (768D): Alternative code model
**For General Use:**
- **bge** (1024D): High-quality general embeddings
- **bge-small** (384D): Faster, smaller, lower quality
**Model Selection Tips:**
1. **Match content type:**
- PDFs/docs → stella or modernbert
- Source code → jina-code
- Mixed → stella or bge
2. **Consider dimensions:**
- Higher dimensions (1024D) = better quality, more storage
- Lower dimensions (384D, 768D) = faster, less storage
3. **Backend matters:**
- fastembed: Faster, optimized, limited models
- sentence-transformers: More models, HuggingFace ecosystem
4. **Collection consistency:**
- Use same model for all documents in a collection
- Cannot mix dimensions in one vector space
**Downloading Models:**
Models auto-download on first use (~1-2GB):
- Cached in ~/.arcaneum/models/
- Reused across indexing operations
- Use --offline flag to require cached models
**Pre-download for offline use:**
```bash
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('jinaai/jina-embeddings-v2-base-code')"
```
**Related Commands:**
- /collection create - Create collection with specific model
- /index pdf - Index with model selection
- /index code - Index with model selection
**Implementation:**
- RDR-002: Embedding client architecture
- RDR-006: Model listing CLI
- arcaneum-142: Multi-backend support

92
commands/search.md Normal file
View File

@@ -0,0 +1,92 @@
---
description: Search across collections
argument-hint: <semantic|text> <query> [options]
---
Search your indexed content using vector-based semantic search or keyword-based full-text search.
**Subcommands:**
- semantic: Vector-based semantic search (Qdrant)
- text: Keyword-based full-text search (MeiliSearch)
**Common Options:**
- --limit: Number of results to return (default: 10)
- --offset: Number of results to skip for pagination (default: 0)
- --filter: Metadata filter (key=value or JSON)
- --json: Output in JSON format
- --verbose: Show detailed information
**Semantic Search Options:**
- --collection: Collection to search (required)
- --vector-name: Vector name (auto-detected if not specified)
- --score-threshold: Minimum similarity score
**Full-Text Search Options:**
- --index: MeiliSearch index name (required)
**Examples:**
```text
# Basic semantic search
/search semantic "authentication logic" --collection MyCode --limit 5
# Full-text keyword search
/search text "def authenticate" --index MyCode-fulltext
# Search with score threshold
/search semantic "fraud detection patterns" --collection PDFs --score-threshold 0.7
# Pagination: Get second page of results
/search semantic "machine learning" --collection Papers --limit 10 --offset 10
# Pagination: Get third page with JSON output
/search semantic "neural networks" --collection Papers --limit 10 --offset 20 --json
```
**Execution:**
```bash
cd ${CLAUDE_PLUGIN_ROOT}
arc search $ARGUMENTS
```
**When to Use Each:**
**Semantic Search** (vector-based):
- Finding conceptually similar code/documents
- Cross-language semantic matching
- "What does this" or "How to" questions
- Fuzzy concept matching
**Full-Text Search** (keyword-based):
- Exact keyword or phrase matching
- Function/variable name search
- Quoted phrase search
- Boolean operators (AND, OR, NOT)
**Result Format:**
Both commands show:
- Relevance score (similarity for semantic, rank for text)
- Source file path
- Matching content snippet
- Metadata (git info for code, page numbers for PDFs)
**Related Commands:**
- /collection list - See available collections
- /index pdf - Index PDFs for searching
- /index code - Index code for searching
**Implementation:**
- RDR-007: Semantic search via Qdrant
- RDR-012: Full-text search via MeiliSearch
- RDR-006: Claude Code integration

81
commands/store.md Normal file
View File

@@ -0,0 +1,81 @@
---
description: Store agent-generated content for long-term memory
argument-hint: <file|-for-stdin> --collection <name> [options]
---
Store agent-generated content (research, analysis, synthesized information) with rich
metadata. Content is persisted to disk for re-indexing and full-text retrieval, then
indexed to Qdrant for semantic search.
**Storage Location:** `~/.arcaneum/agent-memory/{collection}/`
**Options:**
- --collection: Target collection (required)
- --model: Embedding model (default: stella for documents)
- --title: Document title (added to frontmatter)
- --category: Document category (e.g., research, security, analysis)
- --tags: Comma-separated tags
- --metadata: Additional metadata as JSON
- --chunk-size: Target chunk size in tokens (overrides model default)
- --chunk-overlap: Overlap between chunks in tokens
- --verbose: Show detailed progress
- --json: Output in JSON format
**Examples:**
```text
/store analysis.md --collection Memory --title "Security Analysis" --category security
/store - --collection Research --title "Findings" --tags "research,important"
```
**Execution:**
```bash
cd ${CLAUDE_PLUGIN_ROOT}
arc store $ARGUMENTS
```
**How It Works:**
1. Accept content from file or stdin (`-`)
2. Extract/add rich metadata (title, category, tags, custom fields)
3. Semantic chunking preserving document structure
4. Generate embeddings (stella default: 1024D for documents)
5. Upload to Qdrant with metadata
6. Persist to disk: `~/.arcaneum/agent-memory/{collection}/{date}_{agent}_{slug}.md`
7. Generate YAML frontmatter with injection metadata (injection_id, injected_at, injected_by)
**Persistence:**
Content is always persisted for durability. This enables:
- Re-indexing: Update embeddings without losing original content
- Full-text retrieval: Access complete original documents
- Audit trail: Track what was stored and when (injection_id, timestamps)
**Filename Format:**
`YYYYMMDD_agent_slug.md` (e.g., `20251030_claude_security-analysis.md`)
**Use Cases:**
- AI agents storing research findings
- Preserving analysis results
- Collecting synthesized information
- Building knowledge bases from agent workflows
**Default Model:**
- stella (1024D, document-optimized)
**Related Commands:**
- /collection create - Create collection before storing (use --type markdown)
- /search semantic - Search stored content
- /index markdown - For indexing existing markdown directories (different use case)
**Implementation:**
- RDR-014: Markdown content indexing
- arcaneum-204: Direct injection persistence module