gh-cwensel-arcaneum/commands/index.md at master

Files

Zhongwei Li b923023331 Initial commit

2025-11-29 18:17:12 +08:00

6.7 KiB

Raw Permalink Blame History

description, argument-hint

description	argument-hint
Index content into collections	<pdf\|code\|markdown> [<path> \| --from-file <file>] [options]

Index PDFs, markdown, or source code into Qdrant collections for semantic search.

Subcommands:

pdf: Index PDF documents (with OCR support)
markdown: Index markdown files (with frontmatter extraction)
code: Index source code repositories (git-aware)

Common Options:

--collection: Target collection (required)
--from-file: Read file paths from list (one per line, or "-" for stdin)
--model: Embedding model (auto-selected by content type)
--workers: Parallel workers (default: 4)
--force: Force reindex all files
--randomize: Randomize file processing order (useful for parallel indexing)
--no-gpu: Disable GPU acceleration (GPU enabled by default)
--verbose: Show detailed progress (suppress library warnings)
--debug: Show all library warnings including transformers
--json: Output in JSON format

PDF Indexing Options:

--no-ocr: Disable OCR (enabled by default for scanned PDFs)
--ocr-language: OCR language code (default: eng)
--ocr-workers: Parallel OCR workers (default: cpu_count)
--normalize-only: Skip markdown conversion, only normalize whitespace
--preserve-images: Extract images for multimodal search
--process-priority: Process scheduling priority (low, normal, high)
--embedding-batch-size: Batch size for embeddings (auto-tuned if not specified)
--offline: Use cached models only (no network)

Markdown Indexing Options:

--chunk-size: Target chunk size in tokens (overrides model default)
--chunk-overlap: Overlap between chunks in tokens
--recursive/--no-recursive: Search subdirectories recursively (default: recursive)
--exclude: Patterns to exclude (e.g., node_modules, .obsidian)
--offline: Use cached models only (no network)

Source Code Indexing Options:

--depth: Git discovery depth (traverse subdirectories)

Examples:

# Basic indexing (GPU enabled by default)
/index pdf ~/Documents/Research --collection PDFs --model stella
/index markdown ~/notes --collection Notes --model stella
/index code ~/projects/myapp --collection MyCode --model jina-code

# Index from file list
/index pdf --from-file /path/to/pdf_list.txt --collection PDFs
/index markdown --from-file /path/to/md_list.txt --collection Notes

# Index from stdin (pipe file paths)
find ~/Documents -name "*.pdf" | /index pdf --from-file - --collection PDFs
ls ~/notes/*.md | /index markdown --from-file - --collection Notes

# With options
/index markdown ~/docs --collection Docs --chunk-size 512 --verbose
/index pdf ~/scanned-docs --collection Scans --no-ocr --offline

# Force CPU-only mode (disable GPU)
/index pdf ~/Documents/Research --collection PDFs --model stella --no-gpu

# Parallel indexing from multiple terminals (randomize order)
/index pdf ~/Documents/Research --collection PDFs --randomize
/index markdown ~/notes --collection Notes --randomize

# Debug mode (show all warnings)
/index pdf ~/Documents/Research --collection PDFs --model stella --debug

Execution:

cd ${CLAUDE_PLUGIN_ROOT}
arc index $ARGUMENTS

File List Format (--from-file):

When using --from-file, provide a text file with one file path per line:

# Comments are supported (lines starting with #)
/absolute/path/to/file1.pdf
relative/path/to/file2.md
/another/file3.pdf

# Empty lines are ignored

Features:

Supports both absolute and relative paths
Relative paths resolved from current directory
Comments (lines starting with #) and empty lines are skipped
Non-existent files are warned about but processing continues
Wrong file extensions are filtered with warnings
Use "-" to read from stdin

How It Works:

PDF Indexing:

Extract text from PDFs (PyMuPDF + pdfplumber fallback)
Auto-trigger OCR for scanned PDFs (< 100 chars extracted)
Chunk text with 15% overlap for context
Generate embeddings (stella default: 1024D for documents)
Upload to Qdrant with metadata (file path, page numbers)
Incremental: Skips unchanged files (file hash metadata check)

Markdown Indexing:

Discover markdown files (.md, .markdown extensions)
Extract YAML frontmatter (title, author, tags, category, etc.)
Semantic chunking preserving document structure (headers, code blocks)
Generate embeddings (stella default: 1024D for documents)
Upload to Qdrant with metadata (file path, frontmatter fields, header context)
Incremental: Skips unchanged files (SHA256 content hash check)

Source Code Indexing:

Discover git repositories in directory tree
Extract git metadata (project, branch, commit)
Parse code with tree-sitter (AST-aware chunking, 15+ languages)
Generate embeddings (jina-code default: 768D for code)
Upload to Qdrant with metadata (git info, file path, language)
Multi-branch support: project#branch identifier
Incremental: Skips unchanged commits (metadata-based sync)

Default Models:

PDFs: stella (1024D, document-optimized)
Markdown: stella (1024D, document-optimized)
Source: jina-code (768D, code-optimized)

Performance:

PDF: ~10-30 PDFs/minute (depends on OCR workload)
Markdown: ~50-100 files/minute (depends on file size)
Source: 100-200 files/second (depends on file size)
Batch upload: 100-200 chunks per batch
Parallel workers: 4 (adjustable with --workers for PDF/source)
GPU acceleration: 1.5-3x speedup (enabled by default, use --no-gpu to disable)

GPU Acceleration:

GPU acceleration is enabled by default for faster embedding generation:

Apple Silicon: MPS (Metal Performance Shaders) backend
NVIDIA GPUs: CUDA backend
CPU fallback: Automatic if GPU unavailable
Disable GPU: Use --no-gpu flag (for thermal/battery concerns)

Compatible models (verified with GPU support):

stella (recommended for PDFs/markdown) - Full MPS support
jina-code (recommended for source code) - Full MPS support
bge-small, bge-base - CoreML support

Offline Mode:

Use --offline for corporate proxies or SSL issues:

Requires models pre-downloaded: arc models download
No network calls during indexing
Fails if model not cached

Related Commands:

/collection create - Create collection before indexing
/search semantic - Search indexed content
/corpus create - Create both vector + full-text indexes

Debug Mode:

Use --debug to troubleshoot indexing issues:

Shows all library warnings (including HuggingFace transformers)
Displays detailed stack traces
Helps diagnose model loading or GPU issues
Use --verbose for user-facing progress without library warnings

Implementation:

RDR-004: PDF bulk indexing
RDR-005: Git-aware source code indexing
RDR-013: Performance optimization with GPU acceleration
RDR-014: Markdown content indexing
RDR-006: Claude Code integration

6.7 KiB Raw Permalink Blame History

6.7 KiB

Raw Permalink Blame History