200 lines
6.7 KiB
Markdown
200 lines
6.7 KiB
Markdown
---
|
|
description: Index content into collections
|
|
argument-hint: <pdf|code|markdown> [<path> | --from-file <file>] [options]
|
|
---
|
|
|
|
Index PDFs, markdown, or source code into Qdrant collections for semantic search.
|
|
|
|
**Subcommands:**
|
|
|
|
- pdf: Index PDF documents (with OCR support)
|
|
- markdown: Index markdown files (with frontmatter extraction)
|
|
- code: Index source code repositories (git-aware)
|
|
|
|
**Common Options:**
|
|
|
|
- --collection: Target collection (required)
|
|
- --from-file: Read file paths from list (one per line, or "-" for stdin)
|
|
- --model: Embedding model (auto-selected by content type)
|
|
- --workers: Parallel workers (default: 4)
|
|
- --force: Force reindex all files
|
|
- --randomize: Randomize file processing order (useful for parallel indexing)
|
|
- --no-gpu: Disable GPU acceleration (GPU enabled by default)
|
|
- --verbose: Show detailed progress (suppress library warnings)
|
|
- --debug: Show all library warnings including transformers
|
|
- --json: Output in JSON format
|
|
|
|
**PDF Indexing Options:**
|
|
|
|
- --no-ocr: Disable OCR (enabled by default for scanned PDFs)
|
|
- --ocr-language: OCR language code (default: eng)
|
|
- --ocr-workers: Parallel OCR workers (default: cpu_count)
|
|
- --normalize-only: Skip markdown conversion, only normalize whitespace
|
|
- --preserve-images: Extract images for multimodal search
|
|
- --process-priority: Process scheduling priority (low, normal, high)
|
|
- --embedding-batch-size: Batch size for embeddings (auto-tuned if not specified)
|
|
- --offline: Use cached models only (no network)
|
|
|
|
**Markdown Indexing Options:**
|
|
|
|
- --chunk-size: Target chunk size in tokens (overrides model default)
|
|
- --chunk-overlap: Overlap between chunks in tokens
|
|
- --recursive/--no-recursive: Search subdirectories recursively (default: recursive)
|
|
- --exclude: Patterns to exclude (e.g., node_modules, .obsidian)
|
|
- --offline: Use cached models only (no network)
|
|
|
|
**Source Code Indexing Options:**
|
|
|
|
- --depth: Git discovery depth (traverse subdirectories)
|
|
|
|
**Examples:**
|
|
|
|
```text
|
|
# Basic indexing (GPU enabled by default)
|
|
/index pdf ~/Documents/Research --collection PDFs --model stella
|
|
/index markdown ~/notes --collection Notes --model stella
|
|
/index code ~/projects/myapp --collection MyCode --model jina-code
|
|
|
|
# Index from file list
|
|
/index pdf --from-file /path/to/pdf_list.txt --collection PDFs
|
|
/index markdown --from-file /path/to/md_list.txt --collection Notes
|
|
|
|
# Index from stdin (pipe file paths)
|
|
find ~/Documents -name "*.pdf" | /index pdf --from-file - --collection PDFs
|
|
ls ~/notes/*.md | /index markdown --from-file - --collection Notes
|
|
|
|
# With options
|
|
/index markdown ~/docs --collection Docs --chunk-size 512 --verbose
|
|
/index pdf ~/scanned-docs --collection Scans --no-ocr --offline
|
|
|
|
# Force CPU-only mode (disable GPU)
|
|
/index pdf ~/Documents/Research --collection PDFs --model stella --no-gpu
|
|
|
|
# Parallel indexing from multiple terminals (randomize order)
|
|
/index pdf ~/Documents/Research --collection PDFs --randomize
|
|
/index markdown ~/notes --collection Notes --randomize
|
|
|
|
# Debug mode (show all warnings)
|
|
/index pdf ~/Documents/Research --collection PDFs --model stella --debug
|
|
```
|
|
|
|
**Execution:**
|
|
|
|
```bash
|
|
cd ${CLAUDE_PLUGIN_ROOT}
|
|
arc index $ARGUMENTS
|
|
```
|
|
|
|
**File List Format (--from-file):**
|
|
|
|
When using `--from-file`, provide a text file with one file path per line:
|
|
|
|
```text
|
|
# Comments are supported (lines starting with #)
|
|
/absolute/path/to/file1.pdf
|
|
relative/path/to/file2.md
|
|
/another/file3.pdf
|
|
|
|
# Empty lines are ignored
|
|
```
|
|
|
|
Features:
|
|
|
|
- Supports both absolute and relative paths
|
|
- Relative paths resolved from current directory
|
|
- Comments (lines starting with #) and empty lines are skipped
|
|
- Non-existent files are warned about but processing continues
|
|
- Wrong file extensions are filtered with warnings
|
|
- Use "-" to read from stdin
|
|
|
|
**How It Works:**
|
|
|
|
**PDF Indexing:**
|
|
|
|
1. Extract text from PDFs (PyMuPDF + pdfplumber fallback)
|
|
2. Auto-trigger OCR for scanned PDFs (< 100 chars extracted)
|
|
3. Chunk text with 15% overlap for context
|
|
4. Generate embeddings (stella default: 1024D for documents)
|
|
5. Upload to Qdrant with metadata (file path, page numbers)
|
|
6. Incremental: Skips unchanged files (file hash metadata check)
|
|
|
|
**Markdown Indexing:**
|
|
|
|
1. Discover markdown files (.md, .markdown extensions)
|
|
2. Extract YAML frontmatter (title, author, tags, category, etc.)
|
|
3. Semantic chunking preserving document structure (headers, code blocks)
|
|
4. Generate embeddings (stella default: 1024D for documents)
|
|
5. Upload to Qdrant with metadata (file path, frontmatter fields, header context)
|
|
6. Incremental: Skips unchanged files (SHA256 content hash check)
|
|
|
|
**Source Code Indexing:**
|
|
|
|
1. Discover git repositories in directory tree
|
|
2. Extract git metadata (project, branch, commit)
|
|
3. Parse code with tree-sitter (AST-aware chunking, 15+ languages)
|
|
4. Generate embeddings (jina-code default: 768D for code)
|
|
5. Upload to Qdrant with metadata (git info, file path, language)
|
|
6. Multi-branch support: project#branch identifier
|
|
7. Incremental: Skips unchanged commits (metadata-based sync)
|
|
|
|
**Default Models:**
|
|
|
|
- PDFs: stella (1024D, document-optimized)
|
|
- Markdown: stella (1024D, document-optimized)
|
|
- Source: jina-code (768D, code-optimized)
|
|
|
|
**Performance:**
|
|
|
|
- PDF: ~10-30 PDFs/minute (depends on OCR workload)
|
|
- Markdown: ~50-100 files/minute (depends on file size)
|
|
- Source: 100-200 files/second (depends on file size)
|
|
- Batch upload: 100-200 chunks per batch
|
|
- Parallel workers: 4 (adjustable with --workers for PDF/source)
|
|
- **GPU acceleration**: 1.5-3x speedup (enabled by default, use --no-gpu to disable)
|
|
|
|
**GPU Acceleration:**
|
|
|
|
GPU acceleration is **enabled by default** for faster embedding generation:
|
|
|
|
- **Apple Silicon**: MPS (Metal Performance Shaders) backend
|
|
- **NVIDIA GPUs**: CUDA backend
|
|
- **CPU fallback**: Automatic if GPU unavailable
|
|
- **Disable GPU**: Use --no-gpu flag (for thermal/battery concerns)
|
|
|
|
**Compatible models** (verified with GPU support):
|
|
|
|
- stella (recommended for PDFs/markdown) - Full MPS support
|
|
- jina-code (recommended for source code) - Full MPS support
|
|
- bge-small, bge-base - CoreML support
|
|
|
|
**Offline Mode:**
|
|
|
|
Use --offline for corporate proxies or SSL issues:
|
|
|
|
- Requires models pre-downloaded: `arc models download`
|
|
- No network calls during indexing
|
|
- Fails if model not cached
|
|
|
|
**Related Commands:**
|
|
|
|
- /collection create - Create collection before indexing
|
|
- /search semantic - Search indexed content
|
|
- /corpus create - Create both vector + full-text indexes
|
|
|
|
**Debug Mode:**
|
|
|
|
Use --debug to troubleshoot indexing issues:
|
|
|
|
- Shows all library warnings (including HuggingFace transformers)
|
|
- Displays detailed stack traces
|
|
- Helps diagnose model loading or GPU issues
|
|
- Use --verbose for user-facing progress without library warnings
|
|
|
|
**Implementation:**
|
|
|
|
- RDR-004: PDF bulk indexing
|
|
- RDR-005: Git-aware source code indexing
|
|
- RDR-013: Performance optimization with GPU acceleration
|
|
- RDR-014: Markdown content indexing
|
|
- RDR-006: Claude Code integration
|