--- description: Index content into collections argument-hint: [ | --from-file ] [options] --- Index PDFs, markdown, or source code into Qdrant collections for semantic search. **Subcommands:** - pdf: Index PDF documents (with OCR support) - markdown: Index markdown files (with frontmatter extraction) - code: Index source code repositories (git-aware) **Common Options:** - --collection: Target collection (required) - --from-file: Read file paths from list (one per line, or "-" for stdin) - --model: Embedding model (auto-selected by content type) - --workers: Parallel workers (default: 4) - --force: Force reindex all files - --randomize: Randomize file processing order (useful for parallel indexing) - --no-gpu: Disable GPU acceleration (GPU enabled by default) - --verbose: Show detailed progress (suppress library warnings) - --debug: Show all library warnings including transformers - --json: Output in JSON format **PDF Indexing Options:** - --no-ocr: Disable OCR (enabled by default for scanned PDFs) - --ocr-language: OCR language code (default: eng) - --ocr-workers: Parallel OCR workers (default: cpu_count) - --normalize-only: Skip markdown conversion, only normalize whitespace - --preserve-images: Extract images for multimodal search - --process-priority: Process scheduling priority (low, normal, high) - --embedding-batch-size: Batch size for embeddings (auto-tuned if not specified) - --offline: Use cached models only (no network) **Markdown Indexing Options:** - --chunk-size: Target chunk size in tokens (overrides model default) - --chunk-overlap: Overlap between chunks in tokens - --recursive/--no-recursive: Search subdirectories recursively (default: recursive) - --exclude: Patterns to exclude (e.g., node_modules, .obsidian) - --offline: Use cached models only (no network) **Source Code Indexing Options:** - --depth: Git discovery depth (traverse subdirectories) **Examples:** ```text # Basic indexing (GPU enabled by default) /index pdf ~/Documents/Research --collection PDFs --model stella /index markdown ~/notes --collection Notes --model stella /index code ~/projects/myapp --collection MyCode --model jina-code # Index from file list /index pdf --from-file /path/to/pdf_list.txt --collection PDFs /index markdown --from-file /path/to/md_list.txt --collection Notes # Index from stdin (pipe file paths) find ~/Documents -name "*.pdf" | /index pdf --from-file - --collection PDFs ls ~/notes/*.md | /index markdown --from-file - --collection Notes # With options /index markdown ~/docs --collection Docs --chunk-size 512 --verbose /index pdf ~/scanned-docs --collection Scans --no-ocr --offline # Force CPU-only mode (disable GPU) /index pdf ~/Documents/Research --collection PDFs --model stella --no-gpu # Parallel indexing from multiple terminals (randomize order) /index pdf ~/Documents/Research --collection PDFs --randomize /index markdown ~/notes --collection Notes --randomize # Debug mode (show all warnings) /index pdf ~/Documents/Research --collection PDFs --model stella --debug ``` **Execution:** ```bash cd ${CLAUDE_PLUGIN_ROOT} arc index $ARGUMENTS ``` **File List Format (--from-file):** When using `--from-file`, provide a text file with one file path per line: ```text # Comments are supported (lines starting with #) /absolute/path/to/file1.pdf relative/path/to/file2.md /another/file3.pdf # Empty lines are ignored ``` Features: - Supports both absolute and relative paths - Relative paths resolved from current directory - Comments (lines starting with #) and empty lines are skipped - Non-existent files are warned about but processing continues - Wrong file extensions are filtered with warnings - Use "-" to read from stdin **How It Works:** **PDF Indexing:** 1. Extract text from PDFs (PyMuPDF + pdfplumber fallback) 2. Auto-trigger OCR for scanned PDFs (< 100 chars extracted) 3. Chunk text with 15% overlap for context 4. Generate embeddings (stella default: 1024D for documents) 5. Upload to Qdrant with metadata (file path, page numbers) 6. Incremental: Skips unchanged files (file hash metadata check) **Markdown Indexing:** 1. Discover markdown files (.md, .markdown extensions) 2. Extract YAML frontmatter (title, author, tags, category, etc.) 3. Semantic chunking preserving document structure (headers, code blocks) 4. Generate embeddings (stella default: 1024D for documents) 5. Upload to Qdrant with metadata (file path, frontmatter fields, header context) 6. Incremental: Skips unchanged files (SHA256 content hash check) **Source Code Indexing:** 1. Discover git repositories in directory tree 2. Extract git metadata (project, branch, commit) 3. Parse code with tree-sitter (AST-aware chunking, 15+ languages) 4. Generate embeddings (jina-code default: 768D for code) 5. Upload to Qdrant with metadata (git info, file path, language) 6. Multi-branch support: project#branch identifier 7. Incremental: Skips unchanged commits (metadata-based sync) **Default Models:** - PDFs: stella (1024D, document-optimized) - Markdown: stella (1024D, document-optimized) - Source: jina-code (768D, code-optimized) **Performance:** - PDF: ~10-30 PDFs/minute (depends on OCR workload) - Markdown: ~50-100 files/minute (depends on file size) - Source: 100-200 files/second (depends on file size) - Batch upload: 100-200 chunks per batch - Parallel workers: 4 (adjustable with --workers for PDF/source) - **GPU acceleration**: 1.5-3x speedup (enabled by default, use --no-gpu to disable) **GPU Acceleration:** GPU acceleration is **enabled by default** for faster embedding generation: - **Apple Silicon**: MPS (Metal Performance Shaders) backend - **NVIDIA GPUs**: CUDA backend - **CPU fallback**: Automatic if GPU unavailable - **Disable GPU**: Use --no-gpu flag (for thermal/battery concerns) **Compatible models** (verified with GPU support): - stella (recommended for PDFs/markdown) - Full MPS support - jina-code (recommended for source code) - Full MPS support - bge-small, bge-base - CoreML support **Offline Mode:** Use --offline for corporate proxies or SSL issues: - Requires models pre-downloaded: `arc models download` - No network calls during indexing - Fails if model not cached **Related Commands:** - /collection create - Create collection before indexing - /search semantic - Search indexed content - /corpus create - Create both vector + full-text indexes **Debug Mode:** Use --debug to troubleshoot indexing issues: - Shows all library warnings (including HuggingFace transformers) - Displays detailed stack traces - Helps diagnose model loading or GPU issues - Use --verbose for user-facing progress without library warnings **Implementation:** - RDR-004: PDF bulk indexing - RDR-005: Git-aware source code indexing - RDR-013: Performance optimization with GPU acceleration - RDR-014: Markdown content indexing - RDR-006: Claude Code integration