Files
2025-11-29 18:17:12 +08:00

6.7 KiB

description, argument-hint
description argument-hint
Index content into collections <pdf|code|markdown> [<path> | --from-file <file>] [options]

Index PDFs, markdown, or source code into Qdrant collections for semantic search.

Subcommands:

  • pdf: Index PDF documents (with OCR support)
  • markdown: Index markdown files (with frontmatter extraction)
  • code: Index source code repositories (git-aware)

Common Options:

  • --collection: Target collection (required)
  • --from-file: Read file paths from list (one per line, or "-" for stdin)
  • --model: Embedding model (auto-selected by content type)
  • --workers: Parallel workers (default: 4)
  • --force: Force reindex all files
  • --randomize: Randomize file processing order (useful for parallel indexing)
  • --no-gpu: Disable GPU acceleration (GPU enabled by default)
  • --verbose: Show detailed progress (suppress library warnings)
  • --debug: Show all library warnings including transformers
  • --json: Output in JSON format

PDF Indexing Options:

  • --no-ocr: Disable OCR (enabled by default for scanned PDFs)
  • --ocr-language: OCR language code (default: eng)
  • --ocr-workers: Parallel OCR workers (default: cpu_count)
  • --normalize-only: Skip markdown conversion, only normalize whitespace
  • --preserve-images: Extract images for multimodal search
  • --process-priority: Process scheduling priority (low, normal, high)
  • --embedding-batch-size: Batch size for embeddings (auto-tuned if not specified)
  • --offline: Use cached models only (no network)

Markdown Indexing Options:

  • --chunk-size: Target chunk size in tokens (overrides model default)
  • --chunk-overlap: Overlap between chunks in tokens
  • --recursive/--no-recursive: Search subdirectories recursively (default: recursive)
  • --exclude: Patterns to exclude (e.g., node_modules, .obsidian)
  • --offline: Use cached models only (no network)

Source Code Indexing Options:

  • --depth: Git discovery depth (traverse subdirectories)

Examples:

# Basic indexing (GPU enabled by default)
/index pdf ~/Documents/Research --collection PDFs --model stella
/index markdown ~/notes --collection Notes --model stella
/index code ~/projects/myapp --collection MyCode --model jina-code

# Index from file list
/index pdf --from-file /path/to/pdf_list.txt --collection PDFs
/index markdown --from-file /path/to/md_list.txt --collection Notes

# Index from stdin (pipe file paths)
find ~/Documents -name "*.pdf" | /index pdf --from-file - --collection PDFs
ls ~/notes/*.md | /index markdown --from-file - --collection Notes

# With options
/index markdown ~/docs --collection Docs --chunk-size 512 --verbose
/index pdf ~/scanned-docs --collection Scans --no-ocr --offline

# Force CPU-only mode (disable GPU)
/index pdf ~/Documents/Research --collection PDFs --model stella --no-gpu

# Parallel indexing from multiple terminals (randomize order)
/index pdf ~/Documents/Research --collection PDFs --randomize
/index markdown ~/notes --collection Notes --randomize

# Debug mode (show all warnings)
/index pdf ~/Documents/Research --collection PDFs --model stella --debug

Execution:

cd ${CLAUDE_PLUGIN_ROOT}
arc index $ARGUMENTS

File List Format (--from-file):

When using --from-file, provide a text file with one file path per line:

# Comments are supported (lines starting with #)
/absolute/path/to/file1.pdf
relative/path/to/file2.md
/another/file3.pdf

# Empty lines are ignored

Features:

  • Supports both absolute and relative paths
  • Relative paths resolved from current directory
  • Comments (lines starting with #) and empty lines are skipped
  • Non-existent files are warned about but processing continues
  • Wrong file extensions are filtered with warnings
  • Use "-" to read from stdin

How It Works:

PDF Indexing:

  1. Extract text from PDFs (PyMuPDF + pdfplumber fallback)
  2. Auto-trigger OCR for scanned PDFs (< 100 chars extracted)
  3. Chunk text with 15% overlap for context
  4. Generate embeddings (stella default: 1024D for documents)
  5. Upload to Qdrant with metadata (file path, page numbers)
  6. Incremental: Skips unchanged files (file hash metadata check)

Markdown Indexing:

  1. Discover markdown files (.md, .markdown extensions)
  2. Extract YAML frontmatter (title, author, tags, category, etc.)
  3. Semantic chunking preserving document structure (headers, code blocks)
  4. Generate embeddings (stella default: 1024D for documents)
  5. Upload to Qdrant with metadata (file path, frontmatter fields, header context)
  6. Incremental: Skips unchanged files (SHA256 content hash check)

Source Code Indexing:

  1. Discover git repositories in directory tree
  2. Extract git metadata (project, branch, commit)
  3. Parse code with tree-sitter (AST-aware chunking, 15+ languages)
  4. Generate embeddings (jina-code default: 768D for code)
  5. Upload to Qdrant with metadata (git info, file path, language)
  6. Multi-branch support: project#branch identifier
  7. Incremental: Skips unchanged commits (metadata-based sync)

Default Models:

  • PDFs: stella (1024D, document-optimized)
  • Markdown: stella (1024D, document-optimized)
  • Source: jina-code (768D, code-optimized)

Performance:

  • PDF: ~10-30 PDFs/minute (depends on OCR workload)
  • Markdown: ~50-100 files/minute (depends on file size)
  • Source: 100-200 files/second (depends on file size)
  • Batch upload: 100-200 chunks per batch
  • Parallel workers: 4 (adjustable with --workers for PDF/source)
  • GPU acceleration: 1.5-3x speedup (enabled by default, use --no-gpu to disable)

GPU Acceleration:

GPU acceleration is enabled by default for faster embedding generation:

  • Apple Silicon: MPS (Metal Performance Shaders) backend
  • NVIDIA GPUs: CUDA backend
  • CPU fallback: Automatic if GPU unavailable
  • Disable GPU: Use --no-gpu flag (for thermal/battery concerns)

Compatible models (verified with GPU support):

  • stella (recommended for PDFs/markdown) - Full MPS support
  • jina-code (recommended for source code) - Full MPS support
  • bge-small, bge-base - CoreML support

Offline Mode:

Use --offline for corporate proxies or SSL issues:

  • Requires models pre-downloaded: arc models download
  • No network calls during indexing
  • Fails if model not cached

Related Commands:

  • /collection create - Create collection before indexing
  • /search semantic - Search indexed content
  • /corpus create - Create both vector + full-text indexes

Debug Mode:

Use --debug to troubleshoot indexing issues:

  • Shows all library warnings (including HuggingFace transformers)
  • Displays detailed stack traces
  • Helps diagnose model loading or GPU issues
  • Use --verbose for user-facing progress without library warnings

Implementation:

  • RDR-004: PDF bulk indexing
  • RDR-005: Git-aware source code indexing
  • RDR-013: Performance optimization with GPU acceleration
  • RDR-014: Markdown content indexing
  • RDR-006: Claude Code integration