Files
2025-11-29 18:47:25 +08:00

15 KiB

description
description
Guide for writing mcp-vector-search configuration files

mcp-vector-search Configuration Guide

Comprehensive reference for writing .mcp-vector-search/config.edn configuration files.

Overview

mcp-vector-search indexes documents using semantic embeddings and provides a search tool via the Model Context Protocol. Configuration controls which files are indexed, how they're processed, and what metadata is extracted.

Configuration File Location

Create a configuration file at one of these locations (first found is used):

  • Project-specific: .mcp-vector-search/config.edn (in project root)
  • Global: ~/.mcp-vector-search/config.edn (user home directory)

The server reads the configuration at startup and indexes all specified sources.

Basic Configuration Structure

{:description "Custom search tool description"  ; optional
 :watch? true                                   ; optional, enable file watching
 :sources [{:path "/docs/**/*.md"
            :name "Documentation"               ; optional
            :ingest :whole-document             ; optional, defaults to :whole-document
            :watch? true                        ; optional, overrides global :watch?
            :custom-key "custom-value"}]}       ; any additional keys become metadata

Top-level keys:

  • :description - Custom description for the search tool (optional)
  • :watch? - Enable automatic re-indexing when files change (optional, default: false)
  • :sources - Array of source configurations (required)

Path Specifications

Filesystem vs Classpath Sources

Filesystem sources (:path key):

  • Use absolute paths with leading /
  • Support file watching for automatic re-indexing
  • Files are read from the filesystem
{:path "/docs/**/*.md"}

Classpath sources (:class-path key):

  • Use relative paths without leading /
  • File watching not available (read-only resources)
  • Resources discovered from classpath (JARs, resource directories)
  • Useful when embedding mcp-vector-search as a library
{:class-path "docs/**/*.md"}

Important: Sources must specify exactly one of :path or :class-path.

Glob Patterns

Single-level glob (*):

  • Matches any characters within a single directory level
  • Does not match path separators
{:path "/docs/*.md"}           ; matches /docs/README.md
                                ; does NOT match /docs/api/guide.md

Recursive glob (**):

  • Matches any characters across multiple directory levels
  • Includes path separators
{:path "/docs/**/*.md"}         ; matches /docs/README.md
                                ; matches /docs/api/guide.md

Named Captures

Extract metadata from file paths using named regex groups:

{:path "/docs/(?<category>[^/]+)/*.md"}

Syntax: (?<name>pattern)

  • name - Metadata key (converted to keyword)
  • pattern - Java regular expression

For file /docs/api/functions.md:

  • Captures: {:category "api"}

Multiple captures example:

{:path "/(?<project>[^/]+)/(?<version>v\\d+)/(?<file>.+\\.clj)"}

For file /myapp/v1/core.clj:

  • Captures: {:project "myapp", :version "v1", :file "core.clj"}

Path Specification Examples

;; All markdown files recursively
{:path "/docs/**/*.md"}

;; Single directory level
{:path "/docs/*.md"}

;; Capture directory name
{:path "/docs/(?<category>[^/]+)/*.md"}

;; Multiple captures
{:path "/(?<project>[^/]+)/(?<version>[^/]+)/**/*.clj"}

;; Literal file
{:path "/docs/README.md"}

;; Classpath resource (no leading /)
{:class-path "docs/**/*.md"}

Ingest Pipeline Strategies

Ingest strategies control how documents are processed, embedded, and stored. Set via the :ingest key.

:whole-document (default)

Embeds and stores the entire file content as a single segment.

{:sources [{:path "/docs/**/*.md"
            :ingest :whole-document}]}

Characteristics:

  • One segment per file
  • Both embedding and storage use full content
  • Simple and straightforward for most use cases

Use when: You want to search across complete documents and return full content.

:namespace-doc

For Clojure source files - embeds only the namespace docstring but stores the full file content.

{:sources [{:path "/src/**/*.clj"
            :ingest :namespace-doc}]}

Requirements:

  • File must contain a valid ns form
  • Namespace must have a docstring
  • Adds :namespace to metadata (e.g., {:namespace "my.app.core"})

Characteristics:

  • Embedding uses only namespace docstring
  • Storage includes full file content
  • One segment per file

Use when: You want to search Clojure namespaces by their documentation while still returning the complete source code.

:file-path

Embeds the full content but stores only the file path.

{:sources [{:path "/docs/**/*.md"
            :ingest :file-path}]}

Characteristics:

  • Embedding uses full file content
  • Storage contains only the file path
  • One segment per file
  • Reduces memory footprint for large document sets

Use when:

  • You only need to discover which files match a query
  • You want to reduce memory usage
  • Your client will read file content separately

:code-analysis

Analyzes Clojure and Java source files using clj-kondo to extract code elements (vars, namespaces, classes, methods, fields, macros). Creates one searchable segment per code element.

{:sources [{:path "/src/**/*.clj"
            :ingest :code-analysis}]}

Configuration options:

{:sources [{:path "/src/**/*.clj"
            :ingest :code-analysis
            :visibility :public-only        ; :all (default) | :public-only
            :element-types #{:var :macro}}]} ; optional filter

:visibility - Controls which elements to include:

  • :all (default) - Include all elements regardless of visibility
  • :public-only - Include only public elements
    • Clojure: Excludes vars with ^:private or {:private true} metadata
    • Java: Excludes members with private or protected access modifiers

:element-types (optional) - Set of element types to include:

  • Valid types: :var, :macro, :namespace, :class, :method, :field, :constructor
  • If omitted: Include all element types
  • If provided: Only include specified types

Characteristics:

  • Multiple segments per file (one per code element)
  • Embedding uses docstring if present, otherwise element name
  • Content stores complete clj-kondo analysis map as EDN string
  • Supports both Clojure (.clj, .cljs, .cljc) and Java (.java) files

Segment metadata:

  • :element-type - Type of code element (var, macro, namespace, class, method, field, constructor)
  • :element-name - Qualified name (e.g., "my.ns/my-fn" or "com.example.MyClass.myMethod")
  • :language - Source language (clojure or java)
  • :namespace - Containing namespace (Clojure) or package (Java)
  • :visibility - Access level (public, private, or protected)

Use when: You want to search code by documentation or API discovery, finding functions/classes/methods based on their purpose rather than file names.

Examples:

;; Search all code elements
{:sources [{:path "/src/**/*.clj"
            :ingest :code-analysis}]}

;; Search only public API
{:sources [{:path "/src/**/*.clj"
            :ingest :code-analysis
            :visibility :public-only}]}

;; Search only vars and macros
{:sources [{:path "/src/**/*.clj"
            :ingest :code-analysis
            :element-types #{:var :macro}}]}

;; Java source code analysis
{:sources [{:path "/src/**/*.java"
            :ingest :code-analysis
            :visibility :public-only}]}

:chunked

Splits documents into smaller segments using LangChain4j's recursive text splitter. Enables better semantic search for large documents.

{:sources [{:path "/docs/**/*.md"
            :ingest :chunked
            :chunk-size 512
            :chunk-overlap 100}]}

Configuration:

  • :chunk-size - Maximum characters per chunk (default: 512)
  • :chunk-overlap - Characters to overlap between chunks (default: 100)

Note: LangChain4j's recursive paragraph splitter prioritizes semantic boundaries (paragraph breaks) over exact overlap amounts. Adjacent chunks may have less overlap than specified if splitting at a paragraph boundary.

Characteristics:

  • Multiple segments per file
  • Each chunk is embedded and stored independently
  • All chunks from the same file share the same :doc-id for batch removal during updates
  • Chunk metadata includes: :chunk-index (position), :chunk-count (total chunks), :chunk-offset (character offset)

Chunk sizing guidance:

  • Smaller chunks (256-512 chars): Better for precise fact-based retrieval
  • Larger chunks (1024+ chars): Better for broader context
  • Overlap (10-20%): Recommended to preserve context at chunk boundaries

Use when: You have large documents and need precise fact-based retrieval where specific information may be buried in lengthy content.

Examples:

;; Fine-grained retrieval for technical docs
{:sources [{:path "/docs/**/*.md"
            :ingest :chunked
            :chunk-size 384
            :chunk-overlap 75}]}

;; Broader context for narrative content
{:sources [{:path "/articles/**/*.md"
            :ingest :chunked
            :chunk-size 1024
            :chunk-overlap 200}]}

;; Compare strategies for different content types
{:sources [
  ;; Small reference docs - whole document works well
  {:path "/api-reference/**/*.md"
   :ingest :whole-document}

  ;; Large guides - chunking improves precision
  {:path "/guides/**/*.md"
   :ingest :chunked
   :chunk-size 512
   :chunk-overlap 100}]}

Metadata System

Metadata comes from two sources:

  1. Base metadata: Any additional keys in the source map (except :path, :class-path, :name, :ingest, :watch?)
  2. Captures: Values extracted from named groups in the path spec
{:sources [{:path "/docs/(?<category>[^/]+)/*.md"
            :project "my-project"
            :type "documentation"}]}

For a file /docs/api/functions.md:

  • Metadata: {:project "my-project", :type "documentation", :category "api"}

The :name key, if provided, is also added to metadata.

System-added metadata:

  • :doc-id - File path (used for watch updates/deletes)
  • :file-id - File path
  • :segment-id - Unique segment identifier

Strategy-specific metadata:

  • :namespace-doc adds: :namespace
  • :code-analysis adds: :element-type, :element-name, :language, :namespace, :visibility
  • :chunked adds: :chunk-index, :chunk-count, :chunk-offset

File Watching

Optional file watching system for automatic re-indexing when files change.

Configuration:

  • Global :watch? true enables watching for all sources
  • Per-source :watch? true/false overrides global setting
  • Only available for filesystem sources (:path), not classpath sources
{:watch? true  ; enable globally
 :sources [
   {:path "/docs/**/*.md"}         ; watched (global setting)
   {:path "/src/**/*.clj"
    :watch? false}                  ; not watched (override)
   {:path "/notes/**/*.txt"
    :watch? true}]}                 ; watched (explicit)

Behavior:

  • Events are debounced (500ms) to avoid excessive re-indexing
  • File created → index new file
  • File modified → remove old embeddings by :doc-id, re-index
  • File deleted → remove embeddings by :doc-id
  • Recursive watching for directories with ** glob

Complete Examples

{:sources [{:path "/Users/me/docs/**/*.md"}]}

Multi-Source with Metadata

{:description "Project documentation and code search"
 :sources [
   {:path "/Users/me/project/docs/**/*.md"
    :name "Documentation"
    :type "docs"}

   {:path "/Users/me/project/src/**/*.clj"
    :ingest :namespace-doc
    :name "Source Code"
    :type "code"}]}

Metadata Extraction with Captures

{:sources [{:path "/docs/(?<category>[^/]+)/*.md"
            :project "myapp"
            :type "documentation"}]}

Code Analysis with Filtering

{:sources [
   {:path "/src/**/*.clj"
    :ingest :code-analysis
    :visibility :public-only
    :element-types #{:var :macro}}

   {:path "/src/**/*.java"
    :ingest :code-analysis
    :visibility :public-only}]}

Chunked Large Documents

{:sources [
   {:path "/guides/**/*.md"
    :ingest :chunked
    :chunk-size 512
    :chunk-overlap 100}]}

Mixed Filesystem and Classpath

{:sources [
   ;; Filesystem documentation
   {:path "/Users/me/docs/**/*.md"
    :source "local"}

   ;; Bundled library documentation from classpath
   {:class-path "lib-docs/**/*.md"
    :source "library"}

   ;; Clojure source from classpath
   {:class-path "my/app/**/*.clj"
    :ingest :namespace-doc
    :source "library-code"}]}

File Watching Enabled

{:watch? true
 :sources [
   {:path "/Users/me/project/docs/**/*.md"}
   {:path "/Users/me/project/src/**/*.clj"
    :ingest :namespace-doc}]}

Complex Multi-Strategy Configuration

{:description "Comprehensive project search"
 :watch? true
 :sources [
   ;; API reference - small docs, keep whole
   {:path "/docs/api/**/*.md"
    :ingest :whole-document
    :category "api-reference"}

   ;; User guides - large docs, chunk them
   {:path "/docs/guides/**/*.md"
    :ingest :chunked
    :chunk-size 512
    :chunk-overlap 100
    :category "guides"}

   ;; Public API code
   {:path "/src/(?<namespace>[^/]+)/**/*.clj"
    :ingest :code-analysis
    :visibility :public-only
    :category "code"}

   ;; README files - whole document
   {:path "/(?<project>[^/]+)/README.md"
    :ingest :whole-document
    :category "readme"}]}

Tips and Best Practices

Path specifications:

  • Use absolute paths for filesystem sources (start with /)
  • Use relative paths for classpath sources (no leading /)
  • Named captures are powerful for extracting structured metadata
  • Test path patterns with a small subset first

Ingest strategies:

  • Start with :whole-document for most use cases
  • Use :namespace-doc for Clojure codebases to search by documentation
  • Use :code-analysis when you need fine-grained API discovery
  • Use :chunked for large documents (>1000 chars)
  • Use :file-path when you need to minimize memory usage

Metadata:

  • Add meaningful metadata to enable filtering during search
  • Use consistent naming conventions for metadata keys
  • Captures are great for hierarchical organization (project, version, category)

File watching:

  • Enable globally with :watch? true for development
  • Disable for production or when working with static content
  • Override per-source as needed
  • Only works with filesystem sources, not classpath

Performance:

  • Smaller chunk sizes create more segments (more memory, more precise search)
  • Larger chunk sizes create fewer segments (less memory, broader context)
  • :file-path strategy significantly reduces memory usage
  • Consider the trade-off between search precision and resource usage