zhongwei/gh-hugoduncan-mcp-vector-search-claude-plugin-plugins-config-guide

Files

Zhongwei Li 34e99e046e Initial commit

2025-11-29 18:47:25 +08:00

15 KiB

Raw Permalink Blame History

description

description
Guide for writing mcp-vector-search configuration files

mcp-vector-search Configuration Guide

Comprehensive reference for writing .mcp-vector-search/config.edn configuration files.

Overview

mcp-vector-search indexes documents using semantic embeddings and provides a search tool via the Model Context Protocol. Configuration controls which files are indexed, how they're processed, and what metadata is extracted.

Configuration File Location

Create a configuration file at one of these locations (first found is used):

Project-specific: .mcp-vector-search/config.edn (in project root)
Global: ~/.mcp-vector-search/config.edn (user home directory)

The server reads the configuration at startup and indexes all specified sources.

Basic Configuration Structure

{:description "Custom search tool description"  ; optional
 :watch? true                                   ; optional, enable file watching
 :sources [{:path "/docs/**/*.md"
            :name "Documentation"               ; optional
            :ingest :whole-document             ; optional, defaults to :whole-document
            :watch? true                        ; optional, overrides global :watch?
            :custom-key "custom-value"}]}       ; any additional keys become metadata

Top-level keys:

:description - Custom description for the search tool (optional)
:watch? - Enable automatic re-indexing when files change (optional, default: false)
:sources - Array of source configurations (required)

Path Specifications

Filesystem vs Classpath Sources

Filesystem sources (:path key):

Use absolute paths with leading /
Support file watching for automatic re-indexing
Files are read from the filesystem

{:path "/docs/**/*.md"}

Classpath sources (:class-path key):

Use relative paths without leading /
File watching not available (read-only resources)
Resources discovered from classpath (JARs, resource directories)
Useful when embedding mcp-vector-search as a library

{:class-path "docs/**/*.md"}

Important: Sources must specify exactly one of :path or :class-path.

Glob Patterns

Single-level glob (*):

Matches any characters within a single directory level
Does not match path separators

{:path "/docs/*.md"}           ; matches /docs/README.md
                                ; does NOT match /docs/api/guide.md

Recursive glob (**):

Matches any characters across multiple directory levels
Includes path separators

{:path "/docs/**/*.md"}         ; matches /docs/README.md
                                ; matches /docs/api/guide.md

Named Captures

Extract metadata from file paths using named regex groups:

{:path "/docs/(?<category>[^/]+)/*.md"}

Syntax: (?<name>pattern)

name - Metadata key (converted to keyword)
pattern - Java regular expression

For file /docs/api/functions.md:

Captures: {:category "api"}

Multiple captures example:

{:path "/(?<project>[^/]+)/(?<version>v\\d+)/(?<file>.+\\.clj)"}

For file /myapp/v1/core.clj:

Captures: {:project "myapp", :version "v1", :file "core.clj"}

Path Specification Examples

;; All markdown files recursively
{:path "/docs/**/*.md"}

;; Single directory level
{:path "/docs/*.md"}

;; Capture directory name
{:path "/docs/(?<category>[^/]+)/*.md"}

;; Multiple captures
{:path "/(?<project>[^/]+)/(?<version>[^/]+)/**/*.clj"}

;; Literal file
{:path "/docs/README.md"}

;; Classpath resource (no leading /)
{:class-path "docs/**/*.md"}

Ingest Pipeline Strategies

Ingest strategies control how documents are processed, embedded, and stored. Set via the :ingest key.

:whole-document (default)

Embeds and stores the entire file content as a single segment.

{:sources [{:path "/docs/**/*.md"
            :ingest :whole-document}]}

Characteristics:

One segment per file
Both embedding and storage use full content
Simple and straightforward for most use cases

Use when: You want to search across complete documents and return full content.

:namespace-doc

For Clojure source files - embeds only the namespace docstring but stores the full file content.

{:sources [{:path "/src/**/*.clj"
            :ingest :namespace-doc}]}

Requirements:

File must contain a valid ns form
Namespace must have a docstring
Adds :namespace to metadata (e.g., {:namespace "my.app.core"})

Characteristics:

Embedding uses only namespace docstring
Storage includes full file content
One segment per file

Use when: You want to search Clojure namespaces by their documentation while still returning the complete source code.

:file-path

Embeds the full content but stores only the file path.

{:sources [{:path "/docs/**/*.md"
            :ingest :file-path}]}

Characteristics:

Embedding uses full file content
Storage contains only the file path
One segment per file
Reduces memory footprint for large document sets

Use when:

You only need to discover which files match a query
You want to reduce memory usage
Your client will read file content separately

:code-analysis

Analyzes Clojure and Java source files using clj-kondo to extract code elements (vars, namespaces, classes, methods, fields, macros). Creates one searchable segment per code element.

{:sources [{:path "/src/**/*.clj"
            :ingest :code-analysis}]}

Configuration options:

{:sources [{:path "/src/**/*.clj"
            :ingest :code-analysis
            :visibility :public-only        ; :all (default) | :public-only
            :element-types #{:var :macro}}]} ; optional filter

:visibility - Controls which elements to include:

:all (default) - Include all elements regardless of visibility
:public-only - Include only public elements
- Clojure: Excludes vars with ^:private or {:private true} metadata
- Java: Excludes members with private or protected access modifiers

:element-types (optional) - Set of element types to include:

Valid types: :var, :macro, :namespace, :class, :method, :field, :constructor
If omitted: Include all element types
If provided: Only include specified types

Characteristics:

Multiple segments per file (one per code element)
Embedding uses docstring if present, otherwise element name
Content stores complete clj-kondo analysis map as EDN string
Supports both Clojure (.clj, .cljs, .cljc) and Java (.java) files

Segment metadata:

:element-type - Type of code element (var, macro, namespace, class, method, field, constructor)
:element-name - Qualified name (e.g., "my.ns/my-fn" or "com.example.MyClass.myMethod")
:language - Source language (clojure or java)
:namespace - Containing namespace (Clojure) or package (Java)
:visibility - Access level (public, private, or protected)

Use when: You want to search code by documentation or API discovery, finding functions/classes/methods based on their purpose rather than file names.

Examples:

;; Search all code elements
{:sources [{:path "/src/**/*.clj"
            :ingest :code-analysis}]}

;; Search only public API
{:sources [{:path "/src/**/*.clj"
            :ingest :code-analysis
            :visibility :public-only}]}

;; Search only vars and macros
{:sources [{:path "/src/**/*.clj"
            :ingest :code-analysis
            :element-types #{:var :macro}}]}

;; Java source code analysis
{:sources [{:path "/src/**/*.java"
            :ingest :code-analysis
            :visibility :public-only}]}

:chunked

Splits documents into smaller segments using LangChain4j's recursive text splitter. Enables better semantic search for large documents.

{:sources [{:path "/docs/**/*.md"
            :ingest :chunked
            :chunk-size 512
            :chunk-overlap 100}]}

Configuration:

:chunk-size - Maximum characters per chunk (default: 512)
:chunk-overlap - Characters to overlap between chunks (default: 100)

Note: LangChain4j's recursive paragraph splitter prioritizes semantic boundaries (paragraph breaks) over exact overlap amounts. Adjacent chunks may have less overlap than specified if splitting at a paragraph boundary.

Characteristics:

Multiple segments per file
Each chunk is embedded and stored independently
All chunks from the same file share the same :doc-id for batch removal during updates
Chunk metadata includes: :chunk-index (position), :chunk-count (total chunks), :chunk-offset (character offset)

Chunk sizing guidance:

Smaller chunks (256-512 chars): Better for precise fact-based retrieval
Larger chunks (1024+ chars): Better for broader context
Overlap (10-20%): Recommended to preserve context at chunk boundaries

Use when: You have large documents and need precise fact-based retrieval where specific information may be buried in lengthy content.

Examples:

;; Fine-grained retrieval for technical docs
{:sources [{:path "/docs/**/*.md"
            :ingest :chunked
            :chunk-size 384
            :chunk-overlap 75}]}

;; Broader context for narrative content
{:sources [{:path "/articles/**/*.md"
            :ingest :chunked
            :chunk-size 1024
            :chunk-overlap 200}]}

;; Compare strategies for different content types
{:sources [
  ;; Small reference docs - whole document works well
  {:path "/api-reference/**/*.md"
   :ingest :whole-document}

  ;; Large guides - chunking improves precision
  {:path "/guides/**/*.md"
   :ingest :chunked
   :chunk-size 512
   :chunk-overlap 100}]}

Metadata System

Metadata comes from two sources:

Base metadata: Any additional keys in the source map (except :path, :class-path, :name, :ingest, :watch?)
Captures: Values extracted from named groups in the path spec

{:sources [{:path "/docs/(?<category>[^/]+)/*.md"
            :project "my-project"
            :type "documentation"}]}

For a file /docs/api/functions.md:

Metadata: {:project "my-project", :type "documentation", :category "api"}

The :name key, if provided, is also added to metadata.

System-added metadata:

:doc-id - File path (used for watch updates/deletes)
:file-id - File path
:segment-id - Unique segment identifier

Strategy-specific metadata:

:namespace-doc adds: :namespace
:code-analysis adds: :element-type, :element-name, :language, :namespace, :visibility
:chunked adds: :chunk-index, :chunk-count, :chunk-offset

File Watching

Optional file watching system for automatic re-indexing when files change.

Configuration:

Global :watch? true enables watching for all sources
Per-source :watch? true/false overrides global setting
Only available for filesystem sources (:path), not classpath sources

{:watch? true  ; enable globally
 :sources [
   {:path "/docs/**/*.md"}         ; watched (global setting)
   {:path "/src/**/*.clj"
    :watch? false}                  ; not watched (override)
   {:path "/notes/**/*.txt"
    :watch? true}]}                 ; watched (explicit)

Behavior:

Events are debounced (500ms) to avoid excessive re-indexing
File created → index new file
File modified → remove old embeddings by :doc-id, re-index
File deleted → remove embeddings by :doc-id
Recursive watching for directories with ** glob

Complete Examples

Basic Documentation Search

{:sources [{:path "/Users/me/docs/**/*.md"}]}

Multi-Source with Metadata

{:description "Project documentation and code search"
 :sources [
   {:path "/Users/me/project/docs/**/*.md"
    :name "Documentation"
    :type "docs"}

   {:path "/Users/me/project/src/**/*.clj"
    :ingest :namespace-doc
    :name "Source Code"
    :type "code"}]}

Metadata Extraction with Captures

{:sources [{:path "/docs/(?<category>[^/]+)/*.md"
            :project "myapp"
            :type "documentation"}]}

Code Analysis with Filtering

{:sources [
   {:path "/src/**/*.clj"
    :ingest :code-analysis
    :visibility :public-only
    :element-types #{:var :macro}}

   {:path "/src/**/*.java"
    :ingest :code-analysis
    :visibility :public-only}]}

Chunked Large Documents

{:sources [
   {:path "/guides/**/*.md"
    :ingest :chunked
    :chunk-size 512
    :chunk-overlap 100}]}

Mixed Filesystem and Classpath

{:sources [
   ;; Filesystem documentation
   {:path "/Users/me/docs/**/*.md"
    :source "local"}

   ;; Bundled library documentation from classpath
   {:class-path "lib-docs/**/*.md"
    :source "library"}

   ;; Clojure source from classpath
   {:class-path "my/app/**/*.clj"
    :ingest :namespace-doc
    :source "library-code"}]}

File Watching Enabled

{:watch? true
 :sources [
   {:path "/Users/me/project/docs/**/*.md"}
   {:path "/Users/me/project/src/**/*.clj"
    :ingest :namespace-doc}]}

Complex Multi-Strategy Configuration

{:description "Comprehensive project search"
 :watch? true
 :sources [
   ;; API reference - small docs, keep whole
   {:path "/docs/api/**/*.md"
    :ingest :whole-document
    :category "api-reference"}

   ;; User guides - large docs, chunk them
   {:path "/docs/guides/**/*.md"
    :ingest :chunked
    :chunk-size 512
    :chunk-overlap 100
    :category "guides"}

   ;; Public API code
   {:path "/src/(?<namespace>[^/]+)/**/*.clj"
    :ingest :code-analysis
    :visibility :public-only
    :category "code"}

   ;; README files - whole document
   {:path "/(?<project>[^/]+)/README.md"
    :ingest :whole-document
    :category "readme"}]}

Tips and Best Practices

Path specifications:

Use absolute paths for filesystem sources (start with /)
Use relative paths for classpath sources (no leading /)
Named captures are powerful for extracting structured metadata
Test path patterns with a small subset first

Ingest strategies:

Start with :whole-document for most use cases
Use :namespace-doc for Clojure codebases to search by documentation
Use :code-analysis when you need fine-grained API discovery
Use :chunked for large documents (>1000 chars)
Use :file-path when you need to minimize memory usage

Metadata:

Add meaningful metadata to enable filtering during search
Use consistent naming conventions for metadata keys
Captures are great for hierarchical organization (project, version, category)

File watching:

Enable globally with :watch? true for development
Disable for production or when working with static content
Override per-source as needed
Only works with filesystem sources, not classpath

Performance:

Smaller chunk sizes create more segments (more memory, more precise search)
Larger chunk sizes create fewer segments (less memory, broader context)
:file-path strategy significantly reduces memory usage
Consider the trade-off between search precision and resource usage

15 KiB Raw Permalink Blame History

mcp-vector-search Configuration Guide

Overview

Configuration File Location

Basic Configuration Structure

Path Specifications

Filesystem vs Classpath Sources

Glob Patterns

Named Captures

Path Specification Examples

Ingest Pipeline Strategies

:whole-document (default)

:namespace-doc

:file-path

:code-analysis

:chunked

Metadata System

File Watching

Complete Examples

Basic Documentation Search

Multi-Source with Metadata

Metadata Extraction with Captures

Code Analysis with Filtering

Chunked Large Documents

Mixed Filesystem and Classpath

File Watching Enabled

Complex Multi-Strategy Configuration

Tips and Best Practices

15 KiB

Raw Permalink Blame History