15 KiB
description
| description |
|---|
| Guide for writing mcp-vector-search configuration files |
mcp-vector-search Configuration Guide
Comprehensive reference for writing .mcp-vector-search/config.edn configuration files.
Overview
mcp-vector-search indexes documents using semantic embeddings and provides a search tool via the Model Context Protocol. Configuration controls which files are indexed, how they're processed, and what metadata is extracted.
Configuration File Location
Create a configuration file at one of these locations (first found is used):
- Project-specific:
.mcp-vector-search/config.edn(in project root) - Global:
~/.mcp-vector-search/config.edn(user home directory)
The server reads the configuration at startup and indexes all specified sources.
Basic Configuration Structure
{:description "Custom search tool description" ; optional
:watch? true ; optional, enable file watching
:sources [{:path "/docs/**/*.md"
:name "Documentation" ; optional
:ingest :whole-document ; optional, defaults to :whole-document
:watch? true ; optional, overrides global :watch?
:custom-key "custom-value"}]} ; any additional keys become metadata
Top-level keys:
:description- Custom description for the search tool (optional):watch?- Enable automatic re-indexing when files change (optional, default: false):sources- Array of source configurations (required)
Path Specifications
Filesystem vs Classpath Sources
Filesystem sources (:path key):
- Use absolute paths with leading
/ - Support file watching for automatic re-indexing
- Files are read from the filesystem
{:path "/docs/**/*.md"}
Classpath sources (:class-path key):
- Use relative paths without leading
/ - File watching not available (read-only resources)
- Resources discovered from classpath (JARs, resource directories)
- Useful when embedding mcp-vector-search as a library
{:class-path "docs/**/*.md"}
Important: Sources must specify exactly one of :path or :class-path.
Glob Patterns
Single-level glob (*):
- Matches any characters within a single directory level
- Does not match path separators
{:path "/docs/*.md"} ; matches /docs/README.md
; does NOT match /docs/api/guide.md
Recursive glob (**):
- Matches any characters across multiple directory levels
- Includes path separators
{:path "/docs/**/*.md"} ; matches /docs/README.md
; matches /docs/api/guide.md
Named Captures
Extract metadata from file paths using named regex groups:
{:path "/docs/(?<category>[^/]+)/*.md"}
Syntax: (?<name>pattern)
name- Metadata key (converted to keyword)pattern- Java regular expression
For file /docs/api/functions.md:
- Captures:
{:category "api"}
Multiple captures example:
{:path "/(?<project>[^/]+)/(?<version>v\\d+)/(?<file>.+\\.clj)"}
For file /myapp/v1/core.clj:
- Captures:
{:project "myapp", :version "v1", :file "core.clj"}
Path Specification Examples
;; All markdown files recursively
{:path "/docs/**/*.md"}
;; Single directory level
{:path "/docs/*.md"}
;; Capture directory name
{:path "/docs/(?<category>[^/]+)/*.md"}
;; Multiple captures
{:path "/(?<project>[^/]+)/(?<version>[^/]+)/**/*.clj"}
;; Literal file
{:path "/docs/README.md"}
;; Classpath resource (no leading /)
{:class-path "docs/**/*.md"}
Ingest Pipeline Strategies
Ingest strategies control how documents are processed, embedded, and stored. Set via the :ingest key.
:whole-document (default)
Embeds and stores the entire file content as a single segment.
{:sources [{:path "/docs/**/*.md"
:ingest :whole-document}]}
Characteristics:
- One segment per file
- Both embedding and storage use full content
- Simple and straightforward for most use cases
Use when: You want to search across complete documents and return full content.
:namespace-doc
For Clojure source files - embeds only the namespace docstring but stores the full file content.
{:sources [{:path "/src/**/*.clj"
:ingest :namespace-doc}]}
Requirements:
- File must contain a valid
nsform - Namespace must have a docstring
- Adds
:namespaceto metadata (e.g.,{:namespace "my.app.core"})
Characteristics:
- Embedding uses only namespace docstring
- Storage includes full file content
- One segment per file
Use when: You want to search Clojure namespaces by their documentation while still returning the complete source code.
:file-path
Embeds the full content but stores only the file path.
{:sources [{:path "/docs/**/*.md"
:ingest :file-path}]}
Characteristics:
- Embedding uses full file content
- Storage contains only the file path
- One segment per file
- Reduces memory footprint for large document sets
Use when:
- You only need to discover which files match a query
- You want to reduce memory usage
- Your client will read file content separately
:code-analysis
Analyzes Clojure and Java source files using clj-kondo to extract code elements (vars, namespaces, classes, methods, fields, macros). Creates one searchable segment per code element.
{:sources [{:path "/src/**/*.clj"
:ingest :code-analysis}]}
Configuration options:
{:sources [{:path "/src/**/*.clj"
:ingest :code-analysis
:visibility :public-only ; :all (default) | :public-only
:element-types #{:var :macro}}]} ; optional filter
:visibility - Controls which elements to include:
:all(default) - Include all elements regardless of visibility:public-only- Include only public elements- Clojure: Excludes vars with
^:privateor{:private true}metadata - Java: Excludes members with
privateorprotectedaccess modifiers
- Clojure: Excludes vars with
:element-types (optional) - Set of element types to include:
- Valid types:
:var,:macro,:namespace,:class,:method,:field,:constructor - If omitted: Include all element types
- If provided: Only include specified types
Characteristics:
- Multiple segments per file (one per code element)
- Embedding uses docstring if present, otherwise element name
- Content stores complete clj-kondo analysis map as EDN string
- Supports both Clojure (.clj, .cljs, .cljc) and Java (.java) files
Segment metadata:
:element-type- Type of code element (var, macro, namespace, class, method, field, constructor):element-name- Qualified name (e.g., "my.ns/my-fn" or "com.example.MyClass.myMethod"):language- Source language (clojure or java):namespace- Containing namespace (Clojure) or package (Java):visibility- Access level (public, private, or protected)
Use when: You want to search code by documentation or API discovery, finding functions/classes/methods based on their purpose rather than file names.
Examples:
;; Search all code elements
{:sources [{:path "/src/**/*.clj"
:ingest :code-analysis}]}
;; Search only public API
{:sources [{:path "/src/**/*.clj"
:ingest :code-analysis
:visibility :public-only}]}
;; Search only vars and macros
{:sources [{:path "/src/**/*.clj"
:ingest :code-analysis
:element-types #{:var :macro}}]}
;; Java source code analysis
{:sources [{:path "/src/**/*.java"
:ingest :code-analysis
:visibility :public-only}]}
:chunked
Splits documents into smaller segments using LangChain4j's recursive text splitter. Enables better semantic search for large documents.
{:sources [{:path "/docs/**/*.md"
:ingest :chunked
:chunk-size 512
:chunk-overlap 100}]}
Configuration:
:chunk-size- Maximum characters per chunk (default: 512):chunk-overlap- Characters to overlap between chunks (default: 100)
Note: LangChain4j's recursive paragraph splitter prioritizes semantic boundaries (paragraph breaks) over exact overlap amounts. Adjacent chunks may have less overlap than specified if splitting at a paragraph boundary.
Characteristics:
- Multiple segments per file
- Each chunk is embedded and stored independently
- All chunks from the same file share the same
:doc-idfor batch removal during updates - Chunk metadata includes:
:chunk-index(position),:chunk-count(total chunks),:chunk-offset(character offset)
Chunk sizing guidance:
- Smaller chunks (256-512 chars): Better for precise fact-based retrieval
- Larger chunks (1024+ chars): Better for broader context
- Overlap (10-20%): Recommended to preserve context at chunk boundaries
Use when: You have large documents and need precise fact-based retrieval where specific information may be buried in lengthy content.
Examples:
;; Fine-grained retrieval for technical docs
{:sources [{:path "/docs/**/*.md"
:ingest :chunked
:chunk-size 384
:chunk-overlap 75}]}
;; Broader context for narrative content
{:sources [{:path "/articles/**/*.md"
:ingest :chunked
:chunk-size 1024
:chunk-overlap 200}]}
;; Compare strategies for different content types
{:sources [
;; Small reference docs - whole document works well
{:path "/api-reference/**/*.md"
:ingest :whole-document}
;; Large guides - chunking improves precision
{:path "/guides/**/*.md"
:ingest :chunked
:chunk-size 512
:chunk-overlap 100}]}
Metadata System
Metadata comes from two sources:
- Base metadata: Any additional keys in the source map (except
:path,:class-path,:name,:ingest,:watch?) - Captures: Values extracted from named groups in the path spec
{:sources [{:path "/docs/(?<category>[^/]+)/*.md"
:project "my-project"
:type "documentation"}]}
For a file /docs/api/functions.md:
- Metadata:
{:project "my-project", :type "documentation", :category "api"}
The :name key, if provided, is also added to metadata.
System-added metadata:
:doc-id- File path (used for watch updates/deletes):file-id- File path:segment-id- Unique segment identifier
Strategy-specific metadata:
:namespace-docadds::namespace:code-analysisadds::element-type,:element-name,:language,:namespace,:visibility:chunkedadds::chunk-index,:chunk-count,:chunk-offset
File Watching
Optional file watching system for automatic re-indexing when files change.
Configuration:
- Global
:watch? trueenables watching for all sources - Per-source
:watch? true/falseoverrides global setting - Only available for filesystem sources (
:path), not classpath sources
{:watch? true ; enable globally
:sources [
{:path "/docs/**/*.md"} ; watched (global setting)
{:path "/src/**/*.clj"
:watch? false} ; not watched (override)
{:path "/notes/**/*.txt"
:watch? true}]} ; watched (explicit)
Behavior:
- Events are debounced (500ms) to avoid excessive re-indexing
- File created → index new file
- File modified → remove old embeddings by
:doc-id, re-index - File deleted → remove embeddings by
:doc-id - Recursive watching for directories with
**glob
Complete Examples
Basic Documentation Search
{:sources [{:path "/Users/me/docs/**/*.md"}]}
Multi-Source with Metadata
{:description "Project documentation and code search"
:sources [
{:path "/Users/me/project/docs/**/*.md"
:name "Documentation"
:type "docs"}
{:path "/Users/me/project/src/**/*.clj"
:ingest :namespace-doc
:name "Source Code"
:type "code"}]}
Metadata Extraction with Captures
{:sources [{:path "/docs/(?<category>[^/]+)/*.md"
:project "myapp"
:type "documentation"}]}
Code Analysis with Filtering
{:sources [
{:path "/src/**/*.clj"
:ingest :code-analysis
:visibility :public-only
:element-types #{:var :macro}}
{:path "/src/**/*.java"
:ingest :code-analysis
:visibility :public-only}]}
Chunked Large Documents
{:sources [
{:path "/guides/**/*.md"
:ingest :chunked
:chunk-size 512
:chunk-overlap 100}]}
Mixed Filesystem and Classpath
{:sources [
;; Filesystem documentation
{:path "/Users/me/docs/**/*.md"
:source "local"}
;; Bundled library documentation from classpath
{:class-path "lib-docs/**/*.md"
:source "library"}
;; Clojure source from classpath
{:class-path "my/app/**/*.clj"
:ingest :namespace-doc
:source "library-code"}]}
File Watching Enabled
{:watch? true
:sources [
{:path "/Users/me/project/docs/**/*.md"}
{:path "/Users/me/project/src/**/*.clj"
:ingest :namespace-doc}]}
Complex Multi-Strategy Configuration
{:description "Comprehensive project search"
:watch? true
:sources [
;; API reference - small docs, keep whole
{:path "/docs/api/**/*.md"
:ingest :whole-document
:category "api-reference"}
;; User guides - large docs, chunk them
{:path "/docs/guides/**/*.md"
:ingest :chunked
:chunk-size 512
:chunk-overlap 100
:category "guides"}
;; Public API code
{:path "/src/(?<namespace>[^/]+)/**/*.clj"
:ingest :code-analysis
:visibility :public-only
:category "code"}
;; README files - whole document
{:path "/(?<project>[^/]+)/README.md"
:ingest :whole-document
:category "readme"}]}
Tips and Best Practices
Path specifications:
- Use absolute paths for filesystem sources (start with
/) - Use relative paths for classpath sources (no leading
/) - Named captures are powerful for extracting structured metadata
- Test path patterns with a small subset first
Ingest strategies:
- Start with
:whole-documentfor most use cases - Use
:namespace-docfor Clojure codebases to search by documentation - Use
:code-analysiswhen you need fine-grained API discovery - Use
:chunkedfor large documents (>1000 chars) - Use
:file-pathwhen you need to minimize memory usage
Metadata:
- Add meaningful metadata to enable filtering during search
- Use consistent naming conventions for metadata keys
- Captures are great for hierarchical organization (project, version, category)
File watching:
- Enable globally with
:watch? truefor development - Disable for production or when working with static content
- Override per-source as needed
- Only works with filesystem sources, not classpath
Performance:
- Smaller chunk sizes create more segments (more memory, more precise search)
- Larger chunk sizes create fewer segments (less memory, broader context)
:file-pathstrategy significantly reduces memory usage- Consider the trade-off between search precision and resource usage