Files
2025-11-30 08:30:10 +08:00

13 KiB

Matchms Similarity Functions Reference

This document provides detailed information about all similarity scoring methods available in matchms.

Overview

Matchms provides multiple similarity functions for comparing mass spectra. Use calculate_scores() to compute pairwise similarities between reference and query spectra collections.

from matchms import calculate_scores
from matchms.similarity import CosineGreedy

scores = calculate_scores(references=library_spectra,
                         queries=query_spectra,
                         similarity_function=CosineGreedy())

Peak-Based Similarity Functions

These functions compare mass spectra based on their peak patterns (m/z and intensity values).

CosineGreedy

Description: Calculates cosine similarity between two spectra using a fast greedy matching algorithm. Peaks are matched within a specified tolerance, and similarity is computed based on matched peak intensities.

When to use:

  • Fast similarity calculations for large datasets
  • General-purpose spectral matching
  • When speed is prioritized over mathematically optimal matching

Parameters:

  • tolerance (float, default=0.1): Maximum m/z difference for peak matching (Daltons)
  • mz_power (float, default=0.0): Exponent for m/z weighting (0 = no weighting)
  • intensity_power (float, default=1.0): Exponent for intensity weighting

Example:

from matchms.similarity import CosineGreedy

similarity_func = CosineGreedy(tolerance=0.1, mz_power=0.0, intensity_power=1.0)
scores = calculate_scores(references, queries, similarity_func)

Output: Similarity score between 0.0 and 1.0, plus number of matched peaks.


CosineHungarian

Description: Calculates cosine similarity using the Hungarian algorithm for optimal peak matching. Provides mathematically optimal peak assignments but is slower than CosineGreedy.

When to use:

  • When optimal peak matching is required
  • High-quality reference library comparisons
  • Research requiring reproducible, mathematically rigorous results

Parameters:

  • tolerance (float, default=0.1): Maximum m/z difference for peak matching
  • mz_power (float, default=0.0): Exponent for m/z weighting
  • intensity_power (float, default=1.0): Exponent for intensity weighting

Example:

from matchms.similarity import CosineHungarian

similarity_func = CosineHungarian(tolerance=0.1)
scores = calculate_scores(references, queries, similarity_func)

Output: Optimal similarity score between 0.0 and 1.0, plus matched peaks.

Note: Slower than CosineGreedy; use for smaller datasets or when accuracy is critical.


ModifiedCosine

Description: Extends cosine similarity by accounting for precursor m/z differences. Allows peaks to match after applying a mass shift based on the difference between precursor masses. Useful for comparing spectra of related compounds (isotopes, adducts, analogs).

When to use:

  • Comparing spectra from different precursor masses
  • Identifying structural analogs or derivatives
  • Cross-ionization mode comparisons
  • When precursor mass differences are meaningful

Parameters:

  • tolerance (float, default=0.1): Maximum m/z difference for peak matching after shift
  • mz_power (float, default=0.0): Exponent for m/z weighting
  • intensity_power (float, default=1.0): Exponent for intensity weighting

Example:

from matchms.similarity import ModifiedCosine

similarity_func = ModifiedCosine(tolerance=0.1)
scores = calculate_scores(references, queries, similarity_func)

Requirements: Both spectra must have valid precursor_mz metadata.


NeutralLossesCosine

Description: Calculates similarity based on neutral loss patterns rather than fragment m/z values. Neutral losses are derived by subtracting fragment m/z from precursor m/z. Particularly useful for identifying compounds with similar fragmentation patterns.

When to use:

  • Comparing fragmentation patterns across different precursor masses
  • Identifying compounds with similar neutral loss profiles
  • Complementary to regular cosine scoring
  • Metabolite identification and classification

Parameters:

  • tolerance (float, default=0.1): Maximum neutral loss difference for matching
  • mz_power (float, default=0.0): Exponent for loss value weighting
  • intensity_power (float, default=1.0): Exponent for intensity weighting

Example:

from matchms.similarity import NeutralLossesCosine
from matchms.filtering import add_losses

# First add losses to spectra
spectra_with_losses = [add_losses(s) for s in spectra]

similarity_func = NeutralLossesCosine(tolerance=0.1)
scores = calculate_scores(references_with_losses, queries_with_losses, similarity_func)

Requirements:

  • Both spectra must have valid precursor_mz metadata
  • Use add_losses() filter to compute neutral losses before scoring

Structural Similarity Functions

These functions compare molecular structures rather than spectral peaks.

FingerprintSimilarity

Description: Calculates similarity between molecular fingerprints derived from chemical structures (SMILES or InChI). Supports multiple fingerprint types and similarity metrics.

When to use:

  • Structural similarity without spectral data
  • Combining structural and spectral similarity
  • Pre-filtering candidates before spectral matching
  • Structure-activity relationship studies

Parameters:

  • fingerprint_type (str, default="daylight"): Type of fingerprint
    • "daylight": Daylight fingerprint
    • "morgan1", "morgan2", "morgan3": Morgan fingerprints with radius 1, 2, or 3
  • similarity_measure (str, default="jaccard"): Similarity metric
    • "jaccard": Jaccard index (intersection / union)
    • "dice": Dice coefficient (2 * intersection / (size1 + size2))
    • "cosine": Cosine similarity

Example:

from matchms.similarity import FingerprintSimilarity
from matchms.filtering import add_fingerprint

# Add fingerprints to spectra
spectra_with_fps = [add_fingerprint(s, fingerprint_type="morgan2", nbits=2048)
                    for s in spectra]

similarity_func = FingerprintSimilarity(similarity_measure="jaccard")
scores = calculate_scores(references_with_fps, queries_with_fps, similarity_func)

Requirements:

  • Spectra must have valid SMILES or InChI metadata
  • Use add_fingerprint() filter to compute fingerprints
  • Requires rdkit library

Metadata-Based Similarity Functions

These functions compare metadata fields rather than spectral or structural data.

MetadataMatch

Description: Compares user-defined metadata fields between spectra. Supports exact matching for categorical data and tolerance-based matching for numerical data.

When to use:

  • Filtering by experimental conditions (collision energy, retention time)
  • Instrument-specific matching
  • Combining metadata constraints with spectral similarity
  • Custom metadata-based filtering

Parameters:

  • field (str): Metadata field name to compare
  • matching_type (str, default="exact"): Matching method
    • "exact": Exact string/value match
    • "difference": Absolute difference for numerical values
    • "relative_difference": Relative difference for numerical values
  • tolerance (float, optional): Maximum difference for numerical matching

Example (Exact matching):

from matchms.similarity import MetadataMatch

# Match by instrument type
similarity_func = MetadataMatch(field="instrument_type", matching_type="exact")
scores = calculate_scores(references, queries, similarity_func)

Example (Numerical matching):

# Match retention time within 0.5 minutes
similarity_func = MetadataMatch(field="retention_time",
                                matching_type="difference",
                                tolerance=0.5)
scores = calculate_scores(references, queries, similarity_func)

Output: Returns 1.0 (match) or 0.0 (no match) for exact matching. For numerical matching, returns similarity score based on difference.


PrecursorMzMatch

Description: Binary matching based on precursor m/z values. Returns True/False based on whether precursor masses match within specified tolerance.

When to use:

  • Pre-filtering spectral libraries by precursor mass
  • Fast mass-based candidate selection
  • Combining with other similarity metrics
  • Isobaric compound identification

Parameters:

  • tolerance (float, default=0.1): Maximum m/z difference for matching
  • tolerance_type (str, default="Dalton"): Tolerance unit
    • "Dalton": Absolute mass difference
    • "ppm": Parts per million (relative)

Example:

from matchms.similarity import PrecursorMzMatch

# Match precursor within 0.1 Da
similarity_func = PrecursorMzMatch(tolerance=0.1, tolerance_type="Dalton")
scores = calculate_scores(references, queries, similarity_func)

# Match precursor within 10 ppm
similarity_func = PrecursorMzMatch(tolerance=10, tolerance_type="ppm")
scores = calculate_scores(references, queries, similarity_func)

Output: 1.0 (match) or 0.0 (no match)

Requirements: Both spectra must have valid precursor_mz metadata.


ParentMassMatch

Description: Binary matching based on parent mass (neutral mass) values. Similar to PrecursorMzMatch but uses calculated parent mass instead of precursor m/z.

When to use:

  • Comparing spectra from different ionization modes
  • Adduct-independent matching
  • Neutral mass-based library searches

Parameters:

  • tolerance (float, default=0.1): Maximum mass difference for matching
  • tolerance_type (str, default="Dalton"): Tolerance unit ("Dalton" or "ppm")

Example:

from matchms.similarity import ParentMassMatch

similarity_func = ParentMassMatch(tolerance=0.1, tolerance_type="Dalton")
scores = calculate_scores(references, queries, similarity_func)

Output: 1.0 (match) or 0.0 (no match)

Requirements: Both spectra must have valid parent_mass metadata.


Combining Multiple Similarity Functions

Combine multiple similarity metrics for robust compound identification:

from matchms import calculate_scores
from matchms.similarity import CosineGreedy, ModifiedCosine, FingerprintSimilarity

# Calculate multiple similarity scores
cosine_scores = calculate_scores(refs, queries, CosineGreedy())
modified_cosine_scores = calculate_scores(refs, queries, ModifiedCosine())
fingerprint_scores = calculate_scores(refs, queries, FingerprintSimilarity())

# Combine scores with weights
for i, query in enumerate(queries):
    for j, ref in enumerate(refs):
        combined_score = (0.5 * cosine_scores.scores[j, i] +
                         0.3 * modified_cosine_scores.scores[j, i] +
                         0.2 * fingerprint_scores.scores[j, i])

Accessing Scores Results

The Scores object provides multiple methods to access results:

# Get best matches for a query
best_matches = scores.scores_by_query(query_spectrum, sort=True)[:10]

# Get scores as numpy array
score_array = scores.scores

# Get scores as pandas DataFrame
import pandas as pd
df = scores.to_dataframe()

# Filter by threshold
high_scores = [(i, j, score) for i, j, score in scores.to_list() if score > 0.7]

# Save scores
scores.to_json("scores.json")
scores.to_pickle("scores.pkl")

Performance Considerations

Fast methods (large datasets):

  • CosineGreedy
  • PrecursorMzMatch
  • ParentMassMatch

Slow methods (smaller datasets or high accuracy):

  • CosineHungarian
  • ModifiedCosine (slower than CosineGreedy)
  • NeutralLossesCosine
  • FingerprintSimilarity (requires fingerprint computation)

Recommendation: For large-scale library searches, use PrecursorMzMatch to pre-filter candidates, then apply CosineGreedy or ModifiedCosine to filtered results.

Common Similarity Workflows

Standard Library Matching

from matchms.similarity import CosineGreedy

scores = calculate_scores(library_spectra, query_spectra,
                         CosineGreedy(tolerance=0.1))

Multi-Metric Matching

from matchms.similarity import CosineGreedy, ModifiedCosine, FingerprintSimilarity

# Spectral similarity
cosine = calculate_scores(refs, queries, CosineGreedy())
modified = calculate_scores(refs, queries, ModifiedCosine())

# Structural similarity
fingerprint = calculate_scores(refs, queries, FingerprintSimilarity())

Precursor-Filtered Matching

from matchms.similarity import PrecursorMzMatch, CosineGreedy

# First filter by precursor mass
mass_filter = calculate_scores(refs, queries, PrecursorMzMatch(tolerance=0.1))

# Then calculate cosine only for matching precursors
cosine_scores = calculate_scores(refs, queries, CosineGreedy())

Further Reading

For detailed API documentation, parameter descriptions, and mathematical formulations, see: https://matchms.readthedocs.io/en/latest/api/matchms.similarity.html