zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Fork 0

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

13 KiB

Raw Permalink Blame History

Matchms Similarity Functions Reference

This document provides detailed information about all similarity scoring methods available in matchms.

Overview

Matchms provides multiple similarity functions for comparing mass spectra. Use calculate_scores() to compute pairwise similarities between reference and query spectra collections.

from matchms import calculate_scores
from matchms.similarity import CosineGreedy

scores = calculate_scores(references=library_spectra,
                         queries=query_spectra,
                         similarity_function=CosineGreedy())

Peak-Based Similarity Functions

These functions compare mass spectra based on their peak patterns (m/z and intensity values).

CosineGreedy

Description: Calculates cosine similarity between two spectra using a fast greedy matching algorithm. Peaks are matched within a specified tolerance, and similarity is computed based on matched peak intensities.

When to use:

Fast similarity calculations for large datasets
General-purpose spectral matching
When speed is prioritized over mathematically optimal matching

Parameters:

tolerance (float, default=0.1): Maximum m/z difference for peak matching (Daltons)
mz_power (float, default=0.0): Exponent for m/z weighting (0 = no weighting)
intensity_power (float, default=1.0): Exponent for intensity weighting

Example:

from matchms.similarity import CosineGreedy

similarity_func = CosineGreedy(tolerance=0.1, mz_power=0.0, intensity_power=1.0)
scores = calculate_scores(references, queries, similarity_func)

Output: Similarity score between 0.0 and 1.0, plus number of matched peaks.

CosineHungarian

Description: Calculates cosine similarity using the Hungarian algorithm for optimal peak matching. Provides mathematically optimal peak assignments but is slower than CosineGreedy.

When to use:

When optimal peak matching is required
High-quality reference library comparisons
Research requiring reproducible, mathematically rigorous results

Parameters:

tolerance (float, default=0.1): Maximum m/z difference for peak matching
mz_power (float, default=0.0): Exponent for m/z weighting
intensity_power (float, default=1.0): Exponent for intensity weighting

Example:

from matchms.similarity import CosineHungarian

similarity_func = CosineHungarian(tolerance=0.1)
scores = calculate_scores(references, queries, similarity_func)

Output: Optimal similarity score between 0.0 and 1.0, plus matched peaks.

Note: Slower than CosineGreedy; use for smaller datasets or when accuracy is critical.

ModifiedCosine

Description: Extends cosine similarity by accounting for precursor m/z differences. Allows peaks to match after applying a mass shift based on the difference between precursor masses. Useful for comparing spectra of related compounds (isotopes, adducts, analogs).

When to use:

Comparing spectra from different precursor masses
Identifying structural analogs or derivatives
Cross-ionization mode comparisons
When precursor mass differences are meaningful

Parameters:

tolerance (float, default=0.1): Maximum m/z difference for peak matching after shift
mz_power (float, default=0.0): Exponent for m/z weighting
intensity_power (float, default=1.0): Exponent for intensity weighting

Example:

from matchms.similarity import ModifiedCosine

similarity_func = ModifiedCosine(tolerance=0.1)
scores = calculate_scores(references, queries, similarity_func)

Requirements: Both spectra must have valid precursor_mz metadata.

NeutralLossesCosine

Description: Calculates similarity based on neutral loss patterns rather than fragment m/z values. Neutral losses are derived by subtracting fragment m/z from precursor m/z. Particularly useful for identifying compounds with similar fragmentation patterns.

When to use:

Comparing fragmentation patterns across different precursor masses
Identifying compounds with similar neutral loss profiles
Complementary to regular cosine scoring
Metabolite identification and classification

Parameters:

tolerance (float, default=0.1): Maximum neutral loss difference for matching
mz_power (float, default=0.0): Exponent for loss value weighting
intensity_power (float, default=1.0): Exponent for intensity weighting

Example:

from matchms.similarity import NeutralLossesCosine
from matchms.filtering import add_losses

# First add losses to spectra
spectra_with_losses = [add_losses(s) for s in spectra]

similarity_func = NeutralLossesCosine(tolerance=0.1)
scores = calculate_scores(references_with_losses, queries_with_losses, similarity_func)

Requirements:

Both spectra must have valid precursor_mz metadata
Use add_losses() filter to compute neutral losses before scoring

Structural Similarity Functions

These functions compare molecular structures rather than spectral peaks.

FingerprintSimilarity

Description: Calculates similarity between molecular fingerprints derived from chemical structures (SMILES or InChI). Supports multiple fingerprint types and similarity metrics.

When to use:

Structural similarity without spectral data
Combining structural and spectral similarity
Pre-filtering candidates before spectral matching
Structure-activity relationship studies

Parameters:

fingerprint_type (str, default="daylight"): Type of fingerprint
- "daylight": Daylight fingerprint
- "morgan1", "morgan2", "morgan3": Morgan fingerprints with radius 1, 2, or 3
similarity_measure (str, default="jaccard"): Similarity metric
- "jaccard": Jaccard index (intersection / union)
- "dice": Dice coefficient (2 * intersection / (size1 + size2))
- "cosine": Cosine similarity

Example:

from matchms.similarity import FingerprintSimilarity
from matchms.filtering import add_fingerprint

# Add fingerprints to spectra
spectra_with_fps = [add_fingerprint(s, fingerprint_type="morgan2", nbits=2048)
                    for s in spectra]

similarity_func = FingerprintSimilarity(similarity_measure="jaccard")
scores = calculate_scores(references_with_fps, queries_with_fps, similarity_func)

Requirements:

Spectra must have valid SMILES or InChI metadata
Use add_fingerprint() filter to compute fingerprints
Requires rdkit library

Metadata-Based Similarity Functions

These functions compare metadata fields rather than spectral or structural data.

MetadataMatch

Description: Compares user-defined metadata fields between spectra. Supports exact matching for categorical data and tolerance-based matching for numerical data.

When to use:

Filtering by experimental conditions (collision energy, retention time)
Instrument-specific matching
Combining metadata constraints with spectral similarity
Custom metadata-based filtering

Parameters:

field (str): Metadata field name to compare
matching_type (str, default="exact"): Matching method
- "exact": Exact string/value match
- "difference": Absolute difference for numerical values
- "relative_difference": Relative difference for numerical values
tolerance (float, optional): Maximum difference for numerical matching

Example (Exact matching):

from matchms.similarity import MetadataMatch

# Match by instrument type
similarity_func = MetadataMatch(field="instrument_type", matching_type="exact")
scores = calculate_scores(references, queries, similarity_func)

Example (Numerical matching):

# Match retention time within 0.5 minutes
similarity_func = MetadataMatch(field="retention_time",
                                matching_type="difference",
                                tolerance=0.5)
scores = calculate_scores(references, queries, similarity_func)

Output: Returns 1.0 (match) or 0.0 (no match) for exact matching. For numerical matching, returns similarity score based on difference.

PrecursorMzMatch

Description: Binary matching based on precursor m/z values. Returns True/False based on whether precursor masses match within specified tolerance.

When to use:

Pre-filtering spectral libraries by precursor mass
Fast mass-based candidate selection
Combining with other similarity metrics
Isobaric compound identification

Parameters:

tolerance (float, default=0.1): Maximum m/z difference for matching
tolerance_type (str, default="Dalton"): Tolerance unit
- "Dalton": Absolute mass difference
- "ppm": Parts per million (relative)

Example:

from matchms.similarity import PrecursorMzMatch

# Match precursor within 0.1 Da
similarity_func = PrecursorMzMatch(tolerance=0.1, tolerance_type="Dalton")
scores = calculate_scores(references, queries, similarity_func)

# Match precursor within 10 ppm
similarity_func = PrecursorMzMatch(tolerance=10, tolerance_type="ppm")
scores = calculate_scores(references, queries, similarity_func)

Output: 1.0 (match) or 0.0 (no match)

Requirements: Both spectra must have valid precursor_mz metadata.

ParentMassMatch

Description: Binary matching based on parent mass (neutral mass) values. Similar to PrecursorMzMatch but uses calculated parent mass instead of precursor m/z.

When to use:

Comparing spectra from different ionization modes
Adduct-independent matching
Neutral mass-based library searches

Parameters:

tolerance (float, default=0.1): Maximum mass difference for matching
tolerance_type (str, default="Dalton"): Tolerance unit ("Dalton" or "ppm")

Example:

from matchms.similarity import ParentMassMatch

similarity_func = ParentMassMatch(tolerance=0.1, tolerance_type="Dalton")
scores = calculate_scores(references, queries, similarity_func)

Output: 1.0 (match) or 0.0 (no match)

Requirements: Both spectra must have valid parent_mass metadata.

Combining Multiple Similarity Functions

Combine multiple similarity metrics for robust compound identification:

from matchms import calculate_scores
from matchms.similarity import CosineGreedy, ModifiedCosine, FingerprintSimilarity

# Calculate multiple similarity scores
cosine_scores = calculate_scores(refs, queries, CosineGreedy())
modified_cosine_scores = calculate_scores(refs, queries, ModifiedCosine())
fingerprint_scores = calculate_scores(refs, queries, FingerprintSimilarity())

# Combine scores with weights
for i, query in enumerate(queries):
    for j, ref in enumerate(refs):
        combined_score = (0.5 * cosine_scores.scores[j, i] +
                         0.3 * modified_cosine_scores.scores[j, i] +
                         0.2 * fingerprint_scores.scores[j, i])

Accessing Scores Results

The Scores object provides multiple methods to access results:

# Get best matches for a query
best_matches = scores.scores_by_query(query_spectrum, sort=True)[:10]

# Get scores as numpy array
score_array = scores.scores

# Get scores as pandas DataFrame
import pandas as pd
df = scores.to_dataframe()

# Filter by threshold
high_scores = [(i, j, score) for i, j, score in scores.to_list() if score > 0.7]

# Save scores
scores.to_json("scores.json")
scores.to_pickle("scores.pkl")

Performance Considerations

Fast methods (large datasets):

CosineGreedy
PrecursorMzMatch
ParentMassMatch

Slow methods (smaller datasets or high accuracy):

CosineHungarian
ModifiedCosine (slower than CosineGreedy)
NeutralLossesCosine
FingerprintSimilarity (requires fingerprint computation)

Recommendation: For large-scale library searches, use PrecursorMzMatch to pre-filter candidates, then apply CosineGreedy or ModifiedCosine to filtered results.

Common Similarity Workflows

Standard Library Matching

from matchms.similarity import CosineGreedy

scores = calculate_scores(library_spectra, query_spectra,
                         CosineGreedy(tolerance=0.1))

Multi-Metric Matching

from matchms.similarity import CosineGreedy, ModifiedCosine, FingerprintSimilarity

# Spectral similarity
cosine = calculate_scores(refs, queries, CosineGreedy())
modified = calculate_scores(refs, queries, ModifiedCosine())

# Structural similarity
fingerprint = calculate_scores(refs, queries, FingerprintSimilarity())

Precursor-Filtered Matching

from matchms.similarity import PrecursorMzMatch, CosineGreedy

# First filter by precursor mass
mass_filter = calculate_scores(refs, queries, PrecursorMzMatch(tolerance=0.1))

# Then calculate cosine only for matching precursors
cosine_scores = calculate_scores(refs, queries, CosineGreedy())

13 KiB Raw Permalink Blame History

Matchms Similarity Functions Reference

Overview

Peak-Based Similarity Functions

CosineGreedy

CosineHungarian

ModifiedCosine

NeutralLossesCosine

Structural Similarity Functions

FingerprintSimilarity

Metadata-Based Similarity Functions

MetadataMatch

PrecursorMzMatch

ParentMassMatch

Combining Multiple Similarity Functions

Accessing Scores Results

Performance Considerations

Common Similarity Workflows

Standard Library Matching

Multi-Metric Matching

Precursor-Filtered Matching

Further Reading

13 KiB

Raw Permalink Blame History