Initial commit
This commit is contained in:
288
skills/matchms/references/filtering.md
Normal file
288
skills/matchms/references/filtering.md
Normal file
@@ -0,0 +1,288 @@
|
||||
# Matchms Filtering Functions Reference
|
||||
|
||||
This document provides a comprehensive reference of all filtering functions available in matchms for processing mass spectrometry data.
|
||||
|
||||
## Metadata Processing Filters
|
||||
|
||||
### Compound & Chemical Information
|
||||
|
||||
**add_compound_name(spectrum)**
|
||||
- Adds compound name to the correct metadata field
|
||||
- Standardizes compound name storage location
|
||||
|
||||
**clean_compound_name(spectrum)**
|
||||
- Removes frequently seen unwanted additions from compound names
|
||||
- Cleans up formatting inconsistencies
|
||||
|
||||
**derive_adduct_from_name(spectrum)**
|
||||
- Extracts adduct information from compound names
|
||||
- Moves adduct notation to proper metadata field
|
||||
|
||||
**derive_formula_from_name(spectrum)**
|
||||
- Detects chemical formulas in compound names
|
||||
- Relocates formulas to appropriate metadata field
|
||||
|
||||
**derive_annotation_from_compound_name(spectrum)**
|
||||
- Retrieves SMILES/InChI from PubChem using compound name
|
||||
- Automatically annotates chemical structures
|
||||
|
||||
### Chemical Structure Conversions
|
||||
|
||||
**derive_inchi_from_smiles(spectrum)**
|
||||
- Generates InChI from SMILES strings
|
||||
- Requires rdkit library
|
||||
|
||||
**derive_inchikey_from_inchi(spectrum)**
|
||||
- Computes InChIKey from InChI
|
||||
- 27-character hashed identifier
|
||||
|
||||
**derive_smiles_from_inchi(spectrum)**
|
||||
- Creates SMILES from InChI representation
|
||||
- Requires rdkit library
|
||||
|
||||
**repair_inchi_inchikey_smiles(spectrum)**
|
||||
- Corrects misplaced chemical identifiers
|
||||
- Fixes metadata field confusion
|
||||
|
||||
**repair_not_matching_annotation(spectrum)**
|
||||
- Ensures consistency between SMILES, InChI, and InChIKey
|
||||
- Validates chemical structure annotations match
|
||||
|
||||
**add_fingerprint(spectrum, fingerprint_type="daylight", nbits=2048, radius=2)**
|
||||
- Generates molecular fingerprints for similarity calculations
|
||||
- Fingerprint types: "daylight", "morgan1", "morgan2", "morgan3"
|
||||
- Used with FingerprintSimilarity scoring
|
||||
|
||||
### Mass & Charge Information
|
||||
|
||||
**add_precursor_mz(spectrum)**
|
||||
- Normalizes precursor m/z values
|
||||
- Standardizes precursor mass metadata
|
||||
|
||||
**add_parent_mass(spectrum, estimate_from_adduct=True)**
|
||||
- Calculates neutral parent mass from precursor m/z and adduct
|
||||
- Can estimate from adduct if not directly available
|
||||
|
||||
**correct_charge(spectrum)**
|
||||
- Aligns charge values with ionmode
|
||||
- Ensures charge sign matches ionization mode
|
||||
|
||||
**make_charge_int(spectrum)**
|
||||
- Converts charge to integer format
|
||||
- Standardizes charge representation
|
||||
|
||||
**clean_adduct(spectrum)**
|
||||
- Standardizes adduct notation
|
||||
- Corrects common adduct formatting issues
|
||||
|
||||
**interpret_pepmass(spectrum)**
|
||||
- Parses pepmass field into component values
|
||||
- Extracts precursor m/z and intensity from combined field
|
||||
|
||||
### Ion Mode & Validation
|
||||
|
||||
**derive_ionmode(spectrum)**
|
||||
- Determines ionmode from adduct information
|
||||
- Infers positive/negative mode from adduct type
|
||||
|
||||
**require_correct_ionmode(spectrum, ion_mode)**
|
||||
- Filters spectra by specified ionmode
|
||||
- Returns None if ionmode doesn't match
|
||||
- Use: `spectrum = require_correct_ionmode(spectrum, "positive")`
|
||||
|
||||
**require_precursor_mz(spectrum, minimum_accepted_mz=0.0)**
|
||||
- Validates precursor m/z presence and value
|
||||
- Returns None if missing or below threshold
|
||||
|
||||
**require_precursor_below_mz(spectrum, maximum_accepted_mz=1000.0)**
|
||||
- Enforces maximum precursor m/z limit
|
||||
- Returns None if precursor exceeds threshold
|
||||
|
||||
### Retention Information
|
||||
|
||||
**add_retention_time(spectrum)**
|
||||
- Harmonizes retention time as float values
|
||||
- Standardizes RT metadata field
|
||||
|
||||
**add_retention_index(spectrum)**
|
||||
- Stores retention index in standardized field
|
||||
- Normalizes RI metadata
|
||||
|
||||
### Data Harmonization
|
||||
|
||||
**harmonize_undefined_inchi(spectrum, undefined="", aliases=None)**
|
||||
- Standardizes undefined/empty InChI entries
|
||||
- Replaces various "unknown" representations with consistent value
|
||||
|
||||
**harmonize_undefined_inchikey(spectrum, undefined="", aliases=None)**
|
||||
- Standardizes undefined/empty InChIKey entries
|
||||
- Unifies missing data representation
|
||||
|
||||
**harmonize_undefined_smiles(spectrum, undefined="", aliases=None)**
|
||||
- Standardizes undefined/empty SMILES entries
|
||||
- Consistent handling of missing structural data
|
||||
|
||||
### Repair & Quality Functions
|
||||
|
||||
**repair_adduct_based_on_smiles(spectrum, mass_tolerance=0.1)**
|
||||
- Corrects adduct using SMILES and mass matching
|
||||
- Validates adduct matches calculated mass
|
||||
|
||||
**repair_parent_mass_is_mol_wt(spectrum, mass_tolerance=0.1)**
|
||||
- Converts molecular weight to monoisotopic mass
|
||||
- Fixes common metadata confusion
|
||||
|
||||
**repair_precursor_is_parent_mass(spectrum)**
|
||||
- Fixes swapped precursor/parent mass values
|
||||
- Corrects field misassignments
|
||||
|
||||
**repair_smiles_of_salts(spectrum, mass_tolerance=0.1)**
|
||||
- Removes salt components to match parent mass
|
||||
- Extracts relevant molecular fragment
|
||||
|
||||
**require_parent_mass_match_smiles(spectrum, mass_tolerance=0.1)**
|
||||
- Validates parent mass against SMILES-calculated mass
|
||||
- Returns None if masses don't match within tolerance
|
||||
|
||||
**require_valid_annotation(spectrum)**
|
||||
- Ensures complete, consistent chemical annotations
|
||||
- Validates SMILES, InChI, and InChIKey presence and consistency
|
||||
|
||||
## Peak Processing Filters
|
||||
|
||||
### Normalization & Selection
|
||||
|
||||
**normalize_intensities(spectrum)**
|
||||
- Scales peak intensities to unit height (max = 1.0)
|
||||
- Essential preprocessing step for similarity calculations
|
||||
|
||||
**select_by_intensity(spectrum, intensity_from=0.0, intensity_to=1.0)**
|
||||
- Retains peaks within specified absolute intensity range
|
||||
- Filters by raw intensity values
|
||||
|
||||
**select_by_relative_intensity(spectrum, intensity_from=0.0, intensity_to=1.0)**
|
||||
- Keeps peaks within relative intensity bounds
|
||||
- Filters as fraction of maximum intensity
|
||||
|
||||
**select_by_mz(spectrum, mz_from=0.0, mz_to=1000.0)**
|
||||
- Filters peaks by m/z value range
|
||||
- Removes peaks outside specified m/z window
|
||||
|
||||
### Peak Reduction & Filtering
|
||||
|
||||
**reduce_to_number_of_peaks(spectrum, n_max=None, ratio_desired=None)**
|
||||
- Removes lowest-intensity peaks when exceeding maximum
|
||||
- Can specify absolute number or ratio
|
||||
- Use: `spectrum = reduce_to_number_of_peaks(spectrum, n_max=100)`
|
||||
|
||||
**remove_peaks_around_precursor_mz(spectrum, mz_tolerance=17)**
|
||||
- Eliminates peaks within tolerance of precursor
|
||||
- Removes precursor and isotope peaks
|
||||
- Common preprocessing for fragment-based similarity
|
||||
|
||||
**remove_peaks_outside_top_k(spectrum, k=10, ratio_desired=None)**
|
||||
- Retains only peaks near k highest-intensity peaks
|
||||
- Focuses on most informative signals
|
||||
|
||||
**require_minimum_number_of_peaks(spectrum, n_required=10)**
|
||||
- Discards spectra with insufficient peaks
|
||||
- Quality control filter
|
||||
- Returns None if peak count below threshold
|
||||
|
||||
**require_minimum_number_of_high_peaks(spectrum, n_required=5, intensity_threshold=0.05)**
|
||||
- Removes spectra lacking high-intensity peaks
|
||||
- Ensures data quality
|
||||
- Returns None if insufficient peaks above threshold
|
||||
|
||||
### Loss Calculation
|
||||
|
||||
**add_losses(spectrum, loss_mz_from=5.0, loss_mz_to=200.0)**
|
||||
- Derives neutral losses from precursor mass
|
||||
- Calculates loss = precursor_mz - fragment_mz
|
||||
- Adds losses to spectrum for NeutralLossesCosine scoring
|
||||
|
||||
## Pipeline Functions
|
||||
|
||||
**default_filters(spectrum)**
|
||||
- Applies nine essential metadata filters sequentially:
|
||||
1. make_charge_int
|
||||
2. add_precursor_mz
|
||||
3. add_retention_time
|
||||
4. add_retention_index
|
||||
5. derive_adduct_from_name
|
||||
6. derive_formula_from_name
|
||||
7. clean_compound_name
|
||||
8. harmonize_undefined_smiles
|
||||
9. harmonize_undefined_inchi
|
||||
- Recommended starting point for metadata harmonization
|
||||
|
||||
**SpectrumProcessor(filters)**
|
||||
- Orchestrates multi-filter pipelines
|
||||
- Accepts list of filter functions
|
||||
- Example:
|
||||
```python
|
||||
from matchms import SpectrumProcessor
|
||||
processor = SpectrumProcessor([
|
||||
default_filters,
|
||||
normalize_intensities,
|
||||
lambda s: select_by_relative_intensity(s, intensity_from=0.01)
|
||||
])
|
||||
processed = processor(spectrum)
|
||||
```
|
||||
|
||||
## Common Filter Combinations
|
||||
|
||||
### Standard Preprocessing Pipeline
|
||||
```python
|
||||
from matchms.filtering import (default_filters, normalize_intensities,
|
||||
select_by_relative_intensity,
|
||||
require_minimum_number_of_peaks)
|
||||
|
||||
spectrum = default_filters(spectrum)
|
||||
spectrum = normalize_intensities(spectrum)
|
||||
spectrum = select_by_relative_intensity(spectrum, intensity_from=0.01)
|
||||
spectrum = require_minimum_number_of_peaks(spectrum, n_required=5)
|
||||
```
|
||||
|
||||
### Quality Control Pipeline
|
||||
```python
|
||||
from matchms.filtering import (require_precursor_mz, require_minimum_number_of_peaks,
|
||||
require_minimum_number_of_high_peaks)
|
||||
|
||||
spectrum = require_precursor_mz(spectrum, minimum_accepted_mz=50.0)
|
||||
if spectrum is None:
|
||||
# Spectrum failed quality control
|
||||
pass
|
||||
spectrum = require_minimum_number_of_peaks(spectrum, n_required=10)
|
||||
spectrum = require_minimum_number_of_high_peaks(spectrum, n_required=5)
|
||||
```
|
||||
|
||||
### Chemical Annotation Pipeline
|
||||
```python
|
||||
from matchms.filtering import (derive_inchi_from_smiles, derive_inchikey_from_inchi,
|
||||
add_fingerprint, require_valid_annotation)
|
||||
|
||||
spectrum = derive_inchi_from_smiles(spectrum)
|
||||
spectrum = derive_inchikey_from_inchi(spectrum)
|
||||
spectrum = add_fingerprint(spectrum, fingerprint_type="morgan2", nbits=2048)
|
||||
spectrum = require_valid_annotation(spectrum)
|
||||
```
|
||||
|
||||
### Peak Cleaning Pipeline
|
||||
```python
|
||||
from matchms.filtering import (normalize_intensities, remove_peaks_around_precursor_mz,
|
||||
select_by_relative_intensity, reduce_to_number_of_peaks)
|
||||
|
||||
spectrum = normalize_intensities(spectrum)
|
||||
spectrum = remove_peaks_around_precursor_mz(spectrum, mz_tolerance=17)
|
||||
spectrum = select_by_relative_intensity(spectrum, intensity_from=0.01)
|
||||
spectrum = reduce_to_number_of_peaks(spectrum, n_max=200)
|
||||
```
|
||||
|
||||
## Notes on Filter Usage
|
||||
|
||||
1. **Order matters**: Apply filters in logical sequence (e.g., normalize before relative intensity selection)
|
||||
2. **Filters return None**: Many filters return None for invalid spectra; check for None before proceeding
|
||||
3. **Immutability**: Filters typically return modified copies; reassign results to variables
|
||||
4. **Pipeline efficiency**: Use SpectrumProcessor for consistent multi-spectrum processing
|
||||
5. **Documentation**: For detailed parameters, see matchms.readthedocs.io/en/latest/api/matchms.filtering.html
|
||||
416
skills/matchms/references/importing_exporting.md
Normal file
416
skills/matchms/references/importing_exporting.md
Normal file
@@ -0,0 +1,416 @@
|
||||
# Matchms Importing and Exporting Reference
|
||||
|
||||
This document details all file format support in matchms for loading and saving mass spectrometry data.
|
||||
|
||||
## Importing Spectra
|
||||
|
||||
Matchms provides dedicated functions for loading spectra from various file formats. All import functions return generators for memory-efficient processing of large files.
|
||||
|
||||
### Common Import Pattern
|
||||
|
||||
```python
|
||||
from matchms.importing import load_from_mgf
|
||||
|
||||
# Load spectra (returns generator)
|
||||
spectra_generator = load_from_mgf("spectra.mgf")
|
||||
|
||||
# Convert to list for processing
|
||||
spectra = list(spectra_generator)
|
||||
```
|
||||
|
||||
## Supported Import Formats
|
||||
|
||||
### MGF (Mascot Generic Format)
|
||||
|
||||
**Function**: `load_from_mgf(filename, metadata_harmonization=True)`
|
||||
|
||||
**Description**: Loads spectra from MGF files, a common format for mass spectrometry data exchange.
|
||||
|
||||
**Parameters**:
|
||||
- `filename` (str): Path to MGF file
|
||||
- `metadata_harmonization` (bool, default=True): Apply automatic metadata key harmonization
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from matchms.importing import load_from_mgf
|
||||
|
||||
# Load with metadata harmonization
|
||||
spectra = list(load_from_mgf("data.mgf"))
|
||||
|
||||
# Load without harmonization
|
||||
spectra = list(load_from_mgf("data.mgf", metadata_harmonization=False))
|
||||
```
|
||||
|
||||
**MGF Format**: Text-based format with BEGIN IONS/END IONS blocks containing metadata and peak lists.
|
||||
|
||||
---
|
||||
|
||||
### MSP (NIST Mass Spectral Library Format)
|
||||
|
||||
**Function**: `load_from_msp(filename, metadata_harmonization=True)`
|
||||
|
||||
**Description**: Loads spectra from MSP files, commonly used for spectral libraries.
|
||||
|
||||
**Parameters**:
|
||||
- `filename` (str): Path to MSP file
|
||||
- `metadata_harmonization` (bool, default=True): Apply automatic metadata harmonization
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from matchms.importing import load_from_msp
|
||||
|
||||
spectra = list(load_from_msp("library.msp"))
|
||||
```
|
||||
|
||||
**MSP Format**: Text-based format with Name/MW/Comment fields followed by peak lists.
|
||||
|
||||
---
|
||||
|
||||
### mzML (Mass Spectrometry Markup Language)
|
||||
|
||||
**Function**: `load_from_mzml(filename, ms_level=2, metadata_harmonization=True)`
|
||||
|
||||
**Description**: Loads spectra from mzML files, the standard XML-based format for raw mass spectrometry data.
|
||||
|
||||
**Parameters**:
|
||||
- `filename` (str): Path to mzML file
|
||||
- `ms_level` (int, default=2): MS level to extract (1 for MS1, 2 for MS2/tandem)
|
||||
- `metadata_harmonization` (bool, default=True): Apply automatic metadata harmonization
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from matchms.importing import load_from_mzml
|
||||
|
||||
# Load MS2 spectra (default)
|
||||
ms2_spectra = list(load_from_mzml("data.mzML"))
|
||||
|
||||
# Load MS1 spectra
|
||||
ms1_spectra = list(load_from_mzml("data.mzML", ms_level=1))
|
||||
```
|
||||
|
||||
**mzML Format**: XML-based standard format containing raw instrument data and rich metadata.
|
||||
|
||||
---
|
||||
|
||||
### mzXML
|
||||
|
||||
**Function**: `load_from_mzxml(filename, ms_level=2, metadata_harmonization=True)`
|
||||
|
||||
**Description**: Loads spectra from mzXML files, an earlier XML-based format for mass spectrometry data.
|
||||
|
||||
**Parameters**:
|
||||
- `filename` (str): Path to mzXML file
|
||||
- `ms_level` (int, default=2): MS level to extract
|
||||
- `metadata_harmonization` (bool, default=True): Apply automatic metadata harmonization
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from matchms.importing import load_from_mzxml
|
||||
|
||||
spectra = list(load_from_mzxml("data.mzXML"))
|
||||
```
|
||||
|
||||
**mzXML Format**: XML-based format, predecessor to mzML.
|
||||
|
||||
---
|
||||
|
||||
### JSON (GNPS Format)
|
||||
|
||||
**Function**: `load_from_json(filename, metadata_harmonization=True)`
|
||||
|
||||
**Description**: Loads spectra from JSON files, particularly GNPS-compatible JSON format.
|
||||
|
||||
**Parameters**:
|
||||
- `filename` (str): Path to JSON file
|
||||
- `metadata_harmonization` (bool, default=True): Apply automatic metadata harmonization
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from matchms.importing import load_from_json
|
||||
|
||||
spectra = list(load_from_json("spectra.json"))
|
||||
```
|
||||
|
||||
**JSON Format**: Structured JSON with spectrum metadata and peak arrays.
|
||||
|
||||
---
|
||||
|
||||
### Pickle (Python Serialization)
|
||||
|
||||
**Function**: `load_from_pickle(filename)`
|
||||
|
||||
**Description**: Loads previously saved matchms Spectrum objects from pickle files. Fast loading of preprocessed spectra.
|
||||
|
||||
**Parameters**:
|
||||
- `filename` (str): Path to pickle file
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from matchms.importing import load_from_pickle
|
||||
|
||||
spectra = list(load_from_pickle("processed_spectra.pkl"))
|
||||
```
|
||||
|
||||
**Use case**: Saving and loading preprocessed spectra for faster subsequent analyses.
|
||||
|
||||
---
|
||||
|
||||
### USI (Universal Spectrum Identifier)
|
||||
|
||||
**Function**: `load_from_usi(usi)`
|
||||
|
||||
**Description**: Loads a single spectrum from a metabolomics USI reference.
|
||||
|
||||
**Parameters**:
|
||||
- `usi` (str): Universal Spectrum Identifier string
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from matchms.importing import load_from_usi
|
||||
|
||||
usi = "mzspec:GNPS:TASK-...:spectrum..."
|
||||
spectrum = load_from_usi(usi)
|
||||
```
|
||||
|
||||
**USI Format**: Standardized identifier for accessing spectra from online repositories.
|
||||
|
||||
---
|
||||
|
||||
## Exporting Spectra
|
||||
|
||||
Matchms provides functions to save processed spectra to various formats for sharing and archival.
|
||||
|
||||
### MGF Export
|
||||
|
||||
**Function**: `save_as_mgf(spectra, filename, write_mode='w')`
|
||||
|
||||
**Description**: Saves spectra to MGF format.
|
||||
|
||||
**Parameters**:
|
||||
- `spectra` (list): List of Spectrum objects to save
|
||||
- `filename` (str): Output file path
|
||||
- `write_mode` (str, default='w'): File write mode ('w' for write, 'a' for append)
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from matchms.exporting import save_as_mgf
|
||||
|
||||
save_as_mgf(processed_spectra, "output.mgf")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### MSP Export
|
||||
|
||||
**Function**: `save_as_msp(spectra, filename, write_mode='w')`
|
||||
|
||||
**Description**: Saves spectra to MSP format.
|
||||
|
||||
**Parameters**:
|
||||
- `spectra` (list): List of Spectrum objects to save
|
||||
- `filename` (str): Output file path
|
||||
- `write_mode` (str, default='w'): File write mode
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from matchms.exporting import save_as_msp
|
||||
|
||||
save_as_msp(library_spectra, "library.msp")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### JSON Export
|
||||
|
||||
**Function**: `save_as_json(spectra, filename, write_mode='w')`
|
||||
|
||||
**Description**: Saves spectra to JSON format (GNPS-compatible).
|
||||
|
||||
**Parameters**:
|
||||
- `spectra` (list): List of Spectrum objects to save
|
||||
- `filename` (str): Output file path
|
||||
- `write_mode` (str, default='w'): File write mode
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from matchms.exporting import save_as_json
|
||||
|
||||
save_as_json(spectra, "spectra.json")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Pickle Export
|
||||
|
||||
**Function**: `save_as_pickle(spectra, filename)`
|
||||
|
||||
**Description**: Saves spectra as Python pickle file. Preserves all Spectrum attributes and is fastest for loading.
|
||||
|
||||
**Parameters**:
|
||||
- `spectra` (list): List of Spectrum objects to save
|
||||
- `filename` (str): Output file path
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from matchms.exporting import save_as_pickle
|
||||
|
||||
save_as_pickle(processed_spectra, "processed.pkl")
|
||||
```
|
||||
|
||||
**Advantages**:
|
||||
- Fast save and load
|
||||
- Preserves exact Spectrum state
|
||||
- No format conversion overhead
|
||||
|
||||
**Disadvantages**:
|
||||
- Not human-readable
|
||||
- Python-specific (not portable to other languages)
|
||||
- Pickle format may not be compatible across Python versions
|
||||
|
||||
---
|
||||
|
||||
## Complete Import/Export Workflow
|
||||
|
||||
### Preprocessing and Saving Pipeline
|
||||
|
||||
```python
|
||||
from matchms.importing import load_from_mgf
|
||||
from matchms.exporting import save_as_mgf, save_as_pickle
|
||||
from matchms.filtering import default_filters, normalize_intensities
|
||||
from matchms.filtering import select_by_relative_intensity
|
||||
|
||||
# Load raw spectra
|
||||
spectra = list(load_from_mgf("raw_data.mgf"))
|
||||
|
||||
# Process spectra
|
||||
processed = []
|
||||
for spectrum in spectra:
|
||||
spectrum = default_filters(spectrum)
|
||||
spectrum = normalize_intensities(spectrum)
|
||||
spectrum = select_by_relative_intensity(spectrum, intensity_from=0.01)
|
||||
if spectrum is not None:
|
||||
processed.append(spectrum)
|
||||
|
||||
# Save processed spectra (MGF for sharing)
|
||||
save_as_mgf(processed, "processed_data.mgf")
|
||||
|
||||
# Save as pickle for fast reloading
|
||||
save_as_pickle(processed, "processed_data.pkl")
|
||||
```
|
||||
|
||||
### Format Conversion
|
||||
|
||||
```python
|
||||
from matchms.importing import load_from_mzml
|
||||
from matchms.exporting import save_as_mgf, save_as_msp
|
||||
|
||||
# Convert mzML to MGF
|
||||
spectra = list(load_from_mzml("data.mzML", ms_level=2))
|
||||
save_as_mgf(spectra, "data.mgf")
|
||||
|
||||
# Convert to MSP library format
|
||||
save_as_msp(spectra, "data.msp")
|
||||
```
|
||||
|
||||
### Loading from Multiple Files
|
||||
|
||||
```python
|
||||
from matchms.importing import load_from_mgf
|
||||
import glob
|
||||
|
||||
# Load all MGF files in directory
|
||||
all_spectra = []
|
||||
for mgf_file in glob.glob("data/*.mgf"):
|
||||
spectra = list(load_from_mgf(mgf_file))
|
||||
all_spectra.extend(spectra)
|
||||
|
||||
print(f"Loaded {len(all_spectra)} spectra from multiple files")
|
||||
```
|
||||
|
||||
### Memory-Efficient Processing
|
||||
|
||||
```python
|
||||
from matchms.importing import load_from_mgf
|
||||
from matchms.exporting import save_as_mgf
|
||||
from matchms.filtering import default_filters, normalize_intensities
|
||||
|
||||
# Process large file without loading all into memory
|
||||
def process_spectrum(spectrum):
|
||||
spectrum = default_filters(spectrum)
|
||||
spectrum = normalize_intensities(spectrum)
|
||||
return spectrum
|
||||
|
||||
# Stream processing
|
||||
with open("output.mgf", 'w') as outfile:
|
||||
for spectrum in load_from_mgf("large_file.mgf"):
|
||||
processed = process_spectrum(spectrum)
|
||||
if processed is not None:
|
||||
# Write immediately without storing in memory
|
||||
save_as_mgf([processed], outfile, write_mode='a')
|
||||
```
|
||||
|
||||
## Format Selection Guidelines
|
||||
|
||||
**MGF**:
|
||||
- ✓ Widely supported
|
||||
- ✓ Human-readable
|
||||
- ✓ Good for data sharing
|
||||
- ✓ Moderate file size
|
||||
- Best for: Data exchange, GNPS uploads, publication data
|
||||
|
||||
**MSP**:
|
||||
- ✓ Spectral library standard
|
||||
- ✓ Human-readable
|
||||
- ✓ Good metadata support
|
||||
- Best for: Reference libraries, NIST format compatibility
|
||||
|
||||
**JSON**:
|
||||
- ✓ Structured format
|
||||
- ✓ GNPS compatible
|
||||
- ✓ Easy to parse programmatically
|
||||
- Best for: Web applications, GNPS integration, structured data
|
||||
|
||||
**Pickle**:
|
||||
- ✓ Fastest save/load
|
||||
- ✓ Preserves exact state
|
||||
- ✗ Not portable to other languages
|
||||
- ✗ Not human-readable
|
||||
- Best for: Intermediate processing, Python-only workflows
|
||||
|
||||
**mzML/mzXML**:
|
||||
- ✓ Raw instrument data
|
||||
- ✓ Rich metadata
|
||||
- ✓ Industry standard
|
||||
- ✗ Large file size
|
||||
- ✗ Slower to parse
|
||||
- Best for: Raw data archival, multi-level MS data
|
||||
|
||||
## Metadata Harmonization
|
||||
|
||||
The `metadata_harmonization` parameter (available in most import functions) automatically standardizes metadata keys:
|
||||
|
||||
```python
|
||||
# Without harmonization
|
||||
spectrum = load_from_mgf("data.mgf", metadata_harmonization=False)
|
||||
# May have: "PRECURSOR_MZ", "Precursor_mz", "precursormz"
|
||||
|
||||
# With harmonization (default)
|
||||
spectrum = load_from_mgf("data.mgf", metadata_harmonization=True)
|
||||
# Standardized to: "precursor_mz"
|
||||
```
|
||||
|
||||
**Recommended**: Keep harmonization enabled (default) for consistent metadata access across different data sources.
|
||||
|
||||
## File Format Specifications
|
||||
|
||||
For detailed format specifications:
|
||||
- **MGF**: http://www.matrixscience.com/help/data_file_help.html
|
||||
- **MSP**: https://chemdata.nist.gov/mass-spc/ms-search/
|
||||
- **mzML**: http://www.psidev.info/mzML
|
||||
- **GNPS JSON**: https://gnps.ucsd.edu/
|
||||
|
||||
## Further Reading
|
||||
|
||||
For complete API documentation:
|
||||
https://matchms.readthedocs.io/en/latest/api/matchms.importing.html
|
||||
https://matchms.readthedocs.io/en/latest/api/matchms.exporting.html
|
||||
380
skills/matchms/references/similarity.md
Normal file
380
skills/matchms/references/similarity.md
Normal file
@@ -0,0 +1,380 @@
|
||||
# Matchms Similarity Functions Reference
|
||||
|
||||
This document provides detailed information about all similarity scoring methods available in matchms.
|
||||
|
||||
## Overview
|
||||
|
||||
Matchms provides multiple similarity functions for comparing mass spectra. Use `calculate_scores()` to compute pairwise similarities between reference and query spectra collections.
|
||||
|
||||
```python
|
||||
from matchms import calculate_scores
|
||||
from matchms.similarity import CosineGreedy
|
||||
|
||||
scores = calculate_scores(references=library_spectra,
|
||||
queries=query_spectra,
|
||||
similarity_function=CosineGreedy())
|
||||
```
|
||||
|
||||
## Peak-Based Similarity Functions
|
||||
|
||||
These functions compare mass spectra based on their peak patterns (m/z and intensity values).
|
||||
|
||||
### CosineGreedy
|
||||
|
||||
**Description**: Calculates cosine similarity between two spectra using a fast greedy matching algorithm. Peaks are matched within a specified tolerance, and similarity is computed based on matched peak intensities.
|
||||
|
||||
**When to use**:
|
||||
- Fast similarity calculations for large datasets
|
||||
- General-purpose spectral matching
|
||||
- When speed is prioritized over mathematically optimal matching
|
||||
|
||||
**Parameters**:
|
||||
- `tolerance` (float, default=0.1): Maximum m/z difference for peak matching (Daltons)
|
||||
- `mz_power` (float, default=0.0): Exponent for m/z weighting (0 = no weighting)
|
||||
- `intensity_power` (float, default=1.0): Exponent for intensity weighting
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from matchms.similarity import CosineGreedy
|
||||
|
||||
similarity_func = CosineGreedy(tolerance=0.1, mz_power=0.0, intensity_power=1.0)
|
||||
scores = calculate_scores(references, queries, similarity_func)
|
||||
```
|
||||
|
||||
**Output**: Similarity score between 0.0 and 1.0, plus number of matched peaks.
|
||||
|
||||
---
|
||||
|
||||
### CosineHungarian
|
||||
|
||||
**Description**: Calculates cosine similarity using the Hungarian algorithm for optimal peak matching. Provides mathematically optimal peak assignments but is slower than CosineGreedy.
|
||||
|
||||
**When to use**:
|
||||
- When optimal peak matching is required
|
||||
- High-quality reference library comparisons
|
||||
- Research requiring reproducible, mathematically rigorous results
|
||||
|
||||
**Parameters**:
|
||||
- `tolerance` (float, default=0.1): Maximum m/z difference for peak matching
|
||||
- `mz_power` (float, default=0.0): Exponent for m/z weighting
|
||||
- `intensity_power` (float, default=1.0): Exponent for intensity weighting
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from matchms.similarity import CosineHungarian
|
||||
|
||||
similarity_func = CosineHungarian(tolerance=0.1)
|
||||
scores = calculate_scores(references, queries, similarity_func)
|
||||
```
|
||||
|
||||
**Output**: Optimal similarity score between 0.0 and 1.0, plus matched peaks.
|
||||
|
||||
**Note**: Slower than CosineGreedy; use for smaller datasets or when accuracy is critical.
|
||||
|
||||
---
|
||||
|
||||
### ModifiedCosine
|
||||
|
||||
**Description**: Extends cosine similarity by accounting for precursor m/z differences. Allows peaks to match after applying a mass shift based on the difference between precursor masses. Useful for comparing spectra of related compounds (isotopes, adducts, analogs).
|
||||
|
||||
**When to use**:
|
||||
- Comparing spectra from different precursor masses
|
||||
- Identifying structural analogs or derivatives
|
||||
- Cross-ionization mode comparisons
|
||||
- When precursor mass differences are meaningful
|
||||
|
||||
**Parameters**:
|
||||
- `tolerance` (float, default=0.1): Maximum m/z difference for peak matching after shift
|
||||
- `mz_power` (float, default=0.0): Exponent for m/z weighting
|
||||
- `intensity_power` (float, default=1.0): Exponent for intensity weighting
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from matchms.similarity import ModifiedCosine
|
||||
|
||||
similarity_func = ModifiedCosine(tolerance=0.1)
|
||||
scores = calculate_scores(references, queries, similarity_func)
|
||||
```
|
||||
|
||||
**Requirements**: Both spectra must have valid precursor_mz metadata.
|
||||
|
||||
---
|
||||
|
||||
### NeutralLossesCosine
|
||||
|
||||
**Description**: Calculates similarity based on neutral loss patterns rather than fragment m/z values. Neutral losses are derived by subtracting fragment m/z from precursor m/z. Particularly useful for identifying compounds with similar fragmentation patterns.
|
||||
|
||||
**When to use**:
|
||||
- Comparing fragmentation patterns across different precursor masses
|
||||
- Identifying compounds with similar neutral loss profiles
|
||||
- Complementary to regular cosine scoring
|
||||
- Metabolite identification and classification
|
||||
|
||||
**Parameters**:
|
||||
- `tolerance` (float, default=0.1): Maximum neutral loss difference for matching
|
||||
- `mz_power` (float, default=0.0): Exponent for loss value weighting
|
||||
- `intensity_power` (float, default=1.0): Exponent for intensity weighting
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from matchms.similarity import NeutralLossesCosine
|
||||
from matchms.filtering import add_losses
|
||||
|
||||
# First add losses to spectra
|
||||
spectra_with_losses = [add_losses(s) for s in spectra]
|
||||
|
||||
similarity_func = NeutralLossesCosine(tolerance=0.1)
|
||||
scores = calculate_scores(references_with_losses, queries_with_losses, similarity_func)
|
||||
```
|
||||
|
||||
**Requirements**:
|
||||
- Both spectra must have valid precursor_mz metadata
|
||||
- Use `add_losses()` filter to compute neutral losses before scoring
|
||||
|
||||
---
|
||||
|
||||
## Structural Similarity Functions
|
||||
|
||||
These functions compare molecular structures rather than spectral peaks.
|
||||
|
||||
### FingerprintSimilarity
|
||||
|
||||
**Description**: Calculates similarity between molecular fingerprints derived from chemical structures (SMILES or InChI). Supports multiple fingerprint types and similarity metrics.
|
||||
|
||||
**When to use**:
|
||||
- Structural similarity without spectral data
|
||||
- Combining structural and spectral similarity
|
||||
- Pre-filtering candidates before spectral matching
|
||||
- Structure-activity relationship studies
|
||||
|
||||
**Parameters**:
|
||||
- `fingerprint_type` (str, default="daylight"): Type of fingerprint
|
||||
- `"daylight"`: Daylight fingerprint
|
||||
- `"morgan1"`, `"morgan2"`, `"morgan3"`: Morgan fingerprints with radius 1, 2, or 3
|
||||
- `similarity_measure` (str, default="jaccard"): Similarity metric
|
||||
- `"jaccard"`: Jaccard index (intersection / union)
|
||||
- `"dice"`: Dice coefficient (2 * intersection / (size1 + size2))
|
||||
- `"cosine"`: Cosine similarity
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from matchms.similarity import FingerprintSimilarity
|
||||
from matchms.filtering import add_fingerprint
|
||||
|
||||
# Add fingerprints to spectra
|
||||
spectra_with_fps = [add_fingerprint(s, fingerprint_type="morgan2", nbits=2048)
|
||||
for s in spectra]
|
||||
|
||||
similarity_func = FingerprintSimilarity(similarity_measure="jaccard")
|
||||
scores = calculate_scores(references_with_fps, queries_with_fps, similarity_func)
|
||||
```
|
||||
|
||||
**Requirements**:
|
||||
- Spectra must have valid SMILES or InChI metadata
|
||||
- Use `add_fingerprint()` filter to compute fingerprints
|
||||
- Requires rdkit library
|
||||
|
||||
---
|
||||
|
||||
## Metadata-Based Similarity Functions
|
||||
|
||||
These functions compare metadata fields rather than spectral or structural data.
|
||||
|
||||
### MetadataMatch
|
||||
|
||||
**Description**: Compares user-defined metadata fields between spectra. Supports exact matching for categorical data and tolerance-based matching for numerical data.
|
||||
|
||||
**When to use**:
|
||||
- Filtering by experimental conditions (collision energy, retention time)
|
||||
- Instrument-specific matching
|
||||
- Combining metadata constraints with spectral similarity
|
||||
- Custom metadata-based filtering
|
||||
|
||||
**Parameters**:
|
||||
- `field` (str): Metadata field name to compare
|
||||
- `matching_type` (str, default="exact"): Matching method
|
||||
- `"exact"`: Exact string/value match
|
||||
- `"difference"`: Absolute difference for numerical values
|
||||
- `"relative_difference"`: Relative difference for numerical values
|
||||
- `tolerance` (float, optional): Maximum difference for numerical matching
|
||||
|
||||
**Example (Exact matching)**:
|
||||
```python
|
||||
from matchms.similarity import MetadataMatch
|
||||
|
||||
# Match by instrument type
|
||||
similarity_func = MetadataMatch(field="instrument_type", matching_type="exact")
|
||||
scores = calculate_scores(references, queries, similarity_func)
|
||||
```
|
||||
|
||||
**Example (Numerical matching)**:
|
||||
```python
|
||||
# Match retention time within 0.5 minutes
|
||||
similarity_func = MetadataMatch(field="retention_time",
|
||||
matching_type="difference",
|
||||
tolerance=0.5)
|
||||
scores = calculate_scores(references, queries, similarity_func)
|
||||
```
|
||||
|
||||
**Output**: Returns 1.0 (match) or 0.0 (no match) for exact matching. For numerical matching, returns similarity score based on difference.
|
||||
|
||||
---
|
||||
|
||||
### PrecursorMzMatch
|
||||
|
||||
**Description**: Binary matching based on precursor m/z values. Returns True/False based on whether precursor masses match within specified tolerance.
|
||||
|
||||
**When to use**:
|
||||
- Pre-filtering spectral libraries by precursor mass
|
||||
- Fast mass-based candidate selection
|
||||
- Combining with other similarity metrics
|
||||
- Isobaric compound identification
|
||||
|
||||
**Parameters**:
|
||||
- `tolerance` (float, default=0.1): Maximum m/z difference for matching
|
||||
- `tolerance_type` (str, default="Dalton"): Tolerance unit
|
||||
- `"Dalton"`: Absolute mass difference
|
||||
- `"ppm"`: Parts per million (relative)
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from matchms.similarity import PrecursorMzMatch
|
||||
|
||||
# Match precursor within 0.1 Da
|
||||
similarity_func = PrecursorMzMatch(tolerance=0.1, tolerance_type="Dalton")
|
||||
scores = calculate_scores(references, queries, similarity_func)
|
||||
|
||||
# Match precursor within 10 ppm
|
||||
similarity_func = PrecursorMzMatch(tolerance=10, tolerance_type="ppm")
|
||||
scores = calculate_scores(references, queries, similarity_func)
|
||||
```
|
||||
|
||||
**Output**: 1.0 (match) or 0.0 (no match)
|
||||
|
||||
**Requirements**: Both spectra must have valid precursor_mz metadata.
|
||||
|
||||
---
|
||||
|
||||
### ParentMassMatch
|
||||
|
||||
**Description**: Binary matching based on parent mass (neutral mass) values. Similar to PrecursorMzMatch but uses calculated parent mass instead of precursor m/z.
|
||||
|
||||
**When to use**:
|
||||
- Comparing spectra from different ionization modes
|
||||
- Adduct-independent matching
|
||||
- Neutral mass-based library searches
|
||||
|
||||
**Parameters**:
|
||||
- `tolerance` (float, default=0.1): Maximum mass difference for matching
|
||||
- `tolerance_type` (str, default="Dalton"): Tolerance unit ("Dalton" or "ppm")
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
from matchms.similarity import ParentMassMatch
|
||||
|
||||
similarity_func = ParentMassMatch(tolerance=0.1, tolerance_type="Dalton")
|
||||
scores = calculate_scores(references, queries, similarity_func)
|
||||
```
|
||||
|
||||
**Output**: 1.0 (match) or 0.0 (no match)
|
||||
|
||||
**Requirements**: Both spectra must have valid parent_mass metadata.
|
||||
|
||||
---
|
||||
|
||||
## Combining Multiple Similarity Functions
|
||||
|
||||
Combine multiple similarity metrics for robust compound identification:
|
||||
|
||||
```python
|
||||
from matchms import calculate_scores
|
||||
from matchms.similarity import CosineGreedy, ModifiedCosine, FingerprintSimilarity
|
||||
|
||||
# Calculate multiple similarity scores
|
||||
cosine_scores = calculate_scores(refs, queries, CosineGreedy())
|
||||
modified_cosine_scores = calculate_scores(refs, queries, ModifiedCosine())
|
||||
fingerprint_scores = calculate_scores(refs, queries, FingerprintSimilarity())
|
||||
|
||||
# Combine scores with weights
|
||||
for i, query in enumerate(queries):
|
||||
for j, ref in enumerate(refs):
|
||||
combined_score = (0.5 * cosine_scores.scores[j, i] +
|
||||
0.3 * modified_cosine_scores.scores[j, i] +
|
||||
0.2 * fingerprint_scores.scores[j, i])
|
||||
```
|
||||
|
||||
## Accessing Scores Results
|
||||
|
||||
The `Scores` object provides multiple methods to access results:
|
||||
|
||||
```python
|
||||
# Get best matches for a query
|
||||
best_matches = scores.scores_by_query(query_spectrum, sort=True)[:10]
|
||||
|
||||
# Get scores as numpy array
|
||||
score_array = scores.scores
|
||||
|
||||
# Get scores as pandas DataFrame
|
||||
import pandas as pd
|
||||
df = scores.to_dataframe()
|
||||
|
||||
# Filter by threshold
|
||||
high_scores = [(i, j, score) for i, j, score in scores.to_list() if score > 0.7]
|
||||
|
||||
# Save scores
|
||||
scores.to_json("scores.json")
|
||||
scores.to_pickle("scores.pkl")
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
**Fast methods** (large datasets):
|
||||
- CosineGreedy
|
||||
- PrecursorMzMatch
|
||||
- ParentMassMatch
|
||||
|
||||
**Slow methods** (smaller datasets or high accuracy):
|
||||
- CosineHungarian
|
||||
- ModifiedCosine (slower than CosineGreedy)
|
||||
- NeutralLossesCosine
|
||||
- FingerprintSimilarity (requires fingerprint computation)
|
||||
|
||||
**Recommendation**: For large-scale library searches, use PrecursorMzMatch to pre-filter candidates, then apply CosineGreedy or ModifiedCosine to filtered results.
|
||||
|
||||
## Common Similarity Workflows
|
||||
|
||||
### Standard Library Matching
|
||||
```python
|
||||
from matchms.similarity import CosineGreedy
|
||||
|
||||
scores = calculate_scores(library_spectra, query_spectra,
|
||||
CosineGreedy(tolerance=0.1))
|
||||
```
|
||||
|
||||
### Multi-Metric Matching
|
||||
```python
|
||||
from matchms.similarity import CosineGreedy, ModifiedCosine, FingerprintSimilarity
|
||||
|
||||
# Spectral similarity
|
||||
cosine = calculate_scores(refs, queries, CosineGreedy())
|
||||
modified = calculate_scores(refs, queries, ModifiedCosine())
|
||||
|
||||
# Structural similarity
|
||||
fingerprint = calculate_scores(refs, queries, FingerprintSimilarity())
|
||||
```
|
||||
|
||||
### Precursor-Filtered Matching
|
||||
```python
|
||||
from matchms.similarity import PrecursorMzMatch, CosineGreedy
|
||||
|
||||
# First filter by precursor mass
|
||||
mass_filter = calculate_scores(refs, queries, PrecursorMzMatch(tolerance=0.1))
|
||||
|
||||
# Then calculate cosine only for matching precursors
|
||||
cosine_scores = calculate_scores(refs, queries, CosineGreedy())
|
||||
```
|
||||
|
||||
## Further Reading
|
||||
|
||||
For detailed API documentation, parameter descriptions, and mathematical formulations, see:
|
||||
https://matchms.readthedocs.io/en/latest/api/matchms.similarity.html
|
||||
647
skills/matchms/references/workflows.md
Normal file
647
skills/matchms/references/workflows.md
Normal file
@@ -0,0 +1,647 @@
|
||||
# Matchms Common Workflows
|
||||
|
||||
This document provides detailed examples of common mass spectrometry analysis workflows using matchms.
|
||||
|
||||
## Workflow 1: Basic Spectral Library Matching
|
||||
|
||||
Match unknown spectra against a reference library to identify compounds.
|
||||
|
||||
```python
|
||||
from matchms.importing import load_from_mgf
|
||||
from matchms.filtering import default_filters, normalize_intensities
|
||||
from matchms.filtering import select_by_relative_intensity, require_minimum_number_of_peaks
|
||||
from matchms import calculate_scores
|
||||
from matchms.similarity import CosineGreedy
|
||||
|
||||
# Load reference library
|
||||
print("Loading reference library...")
|
||||
library = list(load_from_mgf("reference_library.mgf"))
|
||||
|
||||
# Load query spectra (unknowns)
|
||||
print("Loading query spectra...")
|
||||
queries = list(load_from_mgf("unknown_spectra.mgf"))
|
||||
|
||||
# Process library spectra
|
||||
print("Processing library...")
|
||||
processed_library = []
|
||||
for spectrum in library:
|
||||
spectrum = default_filters(spectrum)
|
||||
spectrum = normalize_intensities(spectrum)
|
||||
spectrum = select_by_relative_intensity(spectrum, intensity_from=0.01)
|
||||
spectrum = require_minimum_number_of_peaks(spectrum, n_required=5)
|
||||
if spectrum is not None:
|
||||
processed_library.append(spectrum)
|
||||
|
||||
# Process query spectra
|
||||
print("Processing queries...")
|
||||
processed_queries = []
|
||||
for spectrum in queries:
|
||||
spectrum = default_filters(spectrum)
|
||||
spectrum = normalize_intensities(spectrum)
|
||||
spectrum = select_by_relative_intensity(spectrum, intensity_from=0.01)
|
||||
spectrum = require_minimum_number_of_peaks(spectrum, n_required=5)
|
||||
if spectrum is not None:
|
||||
processed_queries.append(spectrum)
|
||||
|
||||
# Calculate similarities
|
||||
print("Calculating similarities...")
|
||||
scores = calculate_scores(references=processed_library,
|
||||
queries=processed_queries,
|
||||
similarity_function=CosineGreedy(tolerance=0.1))
|
||||
|
||||
# Get top matches for each query
|
||||
print("\nTop matches:")
|
||||
for i, query in enumerate(processed_queries):
|
||||
top_matches = scores.scores_by_query(query, sort=True)[:5]
|
||||
|
||||
query_name = query.get("compound_name", f"Query {i}")
|
||||
print(f"\n{query_name}:")
|
||||
|
||||
for ref_idx, score in top_matches:
|
||||
ref_spectrum = processed_library[ref_idx]
|
||||
ref_name = ref_spectrum.get("compound_name", f"Ref {ref_idx}")
|
||||
print(f" {ref_name}: {score:.4f}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 2: Quality Control and Data Cleaning
|
||||
|
||||
Filter and clean spectral data before analysis.
|
||||
|
||||
```python
|
||||
from matchms.importing import load_from_mgf
|
||||
from matchms.exporting import save_as_mgf
|
||||
from matchms.filtering import (default_filters, normalize_intensities,
|
||||
require_precursor_mz, require_minimum_number_of_peaks,
|
||||
require_minimum_number_of_high_peaks,
|
||||
select_by_relative_intensity, remove_peaks_around_precursor_mz)
|
||||
|
||||
# Load spectra
|
||||
spectra = list(load_from_mgf("raw_data.mgf"))
|
||||
print(f"Loaded {len(spectra)} raw spectra")
|
||||
|
||||
# Apply quality filters
|
||||
cleaned_spectra = []
|
||||
for spectrum in spectra:
|
||||
# Harmonize metadata
|
||||
spectrum = default_filters(spectrum)
|
||||
|
||||
# Quality requirements
|
||||
spectrum = require_precursor_mz(spectrum, minimum_accepted_mz=50.0)
|
||||
if spectrum is None:
|
||||
continue
|
||||
|
||||
spectrum = require_minimum_number_of_peaks(spectrum, n_required=10)
|
||||
if spectrum is None:
|
||||
continue
|
||||
|
||||
# Clean peaks
|
||||
spectrum = normalize_intensities(spectrum)
|
||||
spectrum = remove_peaks_around_precursor_mz(spectrum, mz_tolerance=17)
|
||||
spectrum = select_by_relative_intensity(spectrum, intensity_from=0.01)
|
||||
|
||||
# Require high-quality peaks
|
||||
spectrum = require_minimum_number_of_high_peaks(spectrum,
|
||||
n_required=5,
|
||||
intensity_threshold=0.05)
|
||||
if spectrum is None:
|
||||
continue
|
||||
|
||||
cleaned_spectra.append(spectrum)
|
||||
|
||||
print(f"Retained {len(cleaned_spectra)} high-quality spectra")
|
||||
print(f"Removed {len(spectra) - len(cleaned_spectra)} low-quality spectra")
|
||||
|
||||
# Save cleaned data
|
||||
save_as_mgf(cleaned_spectra, "cleaned_data.mgf")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 3: Multi-Metric Similarity Scoring
|
||||
|
||||
Combine multiple similarity metrics for robust compound identification.
|
||||
|
||||
```python
|
||||
from matchms.importing import load_from_mgf
|
||||
from matchms.filtering import (default_filters, normalize_intensities,
|
||||
derive_inchi_from_smiles, add_fingerprint, add_losses)
|
||||
from matchms import calculate_scores
|
||||
from matchms.similarity import (CosineGreedy, ModifiedCosine,
|
||||
NeutralLossesCosine, FingerprintSimilarity)
|
||||
import numpy as np
|
||||
|
||||
# Load spectra
|
||||
library = list(load_from_mgf("library.mgf"))
|
||||
queries = list(load_from_mgf("queries.mgf"))
|
||||
|
||||
# Process with multiple features
|
||||
def process_for_multimetric(spectrum):
|
||||
spectrum = default_filters(spectrum)
|
||||
spectrum = normalize_intensities(spectrum)
|
||||
|
||||
# Add chemical fingerprints
|
||||
spectrum = derive_inchi_from_smiles(spectrum)
|
||||
spectrum = add_fingerprint(spectrum, fingerprint_type="morgan2", nbits=2048)
|
||||
|
||||
# Add neutral losses
|
||||
spectrum = add_losses(spectrum, loss_mz_from=5.0, loss_mz_to=200.0)
|
||||
|
||||
return spectrum
|
||||
|
||||
processed_library = [process_for_multimetric(s) for s in library if s is not None]
|
||||
processed_queries = [process_for_multimetric(s) for s in queries if s is not None]
|
||||
|
||||
# Calculate multiple similarity scores
|
||||
print("Calculating Cosine similarity...")
|
||||
cosine_scores = calculate_scores(processed_library, processed_queries,
|
||||
CosineGreedy(tolerance=0.1))
|
||||
|
||||
print("Calculating Modified Cosine similarity...")
|
||||
modified_cosine_scores = calculate_scores(processed_library, processed_queries,
|
||||
ModifiedCosine(tolerance=0.1))
|
||||
|
||||
print("Calculating Neutral Losses similarity...")
|
||||
neutral_losses_scores = calculate_scores(processed_library, processed_queries,
|
||||
NeutralLossesCosine(tolerance=0.1))
|
||||
|
||||
print("Calculating Fingerprint similarity...")
|
||||
fingerprint_scores = calculate_scores(processed_library, processed_queries,
|
||||
FingerprintSimilarity(similarity_measure="jaccard"))
|
||||
|
||||
# Combine scores with weights
|
||||
weights = {
|
||||
'cosine': 0.4,
|
||||
'modified_cosine': 0.3,
|
||||
'neutral_losses': 0.2,
|
||||
'fingerprint': 0.1
|
||||
}
|
||||
|
||||
# Get combined scores for each query
|
||||
for i, query in enumerate(processed_queries):
|
||||
query_name = query.get("compound_name", f"Query {i}")
|
||||
|
||||
combined_scores = []
|
||||
for j, ref in enumerate(processed_library):
|
||||
combined = (weights['cosine'] * cosine_scores.scores[j, i] +
|
||||
weights['modified_cosine'] * modified_cosine_scores.scores[j, i] +
|
||||
weights['neutral_losses'] * neutral_losses_scores.scores[j, i] +
|
||||
weights['fingerprint'] * fingerprint_scores.scores[j, i])
|
||||
combined_scores.append((j, combined))
|
||||
|
||||
# Sort by combined score
|
||||
combined_scores.sort(key=lambda x: x[1], reverse=True)
|
||||
|
||||
print(f"\n{query_name} - Top 3 matches:")
|
||||
for ref_idx, score in combined_scores[:3]:
|
||||
ref_name = processed_library[ref_idx].get("compound_name", f"Ref {ref_idx}")
|
||||
print(f" {ref_name}: {score:.4f}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 4: Precursor-Filtered Library Search
|
||||
|
||||
Pre-filter by precursor mass before spectral matching for faster searches.
|
||||
|
||||
```python
|
||||
from matchms.importing import load_from_mgf
|
||||
from matchms.filtering import default_filters, normalize_intensities
|
||||
from matchms import calculate_scores
|
||||
from matchms.similarity import PrecursorMzMatch, CosineGreedy
|
||||
import numpy as np
|
||||
|
||||
# Load data
|
||||
library = list(load_from_mgf("large_library.mgf"))
|
||||
queries = list(load_from_mgf("queries.mgf"))
|
||||
|
||||
# Process spectra
|
||||
processed_library = [normalize_intensities(default_filters(s)) for s in library]
|
||||
processed_queries = [normalize_intensities(default_filters(s)) for s in queries]
|
||||
|
||||
# Step 1: Fast precursor mass filtering
|
||||
print("Filtering by precursor mass...")
|
||||
mass_filter = calculate_scores(processed_library, processed_queries,
|
||||
PrecursorMzMatch(tolerance=0.1, tolerance_type="Dalton"))
|
||||
|
||||
# Step 2: Calculate cosine only for matching precursors
|
||||
print("Calculating cosine similarity for filtered candidates...")
|
||||
cosine_scores = calculate_scores(processed_library, processed_queries,
|
||||
CosineGreedy(tolerance=0.1))
|
||||
|
||||
# Step 3: Apply mass filter to cosine scores
|
||||
for i, query in enumerate(processed_queries):
|
||||
candidates = []
|
||||
|
||||
for j, ref in enumerate(processed_library):
|
||||
# Only consider if precursor matches
|
||||
if mass_filter.scores[j, i] > 0:
|
||||
cosine_score = cosine_scores.scores[j, i]
|
||||
candidates.append((j, cosine_score))
|
||||
|
||||
# Sort by cosine score
|
||||
candidates.sort(key=lambda x: x[1], reverse=True)
|
||||
|
||||
query_name = query.get("compound_name", f"Query {i}")
|
||||
print(f"\n{query_name} - Top 5 matches (from {len(candidates)} candidates):")
|
||||
|
||||
for ref_idx, score in candidates[:5]:
|
||||
ref_name = processed_library[ref_idx].get("compound_name", f"Ref {ref_idx}")
|
||||
ref_mz = processed_library[ref_idx].get("precursor_mz", "N/A")
|
||||
print(f" {ref_name} (m/z {ref_mz}): {score:.4f}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 5: Building a Reusable Processing Pipeline
|
||||
|
||||
Create a standardized pipeline for consistent processing.
|
||||
|
||||
```python
|
||||
from matchms import SpectrumProcessor
|
||||
from matchms.filtering import (default_filters, normalize_intensities,
|
||||
select_by_relative_intensity,
|
||||
remove_peaks_around_precursor_mz,
|
||||
require_minimum_number_of_peaks,
|
||||
derive_inchi_from_smiles, add_fingerprint)
|
||||
from matchms.importing import load_from_mgf
|
||||
from matchms.exporting import save_as_pickle
|
||||
|
||||
# Define custom processing pipeline
|
||||
def create_standard_pipeline():
|
||||
"""Create a reusable processing pipeline"""
|
||||
return SpectrumProcessor([
|
||||
default_filters,
|
||||
normalize_intensities,
|
||||
lambda s: remove_peaks_around_precursor_mz(s, mz_tolerance=17),
|
||||
lambda s: select_by_relative_intensity(s, intensity_from=0.01),
|
||||
lambda s: require_minimum_number_of_peaks(s, n_required=5),
|
||||
derive_inchi_from_smiles,
|
||||
lambda s: add_fingerprint(s, fingerprint_type="morgan2")
|
||||
])
|
||||
|
||||
# Create pipeline instance
|
||||
pipeline = create_standard_pipeline()
|
||||
|
||||
# Process multiple datasets with same pipeline
|
||||
datasets = ["dataset1.mgf", "dataset2.mgf", "dataset3.mgf"]
|
||||
|
||||
for dataset_file in datasets:
|
||||
print(f"\nProcessing {dataset_file}...")
|
||||
|
||||
# Load spectra
|
||||
spectra = list(load_from_mgf(dataset_file))
|
||||
|
||||
# Apply pipeline
|
||||
processed = []
|
||||
for spectrum in spectra:
|
||||
result = pipeline(spectrum)
|
||||
if result is not None:
|
||||
processed.append(result)
|
||||
|
||||
print(f" Loaded: {len(spectra)}")
|
||||
print(f" Processed: {len(processed)}")
|
||||
|
||||
# Save processed data
|
||||
output_file = dataset_file.replace(".mgf", "_processed.pkl")
|
||||
save_as_pickle(processed, output_file)
|
||||
print(f" Saved to: {output_file}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 6: Format Conversion and Standardization
|
||||
|
||||
Convert between different mass spectrometry file formats.
|
||||
|
||||
```python
|
||||
from matchms.importing import load_from_mzml, load_from_mgf
|
||||
from matchms.exporting import save_as_mgf, save_as_msp, save_as_json
|
||||
from matchms.filtering import default_filters, normalize_intensities
|
||||
|
||||
def convert_and_standardize(input_file, output_format="mgf"):
|
||||
"""
|
||||
Load, standardize, and convert mass spectrometry data
|
||||
|
||||
Parameters:
|
||||
-----------
|
||||
input_file : str
|
||||
Input file path (supports .mzML, .mzXML, .mgf)
|
||||
output_format : str
|
||||
Output format ('mgf', 'msp', or 'json')
|
||||
"""
|
||||
# Determine input format and load
|
||||
if input_file.endswith('.mzML') or input_file.endswith('.mzXML'):
|
||||
from matchms.importing import load_from_mzml
|
||||
spectra = list(load_from_mzml(input_file, ms_level=2))
|
||||
elif input_file.endswith('.mgf'):
|
||||
spectra = list(load_from_mgf(input_file))
|
||||
else:
|
||||
raise ValueError(f"Unsupported format: {input_file}")
|
||||
|
||||
print(f"Loaded {len(spectra)} spectra from {input_file}")
|
||||
|
||||
# Standardize
|
||||
processed = []
|
||||
for spectrum in spectra:
|
||||
spectrum = default_filters(spectrum)
|
||||
spectrum = normalize_intensities(spectrum)
|
||||
if spectrum is not None:
|
||||
processed.append(spectrum)
|
||||
|
||||
print(f"Standardized {len(processed)} spectra")
|
||||
|
||||
# Export
|
||||
output_file = input_file.rsplit('.', 1)[0] + f'_standardized.{output_format}'
|
||||
|
||||
if output_format == 'mgf':
|
||||
save_as_mgf(processed, output_file)
|
||||
elif output_format == 'msp':
|
||||
save_as_msp(processed, output_file)
|
||||
elif output_format == 'json':
|
||||
save_as_json(processed, output_file)
|
||||
else:
|
||||
raise ValueError(f"Unsupported output format: {output_format}")
|
||||
|
||||
print(f"Saved to {output_file}")
|
||||
return processed
|
||||
|
||||
# Convert mzML to MGF
|
||||
convert_and_standardize("raw_data.mzML", output_format="mgf")
|
||||
|
||||
# Convert MGF to MSP library format
|
||||
convert_and_standardize("library.mgf", output_format="msp")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 7: Metadata Enrichment and Validation
|
||||
|
||||
Enrich spectra with chemical structure information and validate annotations.
|
||||
|
||||
```python
|
||||
from matchms.importing import load_from_mgf
|
||||
from matchms.exporting import save_as_mgf
|
||||
from matchms.filtering import (default_filters, derive_inchi_from_smiles,
|
||||
derive_inchikey_from_inchi, derive_smiles_from_inchi,
|
||||
add_fingerprint, repair_not_matching_annotation,
|
||||
require_valid_annotation)
|
||||
|
||||
# Load spectra
|
||||
spectra = list(load_from_mgf("spectra.mgf"))
|
||||
|
||||
# Enrich and validate
|
||||
enriched_spectra = []
|
||||
validation_failures = []
|
||||
|
||||
for i, spectrum in enumerate(spectra):
|
||||
# Basic harmonization
|
||||
spectrum = default_filters(spectrum)
|
||||
|
||||
# Derive chemical structures
|
||||
spectrum = derive_inchi_from_smiles(spectrum)
|
||||
spectrum = derive_inchikey_from_inchi(spectrum)
|
||||
spectrum = derive_smiles_from_inchi(spectrum)
|
||||
|
||||
# Repair mismatches
|
||||
spectrum = repair_not_matching_annotation(spectrum)
|
||||
|
||||
# Add molecular fingerprints
|
||||
spectrum = add_fingerprint(spectrum, fingerprint_type="morgan2", nbits=2048)
|
||||
|
||||
# Validate
|
||||
validated = require_valid_annotation(spectrum)
|
||||
|
||||
if validated is not None:
|
||||
enriched_spectra.append(validated)
|
||||
else:
|
||||
validation_failures.append(i)
|
||||
|
||||
print(f"Successfully enriched: {len(enriched_spectra)}")
|
||||
print(f"Validation failures: {len(validation_failures)}")
|
||||
|
||||
# Save enriched data
|
||||
save_as_mgf(enriched_spectra, "enriched_spectra.mgf")
|
||||
|
||||
# Report failures
|
||||
if validation_failures:
|
||||
print("\nSpectra that failed validation:")
|
||||
for idx in validation_failures[:10]: # Show first 10
|
||||
original = spectra[idx]
|
||||
name = original.get("compound_name", f"Spectrum {idx}")
|
||||
print(f" - {name}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 8: Large-Scale Library Comparison
|
||||
|
||||
Compare two large spectral libraries efficiently.
|
||||
|
||||
```python
|
||||
from matchms.importing import load_from_mgf
|
||||
from matchms.filtering import default_filters, normalize_intensities
|
||||
from matchms import calculate_scores
|
||||
from matchms.similarity import CosineGreedy
|
||||
import numpy as np
|
||||
|
||||
# Load two libraries
|
||||
print("Loading libraries...")
|
||||
library1 = list(load_from_mgf("library1.mgf"))
|
||||
library2 = list(load_from_mgf("library2.mgf"))
|
||||
|
||||
# Process
|
||||
processed_lib1 = [normalize_intensities(default_filters(s)) for s in library1]
|
||||
processed_lib2 = [normalize_intensities(default_filters(s)) for s in library2]
|
||||
|
||||
# Calculate all-vs-all similarities
|
||||
print("Calculating similarities...")
|
||||
scores = calculate_scores(processed_lib1, processed_lib2,
|
||||
CosineGreedy(tolerance=0.1))
|
||||
|
||||
# Find high-similarity pairs (potential duplicates or similar compounds)
|
||||
threshold = 0.8
|
||||
similar_pairs = []
|
||||
|
||||
for i, spec1 in enumerate(processed_lib1):
|
||||
for j, spec2 in enumerate(processed_lib2):
|
||||
score = scores.scores[i, j]
|
||||
if score >= threshold:
|
||||
similar_pairs.append({
|
||||
'lib1_idx': i,
|
||||
'lib2_idx': j,
|
||||
'lib1_name': spec1.get("compound_name", f"L1_{i}"),
|
||||
'lib2_name': spec2.get("compound_name", f"L2_{j}"),
|
||||
'similarity': score
|
||||
})
|
||||
|
||||
# Sort by similarity
|
||||
similar_pairs.sort(key=lambda x: x['similarity'], reverse=True)
|
||||
|
||||
print(f"\nFound {len(similar_pairs)} pairs with similarity >= {threshold}")
|
||||
print("\nTop 10 most similar pairs:")
|
||||
for pair in similar_pairs[:10]:
|
||||
print(f"{pair['lib1_name']} <-> {pair['lib2_name']}: {pair['similarity']:.4f}")
|
||||
|
||||
# Export to CSV
|
||||
import pandas as pd
|
||||
df = pd.DataFrame(similar_pairs)
|
||||
df.to_csv("library_comparison.csv", index=False)
|
||||
print("\nFull results saved to library_comparison.csv")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 9: Ion Mode Specific Processing
|
||||
|
||||
Process positive and negative mode spectra separately.
|
||||
|
||||
```python
|
||||
from matchms.importing import load_from_mgf
|
||||
from matchms.filtering import (default_filters, normalize_intensities,
|
||||
require_correct_ionmode, derive_ionmode)
|
||||
from matchms.exporting import save_as_mgf
|
||||
|
||||
# Load mixed mode spectra
|
||||
spectra = list(load_from_mgf("mixed_modes.mgf"))
|
||||
|
||||
# Separate by ion mode
|
||||
positive_spectra = []
|
||||
negative_spectra = []
|
||||
unknown_mode = []
|
||||
|
||||
for spectrum in spectra:
|
||||
# Harmonize and derive ion mode
|
||||
spectrum = default_filters(spectrum)
|
||||
spectrum = derive_ionmode(spectrum)
|
||||
|
||||
# Separate by mode
|
||||
ionmode = spectrum.get("ionmode")
|
||||
|
||||
if ionmode == "positive":
|
||||
spectrum = normalize_intensities(spectrum)
|
||||
positive_spectra.append(spectrum)
|
||||
elif ionmode == "negative":
|
||||
spectrum = normalize_intensities(spectrum)
|
||||
negative_spectra.append(spectrum)
|
||||
else:
|
||||
unknown_mode.append(spectrum)
|
||||
|
||||
print(f"Positive mode: {len(positive_spectra)}")
|
||||
print(f"Negative mode: {len(negative_spectra)}")
|
||||
print(f"Unknown mode: {len(unknown_mode)}")
|
||||
|
||||
# Save separated data
|
||||
save_as_mgf(positive_spectra, "positive_mode.mgf")
|
||||
save_as_mgf(negative_spectra, "negative_mode.mgf")
|
||||
|
||||
# Process mode-specific analyses
|
||||
from matchms import calculate_scores
|
||||
from matchms.similarity import CosineGreedy
|
||||
|
||||
if len(positive_spectra) > 1:
|
||||
print("\nCalculating positive mode similarities...")
|
||||
pos_scores = calculate_scores(positive_spectra, positive_spectra,
|
||||
CosineGreedy(tolerance=0.1))
|
||||
|
||||
if len(negative_spectra) > 1:
|
||||
print("Calculating negative mode similarities...")
|
||||
neg_scores = calculate_scores(negative_spectra, negative_spectra,
|
||||
CosineGreedy(tolerance=0.1))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 10: Automated Compound Identification Report
|
||||
|
||||
Generate a detailed compound identification report.
|
||||
|
||||
```python
|
||||
from matchms.importing import load_from_mgf
|
||||
from matchms.filtering import default_filters, normalize_intensities
|
||||
from matchms import calculate_scores
|
||||
from matchms.similarity import CosineGreedy, ModifiedCosine
|
||||
import pandas as pd
|
||||
|
||||
def identify_compounds(query_file, library_file, output_csv="identification_report.csv"):
|
||||
"""
|
||||
Automated compound identification with detailed report
|
||||
"""
|
||||
# Load data
|
||||
print("Loading data...")
|
||||
queries = list(load_from_mgf(query_file))
|
||||
library = list(load_from_mgf(library_file))
|
||||
|
||||
# Process
|
||||
proc_queries = [normalize_intensities(default_filters(s)) for s in queries]
|
||||
proc_library = [normalize_intensities(default_filters(s)) for s in library]
|
||||
|
||||
# Calculate similarities
|
||||
print("Calculating similarities...")
|
||||
cosine_scores = calculate_scores(proc_library, proc_queries, CosineGreedy())
|
||||
modified_scores = calculate_scores(proc_library, proc_queries, ModifiedCosine())
|
||||
|
||||
# Generate report
|
||||
results = []
|
||||
for i, query in enumerate(proc_queries):
|
||||
query_name = query.get("compound_name", f"Unknown_{i}")
|
||||
query_mz = query.get("precursor_mz", "N/A")
|
||||
|
||||
# Get top 5 matches
|
||||
cosine_matches = cosine_scores.scores_by_query(query, sort=True)[:5]
|
||||
|
||||
for rank, (lib_idx, cos_score) in enumerate(cosine_matches, 1):
|
||||
ref = proc_library[lib_idx]
|
||||
mod_score = modified_scores.scores[lib_idx, i]
|
||||
|
||||
results.append({
|
||||
'Query': query_name,
|
||||
'Query_mz': query_mz,
|
||||
'Rank': rank,
|
||||
'Match': ref.get("compound_name", f"Ref_{lib_idx}"),
|
||||
'Match_mz': ref.get("precursor_mz", "N/A"),
|
||||
'Cosine_Score': cos_score,
|
||||
'Modified_Cosine': mod_score,
|
||||
'InChIKey': ref.get("inchikey", "N/A")
|
||||
})
|
||||
|
||||
# Create DataFrame and save
|
||||
df = pd.DataFrame(results)
|
||||
df.to_csv(output_csv, index=False)
|
||||
print(f"\nReport saved to {output_csv}")
|
||||
|
||||
# Summary statistics
|
||||
print("\nSummary:")
|
||||
high_confidence = len(df[df['Cosine_Score'] >= 0.8])
|
||||
medium_confidence = len(df[(df['Cosine_Score'] >= 0.6) & (df['Cosine_Score'] < 0.8)])
|
||||
low_confidence = len(df[df['Cosine_Score'] < 0.6])
|
||||
|
||||
print(f" High confidence (≥0.8): {high_confidence}")
|
||||
print(f" Medium confidence (0.6-0.8): {medium_confidence}")
|
||||
print(f" Low confidence (<0.6): {low_confidence}")
|
||||
|
||||
return df
|
||||
|
||||
# Run identification
|
||||
report = identify_compounds("unknowns.mgf", "reference_library.mgf")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always process both queries and references**: Apply the same filters to ensure consistent comparison
|
||||
2. **Save intermediate results**: Use pickle format for fast reloading of processed spectra
|
||||
3. **Monitor memory usage**: Use generators for large files instead of loading all at once
|
||||
4. **Validate data quality**: Apply quality filters before similarity calculations
|
||||
5. **Choose appropriate similarity metrics**: CosineGreedy for speed, ModifiedCosine for related compounds
|
||||
6. **Combine multiple metrics**: Use multiple similarity scores for robust identification
|
||||
7. **Filter by precursor mass first**: Dramatically speeds up large library searches
|
||||
8. **Document your pipeline**: Save processing parameters for reproducibility
|
||||
|
||||
## Further Resources
|
||||
|
||||
- matchms documentation: https://matchms.readthedocs.io
|
||||
- GNPS platform: https://gnps.ucsd.edu
|
||||
- matchms GitHub: https://github.com/matchms/matchms
|
||||
Reference in New Issue
Block a user