Files
2025-11-30 08:30:10 +08:00

11 KiB

Matchms Importing and Exporting Reference

This document details all file format support in matchms for loading and saving mass spectrometry data.

Importing Spectra

Matchms provides dedicated functions for loading spectra from various file formats. All import functions return generators for memory-efficient processing of large files.

Common Import Pattern

from matchms.importing import load_from_mgf

# Load spectra (returns generator)
spectra_generator = load_from_mgf("spectra.mgf")

# Convert to list for processing
spectra = list(spectra_generator)

Supported Import Formats

MGF (Mascot Generic Format)

Function: load_from_mgf(filename, metadata_harmonization=True)

Description: Loads spectra from MGF files, a common format for mass spectrometry data exchange.

Parameters:

  • filename (str): Path to MGF file
  • metadata_harmonization (bool, default=True): Apply automatic metadata key harmonization

Example:

from matchms.importing import load_from_mgf

# Load with metadata harmonization
spectra = list(load_from_mgf("data.mgf"))

# Load without harmonization
spectra = list(load_from_mgf("data.mgf", metadata_harmonization=False))

MGF Format: Text-based format with BEGIN IONS/END IONS blocks containing metadata and peak lists.


MSP (NIST Mass Spectral Library Format)

Function: load_from_msp(filename, metadata_harmonization=True)

Description: Loads spectra from MSP files, commonly used for spectral libraries.

Parameters:

  • filename (str): Path to MSP file
  • metadata_harmonization (bool, default=True): Apply automatic metadata harmonization

Example:

from matchms.importing import load_from_msp

spectra = list(load_from_msp("library.msp"))

MSP Format: Text-based format with Name/MW/Comment fields followed by peak lists.


mzML (Mass Spectrometry Markup Language)

Function: load_from_mzml(filename, ms_level=2, metadata_harmonization=True)

Description: Loads spectra from mzML files, the standard XML-based format for raw mass spectrometry data.

Parameters:

  • filename (str): Path to mzML file
  • ms_level (int, default=2): MS level to extract (1 for MS1, 2 for MS2/tandem)
  • metadata_harmonization (bool, default=True): Apply automatic metadata harmonization

Example:

from matchms.importing import load_from_mzml

# Load MS2 spectra (default)
ms2_spectra = list(load_from_mzml("data.mzML"))

# Load MS1 spectra
ms1_spectra = list(load_from_mzml("data.mzML", ms_level=1))

mzML Format: XML-based standard format containing raw instrument data and rich metadata.


mzXML

Function: load_from_mzxml(filename, ms_level=2, metadata_harmonization=True)

Description: Loads spectra from mzXML files, an earlier XML-based format for mass spectrometry data.

Parameters:

  • filename (str): Path to mzXML file
  • ms_level (int, default=2): MS level to extract
  • metadata_harmonization (bool, default=True): Apply automatic metadata harmonization

Example:

from matchms.importing import load_from_mzxml

spectra = list(load_from_mzxml("data.mzXML"))

mzXML Format: XML-based format, predecessor to mzML.


JSON (GNPS Format)

Function: load_from_json(filename, metadata_harmonization=True)

Description: Loads spectra from JSON files, particularly GNPS-compatible JSON format.

Parameters:

  • filename (str): Path to JSON file
  • metadata_harmonization (bool, default=True): Apply automatic metadata harmonization

Example:

from matchms.importing import load_from_json

spectra = list(load_from_json("spectra.json"))

JSON Format: Structured JSON with spectrum metadata and peak arrays.


Pickle (Python Serialization)

Function: load_from_pickle(filename)

Description: Loads previously saved matchms Spectrum objects from pickle files. Fast loading of preprocessed spectra.

Parameters:

  • filename (str): Path to pickle file

Example:

from matchms.importing import load_from_pickle

spectra = list(load_from_pickle("processed_spectra.pkl"))

Use case: Saving and loading preprocessed spectra for faster subsequent analyses.


USI (Universal Spectrum Identifier)

Function: load_from_usi(usi)

Description: Loads a single spectrum from a metabolomics USI reference.

Parameters:

  • usi (str): Universal Spectrum Identifier string

Example:

from matchms.importing import load_from_usi

usi = "mzspec:GNPS:TASK-...:spectrum..."
spectrum = load_from_usi(usi)

USI Format: Standardized identifier for accessing spectra from online repositories.


Exporting Spectra

Matchms provides functions to save processed spectra to various formats for sharing and archival.

MGF Export

Function: save_as_mgf(spectra, filename, write_mode='w')

Description: Saves spectra to MGF format.

Parameters:

  • spectra (list): List of Spectrum objects to save
  • filename (str): Output file path
  • write_mode (str, default='w'): File write mode ('w' for write, 'a' for append)

Example:

from matchms.exporting import save_as_mgf

save_as_mgf(processed_spectra, "output.mgf")

MSP Export

Function: save_as_msp(spectra, filename, write_mode='w')

Description: Saves spectra to MSP format.

Parameters:

  • spectra (list): List of Spectrum objects to save
  • filename (str): Output file path
  • write_mode (str, default='w'): File write mode

Example:

from matchms.exporting import save_as_msp

save_as_msp(library_spectra, "library.msp")

JSON Export

Function: save_as_json(spectra, filename, write_mode='w')

Description: Saves spectra to JSON format (GNPS-compatible).

Parameters:

  • spectra (list): List of Spectrum objects to save
  • filename (str): Output file path
  • write_mode (str, default='w'): File write mode

Example:

from matchms.exporting import save_as_json

save_as_json(spectra, "spectra.json")

Pickle Export

Function: save_as_pickle(spectra, filename)

Description: Saves spectra as Python pickle file. Preserves all Spectrum attributes and is fastest for loading.

Parameters:

  • spectra (list): List of Spectrum objects to save
  • filename (str): Output file path

Example:

from matchms.exporting import save_as_pickle

save_as_pickle(processed_spectra, "processed.pkl")

Advantages:

  • Fast save and load
  • Preserves exact Spectrum state
  • No format conversion overhead

Disadvantages:

  • Not human-readable
  • Python-specific (not portable to other languages)
  • Pickle format may not be compatible across Python versions

Complete Import/Export Workflow

Preprocessing and Saving Pipeline

from matchms.importing import load_from_mgf
from matchms.exporting import save_as_mgf, save_as_pickle
from matchms.filtering import default_filters, normalize_intensities
from matchms.filtering import select_by_relative_intensity

# Load raw spectra
spectra = list(load_from_mgf("raw_data.mgf"))

# Process spectra
processed = []
for spectrum in spectra:
    spectrum = default_filters(spectrum)
    spectrum = normalize_intensities(spectrum)
    spectrum = select_by_relative_intensity(spectrum, intensity_from=0.01)
    if spectrum is not None:
        processed.append(spectrum)

# Save processed spectra (MGF for sharing)
save_as_mgf(processed, "processed_data.mgf")

# Save as pickle for fast reloading
save_as_pickle(processed, "processed_data.pkl")

Format Conversion

from matchms.importing import load_from_mzml
from matchms.exporting import save_as_mgf, save_as_msp

# Convert mzML to MGF
spectra = list(load_from_mzml("data.mzML", ms_level=2))
save_as_mgf(spectra, "data.mgf")

# Convert to MSP library format
save_as_msp(spectra, "data.msp")

Loading from Multiple Files

from matchms.importing import load_from_mgf
import glob

# Load all MGF files in directory
all_spectra = []
for mgf_file in glob.glob("data/*.mgf"):
    spectra = list(load_from_mgf(mgf_file))
    all_spectra.extend(spectra)

print(f"Loaded {len(all_spectra)} spectra from multiple files")

Memory-Efficient Processing

from matchms.importing import load_from_mgf
from matchms.exporting import save_as_mgf
from matchms.filtering import default_filters, normalize_intensities

# Process large file without loading all into memory
def process_spectrum(spectrum):
    spectrum = default_filters(spectrum)
    spectrum = normalize_intensities(spectrum)
    return spectrum

# Stream processing
with open("output.mgf", 'w') as outfile:
    for spectrum in load_from_mgf("large_file.mgf"):
        processed = process_spectrum(spectrum)
        if processed is not None:
            # Write immediately without storing in memory
            save_as_mgf([processed], outfile, write_mode='a')

Format Selection Guidelines

MGF:

  • ✓ Widely supported
  • ✓ Human-readable
  • ✓ Good for data sharing
  • ✓ Moderate file size
  • Best for: Data exchange, GNPS uploads, publication data

MSP:

  • ✓ Spectral library standard
  • ✓ Human-readable
  • ✓ Good metadata support
  • Best for: Reference libraries, NIST format compatibility

JSON:

  • ✓ Structured format
  • ✓ GNPS compatible
  • ✓ Easy to parse programmatically
  • Best for: Web applications, GNPS integration, structured data

Pickle:

  • ✓ Fastest save/load
  • ✓ Preserves exact state
  • ✗ Not portable to other languages
  • ✗ Not human-readable
  • Best for: Intermediate processing, Python-only workflows

mzML/mzXML:

  • ✓ Raw instrument data
  • ✓ Rich metadata
  • ✓ Industry standard
  • ✗ Large file size
  • ✗ Slower to parse
  • Best for: Raw data archival, multi-level MS data

Metadata Harmonization

The metadata_harmonization parameter (available in most import functions) automatically standardizes metadata keys:

# Without harmonization
spectrum = load_from_mgf("data.mgf", metadata_harmonization=False)
# May have: "PRECURSOR_MZ", "Precursor_mz", "precursormz"

# With harmonization (default)
spectrum = load_from_mgf("data.mgf", metadata_harmonization=True)
# Standardized to: "precursor_mz"

Recommended: Keep harmonization enabled (default) for consistent metadata access across different data sources.

File Format Specifications

For detailed format specifications:

Further Reading

For complete API documentation: https://matchms.readthedocs.io/en/latest/api/matchms.importing.html https://matchms.readthedocs.io/en/latest/api/matchms.exporting.html