# Matchms Importing and Exporting Reference This document details all file format support in matchms for loading and saving mass spectrometry data. ## Importing Spectra Matchms provides dedicated functions for loading spectra from various file formats. All import functions return generators for memory-efficient processing of large files. ### Common Import Pattern ```python from matchms.importing import load_from_mgf # Load spectra (returns generator) spectra_generator = load_from_mgf("spectra.mgf") # Convert to list for processing spectra = list(spectra_generator) ``` ## Supported Import Formats ### MGF (Mascot Generic Format) **Function**: `load_from_mgf(filename, metadata_harmonization=True)` **Description**: Loads spectra from MGF files, a common format for mass spectrometry data exchange. **Parameters**: - `filename` (str): Path to MGF file - `metadata_harmonization` (bool, default=True): Apply automatic metadata key harmonization **Example**: ```python from matchms.importing import load_from_mgf # Load with metadata harmonization spectra = list(load_from_mgf("data.mgf")) # Load without harmonization spectra = list(load_from_mgf("data.mgf", metadata_harmonization=False)) ``` **MGF Format**: Text-based format with BEGIN IONS/END IONS blocks containing metadata and peak lists. --- ### MSP (NIST Mass Spectral Library Format) **Function**: `load_from_msp(filename, metadata_harmonization=True)` **Description**: Loads spectra from MSP files, commonly used for spectral libraries. **Parameters**: - `filename` (str): Path to MSP file - `metadata_harmonization` (bool, default=True): Apply automatic metadata harmonization **Example**: ```python from matchms.importing import load_from_msp spectra = list(load_from_msp("library.msp")) ``` **MSP Format**: Text-based format with Name/MW/Comment fields followed by peak lists. --- ### mzML (Mass Spectrometry Markup Language) **Function**: `load_from_mzml(filename, ms_level=2, metadata_harmonization=True)` **Description**: Loads spectra from mzML files, the standard XML-based format for raw mass spectrometry data. **Parameters**: - `filename` (str): Path to mzML file - `ms_level` (int, default=2): MS level to extract (1 for MS1, 2 for MS2/tandem) - `metadata_harmonization` (bool, default=True): Apply automatic metadata harmonization **Example**: ```python from matchms.importing import load_from_mzml # Load MS2 spectra (default) ms2_spectra = list(load_from_mzml("data.mzML")) # Load MS1 spectra ms1_spectra = list(load_from_mzml("data.mzML", ms_level=1)) ``` **mzML Format**: XML-based standard format containing raw instrument data and rich metadata. --- ### mzXML **Function**: `load_from_mzxml(filename, ms_level=2, metadata_harmonization=True)` **Description**: Loads spectra from mzXML files, an earlier XML-based format for mass spectrometry data. **Parameters**: - `filename` (str): Path to mzXML file - `ms_level` (int, default=2): MS level to extract - `metadata_harmonization` (bool, default=True): Apply automatic metadata harmonization **Example**: ```python from matchms.importing import load_from_mzxml spectra = list(load_from_mzxml("data.mzXML")) ``` **mzXML Format**: XML-based format, predecessor to mzML. --- ### JSON (GNPS Format) **Function**: `load_from_json(filename, metadata_harmonization=True)` **Description**: Loads spectra from JSON files, particularly GNPS-compatible JSON format. **Parameters**: - `filename` (str): Path to JSON file - `metadata_harmonization` (bool, default=True): Apply automatic metadata harmonization **Example**: ```python from matchms.importing import load_from_json spectra = list(load_from_json("spectra.json")) ``` **JSON Format**: Structured JSON with spectrum metadata and peak arrays. --- ### Pickle (Python Serialization) **Function**: `load_from_pickle(filename)` **Description**: Loads previously saved matchms Spectrum objects from pickle files. Fast loading of preprocessed spectra. **Parameters**: - `filename` (str): Path to pickle file **Example**: ```python from matchms.importing import load_from_pickle spectra = list(load_from_pickle("processed_spectra.pkl")) ``` **Use case**: Saving and loading preprocessed spectra for faster subsequent analyses. --- ### USI (Universal Spectrum Identifier) **Function**: `load_from_usi(usi)` **Description**: Loads a single spectrum from a metabolomics USI reference. **Parameters**: - `usi` (str): Universal Spectrum Identifier string **Example**: ```python from matchms.importing import load_from_usi usi = "mzspec:GNPS:TASK-...:spectrum..." spectrum = load_from_usi(usi) ``` **USI Format**: Standardized identifier for accessing spectra from online repositories. --- ## Exporting Spectra Matchms provides functions to save processed spectra to various formats for sharing and archival. ### MGF Export **Function**: `save_as_mgf(spectra, filename, write_mode='w')` **Description**: Saves spectra to MGF format. **Parameters**: - `spectra` (list): List of Spectrum objects to save - `filename` (str): Output file path - `write_mode` (str, default='w'): File write mode ('w' for write, 'a' for append) **Example**: ```python from matchms.exporting import save_as_mgf save_as_mgf(processed_spectra, "output.mgf") ``` --- ### MSP Export **Function**: `save_as_msp(spectra, filename, write_mode='w')` **Description**: Saves spectra to MSP format. **Parameters**: - `spectra` (list): List of Spectrum objects to save - `filename` (str): Output file path - `write_mode` (str, default='w'): File write mode **Example**: ```python from matchms.exporting import save_as_msp save_as_msp(library_spectra, "library.msp") ``` --- ### JSON Export **Function**: `save_as_json(spectra, filename, write_mode='w')` **Description**: Saves spectra to JSON format (GNPS-compatible). **Parameters**: - `spectra` (list): List of Spectrum objects to save - `filename` (str): Output file path - `write_mode` (str, default='w'): File write mode **Example**: ```python from matchms.exporting import save_as_json save_as_json(spectra, "spectra.json") ``` --- ### Pickle Export **Function**: `save_as_pickle(spectra, filename)` **Description**: Saves spectra as Python pickle file. Preserves all Spectrum attributes and is fastest for loading. **Parameters**: - `spectra` (list): List of Spectrum objects to save - `filename` (str): Output file path **Example**: ```python from matchms.exporting import save_as_pickle save_as_pickle(processed_spectra, "processed.pkl") ``` **Advantages**: - Fast save and load - Preserves exact Spectrum state - No format conversion overhead **Disadvantages**: - Not human-readable - Python-specific (not portable to other languages) - Pickle format may not be compatible across Python versions --- ## Complete Import/Export Workflow ### Preprocessing and Saving Pipeline ```python from matchms.importing import load_from_mgf from matchms.exporting import save_as_mgf, save_as_pickle from matchms.filtering import default_filters, normalize_intensities from matchms.filtering import select_by_relative_intensity # Load raw spectra spectra = list(load_from_mgf("raw_data.mgf")) # Process spectra processed = [] for spectrum in spectra: spectrum = default_filters(spectrum) spectrum = normalize_intensities(spectrum) spectrum = select_by_relative_intensity(spectrum, intensity_from=0.01) if spectrum is not None: processed.append(spectrum) # Save processed spectra (MGF for sharing) save_as_mgf(processed, "processed_data.mgf") # Save as pickle for fast reloading save_as_pickle(processed, "processed_data.pkl") ``` ### Format Conversion ```python from matchms.importing import load_from_mzml from matchms.exporting import save_as_mgf, save_as_msp # Convert mzML to MGF spectra = list(load_from_mzml("data.mzML", ms_level=2)) save_as_mgf(spectra, "data.mgf") # Convert to MSP library format save_as_msp(spectra, "data.msp") ``` ### Loading from Multiple Files ```python from matchms.importing import load_from_mgf import glob # Load all MGF files in directory all_spectra = [] for mgf_file in glob.glob("data/*.mgf"): spectra = list(load_from_mgf(mgf_file)) all_spectra.extend(spectra) print(f"Loaded {len(all_spectra)} spectra from multiple files") ``` ### Memory-Efficient Processing ```python from matchms.importing import load_from_mgf from matchms.exporting import save_as_mgf from matchms.filtering import default_filters, normalize_intensities # Process large file without loading all into memory def process_spectrum(spectrum): spectrum = default_filters(spectrum) spectrum = normalize_intensities(spectrum) return spectrum # Stream processing with open("output.mgf", 'w') as outfile: for spectrum in load_from_mgf("large_file.mgf"): processed = process_spectrum(spectrum) if processed is not None: # Write immediately without storing in memory save_as_mgf([processed], outfile, write_mode='a') ``` ## Format Selection Guidelines **MGF**: - ✓ Widely supported - ✓ Human-readable - ✓ Good for data sharing - ✓ Moderate file size - Best for: Data exchange, GNPS uploads, publication data **MSP**: - ✓ Spectral library standard - ✓ Human-readable - ✓ Good metadata support - Best for: Reference libraries, NIST format compatibility **JSON**: - ✓ Structured format - ✓ GNPS compatible - ✓ Easy to parse programmatically - Best for: Web applications, GNPS integration, structured data **Pickle**: - ✓ Fastest save/load - ✓ Preserves exact state - ✗ Not portable to other languages - ✗ Not human-readable - Best for: Intermediate processing, Python-only workflows **mzML/mzXML**: - ✓ Raw instrument data - ✓ Rich metadata - ✓ Industry standard - ✗ Large file size - ✗ Slower to parse - Best for: Raw data archival, multi-level MS data ## Metadata Harmonization The `metadata_harmonization` parameter (available in most import functions) automatically standardizes metadata keys: ```python # Without harmonization spectrum = load_from_mgf("data.mgf", metadata_harmonization=False) # May have: "PRECURSOR_MZ", "Precursor_mz", "precursormz" # With harmonization (default) spectrum = load_from_mgf("data.mgf", metadata_harmonization=True) # Standardized to: "precursor_mz" ``` **Recommended**: Keep harmonization enabled (default) for consistent metadata access across different data sources. ## File Format Specifications For detailed format specifications: - **MGF**: http://www.matrixscience.com/help/data_file_help.html - **MSP**: https://chemdata.nist.gov/mass-spc/ms-search/ - **mzML**: http://www.psidev.info/mzML - **GNPS JSON**: https://gnps.ucsd.edu/ ## Further Reading For complete API documentation: https://matchms.readthedocs.io/en/latest/api/matchms.importing.html https://matchms.readthedocs.io/en/latest/api/matchms.exporting.html