# Proteomics and Metabolomics File Formats Reference This reference covers file formats specific to proteomics, metabolomics, lipidomics, and related omics workflows. ## Mass Spectrometry-Based Proteomics ### .mzML - Mass Spectrometry Markup Language **Description:** Standard XML format for MS data **Typical Data:** MS1 and MS2 spectra, retention times, intensities **Use Cases:** Proteomics, metabolomics pipelines **Python Libraries:** - `pymzml`: `pymzml.run.Reader('file.mzML')` - `pyteomics.mzml`: `pyteomics.mzml.read('file.mzML')` - `pyopenms`: OpenMS Python bindings **EDA Approach:** - Scan count and MS level distribution - Total ion chromatogram (TIC) analysis - Base peak chromatogram (BPC) - m/z coverage and resolution - Retention time range - Precursor selection patterns - Data completeness - Quality control metrics (lock mass, standards) ### .mzXML - Legacy MS XML Format **Description:** Older XML-based MS format **Typical Data:** Mass spectra with metadata **Use Cases:** Legacy proteomics data **Python Libraries:** - `pyteomics.mzxml` - `pymzml`: Can read mzXML **EDA Approach:** - Similar to mzML - Format version compatibility - Conversion quality validation - Metadata preservation check ### .mzIdentML - Peptide Identification Format **Description:** PSI standard for peptide identifications **Typical Data:** Peptide-spectrum matches, proteins, scores **Use Cases:** Search engine results, proteomics workflows **Python Libraries:** - `pyteomics.mzid` - `pyopenms`: MzIdentML support **EDA Approach:** - PSM count and score distribution - FDR calculation and filtering - Modification analysis - Missed cleavage statistics - Protein inference results - Search parameters validation - Decoy hit analysis - Rank-1 vs lower ranks ### .pepXML - Trans-Proteomic Pipeline Peptide XML **Description:** TPP format for peptide identifications **Typical Data:** Search results with statistical validation **Use Cases:** Proteomics database search output **Python Libraries:** - `pyteomics.pepxml` **EDA Approach:** - Search engine comparison - Score distributions (XCorr, expect value, etc.) - Charge state analysis - Modification frequencies - PeptideProphet probabilities - Protein coverage - Spectral counting ### .protXML - Protein Inference Results **Description:** TPP protein-level identifications **Typical Data:** Protein groups, probabilities, peptides **Use Cases:** Protein-level analysis **Python Libraries:** - `pyteomics.protxml` **EDA Approach:** - Protein group statistics - Parsimonious protein sets - ProteinProphet probabilities - Coverage and peptide count per protein - Unique vs shared peptides - Protein molecular weight distribution - GO term enrichment preparation ### .pride.xml - PRIDE XML Format **Description:** Proteomics Identifications Database format **Typical Data:** Complete proteomics experiment data **Use Cases:** Public data deposition (legacy) **Python Libraries:** - `pyteomics.pride` - Custom XML parsers **EDA Approach:** - Experiment metadata extraction - Identification completeness - Cross-linking to spectra - Protocol information - Instrument details ### .tsv / .csv (Proteomics) **Description:** Tab or comma-separated proteomics results **Typical Data:** Peptide or protein quantification tables **Use Cases:** MaxQuant, Proteome Discoverer, Skyline output **Python Libraries:** - `pandas`: `pd.read_csv()` or `pd.read_table()` **EDA Approach:** - Identification counts - Quantitative value distributions - Missing value patterns - Intensity-based analysis - Label-free quantification assessment - Isobaric tag ratio analysis - Coefficient of variation - Batch effects ### .msf - Thermo MSF Database **Description:** Proteome Discoverer results database **Typical Data:** SQLite database with search results **Use Cases:** Thermo Proteome Discoverer workflows **Python Libraries:** - `sqlite3`: Database access - Custom MSF parsers **EDA Approach:** - Database schema exploration - Peptide and protein tables - Score thresholds - Quantification data - Processing node information - Confidence levels ### .pdResult - Proteome Discoverer Result **Description:** Proteome Discoverer study results **Typical Data:** Comprehensive search and quantification **Use Cases:** PD study exports **Python Libraries:** - Vendor tools for conversion - Export to TSV for Python analysis **EDA Approach:** - Study design validation - Result filtering criteria - Quantitative comparison groups - Imputation strategies ### .pep.xml - Peptide Summary **Description:** Compact peptide identification format **Typical Data:** Peptide sequences, modifications, scores **Use Cases:** Downstream analysis input **Python Libraries:** - `pyteomics`: XML parsing **EDA Approach:** - Unique peptide counting - PTM site localization - Retention time predictability - Charge state preferences ## Quantitative Proteomics ### .sky - Skyline Document **Description:** Skyline targeted proteomics document **Typical Data:** Transition lists, chromatograms, results **Use Cases:** Targeted proteomics (SRM/MRM/PRM) **Python Libraries:** - `skyline`: Python API (limited) - Export to CSV for analysis **EDA Approach:** - Transition selection validation - Chromatographic peak quality - Interference detection - Retention time consistency - Calibration curve assessment - Replicate correlation - LOD/LOQ determination ### .sky.zip - Zipped Skyline Document **Description:** Skyline document with external files **Typical Data:** Complete Skyline analysis **Use Cases:** Sharing Skyline projects **Python Libraries:** - `zipfile`: Extract for processing **EDA Approach:** - Document structure - External file references - Result export and analysis ### .wiff - SCIEX WIFF Format **Description:** SCIEX instrument data with quantitation **Typical Data:** LC-MS/MS with MRM transitions **Use Cases:** SCIEX QTRAP, TripleTOF data **Python Libraries:** - Vendor tools (limited Python access) - Conversion to mzML **EDA Approach:** - MRM transition performance - Dwell time optimization - Cycle time analysis - Peak integration quality ### .raw (Thermo) **Description:** Thermo raw instrument file **Typical Data:** Full MS data from Orbitrap, Q Exactive **Use Cases:** Label-free and TMT quantification **Python Libraries:** - `pymsfilereader`: Thermo RawFileReader - `ThermoRawFileParser`: Cross-platform CLI **EDA Approach:** - MS1 and MS2 acquisition rates - AGC target and fill times - Resolution settings - Isolation window validation - SPS ion selection (TMT) - Contamination assessment ### .d (Agilent) **Description:** Agilent data directory **Typical Data:** LC-MS and GC-MS data **Use Cases:** Agilent instrument workflows **Python Libraries:** - Community parsers - Export to mzML **EDA Approach:** - Method consistency - Calibration status - Sequence run information - Retention time stability ## Metabolomics and Lipidomics ### .mzML (Metabolomics) **Description:** Standard MS format for metabolomics **Typical Data:** Full scan MS, targeted MS/MS **Use Cases:** Untargeted and targeted metabolomics **Python Libraries:** - Same as proteomics mzML tools **EDA Approach:** - Feature detection quality - Mass accuracy assessment - Retention time alignment - Blank subtraction - QC sample consistency - Isotope pattern validation - Adduct formation analysis - In-source fragmentation check ### .cdf / .netCDF - ANDI-MS **Description:** Analytical Data Interchange for MS **Typical Data:** GC-MS, LC-MS chromatography data **Use Cases:** Metabolomics, GC-MS workflows **Python Libraries:** - `netCDF4`: Low-level access - `pyopenms`: CDF support - `xcms` via R integration **EDA Approach:** - TIC and extracted ion chromatograms - Peak detection across samples - Retention index calculation - Mass spectral matching - Library search preparation ### .msp - Mass Spectral Format (NIST) **Description:** NIST spectral library format **Typical Data:** Reference mass spectra **Use Cases:** Metabolite identification, library matching **Python Libraries:** - `matchms`: Spectral matching - Custom MSP parsers **EDA Approach:** - Library coverage - Metadata completeness (InChI, SMILES) - Spectral quality metrics - Collision energy standardization - Precursor type annotation ### .mgf (Metabolomics) **Description:** Mascot Generic Format for MS/MS **Typical Data:** MS/MS spectra for metabolite ID **Use Cases:** Spectral library searching **Python Libraries:** - `matchms`: Metabolomics spectral analysis - `pyteomics.mgf` **EDA Approach:** - Spectrum quality filtering - Precursor isolation purity - Fragment m/z accuracy - Neutral loss patterns - MS/MS completeness ### .nmrML - NMR Markup Language **Description:** Standard XML format for NMR metabolomics **Typical Data:** 1D/2D NMR spectra with metadata **Use Cases:** NMR-based metabolomics **Python Libraries:** - `nmrml2isa`: Format conversion - Custom XML parsers **EDA Approach:** - Spectral quality metrics - Binning consistency - Reference compound validation - pH and temperature effects - Metabolite identification confidence ### .json (Metabolomics) **Description:** JSON format for metabolomics results **Typical Data:** Feature tables, annotations, metadata **Use Cases:** GNPS, MetaboAnalyst, web tools **Python Libraries:** - `json`: Standard library - `pandas`: JSON normalization **EDA Approach:** - Feature annotation coverage - GNPS clustering results - Molecular networking statistics - Adduct and in-source fragment linkage - Putative identification confidence ### .txt (Metabolomics Tables) **Description:** Tab-delimited feature tables **Typical Data:** m/z, RT, intensities across samples **Use Cases:** MZmine, XCMS, MS-DIAL output **Python Libraries:** - `pandas`: Text file reading **EDA Approach:** - Feature count and quality - Missing value imputation - Data normalization assessment - Batch correction validation - PCA and clustering for QC - Fold change calculations - Statistical test preparation ### .featureXML - OpenMS Feature Format **Description:** OpenMS detected features **Typical Data:** LC-MS features with quality scores **Use Cases:** OpenMS workflows **Python Libraries:** - `pyopenms`: FeatureXML support **EDA Approach:** - Feature detection parameters - Quality metrics per feature - Isotope pattern fitting - Charge state assignment - FWHM and asymmetry ### .consensusXML - OpenMS Consensus Features **Description:** Linked features across samples **Typical Data:** Aligned features with group info **Use Cases:** Multi-sample LC-MS analysis **Python Libraries:** - `pyopenms`: ConsensusXML reading **EDA Approach:** - Feature correspondence quality - Retention time alignment - Missing value patterns - Intensity normalization needs - Batch-wise feature agreement ### .idXML - OpenMS Identification Format **Description:** Peptide/metabolite identifications **Typical Data:** MS/MS identifications with scores **Use Cases:** OpenMS ID workflows **Python Libraries:** - `pyopenms`: IdXML support **EDA Approach:** - Identification rate - Score distribution - Spectral match quality - False discovery assessment - Annotation transfer validation ## Lipidomics-Specific Formats ### .lcb - LipidCreator Batch **Description:** LipidCreator transition list **Typical Data:** Lipid transitions for targeted MS **Use Cases:** Targeted lipidomics **Python Libraries:** - Export to CSV for processing **EDA Approach:** - Transition coverage per lipid class - Retention time prediction - Collision energy optimization - Class-specific fragmentation patterns ### .mzTab - Proteomics/Metabolomics Tabular Format **Description:** PSI tabular summary format **Typical Data:** Protein/peptide/metabolite quantification **Use Cases:** Publication and data sharing **Python Libraries:** - `pyteomics.mztab` - `pandas` for TSV-like structure **EDA Approach:** - Data completeness - Metadata section validation - Quantification method - Identification confidence - Software and parameters - Quality metrics summary ### .csv (LipidSearch, LipidMatch) **Description:** Lipid identification results **Typical Data:** Lipid annotations, grades, intensities **Use Cases:** Lipidomics software output **Python Libraries:** - `pandas`: CSV reading **EDA Approach:** - Lipid class distribution - Identification grade/confidence - Fatty acid composition analysis - Double bond and chain length patterns - Intensity correlations - Normalization to internal standards ### .sdf (Metabolomics) **Description:** Structure data file for metabolites **Typical Data:** Chemical structures with properties **Use Cases:** Metabolite database creation **Python Libraries:** - `RDKit`: `Chem.SDMolSupplier('file.sdf')` **EDA Approach:** - Structure validation - Property calculation (logP, MW, TPSA) - Molecular formula consistency - Tautomer enumeration - Retention time prediction features ### .mol (Metabolomics) **Description:** Single molecule structure files **Typical Data:** Metabolite chemical structure **Use Cases:** Structure-based searches **Python Libraries:** - `RDKit`: `Chem.MolFromMolFile('file.mol')` **EDA Approach:** - Structure correctness - Stereochemistry validation - Charge state - Implicit hydrogen handling ## Data Processing and Analysis ### .h5 / .hdf5 (Omics) **Description:** HDF5 for large omics datasets **Typical Data:** Feature matrices, spectra, metadata **Use Cases:** Large-scale studies, cloud computing **Python Libraries:** - `h5py`: HDF5 access - `anndata`: For single-cell proteomics **EDA Approach:** - Dataset organization - Chunking and compression - Metadata structure - Efficient data access patterns - Sample and feature annotations ### .Rdata / .rds - R Objects **Description:** Serialized R analysis objects **Typical Data:** Processed omics results from R packages **Use Cases:** xcms, CAMERA, MSnbase workflows **Python Libraries:** - `pyreadr`: `pyreadr.read_r('file.Rdata')` - `rpy2`: R-Python integration **EDA Approach:** - Object structure exploration - Data extraction - Method parameter review - Conversion to Python-native formats ### .mzTab-M - Metabolomics mzTab **Description:** mzTab specific to metabolomics **Typical Data:** Small molecule quantification **Use Cases:** Metabolomics data sharing **Python Libraries:** - `pyteomics.mztab`: Can parse mzTab-M **EDA Approach:** - Small molecule evidence - Feature quantification - Database references (HMDB, KEGG, etc.) - Adduct and charge annotation - MS level information ### .parquet (Omics) **Description:** Columnar storage for large tables **Typical Data:** Feature matrices, metadata **Use Cases:** Efficient big data omics **Python Libraries:** - `pandas`: `pd.read_parquet()` - `pyarrow`: Direct parquet access **EDA Approach:** - Compression efficiency - Column-wise statistics - Partition structure - Schema validation - Fast filtering and aggregation ### .pkl (Omics Models) **Description:** Pickled Python objects **Typical Data:** ML models, processed data **Use Cases:** Workflow intermediate storage **Python Libraries:** - `pickle`: Standard serialization - `joblib`: Enhanced pickling **EDA Approach:** - Object type and structure - Model parameters - Feature importance (if ML model) - Data shapes and types - Deserialization validation ### .zarr (Omics) **Description:** Chunked, compressed array storage **Typical Data:** Multi-dimensional omics data **Use Cases:** Cloud-optimized analysis **Python Libraries:** - `zarr`: Array storage **EDA Approach:** - Chunk optimization - Compression codecs - Multi-scale data - Parallel access patterns - Metadata annotations