Initial commit
This commit is contained in:
@@ -0,0 +1,517 @@
|
||||
# Proteomics and Metabolomics File Formats Reference
|
||||
|
||||
This reference covers file formats specific to proteomics, metabolomics, lipidomics, and related omics workflows.
|
||||
|
||||
## Mass Spectrometry-Based Proteomics
|
||||
|
||||
### .mzML - Mass Spectrometry Markup Language
|
||||
**Description:** Standard XML format for MS data
|
||||
**Typical Data:** MS1 and MS2 spectra, retention times, intensities
|
||||
**Use Cases:** Proteomics, metabolomics pipelines
|
||||
**Python Libraries:**
|
||||
- `pymzml`: `pymzml.run.Reader('file.mzML')`
|
||||
- `pyteomics.mzml`: `pyteomics.mzml.read('file.mzML')`
|
||||
- `pyopenms`: OpenMS Python bindings
|
||||
**EDA Approach:**
|
||||
- Scan count and MS level distribution
|
||||
- Total ion chromatogram (TIC) analysis
|
||||
- Base peak chromatogram (BPC)
|
||||
- m/z coverage and resolution
|
||||
- Retention time range
|
||||
- Precursor selection patterns
|
||||
- Data completeness
|
||||
- Quality control metrics (lock mass, standards)
|
||||
|
||||
### .mzXML - Legacy MS XML Format
|
||||
**Description:** Older XML-based MS format
|
||||
**Typical Data:** Mass spectra with metadata
|
||||
**Use Cases:** Legacy proteomics data
|
||||
**Python Libraries:**
|
||||
- `pyteomics.mzxml`
|
||||
- `pymzml`: Can read mzXML
|
||||
**EDA Approach:**
|
||||
- Similar to mzML
|
||||
- Format version compatibility
|
||||
- Conversion quality validation
|
||||
- Metadata preservation check
|
||||
|
||||
### .mzIdentML - Peptide Identification Format
|
||||
**Description:** PSI standard for peptide identifications
|
||||
**Typical Data:** Peptide-spectrum matches, proteins, scores
|
||||
**Use Cases:** Search engine results, proteomics workflows
|
||||
**Python Libraries:**
|
||||
- `pyteomics.mzid`
|
||||
- `pyopenms`: MzIdentML support
|
||||
**EDA Approach:**
|
||||
- PSM count and score distribution
|
||||
- FDR calculation and filtering
|
||||
- Modification analysis
|
||||
- Missed cleavage statistics
|
||||
- Protein inference results
|
||||
- Search parameters validation
|
||||
- Decoy hit analysis
|
||||
- Rank-1 vs lower ranks
|
||||
|
||||
### .pepXML - Trans-Proteomic Pipeline Peptide XML
|
||||
**Description:** TPP format for peptide identifications
|
||||
**Typical Data:** Search results with statistical validation
|
||||
**Use Cases:** Proteomics database search output
|
||||
**Python Libraries:**
|
||||
- `pyteomics.pepxml`
|
||||
**EDA Approach:**
|
||||
- Search engine comparison
|
||||
- Score distributions (XCorr, expect value, etc.)
|
||||
- Charge state analysis
|
||||
- Modification frequencies
|
||||
- PeptideProphet probabilities
|
||||
- Protein coverage
|
||||
- Spectral counting
|
||||
|
||||
### .protXML - Protein Inference Results
|
||||
**Description:** TPP protein-level identifications
|
||||
**Typical Data:** Protein groups, probabilities, peptides
|
||||
**Use Cases:** Protein-level analysis
|
||||
**Python Libraries:**
|
||||
- `pyteomics.protxml`
|
||||
**EDA Approach:**
|
||||
- Protein group statistics
|
||||
- Parsimonious protein sets
|
||||
- ProteinProphet probabilities
|
||||
- Coverage and peptide count per protein
|
||||
- Unique vs shared peptides
|
||||
- Protein molecular weight distribution
|
||||
- GO term enrichment preparation
|
||||
|
||||
### .pride.xml - PRIDE XML Format
|
||||
**Description:** Proteomics Identifications Database format
|
||||
**Typical Data:** Complete proteomics experiment data
|
||||
**Use Cases:** Public data deposition (legacy)
|
||||
**Python Libraries:**
|
||||
- `pyteomics.pride`
|
||||
- Custom XML parsers
|
||||
**EDA Approach:**
|
||||
- Experiment metadata extraction
|
||||
- Identification completeness
|
||||
- Cross-linking to spectra
|
||||
- Protocol information
|
||||
- Instrument details
|
||||
|
||||
### .tsv / .csv (Proteomics)
|
||||
**Description:** Tab or comma-separated proteomics results
|
||||
**Typical Data:** Peptide or protein quantification tables
|
||||
**Use Cases:** MaxQuant, Proteome Discoverer, Skyline output
|
||||
**Python Libraries:**
|
||||
- `pandas`: `pd.read_csv()` or `pd.read_table()`
|
||||
**EDA Approach:**
|
||||
- Identification counts
|
||||
- Quantitative value distributions
|
||||
- Missing value patterns
|
||||
- Intensity-based analysis
|
||||
- Label-free quantification assessment
|
||||
- Isobaric tag ratio analysis
|
||||
- Coefficient of variation
|
||||
- Batch effects
|
||||
|
||||
### .msf - Thermo MSF Database
|
||||
**Description:** Proteome Discoverer results database
|
||||
**Typical Data:** SQLite database with search results
|
||||
**Use Cases:** Thermo Proteome Discoverer workflows
|
||||
**Python Libraries:**
|
||||
- `sqlite3`: Database access
|
||||
- Custom MSF parsers
|
||||
**EDA Approach:**
|
||||
- Database schema exploration
|
||||
- Peptide and protein tables
|
||||
- Score thresholds
|
||||
- Quantification data
|
||||
- Processing node information
|
||||
- Confidence levels
|
||||
|
||||
### .pdResult - Proteome Discoverer Result
|
||||
**Description:** Proteome Discoverer study results
|
||||
**Typical Data:** Comprehensive search and quantification
|
||||
**Use Cases:** PD study exports
|
||||
**Python Libraries:**
|
||||
- Vendor tools for conversion
|
||||
- Export to TSV for Python analysis
|
||||
**EDA Approach:**
|
||||
- Study design validation
|
||||
- Result filtering criteria
|
||||
- Quantitative comparison groups
|
||||
- Imputation strategies
|
||||
|
||||
### .pep.xml - Peptide Summary
|
||||
**Description:** Compact peptide identification format
|
||||
**Typical Data:** Peptide sequences, modifications, scores
|
||||
**Use Cases:** Downstream analysis input
|
||||
**Python Libraries:**
|
||||
- `pyteomics`: XML parsing
|
||||
**EDA Approach:**
|
||||
- Unique peptide counting
|
||||
- PTM site localization
|
||||
- Retention time predictability
|
||||
- Charge state preferences
|
||||
|
||||
## Quantitative Proteomics
|
||||
|
||||
### .sky - Skyline Document
|
||||
**Description:** Skyline targeted proteomics document
|
||||
**Typical Data:** Transition lists, chromatograms, results
|
||||
**Use Cases:** Targeted proteomics (SRM/MRM/PRM)
|
||||
**Python Libraries:**
|
||||
- `skyline`: Python API (limited)
|
||||
- Export to CSV for analysis
|
||||
**EDA Approach:**
|
||||
- Transition selection validation
|
||||
- Chromatographic peak quality
|
||||
- Interference detection
|
||||
- Retention time consistency
|
||||
- Calibration curve assessment
|
||||
- Replicate correlation
|
||||
- LOD/LOQ determination
|
||||
|
||||
### .sky.zip - Zipped Skyline Document
|
||||
**Description:** Skyline document with external files
|
||||
**Typical Data:** Complete Skyline analysis
|
||||
**Use Cases:** Sharing Skyline projects
|
||||
**Python Libraries:**
|
||||
- `zipfile`: Extract for processing
|
||||
**EDA Approach:**
|
||||
- Document structure
|
||||
- External file references
|
||||
- Result export and analysis
|
||||
|
||||
### .wiff - SCIEX WIFF Format
|
||||
**Description:** SCIEX instrument data with quantitation
|
||||
**Typical Data:** LC-MS/MS with MRM transitions
|
||||
**Use Cases:** SCIEX QTRAP, TripleTOF data
|
||||
**Python Libraries:**
|
||||
- Vendor tools (limited Python access)
|
||||
- Conversion to mzML
|
||||
**EDA Approach:**
|
||||
- MRM transition performance
|
||||
- Dwell time optimization
|
||||
- Cycle time analysis
|
||||
- Peak integration quality
|
||||
|
||||
### .raw (Thermo)
|
||||
**Description:** Thermo raw instrument file
|
||||
**Typical Data:** Full MS data from Orbitrap, Q Exactive
|
||||
**Use Cases:** Label-free and TMT quantification
|
||||
**Python Libraries:**
|
||||
- `pymsfilereader`: Thermo RawFileReader
|
||||
- `ThermoRawFileParser`: Cross-platform CLI
|
||||
**EDA Approach:**
|
||||
- MS1 and MS2 acquisition rates
|
||||
- AGC target and fill times
|
||||
- Resolution settings
|
||||
- Isolation window validation
|
||||
- SPS ion selection (TMT)
|
||||
- Contamination assessment
|
||||
|
||||
### .d (Agilent)
|
||||
**Description:** Agilent data directory
|
||||
**Typical Data:** LC-MS and GC-MS data
|
||||
**Use Cases:** Agilent instrument workflows
|
||||
**Python Libraries:**
|
||||
- Community parsers
|
||||
- Export to mzML
|
||||
**EDA Approach:**
|
||||
- Method consistency
|
||||
- Calibration status
|
||||
- Sequence run information
|
||||
- Retention time stability
|
||||
|
||||
## Metabolomics and Lipidomics
|
||||
|
||||
### .mzML (Metabolomics)
|
||||
**Description:** Standard MS format for metabolomics
|
||||
**Typical Data:** Full scan MS, targeted MS/MS
|
||||
**Use Cases:** Untargeted and targeted metabolomics
|
||||
**Python Libraries:**
|
||||
- Same as proteomics mzML tools
|
||||
**EDA Approach:**
|
||||
- Feature detection quality
|
||||
- Mass accuracy assessment
|
||||
- Retention time alignment
|
||||
- Blank subtraction
|
||||
- QC sample consistency
|
||||
- Isotope pattern validation
|
||||
- Adduct formation analysis
|
||||
- In-source fragmentation check
|
||||
|
||||
### .cdf / .netCDF - ANDI-MS
|
||||
**Description:** Analytical Data Interchange for MS
|
||||
**Typical Data:** GC-MS, LC-MS chromatography data
|
||||
**Use Cases:** Metabolomics, GC-MS workflows
|
||||
**Python Libraries:**
|
||||
- `netCDF4`: Low-level access
|
||||
- `pyopenms`: CDF support
|
||||
- `xcms` via R integration
|
||||
**EDA Approach:**
|
||||
- TIC and extracted ion chromatograms
|
||||
- Peak detection across samples
|
||||
- Retention index calculation
|
||||
- Mass spectral matching
|
||||
- Library search preparation
|
||||
|
||||
### .msp - Mass Spectral Format (NIST)
|
||||
**Description:** NIST spectral library format
|
||||
**Typical Data:** Reference mass spectra
|
||||
**Use Cases:** Metabolite identification, library matching
|
||||
**Python Libraries:**
|
||||
- `matchms`: Spectral matching
|
||||
- Custom MSP parsers
|
||||
**EDA Approach:**
|
||||
- Library coverage
|
||||
- Metadata completeness (InChI, SMILES)
|
||||
- Spectral quality metrics
|
||||
- Collision energy standardization
|
||||
- Precursor type annotation
|
||||
|
||||
### .mgf (Metabolomics)
|
||||
**Description:** Mascot Generic Format for MS/MS
|
||||
**Typical Data:** MS/MS spectra for metabolite ID
|
||||
**Use Cases:** Spectral library searching
|
||||
**Python Libraries:**
|
||||
- `matchms`: Metabolomics spectral analysis
|
||||
- `pyteomics.mgf`
|
||||
**EDA Approach:**
|
||||
- Spectrum quality filtering
|
||||
- Precursor isolation purity
|
||||
- Fragment m/z accuracy
|
||||
- Neutral loss patterns
|
||||
- MS/MS completeness
|
||||
|
||||
### .nmrML - NMR Markup Language
|
||||
**Description:** Standard XML format for NMR metabolomics
|
||||
**Typical Data:** 1D/2D NMR spectra with metadata
|
||||
**Use Cases:** NMR-based metabolomics
|
||||
**Python Libraries:**
|
||||
- `nmrml2isa`: Format conversion
|
||||
- Custom XML parsers
|
||||
**EDA Approach:**
|
||||
- Spectral quality metrics
|
||||
- Binning consistency
|
||||
- Reference compound validation
|
||||
- pH and temperature effects
|
||||
- Metabolite identification confidence
|
||||
|
||||
### .json (Metabolomics)
|
||||
**Description:** JSON format for metabolomics results
|
||||
**Typical Data:** Feature tables, annotations, metadata
|
||||
**Use Cases:** GNPS, MetaboAnalyst, web tools
|
||||
**Python Libraries:**
|
||||
- `json`: Standard library
|
||||
- `pandas`: JSON normalization
|
||||
**EDA Approach:**
|
||||
- Feature annotation coverage
|
||||
- GNPS clustering results
|
||||
- Molecular networking statistics
|
||||
- Adduct and in-source fragment linkage
|
||||
- Putative identification confidence
|
||||
|
||||
### .txt (Metabolomics Tables)
|
||||
**Description:** Tab-delimited feature tables
|
||||
**Typical Data:** m/z, RT, intensities across samples
|
||||
**Use Cases:** MZmine, XCMS, MS-DIAL output
|
||||
**Python Libraries:**
|
||||
- `pandas`: Text file reading
|
||||
**EDA Approach:**
|
||||
- Feature count and quality
|
||||
- Missing value imputation
|
||||
- Data normalization assessment
|
||||
- Batch correction validation
|
||||
- PCA and clustering for QC
|
||||
- Fold change calculations
|
||||
- Statistical test preparation
|
||||
|
||||
### .featureXML - OpenMS Feature Format
|
||||
**Description:** OpenMS detected features
|
||||
**Typical Data:** LC-MS features with quality scores
|
||||
**Use Cases:** OpenMS workflows
|
||||
**Python Libraries:**
|
||||
- `pyopenms`: FeatureXML support
|
||||
**EDA Approach:**
|
||||
- Feature detection parameters
|
||||
- Quality metrics per feature
|
||||
- Isotope pattern fitting
|
||||
- Charge state assignment
|
||||
- FWHM and asymmetry
|
||||
|
||||
### .consensusXML - OpenMS Consensus Features
|
||||
**Description:** Linked features across samples
|
||||
**Typical Data:** Aligned features with group info
|
||||
**Use Cases:** Multi-sample LC-MS analysis
|
||||
**Python Libraries:**
|
||||
- `pyopenms`: ConsensusXML reading
|
||||
**EDA Approach:**
|
||||
- Feature correspondence quality
|
||||
- Retention time alignment
|
||||
- Missing value patterns
|
||||
- Intensity normalization needs
|
||||
- Batch-wise feature agreement
|
||||
|
||||
### .idXML - OpenMS Identification Format
|
||||
**Description:** Peptide/metabolite identifications
|
||||
**Typical Data:** MS/MS identifications with scores
|
||||
**Use Cases:** OpenMS ID workflows
|
||||
**Python Libraries:**
|
||||
- `pyopenms`: IdXML support
|
||||
**EDA Approach:**
|
||||
- Identification rate
|
||||
- Score distribution
|
||||
- Spectral match quality
|
||||
- False discovery assessment
|
||||
- Annotation transfer validation
|
||||
|
||||
## Lipidomics-Specific Formats
|
||||
|
||||
### .lcb - LipidCreator Batch
|
||||
**Description:** LipidCreator transition list
|
||||
**Typical Data:** Lipid transitions for targeted MS
|
||||
**Use Cases:** Targeted lipidomics
|
||||
**Python Libraries:**
|
||||
- Export to CSV for processing
|
||||
**EDA Approach:**
|
||||
- Transition coverage per lipid class
|
||||
- Retention time prediction
|
||||
- Collision energy optimization
|
||||
- Class-specific fragmentation patterns
|
||||
|
||||
### .mzTab - Proteomics/Metabolomics Tabular Format
|
||||
**Description:** PSI tabular summary format
|
||||
**Typical Data:** Protein/peptide/metabolite quantification
|
||||
**Use Cases:** Publication and data sharing
|
||||
**Python Libraries:**
|
||||
- `pyteomics.mztab`
|
||||
- `pandas` for TSV-like structure
|
||||
**EDA Approach:**
|
||||
- Data completeness
|
||||
- Metadata section validation
|
||||
- Quantification method
|
||||
- Identification confidence
|
||||
- Software and parameters
|
||||
- Quality metrics summary
|
||||
|
||||
### .csv (LipidSearch, LipidMatch)
|
||||
**Description:** Lipid identification results
|
||||
**Typical Data:** Lipid annotations, grades, intensities
|
||||
**Use Cases:** Lipidomics software output
|
||||
**Python Libraries:**
|
||||
- `pandas`: CSV reading
|
||||
**EDA Approach:**
|
||||
- Lipid class distribution
|
||||
- Identification grade/confidence
|
||||
- Fatty acid composition analysis
|
||||
- Double bond and chain length patterns
|
||||
- Intensity correlations
|
||||
- Normalization to internal standards
|
||||
|
||||
### .sdf (Metabolomics)
|
||||
**Description:** Structure data file for metabolites
|
||||
**Typical Data:** Chemical structures with properties
|
||||
**Use Cases:** Metabolite database creation
|
||||
**Python Libraries:**
|
||||
- `RDKit`: `Chem.SDMolSupplier('file.sdf')`
|
||||
**EDA Approach:**
|
||||
- Structure validation
|
||||
- Property calculation (logP, MW, TPSA)
|
||||
- Molecular formula consistency
|
||||
- Tautomer enumeration
|
||||
- Retention time prediction features
|
||||
|
||||
### .mol (Metabolomics)
|
||||
**Description:** Single molecule structure files
|
||||
**Typical Data:** Metabolite chemical structure
|
||||
**Use Cases:** Structure-based searches
|
||||
**Python Libraries:**
|
||||
- `RDKit`: `Chem.MolFromMolFile('file.mol')`
|
||||
**EDA Approach:**
|
||||
- Structure correctness
|
||||
- Stereochemistry validation
|
||||
- Charge state
|
||||
- Implicit hydrogen handling
|
||||
|
||||
## Data Processing and Analysis
|
||||
|
||||
### .h5 / .hdf5 (Omics)
|
||||
**Description:** HDF5 for large omics datasets
|
||||
**Typical Data:** Feature matrices, spectra, metadata
|
||||
**Use Cases:** Large-scale studies, cloud computing
|
||||
**Python Libraries:**
|
||||
- `h5py`: HDF5 access
|
||||
- `anndata`: For single-cell proteomics
|
||||
**EDA Approach:**
|
||||
- Dataset organization
|
||||
- Chunking and compression
|
||||
- Metadata structure
|
||||
- Efficient data access patterns
|
||||
- Sample and feature annotations
|
||||
|
||||
### .Rdata / .rds - R Objects
|
||||
**Description:** Serialized R analysis objects
|
||||
**Typical Data:** Processed omics results from R packages
|
||||
**Use Cases:** xcms, CAMERA, MSnbase workflows
|
||||
**Python Libraries:**
|
||||
- `pyreadr`: `pyreadr.read_r('file.Rdata')`
|
||||
- `rpy2`: R-Python integration
|
||||
**EDA Approach:**
|
||||
- Object structure exploration
|
||||
- Data extraction
|
||||
- Method parameter review
|
||||
- Conversion to Python-native formats
|
||||
|
||||
### .mzTab-M - Metabolomics mzTab
|
||||
**Description:** mzTab specific to metabolomics
|
||||
**Typical Data:** Small molecule quantification
|
||||
**Use Cases:** Metabolomics data sharing
|
||||
**Python Libraries:**
|
||||
- `pyteomics.mztab`: Can parse mzTab-M
|
||||
**EDA Approach:**
|
||||
- Small molecule evidence
|
||||
- Feature quantification
|
||||
- Database references (HMDB, KEGG, etc.)
|
||||
- Adduct and charge annotation
|
||||
- MS level information
|
||||
|
||||
### .parquet (Omics)
|
||||
**Description:** Columnar storage for large tables
|
||||
**Typical Data:** Feature matrices, metadata
|
||||
**Use Cases:** Efficient big data omics
|
||||
**Python Libraries:**
|
||||
- `pandas`: `pd.read_parquet()`
|
||||
- `pyarrow`: Direct parquet access
|
||||
**EDA Approach:**
|
||||
- Compression efficiency
|
||||
- Column-wise statistics
|
||||
- Partition structure
|
||||
- Schema validation
|
||||
- Fast filtering and aggregation
|
||||
|
||||
### .pkl (Omics Models)
|
||||
**Description:** Pickled Python objects
|
||||
**Typical Data:** ML models, processed data
|
||||
**Use Cases:** Workflow intermediate storage
|
||||
**Python Libraries:**
|
||||
- `pickle`: Standard serialization
|
||||
- `joblib`: Enhanced pickling
|
||||
**EDA Approach:**
|
||||
- Object type and structure
|
||||
- Model parameters
|
||||
- Feature importance (if ML model)
|
||||
- Data shapes and types
|
||||
- Deserialization validation
|
||||
|
||||
### .zarr (Omics)
|
||||
**Description:** Chunked, compressed array storage
|
||||
**Typical Data:** Multi-dimensional omics data
|
||||
**Use Cases:** Cloud-optimized analysis
|
||||
**Python Libraries:**
|
||||
- `zarr`: Array storage
|
||||
**EDA Approach:**
|
||||
- Chunk optimization
|
||||
- Compression codecs
|
||||
- Multi-scale data
|
||||
- Parallel access patterns
|
||||
- Metadata annotations
|
||||
Reference in New Issue
Block a user