Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,517 @@
# Proteomics and Metabolomics File Formats Reference
This reference covers file formats specific to proteomics, metabolomics, lipidomics, and related omics workflows.
## Mass Spectrometry-Based Proteomics
### .mzML - Mass Spectrometry Markup Language
**Description:** Standard XML format for MS data
**Typical Data:** MS1 and MS2 spectra, retention times, intensities
**Use Cases:** Proteomics, metabolomics pipelines
**Python Libraries:**
- `pymzml`: `pymzml.run.Reader('file.mzML')`
- `pyteomics.mzml`: `pyteomics.mzml.read('file.mzML')`
- `pyopenms`: OpenMS Python bindings
**EDA Approach:**
- Scan count and MS level distribution
- Total ion chromatogram (TIC) analysis
- Base peak chromatogram (BPC)
- m/z coverage and resolution
- Retention time range
- Precursor selection patterns
- Data completeness
- Quality control metrics (lock mass, standards)
### .mzXML - Legacy MS XML Format
**Description:** Older XML-based MS format
**Typical Data:** Mass spectra with metadata
**Use Cases:** Legacy proteomics data
**Python Libraries:**
- `pyteomics.mzxml`
- `pymzml`: Can read mzXML
**EDA Approach:**
- Similar to mzML
- Format version compatibility
- Conversion quality validation
- Metadata preservation check
### .mzIdentML - Peptide Identification Format
**Description:** PSI standard for peptide identifications
**Typical Data:** Peptide-spectrum matches, proteins, scores
**Use Cases:** Search engine results, proteomics workflows
**Python Libraries:**
- `pyteomics.mzid`
- `pyopenms`: MzIdentML support
**EDA Approach:**
- PSM count and score distribution
- FDR calculation and filtering
- Modification analysis
- Missed cleavage statistics
- Protein inference results
- Search parameters validation
- Decoy hit analysis
- Rank-1 vs lower ranks
### .pepXML - Trans-Proteomic Pipeline Peptide XML
**Description:** TPP format for peptide identifications
**Typical Data:** Search results with statistical validation
**Use Cases:** Proteomics database search output
**Python Libraries:**
- `pyteomics.pepxml`
**EDA Approach:**
- Search engine comparison
- Score distributions (XCorr, expect value, etc.)
- Charge state analysis
- Modification frequencies
- PeptideProphet probabilities
- Protein coverage
- Spectral counting
### .protXML - Protein Inference Results
**Description:** TPP protein-level identifications
**Typical Data:** Protein groups, probabilities, peptides
**Use Cases:** Protein-level analysis
**Python Libraries:**
- `pyteomics.protxml`
**EDA Approach:**
- Protein group statistics
- Parsimonious protein sets
- ProteinProphet probabilities
- Coverage and peptide count per protein
- Unique vs shared peptides
- Protein molecular weight distribution
- GO term enrichment preparation
### .pride.xml - PRIDE XML Format
**Description:** Proteomics Identifications Database format
**Typical Data:** Complete proteomics experiment data
**Use Cases:** Public data deposition (legacy)
**Python Libraries:**
- `pyteomics.pride`
- Custom XML parsers
**EDA Approach:**
- Experiment metadata extraction
- Identification completeness
- Cross-linking to spectra
- Protocol information
- Instrument details
### .tsv / .csv (Proteomics)
**Description:** Tab or comma-separated proteomics results
**Typical Data:** Peptide or protein quantification tables
**Use Cases:** MaxQuant, Proteome Discoverer, Skyline output
**Python Libraries:**
- `pandas`: `pd.read_csv()` or `pd.read_table()`
**EDA Approach:**
- Identification counts
- Quantitative value distributions
- Missing value patterns
- Intensity-based analysis
- Label-free quantification assessment
- Isobaric tag ratio analysis
- Coefficient of variation
- Batch effects
### .msf - Thermo MSF Database
**Description:** Proteome Discoverer results database
**Typical Data:** SQLite database with search results
**Use Cases:** Thermo Proteome Discoverer workflows
**Python Libraries:**
- `sqlite3`: Database access
- Custom MSF parsers
**EDA Approach:**
- Database schema exploration
- Peptide and protein tables
- Score thresholds
- Quantification data
- Processing node information
- Confidence levels
### .pdResult - Proteome Discoverer Result
**Description:** Proteome Discoverer study results
**Typical Data:** Comprehensive search and quantification
**Use Cases:** PD study exports
**Python Libraries:**
- Vendor tools for conversion
- Export to TSV for Python analysis
**EDA Approach:**
- Study design validation
- Result filtering criteria
- Quantitative comparison groups
- Imputation strategies
### .pep.xml - Peptide Summary
**Description:** Compact peptide identification format
**Typical Data:** Peptide sequences, modifications, scores
**Use Cases:** Downstream analysis input
**Python Libraries:**
- `pyteomics`: XML parsing
**EDA Approach:**
- Unique peptide counting
- PTM site localization
- Retention time predictability
- Charge state preferences
## Quantitative Proteomics
### .sky - Skyline Document
**Description:** Skyline targeted proteomics document
**Typical Data:** Transition lists, chromatograms, results
**Use Cases:** Targeted proteomics (SRM/MRM/PRM)
**Python Libraries:**
- `skyline`: Python API (limited)
- Export to CSV for analysis
**EDA Approach:**
- Transition selection validation
- Chromatographic peak quality
- Interference detection
- Retention time consistency
- Calibration curve assessment
- Replicate correlation
- LOD/LOQ determination
### .sky.zip - Zipped Skyline Document
**Description:** Skyline document with external files
**Typical Data:** Complete Skyline analysis
**Use Cases:** Sharing Skyline projects
**Python Libraries:**
- `zipfile`: Extract for processing
**EDA Approach:**
- Document structure
- External file references
- Result export and analysis
### .wiff - SCIEX WIFF Format
**Description:** SCIEX instrument data with quantitation
**Typical Data:** LC-MS/MS with MRM transitions
**Use Cases:** SCIEX QTRAP, TripleTOF data
**Python Libraries:**
- Vendor tools (limited Python access)
- Conversion to mzML
**EDA Approach:**
- MRM transition performance
- Dwell time optimization
- Cycle time analysis
- Peak integration quality
### .raw (Thermo)
**Description:** Thermo raw instrument file
**Typical Data:** Full MS data from Orbitrap, Q Exactive
**Use Cases:** Label-free and TMT quantification
**Python Libraries:**
- `pymsfilereader`: Thermo RawFileReader
- `ThermoRawFileParser`: Cross-platform CLI
**EDA Approach:**
- MS1 and MS2 acquisition rates
- AGC target and fill times
- Resolution settings
- Isolation window validation
- SPS ion selection (TMT)
- Contamination assessment
### .d (Agilent)
**Description:** Agilent data directory
**Typical Data:** LC-MS and GC-MS data
**Use Cases:** Agilent instrument workflows
**Python Libraries:**
- Community parsers
- Export to mzML
**EDA Approach:**
- Method consistency
- Calibration status
- Sequence run information
- Retention time stability
## Metabolomics and Lipidomics
### .mzML (Metabolomics)
**Description:** Standard MS format for metabolomics
**Typical Data:** Full scan MS, targeted MS/MS
**Use Cases:** Untargeted and targeted metabolomics
**Python Libraries:**
- Same as proteomics mzML tools
**EDA Approach:**
- Feature detection quality
- Mass accuracy assessment
- Retention time alignment
- Blank subtraction
- QC sample consistency
- Isotope pattern validation
- Adduct formation analysis
- In-source fragmentation check
### .cdf / .netCDF - ANDI-MS
**Description:** Analytical Data Interchange for MS
**Typical Data:** GC-MS, LC-MS chromatography data
**Use Cases:** Metabolomics, GC-MS workflows
**Python Libraries:**
- `netCDF4`: Low-level access
- `pyopenms`: CDF support
- `xcms` via R integration
**EDA Approach:**
- TIC and extracted ion chromatograms
- Peak detection across samples
- Retention index calculation
- Mass spectral matching
- Library search preparation
### .msp - Mass Spectral Format (NIST)
**Description:** NIST spectral library format
**Typical Data:** Reference mass spectra
**Use Cases:** Metabolite identification, library matching
**Python Libraries:**
- `matchms`: Spectral matching
- Custom MSP parsers
**EDA Approach:**
- Library coverage
- Metadata completeness (InChI, SMILES)
- Spectral quality metrics
- Collision energy standardization
- Precursor type annotation
### .mgf (Metabolomics)
**Description:** Mascot Generic Format for MS/MS
**Typical Data:** MS/MS spectra for metabolite ID
**Use Cases:** Spectral library searching
**Python Libraries:**
- `matchms`: Metabolomics spectral analysis
- `pyteomics.mgf`
**EDA Approach:**
- Spectrum quality filtering
- Precursor isolation purity
- Fragment m/z accuracy
- Neutral loss patterns
- MS/MS completeness
### .nmrML - NMR Markup Language
**Description:** Standard XML format for NMR metabolomics
**Typical Data:** 1D/2D NMR spectra with metadata
**Use Cases:** NMR-based metabolomics
**Python Libraries:**
- `nmrml2isa`: Format conversion
- Custom XML parsers
**EDA Approach:**
- Spectral quality metrics
- Binning consistency
- Reference compound validation
- pH and temperature effects
- Metabolite identification confidence
### .json (Metabolomics)
**Description:** JSON format for metabolomics results
**Typical Data:** Feature tables, annotations, metadata
**Use Cases:** GNPS, MetaboAnalyst, web tools
**Python Libraries:**
- `json`: Standard library
- `pandas`: JSON normalization
**EDA Approach:**
- Feature annotation coverage
- GNPS clustering results
- Molecular networking statistics
- Adduct and in-source fragment linkage
- Putative identification confidence
### .txt (Metabolomics Tables)
**Description:** Tab-delimited feature tables
**Typical Data:** m/z, RT, intensities across samples
**Use Cases:** MZmine, XCMS, MS-DIAL output
**Python Libraries:**
- `pandas`: Text file reading
**EDA Approach:**
- Feature count and quality
- Missing value imputation
- Data normalization assessment
- Batch correction validation
- PCA and clustering for QC
- Fold change calculations
- Statistical test preparation
### .featureXML - OpenMS Feature Format
**Description:** OpenMS detected features
**Typical Data:** LC-MS features with quality scores
**Use Cases:** OpenMS workflows
**Python Libraries:**
- `pyopenms`: FeatureXML support
**EDA Approach:**
- Feature detection parameters
- Quality metrics per feature
- Isotope pattern fitting
- Charge state assignment
- FWHM and asymmetry
### .consensusXML - OpenMS Consensus Features
**Description:** Linked features across samples
**Typical Data:** Aligned features with group info
**Use Cases:** Multi-sample LC-MS analysis
**Python Libraries:**
- `pyopenms`: ConsensusXML reading
**EDA Approach:**
- Feature correspondence quality
- Retention time alignment
- Missing value patterns
- Intensity normalization needs
- Batch-wise feature agreement
### .idXML - OpenMS Identification Format
**Description:** Peptide/metabolite identifications
**Typical Data:** MS/MS identifications with scores
**Use Cases:** OpenMS ID workflows
**Python Libraries:**
- `pyopenms`: IdXML support
**EDA Approach:**
- Identification rate
- Score distribution
- Spectral match quality
- False discovery assessment
- Annotation transfer validation
## Lipidomics-Specific Formats
### .lcb - LipidCreator Batch
**Description:** LipidCreator transition list
**Typical Data:** Lipid transitions for targeted MS
**Use Cases:** Targeted lipidomics
**Python Libraries:**
- Export to CSV for processing
**EDA Approach:**
- Transition coverage per lipid class
- Retention time prediction
- Collision energy optimization
- Class-specific fragmentation patterns
### .mzTab - Proteomics/Metabolomics Tabular Format
**Description:** PSI tabular summary format
**Typical Data:** Protein/peptide/metabolite quantification
**Use Cases:** Publication and data sharing
**Python Libraries:**
- `pyteomics.mztab`
- `pandas` for TSV-like structure
**EDA Approach:**
- Data completeness
- Metadata section validation
- Quantification method
- Identification confidence
- Software and parameters
- Quality metrics summary
### .csv (LipidSearch, LipidMatch)
**Description:** Lipid identification results
**Typical Data:** Lipid annotations, grades, intensities
**Use Cases:** Lipidomics software output
**Python Libraries:**
- `pandas`: CSV reading
**EDA Approach:**
- Lipid class distribution
- Identification grade/confidence
- Fatty acid composition analysis
- Double bond and chain length patterns
- Intensity correlations
- Normalization to internal standards
### .sdf (Metabolomics)
**Description:** Structure data file for metabolites
**Typical Data:** Chemical structures with properties
**Use Cases:** Metabolite database creation
**Python Libraries:**
- `RDKit`: `Chem.SDMolSupplier('file.sdf')`
**EDA Approach:**
- Structure validation
- Property calculation (logP, MW, TPSA)
- Molecular formula consistency
- Tautomer enumeration
- Retention time prediction features
### .mol (Metabolomics)
**Description:** Single molecule structure files
**Typical Data:** Metabolite chemical structure
**Use Cases:** Structure-based searches
**Python Libraries:**
- `RDKit`: `Chem.MolFromMolFile('file.mol')`
**EDA Approach:**
- Structure correctness
- Stereochemistry validation
- Charge state
- Implicit hydrogen handling
## Data Processing and Analysis
### .h5 / .hdf5 (Omics)
**Description:** HDF5 for large omics datasets
**Typical Data:** Feature matrices, spectra, metadata
**Use Cases:** Large-scale studies, cloud computing
**Python Libraries:**
- `h5py`: HDF5 access
- `anndata`: For single-cell proteomics
**EDA Approach:**
- Dataset organization
- Chunking and compression
- Metadata structure
- Efficient data access patterns
- Sample and feature annotations
### .Rdata / .rds - R Objects
**Description:** Serialized R analysis objects
**Typical Data:** Processed omics results from R packages
**Use Cases:** xcms, CAMERA, MSnbase workflows
**Python Libraries:**
- `pyreadr`: `pyreadr.read_r('file.Rdata')`
- `rpy2`: R-Python integration
**EDA Approach:**
- Object structure exploration
- Data extraction
- Method parameter review
- Conversion to Python-native formats
### .mzTab-M - Metabolomics mzTab
**Description:** mzTab specific to metabolomics
**Typical Data:** Small molecule quantification
**Use Cases:** Metabolomics data sharing
**Python Libraries:**
- `pyteomics.mztab`: Can parse mzTab-M
**EDA Approach:**
- Small molecule evidence
- Feature quantification
- Database references (HMDB, KEGG, etc.)
- Adduct and charge annotation
- MS level information
### .parquet (Omics)
**Description:** Columnar storage for large tables
**Typical Data:** Feature matrices, metadata
**Use Cases:** Efficient big data omics
**Python Libraries:**
- `pandas`: `pd.read_parquet()`
- `pyarrow`: Direct parquet access
**EDA Approach:**
- Compression efficiency
- Column-wise statistics
- Partition structure
- Schema validation
- Fast filtering and aggregation
### .pkl (Omics Models)
**Description:** Pickled Python objects
**Typical Data:** ML models, processed data
**Use Cases:** Workflow intermediate storage
**Python Libraries:**
- `pickle`: Standard serialization
- `joblib`: Enhanced pickling
**EDA Approach:**
- Object type and structure
- Model parameters
- Feature importance (if ML model)
- Data shapes and types
- Deserialization validation
### .zarr (Omics)
**Description:** Chunked, compressed array storage
**Typical Data:** Multi-dimensional omics data
**Use Cases:** Cloud-optimized analysis
**Python Libraries:**
- `zarr`: Array storage
**EDA Approach:**
- Chunk optimization
- Compression codecs
- Multi-scale data
- Parallel access patterns
- Metadata annotations