Initial commit
This commit is contained in:
@@ -0,0 +1,664 @@
|
||||
# Bioinformatics and Genomics File Formats Reference
|
||||
|
||||
This reference covers file formats used in genomics, transcriptomics, sequence analysis, and related bioinformatics applications.
|
||||
|
||||
## Sequence Data Formats
|
||||
|
||||
### .fasta / .fa / .fna - FASTA Format
|
||||
**Description:** Text-based format for nucleotide or protein sequences
|
||||
**Typical Data:** DNA, RNA, or protein sequences with headers
|
||||
**Use Cases:** Sequence storage, BLAST searches, alignments
|
||||
**Python Libraries:**
|
||||
- `Biopython`: `SeqIO.parse('file.fasta', 'fasta')`
|
||||
- `pyfaidx`: Fast indexed FASTA access
|
||||
- `screed`: Fast sequence parsing
|
||||
**EDA Approach:**
|
||||
- Sequence count and length distribution
|
||||
- GC content analysis
|
||||
- N content (ambiguous bases)
|
||||
- Sequence ID parsing
|
||||
- Duplicate detection
|
||||
- Quality metrics for assemblies (N50, L50)
|
||||
|
||||
### .fastq / .fq - FASTQ Format
|
||||
**Description:** Sequence data with base quality scores
|
||||
**Typical Data:** Raw sequencing reads with Phred quality scores
|
||||
**Use Cases:** NGS data, quality control, read mapping
|
||||
**Python Libraries:**
|
||||
- `Biopython`: `SeqIO.parse('file.fastq', 'fastq')`
|
||||
- `pysam`: Fast FASTQ/BAM operations
|
||||
- `HTSeq`: Sequencing data analysis
|
||||
**EDA Approach:**
|
||||
- Read count and length distribution
|
||||
- Quality score distribution (per-base, per-read)
|
||||
- GC content and bias
|
||||
- Duplicate rate estimation
|
||||
- Adapter contamination detection
|
||||
- k-mer frequency analysis
|
||||
- Encoding format validation (Phred33/64)
|
||||
|
||||
### .sam - Sequence Alignment/Map
|
||||
**Description:** Tab-delimited text format for alignments
|
||||
**Typical Data:** Aligned sequencing reads with mapping quality
|
||||
**Use Cases:** Read alignment storage, variant calling
|
||||
**Python Libraries:**
|
||||
- `pysam`: `pysam.AlignmentFile('file.sam', 'r')`
|
||||
- `HTSeq`: `HTSeq.SAM_Reader('file.sam')`
|
||||
**EDA Approach:**
|
||||
- Mapping rate and quality distribution
|
||||
- Coverage analysis
|
||||
- Insert size distribution (paired-end)
|
||||
- Alignment flags distribution
|
||||
- CIGAR string patterns
|
||||
- Mismatch and indel rates
|
||||
- Duplicate and supplementary alignment counts
|
||||
|
||||
### .bam - Binary Alignment/Map
|
||||
**Description:** Compressed binary version of SAM
|
||||
**Typical Data:** Aligned reads in compressed format
|
||||
**Use Cases:** Efficient storage and processing of alignments
|
||||
**Python Libraries:**
|
||||
- `pysam`: Full BAM support with indexing
|
||||
- `bamnostic`: Pure Python BAM reader
|
||||
**EDA Approach:**
|
||||
- Same as SAM plus:
|
||||
- Compression ratio analysis
|
||||
- Index file (.bai) validation
|
||||
- Chromosome-wise statistics
|
||||
- Strand bias detection
|
||||
- Read group analysis
|
||||
|
||||
### .cram - CRAM Format
|
||||
**Description:** Highly compressed alignment format
|
||||
**Typical Data:** Reference-compressed aligned reads
|
||||
**Use Cases:** Long-term storage, space-efficient archives
|
||||
**Python Libraries:**
|
||||
- `pysam`: CRAM support (requires reference)
|
||||
- Reference genome must be accessible
|
||||
**EDA Approach:**
|
||||
- Compression efficiency vs BAM
|
||||
- Reference dependency validation
|
||||
- Lossy vs lossless compression assessment
|
||||
- Decompression performance
|
||||
- Similar alignment metrics as BAM
|
||||
|
||||
### .bed - Browser Extensible Data
|
||||
**Description:** Tab-delimited format for genomic features
|
||||
**Typical Data:** Genomic intervals (chr, start, end) with annotations
|
||||
**Use Cases:** Peak calling, variant annotation, genome browsing
|
||||
**Python Libraries:**
|
||||
- `pybedtools`: `pybedtools.BedTool('file.bed')`
|
||||
- `pyranges`: `pyranges.read_bed('file.bed')`
|
||||
- `pandas`: Simple BED reading
|
||||
**EDA Approach:**
|
||||
- Feature count and size distribution
|
||||
- Chromosome distribution
|
||||
- Strand bias
|
||||
- Score distribution (if present)
|
||||
- Overlap and proximity analysis
|
||||
- Coverage statistics
|
||||
- Gap analysis between features
|
||||
|
||||
### .bedGraph - BED with Graph Data
|
||||
**Description:** BED format with per-base signal values
|
||||
**Typical Data:** Continuous-valued genomic data (coverage, signals)
|
||||
**Use Cases:** Coverage tracks, ChIP-seq signals, methylation
|
||||
**Python Libraries:**
|
||||
- `pyBigWig`: Can convert to bigWig
|
||||
- `pybedtools`: BedGraph operations
|
||||
**EDA Approach:**
|
||||
- Signal distribution statistics
|
||||
- Genome coverage percentage
|
||||
- Signal dynamics (peaks, valleys)
|
||||
- Chromosome-wise signal patterns
|
||||
- Quantile analysis
|
||||
- Zero-coverage regions
|
||||
|
||||
### .bigWig / .bw - Binary BigWig
|
||||
**Description:** Indexed binary format for genome-wide signal data
|
||||
**Typical Data:** Continuous genomic signals (compressed and indexed)
|
||||
**Use Cases:** Efficient genome browser tracks, large-scale data
|
||||
**Python Libraries:**
|
||||
- `pyBigWig`: `pyBigWig.open('file.bw')`
|
||||
- `pybbi`: BigWig/BigBed interface
|
||||
**EDA Approach:**
|
||||
- Signal statistics extraction
|
||||
- Zoom level analysis
|
||||
- Regional signal extraction
|
||||
- Efficient genome-wide summaries
|
||||
- Compression efficiency
|
||||
- Index structure analysis
|
||||
|
||||
### .bigBed / .bb - Binary BigBed
|
||||
**Description:** Indexed binary BED format
|
||||
**Typical Data:** Genomic features (compressed and indexed)
|
||||
**Use Cases:** Large feature sets, genome browsers
|
||||
**Python Libraries:**
|
||||
- `pybbi`: BigBed reading
|
||||
- `pybigtools`: Modern BigBed interface
|
||||
**EDA Approach:**
|
||||
- Feature density analysis
|
||||
- Efficient interval queries
|
||||
- Zoom level validation
|
||||
- Index performance metrics
|
||||
- Feature size statistics
|
||||
|
||||
### .gff / .gff3 - General Feature Format
|
||||
**Description:** Tab-delimited format for genomic annotations
|
||||
**Typical Data:** Gene models, transcripts, exons, regulatory elements
|
||||
**Use Cases:** Genome annotation, gene prediction
|
||||
**Python Libraries:**
|
||||
- `BCBio.GFF`: Biopython GFF module
|
||||
- `gffutils`: `gffutils.create_db('file.gff3')`
|
||||
- `pyranges`: GFF support
|
||||
**EDA Approach:**
|
||||
- Feature type distribution (gene, exon, CDS, etc.)
|
||||
- Gene structure validation
|
||||
- Strand balance
|
||||
- Hierarchical relationship validation
|
||||
- Phase validation for CDS
|
||||
- Attribute completeness
|
||||
- Gene model statistics (introns, exons per gene)
|
||||
|
||||
### .gtf - Gene Transfer Format
|
||||
**Description:** GFF2-based format for gene annotations
|
||||
**Typical Data:** Gene and transcript annotations
|
||||
**Use Cases:** RNA-seq analysis, gene quantification
|
||||
**Python Libraries:**
|
||||
- `pyranges`: `pyranges.read_gtf('file.gtf')`
|
||||
- `gffutils`: GTF database creation
|
||||
- `HTSeq`: GTF reading for counts
|
||||
**EDA Approach:**
|
||||
- Transcript isoform analysis
|
||||
- Gene structure completeness
|
||||
- Exon number distribution
|
||||
- Transcript length distribution
|
||||
- TSS and TES analysis
|
||||
- Biotype distribution
|
||||
- Overlapping gene detection
|
||||
|
||||
### .vcf - Variant Call Format
|
||||
**Description:** Text format for genetic variants
|
||||
**Typical Data:** SNPs, indels, structural variants with annotations
|
||||
**Use Cases:** Variant calling, population genetics, GWAS
|
||||
**Python Libraries:**
|
||||
- `pysam`: `pysam.VariantFile('file.vcf')`
|
||||
- `cyvcf2`: Fast VCF parsing
|
||||
- `PyVCF`: Older but comprehensive
|
||||
**EDA Approach:**
|
||||
- Variant count by type (SNP, indel, SV)
|
||||
- Quality score distribution
|
||||
- Allele frequency spectrum
|
||||
- Transition/transversion ratio
|
||||
- Heterozygosity rates
|
||||
- Missing genotype analysis
|
||||
- Hardy-Weinberg equilibrium
|
||||
- Annotation completeness (if annotated)
|
||||
|
||||
### .bcf - Binary VCF
|
||||
**Description:** Compressed binary variant format
|
||||
**Typical Data:** Same as VCF but binary
|
||||
**Use Cases:** Efficient variant storage and processing
|
||||
**Python Libraries:**
|
||||
- `pysam`: Full BCF support
|
||||
- `cyvcf2`: Optimized BCF reading
|
||||
**EDA Approach:**
|
||||
- Same as VCF plus:
|
||||
- Compression efficiency
|
||||
- Indexing validation
|
||||
- Read performance metrics
|
||||
|
||||
### .gvcf - Genomic VCF
|
||||
**Description:** VCF with reference confidence blocks
|
||||
**Typical Data:** All positions (variant and non-variant)
|
||||
**Use Cases:** Joint genotyping workflows, GATK
|
||||
**Python Libraries:**
|
||||
- `pysam`: GVCF support
|
||||
- Standard VCF parsers
|
||||
**EDA Approach:**
|
||||
- Reference block analysis
|
||||
- Coverage uniformity
|
||||
- Variant density
|
||||
- Genotype quality across genome
|
||||
- Reference confidence distribution
|
||||
|
||||
## RNA-Seq and Expression Data
|
||||
|
||||
### .counts - Gene Count Matrix
|
||||
**Description:** Tab-delimited gene expression counts
|
||||
**Typical Data:** Gene IDs with read counts per sample
|
||||
**Use Cases:** RNA-seq quantification, differential expression
|
||||
**Python Libraries:**
|
||||
- `pandas`: `pd.read_csv('file.counts', sep='\t')`
|
||||
- `scanpy` (for single-cell): `sc.read_csv()`
|
||||
**EDA Approach:**
|
||||
- Library size distribution
|
||||
- Detection rate (genes per sample)
|
||||
- Zero-inflation analysis
|
||||
- Count distribution (log scale)
|
||||
- Outlier sample detection
|
||||
- Correlation between replicates
|
||||
- PCA for sample relationships
|
||||
|
||||
### .tpm / .fpkm - Normalized Expression
|
||||
**Description:** Normalized gene expression values
|
||||
**Typical Data:** TPM (transcripts per million) or FPKM values
|
||||
**Use Cases:** Cross-sample comparison, visualization
|
||||
**Python Libraries:**
|
||||
- `pandas`: Standard CSV reading
|
||||
- `anndata`: For integrated analysis
|
||||
**EDA Approach:**
|
||||
- Expression distribution
|
||||
- Highly expressed gene identification
|
||||
- Sample clustering
|
||||
- Batch effect detection
|
||||
- Coefficient of variation analysis
|
||||
- Dynamic range assessment
|
||||
|
||||
### .mtx - Matrix Market Format
|
||||
**Description:** Sparse matrix format (common in single-cell)
|
||||
**Typical Data:** Sparse count matrices (cells × genes)
|
||||
**Use Cases:** Single-cell RNA-seq, large sparse matrices
|
||||
**Python Libraries:**
|
||||
- `scipy.io`: `scipy.io.mmread('file.mtx')`
|
||||
- `scanpy`: `sc.read_mtx('file.mtx')`
|
||||
**EDA Approach:**
|
||||
- Sparsity analysis
|
||||
- Cell and gene filtering thresholds
|
||||
- Doublet detection metrics
|
||||
- Mitochondrial fraction
|
||||
- UMI count distribution
|
||||
- Gene detection per cell
|
||||
|
||||
### .h5ad - Anndata Format
|
||||
**Description:** HDF5-based annotated data matrix
|
||||
**Typical Data:** Expression matrix with metadata (cells, genes)
|
||||
**Use Cases:** Single-cell RNA-seq analysis with Scanpy
|
||||
**Python Libraries:**
|
||||
- `scanpy`: `sc.read_h5ad('file.h5ad')`
|
||||
- `anndata`: Direct AnnData manipulation
|
||||
**EDA Approach:**
|
||||
- Cell and gene counts
|
||||
- Metadata completeness
|
||||
- Layer availability (raw, normalized)
|
||||
- Embedding presence (PCA, UMAP)
|
||||
- QC metrics distribution
|
||||
- Batch information
|
||||
- Cell type annotation coverage
|
||||
|
||||
### .loom - Loom Format
|
||||
**Description:** HDF5-based format for omics data
|
||||
**Typical Data:** Expression matrices with metadata
|
||||
**Use Cases:** Single-cell data, RNA velocity analysis
|
||||
**Python Libraries:**
|
||||
- `loompy`: `loompy.connect('file.loom')`
|
||||
- `scanpy`: Can import loom files
|
||||
**EDA Approach:**
|
||||
- Layer analysis (spliced, unspliced)
|
||||
- Row and column attribute exploration
|
||||
- Graph connectivity analysis
|
||||
- Cluster assignments
|
||||
- Velocity-specific metrics
|
||||
|
||||
### .rds - R Data Serialization
|
||||
**Description:** R object storage (often Seurat objects)
|
||||
**Typical Data:** R analysis results, especially single-cell
|
||||
**Use Cases:** R-Python data exchange
|
||||
**Python Libraries:**
|
||||
- `pyreadr`: `pyreadr.read_r('file.rds')`
|
||||
- `rpy2`: For full R integration
|
||||
- Conversion tools to AnnData
|
||||
**EDA Approach:**
|
||||
- Object type identification
|
||||
- Data structure exploration
|
||||
- Metadata extraction
|
||||
- Conversion validation
|
||||
|
||||
## Alignment and Assembly Formats
|
||||
|
||||
### .maf - Multiple Alignment Format
|
||||
**Description:** Text format for multiple sequence alignments
|
||||
**Typical Data:** Genome-wide or local multiple alignments
|
||||
**Use Cases:** Comparative genomics, conservation analysis
|
||||
**Python Libraries:**
|
||||
- `Biopython`: `AlignIO.parse('file.maf', 'maf')`
|
||||
- `bx-python`: MAF-specific tools
|
||||
**EDA Approach:**
|
||||
- Alignment block statistics
|
||||
- Species coverage
|
||||
- Gap analysis
|
||||
- Conservation scoring
|
||||
- Alignment quality metrics
|
||||
- Block length distribution
|
||||
|
||||
### .axt - Pairwise Alignment Format
|
||||
**Description:** Pairwise alignment format (UCSC)
|
||||
**Typical Data:** Pairwise genomic alignments
|
||||
**Use Cases:** Genome comparison, synteny analysis
|
||||
**Python Libraries:**
|
||||
- Custom parsers (simple format)
|
||||
- `bx-python`: AXT support
|
||||
**EDA Approach:**
|
||||
- Alignment score distribution
|
||||
- Identity percentage
|
||||
- Syntenic block identification
|
||||
- Gap size analysis
|
||||
- Coverage statistics
|
||||
|
||||
### .chain - Chain Alignment Format
|
||||
**Description:** Genome coordinate mapping chains
|
||||
**Typical Data:** Coordinate transformations between genome builds
|
||||
**Use Cases:** Liftover, coordinate conversion
|
||||
**Python Libraries:**
|
||||
- `pyliftover`: Chain file usage
|
||||
- Custom parsers for chain format
|
||||
**EDA Approach:**
|
||||
- Chain score distribution
|
||||
- Coverage of source genome
|
||||
- Gap analysis
|
||||
- Inversion detection
|
||||
- Mapping quality assessment
|
||||
|
||||
### .psl - Pattern Space Layout
|
||||
**Description:** BLAT/BLAST alignment format
|
||||
**Typical Data:** Alignment results from BLAT
|
||||
**Use Cases:** Transcript mapping, similarity searches
|
||||
**Python Libraries:**
|
||||
- Custom parsers (tab-delimited)
|
||||
- `pybedtools`: Can handle PSL
|
||||
**EDA Approach:**
|
||||
- Match percentage distribution
|
||||
- Gap statistics
|
||||
- Query coverage
|
||||
- Multiple mapping analysis
|
||||
- Alignment quality metrics
|
||||
|
||||
## Genome Assembly and Annotation
|
||||
|
||||
### .agp - Assembly Golden Path
|
||||
**Description:** Assembly structure description
|
||||
**Typical Data:** Scaffold composition, gap information
|
||||
**Use Cases:** Genome assembly representation
|
||||
**Python Libraries:**
|
||||
- Custom parsers (simple tab-delimited)
|
||||
- Assembly analysis tools
|
||||
**EDA Approach:**
|
||||
- Scaffold statistics (N50, L50)
|
||||
- Gap type and size distribution
|
||||
- Component length analysis
|
||||
- Assembly contiguity metrics
|
||||
- Unplaced contig analysis
|
||||
|
||||
### .scaffolds / .contigs - Assembly Sequences
|
||||
**Description:** Assembled sequences (usually FASTA)
|
||||
**Typical Data:** Assembled genomic sequences
|
||||
**Use Cases:** Genome assembly output
|
||||
**Python Libraries:**
|
||||
- Same as FASTA format
|
||||
- Assembly-specific tools (QUAST)
|
||||
**EDA Approach:**
|
||||
- Assembly statistics (N50, N90, etc.)
|
||||
- Length distribution
|
||||
- Coverage analysis
|
||||
- Gap (N) content
|
||||
- Duplication assessment
|
||||
- BUSCO completeness (if annotations available)
|
||||
|
||||
### .2bit - Compressed Genome Format
|
||||
**Description:** UCSC compact genome format
|
||||
**Typical Data:** Reference genomes (highly compressed)
|
||||
**Use Cases:** Efficient genome storage and access
|
||||
**Python Libraries:**
|
||||
- `py2bit`: `py2bit.open('file.2bit')`
|
||||
- `twobitreader`: Alternative reader
|
||||
**EDA Approach:**
|
||||
- Compression efficiency
|
||||
- Random access performance
|
||||
- Sequence extraction validation
|
||||
- Masked region analysis
|
||||
- N content and distribution
|
||||
|
||||
### .sizes - Chromosome Sizes
|
||||
**Description:** Simple format with chromosome lengths
|
||||
**Typical Data:** Tab-delimited chromosome names and sizes
|
||||
**Use Cases:** Genome browsers, coordinate validation
|
||||
**Python Libraries:**
|
||||
- Simple file reading with pandas
|
||||
- Built into many genomic tools
|
||||
**EDA Approach:**
|
||||
- Genome size calculation
|
||||
- Chromosome count
|
||||
- Size distribution
|
||||
- Karyotype validation
|
||||
- Completeness check against reference
|
||||
|
||||
## Phylogenetics and Evolution
|
||||
|
||||
### .nwk / .newick - Newick Tree Format
|
||||
**Description:** Parenthetical tree representation
|
||||
**Typical Data:** Phylogenetic trees with branch lengths
|
||||
**Use Cases:** Evolutionary analysis, tree visualization
|
||||
**Python Libraries:**
|
||||
- `Biopython`: `Phylo.read('file.nwk', 'newick')`
|
||||
- `ete3`: `ete3.Tree('file.nwk')`
|
||||
- `dendropy`: Phylogenetic computing
|
||||
**EDA Approach:**
|
||||
- Tree structure analysis (tips, internal nodes)
|
||||
- Branch length distribution
|
||||
- Tree balance metrics
|
||||
- Ultrametricity check
|
||||
- Bootstrap support analysis
|
||||
- Topology validation
|
||||
|
||||
### .nexus - Nexus Format
|
||||
**Description:** Rich format for phylogenetic data
|
||||
**Typical Data:** Alignments, trees, character matrices
|
||||
**Use Cases:** Phylogenetic software interchange
|
||||
**Python Libraries:**
|
||||
- `Biopython`: Nexus support
|
||||
- `dendropy`: Comprehensive Nexus handling
|
||||
**EDA Approach:**
|
||||
- Data block analysis
|
||||
- Character type distribution
|
||||
- Tree block validation
|
||||
- Taxa consistency
|
||||
- Command block parsing
|
||||
- Format compliance checking
|
||||
|
||||
### .phylip - PHYLIP Format
|
||||
**Description:** Sequence alignment format (strict/relaxed)
|
||||
**Typical Data:** Multiple sequence alignments
|
||||
**Use Cases:** Phylogenetic analysis input
|
||||
**Python Libraries:**
|
||||
- `Biopython`: `AlignIO.read('file.phy', 'phylip')`
|
||||
- `dendropy`: PHYLIP support
|
||||
**EDA Approach:**
|
||||
- Alignment dimensions
|
||||
- Sequence length uniformity
|
||||
- Gap position analysis
|
||||
- Informative site calculation
|
||||
- Format variant detection (strict vs relaxed)
|
||||
|
||||
### .paml - PAML Output
|
||||
**Description:** Output from PAML phylogenetic software
|
||||
**Typical Data:** Evolutionary model results, dN/dS ratios
|
||||
**Use Cases:** Molecular evolution analysis
|
||||
**Python Libraries:**
|
||||
- Custom parsers for specific PAML programs
|
||||
- `Biopython`: Basic PAML parsing
|
||||
**EDA Approach:**
|
||||
- Model parameter extraction
|
||||
- Likelihood values
|
||||
- dN/dS ratio distribution
|
||||
- Branch-specific results
|
||||
- Convergence assessment
|
||||
|
||||
## Protein and Structure Data
|
||||
|
||||
### .embl - EMBL Format
|
||||
**Description:** Rich sequence annotation format
|
||||
**Typical Data:** Sequences with extensive annotations
|
||||
**Use Cases:** Sequence databases, genome records
|
||||
**Python Libraries:**
|
||||
- `Biopython`: `SeqIO.read('file.embl', 'embl')`
|
||||
**EDA Approach:**
|
||||
- Feature annotation completeness
|
||||
- Sequence length and type
|
||||
- Reference information
|
||||
- Cross-reference validation
|
||||
- Feature overlap analysis
|
||||
|
||||
### .genbank / .gb / .gbk - GenBank Format
|
||||
**Description:** NCBI's sequence annotation format
|
||||
**Typical Data:** Annotated sequences with features
|
||||
**Use Cases:** Sequence databases, annotation transfer
|
||||
**Python Libraries:**
|
||||
- `Biopython`: `SeqIO.parse('file.gb', 'genbank')`
|
||||
**EDA Approach:**
|
||||
- Feature type distribution
|
||||
- CDS analysis (start codons, stops)
|
||||
- Translation validation
|
||||
- Annotation completeness
|
||||
- Source organism extraction
|
||||
- Reference and publication info
|
||||
- Locus tag consistency
|
||||
|
||||
### .sff - Standard Flowgram Format
|
||||
**Description:** 454/Roche sequencing data format
|
||||
**Typical Data:** Raw pyrosequencing flowgrams
|
||||
**Use Cases:** Legacy 454 sequencing data
|
||||
**Python Libraries:**
|
||||
- `Biopython`: `SeqIO.parse('file.sff', 'sff')`
|
||||
- Platform-specific tools
|
||||
**EDA Approach:**
|
||||
- Read count and length
|
||||
- Flowgram signal quality
|
||||
- Key sequence detection
|
||||
- Adapter trimming validation
|
||||
- Quality score distribution
|
||||
|
||||
### .hdf5 (Genomics Specific)
|
||||
**Description:** HDF5 for genomics (10X, Hi-C, etc.)
|
||||
**Typical Data:** High-throughput genomics data
|
||||
**Use Cases:** 10X Genomics, spatial transcriptomics
|
||||
**Python Libraries:**
|
||||
- `h5py`: Low-level access
|
||||
- `scanpy`: For 10X data
|
||||
- `cooler`: For Hi-C data
|
||||
**EDA Approach:**
|
||||
- Dataset structure exploration
|
||||
- Barcode statistics
|
||||
- UMI counting
|
||||
- Feature-barcode matrix analysis
|
||||
- Spatial coordinates (if applicable)
|
||||
|
||||
### .cool / .mcool - Cooler Format
|
||||
**Description:** HDF5-based Hi-C contact matrices
|
||||
**Typical Data:** Chromatin interaction matrices
|
||||
**Use Cases:** 3D genome analysis, Hi-C data
|
||||
**Python Libraries:**
|
||||
- `cooler`: `cooler.Cooler('file.cool')`
|
||||
- `hicstraw`: For .hic format
|
||||
**EDA Approach:**
|
||||
- Resolution analysis
|
||||
- Contact matrix statistics
|
||||
- Distance decay curves
|
||||
- Compartment analysis
|
||||
- TAD boundary detection
|
||||
- Balance factor validation
|
||||
|
||||
### .hic - Hi-C Binary Format
|
||||
**Description:** Juicer binary Hi-C format
|
||||
**Typical Data:** Multi-resolution Hi-C matrices
|
||||
**Use Cases:** Hi-C analysis with Juicer tools
|
||||
**Python Libraries:**
|
||||
- `hicstraw`: `hicstraw.HiCFile('file.hic')`
|
||||
- `straw`: C++ library with Python bindings
|
||||
**EDA Approach:**
|
||||
- Available resolutions
|
||||
- Normalization methods
|
||||
- Contact statistics
|
||||
- Chromosomal interactions
|
||||
- Quality metrics
|
||||
|
||||
### .bw (ChIP-seq / ATAC-seq specific)
|
||||
**Description:** BigWig files for epigenomics
|
||||
**Typical Data:** Coverage or enrichment signals
|
||||
**Use Cases:** ChIP-seq, ATAC-seq, DNase-seq
|
||||
**Python Libraries:**
|
||||
- `pyBigWig`: Standard bigWig access
|
||||
**EDA Approach:**
|
||||
- Peak enrichment patterns
|
||||
- Background signal analysis
|
||||
- Sample correlation
|
||||
- Signal-to-noise ratio
|
||||
- Library complexity metrics
|
||||
|
||||
### .narrowPeak / .broadPeak - ENCODE Peak Formats
|
||||
**Description:** BED-based formats for peaks
|
||||
**Typical Data:** Peak calls with scores and p-values
|
||||
**Use Cases:** ChIP-seq peak calling output
|
||||
**Python Libraries:**
|
||||
- `pybedtools`: BED-compatible
|
||||
- Custom parsers for peak-specific fields
|
||||
**EDA Approach:**
|
||||
- Peak count and width distribution
|
||||
- Signal value distribution
|
||||
- Q-value and p-value analysis
|
||||
- Peak summit analysis
|
||||
- Overlap with known features
|
||||
- Motif enrichment preparation
|
||||
|
||||
### .wig - Wiggle Format
|
||||
**Description:** Dense continuous genomic data
|
||||
**Typical Data:** Coverage or signal tracks
|
||||
**Use Cases:** Genome browser visualization
|
||||
**Python Libraries:**
|
||||
- `pyBigWig`: Can convert to bigWig
|
||||
- Custom parsers for wiggle format
|
||||
**EDA Approach:**
|
||||
- Signal statistics
|
||||
- Coverage metrics
|
||||
- Format variant (fixedStep vs variableStep)
|
||||
- Span parameter analysis
|
||||
- Conversion efficiency to bigWig
|
||||
|
||||
### .ab1 - Sanger Sequencing Trace
|
||||
**Description:** Binary chromatogram format
|
||||
**Typical Data:** Sanger sequencing traces
|
||||
**Use Cases:** Capillary sequencing validation
|
||||
**Python Libraries:**
|
||||
- `Biopython`: `SeqIO.read('file.ab1', 'abi')`
|
||||
- `tracy` tools: For quality assessment
|
||||
**EDA Approach:**
|
||||
- Base calling quality
|
||||
- Trace quality scores
|
||||
- Mixed base detection
|
||||
- Primer and vector detection
|
||||
- Read length and quality region
|
||||
- Heterozygosity detection
|
||||
|
||||
### .scf - Standard Chromatogram Format
|
||||
**Description:** Sanger sequencing chromatogram
|
||||
**Typical Data:** Base calls and confidence values
|
||||
**Use Cases:** Sequencing trace analysis
|
||||
**Python Libraries:**
|
||||
- `Biopython`: SCF format support
|
||||
**EDA Approach:**
|
||||
- Similar to AB1 format
|
||||
- Quality score profiles
|
||||
- Peak height ratios
|
||||
- Signal-to-noise metrics
|
||||
|
||||
### .idx - Index Files (Generic)
|
||||
**Description:** Index files for various formats
|
||||
**Typical Data:** Fast random access indices
|
||||
**Use Cases:** Efficient data access (BAM, VCF, etc.)
|
||||
**Python Libraries:**
|
||||
- Format-specific libraries handle indices
|
||||
- `pysam`: Auto-handles BAI, CSI indices
|
||||
**EDA Approach:**
|
||||
- Index completeness validation
|
||||
- Binning strategy analysis
|
||||
- Access performance metrics
|
||||
- Index size vs data size ratio
|
||||
@@ -0,0 +1,664 @@
|
||||
# Chemistry and Molecular File Formats Reference
|
||||
|
||||
This reference covers file formats commonly used in computational chemistry, cheminformatics, molecular modeling, and related fields.
|
||||
|
||||
## Structure File Formats
|
||||
|
||||
### .pdb - Protein Data Bank
|
||||
**Description:** Standard format for 3D structures of biological macromolecules
|
||||
**Typical Data:** Atomic coordinates, residue information, secondary structure, crystal structure data
|
||||
**Use Cases:** Protein structure analysis, molecular visualization, docking studies
|
||||
**Python Libraries:**
|
||||
- `Biopython`: `Bio.PDB`
|
||||
- `MDAnalysis`: `MDAnalysis.Universe('file.pdb')`
|
||||
- `PyMOL`: `pymol.cmd.load('file.pdb')`
|
||||
- `ProDy`: `prody.parsePDB('file.pdb')`
|
||||
**EDA Approach:**
|
||||
- Structure validation (bond lengths, angles, clashes)
|
||||
- Secondary structure analysis
|
||||
- B-factor distribution
|
||||
- Missing residues/atoms detection
|
||||
- Ramachandran plots for validation
|
||||
- Surface area and volume calculations
|
||||
|
||||
### .cif - Crystallographic Information File
|
||||
**Description:** Structured data format for crystallographic information
|
||||
**Typical Data:** Unit cell parameters, atomic coordinates, symmetry operations, experimental data
|
||||
**Use Cases:** Crystal structure determination, structural biology, materials science
|
||||
**Python Libraries:**
|
||||
- `gemmi`: `gemmi.cif.read_file('file.cif')`
|
||||
- `PyCifRW`: `CifFile.ReadCif('file.cif')`
|
||||
- `Biopython`: `Bio.PDB.MMCIFParser()`
|
||||
**EDA Approach:**
|
||||
- Data completeness check
|
||||
- Resolution and quality metrics
|
||||
- Unit cell parameter analysis
|
||||
- Symmetry group validation
|
||||
- Atomic displacement parameters
|
||||
- R-factors and validation metrics
|
||||
|
||||
### .mol - MDL Molfile
|
||||
**Description:** Chemical structure file format by MDL/Accelrys
|
||||
**Typical Data:** 2D/3D coordinates, atom types, bond orders, charges
|
||||
**Use Cases:** Chemical database storage, cheminformatics, drug design
|
||||
**Python Libraries:**
|
||||
- `RDKit`: `Chem.MolFromMolFile('file.mol')`
|
||||
- `Open Babel`: `pybel.readfile('mol', 'file.mol')`
|
||||
- `ChemoPy`: For descriptor calculation
|
||||
**EDA Approach:**
|
||||
- Molecular property calculation (MW, logP, TPSA)
|
||||
- Functional group analysis
|
||||
- Ring system detection
|
||||
- Stereochemistry validation
|
||||
- 2D/3D coordinate consistency
|
||||
- Valence and charge validation
|
||||
|
||||
### .mol2 - Tripos Mol2
|
||||
**Description:** Complete 3D molecular structure format with atom typing
|
||||
**Typical Data:** Coordinates, SYBYL atom types, bond types, charges, substructures
|
||||
**Use Cases:** Molecular docking, QSAR studies, drug discovery
|
||||
**Python Libraries:**
|
||||
- `RDKit`: `Chem.MolFromMol2File('file.mol2')`
|
||||
- `Open Babel`: `pybel.readfile('mol2', 'file.mol2')`
|
||||
- `MDAnalysis`: Can parse mol2 topology
|
||||
**EDA Approach:**
|
||||
- Atom type distribution
|
||||
- Partial charge analysis
|
||||
- Bond type statistics
|
||||
- Substructure identification
|
||||
- Conformational analysis
|
||||
- Energy minimization status check
|
||||
|
||||
### .sdf - Structure Data File
|
||||
**Description:** Multi-structure file format with associated data
|
||||
**Typical Data:** Multiple molecular structures with properties/annotations
|
||||
**Use Cases:** Chemical databases, virtual screening, compound libraries
|
||||
**Python Libraries:**
|
||||
- `RDKit`: `Chem.SDMolSupplier('file.sdf')`
|
||||
- `Open Babel`: `pybel.readfile('sdf', 'file.sdf')`
|
||||
- `PandasTools` (RDKit): For DataFrame integration
|
||||
**EDA Approach:**
|
||||
- Dataset size and diversity metrics
|
||||
- Property distribution analysis (MW, logP, etc.)
|
||||
- Structural diversity (Tanimoto similarity)
|
||||
- Missing data assessment
|
||||
- Outlier detection in properties
|
||||
- Scaffold analysis
|
||||
|
||||
### .xyz - XYZ Coordinates
|
||||
**Description:** Simple Cartesian coordinate format
|
||||
**Typical Data:** Atom types and 3D coordinates
|
||||
**Use Cases:** Quantum chemistry, geometry optimization, molecular dynamics
|
||||
**Python Libraries:**
|
||||
- `ASE`: `ase.io.read('file.xyz')`
|
||||
- `Open Babel`: `pybel.readfile('xyz', 'file.xyz')`
|
||||
- `cclib`: For parsing QM outputs with xyz
|
||||
**EDA Approach:**
|
||||
- Geometry analysis (bond lengths, angles, dihedrals)
|
||||
- Center of mass calculation
|
||||
- Moment of inertia
|
||||
- Molecular size metrics
|
||||
- Coordinate validation
|
||||
- Symmetry detection
|
||||
|
||||
### .smi / .smiles - SMILES String
|
||||
**Description:** Line notation for chemical structures
|
||||
**Typical Data:** Text representation of molecular structure
|
||||
**Use Cases:** Chemical databases, literature mining, data exchange
|
||||
**Python Libraries:**
|
||||
- `RDKit`: `Chem.MolFromSmiles(smiles)`
|
||||
- `Open Babel`: Can parse SMILES
|
||||
- `DeepChem`: For ML on SMILES
|
||||
**EDA Approach:**
|
||||
- SMILES syntax validation
|
||||
- Descriptor calculation from SMILES
|
||||
- Fingerprint generation
|
||||
- Substructure searching
|
||||
- Tautomer enumeration
|
||||
- Stereoisomer handling
|
||||
|
||||
### .pdbqt - AutoDock PDBQT
|
||||
**Description:** Modified PDB format for AutoDock docking
|
||||
**Typical Data:** Coordinates, partial charges, atom types for docking
|
||||
**Use Cases:** Molecular docking, virtual screening
|
||||
**Python Libraries:**
|
||||
- `Meeko`: For PDBQT preparation
|
||||
- `Open Babel`: Can read PDBQT
|
||||
- `ProDy`: Limited PDBQT support
|
||||
**EDA Approach:**
|
||||
- Charge distribution analysis
|
||||
- Rotatable bond identification
|
||||
- Atom type validation
|
||||
- Coordinate quality check
|
||||
- Hydrogen placement validation
|
||||
- Torsion definition analysis
|
||||
|
||||
### .mae - Maestro Format
|
||||
**Description:** Schrödinger's proprietary molecular structure format
|
||||
**Typical Data:** Structures, properties, annotations from Schrödinger suite
|
||||
**Use Cases:** Drug discovery, molecular modeling with Schrödinger tools
|
||||
**Python Libraries:**
|
||||
- `schrodinger.structure`: Requires Schrödinger installation
|
||||
- Custom parsers for basic reading
|
||||
**EDA Approach:**
|
||||
- Property extraction and analysis
|
||||
- Structure quality metrics
|
||||
- Conformer analysis
|
||||
- Docking score distributions
|
||||
- Ligand efficiency metrics
|
||||
|
||||
### .gro - GROMACS Coordinate File
|
||||
**Description:** Molecular structure file for GROMACS MD simulations
|
||||
**Typical Data:** Atom positions, velocities, box vectors
|
||||
**Use Cases:** Molecular dynamics simulations, GROMACS workflows
|
||||
**Python Libraries:**
|
||||
- `MDAnalysis`: `Universe('file.gro')`
|
||||
- `MDTraj`: `mdtraj.load_gro('file.gro')`
|
||||
- `GromacsWrapper`: For GROMACS integration
|
||||
**EDA Approach:**
|
||||
- System composition analysis
|
||||
- Box dimension validation
|
||||
- Atom position distribution
|
||||
- Velocity distribution (if present)
|
||||
- Density calculation
|
||||
- Solvation analysis
|
||||
|
||||
## Computational Chemistry Output Formats
|
||||
|
||||
### .log - Gaussian Log File
|
||||
**Description:** Output from Gaussian quantum chemistry calculations
|
||||
**Typical Data:** Energies, geometries, frequencies, orbitals, populations
|
||||
**Use Cases:** QM calculations, geometry optimization, frequency analysis
|
||||
**Python Libraries:**
|
||||
- `cclib`: `cclib.io.ccread('file.log')`
|
||||
- `GaussianRunPack`: For Gaussian workflows
|
||||
- Custom parsers with regex
|
||||
**EDA Approach:**
|
||||
- Convergence analysis
|
||||
- Energy profile extraction
|
||||
- Vibrational frequency analysis
|
||||
- Orbital energy levels
|
||||
- Population analysis (Mulliken, NBO)
|
||||
- Thermochemistry data extraction
|
||||
|
||||
### .out - Quantum Chemistry Output
|
||||
**Description:** Generic output file from various QM packages
|
||||
**Typical Data:** Calculation results, energies, properties
|
||||
**Use Cases:** QM calculations across different software
|
||||
**Python Libraries:**
|
||||
- `cclib`: Universal parser for QM outputs
|
||||
- `ASE`: Can read some output formats
|
||||
**EDA Approach:**
|
||||
- Software-specific parsing
|
||||
- Convergence criteria check
|
||||
- Energy and gradient trends
|
||||
- Basis set and method validation
|
||||
- Computational cost analysis
|
||||
|
||||
### .wfn / .wfx - Wavefunction Files
|
||||
**Description:** Wavefunction data for quantum chemical analysis
|
||||
**Typical Data:** Molecular orbitals, basis sets, density matrices
|
||||
**Use Cases:** Electron density analysis, QTAIM analysis
|
||||
**Python Libraries:**
|
||||
- `Multiwfn`: Interface via Python
|
||||
- `Horton`: For wavefunction analysis
|
||||
- Custom parsers for specific formats
|
||||
**EDA Approach:**
|
||||
- Orbital population analysis
|
||||
- Electron density distribution
|
||||
- Critical point analysis (QTAIM)
|
||||
- Molecular orbital visualization
|
||||
- Bonding analysis
|
||||
|
||||
### .fchk - Gaussian Formatted Checkpoint
|
||||
**Description:** Formatted checkpoint file from Gaussian
|
||||
**Typical Data:** Complete wavefunction data, results, geometry
|
||||
**Use Cases:** Post-processing Gaussian calculations
|
||||
**Python Libraries:**
|
||||
- `cclib`: Can parse fchk files
|
||||
- `GaussView` Python API (if available)
|
||||
- Custom parsers
|
||||
**EDA Approach:**
|
||||
- Wavefunction quality assessment
|
||||
- Property extraction
|
||||
- Basis set information
|
||||
- Gradient and Hessian analysis
|
||||
- Natural orbital analysis
|
||||
|
||||
### .cube - Gaussian Cube File
|
||||
**Description:** Volumetric data on a 3D grid
|
||||
**Typical Data:** Electron density, molecular orbitals, ESP on grid
|
||||
**Use Cases:** Visualization of volumetric properties
|
||||
**Python Libraries:**
|
||||
- `cclib`: `cclib.io.ccread('file.cube')`
|
||||
- `ase.io`: `ase.io.read('file.cube')`
|
||||
- `pyquante`: For cube file manipulation
|
||||
**EDA Approach:**
|
||||
- Grid dimension and spacing analysis
|
||||
- Value distribution statistics
|
||||
- Isosurface value determination
|
||||
- Integration over volume
|
||||
- Comparison between different cubes
|
||||
|
||||
## Molecular Dynamics Formats
|
||||
|
||||
### .dcd - Binary Trajectory
|
||||
**Description:** Binary trajectory format (CHARMM, NAMD)
|
||||
**Typical Data:** Time series of atomic coordinates
|
||||
**Use Cases:** MD trajectory analysis
|
||||
**Python Libraries:**
|
||||
- `MDAnalysis`: `Universe(topology, 'traj.dcd')`
|
||||
- `MDTraj`: `mdtraj.load_dcd('traj.dcd', top='topology.pdb')`
|
||||
- `PyTraj` (Amber): Limited support
|
||||
**EDA Approach:**
|
||||
- RMSD/RMSF analysis
|
||||
- Trajectory length and frame count
|
||||
- Coordinate range and drift
|
||||
- Periodic boundary handling
|
||||
- File integrity check
|
||||
- Time step validation
|
||||
|
||||
### .xtc - Compressed Trajectory
|
||||
**Description:** GROMACS compressed trajectory format
|
||||
**Typical Data:** Compressed coordinates from MD simulations
|
||||
**Use Cases:** Space-efficient MD trajectory storage
|
||||
**Python Libraries:**
|
||||
- `MDAnalysis`: `Universe(topology, 'traj.xtc')`
|
||||
- `MDTraj`: `mdtraj.load_xtc('traj.xtc', top='topology.pdb')`
|
||||
**EDA Approach:**
|
||||
- Compression ratio assessment
|
||||
- Precision loss evaluation
|
||||
- RMSD over time
|
||||
- Structural stability metrics
|
||||
- Sampling frequency analysis
|
||||
|
||||
### .trr - GROMACS Trajectory
|
||||
**Description:** Full precision GROMACS trajectory
|
||||
**Typical Data:** Coordinates, velocities, forces from MD
|
||||
**Use Cases:** High-precision MD analysis
|
||||
**Python Libraries:**
|
||||
- `MDAnalysis`: Full support
|
||||
- `MDTraj`: Can read trr files
|
||||
- `GromacsWrapper`
|
||||
**EDA Approach:**
|
||||
- Full system dynamics analysis
|
||||
- Energy conservation check (with velocities)
|
||||
- Force analysis
|
||||
- Temperature and pressure validation
|
||||
- System equilibration assessment
|
||||
|
||||
### .nc / .netcdf - Amber NetCDF Trajectory
|
||||
**Description:** Network Common Data Form trajectory
|
||||
**Typical Data:** MD coordinates, velocities, forces
|
||||
**Use Cases:** Amber MD simulations, large trajectory storage
|
||||
**Python Libraries:**
|
||||
- `MDAnalysis`: NetCDF support
|
||||
- `PyTraj`: Native Amber analysis
|
||||
- `netCDF4`: Low-level access
|
||||
**EDA Approach:**
|
||||
- Metadata extraction
|
||||
- Trajectory statistics
|
||||
- Time series analysis
|
||||
- Replica exchange analysis
|
||||
- Multi-dimensional data extraction
|
||||
|
||||
### .top - GROMACS Topology
|
||||
**Description:** Molecular topology for GROMACS
|
||||
**Typical Data:** Atom types, bonds, angles, force field parameters
|
||||
**Use Cases:** MD simulation setup and analysis
|
||||
**Python Libraries:**
|
||||
- `ParmEd`: `parmed.load_file('system.top')`
|
||||
- `MDAnalysis`: Can parse topology
|
||||
- Custom parsers for specific fields
|
||||
**EDA Approach:**
|
||||
- Force field parameter validation
|
||||
- System composition
|
||||
- Bond/angle/dihedral distribution
|
||||
- Charge neutrality check
|
||||
- Molecule type enumeration
|
||||
|
||||
### .psf - Protein Structure File (CHARMM)
|
||||
**Description:** Topology file for CHARMM/NAMD
|
||||
**Typical Data:** Atom connectivity, types, charges
|
||||
**Use Cases:** CHARMM/NAMD MD simulations
|
||||
**Python Libraries:**
|
||||
- `MDAnalysis`: Native PSF support
|
||||
- `ParmEd`: Can read PSF files
|
||||
**EDA Approach:**
|
||||
- Topology validation
|
||||
- Connectivity analysis
|
||||
- Charge distribution
|
||||
- Atom type statistics
|
||||
- Segment analysis
|
||||
|
||||
### .prmtop - Amber Parameter/Topology
|
||||
**Description:** Amber topology and parameter file
|
||||
**Typical Data:** System topology, force field parameters
|
||||
**Use Cases:** Amber MD simulations
|
||||
**Python Libraries:**
|
||||
- `ParmEd`: `parmed.load_file('system.prmtop')`
|
||||
- `PyTraj`: Native Amber support
|
||||
**EDA Approach:**
|
||||
- Force field completeness
|
||||
- Parameter validation
|
||||
- System size and composition
|
||||
- Periodic box information
|
||||
- Atom mask creation for analysis
|
||||
|
||||
### .inpcrd / .rst7 - Amber Coordinates
|
||||
**Description:** Amber coordinate/restart file
|
||||
**Typical Data:** Atomic coordinates, velocities, box info
|
||||
**Use Cases:** Starting coordinates for Amber MD
|
||||
**Python Libraries:**
|
||||
- `ParmEd`: Works with prmtop
|
||||
- `PyTraj`: Amber coordinate reading
|
||||
**EDA Approach:**
|
||||
- Coordinate validity
|
||||
- System initialization check
|
||||
- Box vector validation
|
||||
- Velocity distribution (if restart)
|
||||
- Energy minimization status
|
||||
|
||||
## Spectroscopy and Analytical Data
|
||||
|
||||
### .jcamp / .jdx - JCAMP-DX
|
||||
**Description:** Joint Committee on Atomic and Molecular Physical Data eXchange
|
||||
**Typical Data:** Spectroscopic data (IR, NMR, MS, UV-Vis)
|
||||
**Use Cases:** Spectroscopy data exchange and archiving
|
||||
**Python Libraries:**
|
||||
- `jcamp`: `jcamp.jcamp_reader('file.jdx')`
|
||||
- `nmrglue`: For NMR JCAMP files
|
||||
- Custom parsers for specific subtypes
|
||||
**EDA Approach:**
|
||||
- Peak detection and analysis
|
||||
- Baseline correction assessment
|
||||
- Signal-to-noise calculation
|
||||
- Spectral range validation
|
||||
- Integration analysis
|
||||
- Comparison with reference spectra
|
||||
|
||||
### .mzML - Mass Spectrometry Markup Language
|
||||
**Description:** Standard XML format for mass spectrometry data
|
||||
**Typical Data:** MS/MS spectra, chromatograms, metadata
|
||||
**Use Cases:** Proteomics, metabolomics, mass spectrometry workflows
|
||||
**Python Libraries:**
|
||||
- `pymzml`: `pymzml.run.Reader('file.mzML')`
|
||||
- `pyteomics`: `pyteomics.mzml.read('file.mzML')`
|
||||
- `MSFileReader` wrappers
|
||||
**EDA Approach:**
|
||||
- Scan count and types
|
||||
- MS level distribution
|
||||
- Retention time range
|
||||
- m/z range and resolution
|
||||
- Peak intensity distribution
|
||||
- Data completeness
|
||||
- Quality control metrics
|
||||
|
||||
### .mzXML - Mass Spectrometry XML
|
||||
**Description:** Open XML format for MS data
|
||||
**Typical Data:** Mass spectra, retention times, peak lists
|
||||
**Use Cases:** Legacy MS data, metabolomics
|
||||
**Python Libraries:**
|
||||
- `pymzml`: Can read mzXML
|
||||
- `pyteomics.mzxml`
|
||||
- `lxml` for direct XML parsing
|
||||
**EDA Approach:**
|
||||
- Similar to mzML
|
||||
- Version compatibility check
|
||||
- Conversion quality assessment
|
||||
- Peak picking validation
|
||||
|
||||
### .raw - Vendor Raw Data
|
||||
**Description:** Proprietary instrument data files (Thermo, Bruker, etc.)
|
||||
**Typical Data:** Raw instrument signals, unprocessed data
|
||||
**Use Cases:** Direct instrument data access
|
||||
**Python Libraries:**
|
||||
- `pymsfilereader`: For Thermo RAW files
|
||||
- `ThermoRawFileParser`: CLI wrapper
|
||||
- Vendor-specific APIs (Thermo, Bruker Compass)
|
||||
**EDA Approach:**
|
||||
- Instrument method extraction
|
||||
- Raw signal quality
|
||||
- Calibration status
|
||||
- Scan function analysis
|
||||
- Chromatographic quality metrics
|
||||
|
||||
### .d - Agilent Data Directory
|
||||
**Description:** Agilent's data folder structure
|
||||
**Typical Data:** LC-MS, GC-MS data and metadata
|
||||
**Use Cases:** Agilent instrument data processing
|
||||
**Python Libraries:**
|
||||
- `agilent-reader`: Community tools
|
||||
- `Chemstation` Python integration
|
||||
- Custom directory parsing
|
||||
**EDA Approach:**
|
||||
- Directory structure validation
|
||||
- Method parameter extraction
|
||||
- Signal file integrity
|
||||
- Calibration curve analysis
|
||||
- Sequence information extraction
|
||||
|
||||
### .fid - NMR Free Induction Decay
|
||||
**Description:** Raw NMR time-domain data
|
||||
**Typical Data:** Time-domain NMR signal
|
||||
**Use Cases:** NMR processing and analysis
|
||||
**Python Libraries:**
|
||||
- `nmrglue`: `nmrglue.bruker.read_fid('fid')`
|
||||
- `nmrstarlib`: For NMR-STAR files
|
||||
**EDA Approach:**
|
||||
- Signal decay analysis
|
||||
- Noise level assessment
|
||||
- Acquisition parameter validation
|
||||
- Apodization function selection
|
||||
- Zero-filling optimization
|
||||
- Phasing parameter estimation
|
||||
|
||||
### .ft - NMR Frequency-Domain Data
|
||||
**Description:** Processed NMR spectrum
|
||||
**Typical Data:** Frequency-domain NMR data
|
||||
**Use Cases:** NMR analysis and interpretation
|
||||
**Python Libraries:**
|
||||
- `nmrglue`: Comprehensive NMR support
|
||||
- `pyNMR`: For processing
|
||||
**EDA Approach:**
|
||||
- Peak picking and integration
|
||||
- Chemical shift calibration
|
||||
- Multiplicity analysis
|
||||
- Coupling constant extraction
|
||||
- Spectral quality metrics
|
||||
- Reference compound identification
|
||||
|
||||
### .spc - Spectroscopy File
|
||||
**Description:** Thermo Galactic spectroscopy format
|
||||
**Typical Data:** IR, Raman, UV-Vis spectra
|
||||
**Use Cases:** Spectroscopic data from various instruments
|
||||
**Python Libraries:**
|
||||
- `spc`: `spc.File('file.spc')`
|
||||
- Custom parsers for binary format
|
||||
**EDA Approach:**
|
||||
- Spectral resolution
|
||||
- Wavelength/wavenumber range
|
||||
- Baseline characterization
|
||||
- Peak identification
|
||||
- Derivative spectra calculation
|
||||
|
||||
## Chemical Database Formats
|
||||
|
||||
### .inchi - International Chemical Identifier
|
||||
**Description:** Text identifier for chemical substances
|
||||
**Typical Data:** Layered chemical structure representation
|
||||
**Use Cases:** Chemical database keys, structure searching
|
||||
**Python Libraries:**
|
||||
- `RDKit`: `Chem.MolFromInchi(inchi)`
|
||||
- `Open Babel`: InChI conversion
|
||||
**EDA Approach:**
|
||||
- InChI validation
|
||||
- Layer analysis
|
||||
- Stereochemistry verification
|
||||
- InChI key generation
|
||||
- Structure round-trip validation
|
||||
|
||||
### .cdx / .cdxml - ChemDraw Exchange
|
||||
**Description:** ChemDraw drawing file format
|
||||
**Typical Data:** 2D chemical structures with annotations
|
||||
**Use Cases:** Chemical drawing, publication figures
|
||||
**Python Libraries:**
|
||||
- `RDKit`: Can import some CDXML
|
||||
- `Open Babel`: Limited support
|
||||
- `ChemDraw` Python API (commercial)
|
||||
**EDA Approach:**
|
||||
- Structure extraction
|
||||
- Annotation preservation
|
||||
- Style consistency
|
||||
- 2D coordinate validation
|
||||
|
||||
### .cml - Chemical Markup Language
|
||||
**Description:** XML-based chemical structure format
|
||||
**Typical Data:** Chemical structures, reactions, properties
|
||||
**Use Cases:** Semantic chemical data representation
|
||||
**Python Libraries:**
|
||||
- `RDKit`: CML support
|
||||
- `Open Babel`: Good CML support
|
||||
- `lxml`: For XML parsing
|
||||
**EDA Approach:**
|
||||
- XML schema validation
|
||||
- Namespace handling
|
||||
- Property extraction
|
||||
- Reaction scheme analysis
|
||||
- Metadata completeness
|
||||
|
||||
### .rxn - MDL Reaction File
|
||||
**Description:** Chemical reaction structure file
|
||||
**Typical Data:** Reactants, products, reaction arrows
|
||||
**Use Cases:** Reaction databases, synthesis planning
|
||||
**Python Libraries:**
|
||||
- `RDKit`: `Chem.ReactionFromRxnFile('file.rxn')`
|
||||
- `Open Babel`: Reaction support
|
||||
**EDA Approach:**
|
||||
- Reaction balancing validation
|
||||
- Atom mapping analysis
|
||||
- Reagent identification
|
||||
- Stereochemistry changes
|
||||
- Reaction classification
|
||||
|
||||
### .rdf - Reaction Data File
|
||||
**Description:** Multi-reaction file format
|
||||
**Typical Data:** Multiple reactions with data
|
||||
**Use Cases:** Reaction databases
|
||||
**Python Libraries:**
|
||||
- `RDKit`: RDF reading capabilities
|
||||
- Custom parsers
|
||||
**EDA Approach:**
|
||||
- Reaction yield statistics
|
||||
- Condition analysis
|
||||
- Success rate patterns
|
||||
- Reagent frequency analysis
|
||||
|
||||
## Computational Output and Data
|
||||
|
||||
### .hdf5 / .h5 - Hierarchical Data Format
|
||||
**Description:** Container for scientific data arrays
|
||||
**Typical Data:** Large arrays, metadata, hierarchical organization
|
||||
**Use Cases:** Large dataset storage, computational results
|
||||
**Python Libraries:**
|
||||
- `h5py`: `h5py.File('file.h5', 'r')`
|
||||
- `pytables`: Advanced HDF5 interface
|
||||
- `pandas`: Can read HDF5
|
||||
**EDA Approach:**
|
||||
- Dataset structure exploration
|
||||
- Array shape and dtype analysis
|
||||
- Metadata extraction
|
||||
- Memory-efficient data sampling
|
||||
- Chunk optimization analysis
|
||||
- Compression ratio assessment
|
||||
|
||||
### .pkl / .pickle - Python Pickle
|
||||
**Description:** Serialized Python objects
|
||||
**Typical Data:** Any Python object (molecules, dataframes, models)
|
||||
**Use Cases:** Intermediate data storage, model persistence
|
||||
**Python Libraries:**
|
||||
- `pickle`: Built-in serialization
|
||||
- `joblib`: Enhanced pickling for large arrays
|
||||
- `dill`: Extended pickle support
|
||||
**EDA Approach:**
|
||||
- Object type inspection
|
||||
- Size and complexity analysis
|
||||
- Version compatibility check
|
||||
- Security validation (trusted source)
|
||||
- Deserialization testing
|
||||
|
||||
### .npy / .npz - NumPy Arrays
|
||||
**Description:** NumPy array binary format
|
||||
**Typical Data:** Numerical arrays (coordinates, features, matrices)
|
||||
**Use Cases:** Fast numerical data I/O
|
||||
**Python Libraries:**
|
||||
- `numpy`: `np.load('file.npy')`
|
||||
- Direct memory mapping for large files
|
||||
**EDA Approach:**
|
||||
- Array shape and dimensions
|
||||
- Data type and precision
|
||||
- Statistical summary (mean, std, range)
|
||||
- Missing value detection
|
||||
- Outlier identification
|
||||
- Memory footprint analysis
|
||||
|
||||
### .mat - MATLAB Data File
|
||||
**Description:** MATLAB workspace data
|
||||
**Typical Data:** Arrays, structures from MATLAB
|
||||
**Use Cases:** MATLAB-Python data exchange
|
||||
**Python Libraries:**
|
||||
- `scipy.io`: `scipy.io.loadmat('file.mat')`
|
||||
- `h5py`: For v7.3 MAT files
|
||||
**EDA Approach:**
|
||||
- Variable extraction and types
|
||||
- Array dimension analysis
|
||||
- Structure field exploration
|
||||
- MATLAB version compatibility
|
||||
- Data type conversion validation
|
||||
|
||||
### .csv - Comma-Separated Values
|
||||
**Description:** Tabular data in text format
|
||||
**Typical Data:** Chemical properties, experimental data, descriptors
|
||||
**Use Cases:** Data exchange, analysis, machine learning
|
||||
**Python Libraries:**
|
||||
- `pandas`: `pd.read_csv('file.csv')`
|
||||
- `csv`: Built-in module
|
||||
- `polars`: Fast CSV reading
|
||||
**EDA Approach:**
|
||||
- Data types inference
|
||||
- Missing value patterns
|
||||
- Statistical summaries
|
||||
- Correlation analysis
|
||||
- Distribution visualization
|
||||
- Outlier detection
|
||||
|
||||
### .json - JavaScript Object Notation
|
||||
**Description:** Structured text data format
|
||||
**Typical Data:** Chemical properties, metadata, API responses
|
||||
**Use Cases:** Data interchange, configuration, web APIs
|
||||
**Python Libraries:**
|
||||
- `json`: Built-in JSON support
|
||||
- `pandas`: `pd.read_json()`
|
||||
- `ujson`: Faster JSON parsing
|
||||
**EDA Approach:**
|
||||
- Schema validation
|
||||
- Nesting depth analysis
|
||||
- Key-value distribution
|
||||
- Data type consistency
|
||||
- Array length statistics
|
||||
|
||||
### .parquet - Apache Parquet
|
||||
**Description:** Columnar storage format
|
||||
**Typical Data:** Large tabular datasets efficiently
|
||||
**Use Cases:** Big data, efficient columnar analytics
|
||||
**Python Libraries:**
|
||||
- `pandas`: `pd.read_parquet('file.parquet')`
|
||||
- `pyarrow`: Direct parquet access
|
||||
- `fastparquet`: Alternative implementation
|
||||
**EDA Approach:**
|
||||
- Column statistics from metadata
|
||||
- Partition analysis
|
||||
- Compression efficiency
|
||||
- Row group structure
|
||||
- Fast sampling for large files
|
||||
- Schema evolution tracking
|
||||
@@ -0,0 +1,518 @@
|
||||
# General Scientific Data Formats Reference
|
||||
|
||||
This reference covers general-purpose scientific data formats used across multiple disciplines.
|
||||
|
||||
## Numerical and Array Data
|
||||
|
||||
### .npy - NumPy Array
|
||||
**Description:** Binary NumPy array format
|
||||
**Typical Data:** N-dimensional arrays of any data type
|
||||
**Use Cases:** Fast I/O for numerical data, intermediate results
|
||||
**Python Libraries:**
|
||||
- `numpy`: `np.load('file.npy')`, `np.save()`
|
||||
- Memory-mapped access: `np.load('file.npy', mmap_mode='r')`
|
||||
**EDA Approach:**
|
||||
- Array shape and dimensionality
|
||||
- Data type and precision
|
||||
- Statistical summary (mean, std, min, max, percentiles)
|
||||
- Missing or invalid values (NaN, inf)
|
||||
- Memory footprint
|
||||
- Value distribution and histogram
|
||||
- Sparsity analysis
|
||||
- Correlation structure (if 2D)
|
||||
|
||||
### .npz - Compressed NumPy Archive
|
||||
**Description:** Multiple NumPy arrays in one file
|
||||
**Typical Data:** Collections of related arrays
|
||||
**Use Cases:** Saving multiple arrays together, compressed storage
|
||||
**Python Libraries:**
|
||||
- `numpy`: `np.load('file.npz')` returns dict-like object
|
||||
- `np.savez()` or `np.savez_compressed()`
|
||||
**EDA Approach:**
|
||||
- List of contained arrays
|
||||
- Individual array analysis
|
||||
- Relationships between arrays
|
||||
- Total file size and compression ratio
|
||||
- Naming conventions
|
||||
- Data consistency checks
|
||||
|
||||
### .csv - Comma-Separated Values
|
||||
**Description:** Plain text tabular data
|
||||
**Typical Data:** Experimental measurements, results tables
|
||||
**Use Cases:** Universal data exchange, spreadsheet export
|
||||
**Python Libraries:**
|
||||
- `pandas`: `pd.read_csv('file.csv')`
|
||||
- `csv`: Built-in module
|
||||
- `polars`: High-performance CSV reading
|
||||
- `numpy`: `np.loadtxt()` or `np.genfromtxt()`
|
||||
**EDA Approach:**
|
||||
- Row and column counts
|
||||
- Data type inference
|
||||
- Missing value patterns and frequency
|
||||
- Column statistics (numeric: mean, std; categorical: frequencies)
|
||||
- Outlier detection
|
||||
- Correlation matrix
|
||||
- Duplicate row detection
|
||||
- Header and index validation
|
||||
- Encoding issues detection
|
||||
|
||||
### .tsv / .tab - Tab-Separated Values
|
||||
**Description:** Tab-delimited tabular data
|
||||
**Typical Data:** Similar to CSV but tab-separated
|
||||
**Use Cases:** Bioinformatics, text processing output
|
||||
**Python Libraries:**
|
||||
- `pandas`: `pd.read_csv('file.tsv', sep='\t')`
|
||||
**EDA Approach:**
|
||||
- Same as CSV format
|
||||
- Tab vs space validation
|
||||
- Quote handling
|
||||
|
||||
### .xlsx / .xls - Excel Spreadsheets
|
||||
**Description:** Microsoft Excel binary/XML formats
|
||||
**Typical Data:** Tabular data with formatting, formulas
|
||||
**Use Cases:** Lab notebooks, data entry, reports
|
||||
**Python Libraries:**
|
||||
- `pandas`: `pd.read_excel('file.xlsx')`
|
||||
- `openpyxl`: Full Excel file manipulation
|
||||
- `xlrd`: Reading .xls (legacy)
|
||||
**EDA Approach:**
|
||||
- Sheet enumeration and names
|
||||
- Per-sheet data analysis
|
||||
- Formula evaluation
|
||||
- Merged cells handling
|
||||
- Hidden rows/columns
|
||||
- Data validation rules
|
||||
- Named ranges
|
||||
- Formatting-only cells detection
|
||||
|
||||
### .json - JavaScript Object Notation
|
||||
**Description:** Hierarchical text data format
|
||||
**Typical Data:** Nested data structures, metadata
|
||||
**Use Cases:** API responses, configuration, results
|
||||
**Python Libraries:**
|
||||
- `json`: Built-in module
|
||||
- `pandas`: `pd.read_json()`
|
||||
- `ujson`: Faster JSON parsing
|
||||
**EDA Approach:**
|
||||
- Schema inference
|
||||
- Nesting depth
|
||||
- Key-value distribution
|
||||
- Array lengths
|
||||
- Data type consistency
|
||||
- Missing keys
|
||||
- Duplicate detection
|
||||
- Size and complexity metrics
|
||||
|
||||
### .xml - Extensible Markup Language
|
||||
**Description:** Hierarchical markup format
|
||||
**Typical Data:** Structured data with metadata
|
||||
**Use Cases:** Standards-based data exchange, APIs
|
||||
**Python Libraries:**
|
||||
- `lxml`: `lxml.etree.parse()`
|
||||
- `xml.etree.ElementTree`: Built-in XML
|
||||
- `xmltodict`: Convert XML to dict
|
||||
**EDA Approach:**
|
||||
- Schema/DTD validation
|
||||
- Element hierarchy and depth
|
||||
- Namespace handling
|
||||
- Attribute vs element content
|
||||
- CDATA sections
|
||||
- Text content extraction
|
||||
- Sibling and child counts
|
||||
|
||||
### .yaml / .yml - YAML
|
||||
**Description:** Human-readable data serialization
|
||||
**Typical Data:** Configuration, metadata, parameters
|
||||
**Use Cases:** Experiment configurations, pipelines
|
||||
**Python Libraries:**
|
||||
- `yaml`: `yaml.safe_load()` or `yaml.load()`
|
||||
- `ruamel.yaml`: YAML 1.2 support
|
||||
**EDA Approach:**
|
||||
- Configuration structure
|
||||
- Data type handling
|
||||
- List and dict depth
|
||||
- Anchor and alias usage
|
||||
- Multi-document files
|
||||
- Comments preservation
|
||||
- Validation against schema
|
||||
|
||||
### .toml - TOML Configuration
|
||||
**Description:** Configuration file format
|
||||
**Typical Data:** Settings, parameters
|
||||
**Use Cases:** Python package configuration, settings
|
||||
**Python Libraries:**
|
||||
- `tomli` / `tomllib`: TOML reading (tomllib in Python 3.11+)
|
||||
- `toml`: Reading and writing
|
||||
**EDA Approach:**
|
||||
- Section structure
|
||||
- Key-value pairs
|
||||
- Data type inference
|
||||
- Nested table validation
|
||||
- Required vs optional fields
|
||||
|
||||
### .ini - INI Configuration
|
||||
**Description:** Simple configuration format
|
||||
**Typical Data:** Application settings
|
||||
**Use Cases:** Legacy configurations, simple settings
|
||||
**Python Libraries:**
|
||||
- `configparser`: Built-in INI parser
|
||||
**EDA Approach:**
|
||||
- Section enumeration
|
||||
- Key-value extraction
|
||||
- Type conversion
|
||||
- Comment handling
|
||||
- Case sensitivity
|
||||
|
||||
## Binary and Compressed Data
|
||||
|
||||
### .hdf5 / .h5 - Hierarchical Data Format 5
|
||||
**Description:** Container for large scientific datasets
|
||||
**Typical Data:** Multi-dimensional arrays, metadata, groups
|
||||
**Use Cases:** Large datasets, multi-modal data, parallel I/O
|
||||
**Python Libraries:**
|
||||
- `h5py`: `h5py.File('file.h5', 'r')`
|
||||
- `pytables`: Advanced HDF5 interface
|
||||
- `pandas`: HDF5 storage via HDFStore
|
||||
**EDA Approach:**
|
||||
- Group and dataset hierarchy
|
||||
- Dataset shapes and dtypes
|
||||
- Attributes and metadata
|
||||
- Compression and chunking strategy
|
||||
- Memory-efficient sampling
|
||||
- Dataset relationships
|
||||
- File size and efficiency
|
||||
- Access patterns optimization
|
||||
|
||||
### .zarr - Chunked Array Storage
|
||||
**Description:** Cloud-optimized chunked arrays
|
||||
**Typical Data:** Large N-dimensional arrays
|
||||
**Use Cases:** Cloud storage, parallel computing, streaming
|
||||
**Python Libraries:**
|
||||
- `zarr`: `zarr.open('file.zarr')`
|
||||
- `xarray`: Zarr backend support
|
||||
**EDA Approach:**
|
||||
- Array metadata and dimensions
|
||||
- Chunk size optimization
|
||||
- Compression codec and ratio
|
||||
- Synchronizer and store type
|
||||
- Multi-scale hierarchies
|
||||
- Parallel access performance
|
||||
- Attribute metadata
|
||||
|
||||
### .gz / .gzip - Gzip Compressed
|
||||
**Description:** Compressed data files
|
||||
**Typical Data:** Any compressed text or binary
|
||||
**Use Cases:** Compression for storage/transfer
|
||||
**Python Libraries:**
|
||||
- `gzip`: Built-in gzip module
|
||||
- `pandas`: Automatic gzip handling in read functions
|
||||
**EDA Approach:**
|
||||
- Compression ratio
|
||||
- Original file type detection
|
||||
- Decompression validation
|
||||
- Header information
|
||||
- Multi-member archives
|
||||
|
||||
### .bz2 - Bzip2 Compressed
|
||||
**Description:** Bzip2 compression
|
||||
**Typical Data:** Highly compressed files
|
||||
**Use Cases:** Better compression than gzip
|
||||
**Python Libraries:**
|
||||
- `bz2`: Built-in bz2 module
|
||||
- Automatic handling in pandas
|
||||
**EDA Approach:**
|
||||
- Compression efficiency
|
||||
- Decompression time
|
||||
- Content validation
|
||||
|
||||
### .zip - ZIP Archive
|
||||
**Description:** Archive with multiple files
|
||||
**Typical Data:** Collections of files
|
||||
**Use Cases:** File distribution, archiving
|
||||
**Python Libraries:**
|
||||
- `zipfile`: Built-in ZIP support
|
||||
- `pandas`: Can read zipped CSVs
|
||||
**EDA Approach:**
|
||||
- Archive member listing
|
||||
- Compression method per file
|
||||
- Total vs compressed size
|
||||
- Directory structure
|
||||
- File type distribution
|
||||
- Extraction validation
|
||||
|
||||
### .tar / .tar.gz - TAR Archive
|
||||
**Description:** Unix tape archive
|
||||
**Typical Data:** Multiple files and directories
|
||||
**Use Cases:** Software distribution, backups
|
||||
**Python Libraries:**
|
||||
- `tarfile`: Built-in TAR support
|
||||
**EDA Approach:**
|
||||
- Member file listing
|
||||
- Compression (if .tar.gz, .tar.bz2)
|
||||
- Directory structure
|
||||
- Permissions preservation
|
||||
- Extraction testing
|
||||
|
||||
## Time Series and Waveform Data
|
||||
|
||||
### .wav - Waveform Audio
|
||||
**Description:** Audio waveform data
|
||||
**Typical Data:** Acoustic signals, audio recordings
|
||||
**Use Cases:** Acoustic analysis, ultrasound, signal processing
|
||||
**Python Libraries:**
|
||||
- `scipy.io.wavfile`: `scipy.io.wavfile.read()`
|
||||
- `wave`: Built-in module
|
||||
- `soundfile`: Enhanced audio I/O
|
||||
**EDA Approach:**
|
||||
- Sample rate and duration
|
||||
- Bit depth and channels
|
||||
- Amplitude distribution
|
||||
- Spectral analysis (FFT)
|
||||
- Signal-to-noise ratio
|
||||
- Clipping detection
|
||||
- Frequency content
|
||||
|
||||
### .mat - MATLAB Data
|
||||
**Description:** MATLAB workspace variables
|
||||
**Typical Data:** Arrays, structures, cells
|
||||
**Use Cases:** MATLAB-Python interoperability
|
||||
**Python Libraries:**
|
||||
- `scipy.io`: `scipy.io.loadmat()`
|
||||
- `h5py`: For MATLAB v7.3 files (HDF5-based)
|
||||
- `mat73`: Pure Python for v7.3
|
||||
**EDA Approach:**
|
||||
- Variable names and types
|
||||
- Array dimensions
|
||||
- Structure field exploration
|
||||
- Cell array handling
|
||||
- Sparse matrix detection
|
||||
- MATLAB version compatibility
|
||||
- Metadata extraction
|
||||
|
||||
### .edf - European Data Format
|
||||
**Description:** Time series data (especially medical)
|
||||
**Typical Data:** EEG, physiological signals
|
||||
**Use Cases:** Medical signal storage
|
||||
**Python Libraries:**
|
||||
- `pyedflib`: EDF/EDF+ reading and writing
|
||||
- `mne`: Neurophysiology data (supports EDF)
|
||||
**EDA Approach:**
|
||||
- Signal count and names
|
||||
- Sampling frequencies
|
||||
- Signal ranges and units
|
||||
- Recording duration
|
||||
- Annotation events
|
||||
- Data quality (saturation, noise)
|
||||
- Patient/study information
|
||||
|
||||
### .csv (Time Series)
|
||||
**Description:** CSV with timestamp column
|
||||
**Typical Data:** Time-indexed measurements
|
||||
**Use Cases:** Sensor data, monitoring, experiments
|
||||
**Python Libraries:**
|
||||
- `pandas`: `pd.read_csv()` with `parse_dates`
|
||||
**EDA Approach:**
|
||||
- Temporal range and resolution
|
||||
- Sampling regularity
|
||||
- Missing time points
|
||||
- Trend and seasonality
|
||||
- Stationarity tests
|
||||
- Autocorrelation
|
||||
- Anomaly detection
|
||||
|
||||
## Geospatial and Environmental Data
|
||||
|
||||
### .shp - Shapefile
|
||||
**Description:** Geospatial vector data
|
||||
**Typical Data:** Geographic features (points, lines, polygons)
|
||||
**Use Cases:** GIS analysis, spatial data
|
||||
**Python Libraries:**
|
||||
- `geopandas`: `gpd.read_file('file.shp')`
|
||||
- `fiona`: Lower-level shapefile access
|
||||
- `pyshp`: Pure Python shapefile reader
|
||||
**EDA Approach:**
|
||||
- Geometry type and count
|
||||
- Coordinate reference system
|
||||
- Bounding box
|
||||
- Attribute table analysis
|
||||
- Geometry validity
|
||||
- Spatial distribution
|
||||
- Multi-part features
|
||||
- Associated files (.shx, .dbf, .prj)
|
||||
|
||||
### .geojson - GeoJSON
|
||||
**Description:** JSON format for geographic data
|
||||
**Typical Data:** Features with geometry and properties
|
||||
**Use Cases:** Web mapping, spatial analysis
|
||||
**Python Libraries:**
|
||||
- `geopandas`: Native GeoJSON support
|
||||
- `json`: Parse as JSON then process
|
||||
**EDA Approach:**
|
||||
- Feature count and types
|
||||
- CRS specification
|
||||
- Bounding box calculation
|
||||
- Property schema
|
||||
- Geometry complexity
|
||||
- Nesting structure
|
||||
|
||||
### .tif / .tiff (Geospatial)
|
||||
**Description:** GeoTIFF with spatial reference
|
||||
**Typical Data:** Satellite imagery, DEMs, rasters
|
||||
**Use Cases:** Remote sensing, terrain analysis
|
||||
**Python Libraries:**
|
||||
- `rasterio`: `rasterio.open('file.tif')`
|
||||
- `gdal`: Geospatial Data Abstraction Library
|
||||
- `xarray` with `rioxarray`: N-D geospatial arrays
|
||||
**EDA Approach:**
|
||||
- Raster dimensions and resolution
|
||||
- Band count and descriptions
|
||||
- Coordinate reference system
|
||||
- Geotransform parameters
|
||||
- NoData value handling
|
||||
- Pixel value distribution
|
||||
- Histogram analysis
|
||||
- Overviews and pyramids
|
||||
|
||||
### .nc / .netcdf - Network Common Data Form
|
||||
**Description:** Self-describing array-based data
|
||||
**Typical Data:** Climate, atmospheric, oceanographic data
|
||||
**Use Cases:** Scientific datasets, model output
|
||||
**Python Libraries:**
|
||||
- `netCDF4`: `netCDF4.Dataset('file.nc')`
|
||||
- `xarray`: `xr.open_dataset('file.nc')`
|
||||
**EDA Approach:**
|
||||
- Variable enumeration
|
||||
- Dimension analysis
|
||||
- Time series properties
|
||||
- Spatial coverage
|
||||
- Attribute metadata (CF conventions)
|
||||
- Coordinate systems
|
||||
- Chunking and compression
|
||||
- Data quality flags
|
||||
|
||||
### .grib / .grib2 - Gridded Binary
|
||||
**Description:** Meteorological data format
|
||||
**Typical Data:** Weather forecasts, climate data
|
||||
**Use Cases:** Numerical weather prediction
|
||||
**Python Libraries:**
|
||||
- `pygrib`: GRIB file reading
|
||||
- `xarray` with `cfgrib`: GRIB to xarray
|
||||
**EDA Approach:**
|
||||
- Message inventory
|
||||
- Parameter and level types
|
||||
- Spatial grid specification
|
||||
- Temporal coverage
|
||||
- Ensemble members
|
||||
- Forecast vs analysis
|
||||
- Data packing and precision
|
||||
|
||||
### .hdf4 - HDF4 Format
|
||||
**Description:** Older HDF format
|
||||
**Typical Data:** NASA Earth Science data
|
||||
**Use Cases:** Satellite data (MODIS, etc.)
|
||||
**Python Libraries:**
|
||||
- `pyhdf`: HDF4 access
|
||||
- `gdal`: Can read HDF4
|
||||
**EDA Approach:**
|
||||
- Scientific dataset listing
|
||||
- Vdata and attributes
|
||||
- Dimension scales
|
||||
- Metadata extraction
|
||||
- Quality flags
|
||||
- Conversion to HDF5 or NetCDF
|
||||
|
||||
## Specialized Scientific Formats
|
||||
|
||||
### .fits - Flexible Image Transport System
|
||||
**Description:** Astronomy data format
|
||||
**Typical Data:** Images, tables, spectra from telescopes
|
||||
**Use Cases:** Astronomical observations
|
||||
**Python Libraries:**
|
||||
- `astropy.io.fits`: `fits.open('file.fits')`
|
||||
- `fitsio`: Alternative FITS library
|
||||
**EDA Approach:**
|
||||
- HDU (Header Data Unit) structure
|
||||
- Image dimensions and WCS
|
||||
- Header keyword analysis
|
||||
- Table column descriptions
|
||||
- Data type and scaling
|
||||
- FITS convention compliance
|
||||
- Checksum validation
|
||||
|
||||
### .asdf - Advanced Scientific Data Format
|
||||
**Description:** Next-gen data format for astronomy
|
||||
**Typical Data:** Complex hierarchical scientific data
|
||||
**Use Cases:** James Webb Space Telescope data
|
||||
**Python Libraries:**
|
||||
- `asdf`: `asdf.open('file.asdf')`
|
||||
**EDA Approach:**
|
||||
- Tree structure exploration
|
||||
- Schema validation
|
||||
- Internal vs external arrays
|
||||
- Compression methods
|
||||
- YAML metadata
|
||||
- Version compatibility
|
||||
|
||||
### .root - ROOT Data Format
|
||||
**Description:** CERN ROOT framework format
|
||||
**Typical Data:** High-energy physics data
|
||||
**Use Cases:** Particle physics experiments
|
||||
**Python Libraries:**
|
||||
- `uproot`: Pure Python ROOT reading
|
||||
- `ROOT`: Official PyROOT bindings
|
||||
**EDA Approach:**
|
||||
- TTree structure
|
||||
- Branch types and entries
|
||||
- Histogram inventory
|
||||
- Event loop statistics
|
||||
- File compression
|
||||
- Split level analysis
|
||||
|
||||
### .txt - Plain Text Data
|
||||
**Description:** Generic text-based data
|
||||
**Typical Data:** Tab/space-delimited, custom formats
|
||||
**Use Cases:** Simple data exchange, logs
|
||||
**Python Libraries:**
|
||||
- `pandas`: `pd.read_csv()` with custom delimiters
|
||||
- `numpy`: `np.loadtxt()`, `np.genfromtxt()`
|
||||
- Built-in file reading
|
||||
**EDA Approach:**
|
||||
- Format detection (delimiter, header)
|
||||
- Data type inference
|
||||
- Comment line handling
|
||||
- Missing value codes
|
||||
- Column alignment
|
||||
- Encoding detection
|
||||
|
||||
### .dat - Generic Data File
|
||||
**Description:** Binary or text data
|
||||
**Typical Data:** Instrument output, custom formats
|
||||
**Use Cases:** Various scientific instruments
|
||||
**Python Libraries:**
|
||||
- Format-specific: requires knowledge of structure
|
||||
- `numpy`: `np.fromfile()` for binary
|
||||
- `struct`: Parse binary structures
|
||||
**EDA Approach:**
|
||||
- Binary vs text determination
|
||||
- Header detection
|
||||
- Record structure inference
|
||||
- Endianness
|
||||
- Data type patterns
|
||||
- Validation with documentation
|
||||
|
||||
### .log - Log Files
|
||||
**Description:** Text logs from software/instruments
|
||||
**Typical Data:** Timestamped events, messages
|
||||
**Use Cases:** Troubleshooting, experiment tracking
|
||||
**Python Libraries:**
|
||||
- Built-in file reading
|
||||
- `pandas`: Structured log parsing
|
||||
- Regular expressions for parsing
|
||||
**EDA Approach:**
|
||||
- Log level distribution
|
||||
- Timestamp parsing
|
||||
- Error and warning frequency
|
||||
- Event sequencing
|
||||
- Pattern recognition
|
||||
- Anomaly detection
|
||||
- Session boundaries
|
||||
@@ -0,0 +1,620 @@
|
||||
# Microscopy and Imaging File Formats Reference
|
||||
|
||||
This reference covers file formats used in microscopy, medical imaging, remote sensing, and scientific image analysis.
|
||||
|
||||
## Microscopy-Specific Formats
|
||||
|
||||
### .tif / .tiff - Tagged Image File Format
|
||||
**Description:** Flexible image format supporting multiple pages and metadata
|
||||
**Typical Data:** Microscopy images, z-stacks, time series, multi-channel
|
||||
**Use Cases:** Fluorescence microscopy, confocal imaging, biological imaging
|
||||
**Python Libraries:**
|
||||
- `tifffile`: `tifffile.imread('file.tif')` - Microscopy TIFF support
|
||||
- `PIL/Pillow`: `Image.open('file.tif')` - Basic TIFF
|
||||
- `scikit-image`: `io.imread('file.tif')`
|
||||
- `AICSImageIO`: Multi-format microscopy reader
|
||||
**EDA Approach:**
|
||||
- Image dimensions and bit depth
|
||||
- Multi-page/z-stack analysis
|
||||
- Metadata extraction (OME-TIFF)
|
||||
- Channel analysis and intensity distributions
|
||||
- Temporal dynamics (time-lapse)
|
||||
- Pixel size and spatial calibration
|
||||
- Histogram analysis per channel
|
||||
- Dynamic range utilization
|
||||
|
||||
### .nd2 - Nikon NIS-Elements
|
||||
**Description:** Proprietary Nikon microscope format
|
||||
**Typical Data:** Multi-dimensional microscopy (XYZCT)
|
||||
**Use Cases:** Nikon microscope data, confocal, widefield
|
||||
**Python Libraries:**
|
||||
- `nd2reader`: `ND2Reader('file.nd2')`
|
||||
- `pims`: `pims.ND2_Reader('file.nd2')`
|
||||
- `AICSImageIO`: Universal reader
|
||||
**EDA Approach:**
|
||||
- Experiment metadata extraction
|
||||
- Channel configurations
|
||||
- Time-lapse frame analysis
|
||||
- Z-stack depth and spacing
|
||||
- XY stage positions
|
||||
- Laser settings and power
|
||||
- Pixel binning information
|
||||
- Acquisition timestamps
|
||||
|
||||
### .lif - Leica Image Format
|
||||
**Description:** Leica microscope proprietary format
|
||||
**Typical Data:** Multi-experiment, multi-dimensional images
|
||||
**Use Cases:** Leica confocal and widefield data
|
||||
**Python Libraries:**
|
||||
- `readlif`: `readlif.LifFile('file.lif')`
|
||||
- `AICSImageIO`: LIF support
|
||||
- `python-bioformats`: Via Bio-Formats
|
||||
**EDA Approach:**
|
||||
- Multiple experiment detection
|
||||
- Image series enumeration
|
||||
- Metadata per experiment
|
||||
- Channel and timepoint structure
|
||||
- Physical dimensions extraction
|
||||
- Objective and detector information
|
||||
- Scan settings analysis
|
||||
|
||||
### .czi - Carl Zeiss Image
|
||||
**Description:** Zeiss microscope format
|
||||
**Typical Data:** Multi-dimensional microscopy with rich metadata
|
||||
**Use Cases:** Zeiss confocal, lightsheet, widefield
|
||||
**Python Libraries:**
|
||||
- `czifile`: `czifile.CziFile('file.czi')`
|
||||
- `AICSImageIO`: CZI support
|
||||
- `pylibCZIrw`: Official Zeiss library
|
||||
**EDA Approach:**
|
||||
- Scene and position analysis
|
||||
- Mosaic tile structure
|
||||
- Channel wavelength information
|
||||
- Acquisition mode detection
|
||||
- Scaling and calibration
|
||||
- Instrument configuration
|
||||
- ROI definitions
|
||||
|
||||
### .oib / .oif - Olympus Image Format
|
||||
**Description:** Olympus microscope formats
|
||||
**Typical Data:** Confocal and multiphoton imaging
|
||||
**Use Cases:** Olympus FluoView data
|
||||
**Python Libraries:**
|
||||
- `AICSImageIO`: OIB/OIF support
|
||||
- `python-bioformats`: Via Bio-Formats
|
||||
**EDA Approach:**
|
||||
- Directory structure validation (OIF)
|
||||
- Metadata file parsing
|
||||
- Channel configuration
|
||||
- Scan parameters
|
||||
- Objective and filter information
|
||||
- PMT settings
|
||||
|
||||
### .vsi - Olympus VSI
|
||||
**Description:** Olympus slide scanner format
|
||||
**Typical Data:** Whole slide imaging, large mosaics
|
||||
**Use Cases:** Virtual microscopy, pathology
|
||||
**Python Libraries:**
|
||||
- `openslide-python`: `openslide.OpenSlide('file.vsi')`
|
||||
- `AICSImageIO`: VSI support
|
||||
**EDA Approach:**
|
||||
- Pyramid level analysis
|
||||
- Tile structure and overlap
|
||||
- Macro and label images
|
||||
- Magnification levels
|
||||
- Whole slide statistics
|
||||
- Region detection
|
||||
|
||||
### .ims - Imaris Format
|
||||
**Description:** Bitplane Imaris HDF5-based format
|
||||
**Typical Data:** Large 3D/4D microscopy datasets
|
||||
**Use Cases:** 3D rendering, time-lapse analysis
|
||||
**Python Libraries:**
|
||||
- `h5py`: Direct HDF5 access
|
||||
- `imaris_ims_file_reader`: Specialized reader
|
||||
**EDA Approach:**
|
||||
- Resolution level analysis
|
||||
- Time point structure
|
||||
- Channel organization
|
||||
- Dataset hierarchy
|
||||
- Thumbnail generation
|
||||
- Memory-mapped access strategies
|
||||
- Chunking optimization
|
||||
|
||||
### .lsm - Zeiss LSM
|
||||
**Description:** Legacy Zeiss confocal format
|
||||
**Typical Data:** Confocal laser scanning microscopy
|
||||
**Use Cases:** Older Zeiss confocal data
|
||||
**Python Libraries:**
|
||||
- `tifffile`: LSM support (TIFF-based)
|
||||
- `python-bioformats`: LSM reading
|
||||
**EDA Approach:**
|
||||
- Similar to TIFF with LSM-specific metadata
|
||||
- Scan speed and resolution
|
||||
- Laser lines and power
|
||||
- Detector gain and offset
|
||||
- LUT information
|
||||
|
||||
### .stk - MetaMorph Stack
|
||||
**Description:** MetaMorph image stack format
|
||||
**Typical Data:** Time-lapse or z-stack sequences
|
||||
**Use Cases:** MetaMorph software output
|
||||
**Python Libraries:**
|
||||
- `tifffile`: STK is TIFF-based
|
||||
- `python-bioformats`: STK support
|
||||
**EDA Approach:**
|
||||
- Stack dimensionality
|
||||
- Plane metadata
|
||||
- Timing information
|
||||
- Stage positions
|
||||
- UIC tags parsing
|
||||
|
||||
### .dv - DeltaVision
|
||||
**Description:** Applied Precision DeltaVision format
|
||||
**Typical Data:** Deconvolution microscopy
|
||||
**Use Cases:** DeltaVision microscope data
|
||||
**Python Libraries:**
|
||||
- `mrc`: Can read DV (MRC-related)
|
||||
- `AICSImageIO`: DV support
|
||||
**EDA Approach:**
|
||||
- Wave information (channels)
|
||||
- Extended header analysis
|
||||
- Lens and magnification
|
||||
- Deconvolution status
|
||||
- Time stamps per section
|
||||
|
||||
### .mrc - Medical Research Council
|
||||
**Description:** Electron microscopy format
|
||||
**Typical Data:** EM images, cryo-EM, tomography
|
||||
**Use Cases:** Structural biology, electron microscopy
|
||||
**Python Libraries:**
|
||||
- `mrcfile`: `mrcfile.open('file.mrc')`
|
||||
- `EMAN2`: EM-specific tools
|
||||
**EDA Approach:**
|
||||
- Volume dimensions
|
||||
- Voxel size and units
|
||||
- Origin and map statistics
|
||||
- Symmetry information
|
||||
- Extended header analysis
|
||||
- Density statistics
|
||||
- Header consistency validation
|
||||
|
||||
### .dm3 / .dm4 - Gatan Digital Micrograph
|
||||
**Description:** Gatan TEM/STEM format
|
||||
**Typical Data:** Transmission electron microscopy
|
||||
**Use Cases:** TEM imaging and analysis
|
||||
**Python Libraries:**
|
||||
- `hyperspy`: `hs.load('file.dm3')`
|
||||
- `ncempy`: `ncempy.io.dm.dmReader('file.dm3')`
|
||||
**EDA Approach:**
|
||||
- Microscope parameters
|
||||
- Energy dispersive spectroscopy data
|
||||
- Diffraction patterns
|
||||
- Calibration information
|
||||
- Tag structure analysis
|
||||
- Image series handling
|
||||
|
||||
### .eer - Electron Event Representation
|
||||
**Description:** Direct electron detector format
|
||||
**Typical Data:** Electron counting data from detectors
|
||||
**Use Cases:** Cryo-EM data collection
|
||||
**Python Libraries:**
|
||||
- `mrcfile`: Some EER support
|
||||
- Vendor-specific tools (Gatan, TFS)
|
||||
**EDA Approach:**
|
||||
- Event counting statistics
|
||||
- Frame rate and dose
|
||||
- Detector configuration
|
||||
- Motion correction assessment
|
||||
- Gain reference validation
|
||||
|
||||
### .ser - TIA Series
|
||||
**Description:** FEI/TFS TIA format
|
||||
**Typical Data:** EM image series
|
||||
**Use Cases:** FEI/Thermo Fisher EM data
|
||||
**Python Libraries:**
|
||||
- `hyperspy`: SER support
|
||||
- `ncempy`: TIA reader
|
||||
**EDA Approach:**
|
||||
- Series structure
|
||||
- Calibration data
|
||||
- Acquisition metadata
|
||||
- Time stamps
|
||||
- Multi-dimensional data organization
|
||||
|
||||
## Medical and Biological Imaging
|
||||
|
||||
### .dcm - DICOM
|
||||
**Description:** Digital Imaging and Communications in Medicine
|
||||
**Typical Data:** Medical images with patient/study metadata
|
||||
**Use Cases:** Clinical imaging, radiology, CT, MRI, PET
|
||||
**Python Libraries:**
|
||||
- `pydicom`: `pydicom.dcmread('file.dcm')`
|
||||
- `SimpleITK`: `sitk.ReadImage('file.dcm')`
|
||||
- `nibabel`: Limited DICOM support
|
||||
**EDA Approach:**
|
||||
- Patient metadata extraction (anonymization check)
|
||||
- Modality-specific analysis
|
||||
- Series and study organization
|
||||
- Slice thickness and spacing
|
||||
- Window/level settings
|
||||
- Hounsfield units (CT)
|
||||
- Image orientation and position
|
||||
- Multi-frame analysis
|
||||
|
||||
### .nii / .nii.gz - NIfTI
|
||||
**Description:** Neuroimaging Informatics Technology Initiative
|
||||
**Typical Data:** Brain imaging, fMRI, structural MRI
|
||||
**Use Cases:** Neuroimaging research, brain analysis
|
||||
**Python Libraries:**
|
||||
- `nibabel`: `nibabel.load('file.nii')`
|
||||
- `nilearn`: Neuroimaging with ML
|
||||
- `SimpleITK`: NIfTI support
|
||||
**EDA Approach:**
|
||||
- Volume dimensions and voxel size
|
||||
- Affine transformation matrix
|
||||
- Time series analysis (fMRI)
|
||||
- Intensity distribution
|
||||
- Brain extraction quality
|
||||
- Registration assessment
|
||||
- Orientation validation
|
||||
- Header information consistency
|
||||
|
||||
### .mnc - MINC Format
|
||||
**Description:** Medical Image NetCDF
|
||||
**Typical Data:** Medical imaging (predecessor to NIfTI)
|
||||
**Use Cases:** Legacy neuroimaging data
|
||||
**Python Libraries:**
|
||||
- `pyminc`: MINC-specific tools
|
||||
- `nibabel`: MINC support
|
||||
**EDA Approach:**
|
||||
- Similar to NIfTI
|
||||
- NetCDF structure exploration
|
||||
- Dimension ordering
|
||||
- Metadata extraction
|
||||
|
||||
### .nrrd - Nearly Raw Raster Data
|
||||
**Description:** Medical imaging format with detached header
|
||||
**Typical Data:** Medical images, research imaging
|
||||
**Use Cases:** 3D Slicer, ITK-based applications
|
||||
**Python Libraries:**
|
||||
- `pynrrd`: `nrrd.read('file.nrrd')`
|
||||
- `SimpleITK`: NRRD support
|
||||
**EDA Approach:**
|
||||
- Header field analysis
|
||||
- Encoding format
|
||||
- Dimension and spacing
|
||||
- Orientation matrix
|
||||
- Compression assessment
|
||||
- Endianness handling
|
||||
|
||||
### .mha / .mhd - MetaImage
|
||||
**Description:** MetaImage format (ITK)
|
||||
**Typical Data:** Medical/scientific 3D images
|
||||
**Use Cases:** ITK/SimpleITK applications
|
||||
**Python Libraries:**
|
||||
- `SimpleITK`: Native MHA/MHD support
|
||||
- `itk`: Direct ITK integration
|
||||
**EDA Approach:**
|
||||
- Header-data file pairing (MHD)
|
||||
- Transform matrix
|
||||
- Element spacing
|
||||
- Compression format
|
||||
- Data type and dimensions
|
||||
|
||||
### .hdr / .img - Analyze Format
|
||||
**Description:** Legacy medical imaging format
|
||||
**Typical Data:** Brain imaging (pre-NIfTI)
|
||||
**Use Cases:** Old neuroimaging datasets
|
||||
**Python Libraries:**
|
||||
- `nibabel`: Analyze support
|
||||
- Conversion to NIfTI recommended
|
||||
**EDA Approach:**
|
||||
- Header-image pairing validation
|
||||
- Byte order issues
|
||||
- Conversion to modern formats
|
||||
- Metadata limitations
|
||||
|
||||
## Scientific Image Formats
|
||||
|
||||
### .png - Portable Network Graphics
|
||||
**Description:** Lossless compressed image format
|
||||
**Typical Data:** 2D images, screenshots, processed data
|
||||
**Use Cases:** Publication figures, lossless storage
|
||||
**Python Libraries:**
|
||||
- `PIL/Pillow`: `Image.open('file.png')`
|
||||
- `scikit-image`: `io.imread('file.png')`
|
||||
- `imageio`: `imageio.imread('file.png')`
|
||||
**EDA Approach:**
|
||||
- Bit depth analysis (8-bit, 16-bit)
|
||||
- Color mode (grayscale, RGB, palette)
|
||||
- Metadata (PNG chunks)
|
||||
- Transparency handling
|
||||
- Compression efficiency
|
||||
- Histogram analysis
|
||||
|
||||
### .jpg / .jpeg - Joint Photographic Experts Group
|
||||
**Description:** Lossy compressed image format
|
||||
**Typical Data:** Natural images, photos
|
||||
**Use Cases:** Visualization, web graphics (not raw data)
|
||||
**Python Libraries:**
|
||||
- `PIL/Pillow`: Standard JPEG support
|
||||
- `scikit-image`: JPEG reading
|
||||
**EDA Approach:**
|
||||
- Compression artifacts detection
|
||||
- Quality factor estimation
|
||||
- Color space (RGB, grayscale)
|
||||
- EXIF metadata
|
||||
- Quantization table analysis
|
||||
- Note: Not suitable for quantitative analysis
|
||||
|
||||
### .bmp - Bitmap Image
|
||||
**Description:** Uncompressed raster image
|
||||
**Typical Data:** Simple images, screenshots
|
||||
**Use Cases:** Compatibility, simple storage
|
||||
**Python Libraries:**
|
||||
- `PIL/Pillow`: BMP support
|
||||
- `scikit-image`: BMP reading
|
||||
**EDA Approach:**
|
||||
- Color depth
|
||||
- Palette analysis (if indexed)
|
||||
- File size efficiency
|
||||
- Pixel format validation
|
||||
|
||||
### .gif - Graphics Interchange Format
|
||||
**Description:** Image format with animation support
|
||||
**Typical Data:** Animated images, simple graphics
|
||||
**Use Cases:** Animations, time-lapse visualization
|
||||
**Python Libraries:**
|
||||
- `PIL/Pillow`: GIF support
|
||||
- `imageio`: Better GIF animation support
|
||||
**EDA Approach:**
|
||||
- Frame count and timing
|
||||
- Palette limitations (256 colors)
|
||||
- Loop count
|
||||
- Disposal method
|
||||
- Transparency handling
|
||||
|
||||
### .svg - Scalable Vector Graphics
|
||||
**Description:** XML-based vector graphics
|
||||
**Typical Data:** Vector drawings, plots, diagrams
|
||||
**Use Cases:** Publication-quality figures, plots
|
||||
**Python Libraries:**
|
||||
- `svgpathtools`: Path manipulation
|
||||
- `cairosvg`: Rasterization
|
||||
- `lxml`: XML parsing
|
||||
**EDA Approach:**
|
||||
- Element structure analysis
|
||||
- Style information
|
||||
- Viewbox and dimensions
|
||||
- Path complexity
|
||||
- Text element extraction
|
||||
- Layer organization
|
||||
|
||||
### .eps - Encapsulated PostScript
|
||||
**Description:** Vector graphics format
|
||||
**Typical Data:** Publication figures
|
||||
**Use Cases:** Legacy publication graphics
|
||||
**Python Libraries:**
|
||||
- `PIL/Pillow`: Basic EPS rasterization
|
||||
- `ghostscript` via subprocess
|
||||
**EDA Approach:**
|
||||
- Bounding box information
|
||||
- Preview image validation
|
||||
- Font embedding
|
||||
- Conversion to modern formats
|
||||
|
||||
### .pdf (Images)
|
||||
**Description:** Portable Document Format with images
|
||||
**Typical Data:** Publication figures, multi-page documents
|
||||
**Use Cases:** Publication, data presentation
|
||||
**Python Libraries:**
|
||||
- `PyMuPDF/fitz`: `fitz.open('file.pdf')`
|
||||
- `pdf2image`: Rasterization
|
||||
- `pdfplumber`: Text and layout extraction
|
||||
**EDA Approach:**
|
||||
- Page count
|
||||
- Image extraction
|
||||
- Resolution and DPI
|
||||
- Embedded fonts and metadata
|
||||
- Compression methods
|
||||
- Image vs vector content
|
||||
|
||||
### .fig - MATLAB Figure
|
||||
**Description:** MATLAB figure file
|
||||
**Typical Data:** MATLAB plots and figures
|
||||
**Use Cases:** MATLAB data visualization
|
||||
**Python Libraries:**
|
||||
- Custom parsers (MAT file structure)
|
||||
- Conversion to other formats
|
||||
**EDA Approach:**
|
||||
- Figure structure
|
||||
- Data extraction from plots
|
||||
- Axes and label information
|
||||
- Plot type identification
|
||||
|
||||
### .hdf5 (Imaging Specific)
|
||||
**Description:** HDF5 for large imaging datasets
|
||||
**Typical Data:** High-content screening, large microscopy
|
||||
**Use Cases:** BigDataViewer, large-scale imaging
|
||||
**Python Libraries:**
|
||||
- `h5py`: Universal HDF5 access
|
||||
- Imaging-specific readers (BigDataViewer)
|
||||
**EDA Approach:**
|
||||
- Dataset hierarchy
|
||||
- Chunk and compression strategy
|
||||
- Multi-resolution pyramid
|
||||
- Metadata organization
|
||||
- Memory-mapped access
|
||||
- Parallel I/O performance
|
||||
|
||||
### .zarr - Chunked Array Storage
|
||||
**Description:** Cloud-optimized array storage
|
||||
**Typical Data:** Large imaging datasets, OME-ZARR
|
||||
**Use Cases:** Cloud microscopy, large-scale analysis
|
||||
**Python Libraries:**
|
||||
- `zarr`: `zarr.open('file.zarr')`
|
||||
- `ome-zarr-py`: OME-ZARR support
|
||||
**EDA Approach:**
|
||||
- Chunk size optimization
|
||||
- Compression codec analysis
|
||||
- Multi-scale representation
|
||||
- Array dimensions and dtype
|
||||
- Metadata structure (OME)
|
||||
- Cloud access patterns
|
||||
|
||||
### .raw - Raw Image Data
|
||||
**Description:** Unformatted binary pixel data
|
||||
**Typical Data:** Raw detector output
|
||||
**Use Cases:** Custom imaging systems
|
||||
**Python Libraries:**
|
||||
- `numpy`: `np.fromfile()` with dtype
|
||||
- `imageio`: Raw format plugins
|
||||
**EDA Approach:**
|
||||
- Dimensions determination (external info needed)
|
||||
- Byte order and data type
|
||||
- Header presence detection
|
||||
- Pixel value range
|
||||
- Noise characteristics
|
||||
|
||||
### .bin - Binary Image Data
|
||||
**Description:** Generic binary image format
|
||||
**Typical Data:** Raw or custom-formatted images
|
||||
**Use Cases:** Instrument-specific outputs
|
||||
**Python Libraries:**
|
||||
- `numpy`: Custom binary reading
|
||||
- `struct`: For structured binary data
|
||||
**EDA Approach:**
|
||||
- Format specification required
|
||||
- Header parsing (if present)
|
||||
- Data type inference
|
||||
- Dimension extraction
|
||||
- Validation with known parameters
|
||||
|
||||
## Image Analysis Formats
|
||||
|
||||
### .roi - ImageJ ROI
|
||||
**Description:** ImageJ region of interest format
|
||||
**Typical Data:** Geometric ROIs, selections
|
||||
**Use Cases:** ImageJ/Fiji analysis workflows
|
||||
**Python Libraries:**
|
||||
- `read-roi`: `read_roi.read_roi_file('file.roi')`
|
||||
- `roifile`: ROI manipulation
|
||||
**EDA Approach:**
|
||||
- ROI type analysis (rectangle, polygon, etc.)
|
||||
- Coordinate extraction
|
||||
- ROI properties (area, perimeter)
|
||||
- Group analysis (ROI sets)
|
||||
- Z-position and time information
|
||||
|
||||
### .zip (ROI sets)
|
||||
**Description:** ZIP archive of ImageJ ROIs
|
||||
**Typical Data:** Multiple ROI files
|
||||
**Use Cases:** Batch ROI analysis
|
||||
**Python Libraries:**
|
||||
- `read-roi`: `read_roi.read_roi_zip('file.zip')`
|
||||
- Standard `zipfile` module
|
||||
**EDA Approach:**
|
||||
- ROI count in set
|
||||
- ROI type distribution
|
||||
- Spatial distribution
|
||||
- Overlapping ROI detection
|
||||
- Naming conventions
|
||||
|
||||
### .ome.tif / .ome.tiff - OME-TIFF
|
||||
**Description:** TIFF with OME-XML metadata
|
||||
**Typical Data:** Standardized microscopy with rich metadata
|
||||
**Use Cases:** Bio-Formats compatible storage
|
||||
**Python Libraries:**
|
||||
- `tifffile`: OME-TIFF support
|
||||
- `AICSImageIO`: OME reading
|
||||
- `python-bioformats`: Bio-Formats integration
|
||||
**EDA Approach:**
|
||||
- OME-XML validation
|
||||
- Physical dimensions extraction
|
||||
- Channel naming and wavelengths
|
||||
- Plane positions (Z, C, T)
|
||||
- Instrument metadata
|
||||
- Bio-Formats compatibility
|
||||
|
||||
### .ome.zarr - OME-ZARR
|
||||
**Description:** OME-NGFF specification on ZARR
|
||||
**Typical Data:** Next-generation file format for bioimaging
|
||||
**Use Cases:** Cloud-native imaging, large datasets
|
||||
**Python Libraries:**
|
||||
- `ome-zarr-py`: Official implementation
|
||||
- `zarr`: Underlying array storage
|
||||
**EDA Approach:**
|
||||
- Multiscale resolution levels
|
||||
- Metadata compliance with OME-NGFF spec
|
||||
- Coordinate transformations
|
||||
- Label and ROI handling
|
||||
- Cloud storage optimization
|
||||
- Chunk access patterns
|
||||
|
||||
### .klb - Keller Lab Block
|
||||
**Description:** Fast microscopy format for large data
|
||||
**Typical Data:** Lightsheet microscopy, time-lapse
|
||||
**Use Cases:** High-throughput imaging
|
||||
**Python Libraries:**
|
||||
- `pyklb`: KLB reading and writing
|
||||
**EDA Approach:**
|
||||
- Compression efficiency
|
||||
- Block structure
|
||||
- Multi-resolution support
|
||||
- Read performance benchmarking
|
||||
- Metadata extraction
|
||||
|
||||
### .vsi - Whole Slide Imaging
|
||||
**Description:** Virtual slide format (multiple vendors)
|
||||
**Typical Data:** Pathology slides, large mosaics
|
||||
**Use Cases:** Digital pathology
|
||||
**Python Libraries:**
|
||||
- `openslide-python`: Multi-format WSI
|
||||
- `tiffslide`: Pure Python alternative
|
||||
**EDA Approach:**
|
||||
- Pyramid level count
|
||||
- Downsampling factors
|
||||
- Associated images (macro, label)
|
||||
- Tile size and overlap
|
||||
- MPP (microns per pixel)
|
||||
- Background detection
|
||||
- Tissue segmentation
|
||||
|
||||
### .ndpi - Hamamatsu NanoZoomer
|
||||
**Description:** Hamamatsu slide scanner format
|
||||
**Typical Data:** Whole slide pathology images
|
||||
**Use Cases:** Digital pathology workflows
|
||||
**Python Libraries:**
|
||||
- `openslide-python`: NDPI support
|
||||
**EDA Approach:**
|
||||
- Multi-resolution pyramid
|
||||
- Lens and objective information
|
||||
- Scan area and magnification
|
||||
- Focal plane information
|
||||
- Tissue detection
|
||||
|
||||
### .svs - Aperio ScanScope
|
||||
**Description:** Aperio whole slide format
|
||||
**Typical Data:** Digital pathology slides
|
||||
**Use Cases:** Pathology image analysis
|
||||
**Python Libraries:**
|
||||
- `openslide-python`: SVS support
|
||||
**EDA Approach:**
|
||||
- Pyramid structure
|
||||
- MPP calibration
|
||||
- Label and macro images
|
||||
- Compression quality
|
||||
- Thumbnail generation
|
||||
|
||||
### .scn - Leica SCN
|
||||
**Description:** Leica slide scanner format
|
||||
**Typical Data:** Whole slide imaging
|
||||
**Use Cases:** Digital pathology
|
||||
**Python Libraries:**
|
||||
- `openslide-python`: SCN support
|
||||
**EDA Approach:**
|
||||
- Tile structure analysis
|
||||
- Collection organization
|
||||
- Metadata extraction
|
||||
- Magnification levels
|
||||
@@ -0,0 +1,517 @@
|
||||
# Proteomics and Metabolomics File Formats Reference
|
||||
|
||||
This reference covers file formats specific to proteomics, metabolomics, lipidomics, and related omics workflows.
|
||||
|
||||
## Mass Spectrometry-Based Proteomics
|
||||
|
||||
### .mzML - Mass Spectrometry Markup Language
|
||||
**Description:** Standard XML format for MS data
|
||||
**Typical Data:** MS1 and MS2 spectra, retention times, intensities
|
||||
**Use Cases:** Proteomics, metabolomics pipelines
|
||||
**Python Libraries:**
|
||||
- `pymzml`: `pymzml.run.Reader('file.mzML')`
|
||||
- `pyteomics.mzml`: `pyteomics.mzml.read('file.mzML')`
|
||||
- `pyopenms`: OpenMS Python bindings
|
||||
**EDA Approach:**
|
||||
- Scan count and MS level distribution
|
||||
- Total ion chromatogram (TIC) analysis
|
||||
- Base peak chromatogram (BPC)
|
||||
- m/z coverage and resolution
|
||||
- Retention time range
|
||||
- Precursor selection patterns
|
||||
- Data completeness
|
||||
- Quality control metrics (lock mass, standards)
|
||||
|
||||
### .mzXML - Legacy MS XML Format
|
||||
**Description:** Older XML-based MS format
|
||||
**Typical Data:** Mass spectra with metadata
|
||||
**Use Cases:** Legacy proteomics data
|
||||
**Python Libraries:**
|
||||
- `pyteomics.mzxml`
|
||||
- `pymzml`: Can read mzXML
|
||||
**EDA Approach:**
|
||||
- Similar to mzML
|
||||
- Format version compatibility
|
||||
- Conversion quality validation
|
||||
- Metadata preservation check
|
||||
|
||||
### .mzIdentML - Peptide Identification Format
|
||||
**Description:** PSI standard for peptide identifications
|
||||
**Typical Data:** Peptide-spectrum matches, proteins, scores
|
||||
**Use Cases:** Search engine results, proteomics workflows
|
||||
**Python Libraries:**
|
||||
- `pyteomics.mzid`
|
||||
- `pyopenms`: MzIdentML support
|
||||
**EDA Approach:**
|
||||
- PSM count and score distribution
|
||||
- FDR calculation and filtering
|
||||
- Modification analysis
|
||||
- Missed cleavage statistics
|
||||
- Protein inference results
|
||||
- Search parameters validation
|
||||
- Decoy hit analysis
|
||||
- Rank-1 vs lower ranks
|
||||
|
||||
### .pepXML - Trans-Proteomic Pipeline Peptide XML
|
||||
**Description:** TPP format for peptide identifications
|
||||
**Typical Data:** Search results with statistical validation
|
||||
**Use Cases:** Proteomics database search output
|
||||
**Python Libraries:**
|
||||
- `pyteomics.pepxml`
|
||||
**EDA Approach:**
|
||||
- Search engine comparison
|
||||
- Score distributions (XCorr, expect value, etc.)
|
||||
- Charge state analysis
|
||||
- Modification frequencies
|
||||
- PeptideProphet probabilities
|
||||
- Protein coverage
|
||||
- Spectral counting
|
||||
|
||||
### .protXML - Protein Inference Results
|
||||
**Description:** TPP protein-level identifications
|
||||
**Typical Data:** Protein groups, probabilities, peptides
|
||||
**Use Cases:** Protein-level analysis
|
||||
**Python Libraries:**
|
||||
- `pyteomics.protxml`
|
||||
**EDA Approach:**
|
||||
- Protein group statistics
|
||||
- Parsimonious protein sets
|
||||
- ProteinProphet probabilities
|
||||
- Coverage and peptide count per protein
|
||||
- Unique vs shared peptides
|
||||
- Protein molecular weight distribution
|
||||
- GO term enrichment preparation
|
||||
|
||||
### .pride.xml - PRIDE XML Format
|
||||
**Description:** Proteomics Identifications Database format
|
||||
**Typical Data:** Complete proteomics experiment data
|
||||
**Use Cases:** Public data deposition (legacy)
|
||||
**Python Libraries:**
|
||||
- `pyteomics.pride`
|
||||
- Custom XML parsers
|
||||
**EDA Approach:**
|
||||
- Experiment metadata extraction
|
||||
- Identification completeness
|
||||
- Cross-linking to spectra
|
||||
- Protocol information
|
||||
- Instrument details
|
||||
|
||||
### .tsv / .csv (Proteomics)
|
||||
**Description:** Tab or comma-separated proteomics results
|
||||
**Typical Data:** Peptide or protein quantification tables
|
||||
**Use Cases:** MaxQuant, Proteome Discoverer, Skyline output
|
||||
**Python Libraries:**
|
||||
- `pandas`: `pd.read_csv()` or `pd.read_table()`
|
||||
**EDA Approach:**
|
||||
- Identification counts
|
||||
- Quantitative value distributions
|
||||
- Missing value patterns
|
||||
- Intensity-based analysis
|
||||
- Label-free quantification assessment
|
||||
- Isobaric tag ratio analysis
|
||||
- Coefficient of variation
|
||||
- Batch effects
|
||||
|
||||
### .msf - Thermo MSF Database
|
||||
**Description:** Proteome Discoverer results database
|
||||
**Typical Data:** SQLite database with search results
|
||||
**Use Cases:** Thermo Proteome Discoverer workflows
|
||||
**Python Libraries:**
|
||||
- `sqlite3`: Database access
|
||||
- Custom MSF parsers
|
||||
**EDA Approach:**
|
||||
- Database schema exploration
|
||||
- Peptide and protein tables
|
||||
- Score thresholds
|
||||
- Quantification data
|
||||
- Processing node information
|
||||
- Confidence levels
|
||||
|
||||
### .pdResult - Proteome Discoverer Result
|
||||
**Description:** Proteome Discoverer study results
|
||||
**Typical Data:** Comprehensive search and quantification
|
||||
**Use Cases:** PD study exports
|
||||
**Python Libraries:**
|
||||
- Vendor tools for conversion
|
||||
- Export to TSV for Python analysis
|
||||
**EDA Approach:**
|
||||
- Study design validation
|
||||
- Result filtering criteria
|
||||
- Quantitative comparison groups
|
||||
- Imputation strategies
|
||||
|
||||
### .pep.xml - Peptide Summary
|
||||
**Description:** Compact peptide identification format
|
||||
**Typical Data:** Peptide sequences, modifications, scores
|
||||
**Use Cases:** Downstream analysis input
|
||||
**Python Libraries:**
|
||||
- `pyteomics`: XML parsing
|
||||
**EDA Approach:**
|
||||
- Unique peptide counting
|
||||
- PTM site localization
|
||||
- Retention time predictability
|
||||
- Charge state preferences
|
||||
|
||||
## Quantitative Proteomics
|
||||
|
||||
### .sky - Skyline Document
|
||||
**Description:** Skyline targeted proteomics document
|
||||
**Typical Data:** Transition lists, chromatograms, results
|
||||
**Use Cases:** Targeted proteomics (SRM/MRM/PRM)
|
||||
**Python Libraries:**
|
||||
- `skyline`: Python API (limited)
|
||||
- Export to CSV for analysis
|
||||
**EDA Approach:**
|
||||
- Transition selection validation
|
||||
- Chromatographic peak quality
|
||||
- Interference detection
|
||||
- Retention time consistency
|
||||
- Calibration curve assessment
|
||||
- Replicate correlation
|
||||
- LOD/LOQ determination
|
||||
|
||||
### .sky.zip - Zipped Skyline Document
|
||||
**Description:** Skyline document with external files
|
||||
**Typical Data:** Complete Skyline analysis
|
||||
**Use Cases:** Sharing Skyline projects
|
||||
**Python Libraries:**
|
||||
- `zipfile`: Extract for processing
|
||||
**EDA Approach:**
|
||||
- Document structure
|
||||
- External file references
|
||||
- Result export and analysis
|
||||
|
||||
### .wiff - SCIEX WIFF Format
|
||||
**Description:** SCIEX instrument data with quantitation
|
||||
**Typical Data:** LC-MS/MS with MRM transitions
|
||||
**Use Cases:** SCIEX QTRAP, TripleTOF data
|
||||
**Python Libraries:**
|
||||
- Vendor tools (limited Python access)
|
||||
- Conversion to mzML
|
||||
**EDA Approach:**
|
||||
- MRM transition performance
|
||||
- Dwell time optimization
|
||||
- Cycle time analysis
|
||||
- Peak integration quality
|
||||
|
||||
### .raw (Thermo)
|
||||
**Description:** Thermo raw instrument file
|
||||
**Typical Data:** Full MS data from Orbitrap, Q Exactive
|
||||
**Use Cases:** Label-free and TMT quantification
|
||||
**Python Libraries:**
|
||||
- `pymsfilereader`: Thermo RawFileReader
|
||||
- `ThermoRawFileParser`: Cross-platform CLI
|
||||
**EDA Approach:**
|
||||
- MS1 and MS2 acquisition rates
|
||||
- AGC target and fill times
|
||||
- Resolution settings
|
||||
- Isolation window validation
|
||||
- SPS ion selection (TMT)
|
||||
- Contamination assessment
|
||||
|
||||
### .d (Agilent)
|
||||
**Description:** Agilent data directory
|
||||
**Typical Data:** LC-MS and GC-MS data
|
||||
**Use Cases:** Agilent instrument workflows
|
||||
**Python Libraries:**
|
||||
- Community parsers
|
||||
- Export to mzML
|
||||
**EDA Approach:**
|
||||
- Method consistency
|
||||
- Calibration status
|
||||
- Sequence run information
|
||||
- Retention time stability
|
||||
|
||||
## Metabolomics and Lipidomics
|
||||
|
||||
### .mzML (Metabolomics)
|
||||
**Description:** Standard MS format for metabolomics
|
||||
**Typical Data:** Full scan MS, targeted MS/MS
|
||||
**Use Cases:** Untargeted and targeted metabolomics
|
||||
**Python Libraries:**
|
||||
- Same as proteomics mzML tools
|
||||
**EDA Approach:**
|
||||
- Feature detection quality
|
||||
- Mass accuracy assessment
|
||||
- Retention time alignment
|
||||
- Blank subtraction
|
||||
- QC sample consistency
|
||||
- Isotope pattern validation
|
||||
- Adduct formation analysis
|
||||
- In-source fragmentation check
|
||||
|
||||
### .cdf / .netCDF - ANDI-MS
|
||||
**Description:** Analytical Data Interchange for MS
|
||||
**Typical Data:** GC-MS, LC-MS chromatography data
|
||||
**Use Cases:** Metabolomics, GC-MS workflows
|
||||
**Python Libraries:**
|
||||
- `netCDF4`: Low-level access
|
||||
- `pyopenms`: CDF support
|
||||
- `xcms` via R integration
|
||||
**EDA Approach:**
|
||||
- TIC and extracted ion chromatograms
|
||||
- Peak detection across samples
|
||||
- Retention index calculation
|
||||
- Mass spectral matching
|
||||
- Library search preparation
|
||||
|
||||
### .msp - Mass Spectral Format (NIST)
|
||||
**Description:** NIST spectral library format
|
||||
**Typical Data:** Reference mass spectra
|
||||
**Use Cases:** Metabolite identification, library matching
|
||||
**Python Libraries:**
|
||||
- `matchms`: Spectral matching
|
||||
- Custom MSP parsers
|
||||
**EDA Approach:**
|
||||
- Library coverage
|
||||
- Metadata completeness (InChI, SMILES)
|
||||
- Spectral quality metrics
|
||||
- Collision energy standardization
|
||||
- Precursor type annotation
|
||||
|
||||
### .mgf (Metabolomics)
|
||||
**Description:** Mascot Generic Format for MS/MS
|
||||
**Typical Data:** MS/MS spectra for metabolite ID
|
||||
**Use Cases:** Spectral library searching
|
||||
**Python Libraries:**
|
||||
- `matchms`: Metabolomics spectral analysis
|
||||
- `pyteomics.mgf`
|
||||
**EDA Approach:**
|
||||
- Spectrum quality filtering
|
||||
- Precursor isolation purity
|
||||
- Fragment m/z accuracy
|
||||
- Neutral loss patterns
|
||||
- MS/MS completeness
|
||||
|
||||
### .nmrML - NMR Markup Language
|
||||
**Description:** Standard XML format for NMR metabolomics
|
||||
**Typical Data:** 1D/2D NMR spectra with metadata
|
||||
**Use Cases:** NMR-based metabolomics
|
||||
**Python Libraries:**
|
||||
- `nmrml2isa`: Format conversion
|
||||
- Custom XML parsers
|
||||
**EDA Approach:**
|
||||
- Spectral quality metrics
|
||||
- Binning consistency
|
||||
- Reference compound validation
|
||||
- pH and temperature effects
|
||||
- Metabolite identification confidence
|
||||
|
||||
### .json (Metabolomics)
|
||||
**Description:** JSON format for metabolomics results
|
||||
**Typical Data:** Feature tables, annotations, metadata
|
||||
**Use Cases:** GNPS, MetaboAnalyst, web tools
|
||||
**Python Libraries:**
|
||||
- `json`: Standard library
|
||||
- `pandas`: JSON normalization
|
||||
**EDA Approach:**
|
||||
- Feature annotation coverage
|
||||
- GNPS clustering results
|
||||
- Molecular networking statistics
|
||||
- Adduct and in-source fragment linkage
|
||||
- Putative identification confidence
|
||||
|
||||
### .txt (Metabolomics Tables)
|
||||
**Description:** Tab-delimited feature tables
|
||||
**Typical Data:** m/z, RT, intensities across samples
|
||||
**Use Cases:** MZmine, XCMS, MS-DIAL output
|
||||
**Python Libraries:**
|
||||
- `pandas`: Text file reading
|
||||
**EDA Approach:**
|
||||
- Feature count and quality
|
||||
- Missing value imputation
|
||||
- Data normalization assessment
|
||||
- Batch correction validation
|
||||
- PCA and clustering for QC
|
||||
- Fold change calculations
|
||||
- Statistical test preparation
|
||||
|
||||
### .featureXML - OpenMS Feature Format
|
||||
**Description:** OpenMS detected features
|
||||
**Typical Data:** LC-MS features with quality scores
|
||||
**Use Cases:** OpenMS workflows
|
||||
**Python Libraries:**
|
||||
- `pyopenms`: FeatureXML support
|
||||
**EDA Approach:**
|
||||
- Feature detection parameters
|
||||
- Quality metrics per feature
|
||||
- Isotope pattern fitting
|
||||
- Charge state assignment
|
||||
- FWHM and asymmetry
|
||||
|
||||
### .consensusXML - OpenMS Consensus Features
|
||||
**Description:** Linked features across samples
|
||||
**Typical Data:** Aligned features with group info
|
||||
**Use Cases:** Multi-sample LC-MS analysis
|
||||
**Python Libraries:**
|
||||
- `pyopenms`: ConsensusXML reading
|
||||
**EDA Approach:**
|
||||
- Feature correspondence quality
|
||||
- Retention time alignment
|
||||
- Missing value patterns
|
||||
- Intensity normalization needs
|
||||
- Batch-wise feature agreement
|
||||
|
||||
### .idXML - OpenMS Identification Format
|
||||
**Description:** Peptide/metabolite identifications
|
||||
**Typical Data:** MS/MS identifications with scores
|
||||
**Use Cases:** OpenMS ID workflows
|
||||
**Python Libraries:**
|
||||
- `pyopenms`: IdXML support
|
||||
**EDA Approach:**
|
||||
- Identification rate
|
||||
- Score distribution
|
||||
- Spectral match quality
|
||||
- False discovery assessment
|
||||
- Annotation transfer validation
|
||||
|
||||
## Lipidomics-Specific Formats
|
||||
|
||||
### .lcb - LipidCreator Batch
|
||||
**Description:** LipidCreator transition list
|
||||
**Typical Data:** Lipid transitions for targeted MS
|
||||
**Use Cases:** Targeted lipidomics
|
||||
**Python Libraries:**
|
||||
- Export to CSV for processing
|
||||
**EDA Approach:**
|
||||
- Transition coverage per lipid class
|
||||
- Retention time prediction
|
||||
- Collision energy optimization
|
||||
- Class-specific fragmentation patterns
|
||||
|
||||
### .mzTab - Proteomics/Metabolomics Tabular Format
|
||||
**Description:** PSI tabular summary format
|
||||
**Typical Data:** Protein/peptide/metabolite quantification
|
||||
**Use Cases:** Publication and data sharing
|
||||
**Python Libraries:**
|
||||
- `pyteomics.mztab`
|
||||
- `pandas` for TSV-like structure
|
||||
**EDA Approach:**
|
||||
- Data completeness
|
||||
- Metadata section validation
|
||||
- Quantification method
|
||||
- Identification confidence
|
||||
- Software and parameters
|
||||
- Quality metrics summary
|
||||
|
||||
### .csv (LipidSearch, LipidMatch)
|
||||
**Description:** Lipid identification results
|
||||
**Typical Data:** Lipid annotations, grades, intensities
|
||||
**Use Cases:** Lipidomics software output
|
||||
**Python Libraries:**
|
||||
- `pandas`: CSV reading
|
||||
**EDA Approach:**
|
||||
- Lipid class distribution
|
||||
- Identification grade/confidence
|
||||
- Fatty acid composition analysis
|
||||
- Double bond and chain length patterns
|
||||
- Intensity correlations
|
||||
- Normalization to internal standards
|
||||
|
||||
### .sdf (Metabolomics)
|
||||
**Description:** Structure data file for metabolites
|
||||
**Typical Data:** Chemical structures with properties
|
||||
**Use Cases:** Metabolite database creation
|
||||
**Python Libraries:**
|
||||
- `RDKit`: `Chem.SDMolSupplier('file.sdf')`
|
||||
**EDA Approach:**
|
||||
- Structure validation
|
||||
- Property calculation (logP, MW, TPSA)
|
||||
- Molecular formula consistency
|
||||
- Tautomer enumeration
|
||||
- Retention time prediction features
|
||||
|
||||
### .mol (Metabolomics)
|
||||
**Description:** Single molecule structure files
|
||||
**Typical Data:** Metabolite chemical structure
|
||||
**Use Cases:** Structure-based searches
|
||||
**Python Libraries:**
|
||||
- `RDKit`: `Chem.MolFromMolFile('file.mol')`
|
||||
**EDA Approach:**
|
||||
- Structure correctness
|
||||
- Stereochemistry validation
|
||||
- Charge state
|
||||
- Implicit hydrogen handling
|
||||
|
||||
## Data Processing and Analysis
|
||||
|
||||
### .h5 / .hdf5 (Omics)
|
||||
**Description:** HDF5 for large omics datasets
|
||||
**Typical Data:** Feature matrices, spectra, metadata
|
||||
**Use Cases:** Large-scale studies, cloud computing
|
||||
**Python Libraries:**
|
||||
- `h5py`: HDF5 access
|
||||
- `anndata`: For single-cell proteomics
|
||||
**EDA Approach:**
|
||||
- Dataset organization
|
||||
- Chunking and compression
|
||||
- Metadata structure
|
||||
- Efficient data access patterns
|
||||
- Sample and feature annotations
|
||||
|
||||
### .Rdata / .rds - R Objects
|
||||
**Description:** Serialized R analysis objects
|
||||
**Typical Data:** Processed omics results from R packages
|
||||
**Use Cases:** xcms, CAMERA, MSnbase workflows
|
||||
**Python Libraries:**
|
||||
- `pyreadr`: `pyreadr.read_r('file.Rdata')`
|
||||
- `rpy2`: R-Python integration
|
||||
**EDA Approach:**
|
||||
- Object structure exploration
|
||||
- Data extraction
|
||||
- Method parameter review
|
||||
- Conversion to Python-native formats
|
||||
|
||||
### .mzTab-M - Metabolomics mzTab
|
||||
**Description:** mzTab specific to metabolomics
|
||||
**Typical Data:** Small molecule quantification
|
||||
**Use Cases:** Metabolomics data sharing
|
||||
**Python Libraries:**
|
||||
- `pyteomics.mztab`: Can parse mzTab-M
|
||||
**EDA Approach:**
|
||||
- Small molecule evidence
|
||||
- Feature quantification
|
||||
- Database references (HMDB, KEGG, etc.)
|
||||
- Adduct and charge annotation
|
||||
- MS level information
|
||||
|
||||
### .parquet (Omics)
|
||||
**Description:** Columnar storage for large tables
|
||||
**Typical Data:** Feature matrices, metadata
|
||||
**Use Cases:** Efficient big data omics
|
||||
**Python Libraries:**
|
||||
- `pandas`: `pd.read_parquet()`
|
||||
- `pyarrow`: Direct parquet access
|
||||
**EDA Approach:**
|
||||
- Compression efficiency
|
||||
- Column-wise statistics
|
||||
- Partition structure
|
||||
- Schema validation
|
||||
- Fast filtering and aggregation
|
||||
|
||||
### .pkl (Omics Models)
|
||||
**Description:** Pickled Python objects
|
||||
**Typical Data:** ML models, processed data
|
||||
**Use Cases:** Workflow intermediate storage
|
||||
**Python Libraries:**
|
||||
- `pickle`: Standard serialization
|
||||
- `joblib`: Enhanced pickling
|
||||
**EDA Approach:**
|
||||
- Object type and structure
|
||||
- Model parameters
|
||||
- Feature importance (if ML model)
|
||||
- Data shapes and types
|
||||
- Deserialization validation
|
||||
|
||||
### .zarr (Omics)
|
||||
**Description:** Chunked, compressed array storage
|
||||
**Typical Data:** Multi-dimensional omics data
|
||||
**Use Cases:** Cloud-optimized analysis
|
||||
**Python Libraries:**
|
||||
- `zarr`: Array storage
|
||||
**EDA Approach:**
|
||||
- Chunk optimization
|
||||
- Compression codecs
|
||||
- Multi-scale data
|
||||
- Parallel access patterns
|
||||
- Metadata annotations
|
||||
@@ -0,0 +1,633 @@
|
||||
# Spectroscopy and Analytical Chemistry File Formats Reference
|
||||
|
||||
This reference covers file formats used in various spectroscopic techniques and analytical chemistry instrumentation.
|
||||
|
||||
## NMR Spectroscopy
|
||||
|
||||
### .fid - NMR Free Induction Decay
|
||||
**Description:** Raw time-domain NMR data from Bruker, Agilent, JEOL
|
||||
**Typical Data:** Complex time-domain signal
|
||||
**Use Cases:** NMR spectroscopy, structure elucidation
|
||||
**Python Libraries:**
|
||||
- `nmrglue`: `nmrglue.bruker.read_fid('fid')` or `nmrglue.varian.read_fid('fid')`
|
||||
- `nmrstarlib`: NMR data handling
|
||||
**EDA Approach:**
|
||||
- Time-domain signal decay
|
||||
- Sampling rate and acquisition time
|
||||
- Number of data points
|
||||
- Signal-to-noise ratio estimation
|
||||
- Baseline drift assessment
|
||||
- Digital filter effects
|
||||
- Acquisition parameter validation
|
||||
- Apodization function selection
|
||||
|
||||
### .ft / .ft1 / .ft2 - NMR Frequency Domain
|
||||
**Description:** Fourier-transformed NMR spectrum
|
||||
**Typical Data:** Processed frequency-domain data
|
||||
**Use Cases:** NMR analysis, peak integration
|
||||
**Python Libraries:**
|
||||
- `nmrglue`: Frequency domain reading
|
||||
- Custom processing pipelines
|
||||
**EDA Approach:**
|
||||
- Peak picking and integration
|
||||
- Chemical shift range
|
||||
- Baseline correction quality
|
||||
- Phase correction assessment
|
||||
- Reference peak identification
|
||||
- Spectral resolution
|
||||
- Artifacts detection
|
||||
- Multiplicity analysis
|
||||
|
||||
### .1r / .2rr - Bruker NMR Processed Data
|
||||
**Description:** Bruker processed spectrum (real part)
|
||||
**Typical Data:** 1D or 2D processed NMR spectra
|
||||
**Use Cases:** NMR data analysis with Bruker software
|
||||
**Python Libraries:**
|
||||
- `nmrglue`: Bruker format support
|
||||
**EDA Approach:**
|
||||
- Processing parameters review
|
||||
- Window function effects
|
||||
- Zero-filling assessment
|
||||
- Linear prediction validation
|
||||
- Spectral artifacts
|
||||
|
||||
### .dx - NMR JCAMP-DX
|
||||
**Description:** JCAMP-DX format for NMR
|
||||
**Typical Data:** Standardized NMR spectrum
|
||||
**Use Cases:** Data exchange between software
|
||||
**Python Libraries:**
|
||||
- `jcamp`: JCAMP reader
|
||||
- `nmrglue`: Can import JCAMP
|
||||
**EDA Approach:**
|
||||
- Format compliance
|
||||
- Metadata completeness
|
||||
- Peak table validation
|
||||
- Integration values
|
||||
- Compound identification info
|
||||
|
||||
### .mnova - Mnova Format
|
||||
**Description:** Mestrelab Research Mnova format
|
||||
**Typical Data:** NMR data with processing info
|
||||
**Use Cases:** Mnova software workflows
|
||||
**Python Libraries:**
|
||||
- `nmrglue`: Limited Mnova support
|
||||
- Conversion tools to standard formats
|
||||
**EDA Approach:**
|
||||
- Multi-spectrum handling
|
||||
- Processing pipeline review
|
||||
- Quantification data
|
||||
- Structure assignment
|
||||
|
||||
## Mass Spectrometry
|
||||
|
||||
### .mzML - Mass Spectrometry Markup Language
|
||||
**Description:** Standard XML-based MS format
|
||||
**Typical Data:** MS spectra, chromatograms, metadata
|
||||
**Use Cases:** Proteomics, metabolomics, lipidomics
|
||||
**Python Libraries:**
|
||||
- `pymzml`: `pymzml.run.Reader('file.mzML')`
|
||||
- `pyteomics.mzml`: `pyteomics.mzml.read('file.mzML')`
|
||||
- `MSFileReader`: Various wrappers
|
||||
**EDA Approach:**
|
||||
- Scan count and MS level distribution
|
||||
- Retention time range and TIC
|
||||
- m/z range and resolution
|
||||
- Precursor ion selection
|
||||
- Fragmentation patterns
|
||||
- Instrument configuration
|
||||
- Quality control metrics
|
||||
- Data completeness
|
||||
|
||||
### .mzXML - Mass Spectrometry XML
|
||||
**Description:** Legacy XML MS format
|
||||
**Typical Data:** Mass spectra and chromatograms
|
||||
**Use Cases:** Proteomics workflows (older)
|
||||
**Python Libraries:**
|
||||
- `pyteomics.mzxml`
|
||||
- `pymzml`: Can read mzXML
|
||||
**EDA Approach:**
|
||||
- Similar to mzML
|
||||
- Version compatibility
|
||||
- Conversion quality assessment
|
||||
|
||||
### .mzData - mzData Format
|
||||
**Description:** Legacy PSI MS format
|
||||
**Typical Data:** Mass spectrometry data
|
||||
**Use Cases:** Legacy data archives
|
||||
**Python Libraries:**
|
||||
- `pyteomics`: Limited support
|
||||
- Conversion to mzML recommended
|
||||
**EDA Approach:**
|
||||
- Format conversion validation
|
||||
- Data completeness
|
||||
- Metadata extraction
|
||||
|
||||
### .raw - Vendor Raw Files (Thermo, Agilent, Bruker)
|
||||
**Description:** Proprietary instrument data
|
||||
**Typical Data:** Raw mass spectra and metadata
|
||||
**Use Cases:** Direct instrument output
|
||||
**Python Libraries:**
|
||||
- `pymsfilereader`: Thermo RAW files
|
||||
- `ThermoRawFileParser`: CLI wrapper
|
||||
- Vendor-specific APIs
|
||||
**EDA Approach:**
|
||||
- Method parameter extraction
|
||||
- Instrument performance metrics
|
||||
- Calibration status
|
||||
- Scan function analysis
|
||||
- MS/MS quality metrics
|
||||
- Dynamic exclusion evaluation
|
||||
|
||||
### .d - Agilent Data Directory
|
||||
**Description:** Agilent MS data folder
|
||||
**Typical Data:** LC-MS, GC-MS with methods
|
||||
**Use Cases:** Agilent MassHunter workflows
|
||||
**Python Libraries:**
|
||||
- Community parsers
|
||||
- Chemstation integration
|
||||
**EDA Approach:**
|
||||
- Directory structure validation
|
||||
- Method parameters
|
||||
- Calibration curves
|
||||
- Sequence metadata
|
||||
- Signal quality metrics
|
||||
|
||||
### .wiff - AB SCIEX Data
|
||||
**Description:** AB SCIEX/SCIEX instrument format
|
||||
**Typical Data:** Mass spectrometry data
|
||||
**Use Cases:** SCIEX instrument workflows
|
||||
**Python Libraries:**
|
||||
- Vendor SDKs (limited Python support)
|
||||
- Conversion tools
|
||||
**EDA Approach:**
|
||||
- Experiment type identification
|
||||
- Scan properties
|
||||
- Quantitation data
|
||||
- Multi-experiment structure
|
||||
|
||||
### .mgf - Mascot Generic Format
|
||||
**Description:** Peak list format for MS/MS
|
||||
**Typical Data:** Precursor and fragment masses
|
||||
**Use Cases:** Peptide identification, database searches
|
||||
**Python Libraries:**
|
||||
- `pyteomics.mgf`: `pyteomics.mgf.read('file.mgf')`
|
||||
- `pyopenms`: MGF support
|
||||
**EDA Approach:**
|
||||
- Spectrum count
|
||||
- Charge state distribution
|
||||
- Precursor m/z and intensity
|
||||
- Fragment peak count
|
||||
- Mass accuracy
|
||||
- Title and metadata parsing
|
||||
|
||||
### .pkl - Peak List (Binary)
|
||||
**Description:** Binary peak list format
|
||||
**Typical Data:** Serialized MS/MS spectra
|
||||
**Use Cases:** Software-specific storage
|
||||
**Python Libraries:**
|
||||
- `pickle`: Standard deserialization
|
||||
- `pyteomics`: PKL support
|
||||
**EDA Approach:**
|
||||
- Data structure inspection
|
||||
- Conversion to standard formats
|
||||
- Metadata preservation
|
||||
|
||||
### .ms1 / .ms2 - MS1/MS2 Formats
|
||||
**Description:** Simple text format for MS data
|
||||
**Typical Data:** MS1 and MS2 scans
|
||||
**Use Cases:** Database searching, proteomics
|
||||
**Python Libraries:**
|
||||
- `pyteomics.ms1` and `ms2`
|
||||
- Simple text parsing
|
||||
**EDA Approach:**
|
||||
- Scan count by level
|
||||
- Retention time series
|
||||
- Charge state analysis
|
||||
- m/z range coverage
|
||||
|
||||
### .pepXML - Peptide XML
|
||||
**Description:** TPP peptide identification format
|
||||
**Typical Data:** Peptide-spectrum matches
|
||||
**Use Cases:** Proteomics search results
|
||||
**Python Libraries:**
|
||||
- `pyteomics.pepxml`
|
||||
**EDA Approach:**
|
||||
- Search result statistics
|
||||
- Score distribution
|
||||
- Modification analysis
|
||||
- FDR assessment
|
||||
- Enzyme specificity
|
||||
|
||||
### .protXML - Protein XML
|
||||
**Description:** TPP protein inference format
|
||||
**Typical Data:** Protein identifications
|
||||
**Use Cases:** Proteomics protein-level results
|
||||
**Python Libraries:**
|
||||
- `pyteomics.protxml`
|
||||
**EDA Approach:**
|
||||
- Protein group analysis
|
||||
- Coverage statistics
|
||||
- Confidence scoring
|
||||
- Parsimony analysis
|
||||
|
||||
### .msp - NIST MS Search Format
|
||||
**Description:** NIST spectral library format
|
||||
**Typical Data:** Reference mass spectra
|
||||
**Use Cases:** Spectral library searching
|
||||
**Python Libraries:**
|
||||
- `matchms`: Spectral library handling
|
||||
- Custom parsers
|
||||
**EDA Approach:**
|
||||
- Library size and coverage
|
||||
- Metadata completeness
|
||||
- Peak count statistics
|
||||
- Compound annotation quality
|
||||
|
||||
## Infrared and Raman Spectroscopy
|
||||
|
||||
### .spc - Galactic SPC
|
||||
**Description:** Thermo Galactic spectroscopy format
|
||||
**Typical Data:** IR, Raman, UV-Vis spectra
|
||||
**Use Cases:** Various spectroscopy instruments
|
||||
**Python Libraries:**
|
||||
- `spc`: `spc.File('file.spc')`
|
||||
- `specio`: Multi-format reader
|
||||
**EDA Approach:**
|
||||
- Wavenumber/wavelength range
|
||||
- Data point density
|
||||
- Multi-spectrum handling
|
||||
- Baseline characteristics
|
||||
- Peak identification
|
||||
- Absorbance/transmittance mode
|
||||
- Instrument information
|
||||
|
||||
### .spa - Thermo Nicolet
|
||||
**Description:** Thermo Fisher FTIR format
|
||||
**Typical Data:** FTIR spectra
|
||||
**Use Cases:** OMNIC software data
|
||||
**Python Libraries:**
|
||||
- Custom binary parsers
|
||||
- Conversion to JCAMP or SPC
|
||||
**EDA Approach:**
|
||||
- Interferogram vs spectrum
|
||||
- Background spectrum validation
|
||||
- Atmospheric compensation
|
||||
- Resolution and scan number
|
||||
- Sample information
|
||||
|
||||
### .0 - Bruker OPUS
|
||||
**Description:** Bruker OPUS FTIR format (numbered files)
|
||||
**Typical Data:** FTIR spectra and metadata
|
||||
**Use Cases:** Bruker FTIR instruments
|
||||
**Python Libraries:**
|
||||
- `brukeropusreader`: OPUS format parser
|
||||
- `specio`: OPUS support
|
||||
**EDA Approach:**
|
||||
- Multiple block types (AB, ScSm, etc.)
|
||||
- Sample and reference spectra
|
||||
- Instrument parameters
|
||||
- Optical path configuration
|
||||
- Beam splitter and detector info
|
||||
|
||||
### .dpt - Data Point Table
|
||||
**Description:** Simple XY data format
|
||||
**Typical Data:** Generic spectroscopic data
|
||||
**Use Cases:** Renishaw Raman, generic exports
|
||||
**Python Libraries:**
|
||||
- `pandas`: CSV-like reading
|
||||
- Text parsing
|
||||
**EDA Approach:**
|
||||
- X-axis type (wavelength, wavenumber, Raman shift)
|
||||
- Y-axis units (intensity, absorbance, etc.)
|
||||
- Data point spacing
|
||||
- Header information
|
||||
- Multi-column data handling
|
||||
|
||||
### .wdf - Renishaw Raman
|
||||
**Description:** Renishaw WiRE data format
|
||||
**Typical Data:** Raman spectra and maps
|
||||
**Use Cases:** Renishaw Raman microscopy
|
||||
**Python Libraries:**
|
||||
- `renishawWiRE`: WDF reader
|
||||
- Custom parsers for WDF format
|
||||
**EDA Approach:**
|
||||
- Spectral vs mapping data
|
||||
- Laser wavelength
|
||||
- Accumulation and exposure time
|
||||
- Spatial coordinates (mapping)
|
||||
- Z-scan data
|
||||
- Baseline and cosmic ray correction
|
||||
|
||||
### .txt (Spectroscopy)
|
||||
**Description:** Generic text export from instruments
|
||||
**Typical Data:** Wavelength/wavenumber and intensity
|
||||
**Use Cases:** Universal data exchange
|
||||
**Python Libraries:**
|
||||
- `pandas`: Text file reading
|
||||
- `numpy`: Simple array loading
|
||||
**EDA Approach:**
|
||||
- Delimiter and format detection
|
||||
- Header parsing
|
||||
- Units identification
|
||||
- Multiple spectrum handling
|
||||
- Metadata extraction from comments
|
||||
|
||||
## UV-Visible Spectroscopy
|
||||
|
||||
### .asd / .asc - ASD Binary/ASCII
|
||||
**Description:** ASD FieldSpec spectroradiometer
|
||||
**Typical Data:** Hyperspectral UV-Vis-NIR data
|
||||
**Use Cases:** Remote sensing, reflectance spectroscopy
|
||||
**Python Libraries:**
|
||||
- `spectral.io.asd`: ASD format support
|
||||
- Custom parsers
|
||||
**EDA Approach:**
|
||||
- Wavelength range (UV to NIR)
|
||||
- Reference spectrum validation
|
||||
- Dark current correction
|
||||
- Integration time
|
||||
- GPS metadata (if present)
|
||||
- Reflectance vs radiance
|
||||
|
||||
### .sp - Perkin Elmer
|
||||
**Description:** Perkin Elmer UV/Vis format
|
||||
**Typical Data:** UV-Vis spectrophotometer data
|
||||
**Use Cases:** PE Lambda instruments
|
||||
**Python Libraries:**
|
||||
- Custom parsers
|
||||
- Conversion to standard formats
|
||||
**EDA Approach:**
|
||||
- Scan parameters
|
||||
- Baseline correction
|
||||
- Multi-wavelength scans
|
||||
- Time-based measurements
|
||||
- Sample/reference handling
|
||||
|
||||
### .csv (Spectroscopy)
|
||||
**Description:** CSV export from UV-Vis instruments
|
||||
**Typical Data:** Wavelength and absorbance/transmittance
|
||||
**Use Cases:** Universal format for UV-Vis data
|
||||
**Python Libraries:**
|
||||
- `pandas`: Native CSV support
|
||||
**EDA Approach:**
|
||||
- Lambda max identification
|
||||
- Beer's law compliance
|
||||
- Baseline offset
|
||||
- Path length correction
|
||||
- Concentration calculations
|
||||
|
||||
## X-ray and Diffraction
|
||||
|
||||
### .cif - Crystallographic Information File
|
||||
**Description:** Crystal structure and diffraction data
|
||||
**Typical Data:** Unit cell, atomic positions, structure factors
|
||||
**Use Cases:** Crystallography, materials science
|
||||
**Python Libraries:**
|
||||
- `gemmi`: `gemmi.cif.read_file('file.cif')`
|
||||
- `PyCifRW`: CIF reading/writing
|
||||
- `pymatgen`: Materials structure analysis
|
||||
**EDA Approach:**
|
||||
- Crystal system and space group
|
||||
- Unit cell parameters
|
||||
- Atomic positions and occupancy
|
||||
- Thermal parameters
|
||||
- R-factors and refinement quality
|
||||
- Completeness and redundancy
|
||||
- Structure validation
|
||||
|
||||
### .hkl - Reflection Data
|
||||
**Description:** Miller indices and intensities
|
||||
**Typical Data:** Integrated diffraction intensities
|
||||
**Use Cases:** Crystallographic refinement
|
||||
**Python Libraries:**
|
||||
- Custom parsers (format dependent)
|
||||
- Crystallography packages (CCP4, etc.)
|
||||
**EDA Approach:**
|
||||
- Resolution range
|
||||
- Completeness by shell
|
||||
- I/sigma distribution
|
||||
- Systematic absences
|
||||
- Twinning detection
|
||||
- Wilson plot
|
||||
|
||||
### .mtz - MTZ Format (CCP4)
|
||||
**Description:** Binary crystallographic data
|
||||
**Typical Data:** Reflections, phases, structure factors
|
||||
**Use Cases:** Macromolecular crystallography
|
||||
**Python Libraries:**
|
||||
- `gemmi`: MTZ support
|
||||
- `cctbx`: Comprehensive crystallography
|
||||
**EDA Approach:**
|
||||
- Column types and data
|
||||
- Resolution limits
|
||||
- R-factors (Rwork, Rfree)
|
||||
- Phase probability distribution
|
||||
- Map coefficients
|
||||
- Batch information
|
||||
|
||||
### .xy / .xye - Powder Diffraction
|
||||
**Description:** 2-theta vs intensity data
|
||||
**Typical Data:** Powder X-ray diffraction patterns
|
||||
**Use Cases:** Phase identification, Rietveld refinement
|
||||
**Python Libraries:**
|
||||
- `pandas`: Simple XY reading
|
||||
- `pymatgen`: XRD pattern analysis
|
||||
**EDA Approach:**
|
||||
- 2-theta range
|
||||
- Peak positions and intensities
|
||||
- Background modeling
|
||||
- Peak width analysis (strain/size)
|
||||
- Phase identification via matching
|
||||
- Preferred orientation effects
|
||||
|
||||
### .raw (XRD)
|
||||
**Description:** Vendor-specific XRD raw data
|
||||
**Typical Data:** XRD patterns with metadata
|
||||
**Use Cases:** Bruker, PANalytical, Rigaku instruments
|
||||
**Python Libraries:**
|
||||
- Vendor-specific parsers
|
||||
- Conversion tools
|
||||
**EDA Approach:**
|
||||
- Scan parameters (step size, time)
|
||||
- Sample alignment
|
||||
- Incident beam setup
|
||||
- Detector configuration
|
||||
- Background scan validation
|
||||
|
||||
### .gsa / .gsas - GSAS Format
|
||||
**Description:** General Structure Analysis System
|
||||
**Typical Data:** Powder diffraction for Rietveld
|
||||
**Use Cases:** Rietveld refinement
|
||||
**Python Libraries:**
|
||||
- GSAS-II Python interface
|
||||
- Custom parsers
|
||||
**EDA Approach:**
|
||||
- Histogram data
|
||||
- Instrument parameters
|
||||
- Phase information
|
||||
- Refinement constraints
|
||||
- Profile function parameters
|
||||
|
||||
## Electron Spectroscopy
|
||||
|
||||
### .vms - VG Scienta
|
||||
**Description:** VG Scienta spectrometer format
|
||||
**Typical Data:** XPS, UPS, ARPES spectra
|
||||
**Use Cases:** Photoelectron spectroscopy
|
||||
**Python Libraries:**
|
||||
- Custom parsers for VMS
|
||||
- `specio`: Multi-format support
|
||||
**EDA Approach:**
|
||||
- Binding energy calibration
|
||||
- Pass energy and resolution
|
||||
- Photoelectron line identification
|
||||
- Satellite peak analysis
|
||||
- Background subtraction quality
|
||||
- Fermi edge position
|
||||
|
||||
### .spe - WinSpec/SPE Format
|
||||
**Description:** Princeton Instruments/Roper Scientific
|
||||
**Typical Data:** CCD spectra, Raman, PL
|
||||
**Use Cases:** Spectroscopy with CCD detectors
|
||||
**Python Libraries:**
|
||||
- `spe2py`: SPE file reader
|
||||
- `spe_loader`: Alternative parser
|
||||
**EDA Approach:**
|
||||
- CCD frame analysis
|
||||
- Wavelength calibration
|
||||
- Dark frame subtraction
|
||||
- Cosmic ray identification
|
||||
- Readout noise
|
||||
- Accumulation statistics
|
||||
|
||||
### .pxt - Princeton PTI
|
||||
**Description:** Photon Technology International
|
||||
**Typical Data:** Fluorescence, phosphorescence spectra
|
||||
**Use Cases:** Fluorescence spectroscopy
|
||||
**Python Libraries:**
|
||||
- Custom parsers
|
||||
- Text-based format variants
|
||||
**EDA Approach:**
|
||||
- Excitation and emission spectra
|
||||
- Quantum yield calculations
|
||||
- Time-resolved measurements
|
||||
- Temperature-dependent data
|
||||
- Correction factors applied
|
||||
|
||||
### .dat (Spectroscopy Generic)
|
||||
**Description:** Generic binary or text spectroscopy data
|
||||
**Typical Data:** Various spectroscopic measurements
|
||||
**Use Cases:** Many instruments use .dat extension
|
||||
**Python Libraries:**
|
||||
- Format-specific identification needed
|
||||
- `numpy`, `pandas` for known formats
|
||||
**EDA Approach:**
|
||||
- Format detection (binary vs text)
|
||||
- Header identification
|
||||
- Data structure inference
|
||||
- Units and axis labels
|
||||
- Instrument signature detection
|
||||
|
||||
## Chromatography
|
||||
|
||||
### .chrom - Chromatogram Data
|
||||
**Description:** Generic chromatography format
|
||||
**Typical Data:** Retention time vs signal
|
||||
**Use Cases:** HPLC, GC, LC-MS
|
||||
**Python Libraries:**
|
||||
- Vendor-specific parsers
|
||||
- `pandas` for text exports
|
||||
**EDA Approach:**
|
||||
- Retention time range
|
||||
- Peak detection and integration
|
||||
- Baseline drift
|
||||
- Resolution between peaks
|
||||
- Signal-to-noise ratio
|
||||
- Tailing factor
|
||||
|
||||
### .ch - ChemStation
|
||||
**Description:** Agilent ChemStation format
|
||||
**Typical Data:** Chromatograms and method parameters
|
||||
**Use Cases:** Agilent HPLC and GC systems
|
||||
**Python Libraries:**
|
||||
- `agilent-chemstation`: Community tools
|
||||
- Binary format parsers
|
||||
**EDA Approach:**
|
||||
- Method validation
|
||||
- Integration parameters
|
||||
- Calibration curve
|
||||
- Sample sequence information
|
||||
- Instrument status
|
||||
|
||||
### .arw - Empower (Waters)
|
||||
**Description:** Waters Empower format
|
||||
**Typical Data:** UPLC/HPLC chromatograms
|
||||
**Use Cases:** Waters instrument data
|
||||
**Python Libraries:**
|
||||
- Vendor tools (limited Python access)
|
||||
- Database extraction tools
|
||||
**EDA Approach:**
|
||||
- Audit trail information
|
||||
- Processing methods
|
||||
- Compound identification
|
||||
- Quantitation results
|
||||
- System suitability tests
|
||||
|
||||
### .lcd - Shimadzu LabSolutions
|
||||
**Description:** Shimadzu chromatography format
|
||||
**Typical Data:** GC/HPLC data
|
||||
**Use Cases:** Shimadzu instruments
|
||||
**Python Libraries:**
|
||||
- Vendor-specific parsers
|
||||
**EDA Approach:**
|
||||
- Method parameters
|
||||
- Peak purity analysis
|
||||
- Spectral data (if PDA)
|
||||
- Quantitative results
|
||||
|
||||
## Other Analytical Techniques
|
||||
|
||||
### .dta - DSC/TGA Data
|
||||
**Description:** Thermal analysis data (TA Instruments)
|
||||
**Typical Data:** Temperature vs heat flow or mass
|
||||
**Use Cases:** Differential scanning calorimetry, thermogravimetry
|
||||
**Python Libraries:**
|
||||
- Custom parsers for TA formats
|
||||
- `pandas` for exported data
|
||||
**EDA Approach:**
|
||||
- Transition temperature identification
|
||||
- Enthalpy calculations
|
||||
- Mass loss steps
|
||||
- Heating rate effects
|
||||
- Baseline determination
|
||||
- Purity assessment
|
||||
|
||||
### .run - ICP-MS/ICP-OES
|
||||
**Description:** Elemental analysis data
|
||||
**Typical Data:** Element concentrations or counts
|
||||
**Use Cases:** Inductively coupled plasma MS/OES
|
||||
**Python Libraries:**
|
||||
- Vendor-specific tools
|
||||
- Custom parsers
|
||||
**EDA Approach:**
|
||||
- Element detection and quantitation
|
||||
- Internal standard performance
|
||||
- Spike recovery
|
||||
- Dilution factor corrections
|
||||
- Isotope ratios
|
||||
- LOD/LOQ calculations
|
||||
|
||||
### .exp - Electrochemistry Data
|
||||
**Description:** Electrochemical experiment data
|
||||
**Typical Data:** Potential vs current or charge
|
||||
**Use Cases:** Cyclic voltammetry, chronoamperometry
|
||||
**Python Libraries:**
|
||||
- Custom parsers per instrument (CHI, Gamry, etc.)
|
||||
- `galvani`: Biologic EC-Lab files
|
||||
**EDA Approach:**
|
||||
- Redox peak identification
|
||||
- Peak potential and current
|
||||
- Scan rate effects
|
||||
- Electron transfer kinetics
|
||||
- Background subtraction
|
||||
- Capacitance calculations
|
||||
Reference in New Issue
Block a user