Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,518 @@
# General Scientific Data Formats Reference
This reference covers general-purpose scientific data formats used across multiple disciplines.
## Numerical and Array Data
### .npy - NumPy Array
**Description:** Binary NumPy array format
**Typical Data:** N-dimensional arrays of any data type
**Use Cases:** Fast I/O for numerical data, intermediate results
**Python Libraries:**
- `numpy`: `np.load('file.npy')`, `np.save()`
- Memory-mapped access: `np.load('file.npy', mmap_mode='r')`
**EDA Approach:**
- Array shape and dimensionality
- Data type and precision
- Statistical summary (mean, std, min, max, percentiles)
- Missing or invalid values (NaN, inf)
- Memory footprint
- Value distribution and histogram
- Sparsity analysis
- Correlation structure (if 2D)
### .npz - Compressed NumPy Archive
**Description:** Multiple NumPy arrays in one file
**Typical Data:** Collections of related arrays
**Use Cases:** Saving multiple arrays together, compressed storage
**Python Libraries:**
- `numpy`: `np.load('file.npz')` returns dict-like object
- `np.savez()` or `np.savez_compressed()`
**EDA Approach:**
- List of contained arrays
- Individual array analysis
- Relationships between arrays
- Total file size and compression ratio
- Naming conventions
- Data consistency checks
### .csv - Comma-Separated Values
**Description:** Plain text tabular data
**Typical Data:** Experimental measurements, results tables
**Use Cases:** Universal data exchange, spreadsheet export
**Python Libraries:**
- `pandas`: `pd.read_csv('file.csv')`
- `csv`: Built-in module
- `polars`: High-performance CSV reading
- `numpy`: `np.loadtxt()` or `np.genfromtxt()`
**EDA Approach:**
- Row and column counts
- Data type inference
- Missing value patterns and frequency
- Column statistics (numeric: mean, std; categorical: frequencies)
- Outlier detection
- Correlation matrix
- Duplicate row detection
- Header and index validation
- Encoding issues detection
### .tsv / .tab - Tab-Separated Values
**Description:** Tab-delimited tabular data
**Typical Data:** Similar to CSV but tab-separated
**Use Cases:** Bioinformatics, text processing output
**Python Libraries:**
- `pandas`: `pd.read_csv('file.tsv', sep='\t')`
**EDA Approach:**
- Same as CSV format
- Tab vs space validation
- Quote handling
### .xlsx / .xls - Excel Spreadsheets
**Description:** Microsoft Excel binary/XML formats
**Typical Data:** Tabular data with formatting, formulas
**Use Cases:** Lab notebooks, data entry, reports
**Python Libraries:**
- `pandas`: `pd.read_excel('file.xlsx')`
- `openpyxl`: Full Excel file manipulation
- `xlrd`: Reading .xls (legacy)
**EDA Approach:**
- Sheet enumeration and names
- Per-sheet data analysis
- Formula evaluation
- Merged cells handling
- Hidden rows/columns
- Data validation rules
- Named ranges
- Formatting-only cells detection
### .json - JavaScript Object Notation
**Description:** Hierarchical text data format
**Typical Data:** Nested data structures, metadata
**Use Cases:** API responses, configuration, results
**Python Libraries:**
- `json`: Built-in module
- `pandas`: `pd.read_json()`
- `ujson`: Faster JSON parsing
**EDA Approach:**
- Schema inference
- Nesting depth
- Key-value distribution
- Array lengths
- Data type consistency
- Missing keys
- Duplicate detection
- Size and complexity metrics
### .xml - Extensible Markup Language
**Description:** Hierarchical markup format
**Typical Data:** Structured data with metadata
**Use Cases:** Standards-based data exchange, APIs
**Python Libraries:**
- `lxml`: `lxml.etree.parse()`
- `xml.etree.ElementTree`: Built-in XML
- `xmltodict`: Convert XML to dict
**EDA Approach:**
- Schema/DTD validation
- Element hierarchy and depth
- Namespace handling
- Attribute vs element content
- CDATA sections
- Text content extraction
- Sibling and child counts
### .yaml / .yml - YAML
**Description:** Human-readable data serialization
**Typical Data:** Configuration, metadata, parameters
**Use Cases:** Experiment configurations, pipelines
**Python Libraries:**
- `yaml`: `yaml.safe_load()` or `yaml.load()`
- `ruamel.yaml`: YAML 1.2 support
**EDA Approach:**
- Configuration structure
- Data type handling
- List and dict depth
- Anchor and alias usage
- Multi-document files
- Comments preservation
- Validation against schema
### .toml - TOML Configuration
**Description:** Configuration file format
**Typical Data:** Settings, parameters
**Use Cases:** Python package configuration, settings
**Python Libraries:**
- `tomli` / `tomllib`: TOML reading (tomllib in Python 3.11+)
- `toml`: Reading and writing
**EDA Approach:**
- Section structure
- Key-value pairs
- Data type inference
- Nested table validation
- Required vs optional fields
### .ini - INI Configuration
**Description:** Simple configuration format
**Typical Data:** Application settings
**Use Cases:** Legacy configurations, simple settings
**Python Libraries:**
- `configparser`: Built-in INI parser
**EDA Approach:**
- Section enumeration
- Key-value extraction
- Type conversion
- Comment handling
- Case sensitivity
## Binary and Compressed Data
### .hdf5 / .h5 - Hierarchical Data Format 5
**Description:** Container for large scientific datasets
**Typical Data:** Multi-dimensional arrays, metadata, groups
**Use Cases:** Large datasets, multi-modal data, parallel I/O
**Python Libraries:**
- `h5py`: `h5py.File('file.h5', 'r')`
- `pytables`: Advanced HDF5 interface
- `pandas`: HDF5 storage via HDFStore
**EDA Approach:**
- Group and dataset hierarchy
- Dataset shapes and dtypes
- Attributes and metadata
- Compression and chunking strategy
- Memory-efficient sampling
- Dataset relationships
- File size and efficiency
- Access patterns optimization
### .zarr - Chunked Array Storage
**Description:** Cloud-optimized chunked arrays
**Typical Data:** Large N-dimensional arrays
**Use Cases:** Cloud storage, parallel computing, streaming
**Python Libraries:**
- `zarr`: `zarr.open('file.zarr')`
- `xarray`: Zarr backend support
**EDA Approach:**
- Array metadata and dimensions
- Chunk size optimization
- Compression codec and ratio
- Synchronizer and store type
- Multi-scale hierarchies
- Parallel access performance
- Attribute metadata
### .gz / .gzip - Gzip Compressed
**Description:** Compressed data files
**Typical Data:** Any compressed text or binary
**Use Cases:** Compression for storage/transfer
**Python Libraries:**
- `gzip`: Built-in gzip module
- `pandas`: Automatic gzip handling in read functions
**EDA Approach:**
- Compression ratio
- Original file type detection
- Decompression validation
- Header information
- Multi-member archives
### .bz2 - Bzip2 Compressed
**Description:** Bzip2 compression
**Typical Data:** Highly compressed files
**Use Cases:** Better compression than gzip
**Python Libraries:**
- `bz2`: Built-in bz2 module
- Automatic handling in pandas
**EDA Approach:**
- Compression efficiency
- Decompression time
- Content validation
### .zip - ZIP Archive
**Description:** Archive with multiple files
**Typical Data:** Collections of files
**Use Cases:** File distribution, archiving
**Python Libraries:**
- `zipfile`: Built-in ZIP support
- `pandas`: Can read zipped CSVs
**EDA Approach:**
- Archive member listing
- Compression method per file
- Total vs compressed size
- Directory structure
- File type distribution
- Extraction validation
### .tar / .tar.gz - TAR Archive
**Description:** Unix tape archive
**Typical Data:** Multiple files and directories
**Use Cases:** Software distribution, backups
**Python Libraries:**
- `tarfile`: Built-in TAR support
**EDA Approach:**
- Member file listing
- Compression (if .tar.gz, .tar.bz2)
- Directory structure
- Permissions preservation
- Extraction testing
## Time Series and Waveform Data
### .wav - Waveform Audio
**Description:** Audio waveform data
**Typical Data:** Acoustic signals, audio recordings
**Use Cases:** Acoustic analysis, ultrasound, signal processing
**Python Libraries:**
- `scipy.io.wavfile`: `scipy.io.wavfile.read()`
- `wave`: Built-in module
- `soundfile`: Enhanced audio I/O
**EDA Approach:**
- Sample rate and duration
- Bit depth and channels
- Amplitude distribution
- Spectral analysis (FFT)
- Signal-to-noise ratio
- Clipping detection
- Frequency content
### .mat - MATLAB Data
**Description:** MATLAB workspace variables
**Typical Data:** Arrays, structures, cells
**Use Cases:** MATLAB-Python interoperability
**Python Libraries:**
- `scipy.io`: `scipy.io.loadmat()`
- `h5py`: For MATLAB v7.3 files (HDF5-based)
- `mat73`: Pure Python for v7.3
**EDA Approach:**
- Variable names and types
- Array dimensions
- Structure field exploration
- Cell array handling
- Sparse matrix detection
- MATLAB version compatibility
- Metadata extraction
### .edf - European Data Format
**Description:** Time series data (especially medical)
**Typical Data:** EEG, physiological signals
**Use Cases:** Medical signal storage
**Python Libraries:**
- `pyedflib`: EDF/EDF+ reading and writing
- `mne`: Neurophysiology data (supports EDF)
**EDA Approach:**
- Signal count and names
- Sampling frequencies
- Signal ranges and units
- Recording duration
- Annotation events
- Data quality (saturation, noise)
- Patient/study information
### .csv (Time Series)
**Description:** CSV with timestamp column
**Typical Data:** Time-indexed measurements
**Use Cases:** Sensor data, monitoring, experiments
**Python Libraries:**
- `pandas`: `pd.read_csv()` with `parse_dates`
**EDA Approach:**
- Temporal range and resolution
- Sampling regularity
- Missing time points
- Trend and seasonality
- Stationarity tests
- Autocorrelation
- Anomaly detection
## Geospatial and Environmental Data
### .shp - Shapefile
**Description:** Geospatial vector data
**Typical Data:** Geographic features (points, lines, polygons)
**Use Cases:** GIS analysis, spatial data
**Python Libraries:**
- `geopandas`: `gpd.read_file('file.shp')`
- `fiona`: Lower-level shapefile access
- `pyshp`: Pure Python shapefile reader
**EDA Approach:**
- Geometry type and count
- Coordinate reference system
- Bounding box
- Attribute table analysis
- Geometry validity
- Spatial distribution
- Multi-part features
- Associated files (.shx, .dbf, .prj)
### .geojson - GeoJSON
**Description:** JSON format for geographic data
**Typical Data:** Features with geometry and properties
**Use Cases:** Web mapping, spatial analysis
**Python Libraries:**
- `geopandas`: Native GeoJSON support
- `json`: Parse as JSON then process
**EDA Approach:**
- Feature count and types
- CRS specification
- Bounding box calculation
- Property schema
- Geometry complexity
- Nesting structure
### .tif / .tiff (Geospatial)
**Description:** GeoTIFF with spatial reference
**Typical Data:** Satellite imagery, DEMs, rasters
**Use Cases:** Remote sensing, terrain analysis
**Python Libraries:**
- `rasterio`: `rasterio.open('file.tif')`
- `gdal`: Geospatial Data Abstraction Library
- `xarray` with `rioxarray`: N-D geospatial arrays
**EDA Approach:**
- Raster dimensions and resolution
- Band count and descriptions
- Coordinate reference system
- Geotransform parameters
- NoData value handling
- Pixel value distribution
- Histogram analysis
- Overviews and pyramids
### .nc / .netcdf - Network Common Data Form
**Description:** Self-describing array-based data
**Typical Data:** Climate, atmospheric, oceanographic data
**Use Cases:** Scientific datasets, model output
**Python Libraries:**
- `netCDF4`: `netCDF4.Dataset('file.nc')`
- `xarray`: `xr.open_dataset('file.nc')`
**EDA Approach:**
- Variable enumeration
- Dimension analysis
- Time series properties
- Spatial coverage
- Attribute metadata (CF conventions)
- Coordinate systems
- Chunking and compression
- Data quality flags
### .grib / .grib2 - Gridded Binary
**Description:** Meteorological data format
**Typical Data:** Weather forecasts, climate data
**Use Cases:** Numerical weather prediction
**Python Libraries:**
- `pygrib`: GRIB file reading
- `xarray` with `cfgrib`: GRIB to xarray
**EDA Approach:**
- Message inventory
- Parameter and level types
- Spatial grid specification
- Temporal coverage
- Ensemble members
- Forecast vs analysis
- Data packing and precision
### .hdf4 - HDF4 Format
**Description:** Older HDF format
**Typical Data:** NASA Earth Science data
**Use Cases:** Satellite data (MODIS, etc.)
**Python Libraries:**
- `pyhdf`: HDF4 access
- `gdal`: Can read HDF4
**EDA Approach:**
- Scientific dataset listing
- Vdata and attributes
- Dimension scales
- Metadata extraction
- Quality flags
- Conversion to HDF5 or NetCDF
## Specialized Scientific Formats
### .fits - Flexible Image Transport System
**Description:** Astronomy data format
**Typical Data:** Images, tables, spectra from telescopes
**Use Cases:** Astronomical observations
**Python Libraries:**
- `astropy.io.fits`: `fits.open('file.fits')`
- `fitsio`: Alternative FITS library
**EDA Approach:**
- HDU (Header Data Unit) structure
- Image dimensions and WCS
- Header keyword analysis
- Table column descriptions
- Data type and scaling
- FITS convention compliance
- Checksum validation
### .asdf - Advanced Scientific Data Format
**Description:** Next-gen data format for astronomy
**Typical Data:** Complex hierarchical scientific data
**Use Cases:** James Webb Space Telescope data
**Python Libraries:**
- `asdf`: `asdf.open('file.asdf')`
**EDA Approach:**
- Tree structure exploration
- Schema validation
- Internal vs external arrays
- Compression methods
- YAML metadata
- Version compatibility
### .root - ROOT Data Format
**Description:** CERN ROOT framework format
**Typical Data:** High-energy physics data
**Use Cases:** Particle physics experiments
**Python Libraries:**
- `uproot`: Pure Python ROOT reading
- `ROOT`: Official PyROOT bindings
**EDA Approach:**
- TTree structure
- Branch types and entries
- Histogram inventory
- Event loop statistics
- File compression
- Split level analysis
### .txt - Plain Text Data
**Description:** Generic text-based data
**Typical Data:** Tab/space-delimited, custom formats
**Use Cases:** Simple data exchange, logs
**Python Libraries:**
- `pandas`: `pd.read_csv()` with custom delimiters
- `numpy`: `np.loadtxt()`, `np.genfromtxt()`
- Built-in file reading
**EDA Approach:**
- Format detection (delimiter, header)
- Data type inference
- Comment line handling
- Missing value codes
- Column alignment
- Encoding detection
### .dat - Generic Data File
**Description:** Binary or text data
**Typical Data:** Instrument output, custom formats
**Use Cases:** Various scientific instruments
**Python Libraries:**
- Format-specific: requires knowledge of structure
- `numpy`: `np.fromfile()` for binary
- `struct`: Parse binary structures
**EDA Approach:**
- Binary vs text determination
- Header detection
- Record structure inference
- Endianness
- Data type patterns
- Validation with documentation
### .log - Log Files
**Description:** Text logs from software/instruments
**Typical Data:** Timestamped events, messages
**Use Cases:** Troubleshooting, experiment tracking
**Python Libraries:**
- Built-in file reading
- `pandas`: Structured log parsing
- Regular expressions for parsing
**EDA Approach:**
- Log level distribution
- Timestamp parsing
- Error and warning frequency
- Event sequencing
- Pattern recognition
- Anomaly detection
- Session boundaries