Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/exploratory-data-analysis/references/general_scientific_formats.md
+++ b/skills/exploratory-data-analysis/references/general_scientific_formats.md
@@ -0,0 +1,518 @@
+# General Scientific Data Formats Reference
+
+This reference covers general-purpose scientific data formats used across multiple disciplines.
+
+## Numerical and Array Data
+
+### .npy - NumPy Array
+**Description:** Binary NumPy array format
+**Typical Data:** N-dimensional arrays of any data type
+**Use Cases:** Fast I/O for numerical data, intermediate results
+**Python Libraries:**
+- `numpy`: `np.load('file.npy')`, `np.save()`
+- Memory-mapped access: `np.load('file.npy', mmap_mode='r')`
+**EDA Approach:**
+- Array shape and dimensionality
+- Data type and precision
+- Statistical summary (mean, std, min, max, percentiles)
+- Missing or invalid values (NaN, inf)
+- Memory footprint
+- Value distribution and histogram
+- Sparsity analysis
+- Correlation structure (if 2D)
+
+### .npz - Compressed NumPy Archive
+**Description:** Multiple NumPy arrays in one file
+**Typical Data:** Collections of related arrays
+**Use Cases:** Saving multiple arrays together, compressed storage
+**Python Libraries:**
+- `numpy`: `np.load('file.npz')` returns dict-like object
+- `np.savez()` or `np.savez_compressed()`
+**EDA Approach:**
+- List of contained arrays
+- Individual array analysis
+- Relationships between arrays
+- Total file size and compression ratio
+- Naming conventions
+- Data consistency checks
+
+### .csv - Comma-Separated Values
+**Description:** Plain text tabular data
+**Typical Data:** Experimental measurements, results tables
+**Use Cases:** Universal data exchange, spreadsheet export
+**Python Libraries:**
+- `pandas`: `pd.read_csv('file.csv')`
+- `csv`: Built-in module
+- `polars`: High-performance CSV reading
+- `numpy`: `np.loadtxt()` or `np.genfromtxt()`
+**EDA Approach:**
+- Row and column counts
+- Data type inference
+- Missing value patterns and frequency
+- Column statistics (numeric: mean, std; categorical: frequencies)
+- Outlier detection
+- Correlation matrix
+- Duplicate row detection
+- Header and index validation
+- Encoding issues detection
+
+### .tsv / .tab - Tab-Separated Values
+**Description:** Tab-delimited tabular data
+**Typical Data:** Similar to CSV but tab-separated
+**Use Cases:** Bioinformatics, text processing output
+**Python Libraries:**
+- `pandas`: `pd.read_csv('file.tsv', sep='\t')`
+**EDA Approach:**
+- Same as CSV format
+- Tab vs space validation
+- Quote handling
+
+### .xlsx / .xls - Excel Spreadsheets
+**Description:** Microsoft Excel binary/XML formats
+**Typical Data:** Tabular data with formatting, formulas
+**Use Cases:** Lab notebooks, data entry, reports
+**Python Libraries:**
+- `pandas`: `pd.read_excel('file.xlsx')`
+- `openpyxl`: Full Excel file manipulation
+- `xlrd`: Reading .xls (legacy)
+**EDA Approach:**
+- Sheet enumeration and names
+- Per-sheet data analysis
+- Formula evaluation
+- Merged cells handling
+- Hidden rows/columns
+- Data validation rules
+- Named ranges
+- Formatting-only cells detection
+
+### .json - JavaScript Object Notation
+**Description:** Hierarchical text data format
+**Typical Data:** Nested data structures, metadata
+**Use Cases:** API responses, configuration, results
+**Python Libraries:**
+- `json`: Built-in module
+- `pandas`: `pd.read_json()`
+- `ujson`: Faster JSON parsing
+**EDA Approach:**
+- Schema inference
+- Nesting depth
+- Key-value distribution
+- Array lengths
+- Data type consistency
+- Missing keys
+- Duplicate detection
+- Size and complexity metrics
+
+### .xml - Extensible Markup Language
+**Description:** Hierarchical markup format
+**Typical Data:** Structured data with metadata
+**Use Cases:** Standards-based data exchange, APIs
+**Python Libraries:**
+- `lxml`: `lxml.etree.parse()`
+- `xml.etree.ElementTree`: Built-in XML
+- `xmltodict`: Convert XML to dict
+**EDA Approach:**
+- Schema/DTD validation
+- Element hierarchy and depth
+- Namespace handling
+- Attribute vs element content
+- CDATA sections
+- Text content extraction
+- Sibling and child counts
+
+### .yaml / .yml - YAML
+**Description:** Human-readable data serialization
+**Typical Data:** Configuration, metadata, parameters
+**Use Cases:** Experiment configurations, pipelines
+**Python Libraries:**
+- `yaml`: `yaml.safe_load()` or `yaml.load()`
+- `ruamel.yaml`: YAML 1.2 support
+**EDA Approach:**
+- Configuration structure
+- Data type handling
+- List and dict depth
+- Anchor and alias usage
+- Multi-document files
+- Comments preservation
+- Validation against schema
+
+### .toml - TOML Configuration
+**Description:** Configuration file format
+**Typical Data:** Settings, parameters
+**Use Cases:** Python package configuration, settings
+**Python Libraries:**
+- `tomli` / `tomllib`: TOML reading (tomllib in Python 3.11+)
+- `toml`: Reading and writing
+**EDA Approach:**
+- Section structure
+- Key-value pairs
+- Data type inference
+- Nested table validation
+- Required vs optional fields
+
+### .ini - INI Configuration
+**Description:** Simple configuration format
+**Typical Data:** Application settings
+**Use Cases:** Legacy configurations, simple settings
+**Python Libraries:**
+- `configparser`: Built-in INI parser
+**EDA Approach:**
+- Section enumeration
+- Key-value extraction
+- Type conversion
+- Comment handling
+- Case sensitivity
+
+## Binary and Compressed Data
+
+### .hdf5 / .h5 - Hierarchical Data Format 5
+**Description:** Container for large scientific datasets
+**Typical Data:** Multi-dimensional arrays, metadata, groups
+**Use Cases:** Large datasets, multi-modal data, parallel I/O
+**Python Libraries:**
+- `h5py`: `h5py.File('file.h5', 'r')`
+- `pytables`: Advanced HDF5 interface
+- `pandas`: HDF5 storage via HDFStore
+**EDA Approach:**
+- Group and dataset hierarchy
+- Dataset shapes and dtypes
+- Attributes and metadata
+- Compression and chunking strategy
+- Memory-efficient sampling
+- Dataset relationships
+- File size and efficiency
+- Access patterns optimization
+
+### .zarr - Chunked Array Storage
+**Description:** Cloud-optimized chunked arrays
+**Typical Data:** Large N-dimensional arrays
+**Use Cases:** Cloud storage, parallel computing, streaming
+**Python Libraries:**
+- `zarr`: `zarr.open('file.zarr')`
+- `xarray`: Zarr backend support
+**EDA Approach:**
+- Array metadata and dimensions
+- Chunk size optimization
+- Compression codec and ratio
+- Synchronizer and store type
+- Multi-scale hierarchies
+- Parallel access performance
+- Attribute metadata
+
+### .gz / .gzip - Gzip Compressed
+**Description:** Compressed data files
+**Typical Data:** Any compressed text or binary
+**Use Cases:** Compression for storage/transfer
+**Python Libraries:**
+- `gzip`: Built-in gzip module
+- `pandas`: Automatic gzip handling in read functions
+**EDA Approach:**
+- Compression ratio
+- Original file type detection
+- Decompression validation
+- Header information
+- Multi-member archives
+
+### .bz2 - Bzip2 Compressed
+**Description:** Bzip2 compression
+**Typical Data:** Highly compressed files
+**Use Cases:** Better compression than gzip
+**Python Libraries:**
+- `bz2`: Built-in bz2 module
+- Automatic handling in pandas
+**EDA Approach:**
+- Compression efficiency
+- Decompression time
+- Content validation
+
+### .zip - ZIP Archive
+**Description:** Archive with multiple files
+**Typical Data:** Collections of files
+**Use Cases:** File distribution, archiving
+**Python Libraries:**
+- `zipfile`: Built-in ZIP support
+- `pandas`: Can read zipped CSVs
+**EDA Approach:**
+- Archive member listing
+- Compression method per file
+- Total vs compressed size
+- Directory structure
+- File type distribution
+- Extraction validation
+
+### .tar / .tar.gz - TAR Archive
+**Description:** Unix tape archive
+**Typical Data:** Multiple files and directories
+**Use Cases:** Software distribution, backups
+**Python Libraries:**
+- `tarfile`: Built-in TAR support
+**EDA Approach:**
+- Member file listing
+- Compression (if .tar.gz, .tar.bz2)
+- Directory structure
+- Permissions preservation
+- Extraction testing
+
+## Time Series and Waveform Data
+
+### .wav - Waveform Audio
+**Description:** Audio waveform data
+**Typical Data:** Acoustic signals, audio recordings
+**Use Cases:** Acoustic analysis, ultrasound, signal processing
+**Python Libraries:**
+- `scipy.io.wavfile`: `scipy.io.wavfile.read()`
+- `wave`: Built-in module
+- `soundfile`: Enhanced audio I/O
+**EDA Approach:**
+- Sample rate and duration
+- Bit depth and channels
+- Amplitude distribution
+- Spectral analysis (FFT)
+- Signal-to-noise ratio
+- Clipping detection
+- Frequency content
+
+### .mat - MATLAB Data
+**Description:** MATLAB workspace variables
+**Typical Data:** Arrays, structures, cells
+**Use Cases:** MATLAB-Python interoperability
+**Python Libraries:**
+- `scipy.io`: `scipy.io.loadmat()`
+- `h5py`: For MATLAB v7.3 files (HDF5-based)
+- `mat73`: Pure Python for v7.3
+**EDA Approach:**
+- Variable names and types
+- Array dimensions
+- Structure field exploration
+- Cell array handling
+- Sparse matrix detection
+- MATLAB version compatibility
+- Metadata extraction
+
+### .edf - European Data Format
+**Description:** Time series data (especially medical)
+**Typical Data:** EEG, physiological signals
+**Use Cases:** Medical signal storage
+**Python Libraries:**
+- `pyedflib`: EDF/EDF+ reading and writing
+- `mne`: Neurophysiology data (supports EDF)
+**EDA Approach:**
+- Signal count and names
+- Sampling frequencies
+- Signal ranges and units
+- Recording duration
+- Annotation events
+- Data quality (saturation, noise)
+- Patient/study information
+
+### .csv (Time Series)
+**Description:** CSV with timestamp column
+**Typical Data:** Time-indexed measurements
+**Use Cases:** Sensor data, monitoring, experiments
+**Python Libraries:**
+- `pandas`: `pd.read_csv()` with `parse_dates`
+**EDA Approach:**
+- Temporal range and resolution
+- Sampling regularity
+- Missing time points
+- Trend and seasonality
+- Stationarity tests
+- Autocorrelation
+- Anomaly detection
+
+## Geospatial and Environmental Data
+
+### .shp - Shapefile
+**Description:** Geospatial vector data
+**Typical Data:** Geographic features (points, lines, polygons)
+**Use Cases:** GIS analysis, spatial data
+**Python Libraries:**
+- `geopandas`: `gpd.read_file('file.shp')`
+- `fiona`: Lower-level shapefile access
+- `pyshp`: Pure Python shapefile reader
+**EDA Approach:**
+- Geometry type and count
+- Coordinate reference system
+- Bounding box
+- Attribute table analysis
+- Geometry validity
+- Spatial distribution
+- Multi-part features
+- Associated files (.shx, .dbf, .prj)
+
+### .geojson - GeoJSON
+**Description:** JSON format for geographic data
+**Typical Data:** Features with geometry and properties
+**Use Cases:** Web mapping, spatial analysis
+**Python Libraries:**
+- `geopandas`: Native GeoJSON support
+- `json`: Parse as JSON then process
+**EDA Approach:**
+- Feature count and types
+- CRS specification
+- Bounding box calculation
+- Property schema
+- Geometry complexity
+- Nesting structure
+
+### .tif / .tiff (Geospatial)
+**Description:** GeoTIFF with spatial reference
+**Typical Data:** Satellite imagery, DEMs, rasters
+**Use Cases:** Remote sensing, terrain analysis
+**Python Libraries:**
+- `rasterio`: `rasterio.open('file.tif')`
+- `gdal`: Geospatial Data Abstraction Library
+- `xarray` with `rioxarray`: N-D geospatial arrays
+**EDA Approach:**
+- Raster dimensions and resolution
+- Band count and descriptions
+- Coordinate reference system
+- Geotransform parameters
+- NoData value handling
+- Pixel value distribution
+- Histogram analysis
+- Overviews and pyramids
+
+### .nc / .netcdf - Network Common Data Form
+**Description:** Self-describing array-based data
+**Typical Data:** Climate, atmospheric, oceanographic data
+**Use Cases:** Scientific datasets, model output
+**Python Libraries:**
+- `netCDF4`: `netCDF4.Dataset('file.nc')`
+- `xarray`: `xr.open_dataset('file.nc')`
+**EDA Approach:**
+- Variable enumeration
+- Dimension analysis
+- Time series properties
+- Spatial coverage
+- Attribute metadata (CF conventions)
+- Coordinate systems
+- Chunking and compression
+- Data quality flags
+
+### .grib / .grib2 - Gridded Binary
+**Description:** Meteorological data format
+**Typical Data:** Weather forecasts, climate data
+**Use Cases:** Numerical weather prediction
+**Python Libraries:**
+- `pygrib`: GRIB file reading
+- `xarray` with `cfgrib`: GRIB to xarray
+**EDA Approach:**
+- Message inventory
+- Parameter and level types
+- Spatial grid specification
+- Temporal coverage
+- Ensemble members
+- Forecast vs analysis
+- Data packing and precision
+
+### .hdf4 - HDF4 Format
+**Description:** Older HDF format
+**Typical Data:** NASA Earth Science data
+**Use Cases:** Satellite data (MODIS, etc.)
+**Python Libraries:**
+- `pyhdf`: HDF4 access
+- `gdal`: Can read HDF4
+**EDA Approach:**
+- Scientific dataset listing
+- Vdata and attributes
+- Dimension scales
+- Metadata extraction
+- Quality flags
+- Conversion to HDF5 or NetCDF
+
+## Specialized Scientific Formats
+
+### .fits - Flexible Image Transport System
+**Description:** Astronomy data format
+**Typical Data:** Images, tables, spectra from telescopes
+**Use Cases:** Astronomical observations
+**Python Libraries:**
+- `astropy.io.fits`: `fits.open('file.fits')`
+- `fitsio`: Alternative FITS library
+**EDA Approach:**
+- HDU (Header Data Unit) structure
+- Image dimensions and WCS
+- Header keyword analysis
+- Table column descriptions
+- Data type and scaling
+- FITS convention compliance
+- Checksum validation
+
+### .asdf - Advanced Scientific Data Format
+**Description:** Next-gen data format for astronomy
+**Typical Data:** Complex hierarchical scientific data
+**Use Cases:** James Webb Space Telescope data
+**Python Libraries:**
+- `asdf`: `asdf.open('file.asdf')`
+**EDA Approach:**
+- Tree structure exploration
+- Schema validation
+- Internal vs external arrays
+- Compression methods
+- YAML metadata
+- Version compatibility
+
+### .root - ROOT Data Format
+**Description:** CERN ROOT framework format
+**Typical Data:** High-energy physics data
+**Use Cases:** Particle physics experiments
+**Python Libraries:**
+- `uproot`: Pure Python ROOT reading
+- `ROOT`: Official PyROOT bindings
+**EDA Approach:**
+- TTree structure
+- Branch types and entries
+- Histogram inventory
+- Event loop statistics
+- File compression
+- Split level analysis
+
+### .txt - Plain Text Data
+**Description:** Generic text-based data
+**Typical Data:** Tab/space-delimited, custom formats
+**Use Cases:** Simple data exchange, logs
+**Python Libraries:**
+- `pandas`: `pd.read_csv()` with custom delimiters
+- `numpy`: `np.loadtxt()`, `np.genfromtxt()`
+- Built-in file reading
+**EDA Approach:**
+- Format detection (delimiter, header)
+- Data type inference
+- Comment line handling
+- Missing value codes
+- Column alignment
+- Encoding detection
+
+### .dat - Generic Data File
+**Description:** Binary or text data
+**Typical Data:** Instrument output, custom formats
+**Use Cases:** Various scientific instruments
+**Python Libraries:**
+- Format-specific: requires knowledge of structure
+- `numpy`: `np.fromfile()` for binary
+- `struct`: Parse binary structures
+**EDA Approach:**
+- Binary vs text determination
+- Header detection
+- Record structure inference
+- Endianness
+- Data type patterns
+- Validation with documentation
+
+### .log - Log Files
+**Description:** Text logs from software/instruments
+**Typical Data:** Timestamped events, messages
+**Use Cases:** Troubleshooting, experiment tracking
+**Python Libraries:**
+- Built-in file reading
+- `pandas`: Structured log parsing
+- Regular expressions for parsing
+**EDA Approach:**
+- Log level distribution
+- Timestamp parsing
+- Error and warning frequency
+- Event sequencing
+- Pattern recognition
+- Anomaly detection
+- Session boundaries