Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,372 @@
# FlowIO API Reference
## Overview
FlowIO is a Python library for reading and writing Flow Cytometry Standard (FCS) files. It supports FCS versions 2.0, 3.0, and 3.1 with minimal dependencies.
## Installation
```bash
pip install flowio
```
Supports Python 3.9 and later.
## Core Classes
### FlowData
The primary class for working with FCS files.
#### Constructor
```python
FlowData(fcs_file,
ignore_offset_error=False,
ignore_offset_discrepancy=False,
use_header_offsets=False,
only_text=False,
nextdata_offset=None,
null_channel_list=None)
```
**Parameters:**
- `fcs_file`: File path (str), Path object, or file handle
- `ignore_offset_error` (bool): Ignore offset errors (default: False)
- `ignore_offset_discrepancy` (bool): Ignore offset discrepancies between HEADER and TEXT sections (default: False)
- `use_header_offsets` (bool): Use HEADER section offsets instead of TEXT section (default: False)
- `only_text` (bool): Only parse the TEXT segment, skip DATA and ANALYSIS (default: False)
- `nextdata_offset` (int): Byte offset for reading multi-dataset files
- `null_channel_list` (list): List of PnN labels for null channels to exclude
#### Attributes
**File Information:**
- `name`: Name of the FCS file
- `file_size`: Size of the file in bytes
- `version`: FCS version (e.g., '3.0', '3.1')
- `header`: Dictionary containing HEADER segment information
- `data_type`: Type of data format ('I', 'F', 'D', 'A')
**Channel Information:**
- `channel_count`: Number of channels in the dataset
- `channels`: Dictionary mapping channel numbers to channel info
- `pnn_labels`: List of PnN (short channel name) labels
- `pns_labels`: List of PnS (descriptive stain name) labels
- `pnr_values`: List of PnR (range) values for each channel
- `fluoro_indices`: List of indices for fluorescence channels
- `scatter_indices`: List of indices for scatter channels
- `time_index`: Index of the time channel (or None)
- `null_channels`: List of null channel indices
**Event Data:**
- `event_count`: Number of events (rows) in the dataset
- `events`: Raw event data as bytes
**Metadata:**
- `text`: Dictionary of TEXT segment key-value pairs
- `analysis`: Dictionary of ANALYSIS segment key-value pairs (if present)
#### Methods
##### as_array()
```python
as_array(preprocess=True)
```
Return event data as a 2-D NumPy array.
**Parameters:**
- `preprocess` (bool): Apply gain, logarithmic, and time scaling transformations (default: True)
**Returns:**
- NumPy ndarray with shape (event_count, channel_count)
**Example:**
```python
flow_data = FlowData('sample.fcs')
events_array = flow_data.as_array() # Preprocessed data
raw_array = flow_data.as_array(preprocess=False) # Raw data
```
##### write_fcs()
```python
write_fcs(filename, metadata=None)
```
Export the FlowData instance as a new FCS file.
**Parameters:**
- `filename` (str): Output file path
- `metadata` (dict): Optional dictionary of TEXT segment keywords to add/update
**Example:**
```python
flow_data = FlowData('sample.fcs')
flow_data.write_fcs('output.fcs', metadata={'$SRC': 'Modified data'})
```
**Note:** Exports as FCS 3.1 with single-precision floating-point data.
## Utility Functions
### read_multiple_data_sets()
```python
read_multiple_data_sets(fcs_file,
ignore_offset_error=False,
ignore_offset_discrepancy=False,
use_header_offsets=False)
```
Read all datasets from an FCS file containing multiple datasets.
**Parameters:**
- Same as FlowData constructor (except `nextdata_offset`)
**Returns:**
- List of FlowData instances, one for each dataset
**Example:**
```python
from flowio import read_multiple_data_sets
datasets = read_multiple_data_sets('multi_dataset.fcs')
print(f"Found {len(datasets)} datasets")
for i, dataset in enumerate(datasets):
print(f"Dataset {i}: {dataset.event_count} events")
```
### create_fcs()
```python
create_fcs(filename,
event_data,
channel_names,
opt_channel_names=None,
metadata=None)
```
Create a new FCS file from event data.
**Parameters:**
- `filename` (str): Output file path
- `event_data` (ndarray): 2-D NumPy array of event data (rows=events, columns=channels)
- `channel_names` (list): List of PnN (short) channel names
- `opt_channel_names` (list): Optional list of PnS (descriptive) channel names
- `metadata` (dict): Optional dictionary of TEXT segment keywords
**Example:**
```python
import numpy as np
from flowio import create_fcs
# Create synthetic data
events = np.random.rand(10000, 5)
channels = ['FSC-A', 'SSC-A', 'FL1-A', 'FL2-A', 'Time']
opt_channels = ['Forward Scatter', 'Side Scatter', 'FITC', 'PE', 'Time']
create_fcs('synthetic.fcs',
events,
channels,
opt_channel_names=opt_channels,
metadata={'$SRC': 'Synthetic data'})
```
## Exception Classes
### FlowIOWarning
Generic warning class for non-critical issues.
### PnEWarning
Warning raised when PnE values are invalid during FCS file creation.
### FlowIOException
Base exception class for FlowIO errors.
### FCSParsingError
Raised when there are issues parsing an FCS file.
### DataOffsetDiscrepancyError
Raised when the HEADER and TEXT sections provide different byte offsets for data segments.
**Workaround:** Use `ignore_offset_discrepancy=True` parameter when creating FlowData instance.
### MultipleDataSetsError
Raised when attempting to read a file with multiple datasets using the standard FlowData constructor.
**Solution:** Use `read_multiple_data_sets()` function instead.
## FCS File Structure Reference
FCS files consist of four segments:
1. **HEADER**: Contains FCS version and byte locations of other segments
2. **TEXT**: Key-value metadata pairs (delimited format)
3. **DATA**: Raw event data (binary, floating-point, or ASCII)
4. **ANALYSIS** (optional): Results from data processing
### Common TEXT Segment Keywords
- `$BEGINDATA`, `$ENDDATA`: Byte offsets for DATA segment
- `$BEGINANALYSIS`, `$ENDANALYSIS`: Byte offsets for ANALYSIS segment
- `$BYTEORD`: Byte order (1,2,3,4 for little-endian; 4,3,2,1 for big-endian)
- `$DATATYPE`: Data type ('I'=integer, 'F'=float, 'D'=double, 'A'=ASCII)
- `$MODE`: Data mode ('L'=list mode, most common)
- `$NEXTDATA`: Offset to next dataset (0 if single dataset)
- `$PAR`: Number of parameters (channels)
- `$TOT`: Total number of events
- `PnN`: Short name for parameter n
- `PnS`: Descriptive stain name for parameter n
- `PnR`: Range (max value) for parameter n
- `PnE`: Amplification exponent for parameter n (format: "a,b" where value = a * 10^(b*x))
- `PnG`: Amplification gain for parameter n
## Channel Types
FlowIO automatically categorizes channels:
- **Scatter channels**: FSC (forward scatter), SSC (side scatter)
- **Fluorescence channels**: FL1, FL2, FITC, PE, etc.
- **Time channel**: Usually labeled "Time"
Access indices via:
- `flow_data.scatter_indices`
- `flow_data.fluoro_indices`
- `flow_data.time_index`
## Data Preprocessing
When calling `as_array(preprocess=True)`, FlowIO applies:
1. **Gain scaling**: Multiply by PnG value
2. **Logarithmic transformation**: Apply PnE exponential transformation if present
3. **Time scaling**: Convert time values to appropriate units
To access raw, unprocessed data: `as_array(preprocess=False)`
## Best Practices
1. **Memory efficiency**: Use `only_text=True` when only metadata is needed
2. **Error handling**: Wrap file operations in try-except blocks for FCSParsingError
3. **Multi-dataset files**: Always use `read_multiple_data_sets()` if unsure about dataset count
4. **Offset issues**: If encountering offset errors, try `ignore_offset_discrepancy=True`
5. **Channel selection**: Use null_channel_list to exclude unwanted channels during parsing
## Integration with FlowKit
For advanced flow cytometry analysis including compensation, gating, and GatingML support, consider using FlowKit library alongside FlowIO. FlowKit provides higher-level abstractions built on top of FlowIO's file parsing capabilities.
## Example Workflows
### Basic File Reading
```python
from flowio import FlowData
# Read FCS file
flow = FlowData('experiment.fcs')
# Print basic info
print(f"Version: {flow.version}")
print(f"Events: {flow.event_count}")
print(f"Channels: {flow.channel_count}")
print(f"Channel names: {flow.pnn_labels}")
# Get event data
events = flow.as_array()
print(f"Data shape: {events.shape}")
```
### Metadata Extraction
```python
from flowio import FlowData
flow = FlowData('sample.fcs', only_text=True)
# Access metadata
print(f"Acquisition date: {flow.text.get('$DATE', 'N/A')}")
print(f"Instrument: {flow.text.get('$CYT', 'N/A')}")
# Channel information
for i, (pnn, pns) in enumerate(zip(flow.pnn_labels, flow.pns_labels)):
print(f"Channel {i}: {pnn} ({pns})")
```
### Creating New FCS Files
```python
import numpy as np
from flowio import create_fcs
# Generate or process data
data = np.random.rand(5000, 3) * 1000
# Define channels
channels = ['FSC-A', 'SSC-A', 'FL1-A']
stains = ['Forward Scatter', 'Side Scatter', 'GFP']
# Create FCS file
create_fcs('output.fcs',
data,
channels,
opt_channel_names=stains,
metadata={
'$SRC': 'Python script',
'$DATE': '19-OCT-2025'
})
```
### Processing Multi-Dataset Files
```python
from flowio import read_multiple_data_sets
# Read all datasets
datasets = read_multiple_data_sets('multi.fcs')
# Process each dataset
for i, dataset in enumerate(datasets):
print(f"\nDataset {i}:")
print(f" Events: {dataset.event_count}")
print(f" Channels: {dataset.pnn_labels}")
# Get data array
events = dataset.as_array()
mean_values = events.mean(axis=0)
print(f" Mean values: {mean_values}")
```
### Modifying and Re-exporting
```python
from flowio import FlowData
# Read original file
flow = FlowData('original.fcs')
# Get event data
events = flow.as_array(preprocess=False)
# Modify data (example: apply custom transformation)
events[:, 0] = events[:, 0] * 1.5 # Scale first channel
# Note: Currently, FlowIO doesn't support direct modification of event data
# For modifications, use create_fcs() instead:
from flowio import create_fcs
create_fcs('modified.fcs',
events,
flow.pnn_labels,
opt_channel_names=flow.pns_labels,
metadata=flow.text)
```