Files
gh-k-dense-ai-claude-scient…/skills/gtars/references/python-api.md
2025-11-30 08:30:10 +08:00

4.1 KiB

Python API Reference

Comprehensive reference for gtars Python bindings.

Installation

# Install gtars Python package
uv pip install gtars

# Or with pip
pip install gtars

Core Classes

RegionSet

Manage collections of genomic intervals:

import gtars

# Create from BED file
regions = gtars.RegionSet.from_bed("regions.bed")

# Create from coordinates
regions = gtars.RegionSet([
    ("chr1", 1000, 2000),
    ("chr1", 3000, 4000),
    ("chr2", 5000, 6000)
])

# Access regions
for region in regions:
    print(f"{region.chromosome}:{region.start}-{region.end}")

# Get region count
num_regions = len(regions)

# Get total coverage
total_coverage = regions.total_coverage()

Region Operations

Perform operations on region sets:

# Sort regions
sorted_regions = regions.sort()

# Merge overlapping regions
merged = regions.merge()

# Filter by size
large_regions = regions.filter_by_size(min_size=1000)

# Filter by chromosome
chr1_regions = regions.filter_by_chromosome("chr1")

Set Operations

Perform set operations on genomic regions:

# Load two region sets
set_a = gtars.RegionSet.from_bed("set_a.bed")
set_b = gtars.RegionSet.from_bed("set_b.bed")

# Union
union = set_a.union(set_b)

# Intersection
intersection = set_a.intersect(set_b)

# Difference
difference = set_a.subtract(set_b)

# Symmetric difference
sym_diff = set_a.symmetric_difference(set_b)

Data Export

Writing BED Files

Export regions to BED format:

# Write to BED file
regions.to_bed("output.bed")

# Write with scores
regions.to_bed("output.bed", scores=score_array)

# Write with names
regions.to_bed("output.bed", names=name_list)

Format Conversion

Convert between formats:

# BED to JSON
regions = gtars.RegionSet.from_bed("input.bed")
regions.to_json("output.json")

# JSON to BED
regions = gtars.RegionSet.from_json("input.json")
regions.to_bed("output.bed")

NumPy Integration

Seamless integration with NumPy arrays:

import numpy as np

# Export to NumPy arrays
starts = regions.starts_array()  # NumPy array of start positions
ends = regions.ends_array()      # NumPy array of end positions
sizes = regions.sizes_array()    # NumPy array of region sizes

# Create from NumPy arrays
chromosomes = ["chr1"] * len(starts)
regions = gtars.RegionSet.from_arrays(chromosomes, starts, ends)

Parallel Processing

Leverage parallel processing for large datasets:

# Enable parallel processing
regions = gtars.RegionSet.from_bed("large_file.bed", parallel=True)

# Parallel operations
result = regions.parallel_apply(custom_function)

Memory Management

Efficient memory usage for large datasets:

# Stream large BED files
for chunk in gtars.RegionSet.stream_bed("large_file.bed", chunk_size=10000):
    process_chunk(chunk)

# Memory-mapped mode
regions = gtars.RegionSet.from_bed("large_file.bed", mmap=True)

Error Handling

Handle common errors:

try:
    regions = gtars.RegionSet.from_bed("file.bed")
except gtars.FileNotFoundError:
    print("File not found")
except gtars.InvalidFormatError as e:
    print(f"Invalid BED format: {e}")
except gtars.ParseError as e:
    print(f"Parse error at line {e.line}: {e.message}")

Configuration

Configure gtars behavior:

# Set global options
gtars.set_option("parallel.threads", 4)
gtars.set_option("memory.limit", "4GB")
gtars.set_option("warnings.strict", True)

# Context manager for temporary options
with gtars.option_context("parallel.threads", 8):
    # Use 8 threads for this block
    regions = gtars.RegionSet.from_bed("large_file.bed", parallel=True)

Logging

Enable logging for debugging:

import logging

# Enable gtars logging
gtars.set_log_level("DEBUG")

# Or use Python logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("gtars")

Performance Tips

  • Use parallel processing for large datasets
  • Enable memory-mapped mode for very large files
  • Stream data when possible to reduce memory usage
  • Pre-sort regions before operations when applicable
  • Use NumPy arrays for numerical computations
  • Cache frequently accessed data