Files
gh-k-dense-ai-claude-scient…/skills/gtars/references/refget.md
2025-11-30 08:30:10 +08:00

3.1 KiB

Reference Sequence Management

The refget module handles reference sequence retrieval and digest computation, following the refget protocol for sequence identification.

RefgetStore

RefgetStore manages reference sequences and their digests:

import gtars

# Create RefgetStore
store = gtars.RefgetStore()

# Add sequence
store.add_sequence("chr1", sequence_data)

# Retrieve sequence
seq = store.get_sequence("chr1")

# Get sequence digest
digest = store.get_digest("chr1")

Sequence Digests

Compute and verify sequence digests:

# Compute digest for sequence
from gtars.refget import compute_digest

digest = compute_digest(sequence_data)

# Verify digest matches
is_valid = store.verify_digest("chr1", expected_digest)

Integration with Reference Genomes

Work with standard reference genomes:

# Load reference genome
store = gtars.RefgetStore.from_fasta("hg38.fa")

# Get chromosome sequences
chr1 = store.get_sequence("chr1")
chr2 = store.get_sequence("chr2")

# Get subsequence
region_seq = store.get_subsequence("chr1", 1000, 2000)

CLI Usage

Manage reference sequences from command line:

# Compute digest for FASTA file
gtars refget digest --input genome.fa --output digests.txt

# Verify sequence digest
gtars refget verify --sequence sequence.fa --digest expected_digest

Refget Protocol Compliance

The refget module follows the GA4GH refget protocol:

Digest Computation

Digests are computed using SHA-512 truncated to 48 bytes:

# Compute refget-compliant digest
digest = gtars.refget.compute_digest(sequence)
# Returns: "SQ.abc123..."

Sequence Retrieval

Retrieve sequences by digest:

# Get sequence by refget digest
seq = store.get_sequence_by_digest("SQ.abc123...")

Use Cases

Reference Validation

Verify reference genome integrity:

# Compute digests for reference
store = gtars.RefgetStore.from_fasta("reference.fa")
digests = {chrom: store.get_digest(chrom) for chrom in store.chromosomes}

# Compare with expected digests
for chrom, expected in expected_digests.items():
    actual = digests[chrom]
    if actual != expected:
        print(f"Mismatch for {chrom}: {actual} != {expected}")

Sequence Extraction

Extract specific genomic regions:

# Extract regions of interest
store = gtars.RefgetStore.from_fasta("hg38.fa")

regions = [
    ("chr1", 1000, 2000),
    ("chr2", 5000, 6000),
    ("chr3", 10000, 11000)
]

sequences = [store.get_subsequence(c, s, e) for c, s, e in regions]

Cross-Reference Comparison

Compare sequences across different references:

# Load two reference versions
hg19 = gtars.RefgetStore.from_fasta("hg19.fa")
hg38 = gtars.RefgetStore.from_fasta("hg38.fa")

# Compare digests
for chrom in hg19.chromosomes:
    digest_19 = hg19.get_digest(chrom)
    digest_38 = hg38.get_digest(chrom)
    if digest_19 != digest_38:
        print(f"{chrom} differs between hg19 and hg38")

Performance Notes

  • Sequences loaded on demand
  • Digests cached after computation
  • Efficient subsequence extraction
  • Memory-mapped file support for large genomes