Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,305 @@
---
name: ensembl-database
description: "Query Ensembl genome database REST API for 250+ species. Gene lookups, sequence retrieval, variant analysis, comparative genomics, orthologs, VEP predictions, for genomic research."
---
# Ensembl Database
## Overview
Access and query the Ensembl genome database, a comprehensive resource for vertebrate genomic data maintained by EMBL-EBI. The database provides gene annotations, sequences, variants, regulatory information, and comparative genomics data for over 250 species. Current release is 115 (September 2025).
## When to Use This Skill
This skill should be used when:
- Querying gene information by symbol or Ensembl ID
- Retrieving DNA, transcript, or protein sequences
- Analyzing genetic variants using the Variant Effect Predictor (VEP)
- Finding orthologs and paralogs across species
- Accessing regulatory features and genomic annotations
- Converting coordinates between genome assemblies (e.g., GRCh37 to GRCh38)
- Performing comparative genomics analyses
- Integrating Ensembl data into genomic research pipelines
## Core Capabilities
### 1. Gene Information Retrieval
Query gene data by symbol, Ensembl ID, or external database identifiers.
**Common operations:**
- Look up gene information by symbol (e.g., "BRCA2", "TP53")
- Retrieve transcript and protein information
- Get gene coordinates and chromosomal locations
- Access cross-references to external databases (UniProt, RefSeq, etc.)
**Using the ensembl_rest package:**
```python
from ensembl_rest import EnsemblClient
client = EnsemblClient()
# Look up gene by symbol
gene_data = client.symbol_lookup(
species='human',
symbol='BRCA2'
)
# Get detailed gene information
gene_info = client.lookup_id(
id='ENSG00000139618', # BRCA2 Ensembl ID
expand=True
)
```
**Direct REST API (no package):**
```python
import requests
server = "https://rest.ensembl.org"
# Symbol lookup
response = requests.get(
f"{server}/lookup/symbol/homo_sapiens/BRCA2",
headers={"Content-Type": "application/json"}
)
gene_data = response.json()
```
### 2. Sequence Retrieval
Fetch genomic, transcript, or protein sequences in various formats (JSON, FASTA, plain text).
**Operations:**
- Get DNA sequences for genes or genomic regions
- Retrieve transcript sequences (cDNA)
- Access protein sequences
- Extract sequences with flanking regions or modifications
**Example:**
```python
# Using ensembl_rest package
sequence = client.sequence_id(
id='ENSG00000139618', # Gene ID
content_type='application/json'
)
# Get sequence for a genomic region
region_seq = client.sequence_region(
species='human',
region='7:140424943-140624564' # chromosome:start-end
)
```
### 3. Variant Analysis
Query genetic variation data and predict variant consequences using the Variant Effect Predictor (VEP).
**Capabilities:**
- Look up variants by rsID or genomic coordinates
- Predict functional consequences of variants
- Access population frequency data
- Retrieve phenotype associations
**VEP example:**
```python
# Predict variant consequences
vep_result = client.vep_hgvs(
species='human',
hgvs_notation='ENST00000380152.7:c.803C>T'
)
# Query variant by rsID
variant = client.variation_id(
species='human',
id='rs699'
)
```
### 4. Comparative Genomics
Perform cross-species comparisons to identify orthologs, paralogs, and evolutionary relationships.
**Operations:**
- Find orthologs (same gene in different species)
- Identify paralogs (related genes in same species)
- Access gene trees showing evolutionary relationships
- Retrieve gene family information
**Example:**
```python
# Find orthologs for a human gene
orthologs = client.homology_ensemblgene(
id='ENSG00000139618', # Human BRCA2
target_species='mouse'
)
# Get gene tree
gene_tree = client.genetree_member_symbol(
species='human',
symbol='BRCA2'
)
```
### 5. Genomic Region Analysis
Find all genomic features (genes, transcripts, regulatory elements) in a specific region.
**Use cases:**
- Identify all genes in a chromosomal region
- Find regulatory features (promoters, enhancers)
- Locate variants within a region
- Retrieve structural features
**Example:**
```python
# Find all features in a region
features = client.overlap_region(
species='human',
region='7:140424943-140624564',
feature='gene'
)
```
### 6. Assembly Mapping
Convert coordinates between different genome assemblies (e.g., GRCh37 to GRCh38).
**Important:** Use `https://grch37.rest.ensembl.org` for GRCh37/hg19 queries and `https://rest.ensembl.org` for current assemblies.
**Example:**
```python
from ensembl_rest import AssemblyMapper
# Map coordinates from GRCh37 to GRCh38
mapper = AssemblyMapper(
species='human',
asm_from='GRCh37',
asm_to='GRCh38'
)
mapped = mapper.map(chrom='7', start=140453136, end=140453136)
```
## API Best Practices
### Rate Limiting
The Ensembl REST API has rate limits. Follow these practices:
1. **Respect rate limits:** Maximum 15 requests per second for anonymous users
2. **Handle 429 responses:** When rate-limited, check the `Retry-After` header and wait
3. **Use batch endpoints:** When querying multiple items, use batch endpoints where available
4. **Cache results:** Store frequently accessed data to reduce API calls
### Error Handling
Always implement proper error handling:
```python
import requests
import time
def query_ensembl(endpoint, params=None, max_retries=3):
server = "https://rest.ensembl.org"
headers = {"Content-Type": "application/json"}
for attempt in range(max_retries):
response = requests.get(
f"{server}{endpoint}",
headers=headers,
params=params
)
if response.status_code == 200:
return response.json()
elif response.status_code == 429:
# Rate limited - wait and retry
retry_after = int(response.headers.get('Retry-After', 1))
time.sleep(retry_after)
else:
response.raise_for_status()
raise Exception(f"Failed after {max_retries} attempts")
```
## Installation
### Python Package (Recommended)
```bash
uv pip install ensembl_rest
```
The `ensembl_rest` package provides a Pythonic interface to all Ensembl REST API endpoints.
### Direct REST API
No installation needed - use standard HTTP libraries like `requests`:
```bash
uv pip install requests
```
## Resources
### references/
- `api_endpoints.md`: Comprehensive documentation of all 17 API endpoint categories with examples and parameters
### scripts/
- `ensembl_query.py`: Reusable Python script for common Ensembl queries with built-in rate limiting and error handling
## Common Workflows
### Workflow 1: Gene Annotation Pipeline
1. Look up gene by symbol to get Ensembl ID
2. Retrieve transcript information
3. Get protein sequences for all transcripts
4. Find orthologs in other species
5. Export results
### Workflow 2: Variant Analysis
1. Query variant by rsID or coordinates
2. Use VEP to predict functional consequences
3. Check population frequencies
4. Retrieve phenotype associations
5. Generate report
### Workflow 3: Comparative Analysis
1. Start with gene of interest in reference species
2. Find orthologs in target species
3. Retrieve sequences for all orthologs
4. Compare gene structures and features
5. Analyze evolutionary conservation
## Species and Assembly Information
To query available species and assemblies:
```python
# List all available species
species_list = client.info_species()
# Get assembly information for a species
assembly_info = client.info_assembly(species='human')
```
Common species identifiers:
- Human: `homo_sapiens` or `human`
- Mouse: `mus_musculus` or `mouse`
- Zebrafish: `danio_rerio` or `zebrafish`
- Fruit fly: `drosophila_melanogaster`
## Additional Resources
- **Official Documentation:** https://rest.ensembl.org/documentation
- **Python Package Docs:** https://ensemblrest.readthedocs.io
- **EBI Training:** https://www.ebi.ac.uk/training/online/courses/ensembl-rest-api/
- **Ensembl Browser:** https://useast.ensembl.org
- **GitHub Examples:** https://github.com/Ensembl/ensembl-rest/wiki

View File

@@ -0,0 +1,346 @@
# Ensembl REST API Endpoints Reference
Comprehensive documentation of all 17 API endpoint categories available in the Ensembl REST API (Release 115, September 2025).
**Base URLs:**
- Current assemblies: `https://rest.ensembl.org`
- GRCh37/hg19 (human): `https://grch37.rest.ensembl.org`
**Rate Limits:**
- Anonymous: 15 requests/second
- Registered: 55,000 requests/hour
## 1. Archive
Retrieve historical information about retired Ensembl identifiers.
**GET /archive/id/:id**
- Retrieve archived entries for a retired identifier
- Example: `/archive/id/ENSG00000157764` (retired gene ID)
## 2. Comparative Genomics
Access gene trees, genomic alignments, and homology data across species.
**GET /alignment/region/:species/:region**
- Get genomic alignments for a region
- Example: `/alignment/region/human/2:106040000-106040050:1?species_set_group=mammals`
**GET /genetree/id/:id**
- Retrieve gene tree for a gene family
- Example: `/genetree/id/ENSGT00390000003602`
**GET /genetree/member/id/:id**
- Get gene tree by member gene ID
- Example: `/genetree/member/id/ENSG00000139618`
**GET /homology/id/:id**
- Find orthologs and paralogs for a gene
- Parameters: `target_species`, `type` (orthologues, paralogues, all)
- Example: `/homology/id/ENSG00000139618?target_species=mouse`
**GET /homology/symbol/:species/:symbol**
- Find homologs by gene symbol
- Example: `/homology/symbol/human/BRCA2?target_species=mouse`
## 3. Cross References
Link external database identifiers to Ensembl objects.
**GET /xrefs/id/:id**
- Get external references for Ensembl ID
- Example: `/xrefs/id/ENSG00000139618`
**GET /xrefs/symbol/:species/:symbol**
- Get cross-references by gene symbol
- Example: `/xrefs/symbol/human/BRCA2`
**GET /xrefs/name/:species/:name**
- Search for objects by external name
- Example: `/xrefs/name/human/NP_000050`
## 4. Information
Query metadata about species, assemblies, biotypes, and database versions.
**GET /info/species**
- List all available species
- Returns species names, assemblies, taxonomy IDs
**GET /info/assembly/:species**
- Get assembly information for a species
- Example: `/info/assembly/human` (returns GRCh38.p14)
**GET /info/assembly/:species/:region**
- Get detailed information about a chromosomal region
- Example: `/info/assembly/human/X`
**GET /info/biotypes/:species**
- List all available biotypes (gene types)
- Example: `/info/biotypes/human`
**GET /info/analysis/:species**
- List available analysis types
- Example: `/info/analysis/human`
**GET /info/data**
- Get general information about the current Ensembl release
## 5. Linkage Disequilibrium (LD)
Calculate linkage disequilibrium between variants.
**GET /ld/:species/:id/:population_name**
- Calculate LD for a variant
- Example: `/ld/human/rs1042522/1000GENOMES:phase_3:KHV`
**GET /ld/pairwise/:species/:id1/:id2**
- Calculate LD between two variants
- Example: `/ld/pairwise/human/rs1042522/rs11540652`
## 6. Lookup
Identify species and database information for identifiers.
**GET /lookup/id/:id**
- Look up object by Ensembl ID
- Parameter: `expand` (include child objects)
- Example: `/lookup/id/ENSG00000139618?expand=1`
**POST /lookup/id**
- Batch lookup multiple IDs
- Submit JSON array of IDs
- Example: `{"ids": ["ENSG00000139618", "ENSG00000157764"]}`
**GET /lookup/symbol/:species/:symbol**
- Look up gene by symbol
- Parameter: `expand` (include transcripts)
- Example: `/lookup/symbol/human/BRCA2?expand=1`
## 7. Mapping
Convert coordinates between assemblies, cDNA, CDS, and protein positions.
**GET /map/cdna/:id/:region**
- Map cDNA coordinates to genomic
- Example: `/map/cdna/ENST00000288602/100..300`
**GET /map/cds/:id/:region**
- Map CDS coordinates to genomic
- Example: `/map/cds/ENST00000288602/1..300`
**GET /map/translation/:id/:region**
- Map protein coordinates to genomic
- Example: `/map/translation/ENSP00000288602/1..100`
**GET /map/:species/:asm_one/:region/:asm_two**
- Map coordinates between assemblies
- Example: `/map/human/GRCh37/7:140453136..140453136/GRCh38`
**POST /map/:species/:asm_one/:asm_two**
- Batch assembly mapping
- Submit JSON array of regions
## 8. Ontologies and Taxonomy
Search biological ontologies and taxonomic classifications.
**GET /ontology/id/:id**
- Get ontology term information
- Example: `/ontology/id/GO:0005515`
**GET /ontology/name/:name**
- Search ontology by term name
- Example: `/ontology/name/protein%20binding`
**GET /taxonomy/classification/:id**
- Get taxonomic classification
- Example: `/taxonomy/classification/9606` (human)
**GET /taxonomy/id/:id**
- Get taxonomy information by ID
- Example: `/taxonomy/id/9606`
## 9. Overlap
Find genomic features overlapping a region.
**GET /overlap/id/:id**
- Get features overlapping a gene/transcript
- Parameters: `feature` (gene, transcript, cds, exon, repeat, etc.)
- Example: `/overlap/id/ENSG00000139618?feature=transcript`
**GET /overlap/region/:species/:region**
- Get all features in a genomic region
- Parameters: `feature` (gene, transcript, variation, regulatory, etc.)
- Example: `/overlap/region/human/7:140424943..140624564?feature=gene`
**GET /overlap/translation/:id**
- Get protein features
- Example: `/overlap/translation/ENSP00000288602`
## 10. Phenotype Annotations
Retrieve disease and trait associations.
**GET /phenotype/accession/:species/:accession**
- Get phenotypes by ontology accession
- Example: `/phenotype/accession/human/EFO:0003767`
**GET /phenotype/gene/:species/:gene**
- Get phenotype associations for a gene
- Example: `/phenotype/gene/human/ENSG00000139618`
**GET /phenotype/region/:species/:region**
- Get phenotypes in genomic region
- Example: `/phenotype/region/human/7:140424943-140624564`
**GET /phenotype/term/:species/:term**
- Search phenotypes by term
- Example: `/phenotype/term/human/cancer`
## 11. Regulation
Access regulatory feature and binding motif data.
**GET /regulatory/species/:species/microarray/:microarray/:probe**
- Get microarray probe information
- Example: `/regulatory/species/human/microarray/HumanWG_6_V2/ILMN_1773626`
**GET /species/:species/binding_matrix/:binding_matrix_id**
- Get transcription factor binding matrix
- Example: `/species/human/binding_matrix/ENSPFM0001`
## 12. Sequence
Retrieve genomic, transcript, and protein sequences.
**GET /sequence/id/:id**
- Get sequence by ID
- Parameters: `type` (genomic, cds, cdna, protein), `format` (json, fasta, text)
- Example: `/sequence/id/ENSG00000139618?type=genomic`
**POST /sequence/id**
- Batch sequence retrieval
- Example: `{"ids": ["ENSG00000139618", "ENSG00000157764"]}`
**GET /sequence/region/:species/:region**
- Get genomic sequence for region
- Parameters: `coord_system`, `format`
- Example: `/sequence/region/human/7:140424943..140624564?format=fasta`
**POST /sequence/region/:species**
- Batch region sequence retrieval
## 13. Transcript Haplotypes
Compute transcript haplotypes from phased genotypes.
**GET /transcript_haplotypes/:species/:id**
- Get transcript haplotypes
- Example: `/transcript_haplotypes/human/ENST00000288602`
## 14. Variant Effect Predictor (VEP)
Predict functional consequences of variants.
**GET /vep/:species/hgvs/:hgvs_notation**
- Predict variant effects using HGVS notation
- Parameters: numerous VEP options
- Example: `/vep/human/hgvs/ENST00000288602:c.803C>T`
**POST /vep/:species/hgvs**
- Batch VEP analysis with HGVS
- Example: `{"hgvs_notations": ["ENST00000288602:c.803C>T"]}`
**GET /vep/:species/id/:id**
- Predict effects for variant ID
- Example: `/vep/human/id/rs699`
**POST /vep/:species/id**
- Batch VEP by variant IDs
**GET /vep/:species/region/:region/:allele**
- Predict effects for region and allele
- Example: `/vep/human/region/7:140453136:C/T`
**POST /vep/:species/region**
- Batch VEP by regions
## 15. Variation
Query genetic variation data and associated publications.
**GET /variation/:species/:id**
- Get variant information by ID
- Parameters: `pops` (include population frequencies), `genotypes`
- Example: `/variation/human/rs699?pops=1`
**POST /variation/:species**
- Batch variant queries
- Example: `{"ids": ["rs699", "rs6025"]}`
**GET /variation/:species/pmcid/:pmcid**
- Get variants from PubMed Central article
- Example: `/variation/human/pmcid/PMC5002951`
**GET /variation/:species/pmid/:pmid**
- Get variants from PubMed article
- Example: `/variation/human/pmid/26318936`
## 16. Variation GA4GH
Access genomic variation data using GA4GH standards.
**POST /ga4gh/beacon**
- Query beacon for variant presence
**GET /ga4gh/features/:id**
- Get feature by ID in GA4GH format
**POST /ga4gh/features/search**
- Search features using GA4GH protocol
**POST /ga4gh/variants/search**
- Search variants using GA4GH protocol
## Response Formats
Most endpoints support multiple response formats:
- **JSON** (default): `Content-Type: application/json`
- **FASTA**: For sequence data
- **XML**: Some endpoints support XML
- **Text**: Plain text output
Specify format using:
1. `Content-Type` header
2. URL parameter: `content-type=text/x-fasta`
3. File extension: `/sequence/id/ENSG00000139618.fasta`
## Common Parameters
Many endpoints share these parameters:
- **expand**: Include child objects (transcripts, proteins)
- **format**: Output format (json, xml, fasta)
- **db_type**: Database type (core, otherfeatures, variation)
- **object_type**: Type of object to return
- **species**: Species name (can be common or scientific)
## Error Codes
- **200**: Success
- **400**: Bad request (invalid parameters)
- **404**: Not found (ID doesn't exist)
- **429**: Rate limit exceeded
- **500**: Internal server error
## Best Practices
1. **Use batch endpoints** for multiple queries (more efficient)
2. **Cache responses** to minimize API calls
3. **Check rate limit headers** in responses
4. **Handle 429 errors** by respecting `Retry-After` header
5. **Use appropriate content types** for sequence data
6. **Specify assembly** when querying older genome versions
7. **Enable expand parameter** when you need full object details

View File

@@ -0,0 +1,427 @@
#!/usr/bin/env python3
"""
Ensembl REST API Query Script
Reusable functions for common Ensembl database queries with built-in rate limiting and error handling.
Usage:
python ensembl_query.py --gene BRCA2 --species human
python ensembl_query.py --variant rs699 --species human
python ensembl_query.py --region "7:140424943-140624564" --species human
"""
import requests
import time
import json
import argparse
from typing import Dict, List, Optional, Any
class EnsemblAPIClient:
"""Client for querying the Ensembl REST API with rate limiting and error handling."""
def __init__(self, server: str = "https://rest.ensembl.org", rate_limit: int = 15):
"""
Initialize the Ensembl API client.
Args:
server: Base URL for the Ensembl REST API
rate_limit: Maximum requests per second (default 15 for anonymous users)
"""
self.server = server
self.rate_limit = rate_limit
self.request_count = 0
self.last_request_time = 0
def _rate_limit_check(self):
"""Enforce rate limiting before making requests."""
current_time = time.time()
time_since_last = current_time - self.last_request_time
if time_since_last < 1.0:
if self.request_count >= self.rate_limit:
sleep_time = 1.0 - time_since_last
time.sleep(sleep_time)
self.request_count = 0
self.last_request_time = time.time()
else:
self.request_count = 0
self.last_request_time = current_time
def _make_request(
self,
endpoint: str,
params: Optional[Dict] = None,
max_retries: int = 3,
method: str = "GET",
data: Optional[Dict] = None
) -> Any:
"""
Make an API request with error handling and retries.
Args:
endpoint: API endpoint path
params: Query parameters
max_retries: Maximum number of retry attempts
method: HTTP method (GET or POST)
data: JSON data for POST requests
Returns:
JSON response data
Raises:
Exception: If request fails after max retries
"""
headers = {"Content-Type": "application/json"}
url = f"{self.server}{endpoint}"
for attempt in range(max_retries):
self._rate_limit_check()
self.request_count += 1
try:
if method == "POST":
response = requests.post(url, headers=headers, json=data)
else:
response = requests.get(url, headers=headers, params=params)
if response.status_code == 200:
return response.json()
elif response.status_code == 429:
# Rate limited - wait and retry
retry_after = int(response.headers.get('Retry-After', 1))
print(f"Rate limited. Waiting {retry_after} seconds...")
time.sleep(retry_after)
elif response.status_code == 404:
raise Exception(f"Resource not found: {endpoint}")
else:
response.raise_for_status()
except requests.exceptions.RequestException as e:
if attempt == max_retries - 1:
raise Exception(f"Request failed after {max_retries} attempts: {e}")
time.sleep(2 ** attempt) # Exponential backoff
raise Exception(f"Failed after {max_retries} attempts")
def lookup_gene_by_symbol(self, species: str, symbol: str, expand: bool = True) -> Dict:
"""
Look up gene information by symbol.
Args:
species: Species name (e.g., 'human', 'mouse')
symbol: Gene symbol (e.g., 'BRCA2', 'TP53')
expand: Include transcript information
Returns:
Gene information dictionary
"""
endpoint = f"/lookup/symbol/{species}/{symbol}"
params = {"expand": 1} if expand else {}
return self._make_request(endpoint, params=params)
def lookup_by_id(self, ensembl_id: str, expand: bool = False) -> Dict:
"""
Look up object by Ensembl ID.
Args:
ensembl_id: Ensembl identifier (e.g., 'ENSG00000139618')
expand: Include child objects
Returns:
Object information dictionary
"""
endpoint = f"/lookup/id/{ensembl_id}"
params = {"expand": 1} if expand else {}
return self._make_request(endpoint, params=params)
def get_sequence(
self,
ensembl_id: str,
seq_type: str = "genomic",
format: str = "json"
) -> Any:
"""
Retrieve sequence by Ensembl ID.
Args:
ensembl_id: Ensembl identifier
seq_type: Sequence type ('genomic', 'cds', 'cdna', 'protein')
format: Output format ('json', 'fasta', 'text')
Returns:
Sequence data
"""
endpoint = f"/sequence/id/{ensembl_id}"
params = {"type": seq_type}
if format == "fasta":
headers = {"Content-Type": "text/x-fasta"}
url = f"{self.server}{endpoint}"
response = requests.get(url, headers=headers, params=params)
return response.text
return self._make_request(endpoint, params=params)
def get_region_sequence(
self,
species: str,
region: str,
format: str = "json"
) -> Any:
"""
Get genomic sequence for a region.
Args:
species: Species name
region: Region string (e.g., '7:140424943-140624564')
format: Output format ('json', 'fasta', 'text')
Returns:
Sequence data
"""
endpoint = f"/sequence/region/{species}/{region}"
if format == "fasta":
headers = {"Content-Type": "text/x-fasta"}
url = f"{self.server}{endpoint}"
response = requests.get(url, headers=headers)
return response.text
return self._make_request(endpoint)
def get_variant(self, species: str, variant_id: str, include_pops: bool = True) -> Dict:
"""
Get variant information by ID.
Args:
species: Species name
variant_id: Variant identifier (e.g., 'rs699')
include_pops: Include population frequencies
Returns:
Variant information dictionary
"""
endpoint = f"/variation/{species}/{variant_id}"
params = {"pops": 1} if include_pops else {}
return self._make_request(endpoint, params=params)
def predict_variant_effect(
self,
species: str,
hgvs_notation: str
) -> List[Dict]:
"""
Predict variant consequences using VEP.
Args:
species: Species name
hgvs_notation: HGVS notation (e.g., 'ENST00000288602:c.803C>T')
Returns:
List of predicted consequences
"""
endpoint = f"/vep/{species}/hgvs/{hgvs_notation}"
return self._make_request(endpoint)
def find_orthologs(
self,
ensembl_id: str,
target_species: Optional[str] = None
) -> Dict:
"""
Find orthologs for a gene.
Args:
ensembl_id: Source gene Ensembl ID
target_species: Target species (optional, returns all if not specified)
Returns:
Homology information dictionary
"""
endpoint = f"/homology/id/{ensembl_id}"
params = {}
if target_species:
params["target_species"] = target_species
return self._make_request(endpoint, params=params)
def get_region_features(
self,
species: str,
region: str,
feature_type: str = "gene"
) -> List[Dict]:
"""
Get genomic features in a region.
Args:
species: Species name
region: Region string (e.g., '7:140424943-140624564')
feature_type: Feature type ('gene', 'transcript', 'variation', etc.)
Returns:
List of features
"""
endpoint = f"/overlap/region/{species}/{region}"
params = {"feature": feature_type}
return self._make_request(endpoint, params=params)
def get_species_info(self) -> List[Dict]:
"""
Get information about all available species.
Returns:
List of species information dictionaries
"""
endpoint = "/info/species"
result = self._make_request(endpoint)
return result.get("species", [])
def get_assembly_info(self, species: str) -> Dict:
"""
Get assembly information for a species.
Args:
species: Species name
Returns:
Assembly information dictionary
"""
endpoint = f"/info/assembly/{species}"
return self._make_request(endpoint)
def map_coordinates(
self,
species: str,
asm_from: str,
region: str,
asm_to: str
) -> Dict:
"""
Map coordinates between genome assemblies.
Args:
species: Species name
asm_from: Source assembly (e.g., 'GRCh37')
region: Region string (e.g., '7:140453136-140453136')
asm_to: Target assembly (e.g., 'GRCh38')
Returns:
Mapped coordinates
"""
endpoint = f"/map/{species}/{asm_from}/{region}/{asm_to}"
return self._make_request(endpoint)
def main():
"""Command-line interface for common Ensembl queries."""
parser = argparse.ArgumentParser(
description="Query the Ensembl database via REST API"
)
parser.add_argument("--gene", help="Gene symbol to look up")
parser.add_argument("--ensembl-id", help="Ensembl ID to look up")
parser.add_argument("--variant", help="Variant ID (e.g., rs699)")
parser.add_argument("--region", help="Genomic region (chr:start-end)")
parser.add_argument(
"--species",
default="human",
help="Species name (default: human)"
)
parser.add_argument(
"--orthologs",
help="Find orthologs for gene (provide Ensembl ID)"
)
parser.add_argument(
"--target-species",
help="Target species for ortholog search"
)
parser.add_argument(
"--sequence",
action="store_true",
help="Retrieve sequence (requires --gene or --ensembl-id or --region)"
)
parser.add_argument(
"--format",
choices=["json", "fasta"],
default="json",
help="Output format (default: json)"
)
parser.add_argument(
"--assembly",
default="GRCh37",
help="For GRCh37, use grch37.rest.ensembl.org server"
)
args = parser.parse_args()
# Select appropriate server
server = "https://rest.ensembl.org"
if args.assembly.lower() == "grch37":
server = "https://grch37.rest.ensembl.org"
client = EnsemblAPIClient(server=server)
try:
if args.gene:
print(f"Looking up gene: {args.gene}")
result = client.lookup_gene_by_symbol(args.species, args.gene)
if args.sequence:
print(f"\nRetrieving sequence for {result['id']}...")
seq_result = client.get_sequence(
result['id'],
format=args.format
)
print(json.dumps(seq_result, indent=2) if args.format == "json" else seq_result)
else:
print(json.dumps(result, indent=2))
elif args.ensembl_id:
print(f"Looking up ID: {args.ensembl_id}")
result = client.lookup_by_id(args.ensembl_id, expand=True)
if args.sequence:
print(f"\nRetrieving sequence...")
seq_result = client.get_sequence(
args.ensembl_id,
format=args.format
)
print(json.dumps(seq_result, indent=2) if args.format == "json" else seq_result)
else:
print(json.dumps(result, indent=2))
elif args.variant:
print(f"Looking up variant: {args.variant}")
result = client.get_variant(args.species, args.variant)
print(json.dumps(result, indent=2))
elif args.region:
if args.sequence:
print(f"Retrieving sequence for region: {args.region}")
result = client.get_region_sequence(
args.species,
args.region,
format=args.format
)
print(json.dumps(result, indent=2) if args.format == "json" else result)
else:
print(f"Finding features in region: {args.region}")
result = client.get_region_features(args.species, args.region)
print(json.dumps(result, indent=2))
elif args.orthologs:
print(f"Finding orthologs for: {args.orthologs}")
result = client.find_orthologs(
args.orthologs,
target_species=args.target_species
)
print(json.dumps(result, indent=2))
else:
parser.print_help()
except Exception as e:
print(f"Error: {e}")
return 1
return 0
if __name__ == "__main__":
exit(main())