Initial commit
This commit is contained in:
198
skills/ena-database/SKILL.md
Normal file
198
skills/ena-database/SKILL.md
Normal file
@@ -0,0 +1,198 @@
|
||||
---
|
||||
name: ena-database
|
||||
description: "Access European Nucleotide Archive via API/FTP. Retrieve DNA/RNA sequences, raw reads (FASTQ), genome assemblies by accession, for genomics and bioinformatics pipelines. Supports multiple formats."
|
||||
---
|
||||
|
||||
# ENA Database
|
||||
|
||||
## Overview
|
||||
|
||||
The European Nucleotide Archive (ENA) is a comprehensive public repository for nucleotide sequence data and associated metadata. Access and query DNA/RNA sequences, raw reads, genome assemblies, and functional annotations through REST APIs and FTP for genomics and bioinformatics pipelines.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when:
|
||||
|
||||
- Retrieving nucleotide sequences or raw sequencing reads by accession
|
||||
- Searching for samples, studies, or assemblies by metadata criteria
|
||||
- Downloading FASTQ files or genome assemblies for analysis
|
||||
- Querying taxonomic information for organisms
|
||||
- Accessing sequence annotations and functional data
|
||||
- Integrating ENA data into bioinformatics pipelines
|
||||
- Performing cross-reference searches to related databases
|
||||
- Bulk downloading datasets via FTP or Aspera
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Data Types and Structure
|
||||
|
||||
ENA organizes data into hierarchical object types:
|
||||
|
||||
**Studies/Projects** - Group related data and control release dates. Studies are the primary unit for citing archived data.
|
||||
|
||||
**Samples** - Represent units of biomaterial from which sequencing libraries were produced. Samples must be registered before submitting most data types.
|
||||
|
||||
**Raw Reads** - Consist of:
|
||||
- **Experiments**: Metadata about sequencing methods, library preparation, and instrument details
|
||||
- **Runs**: References to data files containing raw sequencing reads from a single sequencing run
|
||||
|
||||
**Assemblies** - Genome, transcriptome, metagenome, or metatranscriptome assemblies at various completion levels.
|
||||
|
||||
**Sequences** - Assembled and annotated sequences stored in the EMBL Nucleotide Sequence Database, including coding/non-coding regions and functional annotations.
|
||||
|
||||
**Analyses** - Results from computational analyses of sequence data.
|
||||
|
||||
**Taxonomy Records** - Taxonomic information including lineage and rank.
|
||||
|
||||
### 2. Programmatic Access
|
||||
|
||||
ENA provides multiple REST APIs for data access. Consult `references/api_reference.md` for detailed endpoint documentation.
|
||||
|
||||
**Key APIs:**
|
||||
|
||||
**ENA Portal API** - Advanced search functionality across all ENA data types
|
||||
- Documentation: https://www.ebi.ac.uk/ena/portal/api/doc
|
||||
- Use for complex queries and metadata searches
|
||||
|
||||
**ENA Browser API** - Direct retrieval of records and metadata
|
||||
- Documentation: https://www.ebi.ac.uk/ena/browser/api/doc
|
||||
- Use for downloading specific records by accession
|
||||
- Returns data in XML format
|
||||
|
||||
**ENA Taxonomy REST API** - Query taxonomic information
|
||||
- Access lineage, rank, and related taxonomic data
|
||||
|
||||
**ENA Cross Reference Service** - Access related records from external databases
|
||||
- Endpoint: https://www.ebi.ac.uk/ena/xref/rest/
|
||||
|
||||
**CRAM Reference Registry** - Retrieve reference sequences
|
||||
- Endpoint: https://www.ebi.ac.uk/ena/cram/
|
||||
- Query by MD5 or SHA1 checksums
|
||||
|
||||
**Rate Limiting**: All APIs have a rate limit of 50 requests per second. Exceeding this returns HTTP 429 (Too Many Requests).
|
||||
|
||||
### 3. Searching and Retrieving Data
|
||||
|
||||
**Browser-Based Search:**
|
||||
- Free text search across all fields
|
||||
- Sequence similarity search (BLAST integration)
|
||||
- Cross-reference search to find related records
|
||||
- Advanced search with Rulespace query builder
|
||||
|
||||
**Programmatic Queries:**
|
||||
- Use Portal API for advanced searches at scale
|
||||
- Filter by data type, date range, taxonomy, or metadata fields
|
||||
- Download results as tabulated metadata summaries or XML records
|
||||
|
||||
**Example API Query Pattern:**
|
||||
```python
|
||||
import requests
|
||||
|
||||
# Search for samples from a specific study
|
||||
base_url = "https://www.ebi.ac.uk/ena/portal/api/search"
|
||||
params = {
|
||||
"result": "sample",
|
||||
"query": "study_accession=PRJEB1234",
|
||||
"format": "json",
|
||||
"limit": 100
|
||||
}
|
||||
|
||||
response = requests.get(base_url, params=params)
|
||||
samples = response.json()
|
||||
```
|
||||
|
||||
### 4. Data Retrieval Formats
|
||||
|
||||
**Metadata Formats:**
|
||||
- XML (native ENA format)
|
||||
- JSON (via Portal API)
|
||||
- TSV/CSV (tabulated summaries)
|
||||
|
||||
**Sequence Data:**
|
||||
- FASTQ (raw reads)
|
||||
- BAM/CRAM (aligned reads)
|
||||
- FASTA (assembled sequences)
|
||||
- EMBL flat file format (annotated sequences)
|
||||
|
||||
**Download Methods:**
|
||||
- Direct API download (small files)
|
||||
- FTP for bulk data transfer
|
||||
- Aspera for high-speed transfer of large datasets
|
||||
- enaBrowserTools command-line utility for bulk downloads
|
||||
|
||||
### 5. Common Use Cases
|
||||
|
||||
**Retrieve raw sequencing reads by accession:**
|
||||
```python
|
||||
# Download run files using Browser API
|
||||
accession = "ERR123456"
|
||||
url = f"https://www.ebi.ac.uk/ena/browser/api/xml/{accession}"
|
||||
```
|
||||
|
||||
**Search for all samples in a study:**
|
||||
```python
|
||||
# Use Portal API to list samples
|
||||
study_id = "PRJNA123456"
|
||||
url = f"https://www.ebi.ac.uk/ena/portal/api/search?result=sample&query=study_accession={study_id}&format=tsv"
|
||||
```
|
||||
|
||||
**Find assemblies for a specific organism:**
|
||||
```python
|
||||
# Search assemblies by taxonomy
|
||||
organism = "Escherichia coli"
|
||||
url = f"https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&query=tax_tree({organism})&format=json"
|
||||
```
|
||||
|
||||
**Get taxonomic lineage:**
|
||||
```python
|
||||
# Query taxonomy API
|
||||
taxon_id = "562" # E. coli
|
||||
url = f"https://www.ebi.ac.uk/ena/taxonomy/rest/tax-id/{taxon_id}"
|
||||
```
|
||||
|
||||
### 6. Integration with Analysis Pipelines
|
||||
|
||||
**Bulk Download Pattern:**
|
||||
1. Search for accessions matching criteria using Portal API
|
||||
2. Extract file URLs from search results
|
||||
3. Download files via FTP or using enaBrowserTools
|
||||
4. Process downloaded data in pipeline
|
||||
|
||||
**BLAST Integration:**
|
||||
Integrate with EBI's NCBI BLAST service (REST/SOAP API) for sequence similarity searches against ENA sequences.
|
||||
|
||||
### 7. Best Practices
|
||||
|
||||
**Rate Limiting:**
|
||||
- Implement exponential backoff when receiving HTTP 429 responses
|
||||
- Batch requests when possible to stay within 50 req/sec limit
|
||||
- Use bulk download tools for large datasets instead of iterating API calls
|
||||
|
||||
**Data Citation:**
|
||||
- Always cite using Study/Project accessions when publishing
|
||||
- Include accession numbers for specific samples, runs, or assemblies used
|
||||
|
||||
**API Response Handling:**
|
||||
- Check HTTP status codes before processing responses
|
||||
- Parse XML responses using proper XML libraries (not regex)
|
||||
- Handle pagination for large result sets
|
||||
|
||||
**Performance:**
|
||||
- Use FTP/Aspera for downloading large files (>100MB)
|
||||
- Prefer TSV/JSON formats over XML when only metadata is needed
|
||||
- Cache taxonomy lookups locally when processing many records
|
||||
|
||||
## Resources
|
||||
|
||||
This skill includes detailed reference documentation for working with ENA:
|
||||
|
||||
### references/
|
||||
|
||||
**api_reference.md** - Comprehensive API endpoint documentation including:
|
||||
- Detailed parameters for Portal API and Browser API
|
||||
- Response format specifications
|
||||
- Advanced query syntax and operators
|
||||
- Field names for filtering and searching
|
||||
- Common API patterns and examples
|
||||
|
||||
Load this reference when constructing complex API queries, debugging API responses, or needing specific parameter details.
|
||||
490
skills/ena-database/references/api_reference.md
Normal file
490
skills/ena-database/references/api_reference.md
Normal file
@@ -0,0 +1,490 @@
|
||||
# ENA API Reference
|
||||
|
||||
Comprehensive reference for the European Nucleotide Archive REST APIs.
|
||||
|
||||
## ENA Portal API
|
||||
|
||||
**Base URL:** `https://www.ebi.ac.uk/ena/portal/api`
|
||||
|
||||
**Official Documentation:** https://www.ebi.ac.uk/ena/portal/api/doc
|
||||
|
||||
### Search Endpoint
|
||||
|
||||
**Endpoint:** `/search`
|
||||
|
||||
**Method:** GET
|
||||
|
||||
**Description:** Perform advanced searches across ENA data types with flexible filtering and formatting options.
|
||||
|
||||
**Parameters:**
|
||||
|
||||
| Parameter | Required | Description | Example |
|
||||
|-----------|----------|-------------|---------|
|
||||
| `result` | Yes | Data type to search | `sample`, `study`, `read_run`, `assembly`, `sequence`, `analysis`, `taxon` |
|
||||
| `query` | Yes | Search query using ENA query syntax | `tax_eq(9606)`, `study_accession="PRJNA123456"` |
|
||||
| `format` | No | Output format (default: tsv) | `json`, `tsv`, `xml` |
|
||||
| `fields` | No | Comma-separated list of fields to return | `accession,sample_title,scientific_name` |
|
||||
| `limit` | No | Maximum number of results (default: 100000) | `10`, `1000` |
|
||||
| `offset` | No | Result offset for pagination | `0`, `100` |
|
||||
| `sortFields` | No | Fields to sort by (comma-separated) | `accession`, `collection_date` |
|
||||
| `sortOrder` | No | Sort direction | `asc`, `desc` |
|
||||
| `dataPortal` | No | Restrict to specific data portal | `ena`, `pathogen`, `metagenome` |
|
||||
| `download` | No | Trigger file download | `true`, `false` |
|
||||
| `includeAccessions` | No | Comma-separated accessions to include | `SAMN01,SAMN02` |
|
||||
| `excludeAccessions` | No | Comma-separated accessions to exclude | `SAMN03,SAMN04` |
|
||||
|
||||
**Query Syntax:**
|
||||
|
||||
ENA uses a specialized query language with operators:
|
||||
|
||||
- **Equality:** `field_name="value"` or `field_name=value`
|
||||
- **Wildcards:** `field_name="*partial*"` (use * for wildcard)
|
||||
- **Range:** `field_name>=value AND field_name<=value`
|
||||
- **Logical:** `query1 AND query2`, `query1 OR query2`, `NOT query`
|
||||
- **Taxonomy:** `tax_eq(taxon_id)` - exact match, `tax_tree(taxon_id)` - includes descendants
|
||||
- **Date ranges:** `collection_date>=2020-01-01 AND collection_date<=2023-12-31`
|
||||
- **In operator:** `study_accession IN (PRJNA1,PRJNA2,PRJNA3)`
|
||||
|
||||
**Common Result Types:**
|
||||
|
||||
- `study` - Research projects/studies
|
||||
- `sample` - Biological samples
|
||||
- `read_run` - Raw sequencing runs
|
||||
- `read_experiment` - Sequencing experiment metadata
|
||||
- `analysis` - Analysis results
|
||||
- `assembly` - Genome/transcriptome assemblies
|
||||
- `sequence` - Assembled sequences
|
||||
- `taxon` - Taxonomic records
|
||||
- `coding` - Protein coding sequences
|
||||
- `noncoding` - Non-coding sequences
|
||||
|
||||
**Example Requests:**
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
# Search for human samples
|
||||
url = "https://www.ebi.ac.uk/ena/portal/api/search"
|
||||
params = {
|
||||
"result": "sample",
|
||||
"query": "tax_eq(9606)",
|
||||
"format": "json",
|
||||
"fields": "accession,sample_title,collection_date",
|
||||
"limit": 100
|
||||
}
|
||||
response = requests.get(url, params=params)
|
||||
|
||||
# Search for RNA-seq experiments in a study
|
||||
params = {
|
||||
"result": "read_experiment",
|
||||
"query": 'study_accession="PRJNA123456" AND library_strategy="RNA-Seq"',
|
||||
"format": "tsv"
|
||||
}
|
||||
response = requests.get(url, params=params)
|
||||
|
||||
# Find assemblies for E. coli with minimum contig N50
|
||||
params = {
|
||||
"result": "assembly",
|
||||
"query": "tax_tree(562) AND contig_n50>=50000",
|
||||
"format": "json"
|
||||
}
|
||||
response = requests.get(url, params=params)
|
||||
```
|
||||
|
||||
### Fields Endpoint
|
||||
|
||||
**Endpoint:** `/returnFields`
|
||||
|
||||
**Method:** GET
|
||||
|
||||
**Description:** List available fields for a specific result type.
|
||||
|
||||
**Parameters:**
|
||||
|
||||
| Parameter | Required | Description | Example |
|
||||
|-----------|----------|-------------|---------|
|
||||
| `result` | Yes | Data type | `sample`, `study`, `assembly` |
|
||||
| `dataPortal` | No | Filter by data portal | `ena`, `pathogen` |
|
||||
|
||||
**Example:**
|
||||
|
||||
```python
|
||||
# Get all available fields for samples
|
||||
url = "https://www.ebi.ac.uk/ena/portal/api/returnFields"
|
||||
params = {"result": "sample"}
|
||||
response = requests.get(url, params=params)
|
||||
fields = response.json()
|
||||
```
|
||||
|
||||
### Results Endpoint
|
||||
|
||||
**Endpoint:** `/results`
|
||||
|
||||
**Method:** GET
|
||||
|
||||
**Description:** List available result types.
|
||||
|
||||
**Example:**
|
||||
|
||||
```python
|
||||
url = "https://www.ebi.ac.uk/ena/portal/api/results"
|
||||
response = requests.get(url)
|
||||
```
|
||||
|
||||
### File Report Endpoint
|
||||
|
||||
**Endpoint:** `/filereport`
|
||||
|
||||
**Method:** GET
|
||||
|
||||
**Description:** Get file information and download URLs for reads and analyses.
|
||||
|
||||
**Parameters:**
|
||||
|
||||
| Parameter | Required | Description | Example |
|
||||
|-----------|----------|-------------|---------|
|
||||
| `accession` | Yes | Run or analysis accession | `ERR123456` |
|
||||
| `result` | Yes | Must be `read_run` or `analysis` | `read_run` |
|
||||
| `format` | No | Output format | `json`, `tsv` |
|
||||
| `fields` | No | Fields to include | `run_accession,fastq_ftp,fastq_md5` |
|
||||
|
||||
**Common File Report Fields:**
|
||||
|
||||
- `run_accession` - Run accession number
|
||||
- `fastq_ftp` - FTP URLs for FASTQ files (semicolon-separated)
|
||||
- `fastq_aspera` - Aspera URLs for FASTQ files
|
||||
- `fastq_md5` - MD5 checksums (semicolon-separated)
|
||||
- `fastq_bytes` - File sizes in bytes (semicolon-separated)
|
||||
- `submitted_ftp` - FTP URLs for originally submitted files
|
||||
- `sra_ftp` - FTP URL for SRA format file
|
||||
|
||||
**Example:**
|
||||
|
||||
```python
|
||||
# Get FASTQ download URLs for a run
|
||||
url = "https://www.ebi.ac.uk/ena/portal/api/filereport"
|
||||
params = {
|
||||
"accession": "ERR123456",
|
||||
"result": "read_run",
|
||||
"format": "json",
|
||||
"fields": "run_accession,fastq_ftp,fastq_md5,fastq_bytes"
|
||||
}
|
||||
response = requests.get(url, params=params)
|
||||
file_info = response.json()
|
||||
|
||||
# Download FASTQ files
|
||||
for ftp_url in file_info[0]['fastq_ftp'].split(';'):
|
||||
# Download from ftp://ftp.sra.ebi.ac.uk/...
|
||||
pass
|
||||
```
|
||||
|
||||
## ENA Browser API
|
||||
|
||||
**Base URL:** `https://www.ebi.ac.uk/ena/browser/api`
|
||||
|
||||
**Official Documentation:** https://www.ebi.ac.uk/ena/browser/api/doc
|
||||
|
||||
### XML Retrieval
|
||||
|
||||
**Endpoint:** `/xml/{accession}`
|
||||
|
||||
**Method:** GET
|
||||
|
||||
**Description:** Retrieve record metadata in XML format.
|
||||
|
||||
**Parameters:**
|
||||
|
||||
| Parameter | Type | Description | Example |
|
||||
|-----------|------|-------------|---------|
|
||||
| `accession` | Path | Record accession number | `PRJNA123456`, `SAMEA123456`, `ERR123456` |
|
||||
| `download` | Query | Set to `true` to trigger download | `true` |
|
||||
| `includeLinks` | Query | Include cross-reference links | `true`, `false` |
|
||||
|
||||
**Example:**
|
||||
|
||||
```python
|
||||
# Get sample metadata in XML
|
||||
accession = "SAMEA123456"
|
||||
url = f"https://www.ebi.ac.uk/ena/browser/api/xml/{accession}"
|
||||
response = requests.get(url)
|
||||
xml_data = response.text
|
||||
|
||||
# Get study with cross-references
|
||||
url = f"https://www.ebi.ac.uk/ena/browser/api/xml/PRJNA123456"
|
||||
params = {"includeLinks": "true"}
|
||||
response = requests.get(url, params=params)
|
||||
```
|
||||
|
||||
### Text Retrieval
|
||||
|
||||
**Endpoint:** `/text/{accession}`
|
||||
|
||||
**Method:** GET
|
||||
|
||||
**Description:** Retrieve sequences in EMBL flat file format.
|
||||
|
||||
**Parameters:**
|
||||
|
||||
| Parameter | Type | Description | Example |
|
||||
|-----------|------|-------------|---------|
|
||||
| `accession` | Path | Sequence accession | `LN847353` |
|
||||
| `download` | Query | Trigger download | `true` |
|
||||
| `expandDataclasses` | Query | Include related data classes | `true` |
|
||||
| `lineLimit` | Query | Limit output lines | `1000` |
|
||||
|
||||
**Example:**
|
||||
|
||||
```python
|
||||
# Get sequence in EMBL format
|
||||
url = "https://www.ebi.ac.uk/ena/browser/api/text/LN847353"
|
||||
response = requests.get(url)
|
||||
embl_format = response.text
|
||||
```
|
||||
|
||||
### FASTA Retrieval
|
||||
|
||||
**Endpoint:** `/fasta/{accession}`
|
||||
|
||||
**Method:** GET
|
||||
|
||||
**Description:** Retrieve sequences in FASTA format.
|
||||
|
||||
**Parameters:**
|
||||
|
||||
| Parameter | Type | Description | Example |
|
||||
|-----------|------|-------------|---------|
|
||||
| `accession` | Path | Sequence accession | `LN847353` |
|
||||
| `download` | Query | Trigger download | `true` |
|
||||
| `range` | Query | Subsequence range | `100-500` |
|
||||
| `lineLimit` | Query | Limit output lines | `1000` |
|
||||
|
||||
**Example:**
|
||||
|
||||
```python
|
||||
# Get full sequence
|
||||
url = "https://www.ebi.ac.uk/ena/browser/api/fasta/LN847353"
|
||||
response = requests.get(url)
|
||||
fasta_data = response.text
|
||||
|
||||
# Get subsequence
|
||||
url = "https://www.ebi.ac.uk/ena/browser/api/fasta/LN847353"
|
||||
params = {"range": "1000-2000"}
|
||||
response = requests.get(url, params=params)
|
||||
```
|
||||
|
||||
### Links Retrieval
|
||||
|
||||
**Endpoint:** `/links/{source}/{accession}`
|
||||
|
||||
**Method:** GET
|
||||
|
||||
**Description:** Get cross-references to external databases.
|
||||
|
||||
**Parameters:**
|
||||
|
||||
| Parameter | Type | Description | Example |
|
||||
|-----------|------|-------------|---------|
|
||||
| `source` | Path | Source database type | `sample`, `study`, `sequence` |
|
||||
| `accession` | Path | Accession number | `SAMEA123456` |
|
||||
| `target` | Query | Target database filter | `sra`, `biosample` |
|
||||
|
||||
**Example:**
|
||||
|
||||
```python
|
||||
# Get all links for a sample
|
||||
url = "https://www.ebi.ac.uk/ena/browser/api/links/sample/SAMEA123456"
|
||||
response = requests.get(url)
|
||||
```
|
||||
|
||||
## ENA Taxonomy REST API
|
||||
|
||||
**Base URL:** `https://www.ebi.ac.uk/ena/taxonomy/rest`
|
||||
|
||||
**Description:** Query taxonomic information including lineage and rank.
|
||||
|
||||
### Tax ID Lookup
|
||||
|
||||
**Endpoint:** `/tax-id/{taxon_id}`
|
||||
|
||||
**Method:** GET
|
||||
|
||||
**Description:** Get taxonomic information by NCBI taxonomy ID.
|
||||
|
||||
**Example:**
|
||||
|
||||
```python
|
||||
# Get E. coli taxonomy
|
||||
taxon_id = "562"
|
||||
url = f"https://www.ebi.ac.uk/ena/taxonomy/rest/tax-id/{taxon_id}"
|
||||
response = requests.get(url)
|
||||
taxonomy = response.json()
|
||||
# Returns: taxId, scientificName, commonName, rank, lineage, etc.
|
||||
```
|
||||
|
||||
### Scientific Name Lookup
|
||||
|
||||
**Endpoint:** `/scientific-name/{name}`
|
||||
|
||||
**Method:** GET
|
||||
|
||||
**Description:** Search by scientific name (may return multiple matches).
|
||||
|
||||
**Example:**
|
||||
|
||||
```python
|
||||
# Search by scientific name
|
||||
name = "Escherichia coli"
|
||||
url = f"https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/{name}"
|
||||
response = requests.get(url)
|
||||
```
|
||||
|
||||
### Suggest Names
|
||||
|
||||
**Endpoint:** `/suggest-for-submission/{partial_name}`
|
||||
|
||||
**Method:** GET
|
||||
|
||||
**Description:** Get taxonomy suggestions for submission (autocomplete).
|
||||
|
||||
**Example:**
|
||||
|
||||
```python
|
||||
# Get suggestions
|
||||
partial = "Escheri"
|
||||
url = f"https://www.ebi.ac.uk/ena/taxonomy/rest/suggest-for-submission/{partial}"
|
||||
response = requests.get(url)
|
||||
```
|
||||
|
||||
## Cross-Reference Service
|
||||
|
||||
**Base URL:** `https://www.ebi.ac.uk/ena/xref/rest`
|
||||
|
||||
**Description:** Access records related to ENA entries in external databases.
|
||||
|
||||
### Get Cross-References
|
||||
|
||||
**Endpoint:** `/json/{source}/{accession}`
|
||||
|
||||
**Method:** GET
|
||||
|
||||
**Description:** Retrieve cross-references in JSON format.
|
||||
|
||||
**Parameters:**
|
||||
|
||||
| Parameter | Type | Description | Example |
|
||||
|-----------|------|-------------|---------|
|
||||
| `source` | Path | Source database | `ena`, `sra` |
|
||||
| `accession` | Path | Accession number | `SRR000001` |
|
||||
|
||||
**Example:**
|
||||
|
||||
```python
|
||||
# Get cross-references for an SRA accession
|
||||
url = "https://www.ebi.ac.uk/ena/xref/rest/json/sra/SRR000001"
|
||||
response = requests.get(url)
|
||||
xrefs = response.json()
|
||||
```
|
||||
|
||||
## CRAM Reference Registry
|
||||
|
||||
**Base URL:** `https://www.ebi.ac.uk/ena/cram`
|
||||
|
||||
**Description:** Retrieve reference sequences used in CRAM files.
|
||||
|
||||
### MD5 Lookup
|
||||
|
||||
**Endpoint:** `/md5/{md5_checksum}`
|
||||
|
||||
**Method:** GET
|
||||
|
||||
**Description:** Retrieve reference sequence by MD5 checksum.
|
||||
|
||||
**Example:**
|
||||
|
||||
```python
|
||||
# Get reference by MD5
|
||||
md5 = "7c3f69f0c5f0f0de6d7c34e7c2e25f5c"
|
||||
url = f"https://www.ebi.ac.uk/ena/cram/md5/{md5}"
|
||||
response = requests.get(url)
|
||||
reference_fasta = response.text
|
||||
```
|
||||
|
||||
## Rate Limiting and Error Handling
|
||||
|
||||
**Rate Limits:**
|
||||
- Maximum: 50 requests per second
|
||||
- Exceeding limit returns HTTP 429 (Too Many Requests)
|
||||
- Implement exponential backoff when receiving 429 responses
|
||||
|
||||
**Common HTTP Status Codes:**
|
||||
|
||||
- `200 OK` - Success
|
||||
- `204 No Content` - Success but no data returned
|
||||
- `400 Bad Request` - Invalid parameters
|
||||
- `404 Not Found` - Accession not found
|
||||
- `429 Too Many Requests` - Rate limit exceeded
|
||||
- `500 Internal Server Error` - Server error (retry with backoff)
|
||||
|
||||
**Error Handling Pattern:**
|
||||
|
||||
```python
|
||||
import time
|
||||
import requests
|
||||
from requests.adapters import HTTPAdapter
|
||||
from requests.packages.urllib3.util.retry import Retry
|
||||
|
||||
def create_session_with_retries():
|
||||
"""Create requests session with retry logic"""
|
||||
session = requests.Session()
|
||||
retries = Retry(
|
||||
total=5,
|
||||
backoff_factor=1,
|
||||
status_forcelist=[429, 500, 502, 503, 504],
|
||||
allowed_methods=["GET", "POST"]
|
||||
)
|
||||
adapter = HTTPAdapter(max_retries=retries)
|
||||
session.mount("https://", adapter)
|
||||
return session
|
||||
|
||||
# Usage
|
||||
session = create_session_with_retries()
|
||||
response = session.get(url, params=params)
|
||||
```
|
||||
|
||||
## Bulk Download Recommendations
|
||||
|
||||
For downloading large numbers of files or large datasets:
|
||||
|
||||
1. **Use FTP directly** instead of API for file downloads
|
||||
- Base FTP: `ftp://ftp.sra.ebi.ac.uk/vol1/fastq/`
|
||||
- Aspera for high-speed: `era-fasp@fasp.sra.ebi.ac.uk:`
|
||||
|
||||
2. **Use enaBrowserTools** command-line utility
|
||||
```bash
|
||||
# Download by accession
|
||||
enaDataGet ERR123456
|
||||
|
||||
# Download all runs from a study
|
||||
enaGroupGet PRJEB1234
|
||||
```
|
||||
|
||||
3. **Batch API requests** with proper delays
|
||||
```python
|
||||
import time
|
||||
|
||||
accessions = ["ERR001", "ERR002", "ERR003"]
|
||||
for acc in accessions:
|
||||
response = requests.get(f"{base_url}/xml/{acc}")
|
||||
# Process response
|
||||
time.sleep(0.02) # 50 req/sec = 0.02s between requests
|
||||
```
|
||||
|
||||
## Query Optimization Tips
|
||||
|
||||
1. **Use specific result types** instead of broad searches
|
||||
2. **Limit fields** to only what you need using `fields` parameter
|
||||
3. **Use pagination** for large result sets (limit + offset)
|
||||
4. **Cache taxonomy lookups** locally
|
||||
5. **Prefer JSON/TSV** over XML when possible (smaller, faster)
|
||||
6. **Use includeAccessions/excludeAccessions** to filter large result sets efficiently
|
||||
7. **Batch similar queries** together when possible
|
||||
Reference in New Issue
Block a user