Files
2025-11-30 08:30:10 +08:00

199 lines
6.8 KiB
Markdown

---
name: ena-database
description: "Access European Nucleotide Archive via API/FTP. Retrieve DNA/RNA sequences, raw reads (FASTQ), genome assemblies by accession, for genomics and bioinformatics pipelines. Supports multiple formats."
---
# ENA Database
## Overview
The European Nucleotide Archive (ENA) is a comprehensive public repository for nucleotide sequence data and associated metadata. Access and query DNA/RNA sequences, raw reads, genome assemblies, and functional annotations through REST APIs and FTP for genomics and bioinformatics pipelines.
## When to Use This Skill
This skill should be used when:
- Retrieving nucleotide sequences or raw sequencing reads by accession
- Searching for samples, studies, or assemblies by metadata criteria
- Downloading FASTQ files or genome assemblies for analysis
- Querying taxonomic information for organisms
- Accessing sequence annotations and functional data
- Integrating ENA data into bioinformatics pipelines
- Performing cross-reference searches to related databases
- Bulk downloading datasets via FTP or Aspera
## Core Capabilities
### 1. Data Types and Structure
ENA organizes data into hierarchical object types:
**Studies/Projects** - Group related data and control release dates. Studies are the primary unit for citing archived data.
**Samples** - Represent units of biomaterial from which sequencing libraries were produced. Samples must be registered before submitting most data types.
**Raw Reads** - Consist of:
- **Experiments**: Metadata about sequencing methods, library preparation, and instrument details
- **Runs**: References to data files containing raw sequencing reads from a single sequencing run
**Assemblies** - Genome, transcriptome, metagenome, or metatranscriptome assemblies at various completion levels.
**Sequences** - Assembled and annotated sequences stored in the EMBL Nucleotide Sequence Database, including coding/non-coding regions and functional annotations.
**Analyses** - Results from computational analyses of sequence data.
**Taxonomy Records** - Taxonomic information including lineage and rank.
### 2. Programmatic Access
ENA provides multiple REST APIs for data access. Consult `references/api_reference.md` for detailed endpoint documentation.
**Key APIs:**
**ENA Portal API** - Advanced search functionality across all ENA data types
- Documentation: https://www.ebi.ac.uk/ena/portal/api/doc
- Use for complex queries and metadata searches
**ENA Browser API** - Direct retrieval of records and metadata
- Documentation: https://www.ebi.ac.uk/ena/browser/api/doc
- Use for downloading specific records by accession
- Returns data in XML format
**ENA Taxonomy REST API** - Query taxonomic information
- Access lineage, rank, and related taxonomic data
**ENA Cross Reference Service** - Access related records from external databases
- Endpoint: https://www.ebi.ac.uk/ena/xref/rest/
**CRAM Reference Registry** - Retrieve reference sequences
- Endpoint: https://www.ebi.ac.uk/ena/cram/
- Query by MD5 or SHA1 checksums
**Rate Limiting**: All APIs have a rate limit of 50 requests per second. Exceeding this returns HTTP 429 (Too Many Requests).
### 3. Searching and Retrieving Data
**Browser-Based Search:**
- Free text search across all fields
- Sequence similarity search (BLAST integration)
- Cross-reference search to find related records
- Advanced search with Rulespace query builder
**Programmatic Queries:**
- Use Portal API for advanced searches at scale
- Filter by data type, date range, taxonomy, or metadata fields
- Download results as tabulated metadata summaries or XML records
**Example API Query Pattern:**
```python
import requests
# Search for samples from a specific study
base_url = "https://www.ebi.ac.uk/ena/portal/api/search"
params = {
"result": "sample",
"query": "study_accession=PRJEB1234",
"format": "json",
"limit": 100
}
response = requests.get(base_url, params=params)
samples = response.json()
```
### 4. Data Retrieval Formats
**Metadata Formats:**
- XML (native ENA format)
- JSON (via Portal API)
- TSV/CSV (tabulated summaries)
**Sequence Data:**
- FASTQ (raw reads)
- BAM/CRAM (aligned reads)
- FASTA (assembled sequences)
- EMBL flat file format (annotated sequences)
**Download Methods:**
- Direct API download (small files)
- FTP for bulk data transfer
- Aspera for high-speed transfer of large datasets
- enaBrowserTools command-line utility for bulk downloads
### 5. Common Use Cases
**Retrieve raw sequencing reads by accession:**
```python
# Download run files using Browser API
accession = "ERR123456"
url = f"https://www.ebi.ac.uk/ena/browser/api/xml/{accession}"
```
**Search for all samples in a study:**
```python
# Use Portal API to list samples
study_id = "PRJNA123456"
url = f"https://www.ebi.ac.uk/ena/portal/api/search?result=sample&query=study_accession={study_id}&format=tsv"
```
**Find assemblies for a specific organism:**
```python
# Search assemblies by taxonomy
organism = "Escherichia coli"
url = f"https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&query=tax_tree({organism})&format=json"
```
**Get taxonomic lineage:**
```python
# Query taxonomy API
taxon_id = "562" # E. coli
url = f"https://www.ebi.ac.uk/ena/taxonomy/rest/tax-id/{taxon_id}"
```
### 6. Integration with Analysis Pipelines
**Bulk Download Pattern:**
1. Search for accessions matching criteria using Portal API
2. Extract file URLs from search results
3. Download files via FTP or using enaBrowserTools
4. Process downloaded data in pipeline
**BLAST Integration:**
Integrate with EBI's NCBI BLAST service (REST/SOAP API) for sequence similarity searches against ENA sequences.
### 7. Best Practices
**Rate Limiting:**
- Implement exponential backoff when receiving HTTP 429 responses
- Batch requests when possible to stay within 50 req/sec limit
- Use bulk download tools for large datasets instead of iterating API calls
**Data Citation:**
- Always cite using Study/Project accessions when publishing
- Include accession numbers for specific samples, runs, or assemblies used
**API Response Handling:**
- Check HTTP status codes before processing responses
- Parse XML responses using proper XML libraries (not regex)
- Handle pagination for large result sets
**Performance:**
- Use FTP/Aspera for downloading large files (>100MB)
- Prefer TSV/JSON formats over XML when only metadata is needed
- Cache taxonomy lookups locally when processing many records
## Resources
This skill includes detailed reference documentation for working with ENA:
### references/
**api_reference.md** - Comprehensive API endpoint documentation including:
- Detailed parameters for Portal API and Browser API
- Response format specifications
- Advanced query syntax and operators
- Field names for filtering and searching
- Common API patterns and examples
Load this reference when constructing complex API queries, debugging API responses, or needing specific parameter details.