Initial commit
This commit is contained in:
198
skills/ena-database/SKILL.md
Normal file
198
skills/ena-database/SKILL.md
Normal file
@@ -0,0 +1,198 @@
|
||||
---
|
||||
name: ena-database
|
||||
description: "Access European Nucleotide Archive via API/FTP. Retrieve DNA/RNA sequences, raw reads (FASTQ), genome assemblies by accession, for genomics and bioinformatics pipelines. Supports multiple formats."
|
||||
---
|
||||
|
||||
# ENA Database
|
||||
|
||||
## Overview
|
||||
|
||||
The European Nucleotide Archive (ENA) is a comprehensive public repository for nucleotide sequence data and associated metadata. Access and query DNA/RNA sequences, raw reads, genome assemblies, and functional annotations through REST APIs and FTP for genomics and bioinformatics pipelines.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when:
|
||||
|
||||
- Retrieving nucleotide sequences or raw sequencing reads by accession
|
||||
- Searching for samples, studies, or assemblies by metadata criteria
|
||||
- Downloading FASTQ files or genome assemblies for analysis
|
||||
- Querying taxonomic information for organisms
|
||||
- Accessing sequence annotations and functional data
|
||||
- Integrating ENA data into bioinformatics pipelines
|
||||
- Performing cross-reference searches to related databases
|
||||
- Bulk downloading datasets via FTP or Aspera
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Data Types and Structure
|
||||
|
||||
ENA organizes data into hierarchical object types:
|
||||
|
||||
**Studies/Projects** - Group related data and control release dates. Studies are the primary unit for citing archived data.
|
||||
|
||||
**Samples** - Represent units of biomaterial from which sequencing libraries were produced. Samples must be registered before submitting most data types.
|
||||
|
||||
**Raw Reads** - Consist of:
|
||||
- **Experiments**: Metadata about sequencing methods, library preparation, and instrument details
|
||||
- **Runs**: References to data files containing raw sequencing reads from a single sequencing run
|
||||
|
||||
**Assemblies** - Genome, transcriptome, metagenome, or metatranscriptome assemblies at various completion levels.
|
||||
|
||||
**Sequences** - Assembled and annotated sequences stored in the EMBL Nucleotide Sequence Database, including coding/non-coding regions and functional annotations.
|
||||
|
||||
**Analyses** - Results from computational analyses of sequence data.
|
||||
|
||||
**Taxonomy Records** - Taxonomic information including lineage and rank.
|
||||
|
||||
### 2. Programmatic Access
|
||||
|
||||
ENA provides multiple REST APIs for data access. Consult `references/api_reference.md` for detailed endpoint documentation.
|
||||
|
||||
**Key APIs:**
|
||||
|
||||
**ENA Portal API** - Advanced search functionality across all ENA data types
|
||||
- Documentation: https://www.ebi.ac.uk/ena/portal/api/doc
|
||||
- Use for complex queries and metadata searches
|
||||
|
||||
**ENA Browser API** - Direct retrieval of records and metadata
|
||||
- Documentation: https://www.ebi.ac.uk/ena/browser/api/doc
|
||||
- Use for downloading specific records by accession
|
||||
- Returns data in XML format
|
||||
|
||||
**ENA Taxonomy REST API** - Query taxonomic information
|
||||
- Access lineage, rank, and related taxonomic data
|
||||
|
||||
**ENA Cross Reference Service** - Access related records from external databases
|
||||
- Endpoint: https://www.ebi.ac.uk/ena/xref/rest/
|
||||
|
||||
**CRAM Reference Registry** - Retrieve reference sequences
|
||||
- Endpoint: https://www.ebi.ac.uk/ena/cram/
|
||||
- Query by MD5 or SHA1 checksums
|
||||
|
||||
**Rate Limiting**: All APIs have a rate limit of 50 requests per second. Exceeding this returns HTTP 429 (Too Many Requests).
|
||||
|
||||
### 3. Searching and Retrieving Data
|
||||
|
||||
**Browser-Based Search:**
|
||||
- Free text search across all fields
|
||||
- Sequence similarity search (BLAST integration)
|
||||
- Cross-reference search to find related records
|
||||
- Advanced search with Rulespace query builder
|
||||
|
||||
**Programmatic Queries:**
|
||||
- Use Portal API for advanced searches at scale
|
||||
- Filter by data type, date range, taxonomy, or metadata fields
|
||||
- Download results as tabulated metadata summaries or XML records
|
||||
|
||||
**Example API Query Pattern:**
|
||||
```python
|
||||
import requests
|
||||
|
||||
# Search for samples from a specific study
|
||||
base_url = "https://www.ebi.ac.uk/ena/portal/api/search"
|
||||
params = {
|
||||
"result": "sample",
|
||||
"query": "study_accession=PRJEB1234",
|
||||
"format": "json",
|
||||
"limit": 100
|
||||
}
|
||||
|
||||
response = requests.get(base_url, params=params)
|
||||
samples = response.json()
|
||||
```
|
||||
|
||||
### 4. Data Retrieval Formats
|
||||
|
||||
**Metadata Formats:**
|
||||
- XML (native ENA format)
|
||||
- JSON (via Portal API)
|
||||
- TSV/CSV (tabulated summaries)
|
||||
|
||||
**Sequence Data:**
|
||||
- FASTQ (raw reads)
|
||||
- BAM/CRAM (aligned reads)
|
||||
- FASTA (assembled sequences)
|
||||
- EMBL flat file format (annotated sequences)
|
||||
|
||||
**Download Methods:**
|
||||
- Direct API download (small files)
|
||||
- FTP for bulk data transfer
|
||||
- Aspera for high-speed transfer of large datasets
|
||||
- enaBrowserTools command-line utility for bulk downloads
|
||||
|
||||
### 5. Common Use Cases
|
||||
|
||||
**Retrieve raw sequencing reads by accession:**
|
||||
```python
|
||||
# Download run files using Browser API
|
||||
accession = "ERR123456"
|
||||
url = f"https://www.ebi.ac.uk/ena/browser/api/xml/{accession}"
|
||||
```
|
||||
|
||||
**Search for all samples in a study:**
|
||||
```python
|
||||
# Use Portal API to list samples
|
||||
study_id = "PRJNA123456"
|
||||
url = f"https://www.ebi.ac.uk/ena/portal/api/search?result=sample&query=study_accession={study_id}&format=tsv"
|
||||
```
|
||||
|
||||
**Find assemblies for a specific organism:**
|
||||
```python
|
||||
# Search assemblies by taxonomy
|
||||
organism = "Escherichia coli"
|
||||
url = f"https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&query=tax_tree({organism})&format=json"
|
||||
```
|
||||
|
||||
**Get taxonomic lineage:**
|
||||
```python
|
||||
# Query taxonomy API
|
||||
taxon_id = "562" # E. coli
|
||||
url = f"https://www.ebi.ac.uk/ena/taxonomy/rest/tax-id/{taxon_id}"
|
||||
```
|
||||
|
||||
### 6. Integration with Analysis Pipelines
|
||||
|
||||
**Bulk Download Pattern:**
|
||||
1. Search for accessions matching criteria using Portal API
|
||||
2. Extract file URLs from search results
|
||||
3. Download files via FTP or using enaBrowserTools
|
||||
4. Process downloaded data in pipeline
|
||||
|
||||
**BLAST Integration:**
|
||||
Integrate with EBI's NCBI BLAST service (REST/SOAP API) for sequence similarity searches against ENA sequences.
|
||||
|
||||
### 7. Best Practices
|
||||
|
||||
**Rate Limiting:**
|
||||
- Implement exponential backoff when receiving HTTP 429 responses
|
||||
- Batch requests when possible to stay within 50 req/sec limit
|
||||
- Use bulk download tools for large datasets instead of iterating API calls
|
||||
|
||||
**Data Citation:**
|
||||
- Always cite using Study/Project accessions when publishing
|
||||
- Include accession numbers for specific samples, runs, or assemblies used
|
||||
|
||||
**API Response Handling:**
|
||||
- Check HTTP status codes before processing responses
|
||||
- Parse XML responses using proper XML libraries (not regex)
|
||||
- Handle pagination for large result sets
|
||||
|
||||
**Performance:**
|
||||
- Use FTP/Aspera for downloading large files (>100MB)
|
||||
- Prefer TSV/JSON formats over XML when only metadata is needed
|
||||
- Cache taxonomy lookups locally when processing many records
|
||||
|
||||
## Resources
|
||||
|
||||
This skill includes detailed reference documentation for working with ENA:
|
||||
|
||||
### references/
|
||||
|
||||
**api_reference.md** - Comprehensive API endpoint documentation including:
|
||||
- Detailed parameters for Portal API and Browser API
|
||||
- Response format specifications
|
||||
- Advanced query syntax and operators
|
||||
- Field names for filtering and searching
|
||||
- Common API patterns and examples
|
||||
|
||||
Load this reference when constructing complex API queries, debugging API responses, or needing specific parameter details.
|
||||
Reference in New Issue
Block a user