372 lines
11 KiB
Markdown
372 lines
11 KiB
Markdown
---
|
|
name: kegg-database
|
|
description: "Direct REST API access to KEGG (academic use only). Pathway analysis, gene-pathway mapping, metabolic pathways, drug interactions, ID conversion. For Python workflows with multiple databases, prefer bioservices. Use this for direct HTTP/REST work or KEGG-specific control."
|
|
---
|
|
|
|
# KEGG Database
|
|
|
|
## Overview
|
|
|
|
KEGG (Kyoto Encyclopedia of Genes and Genomes) is a comprehensive bioinformatics resource for biological pathway analysis and molecular interaction networks.
|
|
|
|
**Important**: KEGG API is made available only for academic use by academic users.
|
|
|
|
## When to Use This Skill
|
|
|
|
This skill should be used when querying pathways, genes, compounds, enzymes, diseases, and drugs across multiple organisms using KEGG's REST API.
|
|
|
|
## Quick Start
|
|
|
|
The skill provides:
|
|
1. Python helper functions (`scripts/kegg_api.py`) for all KEGG REST API operations
|
|
2. Comprehensive reference documentation (`references/kegg_reference.md`) with detailed API specifications
|
|
|
|
When users request KEGG data, determine which operation is needed and use the appropriate function from `scripts/kegg_api.py`.
|
|
|
|
## Core Operations
|
|
|
|
### 1. Database Information (`kegg_info`)
|
|
|
|
Retrieve metadata and statistics about KEGG databases.
|
|
|
|
**When to use**: Understanding database structure, checking available data, getting release information.
|
|
|
|
**Usage**:
|
|
```python
|
|
from scripts.kegg_api import kegg_info
|
|
|
|
# Get pathway database info
|
|
info = kegg_info('pathway')
|
|
|
|
# Get organism-specific info
|
|
hsa_info = kegg_info('hsa') # Human genome
|
|
```
|
|
|
|
**Common databases**: `kegg`, `pathway`, `module`, `brite`, `genes`, `genome`, `compound`, `glycan`, `reaction`, `enzyme`, `disease`, `drug`
|
|
|
|
### 2. Listing Entries (`kegg_list`)
|
|
|
|
List entry identifiers and names from KEGG databases.
|
|
|
|
**When to use**: Getting all pathways for an organism, listing genes, retrieving compound catalogs.
|
|
|
|
**Usage**:
|
|
```python
|
|
from scripts.kegg_api import kegg_list
|
|
|
|
# List all reference pathways
|
|
pathways = kegg_list('pathway')
|
|
|
|
# List human-specific pathways
|
|
hsa_pathways = kegg_list('pathway', 'hsa')
|
|
|
|
# List specific genes (max 10)
|
|
genes = kegg_list('hsa:10458+hsa:10459')
|
|
```
|
|
|
|
**Common organism codes**: `hsa` (human), `mmu` (mouse), `dme` (fruit fly), `sce` (yeast), `eco` (E. coli)
|
|
|
|
### 3. Searching (`kegg_find`)
|
|
|
|
Search KEGG databases by keywords or molecular properties.
|
|
|
|
**When to use**: Finding genes by name/description, searching compounds by formula or mass, discovering entries by keywords.
|
|
|
|
**Usage**:
|
|
```python
|
|
from scripts.kegg_api import kegg_find
|
|
|
|
# Keyword search
|
|
results = kegg_find('genes', 'p53')
|
|
shiga_toxin = kegg_find('genes', 'shiga toxin')
|
|
|
|
# Chemical formula search (exact match)
|
|
compounds = kegg_find('compound', 'C7H10N4O2', 'formula')
|
|
|
|
# Molecular weight range search
|
|
drugs = kegg_find('drug', '300-310', 'exact_mass')
|
|
```
|
|
|
|
**Search options**: `formula` (exact match), `exact_mass` (range), `mol_weight` (range)
|
|
|
|
### 4. Retrieving Entries (`kegg_get`)
|
|
|
|
Get complete database entries or specific data formats.
|
|
|
|
**When to use**: Retrieving pathway details, getting gene/protein sequences, downloading pathway maps, accessing compound structures.
|
|
|
|
**Usage**:
|
|
```python
|
|
from scripts.kegg_api import kegg_get
|
|
|
|
# Get pathway entry
|
|
pathway = kegg_get('hsa00010') # Glycolysis pathway
|
|
|
|
# Get multiple entries (max 10)
|
|
genes = kegg_get(['hsa:10458', 'hsa:10459'])
|
|
|
|
# Get protein sequence (FASTA)
|
|
sequence = kegg_get('hsa:10458', 'aaseq')
|
|
|
|
# Get nucleotide sequence
|
|
nt_seq = kegg_get('hsa:10458', 'ntseq')
|
|
|
|
# Get compound structure
|
|
mol_file = kegg_get('cpd:C00002', 'mol') # ATP in MOL format
|
|
|
|
# Get pathway as JSON (single entry only)
|
|
pathway_json = kegg_get('hsa05130', 'json')
|
|
|
|
# Get pathway image (single entry only)
|
|
pathway_img = kegg_get('hsa05130', 'image')
|
|
```
|
|
|
|
**Output formats**: `aaseq` (protein FASTA), `ntseq` (nucleotide FASTA), `mol` (MOL format), `kcf` (KCF format), `image` (PNG), `kgml` (XML), `json` (pathway JSON)
|
|
|
|
**Important**: Image, KGML, and JSON formats allow only one entry at a time.
|
|
|
|
### 5. ID Conversion (`kegg_conv`)
|
|
|
|
Convert identifiers between KEGG and external databases.
|
|
|
|
**When to use**: Integrating KEGG data with other databases, mapping gene IDs, converting compound identifiers.
|
|
|
|
**Usage**:
|
|
```python
|
|
from scripts.kegg_api import kegg_conv
|
|
|
|
# Convert all human genes to NCBI Gene IDs
|
|
conversions = kegg_conv('ncbi-geneid', 'hsa')
|
|
|
|
# Convert specific gene
|
|
gene_id = kegg_conv('ncbi-geneid', 'hsa:10458')
|
|
|
|
# Convert to UniProt
|
|
uniprot_id = kegg_conv('uniprot', 'hsa:10458')
|
|
|
|
# Convert compounds to PubChem
|
|
pubchem_ids = kegg_conv('pubchem', 'compound')
|
|
|
|
# Reverse conversion (NCBI Gene ID to KEGG)
|
|
kegg_id = kegg_conv('hsa', 'ncbi-geneid')
|
|
```
|
|
|
|
**Supported conversions**: `ncbi-geneid`, `ncbi-proteinid`, `uniprot`, `pubchem`, `chebi`
|
|
|
|
### 6. Cross-Referencing (`kegg_link`)
|
|
|
|
Find related entries within and between KEGG databases.
|
|
|
|
**When to use**: Finding pathways containing genes, getting genes in a pathway, mapping genes to KO groups, finding compounds in pathways.
|
|
|
|
**Usage**:
|
|
```python
|
|
from scripts.kegg_api import kegg_link
|
|
|
|
# Find pathways linked to human genes
|
|
pathways = kegg_link('pathway', 'hsa')
|
|
|
|
# Get genes in a specific pathway
|
|
genes = kegg_link('genes', 'hsa00010') # Glycolysis genes
|
|
|
|
# Find pathways containing a specific gene
|
|
gene_pathways = kegg_link('pathway', 'hsa:10458')
|
|
|
|
# Find compounds in a pathway
|
|
compounds = kegg_link('compound', 'hsa00010')
|
|
|
|
# Map genes to KO (orthology) groups
|
|
ko_groups = kegg_link('ko', 'hsa:10458')
|
|
```
|
|
|
|
**Common links**: genes ↔ pathway, pathway ↔ compound, pathway ↔ enzyme, genes ↔ ko (orthology)
|
|
|
|
### 7. Drug-Drug Interactions (`kegg_ddi`)
|
|
|
|
Check for drug-drug interactions.
|
|
|
|
**When to use**: Analyzing drug combinations, checking for contraindications, pharmacological research.
|
|
|
|
**Usage**:
|
|
```python
|
|
from scripts.kegg_api import kegg_ddi
|
|
|
|
# Check single drug
|
|
interactions = kegg_ddi('D00001')
|
|
|
|
# Check multiple drugs (max 10)
|
|
interactions = kegg_ddi(['D00001', 'D00002', 'D00003'])
|
|
```
|
|
|
|
## Common Analysis Workflows
|
|
|
|
### Workflow 1: Gene to Pathway Mapping
|
|
|
|
**Use case**: Finding pathways associated with genes of interest (e.g., for pathway enrichment analysis).
|
|
|
|
```python
|
|
from scripts.kegg_api import kegg_find, kegg_link, kegg_get
|
|
|
|
# Step 1: Find gene ID by name
|
|
gene_results = kegg_find('genes', 'p53')
|
|
|
|
# Step 2: Link gene to pathways
|
|
pathways = kegg_link('pathway', 'hsa:7157') # TP53 gene
|
|
|
|
# Step 3: Get detailed pathway information
|
|
for pathway_line in pathways.split('\n'):
|
|
if pathway_line:
|
|
pathway_id = pathway_line.split('\t')[1].replace('path:', '')
|
|
pathway_info = kegg_get(pathway_id)
|
|
# Process pathway information
|
|
```
|
|
|
|
### Workflow 2: Pathway Enrichment Context
|
|
|
|
**Use case**: Getting all genes in organism pathways for enrichment analysis.
|
|
|
|
```python
|
|
from scripts.kegg_api import kegg_list, kegg_link
|
|
|
|
# Step 1: List all human pathways
|
|
pathways = kegg_list('pathway', 'hsa')
|
|
|
|
# Step 2: For each pathway, get associated genes
|
|
for pathway_line in pathways.split('\n'):
|
|
if pathway_line:
|
|
pathway_id = pathway_line.split('\t')[0]
|
|
genes = kegg_link('genes', pathway_id)
|
|
# Process genes for enrichment analysis
|
|
```
|
|
|
|
### Workflow 3: Compound to Pathway Analysis
|
|
|
|
**Use case**: Finding metabolic pathways containing compounds of interest.
|
|
|
|
```python
|
|
from scripts.kegg_api import kegg_find, kegg_link, kegg_get
|
|
|
|
# Step 1: Search for compound
|
|
compound_results = kegg_find('compound', 'glucose')
|
|
|
|
# Step 2: Link compound to reactions
|
|
reactions = kegg_link('reaction', 'cpd:C00031') # Glucose
|
|
|
|
# Step 3: Link reactions to pathways
|
|
pathways = kegg_link('pathway', 'rn:R00299') # Specific reaction
|
|
|
|
# Step 4: Get pathway details
|
|
pathway_info = kegg_get('map00010') # Glycolysis
|
|
```
|
|
|
|
### Workflow 4: Cross-Database Integration
|
|
|
|
**Use case**: Integrating KEGG data with UniProt, NCBI, or PubChem databases.
|
|
|
|
```python
|
|
from scripts.kegg_api import kegg_conv, kegg_get
|
|
|
|
# Step 1: Convert KEGG gene IDs to external database IDs
|
|
uniprot_map = kegg_conv('uniprot', 'hsa')
|
|
ncbi_map = kegg_conv('ncbi-geneid', 'hsa')
|
|
|
|
# Step 2: Parse conversion results
|
|
for line in uniprot_map.split('\n'):
|
|
if line:
|
|
kegg_id, uniprot_id = line.split('\t')
|
|
# Use external IDs for integration
|
|
|
|
# Step 3: Get sequences using KEGG
|
|
sequence = kegg_get('hsa:10458', 'aaseq')
|
|
```
|
|
|
|
### Workflow 5: Organism-Specific Pathway Analysis
|
|
|
|
**Use case**: Comparing pathways across different organisms.
|
|
|
|
```python
|
|
from scripts.kegg_api import kegg_list, kegg_get
|
|
|
|
# Step 1: List pathways for multiple organisms
|
|
human_pathways = kegg_list('pathway', 'hsa')
|
|
mouse_pathways = kegg_list('pathway', 'mmu')
|
|
yeast_pathways = kegg_list('pathway', 'sce')
|
|
|
|
# Step 2: Get reference pathway for comparison
|
|
ref_pathway = kegg_get('map00010') # Reference glycolysis
|
|
|
|
# Step 3: Get organism-specific versions
|
|
hsa_glycolysis = kegg_get('hsa00010')
|
|
mmu_glycolysis = kegg_get('mmu00010')
|
|
```
|
|
|
|
## Pathway Categories
|
|
|
|
KEGG organizes pathways into seven major categories. When interpreting pathway IDs or recommending pathways to users:
|
|
|
|
1. **Metabolism** (e.g., `map00010` - Glycolysis, `map00190` - Oxidative phosphorylation)
|
|
2. **Genetic Information Processing** (e.g., `map03010` - Ribosome, `map03040` - Spliceosome)
|
|
3. **Environmental Information Processing** (e.g., `map04010` - MAPK signaling, `map02010` - ABC transporters)
|
|
4. **Cellular Processes** (e.g., `map04140` - Autophagy, `map04210` - Apoptosis)
|
|
5. **Organismal Systems** (e.g., `map04610` - Complement cascade, `map04910` - Insulin signaling)
|
|
6. **Human Diseases** (e.g., `map05200` - Pathways in cancer, `map05010` - Alzheimer disease)
|
|
7. **Drug Development** (chronological and target-based classifications)
|
|
|
|
Reference `references/kegg_reference.md` for detailed pathway lists and classifications.
|
|
|
|
## Important Identifiers and Formats
|
|
|
|
### Pathway IDs
|
|
- `map#####` - Reference pathway (generic, not organism-specific)
|
|
- `hsa#####` - Human pathway
|
|
- `mmu#####` - Mouse pathway
|
|
|
|
### Gene IDs
|
|
- Format: `organism:gene_number` (e.g., `hsa:10458`)
|
|
|
|
### Compound IDs
|
|
- Format: `cpd:C#####` (e.g., `cpd:C00002` for ATP)
|
|
|
|
### Drug IDs
|
|
- Format: `dr:D#####` (e.g., `dr:D00001`)
|
|
|
|
### Enzyme IDs
|
|
- Format: `ec:EC_number` (e.g., `ec:1.1.1.1`)
|
|
|
|
### KO (KEGG Orthology) IDs
|
|
- Format: `ko:K#####` (e.g., `ko:K00001`)
|
|
|
|
## API Limitations
|
|
|
|
Respect these constraints when using the KEGG API:
|
|
|
|
1. **Entry limits**: Maximum 10 entries per operation (except image/kgml/json: 1 entry only)
|
|
2. **Academic use**: API is for academic use only; commercial use requires licensing
|
|
3. **HTTP status codes**: Check for 200 (success), 400 (bad request), 404 (not found)
|
|
4. **Rate limiting**: No explicit limit, but avoid rapid-fire requests
|
|
|
|
## Detailed Reference
|
|
|
|
For comprehensive API documentation, database specifications, organism codes, and advanced usage, refer to `references/kegg_reference.md`. This includes:
|
|
|
|
- Complete list of KEGG databases
|
|
- Detailed API operation syntax
|
|
- All organism codes
|
|
- HTTP status codes and error handling
|
|
- Integration with Biopython and R/Bioconductor
|
|
- Best practices for API usage
|
|
|
|
## Troubleshooting
|
|
|
|
**404 Not Found**: Entry or database doesn't exist; verify IDs and organism codes
|
|
**400 Bad Request**: Syntax error in API call; check parameter formatting
|
|
**Empty results**: Search term may not match entries; try broader keywords
|
|
**Image/KGML errors**: These formats only work with single entries; remove batch processing
|
|
|
|
## Additional Tools
|
|
|
|
For interactive pathway visualization and annotation:
|
|
- **KEGG Mapper**: https://www.kegg.jp/kegg/mapper/
|
|
- **BlastKOALA**: Automated genome annotation
|
|
- **GhostKOALA**: Metagenome/metatranscriptome annotation
|