Files
gh-k-dense-ai-claude-scient…/skills/kegg-database/SKILL.md
2025-11-30 08:30:10 +08:00

372 lines
11 KiB
Markdown

---
name: kegg-database
description: "Direct REST API access to KEGG (academic use only). Pathway analysis, gene-pathway mapping, metabolic pathways, drug interactions, ID conversion. For Python workflows with multiple databases, prefer bioservices. Use this for direct HTTP/REST work or KEGG-specific control."
---
# KEGG Database
## Overview
KEGG (Kyoto Encyclopedia of Genes and Genomes) is a comprehensive bioinformatics resource for biological pathway analysis and molecular interaction networks.
**Important**: KEGG API is made available only for academic use by academic users.
## When to Use This Skill
This skill should be used when querying pathways, genes, compounds, enzymes, diseases, and drugs across multiple organisms using KEGG's REST API.
## Quick Start
The skill provides:
1. Python helper functions (`scripts/kegg_api.py`) for all KEGG REST API operations
2. Comprehensive reference documentation (`references/kegg_reference.md`) with detailed API specifications
When users request KEGG data, determine which operation is needed and use the appropriate function from `scripts/kegg_api.py`.
## Core Operations
### 1. Database Information (`kegg_info`)
Retrieve metadata and statistics about KEGG databases.
**When to use**: Understanding database structure, checking available data, getting release information.
**Usage**:
```python
from scripts.kegg_api import kegg_info
# Get pathway database info
info = kegg_info('pathway')
# Get organism-specific info
hsa_info = kegg_info('hsa') # Human genome
```
**Common databases**: `kegg`, `pathway`, `module`, `brite`, `genes`, `genome`, `compound`, `glycan`, `reaction`, `enzyme`, `disease`, `drug`
### 2. Listing Entries (`kegg_list`)
List entry identifiers and names from KEGG databases.
**When to use**: Getting all pathways for an organism, listing genes, retrieving compound catalogs.
**Usage**:
```python
from scripts.kegg_api import kegg_list
# List all reference pathways
pathways = kegg_list('pathway')
# List human-specific pathways
hsa_pathways = kegg_list('pathway', 'hsa')
# List specific genes (max 10)
genes = kegg_list('hsa:10458+hsa:10459')
```
**Common organism codes**: `hsa` (human), `mmu` (mouse), `dme` (fruit fly), `sce` (yeast), `eco` (E. coli)
### 3. Searching (`kegg_find`)
Search KEGG databases by keywords or molecular properties.
**When to use**: Finding genes by name/description, searching compounds by formula or mass, discovering entries by keywords.
**Usage**:
```python
from scripts.kegg_api import kegg_find
# Keyword search
results = kegg_find('genes', 'p53')
shiga_toxin = kegg_find('genes', 'shiga toxin')
# Chemical formula search (exact match)
compounds = kegg_find('compound', 'C7H10N4O2', 'formula')
# Molecular weight range search
drugs = kegg_find('drug', '300-310', 'exact_mass')
```
**Search options**: `formula` (exact match), `exact_mass` (range), `mol_weight` (range)
### 4. Retrieving Entries (`kegg_get`)
Get complete database entries or specific data formats.
**When to use**: Retrieving pathway details, getting gene/protein sequences, downloading pathway maps, accessing compound structures.
**Usage**:
```python
from scripts.kegg_api import kegg_get
# Get pathway entry
pathway = kegg_get('hsa00010') # Glycolysis pathway
# Get multiple entries (max 10)
genes = kegg_get(['hsa:10458', 'hsa:10459'])
# Get protein sequence (FASTA)
sequence = kegg_get('hsa:10458', 'aaseq')
# Get nucleotide sequence
nt_seq = kegg_get('hsa:10458', 'ntseq')
# Get compound structure
mol_file = kegg_get('cpd:C00002', 'mol') # ATP in MOL format
# Get pathway as JSON (single entry only)
pathway_json = kegg_get('hsa05130', 'json')
# Get pathway image (single entry only)
pathway_img = kegg_get('hsa05130', 'image')
```
**Output formats**: `aaseq` (protein FASTA), `ntseq` (nucleotide FASTA), `mol` (MOL format), `kcf` (KCF format), `image` (PNG), `kgml` (XML), `json` (pathway JSON)
**Important**: Image, KGML, and JSON formats allow only one entry at a time.
### 5. ID Conversion (`kegg_conv`)
Convert identifiers between KEGG and external databases.
**When to use**: Integrating KEGG data with other databases, mapping gene IDs, converting compound identifiers.
**Usage**:
```python
from scripts.kegg_api import kegg_conv
# Convert all human genes to NCBI Gene IDs
conversions = kegg_conv('ncbi-geneid', 'hsa')
# Convert specific gene
gene_id = kegg_conv('ncbi-geneid', 'hsa:10458')
# Convert to UniProt
uniprot_id = kegg_conv('uniprot', 'hsa:10458')
# Convert compounds to PubChem
pubchem_ids = kegg_conv('pubchem', 'compound')
# Reverse conversion (NCBI Gene ID to KEGG)
kegg_id = kegg_conv('hsa', 'ncbi-geneid')
```
**Supported conversions**: `ncbi-geneid`, `ncbi-proteinid`, `uniprot`, `pubchem`, `chebi`
### 6. Cross-Referencing (`kegg_link`)
Find related entries within and between KEGG databases.
**When to use**: Finding pathways containing genes, getting genes in a pathway, mapping genes to KO groups, finding compounds in pathways.
**Usage**:
```python
from scripts.kegg_api import kegg_link
# Find pathways linked to human genes
pathways = kegg_link('pathway', 'hsa')
# Get genes in a specific pathway
genes = kegg_link('genes', 'hsa00010') # Glycolysis genes
# Find pathways containing a specific gene
gene_pathways = kegg_link('pathway', 'hsa:10458')
# Find compounds in a pathway
compounds = kegg_link('compound', 'hsa00010')
# Map genes to KO (orthology) groups
ko_groups = kegg_link('ko', 'hsa:10458')
```
**Common links**: genes ↔ pathway, pathway ↔ compound, pathway ↔ enzyme, genes ↔ ko (orthology)
### 7. Drug-Drug Interactions (`kegg_ddi`)
Check for drug-drug interactions.
**When to use**: Analyzing drug combinations, checking for contraindications, pharmacological research.
**Usage**:
```python
from scripts.kegg_api import kegg_ddi
# Check single drug
interactions = kegg_ddi('D00001')
# Check multiple drugs (max 10)
interactions = kegg_ddi(['D00001', 'D00002', 'D00003'])
```
## Common Analysis Workflows
### Workflow 1: Gene to Pathway Mapping
**Use case**: Finding pathways associated with genes of interest (e.g., for pathway enrichment analysis).
```python
from scripts.kegg_api import kegg_find, kegg_link, kegg_get
# Step 1: Find gene ID by name
gene_results = kegg_find('genes', 'p53')
# Step 2: Link gene to pathways
pathways = kegg_link('pathway', 'hsa:7157') # TP53 gene
# Step 3: Get detailed pathway information
for pathway_line in pathways.split('\n'):
if pathway_line:
pathway_id = pathway_line.split('\t')[1].replace('path:', '')
pathway_info = kegg_get(pathway_id)
# Process pathway information
```
### Workflow 2: Pathway Enrichment Context
**Use case**: Getting all genes in organism pathways for enrichment analysis.
```python
from scripts.kegg_api import kegg_list, kegg_link
# Step 1: List all human pathways
pathways = kegg_list('pathway', 'hsa')
# Step 2: For each pathway, get associated genes
for pathway_line in pathways.split('\n'):
if pathway_line:
pathway_id = pathway_line.split('\t')[0]
genes = kegg_link('genes', pathway_id)
# Process genes for enrichment analysis
```
### Workflow 3: Compound to Pathway Analysis
**Use case**: Finding metabolic pathways containing compounds of interest.
```python
from scripts.kegg_api import kegg_find, kegg_link, kegg_get
# Step 1: Search for compound
compound_results = kegg_find('compound', 'glucose')
# Step 2: Link compound to reactions
reactions = kegg_link('reaction', 'cpd:C00031') # Glucose
# Step 3: Link reactions to pathways
pathways = kegg_link('pathway', 'rn:R00299') # Specific reaction
# Step 4: Get pathway details
pathway_info = kegg_get('map00010') # Glycolysis
```
### Workflow 4: Cross-Database Integration
**Use case**: Integrating KEGG data with UniProt, NCBI, or PubChem databases.
```python
from scripts.kegg_api import kegg_conv, kegg_get
# Step 1: Convert KEGG gene IDs to external database IDs
uniprot_map = kegg_conv('uniprot', 'hsa')
ncbi_map = kegg_conv('ncbi-geneid', 'hsa')
# Step 2: Parse conversion results
for line in uniprot_map.split('\n'):
if line:
kegg_id, uniprot_id = line.split('\t')
# Use external IDs for integration
# Step 3: Get sequences using KEGG
sequence = kegg_get('hsa:10458', 'aaseq')
```
### Workflow 5: Organism-Specific Pathway Analysis
**Use case**: Comparing pathways across different organisms.
```python
from scripts.kegg_api import kegg_list, kegg_get
# Step 1: List pathways for multiple organisms
human_pathways = kegg_list('pathway', 'hsa')
mouse_pathways = kegg_list('pathway', 'mmu')
yeast_pathways = kegg_list('pathway', 'sce')
# Step 2: Get reference pathway for comparison
ref_pathway = kegg_get('map00010') # Reference glycolysis
# Step 3: Get organism-specific versions
hsa_glycolysis = kegg_get('hsa00010')
mmu_glycolysis = kegg_get('mmu00010')
```
## Pathway Categories
KEGG organizes pathways into seven major categories. When interpreting pathway IDs or recommending pathways to users:
1. **Metabolism** (e.g., `map00010` - Glycolysis, `map00190` - Oxidative phosphorylation)
2. **Genetic Information Processing** (e.g., `map03010` - Ribosome, `map03040` - Spliceosome)
3. **Environmental Information Processing** (e.g., `map04010` - MAPK signaling, `map02010` - ABC transporters)
4. **Cellular Processes** (e.g., `map04140` - Autophagy, `map04210` - Apoptosis)
5. **Organismal Systems** (e.g., `map04610` - Complement cascade, `map04910` - Insulin signaling)
6. **Human Diseases** (e.g., `map05200` - Pathways in cancer, `map05010` - Alzheimer disease)
7. **Drug Development** (chronological and target-based classifications)
Reference `references/kegg_reference.md` for detailed pathway lists and classifications.
## Important Identifiers and Formats
### Pathway IDs
- `map#####` - Reference pathway (generic, not organism-specific)
- `hsa#####` - Human pathway
- `mmu#####` - Mouse pathway
### Gene IDs
- Format: `organism:gene_number` (e.g., `hsa:10458`)
### Compound IDs
- Format: `cpd:C#####` (e.g., `cpd:C00002` for ATP)
### Drug IDs
- Format: `dr:D#####` (e.g., `dr:D00001`)
### Enzyme IDs
- Format: `ec:EC_number` (e.g., `ec:1.1.1.1`)
### KO (KEGG Orthology) IDs
- Format: `ko:K#####` (e.g., `ko:K00001`)
## API Limitations
Respect these constraints when using the KEGG API:
1. **Entry limits**: Maximum 10 entries per operation (except image/kgml/json: 1 entry only)
2. **Academic use**: API is for academic use only; commercial use requires licensing
3. **HTTP status codes**: Check for 200 (success), 400 (bad request), 404 (not found)
4. **Rate limiting**: No explicit limit, but avoid rapid-fire requests
## Detailed Reference
For comprehensive API documentation, database specifications, organism codes, and advanced usage, refer to `references/kegg_reference.md`. This includes:
- Complete list of KEGG databases
- Detailed API operation syntax
- All organism codes
- HTTP status codes and error handling
- Integration with Biopython and R/Bioconductor
- Best practices for API usage
## Troubleshooting
**404 Not Found**: Entry or database doesn't exist; verify IDs and organism codes
**400 Bad Request**: Syntax error in API call; check parameter formatting
**Empty results**: Search term may not match entries; try broader keywords
**Image/KGML errors**: These formats only work with single entries; remove batch processing
## Additional Tools
For interactive pathway visualization and annotation:
- **KEGG Mapper**: https://www.kegg.jp/kegg/mapper/
- **BlastKOALA**: Automated genome annotation
- **GhostKOALA**: Metagenome/metatranscriptome annotation