Initial commit
This commit is contained in:
190
skills/hmdb-database/SKILL.md
Normal file
190
skills/hmdb-database/SKILL.md
Normal file
@@ -0,0 +1,190 @@
|
||||
---
|
||||
name: hmdb-database
|
||||
description: "Access Human Metabolome Database (220K+ metabolites). Search by name/ID/structure, retrieve chemical properties, biomarker data, NMR/MS spectra, pathways, for metabolomics and identification."
|
||||
---
|
||||
|
||||
# HMDB Database
|
||||
|
||||
## Overview
|
||||
|
||||
The Human Metabolome Database (HMDB) is a comprehensive, freely available resource containing detailed information about small molecule metabolites found in the human body.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when performing metabolomics research, clinical chemistry, biomarker discovery, or metabolite identification tasks.
|
||||
|
||||
## Database Contents
|
||||
|
||||
HMDB version 5.0 (current as of 2025) contains:
|
||||
|
||||
- **220,945 metabolite entries** covering both water-soluble and lipid-soluble compounds
|
||||
- **8,610 protein sequences** for enzymes and transporters involved in metabolism
|
||||
- **130+ data fields per metabolite** including:
|
||||
- Chemical properties (structure, formula, molecular weight, InChI, SMILES)
|
||||
- Clinical data (biomarker associations, diseases, normal/abnormal concentrations)
|
||||
- Biological information (pathways, reactions, locations)
|
||||
- Spectroscopic data (NMR, MS, MS-MS spectra)
|
||||
- External database links (KEGG, PubChem, MetaCyc, ChEBI, PDB, UniProt, GenBank)
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Web-Based Metabolite Searches
|
||||
|
||||
Access HMDB through the web interface at https://www.hmdb.ca/ for:
|
||||
|
||||
**Text Searches:**
|
||||
- Search by metabolite name, synonym, or identifier (HMDB ID)
|
||||
- Example HMDB IDs: HMDB0000001, HMDB0001234
|
||||
- Search by disease associations or pathway involvement
|
||||
- Query by biological specimen type (urine, serum, CSF, saliva, feces, sweat)
|
||||
|
||||
**Structure-Based Searches:**
|
||||
- Use ChemQuery for structure and substructure searches
|
||||
- Search by molecular weight or molecular weight range
|
||||
- Use SMILES or InChI strings to find compounds
|
||||
|
||||
**Spectral Searches:**
|
||||
- LC-MS spectral matching
|
||||
- GC-MS spectral matching
|
||||
- NMR spectral searches for metabolite identification
|
||||
|
||||
**Advanced Searches:**
|
||||
- Combine multiple criteria (name, properties, concentration ranges)
|
||||
- Filter by biological locations or specimen types
|
||||
- Search by protein/enzyme associations
|
||||
|
||||
### 2. Accessing Metabolite Information
|
||||
|
||||
When retrieving metabolite data, HMDB provides:
|
||||
|
||||
**Chemical Information:**
|
||||
- Systematic name, traditional names, and synonyms
|
||||
- Chemical formula and molecular weight
|
||||
- Structure representations (2D/3D, SMILES, InChI, MOL file)
|
||||
- Chemical taxonomy and classification
|
||||
|
||||
**Biological Context:**
|
||||
- Metabolic pathways and reactions
|
||||
- Associated enzymes and transporters
|
||||
- Subcellular locations
|
||||
- Biological roles and functions
|
||||
|
||||
**Clinical Relevance:**
|
||||
- Normal concentration ranges in biological fluids
|
||||
- Biomarker associations with diseases
|
||||
- Clinical significance
|
||||
- Toxicity information when applicable
|
||||
|
||||
**Analytical Data:**
|
||||
- Experimental and predicted NMR spectra
|
||||
- MS and MS-MS spectra
|
||||
- Retention times and chromatographic data
|
||||
- Reference peaks for identification
|
||||
|
||||
### 3. Downloadable Datasets
|
||||
|
||||
HMDB offers bulk data downloads at https://www.hmdb.ca/downloads in multiple formats:
|
||||
|
||||
**Available Formats:**
|
||||
- **XML**: Complete metabolite, protein, and spectra data
|
||||
- **SDF**: Metabolite structure files for cheminformatics
|
||||
- **FASTA**: Protein and gene sequences
|
||||
- **TXT**: Raw spectra peak lists
|
||||
- **CSV/TSV**: Tabular data exports
|
||||
|
||||
**Dataset Categories:**
|
||||
- All metabolites or filtered by specimen type
|
||||
- Protein/enzyme sequences
|
||||
- Experimental and predicted spectra (NMR, GC-MS, MS-MS)
|
||||
- Pathway information
|
||||
|
||||
**Best Practices:**
|
||||
- Download XML format for comprehensive data including all fields
|
||||
- Use SDF format for structure-based analysis and cheminformatics workflows
|
||||
- Parse CSV/TSV formats for integration with data analysis pipelines
|
||||
- Check version dates to ensure up-to-date data (current: v5.0, 2023-07-01)
|
||||
|
||||
**Usage Requirements:**
|
||||
- Free for academic and non-commercial research
|
||||
- Commercial use requires explicit permission (contact samackay@ualberta.ca)
|
||||
- Cite HMDB publication when using data
|
||||
|
||||
### 4. Programmatic API Access
|
||||
|
||||
**API Availability:**
|
||||
HMDB does not provide a public REST API. Programmatic access requires contacting the development team:
|
||||
|
||||
- **Academic/Research groups:** Contact eponine@ualberta.ca (Eponine) or samackay@ualberta.ca (Scott)
|
||||
- **Commercial organizations:** Contact samackay@ualberta.ca (Scott) for customized API access
|
||||
|
||||
**Alternative Programmatic Access:**
|
||||
- **R/Bioconductor**: Use the `hmdbQuery` package for R-based queries
|
||||
- Install: `BiocManager::install("hmdbQuery")`
|
||||
- Provides HTTP-based querying functions
|
||||
- **Downloaded datasets**: Parse XML or CSV files locally for programmatic analysis
|
||||
- **Web scraping**: Not recommended; contact team for proper API access instead
|
||||
|
||||
### 5. Common Research Workflows
|
||||
|
||||
**Metabolite Identification in Untargeted Metabolomics:**
|
||||
1. Obtain experimental MS or NMR spectra from samples
|
||||
2. Use HMDB spectral search tools to match against reference spectra
|
||||
3. Verify candidates by checking molecular weight, retention time, and MS-MS fragmentation
|
||||
4. Review biological plausibility (expected in specimen type, known pathways)
|
||||
|
||||
**Biomarker Discovery:**
|
||||
1. Search HMDB for metabolites associated with disease of interest
|
||||
2. Review concentration ranges in normal vs. disease states
|
||||
3. Identify metabolites with strong differential abundance
|
||||
4. Examine pathway context and biological mechanisms
|
||||
5. Cross-reference with literature via PubMed links
|
||||
|
||||
**Pathway Analysis:**
|
||||
1. Identify metabolites of interest from experimental data
|
||||
2. Look up HMDB entries for each metabolite
|
||||
3. Extract pathway associations and enzymatic reactions
|
||||
4. Use linked SMPDB (Small Molecule Pathway Database) for pathway diagrams
|
||||
5. Identify pathway enrichment for biological interpretation
|
||||
|
||||
**Database Integration:**
|
||||
1. Download HMDB data in XML or CSV format
|
||||
2. Parse and extract relevant fields for local database
|
||||
3. Link with external IDs (KEGG, PubChem, ChEBI) for cross-database queries
|
||||
4. Build local tools or pipelines incorporating HMDB reference data
|
||||
|
||||
## Related HMDB Resources
|
||||
|
||||
The HMDB ecosystem includes related databases:
|
||||
|
||||
- **DrugBank**: ~2,832 drug compounds with pharmaceutical information
|
||||
- **T3DB (Toxin and Toxin Target Database)**: ~3,670 toxic compounds
|
||||
- **SMPDB (Small Molecule Pathway Database)**: Pathway diagrams and maps
|
||||
- **FooDB**: ~70,000 food component compounds
|
||||
|
||||
These databases share similar structure and identifiers, enabling integrated queries across human metabolome, drug, toxin, and food databases.
|
||||
|
||||
## Best Practices
|
||||
|
||||
**Data Quality:**
|
||||
- Verify metabolite identifications with multiple evidence types (spectra, structure, properties)
|
||||
- Check experimental vs. predicted data quality indicators
|
||||
- Review citations and evidence for biomarker associations
|
||||
|
||||
**Version Tracking:**
|
||||
- Note HMDB version used in research (current: v5.0)
|
||||
- Databases are updated periodically with new entries and corrections
|
||||
- Re-query for updates when publishing to ensure current information
|
||||
|
||||
**Citation:**
|
||||
- Always cite HMDB in publications using the database
|
||||
- Reference specific HMDB IDs when discussing metabolites
|
||||
- Acknowledge data sources for downloaded datasets
|
||||
|
||||
**Performance:**
|
||||
- For large-scale analysis, download complete datasets rather than repeated web queries
|
||||
- Use appropriate file formats (XML for comprehensive data, CSV for tabular analysis)
|
||||
- Consider local caching of frequently accessed metabolite information
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
See `references/hmdb_data_fields.md` for detailed information about available data fields and their meanings.
|
||||
267
skills/hmdb-database/references/hmdb_data_fields.md
Normal file
267
skills/hmdb-database/references/hmdb_data_fields.md
Normal file
@@ -0,0 +1,267 @@
|
||||
# HMDB Data Fields Reference
|
||||
|
||||
This document provides detailed information about the data fields available in HMDB metabolite entries.
|
||||
|
||||
## Metabolite Entry Structure
|
||||
|
||||
Each HMDB metabolite entry contains 130+ data fields organized into several categories:
|
||||
|
||||
### Chemical Data Fields
|
||||
|
||||
**Identification:**
|
||||
- `accession`: Primary HMDB ID (e.g., HMDB0000001)
|
||||
- `secondary_accessions`: Previous HMDB IDs for merged entries
|
||||
- `name`: Primary metabolite name
|
||||
- `synonyms`: Alternative names and common names
|
||||
- `chemical_formula`: Molecular formula (e.g., C6H12O6)
|
||||
- `average_molecular_weight`: Average molecular weight in Daltons
|
||||
- `monoisotopic_molecular_weight`: Monoisotopic molecular weight
|
||||
|
||||
**Structure Representations:**
|
||||
- `smiles`: Simplified Molecular Input Line Entry System string
|
||||
- `inchi`: International Chemical Identifier string
|
||||
- `inchikey`: Hashed InChI for fast lookup
|
||||
- `iupac_name`: IUPAC systematic name
|
||||
- `traditional_iupac`: Traditional IUPAC name
|
||||
|
||||
**Chemical Properties:**
|
||||
- `state`: Physical state (solid, liquid, gas)
|
||||
- `charge`: Net molecular charge
|
||||
- `logp`: Octanol-water partition coefficient (experimental/predicted)
|
||||
- `pka_strongest_acidic`: Strongest acidic pKa value
|
||||
- `pka_strongest_basic`: Strongest basic pKa value
|
||||
- `polar_surface_area`: Topological polar surface area (TPSA)
|
||||
- `refractivity`: Molar refractivity
|
||||
- `polarizability`: Molecular polarizability
|
||||
- `rotatable_bond_count`: Number of rotatable bonds
|
||||
- `acceptor_count`: Hydrogen bond acceptor count
|
||||
- `donor_count`: Hydrogen bond donor count
|
||||
|
||||
**Chemical Taxonomy:**
|
||||
- `kingdom`: Chemical kingdom (e.g., Organic compounds)
|
||||
- `super_class`: Chemical superclass
|
||||
- `class`: Chemical class
|
||||
- `sub_class`: Chemical subclass
|
||||
- `direct_parent`: Direct chemical parent
|
||||
- `alternative_parents`: Alternative parent classifications
|
||||
- `substituents`: Chemical substituents present
|
||||
- `description`: Text description of the compound
|
||||
|
||||
### Biological Data Fields
|
||||
|
||||
**Metabolite Origins:**
|
||||
- `origin`: Source of metabolite (endogenous, exogenous, drug metabolite, food component)
|
||||
- `biofluid_locations`: Biological fluids where found (blood, urine, saliva, CSF, etc.)
|
||||
- `tissue_locations`: Tissues where found (liver, kidney, brain, muscle, etc.)
|
||||
- `cellular_locations`: Subcellular locations (cytoplasm, mitochondria, membrane, etc.)
|
||||
|
||||
**Biospecimen Information:**
|
||||
- `biospecimen`: Type of biological specimen
|
||||
- `status`: Detection status (detected, expected, predicted)
|
||||
- `concentration`: Concentration ranges with units
|
||||
- `concentration_references`: Citations for concentration data
|
||||
|
||||
**Normal and Abnormal Concentrations:**
|
||||
For each biofluid (blood, urine, saliva, CSF, feces, sweat):
|
||||
- Normal concentration value and range
|
||||
- Units (μM, mg/L, etc.)
|
||||
- Age and gender considerations
|
||||
- Abnormal concentration indicators
|
||||
- Clinical significance
|
||||
|
||||
### Pathway and Enzyme Information
|
||||
|
||||
**Metabolic Pathways:**
|
||||
- `pathways`: List of associated metabolic pathways
|
||||
- Pathway name
|
||||
- SMPDB ID (Small Molecule Pathway Database ID)
|
||||
- KEGG pathway ID
|
||||
- Pathway category
|
||||
|
||||
**Enzymatic Reactions:**
|
||||
- `protein_associations`: Enzymes and transporters
|
||||
- Protein name
|
||||
- Gene name
|
||||
- Uniprot ID
|
||||
- GenBank ID
|
||||
- Protein type (enzyme, transporter, carrier, etc.)
|
||||
- Enzyme reactions
|
||||
- Enzyme kinetics (Km values)
|
||||
|
||||
**Biochemical Context:**
|
||||
- `reactions`: Biochemical reactions involving the metabolite
|
||||
- `reaction_enzymes`: Enzymes catalyzing reactions
|
||||
- `cofactors`: Required cofactors
|
||||
- `inhibitors`: Known enzyme inhibitors
|
||||
|
||||
### Disease and Biomarker Associations
|
||||
|
||||
**Disease Links:**
|
||||
- `diseases`: Associated diseases and conditions
|
||||
- Disease name
|
||||
- OMIM ID (Online Mendelian Inheritance in Man)
|
||||
- Disease category
|
||||
- References and evidence
|
||||
|
||||
**Biomarker Information:**
|
||||
- `biomarker_status`: Whether compound is a known biomarker
|
||||
- `biomarker_applications`: Clinical applications
|
||||
- `biomarker_for`: Diseases or conditions where used as biomarker
|
||||
|
||||
### Spectroscopic Data
|
||||
|
||||
**NMR Spectra:**
|
||||
- `nmr_spectra`: Nuclear Magnetic Resonance data
|
||||
- Spectrum type (1D ¹H, ¹³C, 2D COSY, HSQC, etc.)
|
||||
- Spectrometer frequency (MHz)
|
||||
- Solvent used
|
||||
- Temperature
|
||||
- pH
|
||||
- Peak list with chemical shifts and multiplicities
|
||||
- FID (Free Induction Decay) files
|
||||
|
||||
**Mass Spectrometry:**
|
||||
- `ms_spectra`: Mass spectrometry data
|
||||
- Spectrum type (MS, MS-MS, LC-MS, GC-MS)
|
||||
- Ionization mode (positive, negative, neutral)
|
||||
- Collision energy
|
||||
- Instrument type
|
||||
- Peak list (m/z, intensity, annotation)
|
||||
- Predicted vs. experimental flag
|
||||
|
||||
**Chromatography:**
|
||||
- `chromatography`: Chromatographic properties
|
||||
- Retention time
|
||||
- Column type
|
||||
- Mobile phase
|
||||
- Method details
|
||||
|
||||
### External Database Links
|
||||
|
||||
**Database Cross-References:**
|
||||
- `kegg_id`: KEGG Compound ID
|
||||
- `pubchem_compound_id`: PubChem CID
|
||||
- `pubchem_substance_id`: PubChem SID
|
||||
- `chebi_id`: Chemical Entities of Biological Interest ID
|
||||
- `chemspider_id`: ChemSpider ID
|
||||
- `drugbank_id`: DrugBank accession (if applicable)
|
||||
- `foodb_id`: FooDB ID (if food component)
|
||||
- `knapsack_id`: KNApSAcK ID
|
||||
- `metacyc_id`: MetaCyc ID
|
||||
- `bigg_id`: BiGG Model ID
|
||||
- `wikipedia_id`: Wikipedia page link
|
||||
- `metlin_id`: METLIN ID
|
||||
- `vmh_id`: Virtual Metabolic Human ID
|
||||
- `fbonto_id`: FlyBase ontology ID
|
||||
|
||||
**Protein Database Links:**
|
||||
- `uniprot_id`: UniProt accession for associated proteins
|
||||
- `genbank_id`: GenBank ID for associated genes
|
||||
- `pdb_id`: Protein Data Bank ID for protein structures
|
||||
|
||||
### Literature and Evidence
|
||||
|
||||
**References:**
|
||||
- `general_references`: General references about the metabolite
|
||||
- PubMed ID
|
||||
- Reference text
|
||||
- Citation
|
||||
- `synthesis_reference`: Synthesis methods and references
|
||||
- `protein_references`: References for protein associations
|
||||
- `pathway_references`: References for pathway involvement
|
||||
|
||||
### Ontology and Classification
|
||||
|
||||
**Ontology Terms:**
|
||||
- `ontology_terms`: Related ontology classifications
|
||||
- Term name
|
||||
- Ontology source (ChEBI, MeSH, etc.)
|
||||
- Term ID
|
||||
- Definition
|
||||
|
||||
### Data Quality and Provenance
|
||||
|
||||
**Metadata:**
|
||||
- `creation_date`: Date entry was created
|
||||
- `update_date`: Date entry was last updated
|
||||
- `version`: HMDB version number
|
||||
- `status`: Entry status (detected, expected, predicted)
|
||||
- `evidence`: Evidence level for detection/presence
|
||||
|
||||
## XML Structure Example
|
||||
|
||||
When downloading HMDB data in XML format, the structure follows this pattern:
|
||||
|
||||
```xml
|
||||
<metabolite>
|
||||
<accession>HMDB0000001</accession>
|
||||
<name>1-Methylhistidine</name>
|
||||
<chemical_formula>C7H11N3O2</chemical_formula>
|
||||
<average_molecular_weight>169.1811</average_molecular_weight>
|
||||
<monoisotopic_molecular_weight>169.085126436</monoisotopic_molecular_weight>
|
||||
<smiles>CN1C=NC(CC(=O)O)=C1</smiles>
|
||||
<inchi>InChI=1S/C7H11N3O2/c1-10-4-8-3-5(10)2-7(11)12/h3-4H,2H2,1H3,(H,11,12)</inchi>
|
||||
<inchikey>BRMWTNUJHUMWMS-UHFFFAOYSA-N</inchikey>
|
||||
|
||||
<biospecimen_locations>
|
||||
<biospecimen>Blood</biospecimen>
|
||||
<biospecimen>Urine</biospecimen>
|
||||
</biospecimen_locations>
|
||||
|
||||
<pathways>
|
||||
<pathway>
|
||||
<name>Histidine Metabolism</name>
|
||||
<smpdb_id>SMP0000044</smpdb_id>
|
||||
<kegg_map_id>map00340</kegg_map_id>
|
||||
</pathway>
|
||||
</pathways>
|
||||
|
||||
<diseases>
|
||||
<disease>
|
||||
<name>Carnosinemia</name>
|
||||
<omim_id>212200</omim_id>
|
||||
</disease>
|
||||
</diseases>
|
||||
|
||||
<normal_concentrations>
|
||||
<concentration>
|
||||
<biospecimen>Blood</biospecimen>
|
||||
<concentration_value>3.8</concentration_value>
|
||||
<concentration_units>uM</concentration_units>
|
||||
</concentration>
|
||||
</normal_concentrations>
|
||||
</metabolite>
|
||||
```
|
||||
|
||||
## Querying Specific Fields
|
||||
|
||||
When working with HMDB data programmatically:
|
||||
|
||||
**For metabolite identification:**
|
||||
- Query by `accession`, `name`, `synonyms`, `inchi`, `smiles`
|
||||
|
||||
**For chemical similarity:**
|
||||
- Use `smiles`, `inchi`, `inchikey`, `molecular_weight`, `chemical_formula`
|
||||
|
||||
**For biomarker discovery:**
|
||||
- Filter by `diseases`, `biomarker_status`, `normal_concentrations`, `abnormal_concentrations`
|
||||
|
||||
**For pathway analysis:**
|
||||
- Extract `pathways`, `protein_associations`, `reactions`
|
||||
|
||||
**For spectral matching:**
|
||||
- Compare against `nmr_spectra`, `ms_spectra` peak lists
|
||||
|
||||
**For cross-database integration:**
|
||||
- Map using external IDs: `kegg_id`, `pubchem_compound_id`, `chebi_id`, etc.
|
||||
|
||||
## Field Completeness
|
||||
|
||||
Not all fields are populated for every metabolite:
|
||||
|
||||
- **Highly complete fields** (>90% of entries): accession, name, chemical_formula, molecular_weight, smiles, inchi
|
||||
- **Moderately complete** (50-90%): biospecimen_locations, tissue_locations, pathways
|
||||
- **Variably complete** (10-50%): concentration data, disease associations, protein associations
|
||||
- **Sparsely complete** (<10%): experimental NMR/MS spectra, detailed kinetic data
|
||||
|
||||
Predicted and computational data (e.g., predicted MS spectra, predicted concentrations) supplement experimental data where available.
|
||||
Reference in New Issue
Block a user