Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,190 @@
---
name: hmdb-database
description: "Access Human Metabolome Database (220K+ metabolites). Search by name/ID/structure, retrieve chemical properties, biomarker data, NMR/MS spectra, pathways, for metabolomics and identification."
---
# HMDB Database
## Overview
The Human Metabolome Database (HMDB) is a comprehensive, freely available resource containing detailed information about small molecule metabolites found in the human body.
## When to Use This Skill
This skill should be used when performing metabolomics research, clinical chemistry, biomarker discovery, or metabolite identification tasks.
## Database Contents
HMDB version 5.0 (current as of 2025) contains:
- **220,945 metabolite entries** covering both water-soluble and lipid-soluble compounds
- **8,610 protein sequences** for enzymes and transporters involved in metabolism
- **130+ data fields per metabolite** including:
- Chemical properties (structure, formula, molecular weight, InChI, SMILES)
- Clinical data (biomarker associations, diseases, normal/abnormal concentrations)
- Biological information (pathways, reactions, locations)
- Spectroscopic data (NMR, MS, MS-MS spectra)
- External database links (KEGG, PubChem, MetaCyc, ChEBI, PDB, UniProt, GenBank)
## Core Capabilities
### 1. Web-Based Metabolite Searches
Access HMDB through the web interface at https://www.hmdb.ca/ for:
**Text Searches:**
- Search by metabolite name, synonym, or identifier (HMDB ID)
- Example HMDB IDs: HMDB0000001, HMDB0001234
- Search by disease associations or pathway involvement
- Query by biological specimen type (urine, serum, CSF, saliva, feces, sweat)
**Structure-Based Searches:**
- Use ChemQuery for structure and substructure searches
- Search by molecular weight or molecular weight range
- Use SMILES or InChI strings to find compounds
**Spectral Searches:**
- LC-MS spectral matching
- GC-MS spectral matching
- NMR spectral searches for metabolite identification
**Advanced Searches:**
- Combine multiple criteria (name, properties, concentration ranges)
- Filter by biological locations or specimen types
- Search by protein/enzyme associations
### 2. Accessing Metabolite Information
When retrieving metabolite data, HMDB provides:
**Chemical Information:**
- Systematic name, traditional names, and synonyms
- Chemical formula and molecular weight
- Structure representations (2D/3D, SMILES, InChI, MOL file)
- Chemical taxonomy and classification
**Biological Context:**
- Metabolic pathways and reactions
- Associated enzymes and transporters
- Subcellular locations
- Biological roles and functions
**Clinical Relevance:**
- Normal concentration ranges in biological fluids
- Biomarker associations with diseases
- Clinical significance
- Toxicity information when applicable
**Analytical Data:**
- Experimental and predicted NMR spectra
- MS and MS-MS spectra
- Retention times and chromatographic data
- Reference peaks for identification
### 3. Downloadable Datasets
HMDB offers bulk data downloads at https://www.hmdb.ca/downloads in multiple formats:
**Available Formats:**
- **XML**: Complete metabolite, protein, and spectra data
- **SDF**: Metabolite structure files for cheminformatics
- **FASTA**: Protein and gene sequences
- **TXT**: Raw spectra peak lists
- **CSV/TSV**: Tabular data exports
**Dataset Categories:**
- All metabolites or filtered by specimen type
- Protein/enzyme sequences
- Experimental and predicted spectra (NMR, GC-MS, MS-MS)
- Pathway information
**Best Practices:**
- Download XML format for comprehensive data including all fields
- Use SDF format for structure-based analysis and cheminformatics workflows
- Parse CSV/TSV formats for integration with data analysis pipelines
- Check version dates to ensure up-to-date data (current: v5.0, 2023-07-01)
**Usage Requirements:**
- Free for academic and non-commercial research
- Commercial use requires explicit permission (contact samackay@ualberta.ca)
- Cite HMDB publication when using data
### 4. Programmatic API Access
**API Availability:**
HMDB does not provide a public REST API. Programmatic access requires contacting the development team:
- **Academic/Research groups:** Contact eponine@ualberta.ca (Eponine) or samackay@ualberta.ca (Scott)
- **Commercial organizations:** Contact samackay@ualberta.ca (Scott) for customized API access
**Alternative Programmatic Access:**
- **R/Bioconductor**: Use the `hmdbQuery` package for R-based queries
- Install: `BiocManager::install("hmdbQuery")`
- Provides HTTP-based querying functions
- **Downloaded datasets**: Parse XML or CSV files locally for programmatic analysis
- **Web scraping**: Not recommended; contact team for proper API access instead
### 5. Common Research Workflows
**Metabolite Identification in Untargeted Metabolomics:**
1. Obtain experimental MS or NMR spectra from samples
2. Use HMDB spectral search tools to match against reference spectra
3. Verify candidates by checking molecular weight, retention time, and MS-MS fragmentation
4. Review biological plausibility (expected in specimen type, known pathways)
**Biomarker Discovery:**
1. Search HMDB for metabolites associated with disease of interest
2. Review concentration ranges in normal vs. disease states
3. Identify metabolites with strong differential abundance
4. Examine pathway context and biological mechanisms
5. Cross-reference with literature via PubMed links
**Pathway Analysis:**
1. Identify metabolites of interest from experimental data
2. Look up HMDB entries for each metabolite
3. Extract pathway associations and enzymatic reactions
4. Use linked SMPDB (Small Molecule Pathway Database) for pathway diagrams
5. Identify pathway enrichment for biological interpretation
**Database Integration:**
1. Download HMDB data in XML or CSV format
2. Parse and extract relevant fields for local database
3. Link with external IDs (KEGG, PubChem, ChEBI) for cross-database queries
4. Build local tools or pipelines incorporating HMDB reference data
## Related HMDB Resources
The HMDB ecosystem includes related databases:
- **DrugBank**: ~2,832 drug compounds with pharmaceutical information
- **T3DB (Toxin and Toxin Target Database)**: ~3,670 toxic compounds
- **SMPDB (Small Molecule Pathway Database)**: Pathway diagrams and maps
- **FooDB**: ~70,000 food component compounds
These databases share similar structure and identifiers, enabling integrated queries across human metabolome, drug, toxin, and food databases.
## Best Practices
**Data Quality:**
- Verify metabolite identifications with multiple evidence types (spectra, structure, properties)
- Check experimental vs. predicted data quality indicators
- Review citations and evidence for biomarker associations
**Version Tracking:**
- Note HMDB version used in research (current: v5.0)
- Databases are updated periodically with new entries and corrections
- Re-query for updates when publishing to ensure current information
**Citation:**
- Always cite HMDB in publications using the database
- Reference specific HMDB IDs when discussing metabolites
- Acknowledge data sources for downloaded datasets
**Performance:**
- For large-scale analysis, download complete datasets rather than repeated web queries
- Use appropriate file formats (XML for comprehensive data, CSV for tabular analysis)
- Consider local caching of frequently accessed metabolite information
## Reference Documentation
See `references/hmdb_data_fields.md` for detailed information about available data fields and their meanings.

View File

@@ -0,0 +1,267 @@
# HMDB Data Fields Reference
This document provides detailed information about the data fields available in HMDB metabolite entries.
## Metabolite Entry Structure
Each HMDB metabolite entry contains 130+ data fields organized into several categories:
### Chemical Data Fields
**Identification:**
- `accession`: Primary HMDB ID (e.g., HMDB0000001)
- `secondary_accessions`: Previous HMDB IDs for merged entries
- `name`: Primary metabolite name
- `synonyms`: Alternative names and common names
- `chemical_formula`: Molecular formula (e.g., C6H12O6)
- `average_molecular_weight`: Average molecular weight in Daltons
- `monoisotopic_molecular_weight`: Monoisotopic molecular weight
**Structure Representations:**
- `smiles`: Simplified Molecular Input Line Entry System string
- `inchi`: International Chemical Identifier string
- `inchikey`: Hashed InChI for fast lookup
- `iupac_name`: IUPAC systematic name
- `traditional_iupac`: Traditional IUPAC name
**Chemical Properties:**
- `state`: Physical state (solid, liquid, gas)
- `charge`: Net molecular charge
- `logp`: Octanol-water partition coefficient (experimental/predicted)
- `pka_strongest_acidic`: Strongest acidic pKa value
- `pka_strongest_basic`: Strongest basic pKa value
- `polar_surface_area`: Topological polar surface area (TPSA)
- `refractivity`: Molar refractivity
- `polarizability`: Molecular polarizability
- `rotatable_bond_count`: Number of rotatable bonds
- `acceptor_count`: Hydrogen bond acceptor count
- `donor_count`: Hydrogen bond donor count
**Chemical Taxonomy:**
- `kingdom`: Chemical kingdom (e.g., Organic compounds)
- `super_class`: Chemical superclass
- `class`: Chemical class
- `sub_class`: Chemical subclass
- `direct_parent`: Direct chemical parent
- `alternative_parents`: Alternative parent classifications
- `substituents`: Chemical substituents present
- `description`: Text description of the compound
### Biological Data Fields
**Metabolite Origins:**
- `origin`: Source of metabolite (endogenous, exogenous, drug metabolite, food component)
- `biofluid_locations`: Biological fluids where found (blood, urine, saliva, CSF, etc.)
- `tissue_locations`: Tissues where found (liver, kidney, brain, muscle, etc.)
- `cellular_locations`: Subcellular locations (cytoplasm, mitochondria, membrane, etc.)
**Biospecimen Information:**
- `biospecimen`: Type of biological specimen
- `status`: Detection status (detected, expected, predicted)
- `concentration`: Concentration ranges with units
- `concentration_references`: Citations for concentration data
**Normal and Abnormal Concentrations:**
For each biofluid (blood, urine, saliva, CSF, feces, sweat):
- Normal concentration value and range
- Units (μM, mg/L, etc.)
- Age and gender considerations
- Abnormal concentration indicators
- Clinical significance
### Pathway and Enzyme Information
**Metabolic Pathways:**
- `pathways`: List of associated metabolic pathways
- Pathway name
- SMPDB ID (Small Molecule Pathway Database ID)
- KEGG pathway ID
- Pathway category
**Enzymatic Reactions:**
- `protein_associations`: Enzymes and transporters
- Protein name
- Gene name
- Uniprot ID
- GenBank ID
- Protein type (enzyme, transporter, carrier, etc.)
- Enzyme reactions
- Enzyme kinetics (Km values)
**Biochemical Context:**
- `reactions`: Biochemical reactions involving the metabolite
- `reaction_enzymes`: Enzymes catalyzing reactions
- `cofactors`: Required cofactors
- `inhibitors`: Known enzyme inhibitors
### Disease and Biomarker Associations
**Disease Links:**
- `diseases`: Associated diseases and conditions
- Disease name
- OMIM ID (Online Mendelian Inheritance in Man)
- Disease category
- References and evidence
**Biomarker Information:**
- `biomarker_status`: Whether compound is a known biomarker
- `biomarker_applications`: Clinical applications
- `biomarker_for`: Diseases or conditions where used as biomarker
### Spectroscopic Data
**NMR Spectra:**
- `nmr_spectra`: Nuclear Magnetic Resonance data
- Spectrum type (1D ¹H, ¹³C, 2D COSY, HSQC, etc.)
- Spectrometer frequency (MHz)
- Solvent used
- Temperature
- pH
- Peak list with chemical shifts and multiplicities
- FID (Free Induction Decay) files
**Mass Spectrometry:**
- `ms_spectra`: Mass spectrometry data
- Spectrum type (MS, MS-MS, LC-MS, GC-MS)
- Ionization mode (positive, negative, neutral)
- Collision energy
- Instrument type
- Peak list (m/z, intensity, annotation)
- Predicted vs. experimental flag
**Chromatography:**
- `chromatography`: Chromatographic properties
- Retention time
- Column type
- Mobile phase
- Method details
### External Database Links
**Database Cross-References:**
- `kegg_id`: KEGG Compound ID
- `pubchem_compound_id`: PubChem CID
- `pubchem_substance_id`: PubChem SID
- `chebi_id`: Chemical Entities of Biological Interest ID
- `chemspider_id`: ChemSpider ID
- `drugbank_id`: DrugBank accession (if applicable)
- `foodb_id`: FooDB ID (if food component)
- `knapsack_id`: KNApSAcK ID
- `metacyc_id`: MetaCyc ID
- `bigg_id`: BiGG Model ID
- `wikipedia_id`: Wikipedia page link
- `metlin_id`: METLIN ID
- `vmh_id`: Virtual Metabolic Human ID
- `fbonto_id`: FlyBase ontology ID
**Protein Database Links:**
- `uniprot_id`: UniProt accession for associated proteins
- `genbank_id`: GenBank ID for associated genes
- `pdb_id`: Protein Data Bank ID for protein structures
### Literature and Evidence
**References:**
- `general_references`: General references about the metabolite
- PubMed ID
- Reference text
- Citation
- `synthesis_reference`: Synthesis methods and references
- `protein_references`: References for protein associations
- `pathway_references`: References for pathway involvement
### Ontology and Classification
**Ontology Terms:**
- `ontology_terms`: Related ontology classifications
- Term name
- Ontology source (ChEBI, MeSH, etc.)
- Term ID
- Definition
### Data Quality and Provenance
**Metadata:**
- `creation_date`: Date entry was created
- `update_date`: Date entry was last updated
- `version`: HMDB version number
- `status`: Entry status (detected, expected, predicted)
- `evidence`: Evidence level for detection/presence
## XML Structure Example
When downloading HMDB data in XML format, the structure follows this pattern:
```xml
<metabolite>
<accession>HMDB0000001</accession>
<name>1-Methylhistidine</name>
<chemical_formula>C7H11N3O2</chemical_formula>
<average_molecular_weight>169.1811</average_molecular_weight>
<monoisotopic_molecular_weight>169.085126436</monoisotopic_molecular_weight>
<smiles>CN1C=NC(CC(=O)O)=C1</smiles>
<inchi>InChI=1S/C7H11N3O2/c1-10-4-8-3-5(10)2-7(11)12/h3-4H,2H2,1H3,(H,11,12)</inchi>
<inchikey>BRMWTNUJHUMWMS-UHFFFAOYSA-N</inchikey>
<biospecimen_locations>
<biospecimen>Blood</biospecimen>
<biospecimen>Urine</biospecimen>
</biospecimen_locations>
<pathways>
<pathway>
<name>Histidine Metabolism</name>
<smpdb_id>SMP0000044</smpdb_id>
<kegg_map_id>map00340</kegg_map_id>
</pathway>
</pathways>
<diseases>
<disease>
<name>Carnosinemia</name>
<omim_id>212200</omim_id>
</disease>
</diseases>
<normal_concentrations>
<concentration>
<biospecimen>Blood</biospecimen>
<concentration_value>3.8</concentration_value>
<concentration_units>uM</concentration_units>
</concentration>
</normal_concentrations>
</metabolite>
```
## Querying Specific Fields
When working with HMDB data programmatically:
**For metabolite identification:**
- Query by `accession`, `name`, `synonyms`, `inchi`, `smiles`
**For chemical similarity:**
- Use `smiles`, `inchi`, `inchikey`, `molecular_weight`, `chemical_formula`
**For biomarker discovery:**
- Filter by `diseases`, `biomarker_status`, `normal_concentrations`, `abnormal_concentrations`
**For pathway analysis:**
- Extract `pathways`, `protein_associations`, `reactions`
**For spectral matching:**
- Compare against `nmr_spectra`, `ms_spectra` peak lists
**For cross-database integration:**
- Map using external IDs: `kegg_id`, `pubchem_compound_id`, `chebi_id`, etc.
## Field Completeness
Not all fields are populated for every metabolite:
- **Highly complete fields** (>90% of entries): accession, name, chemical_formula, molecular_weight, smiles, inchi
- **Moderately complete** (50-90%): biospecimen_locations, tissue_locations, pathways
- **Variably complete** (10-50%): concentration data, disease associations, protein associations
- **Sparsely complete** (<10%): experimental NMR/MS spectra, detailed kinetic data
Predicted and computational data (e.g., predicted MS spectra, predicted concentrations) supplement experimental data where available.