8.5 KiB
HMDB Data Fields Reference
This document provides detailed information about the data fields available in HMDB metabolite entries.
Metabolite Entry Structure
Each HMDB metabolite entry contains 130+ data fields organized into several categories:
Chemical Data Fields
Identification:
accession: Primary HMDB ID (e.g., HMDB0000001)secondary_accessions: Previous HMDB IDs for merged entriesname: Primary metabolite namesynonyms: Alternative names and common nameschemical_formula: Molecular formula (e.g., C6H12O6)average_molecular_weight: Average molecular weight in Daltonsmonoisotopic_molecular_weight: Monoisotopic molecular weight
Structure Representations:
smiles: Simplified Molecular Input Line Entry System stringinchi: International Chemical Identifier stringinchikey: Hashed InChI for fast lookupiupac_name: IUPAC systematic nametraditional_iupac: Traditional IUPAC name
Chemical Properties:
state: Physical state (solid, liquid, gas)charge: Net molecular chargelogp: Octanol-water partition coefficient (experimental/predicted)pka_strongest_acidic: Strongest acidic pKa valuepka_strongest_basic: Strongest basic pKa valuepolar_surface_area: Topological polar surface area (TPSA)refractivity: Molar refractivitypolarizability: Molecular polarizabilityrotatable_bond_count: Number of rotatable bondsacceptor_count: Hydrogen bond acceptor countdonor_count: Hydrogen bond donor count
Chemical Taxonomy:
kingdom: Chemical kingdom (e.g., Organic compounds)super_class: Chemical superclassclass: Chemical classsub_class: Chemical subclassdirect_parent: Direct chemical parentalternative_parents: Alternative parent classificationssubstituents: Chemical substituents presentdescription: Text description of the compound
Biological Data Fields
Metabolite Origins:
origin: Source of metabolite (endogenous, exogenous, drug metabolite, food component)biofluid_locations: Biological fluids where found (blood, urine, saliva, CSF, etc.)tissue_locations: Tissues where found (liver, kidney, brain, muscle, etc.)cellular_locations: Subcellular locations (cytoplasm, mitochondria, membrane, etc.)
Biospecimen Information:
biospecimen: Type of biological specimenstatus: Detection status (detected, expected, predicted)concentration: Concentration ranges with unitsconcentration_references: Citations for concentration data
Normal and Abnormal Concentrations: For each biofluid (blood, urine, saliva, CSF, feces, sweat):
- Normal concentration value and range
- Units (μM, mg/L, etc.)
- Age and gender considerations
- Abnormal concentration indicators
- Clinical significance
Pathway and Enzyme Information
Metabolic Pathways:
pathways: List of associated metabolic pathways- Pathway name
- SMPDB ID (Small Molecule Pathway Database ID)
- KEGG pathway ID
- Pathway category
Enzymatic Reactions:
protein_associations: Enzymes and transporters- Protein name
- Gene name
- Uniprot ID
- GenBank ID
- Protein type (enzyme, transporter, carrier, etc.)
- Enzyme reactions
- Enzyme kinetics (Km values)
Biochemical Context:
reactions: Biochemical reactions involving the metabolitereaction_enzymes: Enzymes catalyzing reactionscofactors: Required cofactorsinhibitors: Known enzyme inhibitors
Disease and Biomarker Associations
Disease Links:
diseases: Associated diseases and conditions- Disease name
- OMIM ID (Online Mendelian Inheritance in Man)
- Disease category
- References and evidence
Biomarker Information:
biomarker_status: Whether compound is a known biomarkerbiomarker_applications: Clinical applicationsbiomarker_for: Diseases or conditions where used as biomarker
Spectroscopic Data
NMR Spectra:
nmr_spectra: Nuclear Magnetic Resonance data- Spectrum type (1D ¹H, ¹³C, 2D COSY, HSQC, etc.)
- Spectrometer frequency (MHz)
- Solvent used
- Temperature
- pH
- Peak list with chemical shifts and multiplicities
- FID (Free Induction Decay) files
Mass Spectrometry:
ms_spectra: Mass spectrometry data- Spectrum type (MS, MS-MS, LC-MS, GC-MS)
- Ionization mode (positive, negative, neutral)
- Collision energy
- Instrument type
- Peak list (m/z, intensity, annotation)
- Predicted vs. experimental flag
Chromatography:
chromatography: Chromatographic properties- Retention time
- Column type
- Mobile phase
- Method details
External Database Links
Database Cross-References:
kegg_id: KEGG Compound IDpubchem_compound_id: PubChem CIDpubchem_substance_id: PubChem SIDchebi_id: Chemical Entities of Biological Interest IDchemspider_id: ChemSpider IDdrugbank_id: DrugBank accession (if applicable)foodb_id: FooDB ID (if food component)knapsack_id: KNApSAcK IDmetacyc_id: MetaCyc IDbigg_id: BiGG Model IDwikipedia_id: Wikipedia page linkmetlin_id: METLIN IDvmh_id: Virtual Metabolic Human IDfbonto_id: FlyBase ontology ID
Protein Database Links:
uniprot_id: UniProt accession for associated proteinsgenbank_id: GenBank ID for associated genespdb_id: Protein Data Bank ID for protein structures
Literature and Evidence
References:
general_references: General references about the metabolite- PubMed ID
- Reference text
- Citation
synthesis_reference: Synthesis methods and referencesprotein_references: References for protein associationspathway_references: References for pathway involvement
Ontology and Classification
Ontology Terms:
ontology_terms: Related ontology classifications- Term name
- Ontology source (ChEBI, MeSH, etc.)
- Term ID
- Definition
Data Quality and Provenance
Metadata:
creation_date: Date entry was createdupdate_date: Date entry was last updatedversion: HMDB version numberstatus: Entry status (detected, expected, predicted)evidence: Evidence level for detection/presence
XML Structure Example
When downloading HMDB data in XML format, the structure follows this pattern:
<metabolite>
<accession>HMDB0000001</accession>
<name>1-Methylhistidine</name>
<chemical_formula>C7H11N3O2</chemical_formula>
<average_molecular_weight>169.1811</average_molecular_weight>
<monoisotopic_molecular_weight>169.085126436</monoisotopic_molecular_weight>
<smiles>CN1C=NC(CC(=O)O)=C1</smiles>
<inchi>InChI=1S/C7H11N3O2/c1-10-4-8-3-5(10)2-7(11)12/h3-4H,2H2,1H3,(H,11,12)</inchi>
<inchikey>BRMWTNUJHUMWMS-UHFFFAOYSA-N</inchikey>
<biospecimen_locations>
<biospecimen>Blood</biospecimen>
<biospecimen>Urine</biospecimen>
</biospecimen_locations>
<pathways>
<pathway>
<name>Histidine Metabolism</name>
<smpdb_id>SMP0000044</smpdb_id>
<kegg_map_id>map00340</kegg_map_id>
</pathway>
</pathways>
<diseases>
<disease>
<name>Carnosinemia</name>
<omim_id>212200</omim_id>
</disease>
</diseases>
<normal_concentrations>
<concentration>
<biospecimen>Blood</biospecimen>
<concentration_value>3.8</concentration_value>
<concentration_units>uM</concentration_units>
</concentration>
</normal_concentrations>
</metabolite>
Querying Specific Fields
When working with HMDB data programmatically:
For metabolite identification:
- Query by
accession,name,synonyms,inchi,smiles
For chemical similarity:
- Use
smiles,inchi,inchikey,molecular_weight,chemical_formula
For biomarker discovery:
- Filter by
diseases,biomarker_status,normal_concentrations,abnormal_concentrations
For pathway analysis:
- Extract
pathways,protein_associations,reactions
For spectral matching:
- Compare against
nmr_spectra,ms_spectrapeak lists
For cross-database integration:
- Map using external IDs:
kegg_id,pubchem_compound_id,chebi_id, etc.
Field Completeness
Not all fields are populated for every metabolite:
- Highly complete fields (>90% of entries): accession, name, chemical_formula, molecular_weight, smiles, inchi
- Moderately complete (50-90%): biospecimen_locations, tissue_locations, pathways
- Variably complete (10-50%): concentration data, disease associations, protein associations
- Sparsely complete (<10%): experimental NMR/MS spectra, detailed kinetic data
Predicted and computational data (e.g., predicted MS spectra, predicted concentrations) supplement experimental data where available.