Files
2025-11-30 08:30:10 +08:00

8.5 KiB

HMDB Data Fields Reference

This document provides detailed information about the data fields available in HMDB metabolite entries.

Metabolite Entry Structure

Each HMDB metabolite entry contains 130+ data fields organized into several categories:

Chemical Data Fields

Identification:

  • accession: Primary HMDB ID (e.g., HMDB0000001)
  • secondary_accessions: Previous HMDB IDs for merged entries
  • name: Primary metabolite name
  • synonyms: Alternative names and common names
  • chemical_formula: Molecular formula (e.g., C6H12O6)
  • average_molecular_weight: Average molecular weight in Daltons
  • monoisotopic_molecular_weight: Monoisotopic molecular weight

Structure Representations:

  • smiles: Simplified Molecular Input Line Entry System string
  • inchi: International Chemical Identifier string
  • inchikey: Hashed InChI for fast lookup
  • iupac_name: IUPAC systematic name
  • traditional_iupac: Traditional IUPAC name

Chemical Properties:

  • state: Physical state (solid, liquid, gas)
  • charge: Net molecular charge
  • logp: Octanol-water partition coefficient (experimental/predicted)
  • pka_strongest_acidic: Strongest acidic pKa value
  • pka_strongest_basic: Strongest basic pKa value
  • polar_surface_area: Topological polar surface area (TPSA)
  • refractivity: Molar refractivity
  • polarizability: Molecular polarizability
  • rotatable_bond_count: Number of rotatable bonds
  • acceptor_count: Hydrogen bond acceptor count
  • donor_count: Hydrogen bond donor count

Chemical Taxonomy:

  • kingdom: Chemical kingdom (e.g., Organic compounds)
  • super_class: Chemical superclass
  • class: Chemical class
  • sub_class: Chemical subclass
  • direct_parent: Direct chemical parent
  • alternative_parents: Alternative parent classifications
  • substituents: Chemical substituents present
  • description: Text description of the compound

Biological Data Fields

Metabolite Origins:

  • origin: Source of metabolite (endogenous, exogenous, drug metabolite, food component)
  • biofluid_locations: Biological fluids where found (blood, urine, saliva, CSF, etc.)
  • tissue_locations: Tissues where found (liver, kidney, brain, muscle, etc.)
  • cellular_locations: Subcellular locations (cytoplasm, mitochondria, membrane, etc.)

Biospecimen Information:

  • biospecimen: Type of biological specimen
  • status: Detection status (detected, expected, predicted)
  • concentration: Concentration ranges with units
  • concentration_references: Citations for concentration data

Normal and Abnormal Concentrations: For each biofluid (blood, urine, saliva, CSF, feces, sweat):

  • Normal concentration value and range
  • Units (μM, mg/L, etc.)
  • Age and gender considerations
  • Abnormal concentration indicators
  • Clinical significance

Pathway and Enzyme Information

Metabolic Pathways:

  • pathways: List of associated metabolic pathways
    • Pathway name
    • SMPDB ID (Small Molecule Pathway Database ID)
    • KEGG pathway ID
    • Pathway category

Enzymatic Reactions:

  • protein_associations: Enzymes and transporters
    • Protein name
    • Gene name
    • Uniprot ID
    • GenBank ID
    • Protein type (enzyme, transporter, carrier, etc.)
    • Enzyme reactions
    • Enzyme kinetics (Km values)

Biochemical Context:

  • reactions: Biochemical reactions involving the metabolite
  • reaction_enzymes: Enzymes catalyzing reactions
  • cofactors: Required cofactors
  • inhibitors: Known enzyme inhibitors

Disease and Biomarker Associations

Disease Links:

  • diseases: Associated diseases and conditions
    • Disease name
    • OMIM ID (Online Mendelian Inheritance in Man)
    • Disease category
    • References and evidence

Biomarker Information:

  • biomarker_status: Whether compound is a known biomarker
  • biomarker_applications: Clinical applications
  • biomarker_for: Diseases or conditions where used as biomarker

Spectroscopic Data

NMR Spectra:

  • nmr_spectra: Nuclear Magnetic Resonance data
    • Spectrum type (1D ¹H, ¹³C, 2D COSY, HSQC, etc.)
    • Spectrometer frequency (MHz)
    • Solvent used
    • Temperature
    • pH
    • Peak list with chemical shifts and multiplicities
    • FID (Free Induction Decay) files

Mass Spectrometry:

  • ms_spectra: Mass spectrometry data
    • Spectrum type (MS, MS-MS, LC-MS, GC-MS)
    • Ionization mode (positive, negative, neutral)
    • Collision energy
    • Instrument type
    • Peak list (m/z, intensity, annotation)
    • Predicted vs. experimental flag

Chromatography:

  • chromatography: Chromatographic properties
    • Retention time
    • Column type
    • Mobile phase
    • Method details

Database Cross-References:

  • kegg_id: KEGG Compound ID
  • pubchem_compound_id: PubChem CID
  • pubchem_substance_id: PubChem SID
  • chebi_id: Chemical Entities of Biological Interest ID
  • chemspider_id: ChemSpider ID
  • drugbank_id: DrugBank accession (if applicable)
  • foodb_id: FooDB ID (if food component)
  • knapsack_id: KNApSAcK ID
  • metacyc_id: MetaCyc ID
  • bigg_id: BiGG Model ID
  • wikipedia_id: Wikipedia page link
  • metlin_id: METLIN ID
  • vmh_id: Virtual Metabolic Human ID
  • fbonto_id: FlyBase ontology ID

Protein Database Links:

  • uniprot_id: UniProt accession for associated proteins
  • genbank_id: GenBank ID for associated genes
  • pdb_id: Protein Data Bank ID for protein structures

Literature and Evidence

References:

  • general_references: General references about the metabolite
    • PubMed ID
    • Reference text
    • Citation
  • synthesis_reference: Synthesis methods and references
  • protein_references: References for protein associations
  • pathway_references: References for pathway involvement

Ontology and Classification

Ontology Terms:

  • ontology_terms: Related ontology classifications
    • Term name
    • Ontology source (ChEBI, MeSH, etc.)
    • Term ID
    • Definition

Data Quality and Provenance

Metadata:

  • creation_date: Date entry was created
  • update_date: Date entry was last updated
  • version: HMDB version number
  • status: Entry status (detected, expected, predicted)
  • evidence: Evidence level for detection/presence

XML Structure Example

When downloading HMDB data in XML format, the structure follows this pattern:

<metabolite>
  <accession>HMDB0000001</accession>
  <name>1-Methylhistidine</name>
  <chemical_formula>C7H11N3O2</chemical_formula>
  <average_molecular_weight>169.1811</average_molecular_weight>
  <monoisotopic_molecular_weight>169.085126436</monoisotopic_molecular_weight>
  <smiles>CN1C=NC(CC(=O)O)=C1</smiles>
  <inchi>InChI=1S/C7H11N3O2/c1-10-4-8-3-5(10)2-7(11)12/h3-4H,2H2,1H3,(H,11,12)</inchi>
  <inchikey>BRMWTNUJHUMWMS-UHFFFAOYSA-N</inchikey>

  <biospecimen_locations>
    <biospecimen>Blood</biospecimen>
    <biospecimen>Urine</biospecimen>
  </biospecimen_locations>

  <pathways>
    <pathway>
      <name>Histidine Metabolism</name>
      <smpdb_id>SMP0000044</smpdb_id>
      <kegg_map_id>map00340</kegg_map_id>
    </pathway>
  </pathways>

  <diseases>
    <disease>
      <name>Carnosinemia</name>
      <omim_id>212200</omim_id>
    </disease>
  </diseases>

  <normal_concentrations>
    <concentration>
      <biospecimen>Blood</biospecimen>
      <concentration_value>3.8</concentration_value>
      <concentration_units>uM</concentration_units>
    </concentration>
  </normal_concentrations>
</metabolite>

Querying Specific Fields

When working with HMDB data programmatically:

For metabolite identification:

  • Query by accession, name, synonyms, inchi, smiles

For chemical similarity:

  • Use smiles, inchi, inchikey, molecular_weight, chemical_formula

For biomarker discovery:

  • Filter by diseases, biomarker_status, normal_concentrations, abnormal_concentrations

For pathway analysis:

  • Extract pathways, protein_associations, reactions

For spectral matching:

  • Compare against nmr_spectra, ms_spectra peak lists

For cross-database integration:

  • Map using external IDs: kegg_id, pubchem_compound_id, chebi_id, etc.

Field Completeness

Not all fields are populated for every metabolite:

  • Highly complete fields (>90% of entries): accession, name, chemical_formula, molecular_weight, smiles, inchi
  • Moderately complete (50-90%): biospecimen_locations, tissue_locations, pathways
  • Variably complete (10-50%): concentration data, disease associations, protein associations
  • Sparsely complete (<10%): experimental NMR/MS spectra, detailed kinetic data

Predicted and computational data (e.g., predicted MS spectra, predicted concentrations) supplement experimental data where available.