# HMDB Data Fields Reference This document provides detailed information about the data fields available in HMDB metabolite entries. ## Metabolite Entry Structure Each HMDB metabolite entry contains 130+ data fields organized into several categories: ### Chemical Data Fields **Identification:** - `accession`: Primary HMDB ID (e.g., HMDB0000001) - `secondary_accessions`: Previous HMDB IDs for merged entries - `name`: Primary metabolite name - `synonyms`: Alternative names and common names - `chemical_formula`: Molecular formula (e.g., C6H12O6) - `average_molecular_weight`: Average molecular weight in Daltons - `monoisotopic_molecular_weight`: Monoisotopic molecular weight **Structure Representations:** - `smiles`: Simplified Molecular Input Line Entry System string - `inchi`: International Chemical Identifier string - `inchikey`: Hashed InChI for fast lookup - `iupac_name`: IUPAC systematic name - `traditional_iupac`: Traditional IUPAC name **Chemical Properties:** - `state`: Physical state (solid, liquid, gas) - `charge`: Net molecular charge - `logp`: Octanol-water partition coefficient (experimental/predicted) - `pka_strongest_acidic`: Strongest acidic pKa value - `pka_strongest_basic`: Strongest basic pKa value - `polar_surface_area`: Topological polar surface area (TPSA) - `refractivity`: Molar refractivity - `polarizability`: Molecular polarizability - `rotatable_bond_count`: Number of rotatable bonds - `acceptor_count`: Hydrogen bond acceptor count - `donor_count`: Hydrogen bond donor count **Chemical Taxonomy:** - `kingdom`: Chemical kingdom (e.g., Organic compounds) - `super_class`: Chemical superclass - `class`: Chemical class - `sub_class`: Chemical subclass - `direct_parent`: Direct chemical parent - `alternative_parents`: Alternative parent classifications - `substituents`: Chemical substituents present - `description`: Text description of the compound ### Biological Data Fields **Metabolite Origins:** - `origin`: Source of metabolite (endogenous, exogenous, drug metabolite, food component) - `biofluid_locations`: Biological fluids where found (blood, urine, saliva, CSF, etc.) - `tissue_locations`: Tissues where found (liver, kidney, brain, muscle, etc.) - `cellular_locations`: Subcellular locations (cytoplasm, mitochondria, membrane, etc.) **Biospecimen Information:** - `biospecimen`: Type of biological specimen - `status`: Detection status (detected, expected, predicted) - `concentration`: Concentration ranges with units - `concentration_references`: Citations for concentration data **Normal and Abnormal Concentrations:** For each biofluid (blood, urine, saliva, CSF, feces, sweat): - Normal concentration value and range - Units (μM, mg/L, etc.) - Age and gender considerations - Abnormal concentration indicators - Clinical significance ### Pathway and Enzyme Information **Metabolic Pathways:** - `pathways`: List of associated metabolic pathways - Pathway name - SMPDB ID (Small Molecule Pathway Database ID) - KEGG pathway ID - Pathway category **Enzymatic Reactions:** - `protein_associations`: Enzymes and transporters - Protein name - Gene name - Uniprot ID - GenBank ID - Protein type (enzyme, transporter, carrier, etc.) - Enzyme reactions - Enzyme kinetics (Km values) **Biochemical Context:** - `reactions`: Biochemical reactions involving the metabolite - `reaction_enzymes`: Enzymes catalyzing reactions - `cofactors`: Required cofactors - `inhibitors`: Known enzyme inhibitors ### Disease and Biomarker Associations **Disease Links:** - `diseases`: Associated diseases and conditions - Disease name - OMIM ID (Online Mendelian Inheritance in Man) - Disease category - References and evidence **Biomarker Information:** - `biomarker_status`: Whether compound is a known biomarker - `biomarker_applications`: Clinical applications - `biomarker_for`: Diseases or conditions where used as biomarker ### Spectroscopic Data **NMR Spectra:** - `nmr_spectra`: Nuclear Magnetic Resonance data - Spectrum type (1D ¹H, ¹³C, 2D COSY, HSQC, etc.) - Spectrometer frequency (MHz) - Solvent used - Temperature - pH - Peak list with chemical shifts and multiplicities - FID (Free Induction Decay) files **Mass Spectrometry:** - `ms_spectra`: Mass spectrometry data - Spectrum type (MS, MS-MS, LC-MS, GC-MS) - Ionization mode (positive, negative, neutral) - Collision energy - Instrument type - Peak list (m/z, intensity, annotation) - Predicted vs. experimental flag **Chromatography:** - `chromatography`: Chromatographic properties - Retention time - Column type - Mobile phase - Method details ### External Database Links **Database Cross-References:** - `kegg_id`: KEGG Compound ID - `pubchem_compound_id`: PubChem CID - `pubchem_substance_id`: PubChem SID - `chebi_id`: Chemical Entities of Biological Interest ID - `chemspider_id`: ChemSpider ID - `drugbank_id`: DrugBank accession (if applicable) - `foodb_id`: FooDB ID (if food component) - `knapsack_id`: KNApSAcK ID - `metacyc_id`: MetaCyc ID - `bigg_id`: BiGG Model ID - `wikipedia_id`: Wikipedia page link - `metlin_id`: METLIN ID - `vmh_id`: Virtual Metabolic Human ID - `fbonto_id`: FlyBase ontology ID **Protein Database Links:** - `uniprot_id`: UniProt accession for associated proteins - `genbank_id`: GenBank ID for associated genes - `pdb_id`: Protein Data Bank ID for protein structures ### Literature and Evidence **References:** - `general_references`: General references about the metabolite - PubMed ID - Reference text - Citation - `synthesis_reference`: Synthesis methods and references - `protein_references`: References for protein associations - `pathway_references`: References for pathway involvement ### Ontology and Classification **Ontology Terms:** - `ontology_terms`: Related ontology classifications - Term name - Ontology source (ChEBI, MeSH, etc.) - Term ID - Definition ### Data Quality and Provenance **Metadata:** - `creation_date`: Date entry was created - `update_date`: Date entry was last updated - `version`: HMDB version number - `status`: Entry status (detected, expected, predicted) - `evidence`: Evidence level for detection/presence ## XML Structure Example When downloading HMDB data in XML format, the structure follows this pattern: ```xml HMDB0000001 1-Methylhistidine C7H11N3O2 169.1811 169.085126436 CN1C=NC(CC(=O)O)=C1 InChI=1S/C7H11N3O2/c1-10-4-8-3-5(10)2-7(11)12/h3-4H,2H2,1H3,(H,11,12) BRMWTNUJHUMWMS-UHFFFAOYSA-N Blood Urine Histidine Metabolism SMP0000044 map00340 Carnosinemia 212200 Blood 3.8 uM ``` ## Querying Specific Fields When working with HMDB data programmatically: **For metabolite identification:** - Query by `accession`, `name`, `synonyms`, `inchi`, `smiles` **For chemical similarity:** - Use `smiles`, `inchi`, `inchikey`, `molecular_weight`, `chemical_formula` **For biomarker discovery:** - Filter by `diseases`, `biomarker_status`, `normal_concentrations`, `abnormal_concentrations` **For pathway analysis:** - Extract `pathways`, `protein_associations`, `reactions` **For spectral matching:** - Compare against `nmr_spectra`, `ms_spectra` peak lists **For cross-database integration:** - Map using external IDs: `kegg_id`, `pubchem_compound_id`, `chebi_id`, etc. ## Field Completeness Not all fields are populated for every metabolite: - **Highly complete fields** (>90% of entries): accession, name, chemical_formula, molecular_weight, smiles, inchi - **Moderately complete** (50-90%): biospecimen_locations, tissue_locations, pathways - **Variably complete** (10-50%): concentration data, disease associations, protein associations - **Sparsely complete** (<10%): experimental NMR/MS spectra, detailed kinetic data Predicted and computational data (e.g., predicted MS spectra, predicted concentrations) supplement experimental data where available.