Initial commit
This commit is contained in:
267
skills/hmdb-database/references/hmdb_data_fields.md
Normal file
267
skills/hmdb-database/references/hmdb_data_fields.md
Normal file
@@ -0,0 +1,267 @@
|
||||
# HMDB Data Fields Reference
|
||||
|
||||
This document provides detailed information about the data fields available in HMDB metabolite entries.
|
||||
|
||||
## Metabolite Entry Structure
|
||||
|
||||
Each HMDB metabolite entry contains 130+ data fields organized into several categories:
|
||||
|
||||
### Chemical Data Fields
|
||||
|
||||
**Identification:**
|
||||
- `accession`: Primary HMDB ID (e.g., HMDB0000001)
|
||||
- `secondary_accessions`: Previous HMDB IDs for merged entries
|
||||
- `name`: Primary metabolite name
|
||||
- `synonyms`: Alternative names and common names
|
||||
- `chemical_formula`: Molecular formula (e.g., C6H12O6)
|
||||
- `average_molecular_weight`: Average molecular weight in Daltons
|
||||
- `monoisotopic_molecular_weight`: Monoisotopic molecular weight
|
||||
|
||||
**Structure Representations:**
|
||||
- `smiles`: Simplified Molecular Input Line Entry System string
|
||||
- `inchi`: International Chemical Identifier string
|
||||
- `inchikey`: Hashed InChI for fast lookup
|
||||
- `iupac_name`: IUPAC systematic name
|
||||
- `traditional_iupac`: Traditional IUPAC name
|
||||
|
||||
**Chemical Properties:**
|
||||
- `state`: Physical state (solid, liquid, gas)
|
||||
- `charge`: Net molecular charge
|
||||
- `logp`: Octanol-water partition coefficient (experimental/predicted)
|
||||
- `pka_strongest_acidic`: Strongest acidic pKa value
|
||||
- `pka_strongest_basic`: Strongest basic pKa value
|
||||
- `polar_surface_area`: Topological polar surface area (TPSA)
|
||||
- `refractivity`: Molar refractivity
|
||||
- `polarizability`: Molecular polarizability
|
||||
- `rotatable_bond_count`: Number of rotatable bonds
|
||||
- `acceptor_count`: Hydrogen bond acceptor count
|
||||
- `donor_count`: Hydrogen bond donor count
|
||||
|
||||
**Chemical Taxonomy:**
|
||||
- `kingdom`: Chemical kingdom (e.g., Organic compounds)
|
||||
- `super_class`: Chemical superclass
|
||||
- `class`: Chemical class
|
||||
- `sub_class`: Chemical subclass
|
||||
- `direct_parent`: Direct chemical parent
|
||||
- `alternative_parents`: Alternative parent classifications
|
||||
- `substituents`: Chemical substituents present
|
||||
- `description`: Text description of the compound
|
||||
|
||||
### Biological Data Fields
|
||||
|
||||
**Metabolite Origins:**
|
||||
- `origin`: Source of metabolite (endogenous, exogenous, drug metabolite, food component)
|
||||
- `biofluid_locations`: Biological fluids where found (blood, urine, saliva, CSF, etc.)
|
||||
- `tissue_locations`: Tissues where found (liver, kidney, brain, muscle, etc.)
|
||||
- `cellular_locations`: Subcellular locations (cytoplasm, mitochondria, membrane, etc.)
|
||||
|
||||
**Biospecimen Information:**
|
||||
- `biospecimen`: Type of biological specimen
|
||||
- `status`: Detection status (detected, expected, predicted)
|
||||
- `concentration`: Concentration ranges with units
|
||||
- `concentration_references`: Citations for concentration data
|
||||
|
||||
**Normal and Abnormal Concentrations:**
|
||||
For each biofluid (blood, urine, saliva, CSF, feces, sweat):
|
||||
- Normal concentration value and range
|
||||
- Units (μM, mg/L, etc.)
|
||||
- Age and gender considerations
|
||||
- Abnormal concentration indicators
|
||||
- Clinical significance
|
||||
|
||||
### Pathway and Enzyme Information
|
||||
|
||||
**Metabolic Pathways:**
|
||||
- `pathways`: List of associated metabolic pathways
|
||||
- Pathway name
|
||||
- SMPDB ID (Small Molecule Pathway Database ID)
|
||||
- KEGG pathway ID
|
||||
- Pathway category
|
||||
|
||||
**Enzymatic Reactions:**
|
||||
- `protein_associations`: Enzymes and transporters
|
||||
- Protein name
|
||||
- Gene name
|
||||
- Uniprot ID
|
||||
- GenBank ID
|
||||
- Protein type (enzyme, transporter, carrier, etc.)
|
||||
- Enzyme reactions
|
||||
- Enzyme kinetics (Km values)
|
||||
|
||||
**Biochemical Context:**
|
||||
- `reactions`: Biochemical reactions involving the metabolite
|
||||
- `reaction_enzymes`: Enzymes catalyzing reactions
|
||||
- `cofactors`: Required cofactors
|
||||
- `inhibitors`: Known enzyme inhibitors
|
||||
|
||||
### Disease and Biomarker Associations
|
||||
|
||||
**Disease Links:**
|
||||
- `diseases`: Associated diseases and conditions
|
||||
- Disease name
|
||||
- OMIM ID (Online Mendelian Inheritance in Man)
|
||||
- Disease category
|
||||
- References and evidence
|
||||
|
||||
**Biomarker Information:**
|
||||
- `biomarker_status`: Whether compound is a known biomarker
|
||||
- `biomarker_applications`: Clinical applications
|
||||
- `biomarker_for`: Diseases or conditions where used as biomarker
|
||||
|
||||
### Spectroscopic Data
|
||||
|
||||
**NMR Spectra:**
|
||||
- `nmr_spectra`: Nuclear Magnetic Resonance data
|
||||
- Spectrum type (1D ¹H, ¹³C, 2D COSY, HSQC, etc.)
|
||||
- Spectrometer frequency (MHz)
|
||||
- Solvent used
|
||||
- Temperature
|
||||
- pH
|
||||
- Peak list with chemical shifts and multiplicities
|
||||
- FID (Free Induction Decay) files
|
||||
|
||||
**Mass Spectrometry:**
|
||||
- `ms_spectra`: Mass spectrometry data
|
||||
- Spectrum type (MS, MS-MS, LC-MS, GC-MS)
|
||||
- Ionization mode (positive, negative, neutral)
|
||||
- Collision energy
|
||||
- Instrument type
|
||||
- Peak list (m/z, intensity, annotation)
|
||||
- Predicted vs. experimental flag
|
||||
|
||||
**Chromatography:**
|
||||
- `chromatography`: Chromatographic properties
|
||||
- Retention time
|
||||
- Column type
|
||||
- Mobile phase
|
||||
- Method details
|
||||
|
||||
### External Database Links
|
||||
|
||||
**Database Cross-References:**
|
||||
- `kegg_id`: KEGG Compound ID
|
||||
- `pubchem_compound_id`: PubChem CID
|
||||
- `pubchem_substance_id`: PubChem SID
|
||||
- `chebi_id`: Chemical Entities of Biological Interest ID
|
||||
- `chemspider_id`: ChemSpider ID
|
||||
- `drugbank_id`: DrugBank accession (if applicable)
|
||||
- `foodb_id`: FooDB ID (if food component)
|
||||
- `knapsack_id`: KNApSAcK ID
|
||||
- `metacyc_id`: MetaCyc ID
|
||||
- `bigg_id`: BiGG Model ID
|
||||
- `wikipedia_id`: Wikipedia page link
|
||||
- `metlin_id`: METLIN ID
|
||||
- `vmh_id`: Virtual Metabolic Human ID
|
||||
- `fbonto_id`: FlyBase ontology ID
|
||||
|
||||
**Protein Database Links:**
|
||||
- `uniprot_id`: UniProt accession for associated proteins
|
||||
- `genbank_id`: GenBank ID for associated genes
|
||||
- `pdb_id`: Protein Data Bank ID for protein structures
|
||||
|
||||
### Literature and Evidence
|
||||
|
||||
**References:**
|
||||
- `general_references`: General references about the metabolite
|
||||
- PubMed ID
|
||||
- Reference text
|
||||
- Citation
|
||||
- `synthesis_reference`: Synthesis methods and references
|
||||
- `protein_references`: References for protein associations
|
||||
- `pathway_references`: References for pathway involvement
|
||||
|
||||
### Ontology and Classification
|
||||
|
||||
**Ontology Terms:**
|
||||
- `ontology_terms`: Related ontology classifications
|
||||
- Term name
|
||||
- Ontology source (ChEBI, MeSH, etc.)
|
||||
- Term ID
|
||||
- Definition
|
||||
|
||||
### Data Quality and Provenance
|
||||
|
||||
**Metadata:**
|
||||
- `creation_date`: Date entry was created
|
||||
- `update_date`: Date entry was last updated
|
||||
- `version`: HMDB version number
|
||||
- `status`: Entry status (detected, expected, predicted)
|
||||
- `evidence`: Evidence level for detection/presence
|
||||
|
||||
## XML Structure Example
|
||||
|
||||
When downloading HMDB data in XML format, the structure follows this pattern:
|
||||
|
||||
```xml
|
||||
<metabolite>
|
||||
<accession>HMDB0000001</accession>
|
||||
<name>1-Methylhistidine</name>
|
||||
<chemical_formula>C7H11N3O2</chemical_formula>
|
||||
<average_molecular_weight>169.1811</average_molecular_weight>
|
||||
<monoisotopic_molecular_weight>169.085126436</monoisotopic_molecular_weight>
|
||||
<smiles>CN1C=NC(CC(=O)O)=C1</smiles>
|
||||
<inchi>InChI=1S/C7H11N3O2/c1-10-4-8-3-5(10)2-7(11)12/h3-4H,2H2,1H3,(H,11,12)</inchi>
|
||||
<inchikey>BRMWTNUJHUMWMS-UHFFFAOYSA-N</inchikey>
|
||||
|
||||
<biospecimen_locations>
|
||||
<biospecimen>Blood</biospecimen>
|
||||
<biospecimen>Urine</biospecimen>
|
||||
</biospecimen_locations>
|
||||
|
||||
<pathways>
|
||||
<pathway>
|
||||
<name>Histidine Metabolism</name>
|
||||
<smpdb_id>SMP0000044</smpdb_id>
|
||||
<kegg_map_id>map00340</kegg_map_id>
|
||||
</pathway>
|
||||
</pathways>
|
||||
|
||||
<diseases>
|
||||
<disease>
|
||||
<name>Carnosinemia</name>
|
||||
<omim_id>212200</omim_id>
|
||||
</disease>
|
||||
</diseases>
|
||||
|
||||
<normal_concentrations>
|
||||
<concentration>
|
||||
<biospecimen>Blood</biospecimen>
|
||||
<concentration_value>3.8</concentration_value>
|
||||
<concentration_units>uM</concentration_units>
|
||||
</concentration>
|
||||
</normal_concentrations>
|
||||
</metabolite>
|
||||
```
|
||||
|
||||
## Querying Specific Fields
|
||||
|
||||
When working with HMDB data programmatically:
|
||||
|
||||
**For metabolite identification:**
|
||||
- Query by `accession`, `name`, `synonyms`, `inchi`, `smiles`
|
||||
|
||||
**For chemical similarity:**
|
||||
- Use `smiles`, `inchi`, `inchikey`, `molecular_weight`, `chemical_formula`
|
||||
|
||||
**For biomarker discovery:**
|
||||
- Filter by `diseases`, `biomarker_status`, `normal_concentrations`, `abnormal_concentrations`
|
||||
|
||||
**For pathway analysis:**
|
||||
- Extract `pathways`, `protein_associations`, `reactions`
|
||||
|
||||
**For spectral matching:**
|
||||
- Compare against `nmr_spectra`, `ms_spectra` peak lists
|
||||
|
||||
**For cross-database integration:**
|
||||
- Map using external IDs: `kegg_id`, `pubchem_compound_id`, `chebi_id`, etc.
|
||||
|
||||
## Field Completeness
|
||||
|
||||
Not all fields are populated for every metabolite:
|
||||
|
||||
- **Highly complete fields** (>90% of entries): accession, name, chemical_formula, molecular_weight, smiles, inchi
|
||||
- **Moderately complete** (50-90%): biospecimen_locations, tissue_locations, pathways
|
||||
- **Variably complete** (10-50%): concentration data, disease associations, protein associations
|
||||
- **Sparsely complete** (<10%): experimental NMR/MS spectra, detailed kinetic data
|
||||
|
||||
Predicted and computational data (e.g., predicted MS spectra, predicted concentrations) supplement experimental data where available.
|
||||
Reference in New Issue
Block a user