Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/hmdb-database/references/hmdb_data_fields.md
+++ b/skills/hmdb-database/references/hmdb_data_fields.md
@@ -0,0 +1,267 @@
+# HMDB Data Fields Reference
+
+This document provides detailed information about the data fields available in HMDB metabolite entries.
+
+## Metabolite Entry Structure
+
+Each HMDB metabolite entry contains 130+ data fields organized into several categories:
+
+### Chemical Data Fields
+
+**Identification:**
+- `accession`: Primary HMDB ID (e.g., HMDB0000001)
+- `secondary_accessions`: Previous HMDB IDs for merged entries
+- `name`: Primary metabolite name
+- `synonyms`: Alternative names and common names
+- `chemical_formula`: Molecular formula (e.g., C6H12O6)
+- `average_molecular_weight`: Average molecular weight in Daltons
+- `monoisotopic_molecular_weight`: Monoisotopic molecular weight
+
+**Structure Representations:**
+- `smiles`: Simplified Molecular Input Line Entry System string
+- `inchi`: International Chemical Identifier string
+- `inchikey`: Hashed InChI for fast lookup
+- `iupac_name`: IUPAC systematic name
+- `traditional_iupac`: Traditional IUPAC name
+
+**Chemical Properties:**
+- `state`: Physical state (solid, liquid, gas)
+- `charge`: Net molecular charge
+- `logp`: Octanol-water partition coefficient (experimental/predicted)
+- `pka_strongest_acidic`: Strongest acidic pKa value
+- `pka_strongest_basic`: Strongest basic pKa value
+- `polar_surface_area`: Topological polar surface area (TPSA)
+- `refractivity`: Molar refractivity
+- `polarizability`: Molecular polarizability
+- `rotatable_bond_count`: Number of rotatable bonds
+- `acceptor_count`: Hydrogen bond acceptor count
+- `donor_count`: Hydrogen bond donor count
+
+**Chemical Taxonomy:**
+- `kingdom`: Chemical kingdom (e.g., Organic compounds)
+- `super_class`: Chemical superclass
+- `class`: Chemical class
+- `sub_class`: Chemical subclass
+- `direct_parent`: Direct chemical parent
+- `alternative_parents`: Alternative parent classifications
+- `substituents`: Chemical substituents present
+- `description`: Text description of the compound
+
+### Biological Data Fields
+
+**Metabolite Origins:**
+- `origin`: Source of metabolite (endogenous, exogenous, drug metabolite, food component)
+- `biofluid_locations`: Biological fluids where found (blood, urine, saliva, CSF, etc.)
+- `tissue_locations`: Tissues where found (liver, kidney, brain, muscle, etc.)
+- `cellular_locations`: Subcellular locations (cytoplasm, mitochondria, membrane, etc.)
+
+**Biospecimen Information:**
+- `biospecimen`: Type of biological specimen
+- `status`: Detection status (detected, expected, predicted)
+- `concentration`: Concentration ranges with units
+- `concentration_references`: Citations for concentration data
+
+**Normal and Abnormal Concentrations:**
+For each biofluid (blood, urine, saliva, CSF, feces, sweat):
+- Normal concentration value and range
+- Units (μM, mg/L, etc.)
+- Age and gender considerations
+- Abnormal concentration indicators
+- Clinical significance
+
+### Pathway and Enzyme Information
+
+**Metabolic Pathways:**
+- `pathways`: List of associated metabolic pathways
+  - Pathway name
+  - SMPDB ID (Small Molecule Pathway Database ID)
+  - KEGG pathway ID
+  - Pathway category
+
+**Enzymatic Reactions:**
+- `protein_associations`: Enzymes and transporters
+  - Protein name
+  - Gene name
+  - Uniprot ID
+  - GenBank ID
+  - Protein type (enzyme, transporter, carrier, etc.)
+  - Enzyme reactions
+  - Enzyme kinetics (Km values)
+
+**Biochemical Context:**
+- `reactions`: Biochemical reactions involving the metabolite
+- `reaction_enzymes`: Enzymes catalyzing reactions
+- `cofactors`: Required cofactors
+- `inhibitors`: Known enzyme inhibitors
+
+### Disease and Biomarker Associations
+
+**Disease Links:**
+- `diseases`: Associated diseases and conditions
+  - Disease name
+  - OMIM ID (Online Mendelian Inheritance in Man)
+  - Disease category
+  - References and evidence
+
+**Biomarker Information:**
+- `biomarker_status`: Whether compound is a known biomarker
+- `biomarker_applications`: Clinical applications
+- `biomarker_for`: Diseases or conditions where used as biomarker
+
+### Spectroscopic Data
+
+**NMR Spectra:**
+- `nmr_spectra`: Nuclear Magnetic Resonance data
+  - Spectrum type (1D ¹H, ¹³C, 2D COSY, HSQC, etc.)
+  - Spectrometer frequency (MHz)
+  - Solvent used
+  - Temperature
+  - pH
+  - Peak list with chemical shifts and multiplicities
+  - FID (Free Induction Decay) files
+
+**Mass Spectrometry:**
+- `ms_spectra`: Mass spectrometry data
+  - Spectrum type (MS, MS-MS, LC-MS, GC-MS)
+  - Ionization mode (positive, negative, neutral)
+  - Collision energy
+  - Instrument type
+  - Peak list (m/z, intensity, annotation)
+  - Predicted vs. experimental flag
+
+**Chromatography:**
+- `chromatography`: Chromatographic properties
+  - Retention time
+  - Column type
+  - Mobile phase
+  - Method details
+
+### External Database Links
+
+**Database Cross-References:**
+- `kegg_id`: KEGG Compound ID
+- `pubchem_compound_id`: PubChem CID
+- `pubchem_substance_id`: PubChem SID
+- `chebi_id`: Chemical Entities of Biological Interest ID
+- `chemspider_id`: ChemSpider ID
+- `drugbank_id`: DrugBank accession (if applicable)
+- `foodb_id`: FooDB ID (if food component)
+- `knapsack_id`: KNApSAcK ID
+- `metacyc_id`: MetaCyc ID
+- `bigg_id`: BiGG Model ID
+- `wikipedia_id`: Wikipedia page link
+- `metlin_id`: METLIN ID
+- `vmh_id`: Virtual Metabolic Human ID
+- `fbonto_id`: FlyBase ontology ID
+
+**Protein Database Links:**
+- `uniprot_id`: UniProt accession for associated proteins
+- `genbank_id`: GenBank ID for associated genes
+- `pdb_id`: Protein Data Bank ID for protein structures
+
+### Literature and Evidence
+
+**References:**
+- `general_references`: General references about the metabolite
+  - PubMed ID
+  - Reference text
+  - Citation
+- `synthesis_reference`: Synthesis methods and references
+- `protein_references`: References for protein associations
+- `pathway_references`: References for pathway involvement
+
+### Ontology and Classification
+
+**Ontology Terms:**
+- `ontology_terms`: Related ontology classifications
+  - Term name
+  - Ontology source (ChEBI, MeSH, etc.)
+  - Term ID
+  - Definition
+
+### Data Quality and Provenance
+
+**Metadata:**
+- `creation_date`: Date entry was created
+- `update_date`: Date entry was last updated
+- `version`: HMDB version number
+- `status`: Entry status (detected, expected, predicted)
+- `evidence`: Evidence level for detection/presence
+
+## XML Structure Example
+
+When downloading HMDB data in XML format, the structure follows this pattern:
+
+```xml
+<metabolite>
+  <accession>HMDB0000001</accession>
+  <name>1-Methylhistidine</name>
+  <chemical_formula>C7H11N3O2</chemical_formula>
+  <average_molecular_weight>169.1811</average_molecular_weight>
+  <monoisotopic_molecular_weight>169.085126436</monoisotopic_molecular_weight>
+  <smiles>CN1C=NC(CC(=O)O)=C1</smiles>
+  <inchi>InChI=1S/C7H11N3O2/c1-10-4-8-3-5(10)2-7(11)12/h3-4H,2H2,1H3,(H,11,12)</inchi>
+  <inchikey>BRMWTNUJHUMWMS-UHFFFAOYSA-N</inchikey>
+
+  <biospecimen_locations>
+    <biospecimen>Blood</biospecimen>
+    <biospecimen>Urine</biospecimen>
+  </biospecimen_locations>
+
+  <pathways>
+    <pathway>
+      <name>Histidine Metabolism</name>
+      <smpdb_id>SMP0000044</smpdb_id>
+      <kegg_map_id>map00340</kegg_map_id>
+    </pathway>
+  </pathways>
+
+  <diseases>
+    <disease>
+      <name>Carnosinemia</name>
+      <omim_id>212200</omim_id>
+    </disease>
+  </diseases>
+
+  <normal_concentrations>
+    <concentration>
+      <biospecimen>Blood</biospecimen>
+      <concentration_value>3.8</concentration_value>
+      <concentration_units>uM</concentration_units>
+    </concentration>
+  </normal_concentrations>
+</metabolite>
+```
+
+## Querying Specific Fields
+
+When working with HMDB data programmatically:
+
+**For metabolite identification:**
+- Query by `accession`, `name`, `synonyms`, `inchi`, `smiles`
+
+**For chemical similarity:**
+- Use `smiles`, `inchi`, `inchikey`, `molecular_weight`, `chemical_formula`
+
+**For biomarker discovery:**
+- Filter by `diseases`, `biomarker_status`, `normal_concentrations`, `abnormal_concentrations`
+
+**For pathway analysis:**
+- Extract `pathways`, `protein_associations`, `reactions`
+
+**For spectral matching:**
+- Compare against `nmr_spectra`, `ms_spectra` peak lists
+
+**For cross-database integration:**
+- Map using external IDs: `kegg_id`, `pubchem_compound_id`, `chebi_id`, etc.
+
+## Field Completeness
+
+Not all fields are populated for every metabolite:
+
+- **Highly complete fields** (>90% of entries): accession, name, chemical_formula, molecular_weight, smiles, inchi
+- **Moderately complete** (50-90%): biospecimen_locations, tissue_locations, pathways
+- **Variably complete** (10-50%): concentration data, disease associations, protein associations
+- **Sparsely complete** (<10%): experimental NMR/MS spectra, detailed kinetic data
+
+Predicted and computational data (e.g., predicted MS spectra, predicted concentrations) supplement experimental data where available.