Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/hmdb-database/SKILL.md
+++ b/skills/hmdb-database/SKILL.md
@@ -0,0 +1,190 @@
+---
+name: hmdb-database
+description: "Access Human Metabolome Database (220K+ metabolites). Search by name/ID/structure, retrieve chemical properties, biomarker data, NMR/MS spectra, pathways, for metabolomics and identification."
+---
+
+# HMDB Database
+
+## Overview
+
+The Human Metabolome Database (HMDB) is a comprehensive, freely available resource containing detailed information about small molecule metabolites found in the human body.
+
+## When to Use This Skill
+
+This skill should be used when performing metabolomics research, clinical chemistry, biomarker discovery, or metabolite identification tasks.
+
+## Database Contents
+
+HMDB version 5.0 (current as of 2025) contains:
+
+- **220,945 metabolite entries** covering both water-soluble and lipid-soluble compounds
+- **8,610 protein sequences** for enzymes and transporters involved in metabolism
+- **130+ data fields per metabolite** including:
+  - Chemical properties (structure, formula, molecular weight, InChI, SMILES)
+  - Clinical data (biomarker associations, diseases, normal/abnormal concentrations)
+  - Biological information (pathways, reactions, locations)
+  - Spectroscopic data (NMR, MS, MS-MS spectra)
+  - External database links (KEGG, PubChem, MetaCyc, ChEBI, PDB, UniProt, GenBank)
+
+## Core Capabilities
+
+### 1. Web-Based Metabolite Searches
+
+Access HMDB through the web interface at https://www.hmdb.ca/ for:
+
+**Text Searches:**
+- Search by metabolite name, synonym, or identifier (HMDB ID)
+- Example HMDB IDs: HMDB0000001, HMDB0001234
+- Search by disease associations or pathway involvement
+- Query by biological specimen type (urine, serum, CSF, saliva, feces, sweat)
+
+**Structure-Based Searches:**
+- Use ChemQuery for structure and substructure searches
+- Search by molecular weight or molecular weight range
+- Use SMILES or InChI strings to find compounds
+
+**Spectral Searches:**
+- LC-MS spectral matching
+- GC-MS spectral matching
+- NMR spectral searches for metabolite identification
+
+**Advanced Searches:**
+- Combine multiple criteria (name, properties, concentration ranges)
+- Filter by biological locations or specimen types
+- Search by protein/enzyme associations
+
+### 2. Accessing Metabolite Information
+
+When retrieving metabolite data, HMDB provides:
+
+**Chemical Information:**
+- Systematic name, traditional names, and synonyms
+- Chemical formula and molecular weight
+- Structure representations (2D/3D, SMILES, InChI, MOL file)
+- Chemical taxonomy and classification
+
+**Biological Context:**
+- Metabolic pathways and reactions
+- Associated enzymes and transporters
+- Subcellular locations
+- Biological roles and functions
+
+**Clinical Relevance:**
+- Normal concentration ranges in biological fluids
+- Biomarker associations with diseases
+- Clinical significance
+- Toxicity information when applicable
+
+**Analytical Data:**
+- Experimental and predicted NMR spectra
+- MS and MS-MS spectra
+- Retention times and chromatographic data
+- Reference peaks for identification
+
+### 3. Downloadable Datasets
+
+HMDB offers bulk data downloads at https://www.hmdb.ca/downloads in multiple formats:
+
+**Available Formats:**
+- **XML**: Complete metabolite, protein, and spectra data
+- **SDF**: Metabolite structure files for cheminformatics
+- **FASTA**: Protein and gene sequences
+- **TXT**: Raw spectra peak lists
+- **CSV/TSV**: Tabular data exports
+
+**Dataset Categories:**
+- All metabolites or filtered by specimen type
+- Protein/enzyme sequences
+- Experimental and predicted spectra (NMR, GC-MS, MS-MS)
+- Pathway information
+
+**Best Practices:**
+- Download XML format for comprehensive data including all fields
+- Use SDF format for structure-based analysis and cheminformatics workflows
+- Parse CSV/TSV formats for integration with data analysis pipelines
+- Check version dates to ensure up-to-date data (current: v5.0, 2023-07-01)
+
+**Usage Requirements:**
+- Free for academic and non-commercial research
+- Commercial use requires explicit permission (contact samackay@ualberta.ca)
+- Cite HMDB publication when using data
+
+### 4. Programmatic API Access
+
+**API Availability:**
+HMDB does not provide a public REST API. Programmatic access requires contacting the development team:
+
+- **Academic/Research groups:** Contact eponine@ualberta.ca (Eponine) or samackay@ualberta.ca (Scott)
+- **Commercial organizations:** Contact samackay@ualberta.ca (Scott) for customized API access
+
+**Alternative Programmatic Access:**
+- **R/Bioconductor**: Use the `hmdbQuery` package for R-based queries
+  - Install: `BiocManager::install("hmdbQuery")`
+  - Provides HTTP-based querying functions
+- **Downloaded datasets**: Parse XML or CSV files locally for programmatic analysis
+- **Web scraping**: Not recommended; contact team for proper API access instead
+
+### 5. Common Research Workflows
+
+**Metabolite Identification in Untargeted Metabolomics:**
+1. Obtain experimental MS or NMR spectra from samples
+2. Use HMDB spectral search tools to match against reference spectra
+3. Verify candidates by checking molecular weight, retention time, and MS-MS fragmentation
+4. Review biological plausibility (expected in specimen type, known pathways)
+
+**Biomarker Discovery:**
+1. Search HMDB for metabolites associated with disease of interest
+2. Review concentration ranges in normal vs. disease states
+3. Identify metabolites with strong differential abundance
+4. Examine pathway context and biological mechanisms
+5. Cross-reference with literature via PubMed links
+
+**Pathway Analysis:**
+1. Identify metabolites of interest from experimental data
+2. Look up HMDB entries for each metabolite
+3. Extract pathway associations and enzymatic reactions
+4. Use linked SMPDB (Small Molecule Pathway Database) for pathway diagrams
+5. Identify pathway enrichment for biological interpretation
+
+**Database Integration:**
+1. Download HMDB data in XML or CSV format
+2. Parse and extract relevant fields for local database
+3. Link with external IDs (KEGG, PubChem, ChEBI) for cross-database queries
+4. Build local tools or pipelines incorporating HMDB reference data
+
+## Related HMDB Resources
+
+The HMDB ecosystem includes related databases:
+
+- **DrugBank**: ~2,832 drug compounds with pharmaceutical information
+- **T3DB (Toxin and Toxin Target Database)**: ~3,670 toxic compounds
+- **SMPDB (Small Molecule Pathway Database)**: Pathway diagrams and maps
+- **FooDB**: ~70,000 food component compounds
+
+These databases share similar structure and identifiers, enabling integrated queries across human metabolome, drug, toxin, and food databases.
+
+## Best Practices
+
+**Data Quality:**
+- Verify metabolite identifications with multiple evidence types (spectra, structure, properties)
+- Check experimental vs. predicted data quality indicators
+- Review citations and evidence for biomarker associations
+
+**Version Tracking:**
+- Note HMDB version used in research (current: v5.0)
+- Databases are updated periodically with new entries and corrections
+- Re-query for updates when publishing to ensure current information
+
+**Citation:**
+- Always cite HMDB in publications using the database
+- Reference specific HMDB IDs when discussing metabolites
+- Acknowledge data sources for downloaded datasets
+
+**Performance:**
+- For large-scale analysis, download complete datasets rather than repeated web queries
+- Use appropriate file formats (XML for comprehensive data, CSV for tabular analysis)
+- Consider local caching of frequently accessed metabolite information
+
+## Reference Documentation
+
+See `references/hmdb_data_fields.md` for detailed information about available data fields and their meanings.
--- a/skills/hmdb-database/references/hmdb_data_fields.md
+++ b/skills/hmdb-database/references/hmdb_data_fields.md
@@ -0,0 +1,267 @@
+# HMDB Data Fields Reference
+
+This document provides detailed information about the data fields available in HMDB metabolite entries.
+
+## Metabolite Entry Structure
+
+Each HMDB metabolite entry contains 130+ data fields organized into several categories:
+
+### Chemical Data Fields
+
+**Identification:**
+- `accession`: Primary HMDB ID (e.g., HMDB0000001)
+- `secondary_accessions`: Previous HMDB IDs for merged entries
+- `name`: Primary metabolite name
+- `synonyms`: Alternative names and common names
+- `chemical_formula`: Molecular formula (e.g., C6H12O6)
+- `average_molecular_weight`: Average molecular weight in Daltons
+- `monoisotopic_molecular_weight`: Monoisotopic molecular weight
+
+**Structure Representations:**
+- `smiles`: Simplified Molecular Input Line Entry System string
+- `inchi`: International Chemical Identifier string
+- `inchikey`: Hashed InChI for fast lookup
+- `iupac_name`: IUPAC systematic name
+- `traditional_iupac`: Traditional IUPAC name
+
+**Chemical Properties:**
+- `state`: Physical state (solid, liquid, gas)
+- `charge`: Net molecular charge
+- `logp`: Octanol-water partition coefficient (experimental/predicted)
+- `pka_strongest_acidic`: Strongest acidic pKa value
+- `pka_strongest_basic`: Strongest basic pKa value
+- `polar_surface_area`: Topological polar surface area (TPSA)
+- `refractivity`: Molar refractivity
+- `polarizability`: Molecular polarizability
+- `rotatable_bond_count`: Number of rotatable bonds
+- `acceptor_count`: Hydrogen bond acceptor count
+- `donor_count`: Hydrogen bond donor count
+
+**Chemical Taxonomy:**
+- `kingdom`: Chemical kingdom (e.g., Organic compounds)
+- `super_class`: Chemical superclass
+- `class`: Chemical class
+- `sub_class`: Chemical subclass
+- `direct_parent`: Direct chemical parent
+- `alternative_parents`: Alternative parent classifications
+- `substituents`: Chemical substituents present
+- `description`: Text description of the compound
+
+### Biological Data Fields
+
+**Metabolite Origins:**
+- `origin`: Source of metabolite (endogenous, exogenous, drug metabolite, food component)
+- `biofluid_locations`: Biological fluids where found (blood, urine, saliva, CSF, etc.)
+- `tissue_locations`: Tissues where found (liver, kidney, brain, muscle, etc.)
+- `cellular_locations`: Subcellular locations (cytoplasm, mitochondria, membrane, etc.)
+
+**Biospecimen Information:**
+- `biospecimen`: Type of biological specimen
+- `status`: Detection status (detected, expected, predicted)
+- `concentration`: Concentration ranges with units
+- `concentration_references`: Citations for concentration data
+
+**Normal and Abnormal Concentrations:**
+For each biofluid (blood, urine, saliva, CSF, feces, sweat):
+- Normal concentration value and range
+- Units (μM, mg/L, etc.)
+- Age and gender considerations
+- Abnormal concentration indicators
+- Clinical significance
+
+### Pathway and Enzyme Information
+
+**Metabolic Pathways:**
+- `pathways`: List of associated metabolic pathways
+  - Pathway name
+  - SMPDB ID (Small Molecule Pathway Database ID)
+  - KEGG pathway ID
+  - Pathway category
+
+**Enzymatic Reactions:**
+- `protein_associations`: Enzymes and transporters
+  - Protein name
+  - Gene name
+  - Uniprot ID
+  - GenBank ID
+  - Protein type (enzyme, transporter, carrier, etc.)
+  - Enzyme reactions
+  - Enzyme kinetics (Km values)
+
+**Biochemical Context:**
+- `reactions`: Biochemical reactions involving the metabolite
+- `reaction_enzymes`: Enzymes catalyzing reactions
+- `cofactors`: Required cofactors
+- `inhibitors`: Known enzyme inhibitors
+
+### Disease and Biomarker Associations
+
+**Disease Links:**
+- `diseases`: Associated diseases and conditions
+  - Disease name
+  - OMIM ID (Online Mendelian Inheritance in Man)
+  - Disease category
+  - References and evidence
+
+**Biomarker Information:**
+- `biomarker_status`: Whether compound is a known biomarker
+- `biomarker_applications`: Clinical applications
+- `biomarker_for`: Diseases or conditions where used as biomarker
+
+### Spectroscopic Data
+
+**NMR Spectra:**
+- `nmr_spectra`: Nuclear Magnetic Resonance data
+  - Spectrum type (1D ¹H, ¹³C, 2D COSY, HSQC, etc.)
+  - Spectrometer frequency (MHz)
+  - Solvent used
+  - Temperature
+  - pH
+  - Peak list with chemical shifts and multiplicities
+  - FID (Free Induction Decay) files
+
+**Mass Spectrometry:**
+- `ms_spectra`: Mass spectrometry data
+  - Spectrum type (MS, MS-MS, LC-MS, GC-MS)
+  - Ionization mode (positive, negative, neutral)
+  - Collision energy
+  - Instrument type
+  - Peak list (m/z, intensity, annotation)
+  - Predicted vs. experimental flag
+
+**Chromatography:**
+- `chromatography`: Chromatographic properties
+  - Retention time
+  - Column type
+  - Mobile phase
+  - Method details
+
+### External Database Links
+
+**Database Cross-References:**
+- `kegg_id`: KEGG Compound ID
+- `pubchem_compound_id`: PubChem CID
+- `pubchem_substance_id`: PubChem SID
+- `chebi_id`: Chemical Entities of Biological Interest ID
+- `chemspider_id`: ChemSpider ID
+- `drugbank_id`: DrugBank accession (if applicable)
+- `foodb_id`: FooDB ID (if food component)
+- `knapsack_id`: KNApSAcK ID
+- `metacyc_id`: MetaCyc ID
+- `bigg_id`: BiGG Model ID
+- `wikipedia_id`: Wikipedia page link
+- `metlin_id`: METLIN ID
+- `vmh_id`: Virtual Metabolic Human ID
+- `fbonto_id`: FlyBase ontology ID
+
+**Protein Database Links:**
+- `uniprot_id`: UniProt accession for associated proteins
+- `genbank_id`: GenBank ID for associated genes
+- `pdb_id`: Protein Data Bank ID for protein structures
+
+### Literature and Evidence
+
+**References:**
+- `general_references`: General references about the metabolite
+  - PubMed ID
+  - Reference text
+  - Citation
+- `synthesis_reference`: Synthesis methods and references
+- `protein_references`: References for protein associations
+- `pathway_references`: References for pathway involvement
+
+### Ontology and Classification
+
+**Ontology Terms:**
+- `ontology_terms`: Related ontology classifications
+  - Term name
+  - Ontology source (ChEBI, MeSH, etc.)
+  - Term ID
+  - Definition
+
+### Data Quality and Provenance
+
+**Metadata:**
+- `creation_date`: Date entry was created
+- `update_date`: Date entry was last updated
+- `version`: HMDB version number
+- `status`: Entry status (detected, expected, predicted)
+- `evidence`: Evidence level for detection/presence
+
+## XML Structure Example
+
+When downloading HMDB data in XML format, the structure follows this pattern:
+
+```xml
+<metabolite>
+  <accession>HMDB0000001</accession>
+  <name>1-Methylhistidine</name>
+  <chemical_formula>C7H11N3O2</chemical_formula>
+  <average_molecular_weight>169.1811</average_molecular_weight>
+  <monoisotopic_molecular_weight>169.085126436</monoisotopic_molecular_weight>
+  <smiles>CN1C=NC(CC(=O)O)=C1</smiles>
+  <inchi>InChI=1S/C7H11N3O2/c1-10-4-8-3-5(10)2-7(11)12/h3-4H,2H2,1H3,(H,11,12)</inchi>
+  <inchikey>BRMWTNUJHUMWMS-UHFFFAOYSA-N</inchikey>
+
+  <biospecimen_locations>
+    <biospecimen>Blood</biospecimen>
+    <biospecimen>Urine</biospecimen>
+  </biospecimen_locations>
+
+  <pathways>
+    <pathway>
+      <name>Histidine Metabolism</name>
+      <smpdb_id>SMP0000044</smpdb_id>
+      <kegg_map_id>map00340</kegg_map_id>
+    </pathway>
+  </pathways>
+
+  <diseases>
+    <disease>
+      <name>Carnosinemia</name>
+      <omim_id>212200</omim_id>
+    </disease>
+  </diseases>
+
+  <normal_concentrations>
+    <concentration>
+      <biospecimen>Blood</biospecimen>
+      <concentration_value>3.8</concentration_value>
+      <concentration_units>uM</concentration_units>
+    </concentration>
+  </normal_concentrations>
+</metabolite>
+```
+
+## Querying Specific Fields
+
+When working with HMDB data programmatically:
+
+**For metabolite identification:**
+- Query by `accession`, `name`, `synonyms`, `inchi`, `smiles`
+
+**For chemical similarity:**
+- Use `smiles`, `inchi`, `inchikey`, `molecular_weight`, `chemical_formula`
+
+**For biomarker discovery:**
+- Filter by `diseases`, `biomarker_status`, `normal_concentrations`, `abnormal_concentrations`
+
+**For pathway analysis:**
+- Extract `pathways`, `protein_associations`, `reactions`
+
+**For spectral matching:**
+- Compare against `nmr_spectra`, `ms_spectra` peak lists
+
+**For cross-database integration:**
+- Map using external IDs: `kegg_id`, `pubchem_compound_id`, `chebi_id`, etc.
+
+## Field Completeness
+
+Not all fields are populated for every metabolite:
+
+- **Highly complete fields** (>90% of entries): accession, name, chemical_formula, molecular_weight, smiles, inchi
+- **Moderately complete** (50-90%): biospecimen_locations, tissue_locations, pathways
+- **Variably complete** (10-50%): concentration data, disease associations, protein associations
+- **Sparsely complete** (<10%): experimental NMR/MS spectra, detailed kinetic data
+
+Predicted and computational data (e.g., predicted MS spectra, predicted concentrations) supplement experimental data where available.