Initial commit
This commit is contained in:
300
skills/gget/references/database_info.md
Normal file
300
skills/gget/references/database_info.md
Normal file
@@ -0,0 +1,300 @@
|
||||
# gget Database Information
|
||||
|
||||
Overview of databases queried by gget modules, including update frequencies and important considerations.
|
||||
|
||||
## Important Note
|
||||
|
||||
The databases queried by gget are continuously being updated, which sometimes changes their structure. gget modules are tested automatically on a biweekly basis and updated to match new database structures when necessary. Always keep gget updated:
|
||||
|
||||
```bash
|
||||
pip install --upgrade gget
|
||||
```
|
||||
|
||||
## Database Directory
|
||||
|
||||
### Genomic Reference Databases
|
||||
|
||||
#### Ensembl
|
||||
- **Used by:** gget ref, gget search, gget info, gget seq
|
||||
- **Description:** Comprehensive genome database with annotations for vertebrate and invertebrate species
|
||||
- **Update frequency:** Regular releases (numbered); new releases approximately every 3 months
|
||||
- **Access:** FTP downloads, REST API
|
||||
- **Website:** https://www.ensembl.org/
|
||||
- **Notes:**
|
||||
- Supports both vertebrate and invertebrate genomes
|
||||
- Can specify release number for reproducibility
|
||||
- Shortcuts available for common species ('human', 'mouse')
|
||||
|
||||
#### UCSC Genome Browser
|
||||
- **Used by:** gget blat
|
||||
- **Description:** Genome browser database with BLAT alignment tool
|
||||
- **Update frequency:** Regular updates with new assemblies
|
||||
- **Access:** Web service API
|
||||
- **Website:** https://genome.ucsc.edu/
|
||||
- **Notes:**
|
||||
- Multiple genome assemblies available (hg38, mm39, etc.)
|
||||
- BLAT optimized for vertebrate genomes
|
||||
|
||||
### Protein & Structure Databases
|
||||
|
||||
#### UniProt
|
||||
- **Used by:** gget info, gget seq (amino acid sequences), gget elm
|
||||
- **Description:** Universal Protein Resource, comprehensive protein sequence and functional information
|
||||
- **Update frequency:** Regular releases (weekly for Swiss-Prot, monthly for TrEMBL)
|
||||
- **Access:** REST API
|
||||
- **Website:** https://www.uniprot.org/
|
||||
- **Notes:**
|
||||
- Swiss-Prot: manually annotated and reviewed
|
||||
- TrEMBL: automatically annotated
|
||||
|
||||
#### NCBI (National Center for Biotechnology Information)
|
||||
- **Used by:** gget info, gget bgee (for non-Ensembl species)
|
||||
- **Description:** Gene and protein databases with extensive cross-references
|
||||
- **Update frequency:** Continuous updates
|
||||
- **Access:** E-utilities API
|
||||
- **Website:** https://www.ncbi.nlm.nih.gov/
|
||||
- **Databases:** Gene, Protein, RefSeq
|
||||
|
||||
#### RCSB PDB (Protein Data Bank)
|
||||
- **Used by:** gget pdb
|
||||
- **Description:** Repository of 3D structural data for proteins and nucleic acids
|
||||
- **Update frequency:** Weekly updates
|
||||
- **Access:** REST API
|
||||
- **Website:** https://www.rcsb.org/
|
||||
- **Notes:**
|
||||
- Experimentally determined structures (X-ray, NMR, cryo-EM)
|
||||
- Includes metadata about experiments and publications
|
||||
|
||||
#### ELM (Eukaryotic Linear Motif)
|
||||
- **Used by:** gget elm
|
||||
- **Description:** Database of functional sites in eukaryotic proteins
|
||||
- **Update frequency:** Periodic updates
|
||||
- **Access:** Downloaded database (via gget setup elm)
|
||||
- **Website:** http://elm.eu.org/
|
||||
- **Notes:**
|
||||
- Requires local download before first use
|
||||
- Contains validated motifs and patterns
|
||||
|
||||
### Sequence Similarity Databases
|
||||
|
||||
#### BLAST Databases (NCBI)
|
||||
- **Used by:** gget blast
|
||||
- **Description:** Pre-formatted databases for BLAST searches
|
||||
- **Update frequency:** Regular updates
|
||||
- **Access:** NCBI BLAST API
|
||||
- **Databases:**
|
||||
- **Nucleotide:** nt (all GenBank), refseq_rna, pdbnt
|
||||
- **Protein:** nr (non-redundant), swissprot, pdbaa, refseq_protein
|
||||
- **Notes:**
|
||||
- nt and nr are very large databases
|
||||
- Consider specialized databases for faster, more focused searches
|
||||
|
||||
### Expression & Correlation Databases
|
||||
|
||||
#### ARCHS4
|
||||
- **Used by:** gget archs4
|
||||
- **Description:** Massive mining of publicly available RNA-seq data
|
||||
- **Update frequency:** Periodic updates with new samples
|
||||
- **Access:** HTTP API
|
||||
- **Website:** https://maayanlab.cloud/archs4/
|
||||
- **Data:**
|
||||
- Human and mouse RNA-seq data
|
||||
- Correlation matrices
|
||||
- Tissue expression atlases
|
||||
- **Citation:** Lachmann et al., Nature Communications, 2018
|
||||
|
||||
#### CZ CELLxGENE Discover
|
||||
- **Used by:** gget cellxgene
|
||||
- **Description:** Single-cell RNA-seq data from multiple studies
|
||||
- **Update frequency:** Continuous additions of new datasets
|
||||
- **Access:** Census API (via cellxgene-census package)
|
||||
- **Website:** https://cellxgene.cziscience.com/
|
||||
- **Data:**
|
||||
- Single-cell RNA-seq count matrices
|
||||
- Cell type annotations
|
||||
- Tissue and disease metadata
|
||||
- **Notes:**
|
||||
- Requires gget setup cellxgene
|
||||
- Gene symbols are case-sensitive
|
||||
- May not support latest Python versions
|
||||
|
||||
#### Bgee
|
||||
- **Used by:** gget bgee
|
||||
- **Description:** Gene expression and orthology database
|
||||
- **Update frequency:** Regular releases
|
||||
- **Access:** REST API
|
||||
- **Website:** https://www.bgee.org/
|
||||
- **Data:**
|
||||
- Gene expression across tissues and developmental stages
|
||||
- Orthology relationships across species
|
||||
- **Citation:** Bastian et al., 2021
|
||||
|
||||
### Functional & Pathway Databases
|
||||
|
||||
#### Enrichr / modEnrichr
|
||||
- **Used by:** gget enrichr
|
||||
- **Description:** Gene set enrichment analysis web service
|
||||
- **Update frequency:** Regular updates to underlying databases
|
||||
- **Access:** REST API
|
||||
- **Website:** https://maayanlab.cloud/Enrichr/
|
||||
- **Databases included:**
|
||||
- KEGG pathways
|
||||
- Gene Ontology (GO)
|
||||
- Transcription factor targets (ChEA)
|
||||
- Disease associations (GWAS Catalog)
|
||||
- Cell type markers (PanglaoDB)
|
||||
- **Notes:**
|
||||
- Supports multiple model organisms
|
||||
- Background gene lists can be provided for custom enrichment
|
||||
|
||||
### Disease & Drug Databases
|
||||
|
||||
#### Open Targets
|
||||
- **Used by:** gget opentargets
|
||||
- **Description:** Integrative platform for disease-target associations
|
||||
- **Update frequency:** Regular releases (quarterly)
|
||||
- **Access:** GraphQL API
|
||||
- **Website:** https://www.opentargets.org/
|
||||
- **Data:**
|
||||
- Disease associations
|
||||
- Drug information and clinical trials
|
||||
- Target tractability
|
||||
- Pharmacogenetics
|
||||
- Gene expression
|
||||
- DepMap gene-disease effects
|
||||
- Protein-protein interactions
|
||||
|
||||
#### cBioPortal
|
||||
- **Used by:** gget cbio
|
||||
- **Description:** Cancer genomics data portal
|
||||
- **Update frequency:** Continuous addition of new studies
|
||||
- **Access:** Web API, downloadable datasets
|
||||
- **Website:** https://www.cbioportal.org/
|
||||
- **Data:**
|
||||
- Mutations, copy number alterations, structural variants
|
||||
- Gene expression
|
||||
- Clinical data
|
||||
- **Notes:**
|
||||
- Large datasets; caching recommended
|
||||
- Multiple cancer types and studies available
|
||||
|
||||
#### COSMIC (Catalogue Of Somatic Mutations In Cancer)
|
||||
- **Used by:** gget cosmic
|
||||
- **Description:** Comprehensive cancer mutation database
|
||||
- **Update frequency:** Regular releases
|
||||
- **Access:** Download (requires account and license for commercial use)
|
||||
- **Website:** https://cancer.sanger.ac.uk/cosmic
|
||||
- **Data:**
|
||||
- Somatic mutations in cancer
|
||||
- Gene census
|
||||
- Cell line data
|
||||
- Drug resistance mutations
|
||||
- **Important:**
|
||||
- Free for academic use
|
||||
- License fees apply for commercial use
|
||||
- Requires COSMIC account credentials
|
||||
- Must download database before querying
|
||||
|
||||
### AI & Prediction Services
|
||||
|
||||
#### AlphaFold2 (DeepMind)
|
||||
- **Used by:** gget alphafold
|
||||
- **Description:** Deep learning model for protein structure prediction
|
||||
- **Model version:** Simplified version for local execution
|
||||
- **Access:** Local computation (requires model download via gget setup)
|
||||
- **Website:** https://alphafold.ebi.ac.uk/
|
||||
- **Notes:**
|
||||
- Requires ~4GB model parameters download
|
||||
- Requires OpenMM installation
|
||||
- Computationally intensive
|
||||
- Python version-specific requirements
|
||||
|
||||
#### OpenAI API
|
||||
- **Used by:** gget gpt
|
||||
- **Description:** Large language model API
|
||||
- **Update frequency:** New models released periodically
|
||||
- **Access:** REST API (requires API key)
|
||||
- **Website:** https://openai.com/
|
||||
- **Notes:**
|
||||
- Default model: gpt-3.5-turbo
|
||||
- Free tier limited to 3 months after account creation
|
||||
- Set billing limits to control costs
|
||||
|
||||
## Data Consistency & Reproducibility
|
||||
|
||||
### Version Control
|
||||
To ensure reproducibility in analyses:
|
||||
|
||||
1. **Specify database versions/releases:**
|
||||
```python
|
||||
# Use specific Ensembl release
|
||||
gget.ref("homo_sapiens", release=110)
|
||||
|
||||
# Use specific Census version
|
||||
gget.cellxgene(gene=["PAX7"], census_version="2023-07-25")
|
||||
```
|
||||
|
||||
2. **Document gget version:**
|
||||
```python
|
||||
import gget
|
||||
print(gget.__version__)
|
||||
```
|
||||
|
||||
3. **Save raw data:**
|
||||
```python
|
||||
# Always save results for reproducibility
|
||||
results = gget.search(["ACE2"], species="homo_sapiens")
|
||||
results.to_csv("search_results_2025-01-15.csv", index=False)
|
||||
```
|
||||
|
||||
### Handling Database Updates
|
||||
|
||||
1. **Regular gget updates:**
|
||||
- Update gget biweekly to match database structure changes
|
||||
- Check release notes for breaking changes
|
||||
|
||||
2. **Error handling:**
|
||||
- Database structure changes may cause temporary failures
|
||||
- Check GitHub issues: https://github.com/pachterlab/gget/issues
|
||||
- Update gget if errors occur
|
||||
|
||||
3. **API rate limiting:**
|
||||
- Implement delays for large-scale queries
|
||||
- Use local databases (DIAMOND, COSMIC) when possible
|
||||
- Cache results to avoid repeated queries
|
||||
|
||||
## Database-Specific Best Practices
|
||||
|
||||
### Ensembl
|
||||
- Use species shortcuts ('human', 'mouse') for convenience
|
||||
- Specify release numbers for reproducibility
|
||||
- Check available species with `gget ref --list_species`
|
||||
|
||||
### UniProt
|
||||
- UniProt IDs are more stable than gene names
|
||||
- Swiss-Prot annotations are manually curated and more reliable
|
||||
- Use PDB flag in gget info only when needed (increases runtime)
|
||||
|
||||
### BLAST/BLAT
|
||||
- Start with default parameters, then optimize
|
||||
- Use specialized databases (swissprot, refseq_protein) for focused searches
|
||||
- Consider E-value cutoffs based on query length
|
||||
|
||||
### Expression Databases
|
||||
- Gene symbols are case-sensitive in CELLxGENE
|
||||
- ARCHS4 correlation data is based on co-expression patterns
|
||||
- Consider tissue-specificity when interpreting results
|
||||
|
||||
### Cancer Databases
|
||||
- cBioPortal: cache data locally for repeated analyses
|
||||
- COSMIC: download appropriate database subset for your needs
|
||||
- Respect license agreements for commercial use
|
||||
|
||||
## Citations
|
||||
|
||||
When using gget, cite both the gget publication and the underlying databases:
|
||||
|
||||
**gget:**
|
||||
Luebbert, L. & Pachter, L. (2023). Efficient querying of genomic reference databases with gget. Bioinformatics. https://doi.org/10.1093/bioinformatics/btac836
|
||||
|
||||
**Database-specific citations:** Check references/ directory or database websites for appropriate citations.
|
||||
Reference in New Issue
Block a user