Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,300 @@
# gget Database Information
Overview of databases queried by gget modules, including update frequencies and important considerations.
## Important Note
The databases queried by gget are continuously being updated, which sometimes changes their structure. gget modules are tested automatically on a biweekly basis and updated to match new database structures when necessary. Always keep gget updated:
```bash
pip install --upgrade gget
```
## Database Directory
### Genomic Reference Databases
#### Ensembl
- **Used by:** gget ref, gget search, gget info, gget seq
- **Description:** Comprehensive genome database with annotations for vertebrate and invertebrate species
- **Update frequency:** Regular releases (numbered); new releases approximately every 3 months
- **Access:** FTP downloads, REST API
- **Website:** https://www.ensembl.org/
- **Notes:**
- Supports both vertebrate and invertebrate genomes
- Can specify release number for reproducibility
- Shortcuts available for common species ('human', 'mouse')
#### UCSC Genome Browser
- **Used by:** gget blat
- **Description:** Genome browser database with BLAT alignment tool
- **Update frequency:** Regular updates with new assemblies
- **Access:** Web service API
- **Website:** https://genome.ucsc.edu/
- **Notes:**
- Multiple genome assemblies available (hg38, mm39, etc.)
- BLAT optimized for vertebrate genomes
### Protein & Structure Databases
#### UniProt
- **Used by:** gget info, gget seq (amino acid sequences), gget elm
- **Description:** Universal Protein Resource, comprehensive protein sequence and functional information
- **Update frequency:** Regular releases (weekly for Swiss-Prot, monthly for TrEMBL)
- **Access:** REST API
- **Website:** https://www.uniprot.org/
- **Notes:**
- Swiss-Prot: manually annotated and reviewed
- TrEMBL: automatically annotated
#### NCBI (National Center for Biotechnology Information)
- **Used by:** gget info, gget bgee (for non-Ensembl species)
- **Description:** Gene and protein databases with extensive cross-references
- **Update frequency:** Continuous updates
- **Access:** E-utilities API
- **Website:** https://www.ncbi.nlm.nih.gov/
- **Databases:** Gene, Protein, RefSeq
#### RCSB PDB (Protein Data Bank)
- **Used by:** gget pdb
- **Description:** Repository of 3D structural data for proteins and nucleic acids
- **Update frequency:** Weekly updates
- **Access:** REST API
- **Website:** https://www.rcsb.org/
- **Notes:**
- Experimentally determined structures (X-ray, NMR, cryo-EM)
- Includes metadata about experiments and publications
#### ELM (Eukaryotic Linear Motif)
- **Used by:** gget elm
- **Description:** Database of functional sites in eukaryotic proteins
- **Update frequency:** Periodic updates
- **Access:** Downloaded database (via gget setup elm)
- **Website:** http://elm.eu.org/
- **Notes:**
- Requires local download before first use
- Contains validated motifs and patterns
### Sequence Similarity Databases
#### BLAST Databases (NCBI)
- **Used by:** gget blast
- **Description:** Pre-formatted databases for BLAST searches
- **Update frequency:** Regular updates
- **Access:** NCBI BLAST API
- **Databases:**
- **Nucleotide:** nt (all GenBank), refseq_rna, pdbnt
- **Protein:** nr (non-redundant), swissprot, pdbaa, refseq_protein
- **Notes:**
- nt and nr are very large databases
- Consider specialized databases for faster, more focused searches
### Expression & Correlation Databases
#### ARCHS4
- **Used by:** gget archs4
- **Description:** Massive mining of publicly available RNA-seq data
- **Update frequency:** Periodic updates with new samples
- **Access:** HTTP API
- **Website:** https://maayanlab.cloud/archs4/
- **Data:**
- Human and mouse RNA-seq data
- Correlation matrices
- Tissue expression atlases
- **Citation:** Lachmann et al., Nature Communications, 2018
#### CZ CELLxGENE Discover
- **Used by:** gget cellxgene
- **Description:** Single-cell RNA-seq data from multiple studies
- **Update frequency:** Continuous additions of new datasets
- **Access:** Census API (via cellxgene-census package)
- **Website:** https://cellxgene.cziscience.com/
- **Data:**
- Single-cell RNA-seq count matrices
- Cell type annotations
- Tissue and disease metadata
- **Notes:**
- Requires gget setup cellxgene
- Gene symbols are case-sensitive
- May not support latest Python versions
#### Bgee
- **Used by:** gget bgee
- **Description:** Gene expression and orthology database
- **Update frequency:** Regular releases
- **Access:** REST API
- **Website:** https://www.bgee.org/
- **Data:**
- Gene expression across tissues and developmental stages
- Orthology relationships across species
- **Citation:** Bastian et al., 2021
### Functional & Pathway Databases
#### Enrichr / modEnrichr
- **Used by:** gget enrichr
- **Description:** Gene set enrichment analysis web service
- **Update frequency:** Regular updates to underlying databases
- **Access:** REST API
- **Website:** https://maayanlab.cloud/Enrichr/
- **Databases included:**
- KEGG pathways
- Gene Ontology (GO)
- Transcription factor targets (ChEA)
- Disease associations (GWAS Catalog)
- Cell type markers (PanglaoDB)
- **Notes:**
- Supports multiple model organisms
- Background gene lists can be provided for custom enrichment
### Disease & Drug Databases
#### Open Targets
- **Used by:** gget opentargets
- **Description:** Integrative platform for disease-target associations
- **Update frequency:** Regular releases (quarterly)
- **Access:** GraphQL API
- **Website:** https://www.opentargets.org/
- **Data:**
- Disease associations
- Drug information and clinical trials
- Target tractability
- Pharmacogenetics
- Gene expression
- DepMap gene-disease effects
- Protein-protein interactions
#### cBioPortal
- **Used by:** gget cbio
- **Description:** Cancer genomics data portal
- **Update frequency:** Continuous addition of new studies
- **Access:** Web API, downloadable datasets
- **Website:** https://www.cbioportal.org/
- **Data:**
- Mutations, copy number alterations, structural variants
- Gene expression
- Clinical data
- **Notes:**
- Large datasets; caching recommended
- Multiple cancer types and studies available
#### COSMIC (Catalogue Of Somatic Mutations In Cancer)
- **Used by:** gget cosmic
- **Description:** Comprehensive cancer mutation database
- **Update frequency:** Regular releases
- **Access:** Download (requires account and license for commercial use)
- **Website:** https://cancer.sanger.ac.uk/cosmic
- **Data:**
- Somatic mutations in cancer
- Gene census
- Cell line data
- Drug resistance mutations
- **Important:**
- Free for academic use
- License fees apply for commercial use
- Requires COSMIC account credentials
- Must download database before querying
### AI & Prediction Services
#### AlphaFold2 (DeepMind)
- **Used by:** gget alphafold
- **Description:** Deep learning model for protein structure prediction
- **Model version:** Simplified version for local execution
- **Access:** Local computation (requires model download via gget setup)
- **Website:** https://alphafold.ebi.ac.uk/
- **Notes:**
- Requires ~4GB model parameters download
- Requires OpenMM installation
- Computationally intensive
- Python version-specific requirements
#### OpenAI API
- **Used by:** gget gpt
- **Description:** Large language model API
- **Update frequency:** New models released periodically
- **Access:** REST API (requires API key)
- **Website:** https://openai.com/
- **Notes:**
- Default model: gpt-3.5-turbo
- Free tier limited to 3 months after account creation
- Set billing limits to control costs
## Data Consistency & Reproducibility
### Version Control
To ensure reproducibility in analyses:
1. **Specify database versions/releases:**
```python
# Use specific Ensembl release
gget.ref("homo_sapiens", release=110)
# Use specific Census version
gget.cellxgene(gene=["PAX7"], census_version="2023-07-25")
```
2. **Document gget version:**
```python
import gget
print(gget.__version__)
```
3. **Save raw data:**
```python
# Always save results for reproducibility
results = gget.search(["ACE2"], species="homo_sapiens")
results.to_csv("search_results_2025-01-15.csv", index=False)
```
### Handling Database Updates
1. **Regular gget updates:**
- Update gget biweekly to match database structure changes
- Check release notes for breaking changes
2. **Error handling:**
- Database structure changes may cause temporary failures
- Check GitHub issues: https://github.com/pachterlab/gget/issues
- Update gget if errors occur
3. **API rate limiting:**
- Implement delays for large-scale queries
- Use local databases (DIAMOND, COSMIC) when possible
- Cache results to avoid repeated queries
## Database-Specific Best Practices
### Ensembl
- Use species shortcuts ('human', 'mouse') for convenience
- Specify release numbers for reproducibility
- Check available species with `gget ref --list_species`
### UniProt
- UniProt IDs are more stable than gene names
- Swiss-Prot annotations are manually curated and more reliable
- Use PDB flag in gget info only when needed (increases runtime)
### BLAST/BLAT
- Start with default parameters, then optimize
- Use specialized databases (swissprot, refseq_protein) for focused searches
- Consider E-value cutoffs based on query length
### Expression Databases
- Gene symbols are case-sensitive in CELLxGENE
- ARCHS4 correlation data is based on co-expression patterns
- Consider tissue-specificity when interpreting results
### Cancer Databases
- cBioPortal: cache data locally for repeated analyses
- COSMIC: download appropriate database subset for your needs
- Respect license agreements for commercial use
## Citations
When using gget, cite both the gget publication and the underlying databases:
**gget:**
Luebbert, L. & Pachter, L. (2023). Efficient querying of genomic reference databases with gget. Bioinformatics. https://doi.org/10.1093/bioinformatics/btac836
**Database-specific citations:** Check references/ directory or database websites for appropriate citations.