# gget Database Information Overview of databases queried by gget modules, including update frequencies and important considerations. ## Important Note The databases queried by gget are continuously being updated, which sometimes changes their structure. gget modules are tested automatically on a biweekly basis and updated to match new database structures when necessary. Always keep gget updated: ```bash pip install --upgrade gget ``` ## Database Directory ### Genomic Reference Databases #### Ensembl - **Used by:** gget ref, gget search, gget info, gget seq - **Description:** Comprehensive genome database with annotations for vertebrate and invertebrate species - **Update frequency:** Regular releases (numbered); new releases approximately every 3 months - **Access:** FTP downloads, REST API - **Website:** https://www.ensembl.org/ - **Notes:** - Supports both vertebrate and invertebrate genomes - Can specify release number for reproducibility - Shortcuts available for common species ('human', 'mouse') #### UCSC Genome Browser - **Used by:** gget blat - **Description:** Genome browser database with BLAT alignment tool - **Update frequency:** Regular updates with new assemblies - **Access:** Web service API - **Website:** https://genome.ucsc.edu/ - **Notes:** - Multiple genome assemblies available (hg38, mm39, etc.) - BLAT optimized for vertebrate genomes ### Protein & Structure Databases #### UniProt - **Used by:** gget info, gget seq (amino acid sequences), gget elm - **Description:** Universal Protein Resource, comprehensive protein sequence and functional information - **Update frequency:** Regular releases (weekly for Swiss-Prot, monthly for TrEMBL) - **Access:** REST API - **Website:** https://www.uniprot.org/ - **Notes:** - Swiss-Prot: manually annotated and reviewed - TrEMBL: automatically annotated #### NCBI (National Center for Biotechnology Information) - **Used by:** gget info, gget bgee (for non-Ensembl species) - **Description:** Gene and protein databases with extensive cross-references - **Update frequency:** Continuous updates - **Access:** E-utilities API - **Website:** https://www.ncbi.nlm.nih.gov/ - **Databases:** Gene, Protein, RefSeq #### RCSB PDB (Protein Data Bank) - **Used by:** gget pdb - **Description:** Repository of 3D structural data for proteins and nucleic acids - **Update frequency:** Weekly updates - **Access:** REST API - **Website:** https://www.rcsb.org/ - **Notes:** - Experimentally determined structures (X-ray, NMR, cryo-EM) - Includes metadata about experiments and publications #### ELM (Eukaryotic Linear Motif) - **Used by:** gget elm - **Description:** Database of functional sites in eukaryotic proteins - **Update frequency:** Periodic updates - **Access:** Downloaded database (via gget setup elm) - **Website:** http://elm.eu.org/ - **Notes:** - Requires local download before first use - Contains validated motifs and patterns ### Sequence Similarity Databases #### BLAST Databases (NCBI) - **Used by:** gget blast - **Description:** Pre-formatted databases for BLAST searches - **Update frequency:** Regular updates - **Access:** NCBI BLAST API - **Databases:** - **Nucleotide:** nt (all GenBank), refseq_rna, pdbnt - **Protein:** nr (non-redundant), swissprot, pdbaa, refseq_protein - **Notes:** - nt and nr are very large databases - Consider specialized databases for faster, more focused searches ### Expression & Correlation Databases #### ARCHS4 - **Used by:** gget archs4 - **Description:** Massive mining of publicly available RNA-seq data - **Update frequency:** Periodic updates with new samples - **Access:** HTTP API - **Website:** https://maayanlab.cloud/archs4/ - **Data:** - Human and mouse RNA-seq data - Correlation matrices - Tissue expression atlases - **Citation:** Lachmann et al., Nature Communications, 2018 #### CZ CELLxGENE Discover - **Used by:** gget cellxgene - **Description:** Single-cell RNA-seq data from multiple studies - **Update frequency:** Continuous additions of new datasets - **Access:** Census API (via cellxgene-census package) - **Website:** https://cellxgene.cziscience.com/ - **Data:** - Single-cell RNA-seq count matrices - Cell type annotations - Tissue and disease metadata - **Notes:** - Requires gget setup cellxgene - Gene symbols are case-sensitive - May not support latest Python versions #### Bgee - **Used by:** gget bgee - **Description:** Gene expression and orthology database - **Update frequency:** Regular releases - **Access:** REST API - **Website:** https://www.bgee.org/ - **Data:** - Gene expression across tissues and developmental stages - Orthology relationships across species - **Citation:** Bastian et al., 2021 ### Functional & Pathway Databases #### Enrichr / modEnrichr - **Used by:** gget enrichr - **Description:** Gene set enrichment analysis web service - **Update frequency:** Regular updates to underlying databases - **Access:** REST API - **Website:** https://maayanlab.cloud/Enrichr/ - **Databases included:** - KEGG pathways - Gene Ontology (GO) - Transcription factor targets (ChEA) - Disease associations (GWAS Catalog) - Cell type markers (PanglaoDB) - **Notes:** - Supports multiple model organisms - Background gene lists can be provided for custom enrichment ### Disease & Drug Databases #### Open Targets - **Used by:** gget opentargets - **Description:** Integrative platform for disease-target associations - **Update frequency:** Regular releases (quarterly) - **Access:** GraphQL API - **Website:** https://www.opentargets.org/ - **Data:** - Disease associations - Drug information and clinical trials - Target tractability - Pharmacogenetics - Gene expression - DepMap gene-disease effects - Protein-protein interactions #### cBioPortal - **Used by:** gget cbio - **Description:** Cancer genomics data portal - **Update frequency:** Continuous addition of new studies - **Access:** Web API, downloadable datasets - **Website:** https://www.cbioportal.org/ - **Data:** - Mutations, copy number alterations, structural variants - Gene expression - Clinical data - **Notes:** - Large datasets; caching recommended - Multiple cancer types and studies available #### COSMIC (Catalogue Of Somatic Mutations In Cancer) - **Used by:** gget cosmic - **Description:** Comprehensive cancer mutation database - **Update frequency:** Regular releases - **Access:** Download (requires account and license for commercial use) - **Website:** https://cancer.sanger.ac.uk/cosmic - **Data:** - Somatic mutations in cancer - Gene census - Cell line data - Drug resistance mutations - **Important:** - Free for academic use - License fees apply for commercial use - Requires COSMIC account credentials - Must download database before querying ### AI & Prediction Services #### AlphaFold2 (DeepMind) - **Used by:** gget alphafold - **Description:** Deep learning model for protein structure prediction - **Model version:** Simplified version for local execution - **Access:** Local computation (requires model download via gget setup) - **Website:** https://alphafold.ebi.ac.uk/ - **Notes:** - Requires ~4GB model parameters download - Requires OpenMM installation - Computationally intensive - Python version-specific requirements #### OpenAI API - **Used by:** gget gpt - **Description:** Large language model API - **Update frequency:** New models released periodically - **Access:** REST API (requires API key) - **Website:** https://openai.com/ - **Notes:** - Default model: gpt-3.5-turbo - Free tier limited to 3 months after account creation - Set billing limits to control costs ## Data Consistency & Reproducibility ### Version Control To ensure reproducibility in analyses: 1. **Specify database versions/releases:** ```python # Use specific Ensembl release gget.ref("homo_sapiens", release=110) # Use specific Census version gget.cellxgene(gene=["PAX7"], census_version="2023-07-25") ``` 2. **Document gget version:** ```python import gget print(gget.__version__) ``` 3. **Save raw data:** ```python # Always save results for reproducibility results = gget.search(["ACE2"], species="homo_sapiens") results.to_csv("search_results_2025-01-15.csv", index=False) ``` ### Handling Database Updates 1. **Regular gget updates:** - Update gget biweekly to match database structure changes - Check release notes for breaking changes 2. **Error handling:** - Database structure changes may cause temporary failures - Check GitHub issues: https://github.com/pachterlab/gget/issues - Update gget if errors occur 3. **API rate limiting:** - Implement delays for large-scale queries - Use local databases (DIAMOND, COSMIC) when possible - Cache results to avoid repeated queries ## Database-Specific Best Practices ### Ensembl - Use species shortcuts ('human', 'mouse') for convenience - Specify release numbers for reproducibility - Check available species with `gget ref --list_species` ### UniProt - UniProt IDs are more stable than gene names - Swiss-Prot annotations are manually curated and more reliable - Use PDB flag in gget info only when needed (increases runtime) ### BLAST/BLAT - Start with default parameters, then optimize - Use specialized databases (swissprot, refseq_protein) for focused searches - Consider E-value cutoffs based on query length ### Expression Databases - Gene symbols are case-sensitive in CELLxGENE - ARCHS4 correlation data is based on co-expression patterns - Consider tissue-specificity when interpreting results ### Cancer Databases - cBioPortal: cache data locally for repeated analyses - COSMIC: download appropriate database subset for your needs - Respect license agreements for commercial use ## Citations When using gget, cite both the gget publication and the underlying databases: **gget:** Luebbert, L. & Pachter, L. (2023). Efficient querying of genomic reference databases with gget. Bioinformatics. https://doi.org/10.1093/bioinformatics/btac836 **Database-specific citations:** Check references/ directory or database websites for appropriate citations.