zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Fork 0

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

10 KiB

Raw Blame History

gget Database Information

Overview of databases queried by gget modules, including update frequencies and important considerations.

Important Note

The databases queried by gget are continuously being updated, which sometimes changes their structure. gget modules are tested automatically on a biweekly basis and updated to match new database structures when necessary. Always keep gget updated:

pip install --upgrade gget

Database Directory

Genomic Reference Databases

Ensembl

Used by: gget ref, gget search, gget info, gget seq
Description: Comprehensive genome database with annotations for vertebrate and invertebrate species
Update frequency: Regular releases (numbered); new releases approximately every 3 months
Access: FTP downloads, REST API
Website: https://www.ensembl.org/
Notes:
- Supports both vertebrate and invertebrate genomes
- Can specify release number for reproducibility
- Shortcuts available for common species ('human', 'mouse')

UCSC Genome Browser

Used by: gget blat
Description: Genome browser database with BLAT alignment tool
Update frequency: Regular updates with new assemblies
Access: Web service API
Website: https://genome.ucsc.edu/
Notes:
- Multiple genome assemblies available (hg38, mm39, etc.)
- BLAT optimized for vertebrate genomes

Protein & Structure Databases

UniProt

Used by: gget info, gget seq (amino acid sequences), gget elm
Description: Universal Protein Resource, comprehensive protein sequence and functional information
Update frequency: Regular releases (weekly for Swiss-Prot, monthly for TrEMBL)
Access: REST API
Website: https://www.uniprot.org/
Notes:
- Swiss-Prot: manually annotated and reviewed
- TrEMBL: automatically annotated

NCBI (National Center for Biotechnology Information)

Used by: gget info, gget bgee (for non-Ensembl species)
Description: Gene and protein databases with extensive cross-references
Update frequency: Continuous updates
Access: E-utilities API
Website: https://www.ncbi.nlm.nih.gov/
Databases: Gene, Protein, RefSeq

RCSB PDB (Protein Data Bank)

Used by: gget pdb
Description: Repository of 3D structural data for proteins and nucleic acids
Update frequency: Weekly updates
Access: REST API
Website: https://www.rcsb.org/
Notes:
- Experimentally determined structures (X-ray, NMR, cryo-EM)
- Includes metadata about experiments and publications

ELM (Eukaryotic Linear Motif)

Used by: gget elm
Description: Database of functional sites in eukaryotic proteins
Update frequency: Periodic updates
Access: Downloaded database (via gget setup elm)
Website: http://elm.eu.org/
Notes:
- Requires local download before first use
- Contains validated motifs and patterns

Sequence Similarity Databases

BLAST Databases (NCBI)

Used by: gget blast
Description: Pre-formatted databases for BLAST searches
Update frequency: Regular updates
Access: NCBI BLAST API
Databases:
- Nucleotide: nt (all GenBank), refseq_rna, pdbnt
- Protein: nr (non-redundant), swissprot, pdbaa, refseq_protein
Notes:
- nt and nr are very large databases
- Consider specialized databases for faster, more focused searches

Expression & Correlation Databases

ARCHS4

Used by: gget archs4
Description: Massive mining of publicly available RNA-seq data
Update frequency: Periodic updates with new samples
Access: HTTP API
Website: https://maayanlab.cloud/archs4/
Data:
- Human and mouse RNA-seq data
- Correlation matrices
- Tissue expression atlases
Citation: Lachmann et al., Nature Communications, 2018

CZ CELLxGENE Discover

Used by: gget cellxgene
Description: Single-cell RNA-seq data from multiple studies
Update frequency: Continuous additions of new datasets
Access: Census API (via cellxgene-census package)
Website: https://cellxgene.cziscience.com/
Data:
- Single-cell RNA-seq count matrices
- Cell type annotations
- Tissue and disease metadata
Notes:
- Requires gget setup cellxgene
- Gene symbols are case-sensitive
- May not support latest Python versions

Bgee

Used by: gget bgee
Description: Gene expression and orthology database
Update frequency: Regular releases
Access: REST API
Website: https://www.bgee.org/
Data:
- Gene expression across tissues and developmental stages
- Orthology relationships across species
Citation: Bastian et al., 2021

Functional & Pathway Databases

Enrichr / modEnrichr

Used by: gget enrichr
Description: Gene set enrichment analysis web service
Update frequency: Regular updates to underlying databases
Access: REST API
Website: https://maayanlab.cloud/Enrichr/
Databases included:
- KEGG pathways
- Gene Ontology (GO)
- Transcription factor targets (ChEA)
- Disease associations (GWAS Catalog)
- Cell type markers (PanglaoDB)
Notes:
- Supports multiple model organisms
- Background gene lists can be provided for custom enrichment

Disease & Drug Databases

Open Targets

Used by: gget opentargets
Description: Integrative platform for disease-target associations
Update frequency: Regular releases (quarterly)
Access: GraphQL API
Website: https://www.opentargets.org/
Data:
- Disease associations
- Drug information and clinical trials
- Target tractability
- Pharmacogenetics
- Gene expression
- DepMap gene-disease effects
- Protein-protein interactions

cBioPortal

Used by: gget cbio
Description: Cancer genomics data portal
Update frequency: Continuous addition of new studies
Access: Web API, downloadable datasets
Website: https://www.cbioportal.org/
Data:
- Mutations, copy number alterations, structural variants
- Gene expression
- Clinical data
Notes:
- Large datasets; caching recommended
- Multiple cancer types and studies available

COSMIC (Catalogue Of Somatic Mutations In Cancer)

Used by: gget cosmic
Description: Comprehensive cancer mutation database
Update frequency: Regular releases
Access: Download (requires account and license for commercial use)
Website: https://cancer.sanger.ac.uk/cosmic
Data:
- Somatic mutations in cancer
- Gene census
- Cell line data
- Drug resistance mutations
Important:
- Free for academic use
- License fees apply for commercial use
- Requires COSMIC account credentials
- Must download database before querying

AI & Prediction Services

AlphaFold2 (DeepMind)

Used by: gget alphafold
Description: Deep learning model for protein structure prediction
Model version: Simplified version for local execution
Access: Local computation (requires model download via gget setup)
Website: https://alphafold.ebi.ac.uk/
Notes:
- Requires ~4GB model parameters download
- Requires OpenMM installation
- Computationally intensive
- Python version-specific requirements

OpenAI API

Used by: gget gpt
Description: Large language model API
Update frequency: New models released periodically
Access: REST API (requires API key)
Website: https://openai.com/
Notes:
- Default model: gpt-3.5-turbo
- Free tier limited to 3 months after account creation
- Set billing limits to control costs

Data Consistency & Reproducibility

Version Control

To ensure reproducibility in analyses:

Specify database versions/releases:

# Use specific Ensembl release
gget.ref("homo_sapiens", release=110)

# Use specific Census version
gget.cellxgene(gene=["PAX7"], census_version="2023-07-25")

Document gget version:
```
import gget
print(gget.__version__)
```

Save raw data:

# Always save results for reproducibility
results = gget.search(["ACE2"], species="homo_sapiens")
results.to_csv("search_results_2025-01-15.csv", index=False)

Handling Database Updates

Regular gget updates:
- Update gget biweekly to match database structure changes
- Check release notes for breaking changes
Error handling:
- Database structure changes may cause temporary failures
- Check GitHub issues: https://github.com/pachterlab/gget/issues
- Update gget if errors occur
API rate limiting:
- Implement delays for large-scale queries
- Use local databases (DIAMOND, COSMIC) when possible
- Cache results to avoid repeated queries

Database-Specific Best Practices

Ensembl

Use species shortcuts ('human', 'mouse') for convenience
Specify release numbers for reproducibility
Check available species with gget ref --list_species

UniProt

UniProt IDs are more stable than gene names
Swiss-Prot annotations are manually curated and more reliable
Use PDB flag in gget info only when needed (increases runtime)

BLAST/BLAT

Start with default parameters, then optimize
Use specialized databases (swissprot, refseq_protein) for focused searches
Consider E-value cutoffs based on query length

Expression Databases

Gene symbols are case-sensitive in CELLxGENE
ARCHS4 correlation data is based on co-expression patterns
Consider tissue-specificity when interpreting results

Cancer Databases

cBioPortal: cache data locally for repeated analyses
COSMIC: download appropriate database subset for your needs
Respect license agreements for commercial use

Citations

When using gget, cite both the gget publication and the underlying databases:

gget: Luebbert, L. & Pachter, L. (2023). Efficient querying of genomic reference databases with gget. Bioinformatics. https://doi.org/10.1093/bioinformatics/btac836

Database-specific citations: Check references/ directory or database websites for appropriate citations.

10 KiB Raw Blame History

gget Database Information

Important Note

Database Directory

Genomic Reference Databases

Ensembl

UCSC Genome Browser

Protein & Structure Databases

UniProt

NCBI (National Center for Biotechnology Information)

RCSB PDB (Protein Data Bank)

ELM (Eukaryotic Linear Motif)

Sequence Similarity Databases

BLAST Databases (NCBI)

Expression & Correlation Databases

ARCHS4

CZ CELLxGENE Discover

Bgee

Functional & Pathway Databases

Enrichr / modEnrichr

Disease & Drug Databases

Open Targets

cBioPortal

COSMIC (Catalogue Of Somatic Mutations In Cancer)

AI & Prediction Services

AlphaFold2 (DeepMind)

OpenAI API

Data Consistency & Reproducibility

Version Control

Handling Database Updates

Database-Specific Best Practices

Ensembl

UniProt

BLAST/BLAT

Expression Databases

Cancer Databases

Citations

10 KiB

Raw Blame History