Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,867 @@
# Biomni Use Cases and Examples
Comprehensive examples demonstrating biomni across biomedical research domains.
## Table of Contents
1. [CRISPR Screening and Gene Editing](#crispr-screening-and-gene-editing)
2. [Single-Cell RNA-seq Analysis](#single-cell-rna-seq-analysis)
3. [Drug Discovery and ADMET](#drug-discovery-and-admet)
4. [GWAS and Genetic Analysis](#gwas-and-genetic-analysis)
5. [Clinical Genomics and Diagnostics](#clinical-genomics-and-diagnostics)
6. [Protein Structure and Function](#protein-structure-and-function)
7. [Literature and Knowledge Synthesis](#literature-and-knowledge-synthesis)
8. [Multi-Omics Integration](#multi-omics-integration)
---
## CRISPR Screening and Gene Editing
### Example 1: Genome-Wide CRISPR Screen Design
**Task:** Design a CRISPR knockout screen to identify genes regulating autophagy.
```python
from biomni.agent import A1
agent = A1(path='./data', llm='claude-sonnet-4-20250514')
result = agent.go("""
Design a genome-wide CRISPR knockout screen to identify genes regulating
autophagy in HEK293 cells.
Requirements:
1. Generate comprehensive sgRNA library targeting all protein-coding genes
2. Design 4 sgRNAs per gene with optimal on-target and minimal off-target scores
3. Include positive controls (known autophagy regulators: ATG5, BECN1, ULK1)
4. Include negative controls (non-targeting sgRNAs)
5. Prioritize genes based on:
- Existing autophagy pathway annotations
- Protein-protein interactions with known autophagy factors
- Expression levels in HEK293 cells
6. Output sgRNA sequences, scores, and gene prioritization rankings
Provide analysis as Python code and interpret results.
""")
agent.save_conversation_history("autophagy_screen_design.pdf")
```
**Expected Output:**
- sgRNA library with ~80,000 guides (4 per gene × ~20,000 genes)
- On-target and off-target scores for each sgRNA
- Prioritized gene list based on pathway enrichment
- Quality control metrics for library design
### Example 2: CRISPR Off-Target Prediction
```python
result = agent.go("""
Analyze potential off-target effects for this sgRNA sequence:
GCTGAAGATCCAGTTCGATG
Tasks:
1. Identify all genomic locations with ≤3 mismatches
2. Score each potential off-target site
3. Assess likelihood of cleavage at off-target sites
4. Recommend whether sgRNA is suitable for use
5. If unsuitable, suggest alternative sgRNAs for the same gene
""")
```
### Example 3: Screen Hit Analysis
```python
result = agent.go("""
Analyze CRISPR screen results from autophagy phenotype screen.
Input file: screen_results.csv
Columns: sgRNA_ID, gene, log2_fold_change, p_value, FDR
Tasks:
1. Identify significant hits (FDR < 0.05, |LFC| > 1.5)
2. Perform gene ontology enrichment on hit genes
3. Map hits to known autophagy pathways
4. Identify novel candidates not previously linked to autophagy
5. Predict functional relationships between hit genes
6. Generate visualization of hit genes in pathway context
""")
```
---
## Single-Cell RNA-seq Analysis
### Example 1: Cell Type Annotation
**Task:** Analyze single-cell RNA-seq data and annotate cell populations.
```python
agent = A1(path='./data', llm='claude-sonnet-4-20250514')
result = agent.go("""
Analyze single-cell RNA-seq dataset from human PBMC sample.
File: pbmc_data.h5ad (10X Genomics format)
Workflow:
1. Quality control:
- Filter cells with <200 or >5000 detected genes
- Remove cells with >20% mitochondrial content
- Filter genes detected in <3 cells
2. Normalization and preprocessing:
- Normalize to 10,000 reads per cell
- Log-transform
- Identify highly variable genes
- Scale data
3. Dimensionality reduction:
- PCA (50 components)
- UMAP visualization
4. Clustering:
- Leiden algorithm with resolution=0.8
- Identify cluster markers (Wilcoxon rank-sum test)
5. Cell type annotation:
- Annotate clusters using marker genes:
* T cells (CD3D, CD3E)
* B cells (CD79A, MS4A1)
* NK cells (GNLY, NKG7)
* Monocytes (CD14, LYZ)
* Dendritic cells (FCER1A, CST3)
6. Generate UMAP plots with annotations and export results
""")
agent.save_conversation_history("pbmc_scrna_analysis.pdf")
```
### Example 2: Differential Expression Analysis
```python
result = agent.go("""
Perform differential expression analysis between conditions in scRNA-seq data.
Data: pbmc_treated_vs_control.h5ad
Conditions: treated (drug X) vs control
Tasks:
1. Identify differentially expressed genes for each cell type
2. Use statistical tests appropriate for scRNA-seq (MAST or Wilcoxon)
3. Apply multiple testing correction (Benjamini-Hochberg)
4. Threshold: |log2FC| > 0.5, adjusted p < 0.05
5. Perform pathway enrichment on DE genes per cell type
6. Identify cell-type-specific drug responses
7. Generate volcano plots and heatmaps
""")
```
### Example 3: Trajectory Analysis
```python
result = agent.go("""
Perform pseudotime trajectory analysis on differentiation dataset.
Data: hematopoiesis_scrna.h5ad
Starting population: Hematopoietic stem cells (HSCs)
Analysis:
1. Subset to hematopoietic lineages
2. Compute diffusion map or PAGA for trajectory inference
3. Order cells along pseudotime
4. Identify genes with dynamic expression along trajectory
5. Cluster genes by expression patterns
6. Map trajectories to known differentiation pathways
7. Visualize key transcription factors driving differentiation
""")
```
---
## Drug Discovery and ADMET
### Example 1: ADMET Property Prediction
**Task:** Predict ADMET properties for drug candidates.
```python
agent = A1(path='./data', llm='claude-sonnet-4-20250514')
result = agent.go("""
Predict ADMET properties for these drug candidates:
Compounds (SMILES format):
1. CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5
2. CN1CCN(CC1)C2=C(C=C3C(=C2)N=CN=C3NC4=CC=C(C=C4)F)OC
3. CC(C)(C)NC(=O)N(CC1=CC=CC=C1)C2CCN(CC2)C(=O)C3=CC4=C(C=C3)OCO4
For each compound, predict:
**Absorption:**
- Caco-2 permeability (cm/s)
- Human intestinal absorption (HIA %)
- Oral bioavailability
**Distribution:**
- Plasma protein binding (%)
- Blood-brain barrier penetration (BBB+/-)
- Volume of distribution (L/kg)
**Metabolism:**
- CYP450 substrate/inhibitor predictions (2D6, 3A4, 2C9, 2C19)
- Metabolic stability (T1/2)
**Excretion:**
- Clearance (mL/min/kg)
- Half-life (hours)
**Toxicity:**
- hERG IC50 (cardiotoxicity risk)
- Hepatotoxicity prediction
- Ames mutagenicity
- LD50 estimates
Provide predictions with confidence scores and flag any red flags.
""")
agent.save_conversation_history("admet_predictions.pdf")
```
### Example 2: Target Identification
```python
result = agent.go("""
Identify potential protein targets for Alzheimer's disease drug development.
Tasks:
1. Query GWAS data for Alzheimer's-associated genes
2. Identify genes with druggable domains (kinases, GPCRs, ion channels, etc.)
3. Check for brain expression patterns
4. Assess disease relevance via literature mining
5. Evaluate existing chemical probe availability
6. Rank targets by:
- Genetic evidence strength
- Druggability
- Lack of existing therapies
7. Suggest target validation experiments
""")
```
### Example 3: Virtual Screening
```python
result = agent.go("""
Perform virtual screening for EGFR kinase inhibitors.
Database: ZINC15 lead-like subset (~6M compounds)
Target: EGFR kinase domain (PDB: 1M17)
Workflow:
1. Prepare protein structure (remove waters, add hydrogens)
2. Define binding pocket (based on erlotinib binding site)
3. Generate pharmacophore model from known EGFR inhibitors
4. Filter ZINC database by:
- Molecular weight: 200-500 Da
- LogP: 0-5
- Lipinski's rule of five
- Pharmacophore match
5. Dock top 10,000 compounds
6. Score by docking energy and predicted binding affinity
7. Select top 100 for further analysis
8. Predict ADMET properties for top hits
9. Recommend top 10 compounds for experimental validation
""")
```
---
## GWAS and Genetic Analysis
### Example 1: GWAS Summary Statistics Analysis
**Task:** Interpret GWAS results and identify causal genes.
```python
agent = A1(path='./data', llm='claude-sonnet-4-20250514')
result = agent.go("""
Analyze GWAS summary statistics for Type 2 Diabetes.
Input file: t2d_gwas_summary.txt
Columns: CHR, BP, SNP, P, OR, BETA, SE, A1, A2
Analysis steps:
1. Identify genome-wide significant variants (P < 5e-8)
2. Perform LD clumping to identify independent signals
3. Map variants to genes using:
- Nearest gene
- eQTL databases (GTEx)
- Hi-C chromatin interactions
4. Prioritize causal genes using multiple evidence:
- Fine-mapping scores
- Coding variant consequences
- Gene expression in relevant tissues (pancreas, liver, adipose)
- Pathway enrichment
5. Identify druggable targets among causal genes
6. Compare with known T2D genes and highlight novel associations
7. Generate Manhattan plot, QQ plot, and gene prioritization table
""")
agent.save_conversation_history("t2d_gwas_analysis.pdf")
```
### Example 2: Polygenic Risk Score
```python
result = agent.go("""
Develop and validate polygenic risk score (PRS) for coronary artery disease (CAD).
Training GWAS: CAD_discovery_summary_stats.txt (N=180,000)
Validation cohort: CAD_validation_genotypes.vcf (N=50,000)
Tasks:
1. Select variants for PRS using p-value thresholding (P < 1e-5)
2. Perform LD clumping (r² < 0.1, 500kb window)
3. Calculate PRS weights from GWAS betas
4. Compute PRS for validation cohort individuals
5. Evaluate PRS performance:
- AUC for CAD case/control discrimination
- Odds ratios across PRS deciles
- Compare to traditional risk factors (age, sex, BMI, smoking)
6. Assess PRS calibration and create risk stratification plot
7. Identify high-risk individuals (top 5% PRS)
""")
```
### Example 3: Variant Pathogenicity Prediction
```python
result = agent.go("""
Predict pathogenicity of rare coding variants in candidate disease genes.
Variants (VCF format):
- chr17:41234451:A>G (BRCA1 p.Arg1347Gly)
- chr2:179428448:C>T (TTN p.Trp13579*)
- chr7:117188679:G>A (CFTR p.Gly542Ser)
For each variant, assess:
1. In silico predictions (SIFT, PolyPhen2, CADD, REVEL)
2. Population frequency (gnomAD)
3. Evolutionary conservation (PhyloP, PhastCons)
4. Protein structure impact (using AlphaFold structures)
5. Functional domain location
6. ClinVar annotations (if available)
7. Literature evidence
8. ACMG/AMP classification criteria
Provide pathogenicity classification (benign, likely benign, VUS, likely pathogenic, pathogenic) with supporting evidence.
""")
```
---
## Clinical Genomics and Diagnostics
### Example 1: Rare Disease Diagnosis
**Task:** Diagnose rare genetic disease from whole exome sequencing.
```python
agent = A1(path='./data', llm='claude-sonnet-4-20250514')
result = agent.go("""
Analyze whole exome sequencing (WES) data for rare disease diagnosis.
Patient phenotypes (HPO terms):
- HP:0001250 (Seizures)
- HP:0001249 (Intellectual disability)
- HP:0001263 (Global developmental delay)
- HP:0001252 (Hypotonia)
VCF file: patient_trio.vcf (proband + parents)
Analysis workflow:
1. Variant filtering:
- Quality filters (QUAL > 30, DP > 10, GQ > 20)
- Frequency filters (gnomAD AF < 0.01)
- Functional impact (missense, nonsense, frameshift, splice site)
2. Inheritance pattern analysis:
- De novo variants
- Autosomal recessive (compound het, homozygous)
- X-linked
3. Phenotype-driven prioritization:
- Match candidate genes to HPO terms
- Use HPO-gene associations
- Check gene expression in relevant tissues (brain)
4. Variant pathogenicity assessment:
- In silico predictions
- ACMG classification
- Literature evidence
5. Generate diagnostic report with:
- Top candidate variants
- Supporting evidence
- Functional validation suggestions
- Genetic counseling recommendations
""")
agent.save_conversation_history("rare_disease_diagnosis.pdf")
```
### Example 2: Cancer Genomics Analysis
```python
result = agent.go("""
Analyze tumor-normal paired sequencing for cancer genomics.
Files:
- tumor_sample.vcf (somatic variants)
- tumor_rnaseq.bam (gene expression)
- tumor_cnv.seg (copy number variants)
Analysis:
1. Identify driver mutations:
- Known cancer genes (COSMIC, OncoKB)
- Recurrent hotspot mutations
- Truncating mutations in tumor suppressors
2. Analyze mutational signatures:
- Decompose signatures (COSMIC signatures)
- Identify mutagenic processes
3. Copy number analysis:
- Identify amplifications and deletions
- Focal vs. arm-level events
- Assess oncogene amplifications and TSG deletions
4. Gene expression analysis:
- Identify outlier gene expression
- Fusion transcript detection
- Pathway dysregulation
5. Therapeutic implications:
- Match alterations to FDA-approved therapies
- Identify clinical trial opportunities
- Predict response to targeted therapies
6. Generate precision oncology report
""")
```
### Example 3: Pharmacogenomics
```python
result = agent.go("""
Generate pharmacogenomics report for patient genotype data.
VCF file: patient_pgx.vcf
Analyze variants affecting drug metabolism:
**CYP450 genes:**
- CYP2D6 (affects ~25% of drugs)
- CYP2C19 (clopidogrel, PPIs, antidepressants)
- CYP2C9 (warfarin, NSAIDs)
- CYP3A5 (tacrolimus, immunosuppressants)
**Drug transporter genes:**
- SLCO1B1 (statin myopathy risk)
- ABCB1 (P-glycoprotein)
**Drug targets:**
- VKORC1 (warfarin dosing)
- DPYD (fluoropyrimidine toxicity)
- TPMT (thiopurine toxicity)
For each gene:
1. Determine diplotype (*1/*1, *1/*2, etc.)
2. Assign metabolizer phenotype (PM, IM, NM, RM, UM)
3. Provide dosing recommendations using CPIC/PharmGKB guidelines
4. Flag high-risk drug-gene interactions
5. Suggest alternative medications if needed
Generate patient-friendly report with actionable recommendations.
""")
```
---
## Protein Structure and Function
### Example 1: AlphaFold Structure Analysis
```python
agent = A1(path='./data', llm='claude-sonnet-4-20250514')
result = agent.go("""
Analyze AlphaFold structure prediction for novel protein.
Protein: Hypothetical protein ABC123 (UniProt: Q9XYZ1)
Tasks:
1. Retrieve AlphaFold structure from database
2. Assess prediction quality:
- pLDDT scores per residue
- Identify high-confidence regions (pLDDT > 90)
- Flag low-confidence regions (pLDDT < 50)
3. Structural analysis:
- Identify domains using structural alignment
- Predict fold family
- Identify secondary structure elements
4. Functional prediction:
- Search for structural homologs in PDB
- Identify conserved functional sites
- Predict binding pockets
- Suggest possible ligands/substrates
5. Variant impact analysis:
- Map disease-associated variants to structure
- Predict structural consequences
- Identify variants affecting binding sites
6. Generate PyMOL visualization scripts highlighting key features
""")
agent.save_conversation_history("alphafold_analysis.pdf")
```
### Example 2: Protein-Protein Interaction Prediction
```python
result = agent.go("""
Predict and analyze protein-protein interactions for autophagy pathway.
Query proteins: ATG5, ATG12, ATG16L1
Analysis:
1. Retrieve known interactions from:
- STRING database
- BioGRID
- IntAct
- Literature mining
2. Predict novel interactions using:
- Structural modeling (AlphaFold-Multimer)
- Coexpression analysis
- Phylogenetic profiling
3. Analyze interaction interfaces:
- Identify binding residues
- Assess interface properties (area, hydrophobicity)
- Predict binding affinity
4. Functional analysis:
- Map interactions to autophagy pathway steps
- Identify regulatory interactions
- Predict complex stoichiometry
5. Therapeutic implications:
- Identify druggable interfaces
- Suggest peptide inhibitors
- Design disruption strategies
Generate network visualization and interaction details.
""")
```
---
## Literature and Knowledge Synthesis
### Example 1: Systematic Literature Review
```python
agent = A1(path='./data', llm='claude-sonnet-4-20250514')
result = agent.go("""
Perform systematic literature review on CRISPR base editing applications.
Search query: "CRISPR base editing" OR "base editor" OR "CBE" OR "ABE"
Date range: 2016-2025
Tasks:
1. Search PubMed and retrieve relevant abstracts
2. Filter for original research articles
3. Extract key information:
- Base editor type (CBE, ABE, dual)
- Target organism/cell type
- Application (disease model, therapy, crop improvement)
- Editing efficiency
- Off-target assessment
4. Categorize applications:
- Therapeutic applications (by disease)
- Agricultural applications
- Basic research
5. Analyze trends:
- Publications over time
- Most studied diseases
- Evolution of base editor technology
6. Synthesize findings:
- Clinical trial status
- Remaining challenges
- Future directions
Generate comprehensive review document with citation statistics.
""")
agent.save_conversation_history("crispr_base_editing_review.pdf")
```
### Example 2: Gene Function Synthesis
```python
result = agent.go("""
Synthesize knowledge about gene function from multiple sources.
Target gene: PARK7 (DJ-1)
Integrate information from:
1. **Genetic databases:**
- NCBI Gene
- UniProt
- OMIM
2. **Expression data:**
- GTEx tissue expression
- Human Protein Atlas
- Single-cell expression atlases
3. **Functional data:**
- GO annotations
- KEGG pathways
- Reactome
- Protein interactions (STRING)
4. **Disease associations:**
- ClinVar variants
- GWAS catalog
- Disease databases (DisGeNET)
5. **Literature:**
- PubMed abstracts
- Key mechanistic studies
- Review articles
Synthesize into comprehensive gene report:
- Molecular function
- Biological processes
- Cellular localization
- Tissue distribution
- Disease associations
- Known drug targets/inhibitors
- Unresolved questions
Generate structured summary suitable for research planning.
""")
```
---
## Multi-Omics Integration
### Example 1: Multi-Omics Disease Analysis
```python
agent = A1(path='./data', llm='claude-sonnet-4-20250514')
result = agent.go("""
Integrate multi-omics data to understand disease mechanism.
Disease: Alzheimer's disease
Data types:
- Genomics: GWAS summary statistics (gwas_ad.txt)
- Transcriptomics: Brain RNA-seq (controls vs AD, rnaseq_data.csv)
- Proteomics: CSF proteomics (proteomics_csf.csv)
- Metabolomics: Plasma metabolomics (metabolomics_plasma.csv)
- Epigenomics: Brain methylation array (methylation_data.csv)
Integration workflow:
1. Analyze each omics layer independently:
- Identify significantly altered features
- Perform pathway enrichment
2. Cross-omics correlation:
- Correlate gene expression with protein levels
- Link genetic variants to expression (eQTL)
- Associate methylation with gene expression
- Connect proteins to metabolites
3. Network analysis:
- Build multi-omics network
- Identify key hub genes/proteins
- Detect disease modules
4. Causal inference:
- Prioritize drivers vs. consequences
- Identify therapeutic targets
- Predict drug mechanisms
5. Generate integrative model of AD pathogenesis
Provide visualization and therapeutic target recommendations.
""")
agent.save_conversation_history("ad_multiomics_analysis.pdf")
```
### Example 2: Systems Biology Modeling
```python
result = agent.go("""
Build systems biology model of metabolic pathway.
Pathway: Glycolysis
Data sources:
- Enzyme kinetics (BRENDA database)
- Metabolite concentrations (literature)
- Gene expression (tissue-specific, GTEx)
- Flux measurements (C13 labeling studies)
Modeling tasks:
1. Construct pathway model:
- Define reactions and stoichiometry
- Parameterize enzyme kinetics (Km, Vmax, Ki)
- Set initial metabolite concentrations
2. Simulate pathway dynamics:
- Steady-state analysis
- Time-course simulations
- Sensitivity analysis
3. Constraint-based modeling:
- Flux balance analysis (FBA)
- Identify bottleneck reactions
- Predict metabolic engineering strategies
4. Integrate with gene expression:
- Tissue-specific model predictions
- Disease vs. normal comparisons
5. Therapeutic predictions:
- Enzyme inhibition effects
- Metabolic rescue strategies
- Drug target identification
Generate model in SBML format and simulation results.
""")
```
---
## Best Practices for Task Formulation
### 1. Be Specific and Detailed
**Poor:**
```python
agent.go("Analyze this RNA-seq data")
```
**Good:**
```python
agent.go("""
Analyze bulk RNA-seq data from cancer vs. normal samples.
Files: cancer_rnaseq.csv (TPM values, 50 cancer, 50 normal)
Tasks:
1. Differential expression (DESeq2, padj < 0.05, |log2FC| > 1)
2. Pathway enrichment (KEGG, Reactome)
3. Generate volcano plot and top DE gene heatmap
""")
```
### 2. Include File Paths and Formats
Always specify:
- Exact file paths
- File formats (VCF, BAM, CSV, H5AD, etc.)
- Data structure (columns, sample IDs)
### 3. Set Clear Success Criteria
Define thresholds and cutoffs:
- Statistical significance (P < 0.05, FDR < 0.1)
- Fold change thresholds
- Quality filters
- Expected outputs
### 4. Request Visualizations
Explicitly ask for plots:
- Volcano plots, MA plots
- Heatmaps, PCA plots
- Network diagrams
- Manhattan plots
### 5. Specify Biological Context
Include:
- Organism (human, mouse, etc.)
- Tissue/cell type
- Disease/condition
- Treatment details
### 6. Request Interpretations
Ask agent to:
- Interpret biological significance
- Suggest follow-up experiments
- Identify limitations
- Provide literature context
---
## Common Patterns
### Data Quality Control
```python
"""
Before analysis, perform quality control:
1. Check for missing values
2. Assess data distributions
3. Identify outliers
4. Generate QC report
Only proceed with analysis if data passes QC.
"""
```
### Iterative Refinement
```python
"""
Perform analysis in stages:
1. Initial exploratory analysis
2. Based on results, refine parameters
3. Focus on interesting findings
4. Generate final report
Show intermediate results for each stage.
"""
```
### Reproducibility
```python
"""
Ensure reproducibility:
1. Set random seeds where applicable
2. Log all parameters used
3. Save intermediate files
4. Export environment info (package versions)
5. Generate methods section for paper
"""
```
These examples demonstrate the breadth of biomedical tasks biomni can handle. Adapt the patterns to your specific research questions, and always include sufficient detail for the agent to execute autonomously.