Initial commit
This commit is contained in:
300
skills/esm/SKILL.md
Normal file
300
skills/esm/SKILL.md
Normal file
@@ -0,0 +1,300 @@
|
||||
---
|
||||
name: esm
|
||||
description: Comprehensive toolkit for protein language models including ESM3 (generative multimodal protein design across sequence, structure, and function) and ESM C (efficient protein embeddings and representations). Use this skill when working with protein sequences, structures, or function prediction; designing novel proteins; generating protein embeddings; performing inverse folding; or conducting protein engineering tasks. Supports both local model usage and cloud-based Forge API for scalable inference.
|
||||
---
|
||||
|
||||
# ESM: Evolutionary Scale Modeling
|
||||
|
||||
## Overview
|
||||
|
||||
ESM provides state-of-the-art protein language models for understanding, generating, and designing proteins. This skill enables working with two model families: ESM3 for generative protein design across sequence, structure, and function, and ESM C for efficient protein representation learning and embeddings.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Protein Sequence Generation with ESM3
|
||||
|
||||
Generate novel protein sequences with desired properties using multimodal generative modeling.
|
||||
|
||||
**When to use:**
|
||||
- Designing proteins with specific functional properties
|
||||
- Completing partial protein sequences
|
||||
- Generating variants of existing proteins
|
||||
- Creating proteins with desired structural characteristics
|
||||
|
||||
**Basic usage:**
|
||||
|
||||
```python
|
||||
from esm.models.esm3 import ESM3
|
||||
from esm.sdk.api import ESM3InferenceClient, ESMProtein, GenerationConfig
|
||||
|
||||
# Load model locally
|
||||
model: ESM3InferenceClient = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda")
|
||||
|
||||
# Create protein prompt
|
||||
protein = ESMProtein(sequence="MPRT___KEND") # '_' represents masked positions
|
||||
|
||||
# Generate completion
|
||||
protein = model.generate(protein, GenerationConfig(track="sequence", num_steps=8))
|
||||
print(protein.sequence)
|
||||
```
|
||||
|
||||
**For remote/cloud usage via Forge API:**
|
||||
|
||||
```python
|
||||
from esm.sdk.forge import ESM3ForgeInferenceClient
|
||||
from esm.sdk.api import ESMProtein, GenerationConfig
|
||||
|
||||
# Connect to Forge
|
||||
model = ESM3ForgeInferenceClient(model="esm3-medium-2024-08", url="https://forge.evolutionaryscale.ai", token="<token>")
|
||||
|
||||
# Generate
|
||||
protein = model.generate(protein, GenerationConfig(track="sequence", num_steps=8))
|
||||
```
|
||||
|
||||
See `references/esm3-api.md` for detailed ESM3 model specifications, advanced generation configurations, and multimodal prompting examples.
|
||||
|
||||
### 2. Structure Prediction and Inverse Folding
|
||||
|
||||
Use ESM3's structure track for structure prediction from sequence or inverse folding (sequence design from structure).
|
||||
|
||||
**Structure prediction:**
|
||||
|
||||
```python
|
||||
from esm.sdk.api import ESM3InferenceClient, ESMProtein, GenerationConfig
|
||||
|
||||
# Predict structure from sequence
|
||||
protein = ESMProtein(sequence="MPRTKEINDAGLIVHSP...")
|
||||
protein_with_structure = model.generate(
|
||||
protein,
|
||||
GenerationConfig(track="structure", num_steps=protein.sequence.count("_"))
|
||||
)
|
||||
|
||||
# Access predicted structure
|
||||
coordinates = protein_with_structure.coordinates # 3D coordinates
|
||||
pdb_string = protein_with_structure.to_pdb()
|
||||
```
|
||||
|
||||
**Inverse folding (sequence from structure):**
|
||||
|
||||
```python
|
||||
# Design sequence for a target structure
|
||||
protein_with_structure = ESMProtein.from_pdb("target_structure.pdb")
|
||||
protein_with_structure.sequence = None # Remove sequence
|
||||
|
||||
# Generate sequence that folds to this structure
|
||||
designed_protein = model.generate(
|
||||
protein_with_structure,
|
||||
GenerationConfig(track="sequence", num_steps=50, temperature=0.7)
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Protein Embeddings with ESM C
|
||||
|
||||
Generate high-quality embeddings for downstream tasks like function prediction, classification, or similarity analysis.
|
||||
|
||||
**When to use:**
|
||||
- Extracting protein representations for machine learning
|
||||
- Computing sequence similarities
|
||||
- Feature extraction for protein classification
|
||||
- Transfer learning for protein-related tasks
|
||||
|
||||
**Basic usage:**
|
||||
|
||||
```python
|
||||
from esm.models.esmc import ESMC
|
||||
from esm.sdk.api import ESMProtein
|
||||
|
||||
# Load ESM C model
|
||||
model = ESMC.from_pretrained("esmc-300m").to("cuda")
|
||||
|
||||
# Get embeddings
|
||||
protein = ESMProtein(sequence="MPRTKEINDAGLIVHSP...")
|
||||
protein_tensor = model.encode(protein)
|
||||
|
||||
# Generate embeddings
|
||||
embeddings = model.forward(protein_tensor)
|
||||
```
|
||||
|
||||
**Batch processing:**
|
||||
|
||||
```python
|
||||
# Encode multiple proteins
|
||||
proteins = [
|
||||
ESMProtein(sequence="MPRTKEIND..."),
|
||||
ESMProtein(sequence="AGLIVHSPQ..."),
|
||||
ESMProtein(sequence="KTEFLNDGR...")
|
||||
]
|
||||
|
||||
embeddings_list = [model.logits(model.forward(model.encode(p))) for p in proteins]
|
||||
```
|
||||
|
||||
See `references/esm-c-api.md` for ESM C model details, efficiency comparisons, and advanced embedding strategies.
|
||||
|
||||
### 4. Function Conditioning and Annotation
|
||||
|
||||
Use ESM3's function track to generate proteins with specific functional annotations or predict function from sequence.
|
||||
|
||||
**Function-conditioned generation:**
|
||||
|
||||
```python
|
||||
from esm.sdk.api import ESMProtein, FunctionAnnotation, GenerationConfig
|
||||
|
||||
# Create protein with desired function
|
||||
protein = ESMProtein(
|
||||
sequence="_" * 200, # Generate 200 residue protein
|
||||
function_annotations=[
|
||||
FunctionAnnotation(label="fluorescent_protein", start=50, end=150)
|
||||
]
|
||||
)
|
||||
|
||||
# Generate sequence with specified function
|
||||
functional_protein = model.generate(
|
||||
protein,
|
||||
GenerationConfig(track="sequence", num_steps=200)
|
||||
)
|
||||
```
|
||||
|
||||
### 5. Chain-of-Thought Generation
|
||||
|
||||
Iteratively refine protein designs using ESM3's chain-of-thought generation approach.
|
||||
|
||||
```python
|
||||
from esm.sdk.api import GenerationConfig
|
||||
|
||||
# Multi-step refinement
|
||||
protein = ESMProtein(sequence="MPRT" + "_" * 100 + "KEND")
|
||||
|
||||
# Step 1: Generate initial structure
|
||||
config = GenerationConfig(track="structure", num_steps=50)
|
||||
protein = model.generate(protein, config)
|
||||
|
||||
# Step 2: Refine sequence based on structure
|
||||
config = GenerationConfig(track="sequence", num_steps=50, temperature=0.5)
|
||||
protein = model.generate(protein, config)
|
||||
|
||||
# Step 3: Predict function
|
||||
config = GenerationConfig(track="function", num_steps=20)
|
||||
protein = model.generate(protein, config)
|
||||
```
|
||||
|
||||
### 6. Batch Processing with Forge API
|
||||
|
||||
Process multiple proteins efficiently using Forge's async executor.
|
||||
|
||||
```python
|
||||
from esm.sdk.forge import ESM3ForgeInferenceClient
|
||||
import asyncio
|
||||
|
||||
client = ESM3ForgeInferenceClient(model="esm3-medium-2024-08", token="<token>")
|
||||
|
||||
# Async batch processing
|
||||
async def batch_generate(proteins_list):
|
||||
tasks = [
|
||||
client.async_generate(protein, GenerationConfig(track="sequence"))
|
||||
for protein in proteins_list
|
||||
]
|
||||
return await asyncio.gather(*tasks)
|
||||
|
||||
# Execute
|
||||
proteins = [ESMProtein(sequence=f"MPRT{'_' * 50}KEND") for _ in range(10)]
|
||||
results = asyncio.run(batch_generate(proteins))
|
||||
```
|
||||
|
||||
See `references/forge-api.md` for detailed Forge API documentation, authentication, rate limits, and batch processing patterns.
|
||||
|
||||
## Model Selection Guide
|
||||
|
||||
**ESM3 Models (Generative):**
|
||||
- `esm3-sm-open-v1` (1.4B) - Open weights, local usage, good for experimentation
|
||||
- `esm3-medium-2024-08` (7B) - Best balance of quality and speed (Forge only)
|
||||
- `esm3-large-2024-03` (98B) - Highest quality, slower (Forge only)
|
||||
|
||||
**ESM C Models (Embeddings):**
|
||||
- `esmc-300m` (30 layers) - Lightweight, fast inference
|
||||
- `esmc-600m` (36 layers) - Balanced performance
|
||||
- `esmc-6b` (80 layers) - Maximum representation quality
|
||||
|
||||
**Selection criteria:**
|
||||
- **Local development/testing:** Use `esm3-sm-open-v1` or `esmc-300m`
|
||||
- **Production quality:** Use `esm3-medium-2024-08` via Forge
|
||||
- **Maximum accuracy:** Use `esm3-large-2024-03` or `esmc-6b`
|
||||
- **High throughput:** Use Forge API with batch executor
|
||||
- **Cost optimization:** Use smaller models, implement caching strategies
|
||||
|
||||
## Installation
|
||||
|
||||
**Basic installation:**
|
||||
|
||||
```bash
|
||||
uv pip install esm
|
||||
```
|
||||
|
||||
**With Flash Attention (recommended for faster inference):**
|
||||
|
||||
```bash
|
||||
uv pip install esm
|
||||
uv pip install flash-attn --no-build-isolation
|
||||
```
|
||||
|
||||
**For Forge API access:**
|
||||
|
||||
```bash
|
||||
uv pip install esm # SDK includes Forge client
|
||||
```
|
||||
|
||||
No additional dependencies needed. Obtain Forge API token at https://forge.evolutionaryscale.ai
|
||||
|
||||
## Common Workflows
|
||||
|
||||
For detailed examples and complete workflows, see `references/workflows.md` which includes:
|
||||
- Novel GFP design with chain-of-thought
|
||||
- Protein variant generation and screening
|
||||
- Structure-based sequence optimization
|
||||
- Function prediction pipelines
|
||||
- Embedding-based clustering and analysis
|
||||
|
||||
## References
|
||||
|
||||
This skill includes comprehensive reference documentation:
|
||||
|
||||
- `references/esm3-api.md` - ESM3 model architecture, API reference, generation parameters, and multimodal prompting
|
||||
- `references/esm-c-api.md` - ESM C model details, embedding strategies, and performance optimization
|
||||
- `references/forge-api.md` - Forge platform documentation, authentication, batch processing, and deployment
|
||||
- `references/workflows.md` - Complete examples and common workflow patterns
|
||||
|
||||
These references contain detailed API specifications, parameter descriptions, and advanced usage patterns. Load them as needed for specific tasks.
|
||||
|
||||
## Best Practices
|
||||
|
||||
**For generation tasks:**
|
||||
- Start with smaller models for prototyping (`esm3-sm-open-v1`)
|
||||
- Use temperature parameter to control diversity (0.0 = deterministic, 1.0 = diverse)
|
||||
- Implement iterative refinement with chain-of-thought for complex designs
|
||||
- Validate generated sequences with structure prediction or wet-lab experiments
|
||||
|
||||
**For embedding tasks:**
|
||||
- Batch process sequences when possible for efficiency
|
||||
- Cache embeddings for repeated analyses
|
||||
- Normalize embeddings when computing similarities
|
||||
- Use appropriate model size based on downstream task requirements
|
||||
|
||||
**For production deployment:**
|
||||
- Use Forge API for scalability and latest models
|
||||
- Implement error handling and retry logic for API calls
|
||||
- Monitor token usage and implement rate limiting
|
||||
- Consider AWS SageMaker deployment for dedicated infrastructure
|
||||
|
||||
## Resources and Documentation
|
||||
|
||||
- **GitHub Repository:** https://github.com/evolutionaryscale/esm
|
||||
- **Forge Platform:** https://forge.evolutionaryscale.ai
|
||||
- **Scientific Paper:** Hayes et al., Science (2025) - https://www.science.org/doi/10.1126/science.ads0018
|
||||
- **Blog Posts:**
|
||||
- ESM3 Release: https://www.evolutionaryscale.ai/blog/esm3-release
|
||||
- ESM C Launch: https://www.evolutionaryscale.ai/blog/esm-cambrian
|
||||
- **Community:** Slack community at https://bit.ly/3FKwcWd
|
||||
- **Model Weights:** HuggingFace EvolutionaryScale organization
|
||||
|
||||
## Responsible Use
|
||||
|
||||
ESM is designed for beneficial applications in protein engineering, drug discovery, and scientific research. Follow the Responsible Biodesign Framework (https://responsiblebiodesign.ai/) when designing novel proteins. Consider biosafety and ethical implications of protein designs before experimental validation.
|
||||
583
skills/esm/references/esm-c-api.md
Normal file
583
skills/esm/references/esm-c-api.md
Normal file
@@ -0,0 +1,583 @@
|
||||
# ESM C API Reference
|
||||
|
||||
## Overview
|
||||
|
||||
ESM C (Cambrian) is a family of protein language models optimized for representation learning and efficient embedding generation. Designed as a drop-in replacement for ESM2, ESM C provides significant improvements in speed and quality across all model sizes.
|
||||
|
||||
## Model Architecture
|
||||
|
||||
**ESM C Family Models:**
|
||||
|
||||
| Model ID | Parameters | Layers | Best For |
|
||||
|----------|-----------|--------|----------|
|
||||
| `esmc-300m` | 300M | 30 | Fast inference, lightweight applications |
|
||||
| `esmc-600m` | 600M | 36 | Balanced performance and quality |
|
||||
| `esmc-6b` | 6B | 80 | Maximum representation quality |
|
||||
|
||||
**Key Features:**
|
||||
- 3x faster inference than ESM2
|
||||
- Improved perplexity and embedding quality
|
||||
- Efficient architecture for production deployment
|
||||
- Compatible with ESM2 workflows (drop-in replacement)
|
||||
- Support for long sequences (up to 1024 residues efficiently)
|
||||
|
||||
**Architecture Improvements over ESM2:**
|
||||
- Optimized attention mechanisms
|
||||
- Better token representation
|
||||
- Enhanced training procedures
|
||||
- Reduced memory footprint
|
||||
|
||||
## Core API Components
|
||||
|
||||
### ESMC Class
|
||||
|
||||
Main interface for ESM C models.
|
||||
|
||||
**Model Loading:**
|
||||
|
||||
```python
|
||||
from esm.models.esmc import ESMC
|
||||
from esm.sdk.api import ESMProtein
|
||||
|
||||
# Load model with automatic device placement
|
||||
model = ESMC.from_pretrained("esmc-300m").to("cuda")
|
||||
|
||||
# Or specify device explicitly
|
||||
model = ESMC.from_pretrained("esmc-600m").to("cpu")
|
||||
|
||||
# For maximum quality
|
||||
model = ESMC.from_pretrained("esmc-6b").to("cuda")
|
||||
```
|
||||
|
||||
**Model Selection Criteria:**
|
||||
|
||||
- **esmc-300m**: Development, real-time applications, batch processing of many sequences
|
||||
- **esmc-600m**: Production deployments, good quality/speed balance
|
||||
- **esmc-6b**: Research, maximum accuracy for downstream tasks
|
||||
|
||||
### Basic Embedding Generation
|
||||
|
||||
**Single Sequence:**
|
||||
|
||||
```python
|
||||
from esm.models.esmc import ESMC
|
||||
from esm.sdk.api import ESMProtein
|
||||
|
||||
# Load model
|
||||
model = ESMC.from_pretrained("esmc-600m").to("cuda")
|
||||
|
||||
# Create protein
|
||||
protein = ESMProtein(sequence="MPRTKEINDAGLIVHSPQWFYK")
|
||||
|
||||
# Encode to tensor
|
||||
protein_tensor = model.encode(protein)
|
||||
|
||||
# Generate embeddings
|
||||
embeddings = model.forward(protein_tensor)
|
||||
|
||||
# Get logits (per-position predictions)
|
||||
logits = model.logits(embeddings)
|
||||
|
||||
print(f"Embedding shape: {embeddings.shape}")
|
||||
print(f"Logits shape: {logits.shape}")
|
||||
```
|
||||
|
||||
**Output Shapes:**
|
||||
|
||||
For a sequence of length L:
|
||||
- `embeddings.shape`: `(1, L, hidden_dim)` where hidden_dim depends on model
|
||||
- esmc-300m: hidden_dim = 960
|
||||
- esmc-600m: hidden_dim = 1152
|
||||
- esmc-6b: hidden_dim = 2560
|
||||
- `logits.shape`: `(1, L, 64)` - per-position amino acid predictions
|
||||
|
||||
### Batch Processing
|
||||
|
||||
Process multiple sequences efficiently:
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
# Multiple proteins
|
||||
sequences = [
|
||||
"MPRTKEINDAGLIVHSP",
|
||||
"AGKWFYLTQSNHERVPM",
|
||||
"DEIFKRNAVWGSLTPQY"
|
||||
]
|
||||
|
||||
proteins = [ESMProtein(sequence=seq) for seq in sequences]
|
||||
|
||||
# Encode all
|
||||
protein_tensors = [model.encode(p) for p in proteins]
|
||||
|
||||
# Process batch (if same length)
|
||||
# For variable lengths, process individually or pad
|
||||
embeddings_list = []
|
||||
for tensor in protein_tensors:
|
||||
embedding = model.forward(tensor)
|
||||
embeddings_list.append(embedding)
|
||||
|
||||
print(f"Processed {len(embeddings_list)} proteins")
|
||||
```
|
||||
|
||||
**Efficient Batching for Variable Lengths:**
|
||||
|
||||
```python
|
||||
def batch_encode_variable_length(model, sequences, max_batch_size=32):
|
||||
"""
|
||||
Efficiently batch encode sequences of variable length.
|
||||
Groups by similar length for efficiency.
|
||||
"""
|
||||
# Sort by length
|
||||
sorted_seqs = sorted(enumerate(sequences), key=lambda x: len(x[1]))
|
||||
|
||||
results = [None] * len(sequences)
|
||||
batch = []
|
||||
batch_indices = []
|
||||
|
||||
for idx, seq in sorted_seqs:
|
||||
batch.append(seq)
|
||||
batch_indices.append(idx)
|
||||
|
||||
# Process batch when full or length changes significantly
|
||||
if (len(batch) >= max_batch_size or
|
||||
(len(batch) > 0 and abs(len(seq) - len(batch[0])) > 10)):
|
||||
|
||||
# Process current batch
|
||||
proteins = [ESMProtein(sequence=s) for s in batch]
|
||||
embeddings = [model.forward(model.encode(p)) for p in proteins]
|
||||
|
||||
# Store results
|
||||
for i, emb in zip(batch_indices, embeddings):
|
||||
results[i] = emb
|
||||
|
||||
batch = []
|
||||
batch_indices = []
|
||||
|
||||
# Process remaining
|
||||
if batch:
|
||||
proteins = [ESMProtein(sequence=s) for s in batch]
|
||||
embeddings = [model.forward(model.encode(p)) for p in proteins]
|
||||
for i, emb in zip(batch_indices, embeddings):
|
||||
results[i] = emb
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### 1. Sequence Similarity Analysis
|
||||
|
||||
Compute similarity between proteins using embeddings:
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.nn.functional as F
|
||||
|
||||
def get_sequence_embedding(model, sequence):
|
||||
"""Get mean-pooled sequence embedding."""
|
||||
protein = ESMProtein(sequence=sequence)
|
||||
tensor = model.encode(protein)
|
||||
embedding = model.forward(tensor)
|
||||
|
||||
# Mean pooling over sequence length
|
||||
return embedding.mean(dim=1)
|
||||
|
||||
# Get embeddings
|
||||
seq1_emb = get_sequence_embedding(model, "MPRTKEINDAGLIVHSP")
|
||||
seq2_emb = get_sequence_embedding(model, "MPRTKEINDAGLIVHSQ") # Similar
|
||||
seq3_emb = get_sequence_embedding(model, "WWWWWWWWWWWWWWWWW") # Different
|
||||
|
||||
# Compute cosine similarity
|
||||
sim_1_2 = F.cosine_similarity(seq1_emb, seq2_emb)
|
||||
sim_1_3 = F.cosine_similarity(seq1_emb, seq3_emb)
|
||||
|
||||
print(f"Similarity (1,2): {sim_1_2.item():.4f}")
|
||||
print(f"Similarity (1,3): {sim_1_3.item():.4f}")
|
||||
```
|
||||
|
||||
### 2. Protein Classification
|
||||
|
||||
Use embeddings as features for classification:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
# Generate embeddings for training set
|
||||
def embed_dataset(model, sequences):
|
||||
embeddings = []
|
||||
for seq in sequences:
|
||||
protein = ESMProtein(sequence=seq)
|
||||
tensor = model.encode(protein)
|
||||
emb = model.forward(tensor).mean(dim=1) # Mean pooling
|
||||
embeddings.append(emb.cpu().detach().numpy().flatten())
|
||||
return np.array(embeddings)
|
||||
|
||||
# Example: Classify proteins by function
|
||||
train_sequences = [...] # Your sequences
|
||||
train_labels = [...] # Your labels
|
||||
|
||||
embeddings = embed_dataset(model, train_sequences)
|
||||
|
||||
# Train classifier
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
embeddings, train_labels, test_size=0.2
|
||||
)
|
||||
|
||||
classifier = LogisticRegression(max_iter=1000)
|
||||
classifier.fit(X_train, y_train)
|
||||
|
||||
# Evaluate
|
||||
accuracy = classifier.score(X_test, y_test)
|
||||
print(f"Classification accuracy: {accuracy:.4f}")
|
||||
```
|
||||
|
||||
### 3. Protein Clustering
|
||||
|
||||
Cluster proteins based on sequence similarity:
|
||||
|
||||
```python
|
||||
from sklearn.cluster import KMeans
|
||||
import numpy as np
|
||||
|
||||
# Generate embeddings
|
||||
sequences = [...] # Your protein sequences
|
||||
embeddings = embed_dataset(model, sequences)
|
||||
|
||||
# Cluster
|
||||
n_clusters = 5
|
||||
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
|
||||
cluster_labels = kmeans.fit_predict(embeddings)
|
||||
|
||||
# Analyze clusters
|
||||
for i in range(n_clusters):
|
||||
cluster_seqs = [seq for seq, label in zip(sequences, cluster_labels) if label == i]
|
||||
print(f"Cluster {i}: {len(cluster_seqs)} sequences")
|
||||
```
|
||||
|
||||
### 4. Sequence Search and Retrieval
|
||||
|
||||
Find similar sequences in a database:
|
||||
|
||||
```python
|
||||
import torch
|
||||
import numpy as np
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
|
||||
def build_sequence_index(model, database_sequences):
|
||||
"""Build searchable index of sequence embeddings."""
|
||||
embeddings = []
|
||||
for seq in database_sequences:
|
||||
emb = get_sequence_embedding(model, seq)
|
||||
embeddings.append(emb.cpu().detach().numpy().flatten())
|
||||
return np.array(embeddings)
|
||||
|
||||
def search_similar_sequences(model, query_seq, database_embeddings,
|
||||
database_sequences, top_k=10):
|
||||
"""Find top-k most similar sequences."""
|
||||
query_emb = get_sequence_embedding(model, query_seq)
|
||||
query_emb_np = query_emb.cpu().detach().numpy().flatten().reshape(1, -1)
|
||||
|
||||
# Compute similarities
|
||||
similarities = cosine_similarity(query_emb_np, database_embeddings)[0]
|
||||
|
||||
# Get top-k
|
||||
top_indices = np.argsort(similarities)[-top_k:][::-1]
|
||||
|
||||
results = [
|
||||
(database_sequences[idx], similarities[idx])
|
||||
for idx in top_indices
|
||||
]
|
||||
return results
|
||||
|
||||
# Example usage
|
||||
database_seqs = [...] # Large sequence database
|
||||
index = build_sequence_index(model, database_seqs)
|
||||
|
||||
query = "MPRTKEINDAGLIVHSP"
|
||||
similar = search_similar_sequences(model, query, index, database_seqs, top_k=5)
|
||||
|
||||
for seq, score in similar:
|
||||
print(f"Score: {score:.4f} - {seq[:30]}...")
|
||||
```
|
||||
|
||||
### 5. Feature Extraction for Downstream Models
|
||||
|
||||
Use ESM C embeddings as input to custom neural networks:
|
||||
|
||||
```python
|
||||
import torch.nn as nn
|
||||
|
||||
class ProteinPropertyPredictor(nn.Module):
|
||||
"""Example: Predict protein properties from ESM C embeddings."""
|
||||
|
||||
def __init__(self, embedding_dim, hidden_dim, output_dim):
|
||||
super().__init__()
|
||||
self.fc1 = nn.Linear(embedding_dim, hidden_dim)
|
||||
self.fc2 = nn.Linear(hidden_dim, hidden_dim)
|
||||
self.fc3 = nn.Linear(hidden_dim, output_dim)
|
||||
self.relu = nn.ReLU()
|
||||
self.dropout = nn.Dropout(0.3)
|
||||
|
||||
def forward(self, embeddings):
|
||||
# embeddings: (batch, seq_len, embedding_dim)
|
||||
# Mean pool over sequence
|
||||
x = embeddings.mean(dim=1)
|
||||
|
||||
x = self.relu(self.fc1(x))
|
||||
x = self.dropout(x)
|
||||
x = self.relu(self.fc2(x))
|
||||
x = self.dropout(x)
|
||||
x = self.fc3(x)
|
||||
return x
|
||||
|
||||
# Use ESM C as frozen feature extractor
|
||||
esm_model = ESMC.from_pretrained("esmc-600m").to("cuda")
|
||||
esm_model.eval() # Freeze
|
||||
|
||||
# Create task-specific model
|
||||
predictor = ProteinPropertyPredictor(
|
||||
embedding_dim=1152, # esmc-600m dimension
|
||||
hidden_dim=512,
|
||||
output_dim=1 # e.g., stability score
|
||||
).to("cuda")
|
||||
|
||||
# Training loop
|
||||
for sequence, target in dataloader:
|
||||
protein = ESMProtein(sequence=sequence)
|
||||
with torch.no_grad():
|
||||
embeddings = esm_model.forward(esm_model.encode(protein))
|
||||
|
||||
prediction = predictor(embeddings)
|
||||
loss = criterion(prediction, target)
|
||||
# ... backprop through predictor only
|
||||
```
|
||||
|
||||
### 6. Per-Residue Analysis
|
||||
|
||||
Extract per-residue representations for detailed analysis:
|
||||
|
||||
```python
|
||||
def get_per_residue_embeddings(model, sequence):
|
||||
"""Get embedding for each residue."""
|
||||
protein = ESMProtein(sequence=sequence)
|
||||
tensor = model.encode(protein)
|
||||
embeddings = model.forward(tensor)
|
||||
|
||||
# embeddings shape: (1, seq_len, hidden_dim)
|
||||
return embeddings.squeeze(0) # (seq_len, hidden_dim)
|
||||
|
||||
# Analyze specific positions
|
||||
sequence = "MPRTKEINDAGLIVHSPQWFYK"
|
||||
residue_embeddings = get_per_residue_embeddings(model, sequence)
|
||||
|
||||
# Extract features for position 10
|
||||
position_10_features = residue_embeddings[10]
|
||||
print(f"Features for residue {sequence[10]} at position 10:")
|
||||
print(f"Shape: {position_10_features.shape}")
|
||||
|
||||
# Compare residue representations
|
||||
pos_5 = residue_embeddings[5]
|
||||
pos_15 = residue_embeddings[15]
|
||||
similarity = F.cosine_similarity(pos_5, pos_15, dim=0)
|
||||
print(f"Residue similarity: {similarity.item():.4f}")
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Memory Management
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
# Use half precision for memory efficiency
|
||||
model = ESMC.from_pretrained("esmc-600m").to("cuda").half()
|
||||
|
||||
# Process with mixed precision
|
||||
with torch.cuda.amp.autocast():
|
||||
embeddings = model.forward(model.encode(protein))
|
||||
|
||||
# Clear cache between batches
|
||||
torch.cuda.empty_cache()
|
||||
```
|
||||
|
||||
### Batch Processing Best Practices
|
||||
|
||||
```python
|
||||
def efficient_batch_processing(model, sequences, batch_size=32):
|
||||
"""Process sequences in optimized batches."""
|
||||
results = []
|
||||
|
||||
for i in range(0, len(sequences), batch_size):
|
||||
batch = sequences[i:i + batch_size]
|
||||
|
||||
# Process batch
|
||||
batch_embeddings = []
|
||||
for seq in batch:
|
||||
protein = ESMProtein(sequence=seq)
|
||||
emb = model.forward(model.encode(protein))
|
||||
batch_embeddings.append(emb)
|
||||
|
||||
results.extend(batch_embeddings)
|
||||
|
||||
# Periodically clear cache
|
||||
if i % (batch_size * 10) == 0:
|
||||
torch.cuda.empty_cache()
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
### Caching Embeddings
|
||||
|
||||
```python
|
||||
import pickle
|
||||
import hashlib
|
||||
|
||||
def get_cache_key(sequence):
|
||||
"""Generate cache key for sequence."""
|
||||
return hashlib.md5(sequence.encode()).hexdigest()
|
||||
|
||||
class EmbeddingCache:
|
||||
"""Cache for protein embeddings."""
|
||||
|
||||
def __init__(self, cache_file="embeddings_cache.pkl"):
|
||||
self.cache_file = cache_file
|
||||
try:
|
||||
with open(cache_file, 'rb') as f:
|
||||
self.cache = pickle.load(f)
|
||||
except FileNotFoundError:
|
||||
self.cache = {}
|
||||
|
||||
def get(self, sequence):
|
||||
key = get_cache_key(sequence)
|
||||
return self.cache.get(key)
|
||||
|
||||
def set(self, sequence, embedding):
|
||||
key = get_cache_key(sequence)
|
||||
self.cache[key] = embedding
|
||||
|
||||
def save(self):
|
||||
with open(self.cache_file, 'wb') as f:
|
||||
pickle.dump(self.cache, f)
|
||||
|
||||
# Usage
|
||||
cache = EmbeddingCache()
|
||||
|
||||
def get_embedding_cached(model, sequence):
|
||||
cached = cache.get(sequence)
|
||||
if cached is not None:
|
||||
return cached
|
||||
|
||||
# Compute
|
||||
protein = ESMProtein(sequence=sequence)
|
||||
embedding = model.forward(model.encode(protein))
|
||||
cache.set(sequence, embedding)
|
||||
|
||||
return embedding
|
||||
|
||||
# Don't forget to save cache
|
||||
cache.save()
|
||||
```
|
||||
|
||||
## Comparison with ESM2
|
||||
|
||||
**Performance Improvements:**
|
||||
|
||||
| Metric | ESM2-650M | ESM C-600M | Improvement |
|
||||
|--------|-----------|------------|-------------|
|
||||
| Inference Speed | 1.0x | 3.0x | 3x faster |
|
||||
| Perplexity | Higher | Lower | Better |
|
||||
| Memory Usage | 1.0x | 0.8x | 20% less |
|
||||
| Embedding Quality | Baseline | Improved | +5-10% |
|
||||
|
||||
**Migration from ESM2:**
|
||||
|
||||
ESM C is designed as a drop-in replacement:
|
||||
|
||||
```python
|
||||
# Old ESM2 code
|
||||
from esm import pretrained
|
||||
model, alphabet = pretrained.esm2_t33_650M_UR50D()
|
||||
|
||||
# New ESM C code (similar API)
|
||||
from esm.models.esmc import ESMC
|
||||
model = ESMC.from_pretrained("esmc-600m")
|
||||
```
|
||||
|
||||
Key differences:
|
||||
- Faster inference with same or better quality
|
||||
- Simplified API through ESMProtein
|
||||
- Better support for long sequences
|
||||
- More efficient memory usage
|
||||
|
||||
## Advanced Topics
|
||||
|
||||
### Fine-tuning ESM C
|
||||
|
||||
ESM C can be fine-tuned for specific tasks:
|
||||
|
||||
```python
|
||||
import torch.optim as optim
|
||||
|
||||
# Load model
|
||||
model = ESMC.from_pretrained("esmc-300m").to("cuda")
|
||||
|
||||
# Unfreeze for fine-tuning
|
||||
for param in model.parameters():
|
||||
param.requires_grad = True
|
||||
|
||||
# Define optimizer
|
||||
optimizer = optim.Adam(model.parameters(), lr=1e-5)
|
||||
|
||||
# Training loop
|
||||
for epoch in range(num_epochs):
|
||||
for sequences, labels in dataloader:
|
||||
optimizer.zero_grad()
|
||||
|
||||
# Forward pass
|
||||
proteins = [ESMProtein(sequence=seq) for seq in sequences]
|
||||
embeddings = [model.forward(model.encode(p)) for p in proteins]
|
||||
|
||||
# Your task-specific loss
|
||||
loss = compute_loss(embeddings, labels)
|
||||
|
||||
loss.backward()
|
||||
optimizer.step()
|
||||
```
|
||||
|
||||
### Attention Visualization
|
||||
|
||||
Extract attention weights for interpretability:
|
||||
|
||||
```python
|
||||
def get_attention_weights(model, sequence):
|
||||
"""Extract attention weights from model."""
|
||||
protein = ESMProtein(sequence=sequence)
|
||||
tensor = model.encode(protein)
|
||||
|
||||
# Forward with attention output
|
||||
output = model.forward(tensor, output_attentions=True)
|
||||
|
||||
return output.attentions # List of attention tensors per layer
|
||||
|
||||
# Visualize attention
|
||||
attentions = get_attention_weights(model, "MPRTKEINDAGLIVHSP")
|
||||
# Process and visualize attention patterns
|
||||
```
|
||||
|
||||
## Citation
|
||||
|
||||
If using ESM C in research, cite:
|
||||
|
||||
```
|
||||
ESM Cambrian: https://www.evolutionaryscale.ai/blog/esm-cambrian
|
||||
EvolutionaryScale (2024)
|
||||
```
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- ESM C blog post: https://www.evolutionaryscale.ai/blog/esm-cambrian
|
||||
- Model weights: HuggingFace EvolutionaryScale organization
|
||||
- Comparison benchmarks: See blog post for detailed performance comparisons
|
||||
452
skills/esm/references/esm3-api.md
Normal file
452
skills/esm/references/esm3-api.md
Normal file
@@ -0,0 +1,452 @@
|
||||
# ESM3 API Reference
|
||||
|
||||
## Overview
|
||||
|
||||
ESM3 is a frontier multimodal generative language model that reasons over the sequence, structure, and function of proteins. It uses iterative masked language modeling to simultaneously generate across these three modalities.
|
||||
|
||||
## Model Architecture
|
||||
|
||||
**ESM3 Family Models:**
|
||||
|
||||
| Model ID | Parameters | Availability | Best For |
|
||||
|----------|-----------|--------------|----------|
|
||||
| `esm3-sm-open-v1` | 1.4B | Open weights (local) | Development, testing, learning |
|
||||
| `esm3-medium-2024-08` | 7B | Forge API only | Production, balanced quality/speed |
|
||||
| `esm3-large-2024-03` | 98B | Forge API only | Maximum quality, research |
|
||||
| `esm3-medium-multimer-2024-09` | 7B | Forge API only | Protein complexes (experimental) |
|
||||
|
||||
**Key Features:**
|
||||
- Simultaneous reasoning across sequence, structure, and function
|
||||
- Iterative generation with controllable number of steps
|
||||
- Support for partial prompting across modalities
|
||||
- Chain-of-thought generation for complex designs
|
||||
- Temperature control for generation diversity
|
||||
|
||||
## Core API Components
|
||||
|
||||
### ESMProtein Class
|
||||
|
||||
The central data structure representing a protein with optional sequence, structure, and function information.
|
||||
|
||||
**Constructor:**
|
||||
|
||||
```python
|
||||
from esm.sdk.api import ESMProtein
|
||||
|
||||
protein = ESMProtein(
|
||||
sequence="MPRTKEINDAGLIVHSP", # Amino acid sequence (optional)
|
||||
coordinates=coordinates_array, # 3D structure (optional)
|
||||
function_annotations=[...], # Function labels (optional)
|
||||
secondary_structure="HHHEEEECCC", # SS annotations (optional)
|
||||
sasa=sasa_array # Solvent accessibility (optional)
|
||||
)
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
|
||||
```python
|
||||
# Load from PDB file
|
||||
protein = ESMProtein.from_pdb("protein.pdb")
|
||||
|
||||
# Export to PDB format
|
||||
pdb_string = protein.to_pdb()
|
||||
|
||||
# Save to file
|
||||
with open("output.pdb", "w") as f:
|
||||
f.write(protein.to_pdb())
|
||||
```
|
||||
|
||||
**Masking Conventions:**
|
||||
|
||||
Use `_` (underscore) to represent masked positions for generation:
|
||||
|
||||
```python
|
||||
# Mask positions 5-10 for generation
|
||||
protein = ESMProtein(sequence="MPRT______AGLIVHSP")
|
||||
|
||||
# Fully masked sequence (generate from scratch)
|
||||
protein = ESMProtein(sequence="_" * 200)
|
||||
|
||||
# Partial structure (some coordinates None)
|
||||
protein = ESMProtein(
|
||||
sequence="MPRTKEIND",
|
||||
coordinates=partial_coords # Some positions can be None
|
||||
)
|
||||
```
|
||||
|
||||
### GenerationConfig Class
|
||||
|
||||
Controls generation behavior and parameters.
|
||||
|
||||
**Basic Configuration:**
|
||||
|
||||
```python
|
||||
from esm.sdk.api import GenerationConfig
|
||||
|
||||
config = GenerationConfig(
|
||||
track="sequence", # Track to generate: "sequence", "structure", or "function"
|
||||
num_steps=8, # Number of demasking steps
|
||||
temperature=0.7, # Sampling temperature (0.0-1.0)
|
||||
top_p=None, # Nucleus sampling threshold
|
||||
condition_on_coordinates_only=False # For structure conditioning
|
||||
)
|
||||
```
|
||||
|
||||
**Parameter Details:**
|
||||
|
||||
- **track**: Which modality to generate
|
||||
- `"sequence"`: Generate amino acid sequence
|
||||
- `"structure"`: Generate 3D coordinates
|
||||
- `"function"`: Generate function annotations
|
||||
|
||||
- **num_steps**: Number of iterative demasking steps
|
||||
- Higher = slower but potentially better quality
|
||||
- Typical range: 8-100 depending on sequence length
|
||||
- For full sequence generation: approximately sequence_length / 2
|
||||
|
||||
- **temperature**: Controls randomness
|
||||
- 0.0: Fully deterministic (greedy decoding)
|
||||
- 0.5-0.7: Balanced exploration
|
||||
- 1.0: Maximum diversity
|
||||
- Higher values increase novelty but may reduce quality
|
||||
|
||||
- **top_p**: Nucleus sampling parameter
|
||||
- Limits sampling to top probability mass
|
||||
- Values: 0.0-1.0 (e.g., 0.9 = sample from top 90% probability mass)
|
||||
- Use for controlled diversity without extreme sampling
|
||||
|
||||
- **condition_on_coordinates_only**: Structure conditioning mode
|
||||
- `True`: Condition only on backbone coordinates (ignore sequence)
|
||||
- Useful for inverse folding tasks
|
||||
|
||||
### ESM3InferenceClient Interface
|
||||
|
||||
The unified interface for both local and remote inference.
|
||||
|
||||
**Local Model Loading:**
|
||||
|
||||
```python
|
||||
from esm.models.esm3 import ESM3
|
||||
|
||||
# Load with automatic device placement
|
||||
model = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda")
|
||||
|
||||
# Or explicitly specify device
|
||||
model = ESM3.from_pretrained("esm3-sm-open-v1").to("cpu")
|
||||
```
|
||||
|
||||
**Generation Method:**
|
||||
|
||||
```python
|
||||
# Basic generation
|
||||
protein_output = model.generate(protein_input, config)
|
||||
|
||||
# With explicit track specification
|
||||
protein_output = model.generate(
|
||||
protein_input,
|
||||
GenerationConfig(track="sequence", num_steps=16, temperature=0.6)
|
||||
)
|
||||
```
|
||||
|
||||
**Forward Pass (Advanced):**
|
||||
|
||||
```python
|
||||
# Get raw model logits for custom sampling
|
||||
protein_tensor = model.encode(protein)
|
||||
output = model.forward(protein_tensor)
|
||||
logits = model.decode(output)
|
||||
```
|
||||
|
||||
## Common Usage Patterns
|
||||
|
||||
### 1. Sequence Completion
|
||||
|
||||
Fill in masked regions of a protein sequence:
|
||||
|
||||
```python
|
||||
# Define partial sequence
|
||||
protein = ESMProtein(sequence="MPRTK____LIVHSP____END")
|
||||
|
||||
# Generate missing positions
|
||||
config = GenerationConfig(track="sequence", num_steps=12, temperature=0.5)
|
||||
completed = model.generate(protein, config)
|
||||
|
||||
print(f"Original: {protein.sequence}")
|
||||
print(f"Completed: {completed.sequence}")
|
||||
```
|
||||
|
||||
### 2. Structure Prediction
|
||||
|
||||
Predict 3D structure from sequence:
|
||||
|
||||
```python
|
||||
# Input: sequence only
|
||||
protein = ESMProtein(sequence="MPRTKEINDAGLIVHSPQWFYK")
|
||||
|
||||
# Generate structure
|
||||
config = GenerationConfig(track="structure", num_steps=len(protein.sequence))
|
||||
protein_with_structure = model.generate(protein, config)
|
||||
|
||||
# Save as PDB
|
||||
with open("predicted_structure.pdb", "w") as f:
|
||||
f.write(protein_with_structure.to_pdb())
|
||||
```
|
||||
|
||||
### 3. Inverse Folding
|
||||
|
||||
Design sequence for a target structure:
|
||||
|
||||
```python
|
||||
# Load target structure
|
||||
target = ESMProtein.from_pdb("target.pdb")
|
||||
|
||||
# Remove sequence, keep structure
|
||||
target.sequence = None
|
||||
|
||||
# Generate sequence that folds to this structure
|
||||
config = GenerationConfig(
|
||||
track="sequence",
|
||||
num_steps=50,
|
||||
temperature=0.7,
|
||||
condition_on_coordinates_only=True
|
||||
)
|
||||
designed = model.generate(target, config)
|
||||
|
||||
print(f"Designed sequence: {designed.sequence}")
|
||||
```
|
||||
|
||||
### 4. Function-Conditioned Generation
|
||||
|
||||
Generate protein with specific function:
|
||||
|
||||
```python
|
||||
from esm.sdk.api import FunctionAnnotation
|
||||
|
||||
# Specify desired function
|
||||
protein = ESMProtein(
|
||||
sequence="_" * 150,
|
||||
function_annotations=[
|
||||
FunctionAnnotation(
|
||||
label="enzymatic_activity",
|
||||
start=30,
|
||||
end=90
|
||||
)
|
||||
]
|
||||
)
|
||||
|
||||
# Generate sequence with this function
|
||||
config = GenerationConfig(track="sequence", num_steps=75, temperature=0.6)
|
||||
functional_protein = model.generate(protein, config)
|
||||
```
|
||||
|
||||
### 5. Multi-Track Generation (Chain-of-Thought)
|
||||
|
||||
Iteratively generate across multiple tracks:
|
||||
|
||||
```python
|
||||
# Start with partial sequence
|
||||
protein = ESMProtein(sequence="MPRT" + "_" * 100)
|
||||
|
||||
# Step 1: Complete sequence
|
||||
protein = model.generate(
|
||||
protein,
|
||||
GenerationConfig(track="sequence", num_steps=50, temperature=0.6)
|
||||
)
|
||||
|
||||
# Step 2: Predict structure for completed sequence
|
||||
protein = model.generate(
|
||||
protein,
|
||||
GenerationConfig(track="structure", num_steps=50)
|
||||
)
|
||||
|
||||
# Step 3: Predict function
|
||||
protein = model.generate(
|
||||
protein,
|
||||
GenerationConfig(track="function", num_steps=20)
|
||||
)
|
||||
|
||||
print(f"Final sequence: {protein.sequence}")
|
||||
print(f"Functions: {protein.function_annotations}")
|
||||
```
|
||||
|
||||
### 6. Variant Generation
|
||||
|
||||
Generate multiple variants of a protein:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
base_sequence = "MPRTKEINDAGLIVHSPQWFYK"
|
||||
variants = []
|
||||
|
||||
for i in range(10):
|
||||
# Mask random positions
|
||||
seq_list = list(base_sequence)
|
||||
mask_indices = np.random.choice(len(seq_list), size=5, replace=False)
|
||||
for idx in mask_indices:
|
||||
seq_list[idx] = '_'
|
||||
|
||||
protein = ESMProtein(sequence=''.join(seq_list))
|
||||
|
||||
# Generate variant
|
||||
variant = model.generate(
|
||||
protein,
|
||||
GenerationConfig(track="sequence", num_steps=8, temperature=0.8)
|
||||
)
|
||||
variants.append(variant.sequence)
|
||||
|
||||
print(f"Generated {len(variants)} variants")
|
||||
```
|
||||
|
||||
## Advanced Topics
|
||||
|
||||
### Temperature Scheduling
|
||||
|
||||
Vary temperature during generation for better control:
|
||||
|
||||
```python
|
||||
def generate_with_temperature_schedule(model, protein, temperatures):
|
||||
"""Generate with decreasing temperature for annealing."""
|
||||
current = protein
|
||||
steps_per_temp = 10
|
||||
|
||||
for temp in temperatures:
|
||||
config = GenerationConfig(
|
||||
track="sequence",
|
||||
num_steps=steps_per_temp,
|
||||
temperature=temp
|
||||
)
|
||||
current = model.generate(current, config)
|
||||
|
||||
return current
|
||||
|
||||
# Example: Start diverse, end deterministic
|
||||
result = generate_with_temperature_schedule(
|
||||
model,
|
||||
protein,
|
||||
temperatures=[1.0, 0.8, 0.6, 0.4, 0.2]
|
||||
)
|
||||
```
|
||||
|
||||
### Constrained Generation
|
||||
|
||||
Preserve specific regions during generation:
|
||||
|
||||
```python
|
||||
# Keep active site residues fixed
|
||||
def mask_except_active_site(sequence, active_site_positions):
|
||||
"""Mask everything except specified positions."""
|
||||
seq_list = ['_'] * len(sequence)
|
||||
for pos in active_site_positions:
|
||||
seq_list[pos] = sequence[pos]
|
||||
return ''.join(seq_list)
|
||||
|
||||
# Define active site
|
||||
active_site = [23, 24, 25, 45, 46, 89]
|
||||
constrained_seq = mask_except_active_site(original_sequence, active_site)
|
||||
|
||||
protein = ESMProtein(sequence=constrained_seq)
|
||||
result = model.generate(protein, GenerationConfig(track="sequence", num_steps=50))
|
||||
```
|
||||
|
||||
### Secondary Structure Conditioning
|
||||
|
||||
Use secondary structure information in generation:
|
||||
|
||||
```python
|
||||
# Define secondary structure (H=helix, E=sheet, C=coil)
|
||||
protein = ESMProtein(
|
||||
sequence="_" * 80,
|
||||
secondary_structure="CCHHHHHHHEEEEECCCHHHHHHCC" + "C" * 55
|
||||
)
|
||||
|
||||
# Generate sequence with this structure
|
||||
result = model.generate(
|
||||
protein,
|
||||
GenerationConfig(track="sequence", num_steps=40, temperature=0.6)
|
||||
)
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Memory Management
|
||||
|
||||
For large proteins or batch processing:
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
# Clear CUDA cache between generations
|
||||
torch.cuda.empty_cache()
|
||||
|
||||
# Use half precision for memory efficiency
|
||||
model = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda").half()
|
||||
|
||||
# Process in chunks for very long sequences
|
||||
def chunk_generate(model, long_sequence, chunk_size=500):
|
||||
chunks = [long_sequence[i:i+chunk_size]
|
||||
for i in range(0, len(long_sequence), chunk_size)]
|
||||
results = []
|
||||
|
||||
for chunk in chunks:
|
||||
protein = ESMProtein(sequence=chunk)
|
||||
result = model.generate(protein, GenerationConfig(track="sequence"))
|
||||
results.append(result.sequence)
|
||||
|
||||
return ''.join(results)
|
||||
```
|
||||
|
||||
### Batch Processing Tips
|
||||
|
||||
When processing multiple proteins:
|
||||
|
||||
1. Sort by sequence length for efficient batching
|
||||
2. Use padding for similar-length sequences
|
||||
3. Process on GPU when available
|
||||
4. Implement checkpointing for long-running jobs
|
||||
5. Use Forge API for large-scale processing (see `forge-api.md`)
|
||||
|
||||
## Error Handling
|
||||
|
||||
```python
|
||||
try:
|
||||
protein = model.generate(protein_input, config)
|
||||
except ValueError as e:
|
||||
print(f"Invalid input: {e}")
|
||||
# Handle invalid sequence or structure
|
||||
except RuntimeError as e:
|
||||
print(f"Generation failed: {e}")
|
||||
# Handle model errors
|
||||
except torch.cuda.OutOfMemoryError:
|
||||
print("GPU out of memory - try smaller model or CPU")
|
||||
# Fallback to CPU or smaller model
|
||||
```
|
||||
|
||||
## Model-Specific Considerations
|
||||
|
||||
**esm3-sm-open-v1:**
|
||||
- Suitable for development and testing
|
||||
- Lower quality than larger models
|
||||
- Fast inference on consumer GPUs
|
||||
- Open weights allow fine-tuning
|
||||
|
||||
**esm3-medium-2024-08:**
|
||||
- Production quality
|
||||
- Good balance of speed and accuracy
|
||||
- Requires Forge API access
|
||||
- Recommended for most applications
|
||||
|
||||
**esm3-large-2024-03:**
|
||||
- State-of-the-art quality
|
||||
- Slowest inference
|
||||
- Use for critical applications
|
||||
- Best for novel protein design
|
||||
|
||||
## Citation
|
||||
|
||||
If using ESM3 in research, cite:
|
||||
|
||||
```
|
||||
Hayes, T. et al. (2025). Simulating 500 million years of evolution with a language model.
|
||||
Science. DOI: 10.1126/science.ads0018
|
||||
```
|
||||
657
skills/esm/references/forge-api.md
Normal file
657
skills/esm/references/forge-api.md
Normal file
@@ -0,0 +1,657 @@
|
||||
# Forge API Reference
|
||||
|
||||
## Overview
|
||||
|
||||
Forge is EvolutionaryScale's cloud platform for scalable protein design and inference. It provides API access to the full ESM3 model family, including large models not available for local execution.
|
||||
|
||||
**Key Benefits:**
|
||||
- Access to all ESM3 models including 98B parameter version
|
||||
- No local GPU requirements
|
||||
- Scalable batch processing
|
||||
- Automatic updates to latest models
|
||||
- Production-ready infrastructure
|
||||
- Async/concurrent request support
|
||||
|
||||
## Getting Started
|
||||
|
||||
### 1. Obtain API Token
|
||||
|
||||
Sign up and get your API token at: https://forge.evolutionaryscale.ai
|
||||
|
||||
### 2. Install ESM SDK
|
||||
|
||||
```bash
|
||||
pip install esm
|
||||
```
|
||||
|
||||
The Forge client is included in the standard ESM package.
|
||||
|
||||
### 3. Basic Connection
|
||||
|
||||
```python
|
||||
from esm.sdk.forge import ESM3ForgeInferenceClient
|
||||
from esm.sdk.api import ESMProtein, GenerationConfig
|
||||
|
||||
# Initialize client
|
||||
client = ESM3ForgeInferenceClient(
|
||||
model="esm3-medium-2024-08",
|
||||
url="https://forge.evolutionaryscale.ai",
|
||||
token="<your-token-here>"
|
||||
)
|
||||
|
||||
# Test connection
|
||||
protein = ESMProtein(sequence="MPRT___KEND")
|
||||
result = client.generate(protein, GenerationConfig(track="sequence", num_steps=8))
|
||||
print(result.sequence)
|
||||
```
|
||||
|
||||
## Available Models
|
||||
|
||||
| Model ID | Parameters | Speed | Quality | Use Case |
|
||||
|----------|-----------|-------|---------|----------|
|
||||
| `esm3-small-2024-08` | 1.4B | Fastest | Good | Rapid prototyping, testing |
|
||||
| `esm3-medium-2024-08` | 7B | Fast | Excellent | Production, most applications |
|
||||
| `esm3-large-2024-03` | 98B | Slower | Best | Research, critical designs |
|
||||
| `esm3-medium-multimer-2024-09` | 7B | Fast | Experimental | Protein complexes |
|
||||
|
||||
**Model Selection Guidelines:**
|
||||
|
||||
- **Development/Testing**: Use `esm3-small-2024-08` for quick iteration
|
||||
- **Production**: Use `esm3-medium-2024-08` for best balance
|
||||
- **Research/Critical**: Use `esm3-large-2024-03` for highest quality
|
||||
- **Complexes**: Use `esm3-medium-multimer-2024-09` (experimental)
|
||||
|
||||
## ESM3ForgeInferenceClient API
|
||||
|
||||
### Initialization
|
||||
|
||||
```python
|
||||
from esm.sdk.forge import ESM3ForgeInferenceClient
|
||||
|
||||
# Basic initialization
|
||||
client = ESM3ForgeInferenceClient(
|
||||
model="esm3-medium-2024-08",
|
||||
token="<your-token>"
|
||||
)
|
||||
|
||||
# With custom URL (for enterprise deployments)
|
||||
client = ESM3ForgeInferenceClient(
|
||||
model="esm3-medium-2024-08",
|
||||
url="https://custom.forge.instance.com",
|
||||
token="<your-token>"
|
||||
)
|
||||
|
||||
# With timeout configuration
|
||||
client = ESM3ForgeInferenceClient(
|
||||
model="esm3-medium-2024-08",
|
||||
token="<your-token>",
|
||||
timeout=300 # 5 minutes
|
||||
)
|
||||
```
|
||||
|
||||
### Synchronous Generation
|
||||
|
||||
Standard blocking generation calls:
|
||||
|
||||
```python
|
||||
from esm.sdk.api import ESMProtein, GenerationConfig
|
||||
|
||||
# Basic generation
|
||||
protein = ESMProtein(sequence="MPRT___KEND")
|
||||
config = GenerationConfig(track="sequence", num_steps=8)
|
||||
|
||||
result = client.generate(protein, config)
|
||||
print(f"Generated: {result.sequence}")
|
||||
```
|
||||
|
||||
### Asynchronous Generation
|
||||
|
||||
For concurrent processing of multiple proteins:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from esm.sdk.api import ESMProtein, GenerationConfig
|
||||
|
||||
async def generate_many(client, proteins):
|
||||
"""Generate multiple proteins concurrently."""
|
||||
tasks = []
|
||||
|
||||
for protein in proteins:
|
||||
task = client.async_generate(
|
||||
protein,
|
||||
GenerationConfig(track="sequence", num_steps=8)
|
||||
)
|
||||
tasks.append(task)
|
||||
|
||||
results = await asyncio.gather(*tasks)
|
||||
return results
|
||||
|
||||
# Usage
|
||||
proteins = [
|
||||
ESMProtein(sequence=f"MPRT{'_' * 10}KEND"),
|
||||
ESMProtein(sequence=f"AGLV{'_' * 10}HSPQ"),
|
||||
ESMProtein(sequence=f"KEIT{'_' * 10}NDFL")
|
||||
]
|
||||
|
||||
results = asyncio.run(generate_many(client, proteins))
|
||||
print(f"Generated {len(results)} proteins")
|
||||
```
|
||||
|
||||
### Batch Processing with BatchExecutor
|
||||
|
||||
For large-scale processing with automatic concurrency management:
|
||||
|
||||
```python
|
||||
from esm.sdk.forge import BatchExecutor
|
||||
from esm.sdk.api import ESMProtein, GenerationConfig
|
||||
|
||||
# Create batch executor
|
||||
executor = BatchExecutor(
|
||||
client=client,
|
||||
max_concurrent=10 # Process 10 requests concurrently
|
||||
)
|
||||
|
||||
# Prepare batch of proteins
|
||||
proteins = [ESMProtein(sequence=f"MPRT{'_' * 50}KEND") for _ in range(100)]
|
||||
config = GenerationConfig(track="sequence", num_steps=25)
|
||||
|
||||
# Submit batch
|
||||
batch_results = executor.submit_batch(
|
||||
proteins=proteins,
|
||||
config=config,
|
||||
progress_callback=lambda i, total: print(f"Processed {i}/{total}")
|
||||
)
|
||||
|
||||
print(f"Completed {len(batch_results)} generations")
|
||||
```
|
||||
|
||||
## Rate Limiting and Quotas
|
||||
|
||||
### Understanding Limits
|
||||
|
||||
Forge implements rate limiting based on:
|
||||
- Requests per minute (RPM)
|
||||
- Tokens per minute (TPM)
|
||||
- Concurrent requests
|
||||
|
||||
**Typical Limits (subject to change):**
|
||||
- Free tier: 60 RPM, 5 concurrent
|
||||
- Pro tier: 300 RPM, 20 concurrent
|
||||
- Enterprise: Custom limits
|
||||
|
||||
### Handling Rate Limits
|
||||
|
||||
```python
|
||||
import time
|
||||
from requests.exceptions import HTTPError
|
||||
|
||||
def generate_with_retry(client, protein, config, max_retries=3):
|
||||
"""Generate with automatic retry on rate limit."""
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
return client.generate(protein, config)
|
||||
except HTTPError as e:
|
||||
if e.response.status_code == 429: # Rate limit
|
||||
wait_time = 2 ** attempt # Exponential backoff
|
||||
print(f"Rate limited, waiting {wait_time}s...")
|
||||
time.sleep(wait_time)
|
||||
else:
|
||||
raise
|
||||
raise Exception("Max retries exceeded")
|
||||
|
||||
# Usage
|
||||
result = generate_with_retry(client, protein, config)
|
||||
```
|
||||
|
||||
### Implementing Custom Rate Limiter
|
||||
|
||||
```python
|
||||
import time
|
||||
from collections import deque
|
||||
|
||||
class RateLimiter:
|
||||
"""Simple rate limiter for API calls."""
|
||||
|
||||
def __init__(self, max_per_minute=60):
|
||||
self.max_per_minute = max_per_minute
|
||||
self.calls = deque()
|
||||
|
||||
def wait_if_needed(self):
|
||||
"""Wait if rate limit would be exceeded."""
|
||||
now = time.time()
|
||||
|
||||
# Remove old calls
|
||||
while self.calls and self.calls[0] < now - 60:
|
||||
self.calls.popleft()
|
||||
|
||||
# Wait if at limit
|
||||
if len(self.calls) >= self.max_per_minute:
|
||||
sleep_time = 60 - (now - self.calls[0])
|
||||
if sleep_time > 0:
|
||||
time.sleep(sleep_time)
|
||||
self.calls.popleft()
|
||||
|
||||
self.calls.append(now)
|
||||
|
||||
# Usage
|
||||
limiter = RateLimiter(max_per_minute=60)
|
||||
|
||||
for protein in proteins:
|
||||
limiter.wait_if_needed()
|
||||
result = client.generate(protein, config)
|
||||
```
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Streaming Results
|
||||
|
||||
Process results as they complete:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
|
||||
async def stream_generate(client, proteins, config):
|
||||
"""Stream results as they complete."""
|
||||
pending = {
|
||||
asyncio.create_task(client.async_generate(p, config)): i
|
||||
for i, p in enumerate(proteins)
|
||||
}
|
||||
|
||||
results = [None] * len(proteins)
|
||||
|
||||
while pending:
|
||||
done, pending = await asyncio.wait(
|
||||
pending.keys(),
|
||||
return_when=asyncio.FIRST_COMPLETED
|
||||
)
|
||||
|
||||
for task in done:
|
||||
idx = pending.pop(task)
|
||||
result = await task
|
||||
results[idx] = result
|
||||
yield idx, result
|
||||
|
||||
# Usage
|
||||
async def process_stream():
|
||||
async for idx, result in stream_generate(client, proteins, config):
|
||||
print(f"Completed protein {idx}: {result.sequence[:20]}...")
|
||||
|
||||
asyncio.run(process_stream())
|
||||
```
|
||||
|
||||
### Batch with Progress Tracking
|
||||
|
||||
```python
|
||||
from tqdm import tqdm
|
||||
import asyncio
|
||||
|
||||
async def batch_with_progress(client, proteins, config):
|
||||
"""Process batch with progress bar."""
|
||||
results = []
|
||||
|
||||
with tqdm(total=len(proteins)) as pbar:
|
||||
for protein in proteins:
|
||||
result = await client.async_generate(protein, config)
|
||||
results.append(result)
|
||||
pbar.update(1)
|
||||
|
||||
return results
|
||||
|
||||
# Usage
|
||||
results = asyncio.run(batch_with_progress(client, proteins, config))
|
||||
```
|
||||
|
||||
### Checkpoint and Resume
|
||||
|
||||
For long-running batch jobs:
|
||||
|
||||
```python
|
||||
import pickle
|
||||
import os
|
||||
|
||||
class CheckpointedBatchProcessor:
|
||||
"""Batch processor with checkpoint/resume capability."""
|
||||
|
||||
def __init__(self, client, checkpoint_file="checkpoint.pkl"):
|
||||
self.client = client
|
||||
self.checkpoint_file = checkpoint_file
|
||||
self.completed = self.load_checkpoint()
|
||||
|
||||
def load_checkpoint(self):
|
||||
if os.path.exists(self.checkpoint_file):
|
||||
with open(self.checkpoint_file, 'rb') as f:
|
||||
return pickle.load(f)
|
||||
return {}
|
||||
|
||||
def save_checkpoint(self):
|
||||
with open(self.checkpoint_file, 'wb') as f:
|
||||
pickle.dump(self.completed, f)
|
||||
|
||||
def process_batch(self, proteins, config):
|
||||
"""Process batch with checkpointing."""
|
||||
results = {}
|
||||
|
||||
for i, protein in enumerate(proteins):
|
||||
# Skip if already completed
|
||||
if i in self.completed:
|
||||
results[i] = self.completed[i]
|
||||
continue
|
||||
|
||||
try:
|
||||
result = self.client.generate(protein, config)
|
||||
results[i] = result
|
||||
self.completed[i] = result
|
||||
|
||||
# Save checkpoint every 10 items
|
||||
if i % 10 == 0:
|
||||
self.save_checkpoint()
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error processing {i}: {e}")
|
||||
self.save_checkpoint()
|
||||
raise
|
||||
|
||||
self.save_checkpoint()
|
||||
return results
|
||||
|
||||
# Usage
|
||||
processor = CheckpointedBatchProcessor(client)
|
||||
results = processor.process_batch(proteins, config)
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Common Errors and Solutions
|
||||
|
||||
```python
|
||||
from requests.exceptions import HTTPError, ConnectionError, Timeout
|
||||
|
||||
def robust_generate(client, protein, config):
|
||||
"""Generate with comprehensive error handling."""
|
||||
try:
|
||||
return client.generate(protein, config)
|
||||
|
||||
except HTTPError as e:
|
||||
if e.response.status_code == 401:
|
||||
raise ValueError("Invalid API token")
|
||||
elif e.response.status_code == 429:
|
||||
raise ValueError("Rate limit exceeded - slow down requests")
|
||||
elif e.response.status_code == 500:
|
||||
raise ValueError("Server error - try again later")
|
||||
else:
|
||||
raise
|
||||
|
||||
except ConnectionError:
|
||||
raise ValueError("Network error - check internet connection")
|
||||
|
||||
except Timeout:
|
||||
raise ValueError("Request timeout - try smaller protein or increase timeout")
|
||||
|
||||
except Exception as e:
|
||||
raise ValueError(f"Unexpected error: {str(e)}")
|
||||
|
||||
# Usage with retry logic
|
||||
def generate_with_full_retry(client, protein, config, max_retries=3):
|
||||
"""Combine error handling with retry logic."""
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
return robust_generate(client, protein, config)
|
||||
except ValueError as e:
|
||||
if "rate limit" in str(e).lower() and attempt < max_retries - 1:
|
||||
time.sleep(2 ** attempt)
|
||||
continue
|
||||
raise
|
||||
```
|
||||
|
||||
## Cost Optimization
|
||||
|
||||
### Strategies to Reduce Costs
|
||||
|
||||
**1. Use Appropriate Model Size:**
|
||||
|
||||
```python
|
||||
# Use smaller model for testing
|
||||
dev_client = ESM3ForgeInferenceClient(
|
||||
model="esm3-small-2024-08",
|
||||
token=token
|
||||
)
|
||||
|
||||
# Use larger model only for final generation
|
||||
prod_client = ESM3ForgeInferenceClient(
|
||||
model="esm3-large-2024-03",
|
||||
token=token
|
||||
)
|
||||
```
|
||||
|
||||
**2. Cache Results:**
|
||||
|
||||
```python
|
||||
import hashlib
|
||||
import json
|
||||
|
||||
class ForgeCache:
|
||||
"""Cache Forge API results locally."""
|
||||
|
||||
def __init__(self, cache_dir="forge_cache"):
|
||||
self.cache_dir = cache_dir
|
||||
os.makedirs(cache_dir, exist_ok=True)
|
||||
|
||||
def get_cache_key(self, protein, config):
|
||||
"""Generate cache key from inputs."""
|
||||
data = {
|
||||
'sequence': protein.sequence,
|
||||
'config': str(config)
|
||||
}
|
||||
return hashlib.md5(json.dumps(data, sort_keys=True).encode()).hexdigest()
|
||||
|
||||
def get(self, protein, config):
|
||||
"""Get cached result."""
|
||||
key = self.get_cache_key(protein, config)
|
||||
path = os.path.join(self.cache_dir, f"{key}.pkl")
|
||||
|
||||
if os.path.exists(path):
|
||||
with open(path, 'rb') as f:
|
||||
return pickle.load(f)
|
||||
return None
|
||||
|
||||
def set(self, protein, config, result):
|
||||
"""Cache result."""
|
||||
key = self.get_cache_key(protein, config)
|
||||
path = os.path.join(self.cache_dir, f"{key}.pkl")
|
||||
|
||||
with open(path, 'wb') as f:
|
||||
pickle.dump(result, f)
|
||||
|
||||
# Usage
|
||||
cache = ForgeCache()
|
||||
|
||||
def cached_generate(client, protein, config):
|
||||
"""Generate with caching."""
|
||||
cached = cache.get(protein, config)
|
||||
if cached:
|
||||
return cached
|
||||
|
||||
result = client.generate(protein, config)
|
||||
cache.set(protein, config, result)
|
||||
return result
|
||||
```
|
||||
|
||||
**3. Batch Similar Requests:**
|
||||
|
||||
Group similar generation tasks to reduce overhead:
|
||||
|
||||
```python
|
||||
def batch_similar_tasks(proteins, max_batch_size=50):
|
||||
"""Group proteins by similar properties."""
|
||||
# Sort by length for efficient processing
|
||||
sorted_proteins = sorted(proteins, key=lambda p: len(p.sequence))
|
||||
|
||||
batches = []
|
||||
current_batch = []
|
||||
|
||||
for protein in sorted_proteins:
|
||||
current_batch.append(protein)
|
||||
|
||||
if len(current_batch) >= max_batch_size:
|
||||
batches.append(current_batch)
|
||||
current_batch = []
|
||||
|
||||
if current_batch:
|
||||
batches.append(current_batch)
|
||||
|
||||
return batches
|
||||
```
|
||||
|
||||
## Monitoring and Logging
|
||||
|
||||
### Track API Usage
|
||||
|
||||
```python
|
||||
import logging
|
||||
from datetime import datetime
|
||||
|
||||
class ForgeMonitor:
|
||||
"""Monitor Forge API usage."""
|
||||
|
||||
def __init__(self):
|
||||
self.calls = []
|
||||
self.errors = []
|
||||
|
||||
def log_call(self, model, protein_length, duration, success=True, error=None):
|
||||
"""Log API call."""
|
||||
entry = {
|
||||
'timestamp': datetime.now(),
|
||||
'model': model,
|
||||
'protein_length': protein_length,
|
||||
'duration': duration,
|
||||
'success': success,
|
||||
'error': str(error) if error else None
|
||||
}
|
||||
|
||||
if success:
|
||||
self.calls.append(entry)
|
||||
else:
|
||||
self.errors.append(entry)
|
||||
|
||||
def get_stats(self):
|
||||
"""Get usage statistics."""
|
||||
total_calls = len(self.calls) + len(self.errors)
|
||||
success_rate = len(self.calls) / total_calls if total_calls > 0 else 0
|
||||
avg_duration = sum(c['duration'] for c in self.calls) / len(self.calls) if self.calls else 0
|
||||
|
||||
return {
|
||||
'total_calls': total_calls,
|
||||
'successful': len(self.calls),
|
||||
'failed': len(self.errors),
|
||||
'success_rate': success_rate,
|
||||
'avg_duration': avg_duration
|
||||
}
|
||||
|
||||
# Usage
|
||||
monitor = ForgeMonitor()
|
||||
|
||||
def monitored_generate(client, protein, config):
|
||||
"""Generate with monitoring."""
|
||||
start = time.time()
|
||||
|
||||
try:
|
||||
result = client.generate(protein, config)
|
||||
duration = time.time() - start
|
||||
monitor.log_call(
|
||||
model=client.model,
|
||||
protein_length=len(protein.sequence),
|
||||
duration=duration,
|
||||
success=True
|
||||
)
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
duration = time.time() - start
|
||||
monitor.log_call(
|
||||
model=client.model,
|
||||
protein_length=len(protein.sequence),
|
||||
duration=duration,
|
||||
success=False,
|
||||
error=e
|
||||
)
|
||||
raise
|
||||
|
||||
# Check stats
|
||||
print(monitor.get_stats())
|
||||
```
|
||||
|
||||
## AWS SageMaker Deployment
|
||||
|
||||
For dedicated infrastructure and enterprise use:
|
||||
|
||||
### Deployment Options
|
||||
|
||||
1. **AWS Marketplace Listing**: Deploy ESM3 via AWS SageMaker Marketplace
|
||||
2. **Custom Endpoint**: Configure dedicated inference endpoint
|
||||
3. **Batch Transform**: Use SageMaker Batch Transform for large-scale processing
|
||||
|
||||
### Benefits
|
||||
|
||||
- Dedicated compute resources
|
||||
- No rate limiting beyond your infrastructure
|
||||
- Data stays in your AWS environment
|
||||
- Integration with AWS services
|
||||
- Custom instance types and scaling
|
||||
|
||||
**More Information:**
|
||||
- AWS Marketplace: https://aws.amazon.com/marketplace/seller-profile?id=seller-iw2nbscescndm
|
||||
- Contact EvolutionaryScale for enterprise licensing
|
||||
|
||||
## Best Practices Summary
|
||||
|
||||
1. **Authentication**: Store tokens securely (environment variables, secrets manager)
|
||||
2. **Rate Limiting**: Implement exponential backoff and respect limits
|
||||
3. **Error Handling**: Always handle network errors and retries
|
||||
4. **Caching**: Cache results for repeated queries
|
||||
5. **Model Selection**: Use appropriate model size for task
|
||||
6. **Batch Processing**: Use async/batch processing for multiple proteins
|
||||
7. **Monitoring**: Track usage and costs
|
||||
8. **Checkpointing**: Save progress for long-running jobs
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Connection Issues
|
||||
|
||||
```python
|
||||
# Test connection
|
||||
try:
|
||||
client = ESM3ForgeInferenceClient(model="esm3-medium-2024-08", token=token)
|
||||
test_protein = ESMProtein(sequence="MPRTK")
|
||||
result = client.generate(test_protein, GenerationConfig(track="sequence", num_steps=1))
|
||||
print("Connection successful!")
|
||||
except Exception as e:
|
||||
print(f"Connection failed: {e}")
|
||||
```
|
||||
|
||||
### Token Validation
|
||||
|
||||
```python
|
||||
def validate_token(token):
|
||||
"""Validate API token."""
|
||||
try:
|
||||
client = ESM3ForgeInferenceClient(
|
||||
model="esm3-small-2024-08",
|
||||
token=token
|
||||
)
|
||||
# Make minimal test call
|
||||
test = ESMProtein(sequence="MPR")
|
||||
client.generate(test, GenerationConfig(track="sequence", num_steps=1))
|
||||
return True
|
||||
except HTTPError as e:
|
||||
if e.response.status_code == 401:
|
||||
return False
|
||||
raise
|
||||
```
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **Forge Platform**: https://forge.evolutionaryscale.ai
|
||||
- **API Documentation**: Check Forge dashboard for latest API specs
|
||||
- **Community Support**: Slack community at https://bit.ly/3FKwcWd
|
||||
- **Enterprise Contact**: Contact EvolutionaryScale for custom deployments
|
||||
685
skills/esm/references/workflows.md
Normal file
685
skills/esm/references/workflows.md
Normal file
@@ -0,0 +1,685 @@
|
||||
# ESM Workflows and Examples
|
||||
|
||||
## Overview
|
||||
|
||||
This document provides complete, end-to-end examples of common workflows using ESM3 and ESM C. Each workflow includes setup, execution, and analysis code.
|
||||
|
||||
## Workflow 1: Novel GFP Design with Chain-of-Thought
|
||||
|
||||
Design a novel fluorescent protein using ESM3's multimodal generation capabilities.
|
||||
|
||||
### Objective
|
||||
|
||||
Generate a green fluorescent protein (GFP) with specific properties using chain-of-thought reasoning across sequence, structure, and function.
|
||||
|
||||
### Complete Implementation
|
||||
|
||||
```python
|
||||
from esm.models.esm3 import ESM3
|
||||
from esm.sdk.api import ESMProtein, GenerationConfig, FunctionAnnotation
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Setup
|
||||
model = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda")
|
||||
|
||||
# Step 1: Define target properties
|
||||
print("Step 1: Defining target GFP properties...")
|
||||
|
||||
# Create protein with desired function
|
||||
target_length = 238 # Typical GFP length
|
||||
protein = ESMProtein(
|
||||
sequence="_" * target_length,
|
||||
function_annotations=[
|
||||
FunctionAnnotation(
|
||||
label="green_fluorescent_protein",
|
||||
start=65,
|
||||
end=75 # Chromophore region
|
||||
)
|
||||
]
|
||||
)
|
||||
|
||||
# Step 2: Generate initial sequence with function conditioning
|
||||
print("Step 2: Generating initial sequence...")
|
||||
|
||||
config = GenerationConfig(
|
||||
track="sequence",
|
||||
num_steps=target_length // 3, # Gradual generation
|
||||
temperature=0.7 # Moderate diversity
|
||||
)
|
||||
protein = model.generate(protein, config)
|
||||
print(f"Generated sequence: {protein.sequence[:50]}...")
|
||||
|
||||
# Step 3: Predict structure
|
||||
print("Step 3: Predicting structure...")
|
||||
|
||||
config = GenerationConfig(
|
||||
track="structure",
|
||||
num_steps=target_length // 2
|
||||
)
|
||||
protein = model.generate(protein, config)
|
||||
print(f"Structure predicted, coordinates shape: {protein.coordinates.shape}")
|
||||
|
||||
# Step 4: Refine sequence based on structure
|
||||
print("Step 4: Refining sequence based on structure...")
|
||||
|
||||
# Mask regions for refinement (e.g., surface residues)
|
||||
sequence_list = list(protein.sequence)
|
||||
# Keep chromophore region, refine others
|
||||
for i in range(0, 65):
|
||||
if i % 3 == 0: # Refine every third position
|
||||
sequence_list[i] = '_'
|
||||
for i in range(75, target_length):
|
||||
if i % 3 == 0:
|
||||
sequence_list[i] = '_'
|
||||
|
||||
protein.sequence = ''.join(sequence_list)
|
||||
|
||||
config = GenerationConfig(
|
||||
track="sequence",
|
||||
num_steps=50,
|
||||
temperature=0.5 # Lower temperature for refinement
|
||||
)
|
||||
protein = model.generate(protein, config)
|
||||
|
||||
# Step 5: Final validation
|
||||
print("Step 5: Final validation...")
|
||||
|
||||
# Predict final structure
|
||||
config = GenerationConfig(track="structure", num_steps=30)
|
||||
protein = model.generate(protein, config)
|
||||
|
||||
# Save results
|
||||
with open("novel_gfp.pdb", "w") as f:
|
||||
f.write(protein.to_pdb())
|
||||
|
||||
with open("novel_gfp_sequence.txt", "w") as f:
|
||||
f.write(f">Novel_GFP\n{protein.sequence}\n")
|
||||
|
||||
print(f"\nFinal GFP sequence:\n{protein.sequence}")
|
||||
print(f"\nFunction annotations: {protein.function_annotations}")
|
||||
print(f"Structure saved to: novel_gfp.pdb")
|
||||
```
|
||||
|
||||
### Validation Steps
|
||||
|
||||
```python
|
||||
# Analyze designed GFP
|
||||
def analyze_gfp(protein):
|
||||
"""Analyze generated GFP properties."""
|
||||
|
||||
# Check chromophore region (should be around Ser65-Tyr66-Gly67)
|
||||
chromophore_region = protein.sequence[64:68]
|
||||
print(f"Chromophore region: {chromophore_region}")
|
||||
|
||||
# Check barrel structure (GFPs have beta-barrel)
|
||||
# Analyze secondary structure if available
|
||||
if protein.secondary_structure:
|
||||
beta_content = protein.secondary_structure.count('E') / len(protein.sequence)
|
||||
print(f"Beta sheet content: {beta_content:.2%}")
|
||||
|
||||
# Check sequence similarity to known GFPs
|
||||
# (Would require BLAST or alignment tool in practice)
|
||||
|
||||
return {
|
||||
'length': len(protein.sequence),
|
||||
'chromophore': chromophore_region,
|
||||
'coordinates_available': protein.coordinates is not None
|
||||
}
|
||||
|
||||
analysis = analyze_gfp(protein)
|
||||
print(f"\nAnalysis results: {analysis}")
|
||||
```
|
||||
|
||||
## Workflow 2: Protein Variant Library Generation
|
||||
|
||||
Generate and analyze a library of protein variants for directed evolution.
|
||||
|
||||
### Objective
|
||||
|
||||
Create variants of a parent protein by targeted mutagenesis while maintaining structural integrity.
|
||||
|
||||
### Complete Implementation
|
||||
|
||||
```python
|
||||
from esm.models.esm3 import ESM3
|
||||
from esm.sdk.api import ESMProtein, GenerationConfig
|
||||
import numpy as np
|
||||
from sklearn.cluster import KMeans
|
||||
|
||||
# Setup
|
||||
model = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda")
|
||||
|
||||
# Parent protein
|
||||
parent_sequence = "MPRTKEINDAGLIVHSPQWFYKARNDTESLGKIVHEFPM"
|
||||
parent_protein = ESMProtein(sequence=parent_sequence)
|
||||
|
||||
# Define mutation parameters
|
||||
num_variants = 50
|
||||
positions_to_mutate = 5 # Number of positions per variant
|
||||
|
||||
# Step 1: Generate variant library
|
||||
print("Generating variant library...")
|
||||
|
||||
variants = []
|
||||
for i in range(num_variants):
|
||||
# Create masked sequence with random positions
|
||||
seq_list = list(parent_sequence)
|
||||
|
||||
# Select random positions to mutate
|
||||
mutation_positions = np.random.choice(
|
||||
len(seq_list),
|
||||
size=positions_to_mutate,
|
||||
replace=False
|
||||
)
|
||||
|
||||
for pos in mutation_positions:
|
||||
seq_list[pos] = '_'
|
||||
|
||||
# Generate variant
|
||||
variant_protein = ESMProtein(sequence=''.join(seq_list))
|
||||
|
||||
config = GenerationConfig(
|
||||
track="sequence",
|
||||
num_steps=positions_to_mutate * 2,
|
||||
temperature=0.8 # Higher diversity
|
||||
)
|
||||
|
||||
variant = model.generate(variant_protein, config)
|
||||
variants.append(variant.sequence)
|
||||
|
||||
if (i + 1) % 10 == 0:
|
||||
print(f"Generated {i + 1}/{num_variants} variants")
|
||||
|
||||
print(f"\nGenerated {len(variants)} variants")
|
||||
|
||||
# Step 2: Predict structures for variants
|
||||
print("\nPredicting structures...")
|
||||
|
||||
variant_proteins_with_structure = []
|
||||
for i, seq in enumerate(variants):
|
||||
protein = ESMProtein(sequence=seq)
|
||||
|
||||
config = GenerationConfig(
|
||||
track="structure",
|
||||
num_steps=len(seq) // 2
|
||||
)
|
||||
|
||||
protein_with_structure = model.generate(protein, config)
|
||||
variant_proteins_with_structure.append(protein_with_structure)
|
||||
|
||||
if (i + 1) % 10 == 0:
|
||||
print(f"Predicted structures for {i + 1}/{len(variants)} variants")
|
||||
|
||||
# Step 3: Analyze variant diversity
|
||||
print("\nAnalyzing variant diversity...")
|
||||
|
||||
# Calculate Hamming distances from parent
|
||||
def hamming_distance(seq1, seq2):
|
||||
"""Calculate Hamming distance between sequences."""
|
||||
return sum(c1 != c2 for c1, c2 in zip(seq1, seq2))
|
||||
|
||||
distances = [hamming_distance(parent_sequence, var) for var in variants]
|
||||
print(f"Average mutations per variant: {np.mean(distances):.1f}")
|
||||
print(f"Mutation range: {min(distances)}-{max(distances)}")
|
||||
|
||||
# Step 4: Get embeddings for clustering
|
||||
print("\nGenerating embeddings for clustering...")
|
||||
|
||||
from esm.models.esmc import ESMC
|
||||
|
||||
embedding_model = ESMC.from_pretrained("esmc-300m").to("cuda")
|
||||
|
||||
def get_embedding(sequence):
|
||||
"""Get mean-pooled embedding for sequence."""
|
||||
protein = ESMProtein(sequence=sequence)
|
||||
tensor = embedding_model.encode(protein)
|
||||
emb = embedding_model.forward(tensor)
|
||||
return emb.mean(dim=1).cpu().detach().numpy().flatten()
|
||||
|
||||
variant_embeddings = np.array([get_embedding(seq) for seq in variants])
|
||||
|
||||
# Step 5: Cluster variants
|
||||
print("Clustering variants...")
|
||||
|
||||
n_clusters = 5
|
||||
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
|
||||
cluster_labels = kmeans.fit_predict(variant_embeddings)
|
||||
|
||||
# Analyze clusters
|
||||
print("\nCluster analysis:")
|
||||
for i in range(n_clusters):
|
||||
cluster_variants = [var for var, label in zip(variants, cluster_labels) if label == i]
|
||||
cluster_distances = [hamming_distance(parent_sequence, var) for var in cluster_variants]
|
||||
|
||||
print(f"\nCluster {i}:")
|
||||
print(f" Size: {len(cluster_variants)}")
|
||||
print(f" Avg distance from parent: {np.mean(cluster_distances):.1f}")
|
||||
print(f" Representative: {cluster_variants[0][:40]}...")
|
||||
|
||||
# Step 6: Select diverse representatives
|
||||
print("\nSelecting diverse representatives...")
|
||||
|
||||
representatives = []
|
||||
for i in range(n_clusters):
|
||||
# Get centroid
|
||||
cluster_indices = np.where(cluster_labels == i)[0]
|
||||
cluster_embs = variant_embeddings[cluster_indices]
|
||||
|
||||
# Find closest to centroid
|
||||
centroid = cluster_embs.mean(axis=0)
|
||||
distances_to_centroid = np.linalg.norm(cluster_embs - centroid, axis=1)
|
||||
rep_idx = cluster_indices[np.argmin(distances_to_centroid)]
|
||||
|
||||
representatives.append(variants[rep_idx])
|
||||
|
||||
# Save results
|
||||
print("\nSaving results...")
|
||||
|
||||
with open("variant_library.fasta", "w") as f:
|
||||
f.write(f">Parent\n{parent_sequence}\n\n")
|
||||
for i, var in enumerate(variants):
|
||||
f.write(f">Variant_{i+1}_Cluster_{cluster_labels[i]}\n{var}\n")
|
||||
|
||||
with open("representative_variants.fasta", "w") as f:
|
||||
for i, rep in enumerate(representatives):
|
||||
f.write(f">Representative_Cluster_{i}\n{rep}\n")
|
||||
|
||||
print("Variant library saved to: variant_library.fasta")
|
||||
print("Representatives saved to: representative_variants.fasta")
|
||||
```
|
||||
|
||||
## Workflow 3: Structure-Based Sequence Optimization
|
||||
|
||||
Optimize a protein sequence to improve stability while maintaining function.
|
||||
|
||||
### Objective
|
||||
|
||||
Given a protein structure, design sequences that maintain the fold but have improved properties.
|
||||
|
||||
### Complete Implementation
|
||||
|
||||
```python
|
||||
from esm.models.esm3 import ESM3
|
||||
from esm.sdk.api import ESMProtein, GenerationConfig
|
||||
import numpy as np
|
||||
|
||||
# Setup
|
||||
model = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda")
|
||||
|
||||
# Load target structure (e.g., from PDB)
|
||||
target_protein = ESMProtein.from_pdb("target_structure.pdb")
|
||||
original_sequence = target_protein.sequence
|
||||
|
||||
print(f"Original sequence: {original_sequence}")
|
||||
print(f"Structure loaded: {target_protein.coordinates.shape}")
|
||||
|
||||
# Step 1: Generate multiple sequence designs
|
||||
print("\nGenerating optimized sequences...")
|
||||
|
||||
num_designs = 20
|
||||
optimized_sequences = []
|
||||
|
||||
for i in range(num_designs):
|
||||
# Start with structure, remove sequence
|
||||
design_protein = ESMProtein(
|
||||
coordinates=target_protein.coordinates.copy(),
|
||||
secondary_structure=target_protein.secondary_structure
|
||||
)
|
||||
|
||||
# Generate sequence for this structure
|
||||
config = GenerationConfig(
|
||||
track="sequence",
|
||||
num_steps=len(original_sequence),
|
||||
temperature=0.7,
|
||||
condition_on_coordinates_only=True
|
||||
)
|
||||
|
||||
designed = model.generate(design_protein, config)
|
||||
optimized_sequences.append(designed.sequence)
|
||||
|
||||
if (i + 1) % 5 == 0:
|
||||
print(f"Generated {i + 1}/{num_designs} designs")
|
||||
|
||||
# Step 2: Validate structural compatibility
|
||||
print("\nValidating structural compatibility...")
|
||||
|
||||
validated_designs = []
|
||||
|
||||
for seq in optimized_sequences:
|
||||
# Predict structure for designed sequence
|
||||
test_protein = ESMProtein(sequence=seq)
|
||||
|
||||
config = GenerationConfig(
|
||||
track="structure",
|
||||
num_steps=len(seq) // 2
|
||||
)
|
||||
|
||||
predicted = model.generate(test_protein, config)
|
||||
|
||||
# Calculate RMSD (simplified - in practice use proper alignment)
|
||||
# Here we just check if structure prediction succeeds
|
||||
if predicted.coordinates is not None:
|
||||
validated_designs.append(seq)
|
||||
|
||||
print(f"Validated {len(validated_designs)}/{num_designs} designs")
|
||||
|
||||
# Step 3: Analyze sequence properties
|
||||
print("\nAnalyzing sequence properties...")
|
||||
|
||||
def calculate_properties(sequence):
|
||||
"""Calculate basic sequence properties."""
|
||||
# Hydrophobicity (simplified)
|
||||
hydrophobic = "AILMFWYV"
|
||||
hydrophobic_fraction = sum(1 for aa in sequence if aa in hydrophobic) / len(sequence)
|
||||
|
||||
# Charge
|
||||
positive = "KR"
|
||||
negative = "DE"
|
||||
net_charge = sum(1 for aa in sequence if aa in positive) - sum(1 for aa in sequence if aa in negative)
|
||||
|
||||
# Aromatic content
|
||||
aromatic = "FWY"
|
||||
aromatic_fraction = sum(1 for aa in sequence if aa in aromatic) / len(sequence)
|
||||
|
||||
return {
|
||||
'hydrophobic_fraction': hydrophobic_fraction,
|
||||
'net_charge': net_charge,
|
||||
'aromatic_fraction': aromatic_fraction
|
||||
}
|
||||
|
||||
# Compare to original
|
||||
original_props = calculate_properties(original_sequence)
|
||||
print(f"\nOriginal properties:")
|
||||
print(f" Hydrophobic: {original_props['hydrophobic_fraction']:.2%}")
|
||||
print(f" Net charge: {original_props['net_charge']:+d}")
|
||||
print(f" Aromatic: {original_props['aromatic_fraction']:.2%}")
|
||||
|
||||
# Analyze designs
|
||||
design_properties = [calculate_properties(seq) for seq in validated_designs]
|
||||
|
||||
avg_hydrophobic = np.mean([p['hydrophobic_fraction'] for p in design_properties])
|
||||
avg_charge = np.mean([p['net_charge'] for p in design_properties])
|
||||
avg_aromatic = np.mean([p['aromatic_fraction'] for p in design_properties])
|
||||
|
||||
print(f"\nDesigned sequences (average):")
|
||||
print(f" Hydrophobic: {avg_hydrophobic:.2%}")
|
||||
print(f" Net charge: {avg_charge:+.1f}")
|
||||
print(f" Aromatic: {avg_aromatic:.2%}")
|
||||
|
||||
# Step 4: Rank designs
|
||||
print("\nRanking designs...")
|
||||
|
||||
def score_design(sequence, original_props):
|
||||
"""Score design based on desired properties."""
|
||||
props = calculate_properties(sequence)
|
||||
|
||||
# Prefer higher hydrophobic content (for stability)
|
||||
hydrophobic_score = props['hydrophobic_fraction']
|
||||
|
||||
# Prefer similar charge to original
|
||||
charge_score = 1.0 / (1.0 + abs(props['net_charge'] - original_props['net_charge']))
|
||||
|
||||
# Combined score
|
||||
return hydrophobic_score * 0.6 + charge_score * 0.4
|
||||
|
||||
scores = [(seq, score_design(seq, original_props)) for seq in validated_designs]
|
||||
scores.sort(key=lambda x: x[1], reverse=True)
|
||||
|
||||
print("\nTop 5 designs:")
|
||||
for i, (seq, score) in enumerate(scores[:5]):
|
||||
print(f"\n{i+1}. Score: {score:.3f}")
|
||||
print(f" Sequence: {seq[:40]}...")
|
||||
|
||||
# Step 5: Save results
|
||||
print("\nSaving results...")
|
||||
|
||||
with open("optimized_sequences.fasta", "w") as f:
|
||||
f.write(f">Original\n{original_sequence}\n\n")
|
||||
|
||||
for i, (seq, score) in enumerate(scores):
|
||||
props = calculate_properties(seq)
|
||||
f.write(f">Design_{i+1}_Score_{score:.3f}\n")
|
||||
f.write(f"# Hydrophobic: {props['hydrophobic_fraction']:.2%}, ")
|
||||
f.write(f"Charge: {props['net_charge']:+d}, ")
|
||||
f.write(f"Aromatic: {props['aromatic_fraction']:.2%}\n")
|
||||
f.write(f"{seq}\n\n")
|
||||
|
||||
print("Results saved to: optimized_sequences.fasta")
|
||||
```
|
||||
|
||||
## Workflow 4: Function Prediction Pipeline
|
||||
|
||||
Predict protein function from sequence using ESM3 and ESM C.
|
||||
|
||||
### Objective
|
||||
|
||||
Build a pipeline that predicts protein function using both generative (ESM3) and embedding (ESM C) approaches.
|
||||
|
||||
### Complete Implementation
|
||||
|
||||
```python
|
||||
from esm.models.esm3 import ESM3
|
||||
from esm.models.esmc import ESMC
|
||||
from esm.sdk.api import ESMProtein, GenerationConfig
|
||||
import numpy as np
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.model_selection import cross_val_score
|
||||
|
||||
# Setup models
|
||||
esm3_model = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda")
|
||||
esmc_model = ESMC.from_pretrained("esmc-600m").to("cuda")
|
||||
|
||||
# Example: Predict if protein is an enzyme
|
||||
# (In practice, you'd have a labeled training set)
|
||||
|
||||
def predict_function_generative(sequence):
|
||||
"""Predict function using ESM3 generative approach."""
|
||||
|
||||
protein = ESMProtein(sequence=sequence)
|
||||
|
||||
# Generate function annotations
|
||||
config = GenerationConfig(
|
||||
track="function",
|
||||
num_steps=20,
|
||||
temperature=0.3 # Low temperature for confident predictions
|
||||
)
|
||||
|
||||
protein_with_function = esm3_model.generate(protein, config)
|
||||
|
||||
return protein_with_function.function_annotations
|
||||
|
||||
def predict_function_embedding(sequence, function_classifier):
|
||||
"""Predict function using ESM C embeddings + classifier."""
|
||||
|
||||
# Get embedding
|
||||
protein = ESMProtein(sequence=sequence)
|
||||
tensor = esmc_model.encode(protein)
|
||||
embedding = esmc_model.forward(tensor)
|
||||
|
||||
# Mean pool
|
||||
embedding_pooled = embedding.mean(dim=1).cpu().detach().numpy()
|
||||
|
||||
# Predict with classifier
|
||||
prediction = function_classifier.predict(embedding_pooled)
|
||||
probability = function_classifier.predict_proba(embedding_pooled)
|
||||
|
||||
return prediction[0], probability[0]
|
||||
|
||||
# Example workflow with test sequences
|
||||
test_sequences = {
|
||||
"kinase": "MPRTKEINDAGLIVHSPQWFYKARNDTESLGKIVHEF",
|
||||
"protease": "AGLIVHSPQWFYKARNDTESLGKIVHEFPMCDEGH",
|
||||
"transporter": "KTEFLNDGRPMLIVHSPQWFYKARNDTESLGKIVH"
|
||||
}
|
||||
|
||||
print("Predicting functions...\n")
|
||||
|
||||
for name, sequence in test_sequences.items():
|
||||
print(f"{name.upper()}:")
|
||||
print(f"Sequence: {sequence[:30]}...")
|
||||
|
||||
# Method 1: Generative
|
||||
functions = predict_function_generative(sequence)
|
||||
print(f" Generative predictions: {functions}")
|
||||
|
||||
# Method 2: Embedding-based would require trained classifier
|
||||
# (Skipped in this example as it needs training data)
|
||||
|
||||
print()
|
||||
```
|
||||
|
||||
## Workflow 5: Embedding-Based Clustering and Analysis
|
||||
|
||||
Cluster and analyze a large protein dataset using ESM C embeddings.
|
||||
|
||||
### Complete Implementation
|
||||
|
||||
```python
|
||||
from esm.models.esmc import ESMC
|
||||
from esm.sdk.api import ESMProtein
|
||||
import numpy as np
|
||||
from sklearn.cluster import DBSCAN
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.manifold import TSNE
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Setup
|
||||
model = ESMC.from_pretrained("esmc-600m").to("cuda")
|
||||
|
||||
# Load protein dataset (example)
|
||||
sequences = [
|
||||
# In practice, load from FASTA or database
|
||||
"MPRTKEINDAGLIVHSPQWFYK",
|
||||
"AGLIVHSPQWFYKARNDTESL",
|
||||
# ... more sequences
|
||||
]
|
||||
|
||||
print(f"Loaded {len(sequences)} sequences")
|
||||
|
||||
# Step 1: Generate embeddings
|
||||
print("Generating embeddings...")
|
||||
|
||||
embeddings = []
|
||||
for i, seq in enumerate(sequences):
|
||||
protein = ESMProtein(sequence=seq)
|
||||
tensor = model.encode(protein)
|
||||
emb = model.forward(tensor)
|
||||
|
||||
# Mean pooling
|
||||
emb_pooled = emb.mean(dim=1).cpu().detach().numpy().flatten()
|
||||
embeddings.append(emb_pooled)
|
||||
|
||||
if (i + 1) % 100 == 0:
|
||||
print(f"Processed {i + 1}/{len(sequences)}")
|
||||
|
||||
embeddings = np.array(embeddings)
|
||||
print(f"Embeddings shape: {embeddings.shape}")
|
||||
|
||||
# Step 2: Dimensionality reduction for visualization
|
||||
print("\nReducing dimensionality...")
|
||||
|
||||
# PCA for initial reduction
|
||||
pca = PCA(n_components=50)
|
||||
embeddings_pca = pca.fit_transform(embeddings)
|
||||
print(f"PCA explained variance: {pca.explained_variance_ratio_[:10].sum():.2%}")
|
||||
|
||||
# t-SNE for visualization
|
||||
tsne = TSNE(n_components=2, random_state=42)
|
||||
embeddings_2d = tsne.fit_transform(embeddings_pca)
|
||||
|
||||
# Step 3: Clustering
|
||||
print("\nClustering...")
|
||||
|
||||
# DBSCAN for density-based clustering
|
||||
clustering = DBSCAN(eps=0.5, min_samples=5)
|
||||
cluster_labels = clustering.fit_predict(embeddings)
|
||||
|
||||
n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
|
||||
n_noise = list(cluster_labels).count(-1)
|
||||
|
||||
print(f"Number of clusters: {n_clusters}")
|
||||
print(f"Number of noise points: {n_noise}")
|
||||
|
||||
# Step 4: Visualize
|
||||
print("\nGenerating visualization...")
|
||||
|
||||
plt.figure(figsize=(12, 8))
|
||||
scatter = plt.scatter(
|
||||
embeddings_2d[:, 0],
|
||||
embeddings_2d[:, 1],
|
||||
c=cluster_labels,
|
||||
cmap='viridis',
|
||||
alpha=0.6
|
||||
)
|
||||
plt.colorbar(scatter)
|
||||
plt.title("Protein Sequence Clustering (ESM C Embeddings)")
|
||||
plt.xlabel("t-SNE 1")
|
||||
plt.ylabel("t-SNE 2")
|
||||
plt.savefig("protein_clusters.png", dpi=300, bbox_inches='tight')
|
||||
print("Visualization saved to: protein_clusters.png")
|
||||
|
||||
# Step 5: Analyze clusters
|
||||
print("\nCluster analysis:")
|
||||
|
||||
for cluster_id in range(n_clusters):
|
||||
cluster_indices = np.where(cluster_labels == cluster_id)[0]
|
||||
cluster_seqs = [sequences[i] for i in cluster_indices]
|
||||
|
||||
print(f"\nCluster {cluster_id}:")
|
||||
print(f" Size: {len(cluster_seqs)}")
|
||||
print(f" Avg length: {np.mean([len(s) for s in cluster_seqs]):.1f}")
|
||||
print(f" Example: {cluster_seqs[0][:40]}...")
|
||||
|
||||
# Save cluster assignments
|
||||
with open("cluster_assignments.txt", "w") as f:
|
||||
for i, (seq, label) in enumerate(zip(sequences, cluster_labels)):
|
||||
f.write(f"Sequence_{i}\tCluster_{label}\t{seq}\n")
|
||||
|
||||
print("\nCluster assignments saved to: cluster_assignments.txt")
|
||||
```
|
||||
|
||||
## Additional Workflow Tips
|
||||
|
||||
### Memory Management for Large Datasets
|
||||
|
||||
```python
|
||||
def process_large_dataset(sequences, batch_size=32):
|
||||
"""Process large dataset with memory management."""
|
||||
import gc
|
||||
import torch
|
||||
|
||||
results = []
|
||||
|
||||
for i in range(0, len(sequences), batch_size):
|
||||
batch = sequences[i:i + batch_size]
|
||||
|
||||
# Process batch
|
||||
batch_results = [process_sequence(seq) for seq in batch]
|
||||
results.extend(batch_results)
|
||||
|
||||
# Clear memory
|
||||
torch.cuda.empty_cache()
|
||||
gc.collect()
|
||||
|
||||
if (i + batch_size) % 100 == 0:
|
||||
print(f"Processed {min(i + batch_size, len(sequences))}/{len(sequences)}")
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
### Parallel Processing
|
||||
|
||||
```python
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
import asyncio
|
||||
|
||||
def parallel_workflow(sequences, n_workers=4):
|
||||
"""Process sequences in parallel."""
|
||||
|
||||
with ThreadPoolExecutor(max_workers=n_workers) as executor:
|
||||
results = list(executor.map(process_sequence, sequences))
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
These workflows provide comprehensive examples for common ESM use cases. Adapt them to your specific needs and always validate results with appropriate biological experiments.
|
||||
Reference in New Issue
Block a user