Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

300
skills/esm/SKILL.md Normal file
View File

@@ -0,0 +1,300 @@
---
name: esm
description: Comprehensive toolkit for protein language models including ESM3 (generative multimodal protein design across sequence, structure, and function) and ESM C (efficient protein embeddings and representations). Use this skill when working with protein sequences, structures, or function prediction; designing novel proteins; generating protein embeddings; performing inverse folding; or conducting protein engineering tasks. Supports both local model usage and cloud-based Forge API for scalable inference.
---
# ESM: Evolutionary Scale Modeling
## Overview
ESM provides state-of-the-art protein language models for understanding, generating, and designing proteins. This skill enables working with two model families: ESM3 for generative protein design across sequence, structure, and function, and ESM C for efficient protein representation learning and embeddings.
## Core Capabilities
### 1. Protein Sequence Generation with ESM3
Generate novel protein sequences with desired properties using multimodal generative modeling.
**When to use:**
- Designing proteins with specific functional properties
- Completing partial protein sequences
- Generating variants of existing proteins
- Creating proteins with desired structural characteristics
**Basic usage:**
```python
from esm.models.esm3 import ESM3
from esm.sdk.api import ESM3InferenceClient, ESMProtein, GenerationConfig
# Load model locally
model: ESM3InferenceClient = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda")
# Create protein prompt
protein = ESMProtein(sequence="MPRT___KEND") # '_' represents masked positions
# Generate completion
protein = model.generate(protein, GenerationConfig(track="sequence", num_steps=8))
print(protein.sequence)
```
**For remote/cloud usage via Forge API:**
```python
from esm.sdk.forge import ESM3ForgeInferenceClient
from esm.sdk.api import ESMProtein, GenerationConfig
# Connect to Forge
model = ESM3ForgeInferenceClient(model="esm3-medium-2024-08", url="https://forge.evolutionaryscale.ai", token="<token>")
# Generate
protein = model.generate(protein, GenerationConfig(track="sequence", num_steps=8))
```
See `references/esm3-api.md` for detailed ESM3 model specifications, advanced generation configurations, and multimodal prompting examples.
### 2. Structure Prediction and Inverse Folding
Use ESM3's structure track for structure prediction from sequence or inverse folding (sequence design from structure).
**Structure prediction:**
```python
from esm.sdk.api import ESM3InferenceClient, ESMProtein, GenerationConfig
# Predict structure from sequence
protein = ESMProtein(sequence="MPRTKEINDAGLIVHSP...")
protein_with_structure = model.generate(
protein,
GenerationConfig(track="structure", num_steps=protein.sequence.count("_"))
)
# Access predicted structure
coordinates = protein_with_structure.coordinates # 3D coordinates
pdb_string = protein_with_structure.to_pdb()
```
**Inverse folding (sequence from structure):**
```python
# Design sequence for a target structure
protein_with_structure = ESMProtein.from_pdb("target_structure.pdb")
protein_with_structure.sequence = None # Remove sequence
# Generate sequence that folds to this structure
designed_protein = model.generate(
protein_with_structure,
GenerationConfig(track="sequence", num_steps=50, temperature=0.7)
)
```
### 3. Protein Embeddings with ESM C
Generate high-quality embeddings for downstream tasks like function prediction, classification, or similarity analysis.
**When to use:**
- Extracting protein representations for machine learning
- Computing sequence similarities
- Feature extraction for protein classification
- Transfer learning for protein-related tasks
**Basic usage:**
```python
from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein
# Load ESM C model
model = ESMC.from_pretrained("esmc-300m").to("cuda")
# Get embeddings
protein = ESMProtein(sequence="MPRTKEINDAGLIVHSP...")
protein_tensor = model.encode(protein)
# Generate embeddings
embeddings = model.forward(protein_tensor)
```
**Batch processing:**
```python
# Encode multiple proteins
proteins = [
ESMProtein(sequence="MPRTKEIND..."),
ESMProtein(sequence="AGLIVHSPQ..."),
ESMProtein(sequence="KTEFLNDGR...")
]
embeddings_list = [model.logits(model.forward(model.encode(p))) for p in proteins]
```
See `references/esm-c-api.md` for ESM C model details, efficiency comparisons, and advanced embedding strategies.
### 4. Function Conditioning and Annotation
Use ESM3's function track to generate proteins with specific functional annotations or predict function from sequence.
**Function-conditioned generation:**
```python
from esm.sdk.api import ESMProtein, FunctionAnnotation, GenerationConfig
# Create protein with desired function
protein = ESMProtein(
sequence="_" * 200, # Generate 200 residue protein
function_annotations=[
FunctionAnnotation(label="fluorescent_protein", start=50, end=150)
]
)
# Generate sequence with specified function
functional_protein = model.generate(
protein,
GenerationConfig(track="sequence", num_steps=200)
)
```
### 5. Chain-of-Thought Generation
Iteratively refine protein designs using ESM3's chain-of-thought generation approach.
```python
from esm.sdk.api import GenerationConfig
# Multi-step refinement
protein = ESMProtein(sequence="MPRT" + "_" * 100 + "KEND")
# Step 1: Generate initial structure
config = GenerationConfig(track="structure", num_steps=50)
protein = model.generate(protein, config)
# Step 2: Refine sequence based on structure
config = GenerationConfig(track="sequence", num_steps=50, temperature=0.5)
protein = model.generate(protein, config)
# Step 3: Predict function
config = GenerationConfig(track="function", num_steps=20)
protein = model.generate(protein, config)
```
### 6. Batch Processing with Forge API
Process multiple proteins efficiently using Forge's async executor.
```python
from esm.sdk.forge import ESM3ForgeInferenceClient
import asyncio
client = ESM3ForgeInferenceClient(model="esm3-medium-2024-08", token="<token>")
# Async batch processing
async def batch_generate(proteins_list):
tasks = [
client.async_generate(protein, GenerationConfig(track="sequence"))
for protein in proteins_list
]
return await asyncio.gather(*tasks)
# Execute
proteins = [ESMProtein(sequence=f"MPRT{'_' * 50}KEND") for _ in range(10)]
results = asyncio.run(batch_generate(proteins))
```
See `references/forge-api.md` for detailed Forge API documentation, authentication, rate limits, and batch processing patterns.
## Model Selection Guide
**ESM3 Models (Generative):**
- `esm3-sm-open-v1` (1.4B) - Open weights, local usage, good for experimentation
- `esm3-medium-2024-08` (7B) - Best balance of quality and speed (Forge only)
- `esm3-large-2024-03` (98B) - Highest quality, slower (Forge only)
**ESM C Models (Embeddings):**
- `esmc-300m` (30 layers) - Lightweight, fast inference
- `esmc-600m` (36 layers) - Balanced performance
- `esmc-6b` (80 layers) - Maximum representation quality
**Selection criteria:**
- **Local development/testing:** Use `esm3-sm-open-v1` or `esmc-300m`
- **Production quality:** Use `esm3-medium-2024-08` via Forge
- **Maximum accuracy:** Use `esm3-large-2024-03` or `esmc-6b`
- **High throughput:** Use Forge API with batch executor
- **Cost optimization:** Use smaller models, implement caching strategies
## Installation
**Basic installation:**
```bash
uv pip install esm
```
**With Flash Attention (recommended for faster inference):**
```bash
uv pip install esm
uv pip install flash-attn --no-build-isolation
```
**For Forge API access:**
```bash
uv pip install esm # SDK includes Forge client
```
No additional dependencies needed. Obtain Forge API token at https://forge.evolutionaryscale.ai
## Common Workflows
For detailed examples and complete workflows, see `references/workflows.md` which includes:
- Novel GFP design with chain-of-thought
- Protein variant generation and screening
- Structure-based sequence optimization
- Function prediction pipelines
- Embedding-based clustering and analysis
## References
This skill includes comprehensive reference documentation:
- `references/esm3-api.md` - ESM3 model architecture, API reference, generation parameters, and multimodal prompting
- `references/esm-c-api.md` - ESM C model details, embedding strategies, and performance optimization
- `references/forge-api.md` - Forge platform documentation, authentication, batch processing, and deployment
- `references/workflows.md` - Complete examples and common workflow patterns
These references contain detailed API specifications, parameter descriptions, and advanced usage patterns. Load them as needed for specific tasks.
## Best Practices
**For generation tasks:**
- Start with smaller models for prototyping (`esm3-sm-open-v1`)
- Use temperature parameter to control diversity (0.0 = deterministic, 1.0 = diverse)
- Implement iterative refinement with chain-of-thought for complex designs
- Validate generated sequences with structure prediction or wet-lab experiments
**For embedding tasks:**
- Batch process sequences when possible for efficiency
- Cache embeddings for repeated analyses
- Normalize embeddings when computing similarities
- Use appropriate model size based on downstream task requirements
**For production deployment:**
- Use Forge API for scalability and latest models
- Implement error handling and retry logic for API calls
- Monitor token usage and implement rate limiting
- Consider AWS SageMaker deployment for dedicated infrastructure
## Resources and Documentation
- **GitHub Repository:** https://github.com/evolutionaryscale/esm
- **Forge Platform:** https://forge.evolutionaryscale.ai
- **Scientific Paper:** Hayes et al., Science (2025) - https://www.science.org/doi/10.1126/science.ads0018
- **Blog Posts:**
- ESM3 Release: https://www.evolutionaryscale.ai/blog/esm3-release
- ESM C Launch: https://www.evolutionaryscale.ai/blog/esm-cambrian
- **Community:** Slack community at https://bit.ly/3FKwcWd
- **Model Weights:** HuggingFace EvolutionaryScale organization
## Responsible Use
ESM is designed for beneficial applications in protein engineering, drug discovery, and scientific research. Follow the Responsible Biodesign Framework (https://responsiblebiodesign.ai/) when designing novel proteins. Consider biosafety and ethical implications of protein designs before experimental validation.

View File

@@ -0,0 +1,583 @@
# ESM C API Reference
## Overview
ESM C (Cambrian) is a family of protein language models optimized for representation learning and efficient embedding generation. Designed as a drop-in replacement for ESM2, ESM C provides significant improvements in speed and quality across all model sizes.
## Model Architecture
**ESM C Family Models:**
| Model ID | Parameters | Layers | Best For |
|----------|-----------|--------|----------|
| `esmc-300m` | 300M | 30 | Fast inference, lightweight applications |
| `esmc-600m` | 600M | 36 | Balanced performance and quality |
| `esmc-6b` | 6B | 80 | Maximum representation quality |
**Key Features:**
- 3x faster inference than ESM2
- Improved perplexity and embedding quality
- Efficient architecture for production deployment
- Compatible with ESM2 workflows (drop-in replacement)
- Support for long sequences (up to 1024 residues efficiently)
**Architecture Improvements over ESM2:**
- Optimized attention mechanisms
- Better token representation
- Enhanced training procedures
- Reduced memory footprint
## Core API Components
### ESMC Class
Main interface for ESM C models.
**Model Loading:**
```python
from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein
# Load model with automatic device placement
model = ESMC.from_pretrained("esmc-300m").to("cuda")
# Or specify device explicitly
model = ESMC.from_pretrained("esmc-600m").to("cpu")
# For maximum quality
model = ESMC.from_pretrained("esmc-6b").to("cuda")
```
**Model Selection Criteria:**
- **esmc-300m**: Development, real-time applications, batch processing of many sequences
- **esmc-600m**: Production deployments, good quality/speed balance
- **esmc-6b**: Research, maximum accuracy for downstream tasks
### Basic Embedding Generation
**Single Sequence:**
```python
from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein
# Load model
model = ESMC.from_pretrained("esmc-600m").to("cuda")
# Create protein
protein = ESMProtein(sequence="MPRTKEINDAGLIVHSPQWFYK")
# Encode to tensor
protein_tensor = model.encode(protein)
# Generate embeddings
embeddings = model.forward(protein_tensor)
# Get logits (per-position predictions)
logits = model.logits(embeddings)
print(f"Embedding shape: {embeddings.shape}")
print(f"Logits shape: {logits.shape}")
```
**Output Shapes:**
For a sequence of length L:
- `embeddings.shape`: `(1, L, hidden_dim)` where hidden_dim depends on model
- esmc-300m: hidden_dim = 960
- esmc-600m: hidden_dim = 1152
- esmc-6b: hidden_dim = 2560
- `logits.shape`: `(1, L, 64)` - per-position amino acid predictions
### Batch Processing
Process multiple sequences efficiently:
```python
import torch
# Multiple proteins
sequences = [
"MPRTKEINDAGLIVHSP",
"AGKWFYLTQSNHERVPM",
"DEIFKRNAVWGSLTPQY"
]
proteins = [ESMProtein(sequence=seq) for seq in sequences]
# Encode all
protein_tensors = [model.encode(p) for p in proteins]
# Process batch (if same length)
# For variable lengths, process individually or pad
embeddings_list = []
for tensor in protein_tensors:
embedding = model.forward(tensor)
embeddings_list.append(embedding)
print(f"Processed {len(embeddings_list)} proteins")
```
**Efficient Batching for Variable Lengths:**
```python
def batch_encode_variable_length(model, sequences, max_batch_size=32):
"""
Efficiently batch encode sequences of variable length.
Groups by similar length for efficiency.
"""
# Sort by length
sorted_seqs = sorted(enumerate(sequences), key=lambda x: len(x[1]))
results = [None] * len(sequences)
batch = []
batch_indices = []
for idx, seq in sorted_seqs:
batch.append(seq)
batch_indices.append(idx)
# Process batch when full or length changes significantly
if (len(batch) >= max_batch_size or
(len(batch) > 0 and abs(len(seq) - len(batch[0])) > 10)):
# Process current batch
proteins = [ESMProtein(sequence=s) for s in batch]
embeddings = [model.forward(model.encode(p)) for p in proteins]
# Store results
for i, emb in zip(batch_indices, embeddings):
results[i] = emb
batch = []
batch_indices = []
# Process remaining
if batch:
proteins = [ESMProtein(sequence=s) for s in batch]
embeddings = [model.forward(model.encode(p)) for p in proteins]
for i, emb in zip(batch_indices, embeddings):
results[i] = emb
return results
```
## Common Use Cases
### 1. Sequence Similarity Analysis
Compute similarity between proteins using embeddings:
```python
import torch
import torch.nn.functional as F
def get_sequence_embedding(model, sequence):
"""Get mean-pooled sequence embedding."""
protein = ESMProtein(sequence=sequence)
tensor = model.encode(protein)
embedding = model.forward(tensor)
# Mean pooling over sequence length
return embedding.mean(dim=1)
# Get embeddings
seq1_emb = get_sequence_embedding(model, "MPRTKEINDAGLIVHSP")
seq2_emb = get_sequence_embedding(model, "MPRTKEINDAGLIVHSQ") # Similar
seq3_emb = get_sequence_embedding(model, "WWWWWWWWWWWWWWWWW") # Different
# Compute cosine similarity
sim_1_2 = F.cosine_similarity(seq1_emb, seq2_emb)
sim_1_3 = F.cosine_similarity(seq1_emb, seq3_emb)
print(f"Similarity (1,2): {sim_1_2.item():.4f}")
print(f"Similarity (1,3): {sim_1_3.item():.4f}")
```
### 2. Protein Classification
Use embeddings as features for classification:
```python
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Generate embeddings for training set
def embed_dataset(model, sequences):
embeddings = []
for seq in sequences:
protein = ESMProtein(sequence=seq)
tensor = model.encode(protein)
emb = model.forward(tensor).mean(dim=1) # Mean pooling
embeddings.append(emb.cpu().detach().numpy().flatten())
return np.array(embeddings)
# Example: Classify proteins by function
train_sequences = [...] # Your sequences
train_labels = [...] # Your labels
embeddings = embed_dataset(model, train_sequences)
# Train classifier
X_train, X_test, y_train, y_test = train_test_split(
embeddings, train_labels, test_size=0.2
)
classifier = LogisticRegression(max_iter=1000)
classifier.fit(X_train, y_train)
# Evaluate
accuracy = classifier.score(X_test, y_test)
print(f"Classification accuracy: {accuracy:.4f}")
```
### 3. Protein Clustering
Cluster proteins based on sequence similarity:
```python
from sklearn.cluster import KMeans
import numpy as np
# Generate embeddings
sequences = [...] # Your protein sequences
embeddings = embed_dataset(model, sequences)
# Cluster
n_clusters = 5
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(embeddings)
# Analyze clusters
for i in range(n_clusters):
cluster_seqs = [seq for seq, label in zip(sequences, cluster_labels) if label == i]
print(f"Cluster {i}: {len(cluster_seqs)} sequences")
```
### 4. Sequence Search and Retrieval
Find similar sequences in a database:
```python
import torch
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def build_sequence_index(model, database_sequences):
"""Build searchable index of sequence embeddings."""
embeddings = []
for seq in database_sequences:
emb = get_sequence_embedding(model, seq)
embeddings.append(emb.cpu().detach().numpy().flatten())
return np.array(embeddings)
def search_similar_sequences(model, query_seq, database_embeddings,
database_sequences, top_k=10):
"""Find top-k most similar sequences."""
query_emb = get_sequence_embedding(model, query_seq)
query_emb_np = query_emb.cpu().detach().numpy().flatten().reshape(1, -1)
# Compute similarities
similarities = cosine_similarity(query_emb_np, database_embeddings)[0]
# Get top-k
top_indices = np.argsort(similarities)[-top_k:][::-1]
results = [
(database_sequences[idx], similarities[idx])
for idx in top_indices
]
return results
# Example usage
database_seqs = [...] # Large sequence database
index = build_sequence_index(model, database_seqs)
query = "MPRTKEINDAGLIVHSP"
similar = search_similar_sequences(model, query, index, database_seqs, top_k=5)
for seq, score in similar:
print(f"Score: {score:.4f} - {seq[:30]}...")
```
### 5. Feature Extraction for Downstream Models
Use ESM C embeddings as input to custom neural networks:
```python
import torch.nn as nn
class ProteinPropertyPredictor(nn.Module):
"""Example: Predict protein properties from ESM C embeddings."""
def __init__(self, embedding_dim, hidden_dim, output_dim):
super().__init__()
self.fc1 = nn.Linear(embedding_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, hidden_dim)
self.fc3 = nn.Linear(hidden_dim, output_dim)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(0.3)
def forward(self, embeddings):
# embeddings: (batch, seq_len, embedding_dim)
# Mean pool over sequence
x = embeddings.mean(dim=1)
x = self.relu(self.fc1(x))
x = self.dropout(x)
x = self.relu(self.fc2(x))
x = self.dropout(x)
x = self.fc3(x)
return x
# Use ESM C as frozen feature extractor
esm_model = ESMC.from_pretrained("esmc-600m").to("cuda")
esm_model.eval() # Freeze
# Create task-specific model
predictor = ProteinPropertyPredictor(
embedding_dim=1152, # esmc-600m dimension
hidden_dim=512,
output_dim=1 # e.g., stability score
).to("cuda")
# Training loop
for sequence, target in dataloader:
protein = ESMProtein(sequence=sequence)
with torch.no_grad():
embeddings = esm_model.forward(esm_model.encode(protein))
prediction = predictor(embeddings)
loss = criterion(prediction, target)
# ... backprop through predictor only
```
### 6. Per-Residue Analysis
Extract per-residue representations for detailed analysis:
```python
def get_per_residue_embeddings(model, sequence):
"""Get embedding for each residue."""
protein = ESMProtein(sequence=sequence)
tensor = model.encode(protein)
embeddings = model.forward(tensor)
# embeddings shape: (1, seq_len, hidden_dim)
return embeddings.squeeze(0) # (seq_len, hidden_dim)
# Analyze specific positions
sequence = "MPRTKEINDAGLIVHSPQWFYK"
residue_embeddings = get_per_residue_embeddings(model, sequence)
# Extract features for position 10
position_10_features = residue_embeddings[10]
print(f"Features for residue {sequence[10]} at position 10:")
print(f"Shape: {position_10_features.shape}")
# Compare residue representations
pos_5 = residue_embeddings[5]
pos_15 = residue_embeddings[15]
similarity = F.cosine_similarity(pos_5, pos_15, dim=0)
print(f"Residue similarity: {similarity.item():.4f}")
```
## Performance Optimization
### Memory Management
```python
import torch
# Use half precision for memory efficiency
model = ESMC.from_pretrained("esmc-600m").to("cuda").half()
# Process with mixed precision
with torch.cuda.amp.autocast():
embeddings = model.forward(model.encode(protein))
# Clear cache between batches
torch.cuda.empty_cache()
```
### Batch Processing Best Practices
```python
def efficient_batch_processing(model, sequences, batch_size=32):
"""Process sequences in optimized batches."""
results = []
for i in range(0, len(sequences), batch_size):
batch = sequences[i:i + batch_size]
# Process batch
batch_embeddings = []
for seq in batch:
protein = ESMProtein(sequence=seq)
emb = model.forward(model.encode(protein))
batch_embeddings.append(emb)
results.extend(batch_embeddings)
# Periodically clear cache
if i % (batch_size * 10) == 0:
torch.cuda.empty_cache()
return results
```
### Caching Embeddings
```python
import pickle
import hashlib
def get_cache_key(sequence):
"""Generate cache key for sequence."""
return hashlib.md5(sequence.encode()).hexdigest()
class EmbeddingCache:
"""Cache for protein embeddings."""
def __init__(self, cache_file="embeddings_cache.pkl"):
self.cache_file = cache_file
try:
with open(cache_file, 'rb') as f:
self.cache = pickle.load(f)
except FileNotFoundError:
self.cache = {}
def get(self, sequence):
key = get_cache_key(sequence)
return self.cache.get(key)
def set(self, sequence, embedding):
key = get_cache_key(sequence)
self.cache[key] = embedding
def save(self):
with open(self.cache_file, 'wb') as f:
pickle.dump(self.cache, f)
# Usage
cache = EmbeddingCache()
def get_embedding_cached(model, sequence):
cached = cache.get(sequence)
if cached is not None:
return cached
# Compute
protein = ESMProtein(sequence=sequence)
embedding = model.forward(model.encode(protein))
cache.set(sequence, embedding)
return embedding
# Don't forget to save cache
cache.save()
```
## Comparison with ESM2
**Performance Improvements:**
| Metric | ESM2-650M | ESM C-600M | Improvement |
|--------|-----------|------------|-------------|
| Inference Speed | 1.0x | 3.0x | 3x faster |
| Perplexity | Higher | Lower | Better |
| Memory Usage | 1.0x | 0.8x | 20% less |
| Embedding Quality | Baseline | Improved | +5-10% |
**Migration from ESM2:**
ESM C is designed as a drop-in replacement:
```python
# Old ESM2 code
from esm import pretrained
model, alphabet = pretrained.esm2_t33_650M_UR50D()
# New ESM C code (similar API)
from esm.models.esmc import ESMC
model = ESMC.from_pretrained("esmc-600m")
```
Key differences:
- Faster inference with same or better quality
- Simplified API through ESMProtein
- Better support for long sequences
- More efficient memory usage
## Advanced Topics
### Fine-tuning ESM C
ESM C can be fine-tuned for specific tasks:
```python
import torch.optim as optim
# Load model
model = ESMC.from_pretrained("esmc-300m").to("cuda")
# Unfreeze for fine-tuning
for param in model.parameters():
param.requires_grad = True
# Define optimizer
optimizer = optim.Adam(model.parameters(), lr=1e-5)
# Training loop
for epoch in range(num_epochs):
for sequences, labels in dataloader:
optimizer.zero_grad()
# Forward pass
proteins = [ESMProtein(sequence=seq) for seq in sequences]
embeddings = [model.forward(model.encode(p)) for p in proteins]
# Your task-specific loss
loss = compute_loss(embeddings, labels)
loss.backward()
optimizer.step()
```
### Attention Visualization
Extract attention weights for interpretability:
```python
def get_attention_weights(model, sequence):
"""Extract attention weights from model."""
protein = ESMProtein(sequence=sequence)
tensor = model.encode(protein)
# Forward with attention output
output = model.forward(tensor, output_attentions=True)
return output.attentions # List of attention tensors per layer
# Visualize attention
attentions = get_attention_weights(model, "MPRTKEINDAGLIVHSP")
# Process and visualize attention patterns
```
## Citation
If using ESM C in research, cite:
```
ESM Cambrian: https://www.evolutionaryscale.ai/blog/esm-cambrian
EvolutionaryScale (2024)
```
## Additional Resources
- ESM C blog post: https://www.evolutionaryscale.ai/blog/esm-cambrian
- Model weights: HuggingFace EvolutionaryScale organization
- Comparison benchmarks: See blog post for detailed performance comparisons

View File

@@ -0,0 +1,452 @@
# ESM3 API Reference
## Overview
ESM3 is a frontier multimodal generative language model that reasons over the sequence, structure, and function of proteins. It uses iterative masked language modeling to simultaneously generate across these three modalities.
## Model Architecture
**ESM3 Family Models:**
| Model ID | Parameters | Availability | Best For |
|----------|-----------|--------------|----------|
| `esm3-sm-open-v1` | 1.4B | Open weights (local) | Development, testing, learning |
| `esm3-medium-2024-08` | 7B | Forge API only | Production, balanced quality/speed |
| `esm3-large-2024-03` | 98B | Forge API only | Maximum quality, research |
| `esm3-medium-multimer-2024-09` | 7B | Forge API only | Protein complexes (experimental) |
**Key Features:**
- Simultaneous reasoning across sequence, structure, and function
- Iterative generation with controllable number of steps
- Support for partial prompting across modalities
- Chain-of-thought generation for complex designs
- Temperature control for generation diversity
## Core API Components
### ESMProtein Class
The central data structure representing a protein with optional sequence, structure, and function information.
**Constructor:**
```python
from esm.sdk.api import ESMProtein
protein = ESMProtein(
sequence="MPRTKEINDAGLIVHSP", # Amino acid sequence (optional)
coordinates=coordinates_array, # 3D structure (optional)
function_annotations=[...], # Function labels (optional)
secondary_structure="HHHEEEECCC", # SS annotations (optional)
sasa=sasa_array # Solvent accessibility (optional)
)
```
**Key Methods:**
```python
# Load from PDB file
protein = ESMProtein.from_pdb("protein.pdb")
# Export to PDB format
pdb_string = protein.to_pdb()
# Save to file
with open("output.pdb", "w") as f:
f.write(protein.to_pdb())
```
**Masking Conventions:**
Use `_` (underscore) to represent masked positions for generation:
```python
# Mask positions 5-10 for generation
protein = ESMProtein(sequence="MPRT______AGLIVHSP")
# Fully masked sequence (generate from scratch)
protein = ESMProtein(sequence="_" * 200)
# Partial structure (some coordinates None)
protein = ESMProtein(
sequence="MPRTKEIND",
coordinates=partial_coords # Some positions can be None
)
```
### GenerationConfig Class
Controls generation behavior and parameters.
**Basic Configuration:**
```python
from esm.sdk.api import GenerationConfig
config = GenerationConfig(
track="sequence", # Track to generate: "sequence", "structure", or "function"
num_steps=8, # Number of demasking steps
temperature=0.7, # Sampling temperature (0.0-1.0)
top_p=None, # Nucleus sampling threshold
condition_on_coordinates_only=False # For structure conditioning
)
```
**Parameter Details:**
- **track**: Which modality to generate
- `"sequence"`: Generate amino acid sequence
- `"structure"`: Generate 3D coordinates
- `"function"`: Generate function annotations
- **num_steps**: Number of iterative demasking steps
- Higher = slower but potentially better quality
- Typical range: 8-100 depending on sequence length
- For full sequence generation: approximately sequence_length / 2
- **temperature**: Controls randomness
- 0.0: Fully deterministic (greedy decoding)
- 0.5-0.7: Balanced exploration
- 1.0: Maximum diversity
- Higher values increase novelty but may reduce quality
- **top_p**: Nucleus sampling parameter
- Limits sampling to top probability mass
- Values: 0.0-1.0 (e.g., 0.9 = sample from top 90% probability mass)
- Use for controlled diversity without extreme sampling
- **condition_on_coordinates_only**: Structure conditioning mode
- `True`: Condition only on backbone coordinates (ignore sequence)
- Useful for inverse folding tasks
### ESM3InferenceClient Interface
The unified interface for both local and remote inference.
**Local Model Loading:**
```python
from esm.models.esm3 import ESM3
# Load with automatic device placement
model = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda")
# Or explicitly specify device
model = ESM3.from_pretrained("esm3-sm-open-v1").to("cpu")
```
**Generation Method:**
```python
# Basic generation
protein_output = model.generate(protein_input, config)
# With explicit track specification
protein_output = model.generate(
protein_input,
GenerationConfig(track="sequence", num_steps=16, temperature=0.6)
)
```
**Forward Pass (Advanced):**
```python
# Get raw model logits for custom sampling
protein_tensor = model.encode(protein)
output = model.forward(protein_tensor)
logits = model.decode(output)
```
## Common Usage Patterns
### 1. Sequence Completion
Fill in masked regions of a protein sequence:
```python
# Define partial sequence
protein = ESMProtein(sequence="MPRTK____LIVHSP____END")
# Generate missing positions
config = GenerationConfig(track="sequence", num_steps=12, temperature=0.5)
completed = model.generate(protein, config)
print(f"Original: {protein.sequence}")
print(f"Completed: {completed.sequence}")
```
### 2. Structure Prediction
Predict 3D structure from sequence:
```python
# Input: sequence only
protein = ESMProtein(sequence="MPRTKEINDAGLIVHSPQWFYK")
# Generate structure
config = GenerationConfig(track="structure", num_steps=len(protein.sequence))
protein_with_structure = model.generate(protein, config)
# Save as PDB
with open("predicted_structure.pdb", "w") as f:
f.write(protein_with_structure.to_pdb())
```
### 3. Inverse Folding
Design sequence for a target structure:
```python
# Load target structure
target = ESMProtein.from_pdb("target.pdb")
# Remove sequence, keep structure
target.sequence = None
# Generate sequence that folds to this structure
config = GenerationConfig(
track="sequence",
num_steps=50,
temperature=0.7,
condition_on_coordinates_only=True
)
designed = model.generate(target, config)
print(f"Designed sequence: {designed.sequence}")
```
### 4. Function-Conditioned Generation
Generate protein with specific function:
```python
from esm.sdk.api import FunctionAnnotation
# Specify desired function
protein = ESMProtein(
sequence="_" * 150,
function_annotations=[
FunctionAnnotation(
label="enzymatic_activity",
start=30,
end=90
)
]
)
# Generate sequence with this function
config = GenerationConfig(track="sequence", num_steps=75, temperature=0.6)
functional_protein = model.generate(protein, config)
```
### 5. Multi-Track Generation (Chain-of-Thought)
Iteratively generate across multiple tracks:
```python
# Start with partial sequence
protein = ESMProtein(sequence="MPRT" + "_" * 100)
# Step 1: Complete sequence
protein = model.generate(
protein,
GenerationConfig(track="sequence", num_steps=50, temperature=0.6)
)
# Step 2: Predict structure for completed sequence
protein = model.generate(
protein,
GenerationConfig(track="structure", num_steps=50)
)
# Step 3: Predict function
protein = model.generate(
protein,
GenerationConfig(track="function", num_steps=20)
)
print(f"Final sequence: {protein.sequence}")
print(f"Functions: {protein.function_annotations}")
```
### 6. Variant Generation
Generate multiple variants of a protein:
```python
import numpy as np
base_sequence = "MPRTKEINDAGLIVHSPQWFYK"
variants = []
for i in range(10):
# Mask random positions
seq_list = list(base_sequence)
mask_indices = np.random.choice(len(seq_list), size=5, replace=False)
for idx in mask_indices:
seq_list[idx] = '_'
protein = ESMProtein(sequence=''.join(seq_list))
# Generate variant
variant = model.generate(
protein,
GenerationConfig(track="sequence", num_steps=8, temperature=0.8)
)
variants.append(variant.sequence)
print(f"Generated {len(variants)} variants")
```
## Advanced Topics
### Temperature Scheduling
Vary temperature during generation for better control:
```python
def generate_with_temperature_schedule(model, protein, temperatures):
"""Generate with decreasing temperature for annealing."""
current = protein
steps_per_temp = 10
for temp in temperatures:
config = GenerationConfig(
track="sequence",
num_steps=steps_per_temp,
temperature=temp
)
current = model.generate(current, config)
return current
# Example: Start diverse, end deterministic
result = generate_with_temperature_schedule(
model,
protein,
temperatures=[1.0, 0.8, 0.6, 0.4, 0.2]
)
```
### Constrained Generation
Preserve specific regions during generation:
```python
# Keep active site residues fixed
def mask_except_active_site(sequence, active_site_positions):
"""Mask everything except specified positions."""
seq_list = ['_'] * len(sequence)
for pos in active_site_positions:
seq_list[pos] = sequence[pos]
return ''.join(seq_list)
# Define active site
active_site = [23, 24, 25, 45, 46, 89]
constrained_seq = mask_except_active_site(original_sequence, active_site)
protein = ESMProtein(sequence=constrained_seq)
result = model.generate(protein, GenerationConfig(track="sequence", num_steps=50))
```
### Secondary Structure Conditioning
Use secondary structure information in generation:
```python
# Define secondary structure (H=helix, E=sheet, C=coil)
protein = ESMProtein(
sequence="_" * 80,
secondary_structure="CCHHHHHHHEEEEECCCHHHHHHCC" + "C" * 55
)
# Generate sequence with this structure
result = model.generate(
protein,
GenerationConfig(track="sequence", num_steps=40, temperature=0.6)
)
```
## Performance Optimization
### Memory Management
For large proteins or batch processing:
```python
import torch
# Clear CUDA cache between generations
torch.cuda.empty_cache()
# Use half precision for memory efficiency
model = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda").half()
# Process in chunks for very long sequences
def chunk_generate(model, long_sequence, chunk_size=500):
chunks = [long_sequence[i:i+chunk_size]
for i in range(0, len(long_sequence), chunk_size)]
results = []
for chunk in chunks:
protein = ESMProtein(sequence=chunk)
result = model.generate(protein, GenerationConfig(track="sequence"))
results.append(result.sequence)
return ''.join(results)
```
### Batch Processing Tips
When processing multiple proteins:
1. Sort by sequence length for efficient batching
2. Use padding for similar-length sequences
3. Process on GPU when available
4. Implement checkpointing for long-running jobs
5. Use Forge API for large-scale processing (see `forge-api.md`)
## Error Handling
```python
try:
protein = model.generate(protein_input, config)
except ValueError as e:
print(f"Invalid input: {e}")
# Handle invalid sequence or structure
except RuntimeError as e:
print(f"Generation failed: {e}")
# Handle model errors
except torch.cuda.OutOfMemoryError:
print("GPU out of memory - try smaller model or CPU")
# Fallback to CPU or smaller model
```
## Model-Specific Considerations
**esm3-sm-open-v1:**
- Suitable for development and testing
- Lower quality than larger models
- Fast inference on consumer GPUs
- Open weights allow fine-tuning
**esm3-medium-2024-08:**
- Production quality
- Good balance of speed and accuracy
- Requires Forge API access
- Recommended for most applications
**esm3-large-2024-03:**
- State-of-the-art quality
- Slowest inference
- Use for critical applications
- Best for novel protein design
## Citation
If using ESM3 in research, cite:
```
Hayes, T. et al. (2025). Simulating 500 million years of evolution with a language model.
Science. DOI: 10.1126/science.ads0018
```

View File

@@ -0,0 +1,657 @@
# Forge API Reference
## Overview
Forge is EvolutionaryScale's cloud platform for scalable protein design and inference. It provides API access to the full ESM3 model family, including large models not available for local execution.
**Key Benefits:**
- Access to all ESM3 models including 98B parameter version
- No local GPU requirements
- Scalable batch processing
- Automatic updates to latest models
- Production-ready infrastructure
- Async/concurrent request support
## Getting Started
### 1. Obtain API Token
Sign up and get your API token at: https://forge.evolutionaryscale.ai
### 2. Install ESM SDK
```bash
pip install esm
```
The Forge client is included in the standard ESM package.
### 3. Basic Connection
```python
from esm.sdk.forge import ESM3ForgeInferenceClient
from esm.sdk.api import ESMProtein, GenerationConfig
# Initialize client
client = ESM3ForgeInferenceClient(
model="esm3-medium-2024-08",
url="https://forge.evolutionaryscale.ai",
token="<your-token-here>"
)
# Test connection
protein = ESMProtein(sequence="MPRT___KEND")
result = client.generate(protein, GenerationConfig(track="sequence", num_steps=8))
print(result.sequence)
```
## Available Models
| Model ID | Parameters | Speed | Quality | Use Case |
|----------|-----------|-------|---------|----------|
| `esm3-small-2024-08` | 1.4B | Fastest | Good | Rapid prototyping, testing |
| `esm3-medium-2024-08` | 7B | Fast | Excellent | Production, most applications |
| `esm3-large-2024-03` | 98B | Slower | Best | Research, critical designs |
| `esm3-medium-multimer-2024-09` | 7B | Fast | Experimental | Protein complexes |
**Model Selection Guidelines:**
- **Development/Testing**: Use `esm3-small-2024-08` for quick iteration
- **Production**: Use `esm3-medium-2024-08` for best balance
- **Research/Critical**: Use `esm3-large-2024-03` for highest quality
- **Complexes**: Use `esm3-medium-multimer-2024-09` (experimental)
## ESM3ForgeInferenceClient API
### Initialization
```python
from esm.sdk.forge import ESM3ForgeInferenceClient
# Basic initialization
client = ESM3ForgeInferenceClient(
model="esm3-medium-2024-08",
token="<your-token>"
)
# With custom URL (for enterprise deployments)
client = ESM3ForgeInferenceClient(
model="esm3-medium-2024-08",
url="https://custom.forge.instance.com",
token="<your-token>"
)
# With timeout configuration
client = ESM3ForgeInferenceClient(
model="esm3-medium-2024-08",
token="<your-token>",
timeout=300 # 5 minutes
)
```
### Synchronous Generation
Standard blocking generation calls:
```python
from esm.sdk.api import ESMProtein, GenerationConfig
# Basic generation
protein = ESMProtein(sequence="MPRT___KEND")
config = GenerationConfig(track="sequence", num_steps=8)
result = client.generate(protein, config)
print(f"Generated: {result.sequence}")
```
### Asynchronous Generation
For concurrent processing of multiple proteins:
```python
import asyncio
from esm.sdk.api import ESMProtein, GenerationConfig
async def generate_many(client, proteins):
"""Generate multiple proteins concurrently."""
tasks = []
for protein in proteins:
task = client.async_generate(
protein,
GenerationConfig(track="sequence", num_steps=8)
)
tasks.append(task)
results = await asyncio.gather(*tasks)
return results
# Usage
proteins = [
ESMProtein(sequence=f"MPRT{'_' * 10}KEND"),
ESMProtein(sequence=f"AGLV{'_' * 10}HSPQ"),
ESMProtein(sequence=f"KEIT{'_' * 10}NDFL")
]
results = asyncio.run(generate_many(client, proteins))
print(f"Generated {len(results)} proteins")
```
### Batch Processing with BatchExecutor
For large-scale processing with automatic concurrency management:
```python
from esm.sdk.forge import BatchExecutor
from esm.sdk.api import ESMProtein, GenerationConfig
# Create batch executor
executor = BatchExecutor(
client=client,
max_concurrent=10 # Process 10 requests concurrently
)
# Prepare batch of proteins
proteins = [ESMProtein(sequence=f"MPRT{'_' * 50}KEND") for _ in range(100)]
config = GenerationConfig(track="sequence", num_steps=25)
# Submit batch
batch_results = executor.submit_batch(
proteins=proteins,
config=config,
progress_callback=lambda i, total: print(f"Processed {i}/{total}")
)
print(f"Completed {len(batch_results)} generations")
```
## Rate Limiting and Quotas
### Understanding Limits
Forge implements rate limiting based on:
- Requests per minute (RPM)
- Tokens per minute (TPM)
- Concurrent requests
**Typical Limits (subject to change):**
- Free tier: 60 RPM, 5 concurrent
- Pro tier: 300 RPM, 20 concurrent
- Enterprise: Custom limits
### Handling Rate Limits
```python
import time
from requests.exceptions import HTTPError
def generate_with_retry(client, protein, config, max_retries=3):
"""Generate with automatic retry on rate limit."""
for attempt in range(max_retries):
try:
return client.generate(protein, config)
except HTTPError as e:
if e.response.status_code == 429: # Rate limit
wait_time = 2 ** attempt # Exponential backoff
print(f"Rate limited, waiting {wait_time}s...")
time.sleep(wait_time)
else:
raise
raise Exception("Max retries exceeded")
# Usage
result = generate_with_retry(client, protein, config)
```
### Implementing Custom Rate Limiter
```python
import time
from collections import deque
class RateLimiter:
"""Simple rate limiter for API calls."""
def __init__(self, max_per_minute=60):
self.max_per_minute = max_per_minute
self.calls = deque()
def wait_if_needed(self):
"""Wait if rate limit would be exceeded."""
now = time.time()
# Remove old calls
while self.calls and self.calls[0] < now - 60:
self.calls.popleft()
# Wait if at limit
if len(self.calls) >= self.max_per_minute:
sleep_time = 60 - (now - self.calls[0])
if sleep_time > 0:
time.sleep(sleep_time)
self.calls.popleft()
self.calls.append(now)
# Usage
limiter = RateLimiter(max_per_minute=60)
for protein in proteins:
limiter.wait_if_needed()
result = client.generate(protein, config)
```
## Advanced Patterns
### Streaming Results
Process results as they complete:
```python
import asyncio
from concurrent.futures import ThreadPoolExecutor
async def stream_generate(client, proteins, config):
"""Stream results as they complete."""
pending = {
asyncio.create_task(client.async_generate(p, config)): i
for i, p in enumerate(proteins)
}
results = [None] * len(proteins)
while pending:
done, pending = await asyncio.wait(
pending.keys(),
return_when=asyncio.FIRST_COMPLETED
)
for task in done:
idx = pending.pop(task)
result = await task
results[idx] = result
yield idx, result
# Usage
async def process_stream():
async for idx, result in stream_generate(client, proteins, config):
print(f"Completed protein {idx}: {result.sequence[:20]}...")
asyncio.run(process_stream())
```
### Batch with Progress Tracking
```python
from tqdm import tqdm
import asyncio
async def batch_with_progress(client, proteins, config):
"""Process batch with progress bar."""
results = []
with tqdm(total=len(proteins)) as pbar:
for protein in proteins:
result = await client.async_generate(protein, config)
results.append(result)
pbar.update(1)
return results
# Usage
results = asyncio.run(batch_with_progress(client, proteins, config))
```
### Checkpoint and Resume
For long-running batch jobs:
```python
import pickle
import os
class CheckpointedBatchProcessor:
"""Batch processor with checkpoint/resume capability."""
def __init__(self, client, checkpoint_file="checkpoint.pkl"):
self.client = client
self.checkpoint_file = checkpoint_file
self.completed = self.load_checkpoint()
def load_checkpoint(self):
if os.path.exists(self.checkpoint_file):
with open(self.checkpoint_file, 'rb') as f:
return pickle.load(f)
return {}
def save_checkpoint(self):
with open(self.checkpoint_file, 'wb') as f:
pickle.dump(self.completed, f)
def process_batch(self, proteins, config):
"""Process batch with checkpointing."""
results = {}
for i, protein in enumerate(proteins):
# Skip if already completed
if i in self.completed:
results[i] = self.completed[i]
continue
try:
result = self.client.generate(protein, config)
results[i] = result
self.completed[i] = result
# Save checkpoint every 10 items
if i % 10 == 0:
self.save_checkpoint()
except Exception as e:
print(f"Error processing {i}: {e}")
self.save_checkpoint()
raise
self.save_checkpoint()
return results
# Usage
processor = CheckpointedBatchProcessor(client)
results = processor.process_batch(proteins, config)
```
## Error Handling
### Common Errors and Solutions
```python
from requests.exceptions import HTTPError, ConnectionError, Timeout
def robust_generate(client, protein, config):
"""Generate with comprehensive error handling."""
try:
return client.generate(protein, config)
except HTTPError as e:
if e.response.status_code == 401:
raise ValueError("Invalid API token")
elif e.response.status_code == 429:
raise ValueError("Rate limit exceeded - slow down requests")
elif e.response.status_code == 500:
raise ValueError("Server error - try again later")
else:
raise
except ConnectionError:
raise ValueError("Network error - check internet connection")
except Timeout:
raise ValueError("Request timeout - try smaller protein or increase timeout")
except Exception as e:
raise ValueError(f"Unexpected error: {str(e)}")
# Usage with retry logic
def generate_with_full_retry(client, protein, config, max_retries=3):
"""Combine error handling with retry logic."""
for attempt in range(max_retries):
try:
return robust_generate(client, protein, config)
except ValueError as e:
if "rate limit" in str(e).lower() and attempt < max_retries - 1:
time.sleep(2 ** attempt)
continue
raise
```
## Cost Optimization
### Strategies to Reduce Costs
**1. Use Appropriate Model Size:**
```python
# Use smaller model for testing
dev_client = ESM3ForgeInferenceClient(
model="esm3-small-2024-08",
token=token
)
# Use larger model only for final generation
prod_client = ESM3ForgeInferenceClient(
model="esm3-large-2024-03",
token=token
)
```
**2. Cache Results:**
```python
import hashlib
import json
class ForgeCache:
"""Cache Forge API results locally."""
def __init__(self, cache_dir="forge_cache"):
self.cache_dir = cache_dir
os.makedirs(cache_dir, exist_ok=True)
def get_cache_key(self, protein, config):
"""Generate cache key from inputs."""
data = {
'sequence': protein.sequence,
'config': str(config)
}
return hashlib.md5(json.dumps(data, sort_keys=True).encode()).hexdigest()
def get(self, protein, config):
"""Get cached result."""
key = self.get_cache_key(protein, config)
path = os.path.join(self.cache_dir, f"{key}.pkl")
if os.path.exists(path):
with open(path, 'rb') as f:
return pickle.load(f)
return None
def set(self, protein, config, result):
"""Cache result."""
key = self.get_cache_key(protein, config)
path = os.path.join(self.cache_dir, f"{key}.pkl")
with open(path, 'wb') as f:
pickle.dump(result, f)
# Usage
cache = ForgeCache()
def cached_generate(client, protein, config):
"""Generate with caching."""
cached = cache.get(protein, config)
if cached:
return cached
result = client.generate(protein, config)
cache.set(protein, config, result)
return result
```
**3. Batch Similar Requests:**
Group similar generation tasks to reduce overhead:
```python
def batch_similar_tasks(proteins, max_batch_size=50):
"""Group proteins by similar properties."""
# Sort by length for efficient processing
sorted_proteins = sorted(proteins, key=lambda p: len(p.sequence))
batches = []
current_batch = []
for protein in sorted_proteins:
current_batch.append(protein)
if len(current_batch) >= max_batch_size:
batches.append(current_batch)
current_batch = []
if current_batch:
batches.append(current_batch)
return batches
```
## Monitoring and Logging
### Track API Usage
```python
import logging
from datetime import datetime
class ForgeMonitor:
"""Monitor Forge API usage."""
def __init__(self):
self.calls = []
self.errors = []
def log_call(self, model, protein_length, duration, success=True, error=None):
"""Log API call."""
entry = {
'timestamp': datetime.now(),
'model': model,
'protein_length': protein_length,
'duration': duration,
'success': success,
'error': str(error) if error else None
}
if success:
self.calls.append(entry)
else:
self.errors.append(entry)
def get_stats(self):
"""Get usage statistics."""
total_calls = len(self.calls) + len(self.errors)
success_rate = len(self.calls) / total_calls if total_calls > 0 else 0
avg_duration = sum(c['duration'] for c in self.calls) / len(self.calls) if self.calls else 0
return {
'total_calls': total_calls,
'successful': len(self.calls),
'failed': len(self.errors),
'success_rate': success_rate,
'avg_duration': avg_duration
}
# Usage
monitor = ForgeMonitor()
def monitored_generate(client, protein, config):
"""Generate with monitoring."""
start = time.time()
try:
result = client.generate(protein, config)
duration = time.time() - start
monitor.log_call(
model=client.model,
protein_length=len(protein.sequence),
duration=duration,
success=True
)
return result
except Exception as e:
duration = time.time() - start
monitor.log_call(
model=client.model,
protein_length=len(protein.sequence),
duration=duration,
success=False,
error=e
)
raise
# Check stats
print(monitor.get_stats())
```
## AWS SageMaker Deployment
For dedicated infrastructure and enterprise use:
### Deployment Options
1. **AWS Marketplace Listing**: Deploy ESM3 via AWS SageMaker Marketplace
2. **Custom Endpoint**: Configure dedicated inference endpoint
3. **Batch Transform**: Use SageMaker Batch Transform for large-scale processing
### Benefits
- Dedicated compute resources
- No rate limiting beyond your infrastructure
- Data stays in your AWS environment
- Integration with AWS services
- Custom instance types and scaling
**More Information:**
- AWS Marketplace: https://aws.amazon.com/marketplace/seller-profile?id=seller-iw2nbscescndm
- Contact EvolutionaryScale for enterprise licensing
## Best Practices Summary
1. **Authentication**: Store tokens securely (environment variables, secrets manager)
2. **Rate Limiting**: Implement exponential backoff and respect limits
3. **Error Handling**: Always handle network errors and retries
4. **Caching**: Cache results for repeated queries
5. **Model Selection**: Use appropriate model size for task
6. **Batch Processing**: Use async/batch processing for multiple proteins
7. **Monitoring**: Track usage and costs
8. **Checkpointing**: Save progress for long-running jobs
## Troubleshooting
### Connection Issues
```python
# Test connection
try:
client = ESM3ForgeInferenceClient(model="esm3-medium-2024-08", token=token)
test_protein = ESMProtein(sequence="MPRTK")
result = client.generate(test_protein, GenerationConfig(track="sequence", num_steps=1))
print("Connection successful!")
except Exception as e:
print(f"Connection failed: {e}")
```
### Token Validation
```python
def validate_token(token):
"""Validate API token."""
try:
client = ESM3ForgeInferenceClient(
model="esm3-small-2024-08",
token=token
)
# Make minimal test call
test = ESMProtein(sequence="MPR")
client.generate(test, GenerationConfig(track="sequence", num_steps=1))
return True
except HTTPError as e:
if e.response.status_code == 401:
return False
raise
```
## Additional Resources
- **Forge Platform**: https://forge.evolutionaryscale.ai
- **API Documentation**: Check Forge dashboard for latest API specs
- **Community Support**: Slack community at https://bit.ly/3FKwcWd
- **Enterprise Contact**: Contact EvolutionaryScale for custom deployments

View File

@@ -0,0 +1,685 @@
# ESM Workflows and Examples
## Overview
This document provides complete, end-to-end examples of common workflows using ESM3 and ESM C. Each workflow includes setup, execution, and analysis code.
## Workflow 1: Novel GFP Design with Chain-of-Thought
Design a novel fluorescent protein using ESM3's multimodal generation capabilities.
### Objective
Generate a green fluorescent protein (GFP) with specific properties using chain-of-thought reasoning across sequence, structure, and function.
### Complete Implementation
```python
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig, FunctionAnnotation
import matplotlib.pyplot as plt
# Setup
model = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda")
# Step 1: Define target properties
print("Step 1: Defining target GFP properties...")
# Create protein with desired function
target_length = 238 # Typical GFP length
protein = ESMProtein(
sequence="_" * target_length,
function_annotations=[
FunctionAnnotation(
label="green_fluorescent_protein",
start=65,
end=75 # Chromophore region
)
]
)
# Step 2: Generate initial sequence with function conditioning
print("Step 2: Generating initial sequence...")
config = GenerationConfig(
track="sequence",
num_steps=target_length // 3, # Gradual generation
temperature=0.7 # Moderate diversity
)
protein = model.generate(protein, config)
print(f"Generated sequence: {protein.sequence[:50]}...")
# Step 3: Predict structure
print("Step 3: Predicting structure...")
config = GenerationConfig(
track="structure",
num_steps=target_length // 2
)
protein = model.generate(protein, config)
print(f"Structure predicted, coordinates shape: {protein.coordinates.shape}")
# Step 4: Refine sequence based on structure
print("Step 4: Refining sequence based on structure...")
# Mask regions for refinement (e.g., surface residues)
sequence_list = list(protein.sequence)
# Keep chromophore region, refine others
for i in range(0, 65):
if i % 3 == 0: # Refine every third position
sequence_list[i] = '_'
for i in range(75, target_length):
if i % 3 == 0:
sequence_list[i] = '_'
protein.sequence = ''.join(sequence_list)
config = GenerationConfig(
track="sequence",
num_steps=50,
temperature=0.5 # Lower temperature for refinement
)
protein = model.generate(protein, config)
# Step 5: Final validation
print("Step 5: Final validation...")
# Predict final structure
config = GenerationConfig(track="structure", num_steps=30)
protein = model.generate(protein, config)
# Save results
with open("novel_gfp.pdb", "w") as f:
f.write(protein.to_pdb())
with open("novel_gfp_sequence.txt", "w") as f:
f.write(f">Novel_GFP\n{protein.sequence}\n")
print(f"\nFinal GFP sequence:\n{protein.sequence}")
print(f"\nFunction annotations: {protein.function_annotations}")
print(f"Structure saved to: novel_gfp.pdb")
```
### Validation Steps
```python
# Analyze designed GFP
def analyze_gfp(protein):
"""Analyze generated GFP properties."""
# Check chromophore region (should be around Ser65-Tyr66-Gly67)
chromophore_region = protein.sequence[64:68]
print(f"Chromophore region: {chromophore_region}")
# Check barrel structure (GFPs have beta-barrel)
# Analyze secondary structure if available
if protein.secondary_structure:
beta_content = protein.secondary_structure.count('E') / len(protein.sequence)
print(f"Beta sheet content: {beta_content:.2%}")
# Check sequence similarity to known GFPs
# (Would require BLAST or alignment tool in practice)
return {
'length': len(protein.sequence),
'chromophore': chromophore_region,
'coordinates_available': protein.coordinates is not None
}
analysis = analyze_gfp(protein)
print(f"\nAnalysis results: {analysis}")
```
## Workflow 2: Protein Variant Library Generation
Generate and analyze a library of protein variants for directed evolution.
### Objective
Create variants of a parent protein by targeted mutagenesis while maintaining structural integrity.
### Complete Implementation
```python
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
import numpy as np
from sklearn.cluster import KMeans
# Setup
model = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda")
# Parent protein
parent_sequence = "MPRTKEINDAGLIVHSPQWFYKARNDTESLGKIVHEFPM"
parent_protein = ESMProtein(sequence=parent_sequence)
# Define mutation parameters
num_variants = 50
positions_to_mutate = 5 # Number of positions per variant
# Step 1: Generate variant library
print("Generating variant library...")
variants = []
for i in range(num_variants):
# Create masked sequence with random positions
seq_list = list(parent_sequence)
# Select random positions to mutate
mutation_positions = np.random.choice(
len(seq_list),
size=positions_to_mutate,
replace=False
)
for pos in mutation_positions:
seq_list[pos] = '_'
# Generate variant
variant_protein = ESMProtein(sequence=''.join(seq_list))
config = GenerationConfig(
track="sequence",
num_steps=positions_to_mutate * 2,
temperature=0.8 # Higher diversity
)
variant = model.generate(variant_protein, config)
variants.append(variant.sequence)
if (i + 1) % 10 == 0:
print(f"Generated {i + 1}/{num_variants} variants")
print(f"\nGenerated {len(variants)} variants")
# Step 2: Predict structures for variants
print("\nPredicting structures...")
variant_proteins_with_structure = []
for i, seq in enumerate(variants):
protein = ESMProtein(sequence=seq)
config = GenerationConfig(
track="structure",
num_steps=len(seq) // 2
)
protein_with_structure = model.generate(protein, config)
variant_proteins_with_structure.append(protein_with_structure)
if (i + 1) % 10 == 0:
print(f"Predicted structures for {i + 1}/{len(variants)} variants")
# Step 3: Analyze variant diversity
print("\nAnalyzing variant diversity...")
# Calculate Hamming distances from parent
def hamming_distance(seq1, seq2):
"""Calculate Hamming distance between sequences."""
return sum(c1 != c2 for c1, c2 in zip(seq1, seq2))
distances = [hamming_distance(parent_sequence, var) for var in variants]
print(f"Average mutations per variant: {np.mean(distances):.1f}")
print(f"Mutation range: {min(distances)}-{max(distances)}")
# Step 4: Get embeddings for clustering
print("\nGenerating embeddings for clustering...")
from esm.models.esmc import ESMC
embedding_model = ESMC.from_pretrained("esmc-300m").to("cuda")
def get_embedding(sequence):
"""Get mean-pooled embedding for sequence."""
protein = ESMProtein(sequence=sequence)
tensor = embedding_model.encode(protein)
emb = embedding_model.forward(tensor)
return emb.mean(dim=1).cpu().detach().numpy().flatten()
variant_embeddings = np.array([get_embedding(seq) for seq in variants])
# Step 5: Cluster variants
print("Clustering variants...")
n_clusters = 5
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(variant_embeddings)
# Analyze clusters
print("\nCluster analysis:")
for i in range(n_clusters):
cluster_variants = [var for var, label in zip(variants, cluster_labels) if label == i]
cluster_distances = [hamming_distance(parent_sequence, var) for var in cluster_variants]
print(f"\nCluster {i}:")
print(f" Size: {len(cluster_variants)}")
print(f" Avg distance from parent: {np.mean(cluster_distances):.1f}")
print(f" Representative: {cluster_variants[0][:40]}...")
# Step 6: Select diverse representatives
print("\nSelecting diverse representatives...")
representatives = []
for i in range(n_clusters):
# Get centroid
cluster_indices = np.where(cluster_labels == i)[0]
cluster_embs = variant_embeddings[cluster_indices]
# Find closest to centroid
centroid = cluster_embs.mean(axis=0)
distances_to_centroid = np.linalg.norm(cluster_embs - centroid, axis=1)
rep_idx = cluster_indices[np.argmin(distances_to_centroid)]
representatives.append(variants[rep_idx])
# Save results
print("\nSaving results...")
with open("variant_library.fasta", "w") as f:
f.write(f">Parent\n{parent_sequence}\n\n")
for i, var in enumerate(variants):
f.write(f">Variant_{i+1}_Cluster_{cluster_labels[i]}\n{var}\n")
with open("representative_variants.fasta", "w") as f:
for i, rep in enumerate(representatives):
f.write(f">Representative_Cluster_{i}\n{rep}\n")
print("Variant library saved to: variant_library.fasta")
print("Representatives saved to: representative_variants.fasta")
```
## Workflow 3: Structure-Based Sequence Optimization
Optimize a protein sequence to improve stability while maintaining function.
### Objective
Given a protein structure, design sequences that maintain the fold but have improved properties.
### Complete Implementation
```python
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
import numpy as np
# Setup
model = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda")
# Load target structure (e.g., from PDB)
target_protein = ESMProtein.from_pdb("target_structure.pdb")
original_sequence = target_protein.sequence
print(f"Original sequence: {original_sequence}")
print(f"Structure loaded: {target_protein.coordinates.shape}")
# Step 1: Generate multiple sequence designs
print("\nGenerating optimized sequences...")
num_designs = 20
optimized_sequences = []
for i in range(num_designs):
# Start with structure, remove sequence
design_protein = ESMProtein(
coordinates=target_protein.coordinates.copy(),
secondary_structure=target_protein.secondary_structure
)
# Generate sequence for this structure
config = GenerationConfig(
track="sequence",
num_steps=len(original_sequence),
temperature=0.7,
condition_on_coordinates_only=True
)
designed = model.generate(design_protein, config)
optimized_sequences.append(designed.sequence)
if (i + 1) % 5 == 0:
print(f"Generated {i + 1}/{num_designs} designs")
# Step 2: Validate structural compatibility
print("\nValidating structural compatibility...")
validated_designs = []
for seq in optimized_sequences:
# Predict structure for designed sequence
test_protein = ESMProtein(sequence=seq)
config = GenerationConfig(
track="structure",
num_steps=len(seq) // 2
)
predicted = model.generate(test_protein, config)
# Calculate RMSD (simplified - in practice use proper alignment)
# Here we just check if structure prediction succeeds
if predicted.coordinates is not None:
validated_designs.append(seq)
print(f"Validated {len(validated_designs)}/{num_designs} designs")
# Step 3: Analyze sequence properties
print("\nAnalyzing sequence properties...")
def calculate_properties(sequence):
"""Calculate basic sequence properties."""
# Hydrophobicity (simplified)
hydrophobic = "AILMFWYV"
hydrophobic_fraction = sum(1 for aa in sequence if aa in hydrophobic) / len(sequence)
# Charge
positive = "KR"
negative = "DE"
net_charge = sum(1 for aa in sequence if aa in positive) - sum(1 for aa in sequence if aa in negative)
# Aromatic content
aromatic = "FWY"
aromatic_fraction = sum(1 for aa in sequence if aa in aromatic) / len(sequence)
return {
'hydrophobic_fraction': hydrophobic_fraction,
'net_charge': net_charge,
'aromatic_fraction': aromatic_fraction
}
# Compare to original
original_props = calculate_properties(original_sequence)
print(f"\nOriginal properties:")
print(f" Hydrophobic: {original_props['hydrophobic_fraction']:.2%}")
print(f" Net charge: {original_props['net_charge']:+d}")
print(f" Aromatic: {original_props['aromatic_fraction']:.2%}")
# Analyze designs
design_properties = [calculate_properties(seq) for seq in validated_designs]
avg_hydrophobic = np.mean([p['hydrophobic_fraction'] for p in design_properties])
avg_charge = np.mean([p['net_charge'] for p in design_properties])
avg_aromatic = np.mean([p['aromatic_fraction'] for p in design_properties])
print(f"\nDesigned sequences (average):")
print(f" Hydrophobic: {avg_hydrophobic:.2%}")
print(f" Net charge: {avg_charge:+.1f}")
print(f" Aromatic: {avg_aromatic:.2%}")
# Step 4: Rank designs
print("\nRanking designs...")
def score_design(sequence, original_props):
"""Score design based on desired properties."""
props = calculate_properties(sequence)
# Prefer higher hydrophobic content (for stability)
hydrophobic_score = props['hydrophobic_fraction']
# Prefer similar charge to original
charge_score = 1.0 / (1.0 + abs(props['net_charge'] - original_props['net_charge']))
# Combined score
return hydrophobic_score * 0.6 + charge_score * 0.4
scores = [(seq, score_design(seq, original_props)) for seq in validated_designs]
scores.sort(key=lambda x: x[1], reverse=True)
print("\nTop 5 designs:")
for i, (seq, score) in enumerate(scores[:5]):
print(f"\n{i+1}. Score: {score:.3f}")
print(f" Sequence: {seq[:40]}...")
# Step 5: Save results
print("\nSaving results...")
with open("optimized_sequences.fasta", "w") as f:
f.write(f">Original\n{original_sequence}\n\n")
for i, (seq, score) in enumerate(scores):
props = calculate_properties(seq)
f.write(f">Design_{i+1}_Score_{score:.3f}\n")
f.write(f"# Hydrophobic: {props['hydrophobic_fraction']:.2%}, ")
f.write(f"Charge: {props['net_charge']:+d}, ")
f.write(f"Aromatic: {props['aromatic_fraction']:.2%}\n")
f.write(f"{seq}\n\n")
print("Results saved to: optimized_sequences.fasta")
```
## Workflow 4: Function Prediction Pipeline
Predict protein function from sequence using ESM3 and ESM C.
### Objective
Build a pipeline that predicts protein function using both generative (ESM3) and embedding (ESM C) approaches.
### Complete Implementation
```python
from esm.models.esm3 import ESM3
from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein, GenerationConfig
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
# Setup models
esm3_model = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda")
esmc_model = ESMC.from_pretrained("esmc-600m").to("cuda")
# Example: Predict if protein is an enzyme
# (In practice, you'd have a labeled training set)
def predict_function_generative(sequence):
"""Predict function using ESM3 generative approach."""
protein = ESMProtein(sequence=sequence)
# Generate function annotations
config = GenerationConfig(
track="function",
num_steps=20,
temperature=0.3 # Low temperature for confident predictions
)
protein_with_function = esm3_model.generate(protein, config)
return protein_with_function.function_annotations
def predict_function_embedding(sequence, function_classifier):
"""Predict function using ESM C embeddings + classifier."""
# Get embedding
protein = ESMProtein(sequence=sequence)
tensor = esmc_model.encode(protein)
embedding = esmc_model.forward(tensor)
# Mean pool
embedding_pooled = embedding.mean(dim=1).cpu().detach().numpy()
# Predict with classifier
prediction = function_classifier.predict(embedding_pooled)
probability = function_classifier.predict_proba(embedding_pooled)
return prediction[0], probability[0]
# Example workflow with test sequences
test_sequences = {
"kinase": "MPRTKEINDAGLIVHSPQWFYKARNDTESLGKIVHEF",
"protease": "AGLIVHSPQWFYKARNDTESLGKIVHEFPMCDEGH",
"transporter": "KTEFLNDGRPMLIVHSPQWFYKARNDTESLGKIVH"
}
print("Predicting functions...\n")
for name, sequence in test_sequences.items():
print(f"{name.upper()}:")
print(f"Sequence: {sequence[:30]}...")
# Method 1: Generative
functions = predict_function_generative(sequence)
print(f" Generative predictions: {functions}")
# Method 2: Embedding-based would require trained classifier
# (Skipped in this example as it needs training data)
print()
```
## Workflow 5: Embedding-Based Clustering and Analysis
Cluster and analyze a large protein dataset using ESM C embeddings.
### Complete Implementation
```python
from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Setup
model = ESMC.from_pretrained("esmc-600m").to("cuda")
# Load protein dataset (example)
sequences = [
# In practice, load from FASTA or database
"MPRTKEINDAGLIVHSPQWFYK",
"AGLIVHSPQWFYKARNDTESL",
# ... more sequences
]
print(f"Loaded {len(sequences)} sequences")
# Step 1: Generate embeddings
print("Generating embeddings...")
embeddings = []
for i, seq in enumerate(sequences):
protein = ESMProtein(sequence=seq)
tensor = model.encode(protein)
emb = model.forward(tensor)
# Mean pooling
emb_pooled = emb.mean(dim=1).cpu().detach().numpy().flatten()
embeddings.append(emb_pooled)
if (i + 1) % 100 == 0:
print(f"Processed {i + 1}/{len(sequences)}")
embeddings = np.array(embeddings)
print(f"Embeddings shape: {embeddings.shape}")
# Step 2: Dimensionality reduction for visualization
print("\nReducing dimensionality...")
# PCA for initial reduction
pca = PCA(n_components=50)
embeddings_pca = pca.fit_transform(embeddings)
print(f"PCA explained variance: {pca.explained_variance_ratio_[:10].sum():.2%}")
# t-SNE for visualization
tsne = TSNE(n_components=2, random_state=42)
embeddings_2d = tsne.fit_transform(embeddings_pca)
# Step 3: Clustering
print("\nClustering...")
# DBSCAN for density-based clustering
clustering = DBSCAN(eps=0.5, min_samples=5)
cluster_labels = clustering.fit_predict(embeddings)
n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
n_noise = list(cluster_labels).count(-1)
print(f"Number of clusters: {n_clusters}")
print(f"Number of noise points: {n_noise}")
# Step 4: Visualize
print("\nGenerating visualization...")
plt.figure(figsize=(12, 8))
scatter = plt.scatter(
embeddings_2d[:, 0],
embeddings_2d[:, 1],
c=cluster_labels,
cmap='viridis',
alpha=0.6
)
plt.colorbar(scatter)
plt.title("Protein Sequence Clustering (ESM C Embeddings)")
plt.xlabel("t-SNE 1")
plt.ylabel("t-SNE 2")
plt.savefig("protein_clusters.png", dpi=300, bbox_inches='tight')
print("Visualization saved to: protein_clusters.png")
# Step 5: Analyze clusters
print("\nCluster analysis:")
for cluster_id in range(n_clusters):
cluster_indices = np.where(cluster_labels == cluster_id)[0]
cluster_seqs = [sequences[i] for i in cluster_indices]
print(f"\nCluster {cluster_id}:")
print(f" Size: {len(cluster_seqs)}")
print(f" Avg length: {np.mean([len(s) for s in cluster_seqs]):.1f}")
print(f" Example: {cluster_seqs[0][:40]}...")
# Save cluster assignments
with open("cluster_assignments.txt", "w") as f:
for i, (seq, label) in enumerate(zip(sequences, cluster_labels)):
f.write(f"Sequence_{i}\tCluster_{label}\t{seq}\n")
print("\nCluster assignments saved to: cluster_assignments.txt")
```
## Additional Workflow Tips
### Memory Management for Large Datasets
```python
def process_large_dataset(sequences, batch_size=32):
"""Process large dataset with memory management."""
import gc
import torch
results = []
for i in range(0, len(sequences), batch_size):
batch = sequences[i:i + batch_size]
# Process batch
batch_results = [process_sequence(seq) for seq in batch]
results.extend(batch_results)
# Clear memory
torch.cuda.empty_cache()
gc.collect()
if (i + batch_size) % 100 == 0:
print(f"Processed {min(i + batch_size, len(sequences))}/{len(sequences)}")
return results
```
### Parallel Processing
```python
from concurrent.futures import ThreadPoolExecutor
import asyncio
def parallel_workflow(sequences, n_workers=4):
"""Process sequences in parallel."""
with ThreadPoolExecutor(max_workers=n_workers) as executor:
results = list(executor.map(process_sequence, sequences))
return results
```
These workflows provide comprehensive examples for common ESM use cases. Adapt them to your specific needs and always validate results with appropriate biological experiments.