Initial commit
This commit is contained in:
452
skills/esm/references/esm3-api.md
Normal file
452
skills/esm/references/esm3-api.md
Normal file
@@ -0,0 +1,452 @@
|
||||
# ESM3 API Reference
|
||||
|
||||
## Overview
|
||||
|
||||
ESM3 is a frontier multimodal generative language model that reasons over the sequence, structure, and function of proteins. It uses iterative masked language modeling to simultaneously generate across these three modalities.
|
||||
|
||||
## Model Architecture
|
||||
|
||||
**ESM3 Family Models:**
|
||||
|
||||
| Model ID | Parameters | Availability | Best For |
|
||||
|----------|-----------|--------------|----------|
|
||||
| `esm3-sm-open-v1` | 1.4B | Open weights (local) | Development, testing, learning |
|
||||
| `esm3-medium-2024-08` | 7B | Forge API only | Production, balanced quality/speed |
|
||||
| `esm3-large-2024-03` | 98B | Forge API only | Maximum quality, research |
|
||||
| `esm3-medium-multimer-2024-09` | 7B | Forge API only | Protein complexes (experimental) |
|
||||
|
||||
**Key Features:**
|
||||
- Simultaneous reasoning across sequence, structure, and function
|
||||
- Iterative generation with controllable number of steps
|
||||
- Support for partial prompting across modalities
|
||||
- Chain-of-thought generation for complex designs
|
||||
- Temperature control for generation diversity
|
||||
|
||||
## Core API Components
|
||||
|
||||
### ESMProtein Class
|
||||
|
||||
The central data structure representing a protein with optional sequence, structure, and function information.
|
||||
|
||||
**Constructor:**
|
||||
|
||||
```python
|
||||
from esm.sdk.api import ESMProtein
|
||||
|
||||
protein = ESMProtein(
|
||||
sequence="MPRTKEINDAGLIVHSP", # Amino acid sequence (optional)
|
||||
coordinates=coordinates_array, # 3D structure (optional)
|
||||
function_annotations=[...], # Function labels (optional)
|
||||
secondary_structure="HHHEEEECCC", # SS annotations (optional)
|
||||
sasa=sasa_array # Solvent accessibility (optional)
|
||||
)
|
||||
```
|
||||
|
||||
**Key Methods:**
|
||||
|
||||
```python
|
||||
# Load from PDB file
|
||||
protein = ESMProtein.from_pdb("protein.pdb")
|
||||
|
||||
# Export to PDB format
|
||||
pdb_string = protein.to_pdb()
|
||||
|
||||
# Save to file
|
||||
with open("output.pdb", "w") as f:
|
||||
f.write(protein.to_pdb())
|
||||
```
|
||||
|
||||
**Masking Conventions:**
|
||||
|
||||
Use `_` (underscore) to represent masked positions for generation:
|
||||
|
||||
```python
|
||||
# Mask positions 5-10 for generation
|
||||
protein = ESMProtein(sequence="MPRT______AGLIVHSP")
|
||||
|
||||
# Fully masked sequence (generate from scratch)
|
||||
protein = ESMProtein(sequence="_" * 200)
|
||||
|
||||
# Partial structure (some coordinates None)
|
||||
protein = ESMProtein(
|
||||
sequence="MPRTKEIND",
|
||||
coordinates=partial_coords # Some positions can be None
|
||||
)
|
||||
```
|
||||
|
||||
### GenerationConfig Class
|
||||
|
||||
Controls generation behavior and parameters.
|
||||
|
||||
**Basic Configuration:**
|
||||
|
||||
```python
|
||||
from esm.sdk.api import GenerationConfig
|
||||
|
||||
config = GenerationConfig(
|
||||
track="sequence", # Track to generate: "sequence", "structure", or "function"
|
||||
num_steps=8, # Number of demasking steps
|
||||
temperature=0.7, # Sampling temperature (0.0-1.0)
|
||||
top_p=None, # Nucleus sampling threshold
|
||||
condition_on_coordinates_only=False # For structure conditioning
|
||||
)
|
||||
```
|
||||
|
||||
**Parameter Details:**
|
||||
|
||||
- **track**: Which modality to generate
|
||||
- `"sequence"`: Generate amino acid sequence
|
||||
- `"structure"`: Generate 3D coordinates
|
||||
- `"function"`: Generate function annotations
|
||||
|
||||
- **num_steps**: Number of iterative demasking steps
|
||||
- Higher = slower but potentially better quality
|
||||
- Typical range: 8-100 depending on sequence length
|
||||
- For full sequence generation: approximately sequence_length / 2
|
||||
|
||||
- **temperature**: Controls randomness
|
||||
- 0.0: Fully deterministic (greedy decoding)
|
||||
- 0.5-0.7: Balanced exploration
|
||||
- 1.0: Maximum diversity
|
||||
- Higher values increase novelty but may reduce quality
|
||||
|
||||
- **top_p**: Nucleus sampling parameter
|
||||
- Limits sampling to top probability mass
|
||||
- Values: 0.0-1.0 (e.g., 0.9 = sample from top 90% probability mass)
|
||||
- Use for controlled diversity without extreme sampling
|
||||
|
||||
- **condition_on_coordinates_only**: Structure conditioning mode
|
||||
- `True`: Condition only on backbone coordinates (ignore sequence)
|
||||
- Useful for inverse folding tasks
|
||||
|
||||
### ESM3InferenceClient Interface
|
||||
|
||||
The unified interface for both local and remote inference.
|
||||
|
||||
**Local Model Loading:**
|
||||
|
||||
```python
|
||||
from esm.models.esm3 import ESM3
|
||||
|
||||
# Load with automatic device placement
|
||||
model = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda")
|
||||
|
||||
# Or explicitly specify device
|
||||
model = ESM3.from_pretrained("esm3-sm-open-v1").to("cpu")
|
||||
```
|
||||
|
||||
**Generation Method:**
|
||||
|
||||
```python
|
||||
# Basic generation
|
||||
protein_output = model.generate(protein_input, config)
|
||||
|
||||
# With explicit track specification
|
||||
protein_output = model.generate(
|
||||
protein_input,
|
||||
GenerationConfig(track="sequence", num_steps=16, temperature=0.6)
|
||||
)
|
||||
```
|
||||
|
||||
**Forward Pass (Advanced):**
|
||||
|
||||
```python
|
||||
# Get raw model logits for custom sampling
|
||||
protein_tensor = model.encode(protein)
|
||||
output = model.forward(protein_tensor)
|
||||
logits = model.decode(output)
|
||||
```
|
||||
|
||||
## Common Usage Patterns
|
||||
|
||||
### 1. Sequence Completion
|
||||
|
||||
Fill in masked regions of a protein sequence:
|
||||
|
||||
```python
|
||||
# Define partial sequence
|
||||
protein = ESMProtein(sequence="MPRTK____LIVHSP____END")
|
||||
|
||||
# Generate missing positions
|
||||
config = GenerationConfig(track="sequence", num_steps=12, temperature=0.5)
|
||||
completed = model.generate(protein, config)
|
||||
|
||||
print(f"Original: {protein.sequence}")
|
||||
print(f"Completed: {completed.sequence}")
|
||||
```
|
||||
|
||||
### 2. Structure Prediction
|
||||
|
||||
Predict 3D structure from sequence:
|
||||
|
||||
```python
|
||||
# Input: sequence only
|
||||
protein = ESMProtein(sequence="MPRTKEINDAGLIVHSPQWFYK")
|
||||
|
||||
# Generate structure
|
||||
config = GenerationConfig(track="structure", num_steps=len(protein.sequence))
|
||||
protein_with_structure = model.generate(protein, config)
|
||||
|
||||
# Save as PDB
|
||||
with open("predicted_structure.pdb", "w") as f:
|
||||
f.write(protein_with_structure.to_pdb())
|
||||
```
|
||||
|
||||
### 3. Inverse Folding
|
||||
|
||||
Design sequence for a target structure:
|
||||
|
||||
```python
|
||||
# Load target structure
|
||||
target = ESMProtein.from_pdb("target.pdb")
|
||||
|
||||
# Remove sequence, keep structure
|
||||
target.sequence = None
|
||||
|
||||
# Generate sequence that folds to this structure
|
||||
config = GenerationConfig(
|
||||
track="sequence",
|
||||
num_steps=50,
|
||||
temperature=0.7,
|
||||
condition_on_coordinates_only=True
|
||||
)
|
||||
designed = model.generate(target, config)
|
||||
|
||||
print(f"Designed sequence: {designed.sequence}")
|
||||
```
|
||||
|
||||
### 4. Function-Conditioned Generation
|
||||
|
||||
Generate protein with specific function:
|
||||
|
||||
```python
|
||||
from esm.sdk.api import FunctionAnnotation
|
||||
|
||||
# Specify desired function
|
||||
protein = ESMProtein(
|
||||
sequence="_" * 150,
|
||||
function_annotations=[
|
||||
FunctionAnnotation(
|
||||
label="enzymatic_activity",
|
||||
start=30,
|
||||
end=90
|
||||
)
|
||||
]
|
||||
)
|
||||
|
||||
# Generate sequence with this function
|
||||
config = GenerationConfig(track="sequence", num_steps=75, temperature=0.6)
|
||||
functional_protein = model.generate(protein, config)
|
||||
```
|
||||
|
||||
### 5. Multi-Track Generation (Chain-of-Thought)
|
||||
|
||||
Iteratively generate across multiple tracks:
|
||||
|
||||
```python
|
||||
# Start with partial sequence
|
||||
protein = ESMProtein(sequence="MPRT" + "_" * 100)
|
||||
|
||||
# Step 1: Complete sequence
|
||||
protein = model.generate(
|
||||
protein,
|
||||
GenerationConfig(track="sequence", num_steps=50, temperature=0.6)
|
||||
)
|
||||
|
||||
# Step 2: Predict structure for completed sequence
|
||||
protein = model.generate(
|
||||
protein,
|
||||
GenerationConfig(track="structure", num_steps=50)
|
||||
)
|
||||
|
||||
# Step 3: Predict function
|
||||
protein = model.generate(
|
||||
protein,
|
||||
GenerationConfig(track="function", num_steps=20)
|
||||
)
|
||||
|
||||
print(f"Final sequence: {protein.sequence}")
|
||||
print(f"Functions: {protein.function_annotations}")
|
||||
```
|
||||
|
||||
### 6. Variant Generation
|
||||
|
||||
Generate multiple variants of a protein:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
base_sequence = "MPRTKEINDAGLIVHSPQWFYK"
|
||||
variants = []
|
||||
|
||||
for i in range(10):
|
||||
# Mask random positions
|
||||
seq_list = list(base_sequence)
|
||||
mask_indices = np.random.choice(len(seq_list), size=5, replace=False)
|
||||
for idx in mask_indices:
|
||||
seq_list[idx] = '_'
|
||||
|
||||
protein = ESMProtein(sequence=''.join(seq_list))
|
||||
|
||||
# Generate variant
|
||||
variant = model.generate(
|
||||
protein,
|
||||
GenerationConfig(track="sequence", num_steps=8, temperature=0.8)
|
||||
)
|
||||
variants.append(variant.sequence)
|
||||
|
||||
print(f"Generated {len(variants)} variants")
|
||||
```
|
||||
|
||||
## Advanced Topics
|
||||
|
||||
### Temperature Scheduling
|
||||
|
||||
Vary temperature during generation for better control:
|
||||
|
||||
```python
|
||||
def generate_with_temperature_schedule(model, protein, temperatures):
|
||||
"""Generate with decreasing temperature for annealing."""
|
||||
current = protein
|
||||
steps_per_temp = 10
|
||||
|
||||
for temp in temperatures:
|
||||
config = GenerationConfig(
|
||||
track="sequence",
|
||||
num_steps=steps_per_temp,
|
||||
temperature=temp
|
||||
)
|
||||
current = model.generate(current, config)
|
||||
|
||||
return current
|
||||
|
||||
# Example: Start diverse, end deterministic
|
||||
result = generate_with_temperature_schedule(
|
||||
model,
|
||||
protein,
|
||||
temperatures=[1.0, 0.8, 0.6, 0.4, 0.2]
|
||||
)
|
||||
```
|
||||
|
||||
### Constrained Generation
|
||||
|
||||
Preserve specific regions during generation:
|
||||
|
||||
```python
|
||||
# Keep active site residues fixed
|
||||
def mask_except_active_site(sequence, active_site_positions):
|
||||
"""Mask everything except specified positions."""
|
||||
seq_list = ['_'] * len(sequence)
|
||||
for pos in active_site_positions:
|
||||
seq_list[pos] = sequence[pos]
|
||||
return ''.join(seq_list)
|
||||
|
||||
# Define active site
|
||||
active_site = [23, 24, 25, 45, 46, 89]
|
||||
constrained_seq = mask_except_active_site(original_sequence, active_site)
|
||||
|
||||
protein = ESMProtein(sequence=constrained_seq)
|
||||
result = model.generate(protein, GenerationConfig(track="sequence", num_steps=50))
|
||||
```
|
||||
|
||||
### Secondary Structure Conditioning
|
||||
|
||||
Use secondary structure information in generation:
|
||||
|
||||
```python
|
||||
# Define secondary structure (H=helix, E=sheet, C=coil)
|
||||
protein = ESMProtein(
|
||||
sequence="_" * 80,
|
||||
secondary_structure="CCHHHHHHHEEEEECCCHHHHHHCC" + "C" * 55
|
||||
)
|
||||
|
||||
# Generate sequence with this structure
|
||||
result = model.generate(
|
||||
protein,
|
||||
GenerationConfig(track="sequence", num_steps=40, temperature=0.6)
|
||||
)
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Memory Management
|
||||
|
||||
For large proteins or batch processing:
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
# Clear CUDA cache between generations
|
||||
torch.cuda.empty_cache()
|
||||
|
||||
# Use half precision for memory efficiency
|
||||
model = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda").half()
|
||||
|
||||
# Process in chunks for very long sequences
|
||||
def chunk_generate(model, long_sequence, chunk_size=500):
|
||||
chunks = [long_sequence[i:i+chunk_size]
|
||||
for i in range(0, len(long_sequence), chunk_size)]
|
||||
results = []
|
||||
|
||||
for chunk in chunks:
|
||||
protein = ESMProtein(sequence=chunk)
|
||||
result = model.generate(protein, GenerationConfig(track="sequence"))
|
||||
results.append(result.sequence)
|
||||
|
||||
return ''.join(results)
|
||||
```
|
||||
|
||||
### Batch Processing Tips
|
||||
|
||||
When processing multiple proteins:
|
||||
|
||||
1. Sort by sequence length for efficient batching
|
||||
2. Use padding for similar-length sequences
|
||||
3. Process on GPU when available
|
||||
4. Implement checkpointing for long-running jobs
|
||||
5. Use Forge API for large-scale processing (see `forge-api.md`)
|
||||
|
||||
## Error Handling
|
||||
|
||||
```python
|
||||
try:
|
||||
protein = model.generate(protein_input, config)
|
||||
except ValueError as e:
|
||||
print(f"Invalid input: {e}")
|
||||
# Handle invalid sequence or structure
|
||||
except RuntimeError as e:
|
||||
print(f"Generation failed: {e}")
|
||||
# Handle model errors
|
||||
except torch.cuda.OutOfMemoryError:
|
||||
print("GPU out of memory - try smaller model or CPU")
|
||||
# Fallback to CPU or smaller model
|
||||
```
|
||||
|
||||
## Model-Specific Considerations
|
||||
|
||||
**esm3-sm-open-v1:**
|
||||
- Suitable for development and testing
|
||||
- Lower quality than larger models
|
||||
- Fast inference on consumer GPUs
|
||||
- Open weights allow fine-tuning
|
||||
|
||||
**esm3-medium-2024-08:**
|
||||
- Production quality
|
||||
- Good balance of speed and accuracy
|
||||
- Requires Forge API access
|
||||
- Recommended for most applications
|
||||
|
||||
**esm3-large-2024-03:**
|
||||
- State-of-the-art quality
|
||||
- Slowest inference
|
||||
- Use for critical applications
|
||||
- Best for novel protein design
|
||||
|
||||
## Citation
|
||||
|
||||
If using ESM3 in research, cite:
|
||||
|
||||
```
|
||||
Hayes, T. et al. (2025). Simulating 500 million years of evolution with a language model.
|
||||
Science. DOI: 10.1126/science.ads0018
|
||||
```
|
||||
Reference in New Issue
Block a user