Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/esm/references/esm3-api.md
+++ b/skills/esm/references/esm3-api.md
@@ -0,0 +1,452 @@
+# ESM3 API Reference
+
+## Overview
+
+ESM3 is a frontier multimodal generative language model that reasons over the sequence, structure, and function of proteins. It uses iterative masked language modeling to simultaneously generate across these three modalities.
+
+## Model Architecture
+
+**ESM3 Family Models:**
+
+| Model ID | Parameters | Availability | Best For |
+|----------|-----------|--------------|----------|
+| `esm3-sm-open-v1` | 1.4B | Open weights (local) | Development, testing, learning |
+| `esm3-medium-2024-08` | 7B | Forge API only | Production, balanced quality/speed |
+| `esm3-large-2024-03` | 98B | Forge API only | Maximum quality, research |
+| `esm3-medium-multimer-2024-09` | 7B | Forge API only | Protein complexes (experimental) |
+
+**Key Features:**
+- Simultaneous reasoning across sequence, structure, and function
+- Iterative generation with controllable number of steps
+- Support for partial prompting across modalities
+- Chain-of-thought generation for complex designs
+- Temperature control for generation diversity
+
+## Core API Components
+
+### ESMProtein Class
+
+The central data structure representing a protein with optional sequence, structure, and function information.
+
+**Constructor:**
+
+```python
+from esm.sdk.api import ESMProtein
+
+protein = ESMProtein(
+    sequence="MPRTKEINDAGLIVHSP",           # Amino acid sequence (optional)
+    coordinates=coordinates_array,          # 3D structure (optional)
+    function_annotations=[...],             # Function labels (optional)
+    secondary_structure="HHHEEEECCC",       # SS annotations (optional)
+    sasa=sasa_array                        # Solvent accessibility (optional)
+)
+```
+
+**Key Methods:**
+
+```python
+# Load from PDB file
+protein = ESMProtein.from_pdb("protein.pdb")
+
+# Export to PDB format
+pdb_string = protein.to_pdb()
+
+# Save to file
+with open("output.pdb", "w") as f:
+    f.write(protein.to_pdb())
+```
+
+**Masking Conventions:**
+
+Use `_` (underscore) to represent masked positions for generation:
+
+```python
+# Mask positions 5-10 for generation
+protein = ESMProtein(sequence="MPRT______AGLIVHSP")
+
+# Fully masked sequence (generate from scratch)
+protein = ESMProtein(sequence="_" * 200)
+
+# Partial structure (some coordinates None)
+protein = ESMProtein(
+    sequence="MPRTKEIND",
+    coordinates=partial_coords  # Some positions can be None
+)
+```
+
+### GenerationConfig Class
+
+Controls generation behavior and parameters.
+
+**Basic Configuration:**
+
+```python
+from esm.sdk.api import GenerationConfig
+
+config = GenerationConfig(
+    track="sequence",              # Track to generate: "sequence", "structure", or "function"
+    num_steps=8,                  # Number of demasking steps
+    temperature=0.7,              # Sampling temperature (0.0-1.0)
+    top_p=None,                   # Nucleus sampling threshold
+    condition_on_coordinates_only=False  # For structure conditioning
+)
+```
+
+**Parameter Details:**
+
+- **track**: Which modality to generate
+  - `"sequence"`: Generate amino acid sequence
+  - `"structure"`: Generate 3D coordinates
+  - `"function"`: Generate function annotations
+
+- **num_steps**: Number of iterative demasking steps
+  - Higher = slower but potentially better quality
+  - Typical range: 8-100 depending on sequence length
+  - For full sequence generation: approximately sequence_length / 2
+
+- **temperature**: Controls randomness
+  - 0.0: Fully deterministic (greedy decoding)
+  - 0.5-0.7: Balanced exploration
+  - 1.0: Maximum diversity
+  - Higher values increase novelty but may reduce quality
+
+- **top_p**: Nucleus sampling parameter
+  - Limits sampling to top probability mass
+  - Values: 0.0-1.0 (e.g., 0.9 = sample from top 90% probability mass)
+  - Use for controlled diversity without extreme sampling
+
+- **condition_on_coordinates_only**: Structure conditioning mode
+  - `True`: Condition only on backbone coordinates (ignore sequence)
+  - Useful for inverse folding tasks
+
+### ESM3InferenceClient Interface
+
+The unified interface for both local and remote inference.
+
+**Local Model Loading:**
+
+```python
+from esm.models.esm3 import ESM3
+
+# Load with automatic device placement
+model = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda")
+
+# Or explicitly specify device
+model = ESM3.from_pretrained("esm3-sm-open-v1").to("cpu")
+```
+
+**Generation Method:**
+
+```python
+# Basic generation
+protein_output = model.generate(protein_input, config)
+
+# With explicit track specification
+protein_output = model.generate(
+    protein_input,
+    GenerationConfig(track="sequence", num_steps=16, temperature=0.6)
+)
+```
+
+**Forward Pass (Advanced):**
+
+```python
+# Get raw model logits for custom sampling
+protein_tensor = model.encode(protein)
+output = model.forward(protein_tensor)
+logits = model.decode(output)
+```
+
+## Common Usage Patterns
+
+### 1. Sequence Completion
+
+Fill in masked regions of a protein sequence:
+
+```python
+# Define partial sequence
+protein = ESMProtein(sequence="MPRTK____LIVHSP____END")
+
+# Generate missing positions
+config = GenerationConfig(track="sequence", num_steps=12, temperature=0.5)
+completed = model.generate(protein, config)
+
+print(f"Original:  {protein.sequence}")
+print(f"Completed: {completed.sequence}")
+```
+
+### 2. Structure Prediction
+
+Predict 3D structure from sequence:
+
+```python
+# Input: sequence only
+protein = ESMProtein(sequence="MPRTKEINDAGLIVHSPQWFYK")
+
+# Generate structure
+config = GenerationConfig(track="structure", num_steps=len(protein.sequence))
+protein_with_structure = model.generate(protein, config)
+
+# Save as PDB
+with open("predicted_structure.pdb", "w") as f:
+    f.write(protein_with_structure.to_pdb())
+```
+
+### 3. Inverse Folding
+
+Design sequence for a target structure:
+
+```python
+# Load target structure
+target = ESMProtein.from_pdb("target.pdb")
+
+# Remove sequence, keep structure
+target.sequence = None
+
+# Generate sequence that folds to this structure
+config = GenerationConfig(
+    track="sequence",
+    num_steps=50,
+    temperature=0.7,
+    condition_on_coordinates_only=True
+)
+designed = model.generate(target, config)
+
+print(f"Designed sequence: {designed.sequence}")
+```
+
+### 4. Function-Conditioned Generation
+
+Generate protein with specific function:
+
+```python
+from esm.sdk.api import FunctionAnnotation
+
+# Specify desired function
+protein = ESMProtein(
+    sequence="_" * 150,
+    function_annotations=[
+        FunctionAnnotation(
+            label="enzymatic_activity",
+            start=30,
+            end=90
+        )
+    ]
+)
+
+# Generate sequence with this function
+config = GenerationConfig(track="sequence", num_steps=75, temperature=0.6)
+functional_protein = model.generate(protein, config)
+```
+
+### 5. Multi-Track Generation (Chain-of-Thought)
+
+Iteratively generate across multiple tracks:
+
+```python
+# Start with partial sequence
+protein = ESMProtein(sequence="MPRT" + "_" * 100)
+
+# Step 1: Complete sequence
+protein = model.generate(
+    protein,
+    GenerationConfig(track="sequence", num_steps=50, temperature=0.6)
+)
+
+# Step 2: Predict structure for completed sequence
+protein = model.generate(
+    protein,
+    GenerationConfig(track="structure", num_steps=50)
+)
+
+# Step 3: Predict function
+protein = model.generate(
+    protein,
+    GenerationConfig(track="function", num_steps=20)
+)
+
+print(f"Final sequence: {protein.sequence}")
+print(f"Functions: {protein.function_annotations}")
+```
+
+### 6. Variant Generation
+
+Generate multiple variants of a protein:
+
+```python
+import numpy as np
+
+base_sequence = "MPRTKEINDAGLIVHSPQWFYK"
+variants = []
+
+for i in range(10):
+    # Mask random positions
+    seq_list = list(base_sequence)
+    mask_indices = np.random.choice(len(seq_list), size=5, replace=False)
+    for idx in mask_indices:
+        seq_list[idx] = '_'
+
+    protein = ESMProtein(sequence=''.join(seq_list))
+
+    # Generate variant
+    variant = model.generate(
+        protein,
+        GenerationConfig(track="sequence", num_steps=8, temperature=0.8)
+    )
+    variants.append(variant.sequence)
+
+print(f"Generated {len(variants)} variants")
+```
+
+## Advanced Topics
+
+### Temperature Scheduling
+
+Vary temperature during generation for better control:
+
+```python
+def generate_with_temperature_schedule(model, protein, temperatures):
+    """Generate with decreasing temperature for annealing."""
+    current = protein
+    steps_per_temp = 10
+
+    for temp in temperatures:
+        config = GenerationConfig(
+            track="sequence",
+            num_steps=steps_per_temp,
+            temperature=temp
+        )
+        current = model.generate(current, config)
+
+    return current
+
+# Example: Start diverse, end deterministic
+result = generate_with_temperature_schedule(
+    model,
+    protein,
+    temperatures=[1.0, 0.8, 0.6, 0.4, 0.2]
+)
+```
+
+### Constrained Generation
+
+Preserve specific regions during generation:
+
+```python
+# Keep active site residues fixed
+def mask_except_active_site(sequence, active_site_positions):
+    """Mask everything except specified positions."""
+    seq_list = ['_'] * len(sequence)
+    for pos in active_site_positions:
+        seq_list[pos] = sequence[pos]
+    return ''.join(seq_list)
+
+# Define active site
+active_site = [23, 24, 25, 45, 46, 89]
+constrained_seq = mask_except_active_site(original_sequence, active_site)
+
+protein = ESMProtein(sequence=constrained_seq)
+result = model.generate(protein, GenerationConfig(track="sequence", num_steps=50))
+```
+
+### Secondary Structure Conditioning
+
+Use secondary structure information in generation:
+
+```python
+# Define secondary structure (H=helix, E=sheet, C=coil)
+protein = ESMProtein(
+    sequence="_" * 80,
+    secondary_structure="CCHHHHHHHEEEEECCCHHHHHHCC" + "C" * 55
+)
+
+# Generate sequence with this structure
+result = model.generate(
+    protein,
+    GenerationConfig(track="sequence", num_steps=40, temperature=0.6)
+)
+```
+
+## Performance Optimization
+
+### Memory Management
+
+For large proteins or batch processing:
+
+```python
+import torch
+
+# Clear CUDA cache between generations
+torch.cuda.empty_cache()
+
+# Use half precision for memory efficiency
+model = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda").half()
+
+# Process in chunks for very long sequences
+def chunk_generate(model, long_sequence, chunk_size=500):
+    chunks = [long_sequence[i:i+chunk_size]
+              for i in range(0, len(long_sequence), chunk_size)]
+    results = []
+
+    for chunk in chunks:
+        protein = ESMProtein(sequence=chunk)
+        result = model.generate(protein, GenerationConfig(track="sequence"))
+        results.append(result.sequence)
+
+    return ''.join(results)
+```
+
+### Batch Processing Tips
+
+When processing multiple proteins:
+
+1. Sort by sequence length for efficient batching
+2. Use padding for similar-length sequences
+3. Process on GPU when available
+4. Implement checkpointing for long-running jobs
+5. Use Forge API for large-scale processing (see `forge-api.md`)
+
+## Error Handling
+
+```python
+try:
+    protein = model.generate(protein_input, config)
+except ValueError as e:
+    print(f"Invalid input: {e}")
+    # Handle invalid sequence or structure
+except RuntimeError as e:
+    print(f"Generation failed: {e}")
+    # Handle model errors
+except torch.cuda.OutOfMemoryError:
+    print("GPU out of memory - try smaller model or CPU")
+    # Fallback to CPU or smaller model
+```
+
+## Model-Specific Considerations
+
+**esm3-sm-open-v1:**
+- Suitable for development and testing
+- Lower quality than larger models
+- Fast inference on consumer GPUs
+- Open weights allow fine-tuning
+
+**esm3-medium-2024-08:**
+- Production quality
+- Good balance of speed and accuracy
+- Requires Forge API access
+- Recommended for most applications
+
+**esm3-large-2024-03:**
+- State-of-the-art quality
+- Slowest inference
+- Use for critical applications
+- Best for novel protein design
+
+## Citation
+
+If using ESM3 in research, cite:
+
+```
+Hayes, T. et al. (2025). Simulating 500 million years of evolution with a language model.
+Science. DOI: 10.1126/science.ads0018
+```