Initial commit
This commit is contained in:
637
skills/adaptyv/reference/protein_optimization.md
Normal file
637
skills/adaptyv/reference/protein_optimization.md
Normal file
@@ -0,0 +1,637 @@
|
||||
# Protein Sequence Optimization
|
||||
|
||||
## Overview
|
||||
|
||||
Before submitting protein sequences for experimental testing, use computational tools to optimize sequences for improved expression, solubility, and stability. This pre-screening reduces experimental costs and increases success rates.
|
||||
|
||||
## Common Protein Expression Problems
|
||||
|
||||
### 1. Unpaired Cysteines
|
||||
|
||||
**Problem:**
|
||||
- Unpaired cysteines form unwanted disulfide bonds
|
||||
- Leads to aggregation and misfolding
|
||||
- Reduces expression yield and stability
|
||||
|
||||
**Solution:**
|
||||
- Remove unpaired cysteines unless functionally necessary
|
||||
- Pair cysteines appropriately for structural disulfides
|
||||
- Replace with serine or alanine in non-critical positions
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
# Check for cysteine pairs
|
||||
from Bio.Seq import Seq
|
||||
|
||||
def check_cysteines(sequence):
|
||||
cys_count = sequence.count('C')
|
||||
if cys_count % 2 != 0:
|
||||
print(f"Warning: Odd number of cysteines ({cys_count})")
|
||||
return cys_count
|
||||
```
|
||||
|
||||
### 2. Excessive Hydrophobicity
|
||||
|
||||
**Problem:**
|
||||
- Long hydrophobic patches promote aggregation
|
||||
- Exposed hydrophobic residues drive protein clumping
|
||||
- Poor solubility in aqueous buffers
|
||||
|
||||
**Solution:**
|
||||
- Maintain balanced hydropathy profiles
|
||||
- Use short, flexible linkers between domains
|
||||
- Reduce surface-exposed hydrophobic residues
|
||||
|
||||
**Metrics:**
|
||||
- Kyte-Doolittle hydropathy plots
|
||||
- GRAVY score (Grand Average of Hydropathy)
|
||||
- pSAE (percent Solvent-Accessible hydrophobic residues)
|
||||
|
||||
### 3. Low Solubility
|
||||
|
||||
**Problem:**
|
||||
- Proteins precipitate during expression or purification
|
||||
- Inclusion body formation
|
||||
- Difficult downstream processing
|
||||
|
||||
**Solution:**
|
||||
- Use solubility prediction tools for pre-screening
|
||||
- Apply sequence optimization algorithms
|
||||
- Add solubilizing tags if needed
|
||||
|
||||
## Computational Tools for Optimization
|
||||
|
||||
### NetSolP - Initial Solubility Screening
|
||||
|
||||
**Purpose:** Fast solubility prediction for filtering sequences.
|
||||
|
||||
**Method:** Machine learning model trained on E. coli expression data.
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# Install: uv pip install requests
|
||||
import requests
|
||||
|
||||
def predict_solubility_netsolp(sequence):
|
||||
"""Predict protein solubility using NetSolP web service"""
|
||||
url = "https://services.healthtech.dtu.dk/services/NetSolP-1.0/api/predict"
|
||||
|
||||
data = {
|
||||
"sequence": sequence,
|
||||
"format": "fasta"
|
||||
}
|
||||
|
||||
response = requests.post(url, data=data)
|
||||
return response.json()
|
||||
|
||||
# Example
|
||||
sequence = "MKVLWAALLGLLGAAA..."
|
||||
result = predict_solubility_netsolp(sequence)
|
||||
print(f"Solubility score: {result['score']}")
|
||||
```
|
||||
|
||||
**Interpretation:**
|
||||
- Score > 0.5: Likely soluble
|
||||
- Score < 0.5: Likely insoluble
|
||||
- Use for initial filtering before more expensive predictions
|
||||
|
||||
**When to use:**
|
||||
- First-pass filtering of large libraries
|
||||
- Quick validation of designed sequences
|
||||
- Prioritizing sequences for experimental testing
|
||||
|
||||
### SoluProt - Comprehensive Solubility Prediction
|
||||
|
||||
**Purpose:** Advanced solubility prediction with higher accuracy.
|
||||
|
||||
**Method:** Deep learning model incorporating sequence and structural features.
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# Install: uv pip install soluprot
|
||||
from soluprot import predict_solubility
|
||||
|
||||
def screen_variants_soluprot(sequences):
|
||||
"""Screen multiple sequences for solubility"""
|
||||
results = []
|
||||
for name, seq in sequences.items():
|
||||
score = predict_solubility(seq)
|
||||
results.append({
|
||||
'name': name,
|
||||
'sequence': seq,
|
||||
'solubility_score': score,
|
||||
'predicted_soluble': score > 0.6
|
||||
})
|
||||
return results
|
||||
|
||||
# Example
|
||||
sequences = {
|
||||
'variant_1': 'MKVLW...',
|
||||
'variant_2': 'MATGV...'
|
||||
}
|
||||
|
||||
results = screen_variants_soluprot(sequences)
|
||||
soluble_variants = [r for r in results if r['predicted_soluble']]
|
||||
```
|
||||
|
||||
**Interpretation:**
|
||||
- Score > 0.6: High solubility confidence
|
||||
- Score 0.4-0.6: Uncertain, may need optimization
|
||||
- Score < 0.4: Likely problematic
|
||||
|
||||
**When to use:**
|
||||
- After initial NetSolP filtering
|
||||
- When higher prediction accuracy is needed
|
||||
- Before committing to expensive synthesis/testing
|
||||
|
||||
### SolubleMPNN - Sequence Redesign
|
||||
|
||||
**Purpose:** Redesign protein sequences to improve solubility while maintaining function.
|
||||
|
||||
**Method:** Graph neural network that suggests mutations to increase solubility.
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# Install: uv pip install soluble-mpnn
|
||||
from soluble_mpnn import optimize_sequence
|
||||
|
||||
def optimize_for_solubility(sequence, structure_pdb=None):
|
||||
"""
|
||||
Redesign sequence for improved solubility
|
||||
|
||||
Args:
|
||||
sequence: Original amino acid sequence
|
||||
structure_pdb: Optional PDB file for structure-aware design
|
||||
|
||||
Returns:
|
||||
Optimized sequence variants ranked by predicted solubility
|
||||
"""
|
||||
|
||||
variants = optimize_sequence(
|
||||
sequence=sequence,
|
||||
structure=structure_pdb,
|
||||
num_variants=10,
|
||||
temperature=0.1 # Lower = more conservative mutations
|
||||
)
|
||||
|
||||
return variants
|
||||
|
||||
# Example
|
||||
original_seq = "MKVLWAALLGLLGAAA..."
|
||||
optimized_variants = optimize_for_solubility(original_seq)
|
||||
|
||||
for i, variant in enumerate(optimized_variants):
|
||||
print(f"Variant {i+1}:")
|
||||
print(f" Sequence: {variant['sequence']}")
|
||||
print(f" Solubility score: {variant['solubility_score']}")
|
||||
print(f" Mutations: {variant['mutations']}")
|
||||
```
|
||||
|
||||
**Design strategy:**
|
||||
- **Conservative** (temperature=0.1): Minimal changes, safer
|
||||
- **Moderate** (temperature=0.3): Balance between change and safety
|
||||
- **Aggressive** (temperature=0.5): More mutations, higher risk
|
||||
|
||||
**When to use:**
|
||||
- Primary tool for sequence optimization
|
||||
- Default starting point for improving problematic sequences
|
||||
- Generating diverse soluble variants
|
||||
|
||||
**Best practices:**
|
||||
- Generate 10-50 variants per sequence
|
||||
- Use structure information when available (improves accuracy)
|
||||
- Validate key functional residues are preserved
|
||||
- Test multiple temperature settings
|
||||
|
||||
### ESM (Evolutionary Scale Modeling) - Sequence Likelihood
|
||||
|
||||
**Purpose:** Assess how "natural" a protein sequence appears based on evolutionary patterns.
|
||||
|
||||
**Method:** Protein language model trained on millions of natural sequences.
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# Install: uv pip install fair-esm
|
||||
import torch
|
||||
from esm import pretrained
|
||||
|
||||
def score_sequence_esm(sequence):
|
||||
"""
|
||||
Calculate ESM likelihood score for sequence
|
||||
Higher scores indicate more natural/stable sequences
|
||||
"""
|
||||
|
||||
model, alphabet = pretrained.esm2_t33_650M_UR50D()
|
||||
batch_converter = alphabet.get_batch_converter()
|
||||
|
||||
data = [("protein", sequence)]
|
||||
_, _, batch_tokens = batch_converter(data)
|
||||
|
||||
with torch.no_grad():
|
||||
results = model(batch_tokens, repr_layers=[33])
|
||||
token_logprobs = results["logits"].log_softmax(dim=-1)
|
||||
|
||||
# Calculate perplexity as sequence quality metric
|
||||
sequence_score = token_logprobs.mean().item()
|
||||
|
||||
return sequence_score
|
||||
|
||||
# Example - Compare variants
|
||||
sequences = {
|
||||
'original': 'MKVLW...',
|
||||
'optimized_1': 'MKVLS...',
|
||||
'optimized_2': 'MKVLA...'
|
||||
}
|
||||
|
||||
for name, seq in sequences.items():
|
||||
score = score_sequence_esm(seq)
|
||||
print(f"{name}: ESM score = {score:.3f}")
|
||||
```
|
||||
|
||||
**Interpretation:**
|
||||
- Higher scores → More "natural" sequence
|
||||
- Use to avoid unlikely mutations
|
||||
- Balance with functional requirements
|
||||
|
||||
**When to use:**
|
||||
- Filtering synthetic designs
|
||||
- Comparing SolubleMPNN variants
|
||||
- Ensuring sequences aren't too artificial
|
||||
- Avoiding expression bottlenecks
|
||||
|
||||
**Integration with design:**
|
||||
```python
|
||||
def rank_variants_by_esm(variants):
|
||||
"""Rank protein variants by ESM likelihood"""
|
||||
scored = []
|
||||
for v in variants:
|
||||
esm_score = score_sequence_esm(v['sequence'])
|
||||
v['esm_score'] = esm_score
|
||||
scored.append(v)
|
||||
|
||||
# Sort by combined solubility and ESM score
|
||||
scored.sort(
|
||||
key=lambda x: x['solubility_score'] * x['esm_score'],
|
||||
reverse=True
|
||||
)
|
||||
|
||||
return scored
|
||||
```
|
||||
|
||||
### ipTM - Interface Stability (AlphaFold-Multimer)
|
||||
|
||||
**Purpose:** Assess protein-protein interface stability and binding confidence.
|
||||
|
||||
**Method:** Interface predicted TM-score from AlphaFold-Multimer predictions.
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# Requires AlphaFold-Multimer installation
|
||||
# Or use ColabFold for easier access
|
||||
|
||||
def predict_interface_stability(protein_a_seq, protein_b_seq):
|
||||
"""
|
||||
Predict interface stability using AlphaFold-Multimer
|
||||
|
||||
Returns ipTM score: higher = more stable interface
|
||||
"""
|
||||
from colabfold import run_alphafold_multimer
|
||||
|
||||
sequences = {
|
||||
'chainA': protein_a_seq,
|
||||
'chainB': protein_b_seq
|
||||
}
|
||||
|
||||
result = run_alphafold_multimer(sequences)
|
||||
|
||||
return {
|
||||
'ipTM': result['iptm'],
|
||||
'pTM': result['ptm'],
|
||||
'pLDDT': result['plddt']
|
||||
}
|
||||
|
||||
# Example for antibody-antigen binding
|
||||
antibody_seq = "EVQLVESGGGLVQPGG..."
|
||||
antigen_seq = "MKVLWAALLGLLGAAA..."
|
||||
|
||||
stability = predict_interface_stability(antibody_seq, antigen_seq)
|
||||
print(f"Interface pTM: {stability['ipTM']:.3f}")
|
||||
|
||||
# Interpretation
|
||||
if stability['ipTM'] > 0.7:
|
||||
print("High confidence interface")
|
||||
elif stability['ipTM'] > 0.5:
|
||||
print("Moderate confidence interface")
|
||||
else:
|
||||
print("Low confidence interface - may need redesign")
|
||||
```
|
||||
|
||||
**Interpretation:**
|
||||
- ipTM > 0.7: Strong predicted interface
|
||||
- ipTM 0.5-0.7: Moderate interface confidence
|
||||
- ipTM < 0.5: Weak interface, consider redesign
|
||||
|
||||
**When to use:**
|
||||
- Antibody-antigen design
|
||||
- Protein-protein interaction engineering
|
||||
- Validating binding interfaces
|
||||
- Comparing interface variants
|
||||
|
||||
### pSAE - Solvent-Accessible Hydrophobic Residues
|
||||
|
||||
**Purpose:** Quantify exposed hydrophobic residues that promote aggregation.
|
||||
|
||||
**Method:** Calculates percentage of solvent-accessible surface area (SASA) occupied by hydrophobic residues.
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# Requires structure (PDB file or AlphaFold prediction)
|
||||
# Install: uv pip install biopython
|
||||
|
||||
from Bio.PDB import PDBParser, DSSP
|
||||
import numpy as np
|
||||
|
||||
def calculate_psae(pdb_file):
|
||||
"""
|
||||
Calculate percent Solvent-Accessible hydrophobic residues (pSAE)
|
||||
|
||||
Lower pSAE = better solubility
|
||||
"""
|
||||
|
||||
parser = PDBParser(QUIET=True)
|
||||
structure = parser.get_structure('protein', pdb_file)
|
||||
|
||||
# Run DSSP to get solvent accessibility
|
||||
model = structure[0]
|
||||
dssp = DSSP(model, pdb_file, acc_array='Wilke')
|
||||
|
||||
hydrophobic = ['ALA', 'VAL', 'ILE', 'LEU', 'MET', 'PHE', 'TRP', 'PRO']
|
||||
|
||||
total_sasa = 0
|
||||
hydrophobic_sasa = 0
|
||||
|
||||
for residue in dssp:
|
||||
res_name = residue[1]
|
||||
rel_accessibility = residue[3]
|
||||
|
||||
total_sasa += rel_accessibility
|
||||
if res_name in hydrophobic:
|
||||
hydrophobic_sasa += rel_accessibility
|
||||
|
||||
psae = (hydrophobic_sasa / total_sasa) * 100
|
||||
|
||||
return psae
|
||||
|
||||
# Example
|
||||
pdb_file = "protein_structure.pdb"
|
||||
psae_score = calculate_psae(pdb_file)
|
||||
print(f"pSAE: {psae_score:.2f}%")
|
||||
|
||||
# Interpretation
|
||||
if psae_score < 25:
|
||||
print("Good solubility expected")
|
||||
elif psae_score < 35:
|
||||
print("Moderate solubility")
|
||||
else:
|
||||
print("High aggregation risk")
|
||||
```
|
||||
|
||||
**Interpretation:**
|
||||
- pSAE < 25%: Low aggregation risk
|
||||
- pSAE 25-35%: Moderate risk
|
||||
- pSAE > 35%: High aggregation risk
|
||||
|
||||
**When to use:**
|
||||
- Analyzing designed structures
|
||||
- Post-AlphaFold validation
|
||||
- Identifying aggregation hotspots
|
||||
- Guiding surface mutations
|
||||
|
||||
## Recommended Optimization Workflow
|
||||
|
||||
### Step 1: Initial Screening (Fast)
|
||||
|
||||
```python
|
||||
def initial_screening(sequences):
|
||||
"""
|
||||
Quick first-pass filtering using NetSolP
|
||||
Filters out obviously problematic sequences
|
||||
"""
|
||||
passed = []
|
||||
for name, seq in sequences.items():
|
||||
netsolp_score = predict_solubility_netsolp(seq)
|
||||
if netsolp_score > 0.5:
|
||||
passed.append((name, seq))
|
||||
|
||||
return passed
|
||||
```
|
||||
|
||||
### Step 2: Detailed Assessment (Moderate)
|
||||
|
||||
```python
|
||||
def detailed_assessment(filtered_sequences):
|
||||
"""
|
||||
More thorough analysis with SoluProt and ESM
|
||||
Ranks sequences by multiple criteria
|
||||
"""
|
||||
results = []
|
||||
for name, seq in filtered_sequences:
|
||||
soluprot_score = predict_solubility(seq)
|
||||
esm_score = score_sequence_esm(seq)
|
||||
|
||||
combined_score = soluprot_score * 0.7 + esm_score * 0.3
|
||||
|
||||
results.append({
|
||||
'name': name,
|
||||
'sequence': seq,
|
||||
'soluprot': soluprot_score,
|
||||
'esm': esm_score,
|
||||
'combined': combined_score
|
||||
})
|
||||
|
||||
results.sort(key=lambda x: x['combined'], reverse=True)
|
||||
return results
|
||||
```
|
||||
|
||||
### Step 3: Sequence Optimization (If needed)
|
||||
|
||||
```python
|
||||
def optimize_problematic_sequences(sequences_needing_optimization):
|
||||
"""
|
||||
Use SolubleMPNN to redesign problematic sequences
|
||||
Returns improved variants
|
||||
"""
|
||||
optimized = []
|
||||
for name, seq in sequences_needing_optimization:
|
||||
# Generate multiple variants
|
||||
variants = optimize_sequence(
|
||||
sequence=seq,
|
||||
num_variants=10,
|
||||
temperature=0.2
|
||||
)
|
||||
|
||||
# Score variants with ESM
|
||||
for variant in variants:
|
||||
variant['esm_score'] = score_sequence_esm(variant['sequence'])
|
||||
|
||||
# Keep best variants
|
||||
variants.sort(
|
||||
key=lambda x: x['solubility_score'] * x['esm_score'],
|
||||
reverse=True
|
||||
)
|
||||
|
||||
optimized.extend(variants[:3]) # Top 3 variants per sequence
|
||||
|
||||
return optimized
|
||||
```
|
||||
|
||||
### Step 4: Structure-Based Validation (For critical sequences)
|
||||
|
||||
```python
|
||||
def structure_validation(top_candidates):
|
||||
"""
|
||||
Predict structures and calculate pSAE for top candidates
|
||||
Final validation before experimental testing
|
||||
"""
|
||||
validated = []
|
||||
for candidate in top_candidates:
|
||||
# Predict structure with AlphaFold
|
||||
structure_pdb = predict_structure_alphafold(candidate['sequence'])
|
||||
|
||||
# Calculate pSAE
|
||||
psae = calculate_psae(structure_pdb)
|
||||
|
||||
candidate['psae'] = psae
|
||||
candidate['pass_structure_check'] = psae < 30
|
||||
|
||||
validated.append(candidate)
|
||||
|
||||
return validated
|
||||
```
|
||||
|
||||
### Complete Workflow Example
|
||||
|
||||
```python
|
||||
def complete_optimization_pipeline(initial_sequences):
|
||||
"""
|
||||
End-to-end optimization pipeline
|
||||
|
||||
Input: Dictionary of {name: sequence}
|
||||
Output: Ranked list of optimized, validated sequences
|
||||
"""
|
||||
|
||||
print("Step 1: Initial screening with NetSolP...")
|
||||
filtered = initial_screening(initial_sequences)
|
||||
print(f" Passed: {len(filtered)}/{len(initial_sequences)}")
|
||||
|
||||
print("Step 2: Detailed assessment with SoluProt and ESM...")
|
||||
assessed = detailed_assessment(filtered)
|
||||
|
||||
# Split into good and needs-optimization
|
||||
good_sequences = [s for s in assessed if s['soluprot'] > 0.6]
|
||||
needs_optimization = [s for s in assessed if s['soluprot'] <= 0.6]
|
||||
|
||||
print(f" Good sequences: {len(good_sequences)}")
|
||||
print(f" Need optimization: {len(needs_optimization)}")
|
||||
|
||||
if needs_optimization:
|
||||
print("Step 3: Optimizing problematic sequences with SolubleMPNN...")
|
||||
optimized = optimize_problematic_sequences(needs_optimization)
|
||||
all_sequences = good_sequences + optimized
|
||||
else:
|
||||
all_sequences = good_sequences
|
||||
|
||||
print("Step 4: Structure-based validation for top candidates...")
|
||||
top_20 = all_sequences[:20]
|
||||
final_validated = structure_validation(top_20)
|
||||
|
||||
# Final ranking
|
||||
final_validated.sort(
|
||||
key=lambda x: (
|
||||
x['pass_structure_check'],
|
||||
x['combined'],
|
||||
-x['psae']
|
||||
),
|
||||
reverse=True
|
||||
)
|
||||
|
||||
return final_validated
|
||||
|
||||
# Usage
|
||||
initial_library = {
|
||||
'variant_1': 'MKVLWAALLGLLGAAA...',
|
||||
'variant_2': 'MATGVLWAALLGLLGA...',
|
||||
# ... more sequences
|
||||
}
|
||||
|
||||
optimized_library = complete_optimization_pipeline(initial_library)
|
||||
|
||||
# Submit top sequences to Adaptyv
|
||||
top_sequences_for_testing = optimized_library[:50]
|
||||
```
|
||||
|
||||
## Best Practices Summary
|
||||
|
||||
1. **Always pre-screen** before experimental testing
|
||||
2. **Use NetSolP first** for fast filtering of large libraries
|
||||
3. **Apply SolubleMPNN** as default optimization tool
|
||||
4. **Validate with ESM** to avoid unnatural sequences
|
||||
5. **Calculate pSAE** for structure-based validation
|
||||
6. **Test multiple variants** per design to account for prediction uncertainty
|
||||
7. **Keep controls** - include wild-type or known-good sequences
|
||||
8. **Iterate** - use experimental results to refine predictions
|
||||
|
||||
## Integration with Adaptyv
|
||||
|
||||
After computational optimization, submit sequences to Adaptyv:
|
||||
|
||||
```python
|
||||
# After optimization pipeline
|
||||
optimized_sequences = complete_optimization_pipeline(initial_library)
|
||||
|
||||
# Prepare FASTA format
|
||||
fasta_content = ""
|
||||
for seq_data in optimized_sequences[:50]: # Top 50
|
||||
fasta_content += f">{seq_data['name']}\n{seq_data['sequence']}\n"
|
||||
|
||||
# Submit to Adaptyv
|
||||
import requests
|
||||
response = requests.post(
|
||||
"https://kq5jp7qj7wdqklhsxmovkzn4l40obksv.lambda-url.eu-central-1.on.aws/experiments",
|
||||
headers={"Authorization": f"Bearer {api_key}"},
|
||||
json={
|
||||
"sequences": fasta_content,
|
||||
"experiment_type": "expression",
|
||||
"metadata": {
|
||||
"optimization_method": "SolubleMPNN_ESM_pipeline",
|
||||
"computational_scores": [s['combined'] for s in optimized_sequences[:50]]
|
||||
}
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Issue: All sequences score poorly on solubility predictions**
|
||||
- Check if sequences contain unusual amino acids
|
||||
- Verify FASTA format is correct
|
||||
- Consider if protein family is naturally low-solubility
|
||||
- May need experimental validation despite predictions
|
||||
|
||||
**Issue: SolubleMPNN changes functionally important residues**
|
||||
- Provide structure file to preserve spatial constraints
|
||||
- Mask critical residues from mutation
|
||||
- Lower temperature parameter for conservative changes
|
||||
- Manually revert problematic mutations
|
||||
|
||||
**Issue: ESM scores are low after optimization**
|
||||
- Optimization may be too aggressive
|
||||
- Try lower temperature in SolubleMPNN
|
||||
- Balance between solubility and naturalness
|
||||
- Consider that some optimization may require non-natural mutations
|
||||
|
||||
**Issue: Predictions don't match experimental results**
|
||||
- Predictions are probabilistic, not deterministic
|
||||
- Host system and conditions affect expression
|
||||
- Some proteins may need experimental validation
|
||||
- Use predictions as enrichment, not absolute filters
|
||||
Reference in New Issue
Block a user