638 lines
17 KiB
Markdown
638 lines
17 KiB
Markdown
# Protein Sequence Optimization
|
|
|
|
## Overview
|
|
|
|
Before submitting protein sequences for experimental testing, use computational tools to optimize sequences for improved expression, solubility, and stability. This pre-screening reduces experimental costs and increases success rates.
|
|
|
|
## Common Protein Expression Problems
|
|
|
|
### 1. Unpaired Cysteines
|
|
|
|
**Problem:**
|
|
- Unpaired cysteines form unwanted disulfide bonds
|
|
- Leads to aggregation and misfolding
|
|
- Reduces expression yield and stability
|
|
|
|
**Solution:**
|
|
- Remove unpaired cysteines unless functionally necessary
|
|
- Pair cysteines appropriately for structural disulfides
|
|
- Replace with serine or alanine in non-critical positions
|
|
|
|
**Example:**
|
|
```python
|
|
# Check for cysteine pairs
|
|
from Bio.Seq import Seq
|
|
|
|
def check_cysteines(sequence):
|
|
cys_count = sequence.count('C')
|
|
if cys_count % 2 != 0:
|
|
print(f"Warning: Odd number of cysteines ({cys_count})")
|
|
return cys_count
|
|
```
|
|
|
|
### 2. Excessive Hydrophobicity
|
|
|
|
**Problem:**
|
|
- Long hydrophobic patches promote aggregation
|
|
- Exposed hydrophobic residues drive protein clumping
|
|
- Poor solubility in aqueous buffers
|
|
|
|
**Solution:**
|
|
- Maintain balanced hydropathy profiles
|
|
- Use short, flexible linkers between domains
|
|
- Reduce surface-exposed hydrophobic residues
|
|
|
|
**Metrics:**
|
|
- Kyte-Doolittle hydropathy plots
|
|
- GRAVY score (Grand Average of Hydropathy)
|
|
- pSAE (percent Solvent-Accessible hydrophobic residues)
|
|
|
|
### 3. Low Solubility
|
|
|
|
**Problem:**
|
|
- Proteins precipitate during expression or purification
|
|
- Inclusion body formation
|
|
- Difficult downstream processing
|
|
|
|
**Solution:**
|
|
- Use solubility prediction tools for pre-screening
|
|
- Apply sequence optimization algorithms
|
|
- Add solubilizing tags if needed
|
|
|
|
## Computational Tools for Optimization
|
|
|
|
### NetSolP - Initial Solubility Screening
|
|
|
|
**Purpose:** Fast solubility prediction for filtering sequences.
|
|
|
|
**Method:** Machine learning model trained on E. coli expression data.
|
|
|
|
**Usage:**
|
|
```python
|
|
# Install: uv pip install requests
|
|
import requests
|
|
|
|
def predict_solubility_netsolp(sequence):
|
|
"""Predict protein solubility using NetSolP web service"""
|
|
url = "https://services.healthtech.dtu.dk/services/NetSolP-1.0/api/predict"
|
|
|
|
data = {
|
|
"sequence": sequence,
|
|
"format": "fasta"
|
|
}
|
|
|
|
response = requests.post(url, data=data)
|
|
return response.json()
|
|
|
|
# Example
|
|
sequence = "MKVLWAALLGLLGAAA..."
|
|
result = predict_solubility_netsolp(sequence)
|
|
print(f"Solubility score: {result['score']}")
|
|
```
|
|
|
|
**Interpretation:**
|
|
- Score > 0.5: Likely soluble
|
|
- Score < 0.5: Likely insoluble
|
|
- Use for initial filtering before more expensive predictions
|
|
|
|
**When to use:**
|
|
- First-pass filtering of large libraries
|
|
- Quick validation of designed sequences
|
|
- Prioritizing sequences for experimental testing
|
|
|
|
### SoluProt - Comprehensive Solubility Prediction
|
|
|
|
**Purpose:** Advanced solubility prediction with higher accuracy.
|
|
|
|
**Method:** Deep learning model incorporating sequence and structural features.
|
|
|
|
**Usage:**
|
|
```python
|
|
# Install: uv pip install soluprot
|
|
from soluprot import predict_solubility
|
|
|
|
def screen_variants_soluprot(sequences):
|
|
"""Screen multiple sequences for solubility"""
|
|
results = []
|
|
for name, seq in sequences.items():
|
|
score = predict_solubility(seq)
|
|
results.append({
|
|
'name': name,
|
|
'sequence': seq,
|
|
'solubility_score': score,
|
|
'predicted_soluble': score > 0.6
|
|
})
|
|
return results
|
|
|
|
# Example
|
|
sequences = {
|
|
'variant_1': 'MKVLW...',
|
|
'variant_2': 'MATGV...'
|
|
}
|
|
|
|
results = screen_variants_soluprot(sequences)
|
|
soluble_variants = [r for r in results if r['predicted_soluble']]
|
|
```
|
|
|
|
**Interpretation:**
|
|
- Score > 0.6: High solubility confidence
|
|
- Score 0.4-0.6: Uncertain, may need optimization
|
|
- Score < 0.4: Likely problematic
|
|
|
|
**When to use:**
|
|
- After initial NetSolP filtering
|
|
- When higher prediction accuracy is needed
|
|
- Before committing to expensive synthesis/testing
|
|
|
|
### SolubleMPNN - Sequence Redesign
|
|
|
|
**Purpose:** Redesign protein sequences to improve solubility while maintaining function.
|
|
|
|
**Method:** Graph neural network that suggests mutations to increase solubility.
|
|
|
|
**Usage:**
|
|
```python
|
|
# Install: uv pip install soluble-mpnn
|
|
from soluble_mpnn import optimize_sequence
|
|
|
|
def optimize_for_solubility(sequence, structure_pdb=None):
|
|
"""
|
|
Redesign sequence for improved solubility
|
|
|
|
Args:
|
|
sequence: Original amino acid sequence
|
|
structure_pdb: Optional PDB file for structure-aware design
|
|
|
|
Returns:
|
|
Optimized sequence variants ranked by predicted solubility
|
|
"""
|
|
|
|
variants = optimize_sequence(
|
|
sequence=sequence,
|
|
structure=structure_pdb,
|
|
num_variants=10,
|
|
temperature=0.1 # Lower = more conservative mutations
|
|
)
|
|
|
|
return variants
|
|
|
|
# Example
|
|
original_seq = "MKVLWAALLGLLGAAA..."
|
|
optimized_variants = optimize_for_solubility(original_seq)
|
|
|
|
for i, variant in enumerate(optimized_variants):
|
|
print(f"Variant {i+1}:")
|
|
print(f" Sequence: {variant['sequence']}")
|
|
print(f" Solubility score: {variant['solubility_score']}")
|
|
print(f" Mutations: {variant['mutations']}")
|
|
```
|
|
|
|
**Design strategy:**
|
|
- **Conservative** (temperature=0.1): Minimal changes, safer
|
|
- **Moderate** (temperature=0.3): Balance between change and safety
|
|
- **Aggressive** (temperature=0.5): More mutations, higher risk
|
|
|
|
**When to use:**
|
|
- Primary tool for sequence optimization
|
|
- Default starting point for improving problematic sequences
|
|
- Generating diverse soluble variants
|
|
|
|
**Best practices:**
|
|
- Generate 10-50 variants per sequence
|
|
- Use structure information when available (improves accuracy)
|
|
- Validate key functional residues are preserved
|
|
- Test multiple temperature settings
|
|
|
|
### ESM (Evolutionary Scale Modeling) - Sequence Likelihood
|
|
|
|
**Purpose:** Assess how "natural" a protein sequence appears based on evolutionary patterns.
|
|
|
|
**Method:** Protein language model trained on millions of natural sequences.
|
|
|
|
**Usage:**
|
|
```python
|
|
# Install: uv pip install fair-esm
|
|
import torch
|
|
from esm import pretrained
|
|
|
|
def score_sequence_esm(sequence):
|
|
"""
|
|
Calculate ESM likelihood score for sequence
|
|
Higher scores indicate more natural/stable sequences
|
|
"""
|
|
|
|
model, alphabet = pretrained.esm2_t33_650M_UR50D()
|
|
batch_converter = alphabet.get_batch_converter()
|
|
|
|
data = [("protein", sequence)]
|
|
_, _, batch_tokens = batch_converter(data)
|
|
|
|
with torch.no_grad():
|
|
results = model(batch_tokens, repr_layers=[33])
|
|
token_logprobs = results["logits"].log_softmax(dim=-1)
|
|
|
|
# Calculate perplexity as sequence quality metric
|
|
sequence_score = token_logprobs.mean().item()
|
|
|
|
return sequence_score
|
|
|
|
# Example - Compare variants
|
|
sequences = {
|
|
'original': 'MKVLW...',
|
|
'optimized_1': 'MKVLS...',
|
|
'optimized_2': 'MKVLA...'
|
|
}
|
|
|
|
for name, seq in sequences.items():
|
|
score = score_sequence_esm(seq)
|
|
print(f"{name}: ESM score = {score:.3f}")
|
|
```
|
|
|
|
**Interpretation:**
|
|
- Higher scores → More "natural" sequence
|
|
- Use to avoid unlikely mutations
|
|
- Balance with functional requirements
|
|
|
|
**When to use:**
|
|
- Filtering synthetic designs
|
|
- Comparing SolubleMPNN variants
|
|
- Ensuring sequences aren't too artificial
|
|
- Avoiding expression bottlenecks
|
|
|
|
**Integration with design:**
|
|
```python
|
|
def rank_variants_by_esm(variants):
|
|
"""Rank protein variants by ESM likelihood"""
|
|
scored = []
|
|
for v in variants:
|
|
esm_score = score_sequence_esm(v['sequence'])
|
|
v['esm_score'] = esm_score
|
|
scored.append(v)
|
|
|
|
# Sort by combined solubility and ESM score
|
|
scored.sort(
|
|
key=lambda x: x['solubility_score'] * x['esm_score'],
|
|
reverse=True
|
|
)
|
|
|
|
return scored
|
|
```
|
|
|
|
### ipTM - Interface Stability (AlphaFold-Multimer)
|
|
|
|
**Purpose:** Assess protein-protein interface stability and binding confidence.
|
|
|
|
**Method:** Interface predicted TM-score from AlphaFold-Multimer predictions.
|
|
|
|
**Usage:**
|
|
```python
|
|
# Requires AlphaFold-Multimer installation
|
|
# Or use ColabFold for easier access
|
|
|
|
def predict_interface_stability(protein_a_seq, protein_b_seq):
|
|
"""
|
|
Predict interface stability using AlphaFold-Multimer
|
|
|
|
Returns ipTM score: higher = more stable interface
|
|
"""
|
|
from colabfold import run_alphafold_multimer
|
|
|
|
sequences = {
|
|
'chainA': protein_a_seq,
|
|
'chainB': protein_b_seq
|
|
}
|
|
|
|
result = run_alphafold_multimer(sequences)
|
|
|
|
return {
|
|
'ipTM': result['iptm'],
|
|
'pTM': result['ptm'],
|
|
'pLDDT': result['plddt']
|
|
}
|
|
|
|
# Example for antibody-antigen binding
|
|
antibody_seq = "EVQLVESGGGLVQPGG..."
|
|
antigen_seq = "MKVLWAALLGLLGAAA..."
|
|
|
|
stability = predict_interface_stability(antibody_seq, antigen_seq)
|
|
print(f"Interface pTM: {stability['ipTM']:.3f}")
|
|
|
|
# Interpretation
|
|
if stability['ipTM'] > 0.7:
|
|
print("High confidence interface")
|
|
elif stability['ipTM'] > 0.5:
|
|
print("Moderate confidence interface")
|
|
else:
|
|
print("Low confidence interface - may need redesign")
|
|
```
|
|
|
|
**Interpretation:**
|
|
- ipTM > 0.7: Strong predicted interface
|
|
- ipTM 0.5-0.7: Moderate interface confidence
|
|
- ipTM < 0.5: Weak interface, consider redesign
|
|
|
|
**When to use:**
|
|
- Antibody-antigen design
|
|
- Protein-protein interaction engineering
|
|
- Validating binding interfaces
|
|
- Comparing interface variants
|
|
|
|
### pSAE - Solvent-Accessible Hydrophobic Residues
|
|
|
|
**Purpose:** Quantify exposed hydrophobic residues that promote aggregation.
|
|
|
|
**Method:** Calculates percentage of solvent-accessible surface area (SASA) occupied by hydrophobic residues.
|
|
|
|
**Usage:**
|
|
```python
|
|
# Requires structure (PDB file or AlphaFold prediction)
|
|
# Install: uv pip install biopython
|
|
|
|
from Bio.PDB import PDBParser, DSSP
|
|
import numpy as np
|
|
|
|
def calculate_psae(pdb_file):
|
|
"""
|
|
Calculate percent Solvent-Accessible hydrophobic residues (pSAE)
|
|
|
|
Lower pSAE = better solubility
|
|
"""
|
|
|
|
parser = PDBParser(QUIET=True)
|
|
structure = parser.get_structure('protein', pdb_file)
|
|
|
|
# Run DSSP to get solvent accessibility
|
|
model = structure[0]
|
|
dssp = DSSP(model, pdb_file, acc_array='Wilke')
|
|
|
|
hydrophobic = ['ALA', 'VAL', 'ILE', 'LEU', 'MET', 'PHE', 'TRP', 'PRO']
|
|
|
|
total_sasa = 0
|
|
hydrophobic_sasa = 0
|
|
|
|
for residue in dssp:
|
|
res_name = residue[1]
|
|
rel_accessibility = residue[3]
|
|
|
|
total_sasa += rel_accessibility
|
|
if res_name in hydrophobic:
|
|
hydrophobic_sasa += rel_accessibility
|
|
|
|
psae = (hydrophobic_sasa / total_sasa) * 100
|
|
|
|
return psae
|
|
|
|
# Example
|
|
pdb_file = "protein_structure.pdb"
|
|
psae_score = calculate_psae(pdb_file)
|
|
print(f"pSAE: {psae_score:.2f}%")
|
|
|
|
# Interpretation
|
|
if psae_score < 25:
|
|
print("Good solubility expected")
|
|
elif psae_score < 35:
|
|
print("Moderate solubility")
|
|
else:
|
|
print("High aggregation risk")
|
|
```
|
|
|
|
**Interpretation:**
|
|
- pSAE < 25%: Low aggregation risk
|
|
- pSAE 25-35%: Moderate risk
|
|
- pSAE > 35%: High aggregation risk
|
|
|
|
**When to use:**
|
|
- Analyzing designed structures
|
|
- Post-AlphaFold validation
|
|
- Identifying aggregation hotspots
|
|
- Guiding surface mutations
|
|
|
|
## Recommended Optimization Workflow
|
|
|
|
### Step 1: Initial Screening (Fast)
|
|
|
|
```python
|
|
def initial_screening(sequences):
|
|
"""
|
|
Quick first-pass filtering using NetSolP
|
|
Filters out obviously problematic sequences
|
|
"""
|
|
passed = []
|
|
for name, seq in sequences.items():
|
|
netsolp_score = predict_solubility_netsolp(seq)
|
|
if netsolp_score > 0.5:
|
|
passed.append((name, seq))
|
|
|
|
return passed
|
|
```
|
|
|
|
### Step 2: Detailed Assessment (Moderate)
|
|
|
|
```python
|
|
def detailed_assessment(filtered_sequences):
|
|
"""
|
|
More thorough analysis with SoluProt and ESM
|
|
Ranks sequences by multiple criteria
|
|
"""
|
|
results = []
|
|
for name, seq in filtered_sequences:
|
|
soluprot_score = predict_solubility(seq)
|
|
esm_score = score_sequence_esm(seq)
|
|
|
|
combined_score = soluprot_score * 0.7 + esm_score * 0.3
|
|
|
|
results.append({
|
|
'name': name,
|
|
'sequence': seq,
|
|
'soluprot': soluprot_score,
|
|
'esm': esm_score,
|
|
'combined': combined_score
|
|
})
|
|
|
|
results.sort(key=lambda x: x['combined'], reverse=True)
|
|
return results
|
|
```
|
|
|
|
### Step 3: Sequence Optimization (If needed)
|
|
|
|
```python
|
|
def optimize_problematic_sequences(sequences_needing_optimization):
|
|
"""
|
|
Use SolubleMPNN to redesign problematic sequences
|
|
Returns improved variants
|
|
"""
|
|
optimized = []
|
|
for name, seq in sequences_needing_optimization:
|
|
# Generate multiple variants
|
|
variants = optimize_sequence(
|
|
sequence=seq,
|
|
num_variants=10,
|
|
temperature=0.2
|
|
)
|
|
|
|
# Score variants with ESM
|
|
for variant in variants:
|
|
variant['esm_score'] = score_sequence_esm(variant['sequence'])
|
|
|
|
# Keep best variants
|
|
variants.sort(
|
|
key=lambda x: x['solubility_score'] * x['esm_score'],
|
|
reverse=True
|
|
)
|
|
|
|
optimized.extend(variants[:3]) # Top 3 variants per sequence
|
|
|
|
return optimized
|
|
```
|
|
|
|
### Step 4: Structure-Based Validation (For critical sequences)
|
|
|
|
```python
|
|
def structure_validation(top_candidates):
|
|
"""
|
|
Predict structures and calculate pSAE for top candidates
|
|
Final validation before experimental testing
|
|
"""
|
|
validated = []
|
|
for candidate in top_candidates:
|
|
# Predict structure with AlphaFold
|
|
structure_pdb = predict_structure_alphafold(candidate['sequence'])
|
|
|
|
# Calculate pSAE
|
|
psae = calculate_psae(structure_pdb)
|
|
|
|
candidate['psae'] = psae
|
|
candidate['pass_structure_check'] = psae < 30
|
|
|
|
validated.append(candidate)
|
|
|
|
return validated
|
|
```
|
|
|
|
### Complete Workflow Example
|
|
|
|
```python
|
|
def complete_optimization_pipeline(initial_sequences):
|
|
"""
|
|
End-to-end optimization pipeline
|
|
|
|
Input: Dictionary of {name: sequence}
|
|
Output: Ranked list of optimized, validated sequences
|
|
"""
|
|
|
|
print("Step 1: Initial screening with NetSolP...")
|
|
filtered = initial_screening(initial_sequences)
|
|
print(f" Passed: {len(filtered)}/{len(initial_sequences)}")
|
|
|
|
print("Step 2: Detailed assessment with SoluProt and ESM...")
|
|
assessed = detailed_assessment(filtered)
|
|
|
|
# Split into good and needs-optimization
|
|
good_sequences = [s for s in assessed if s['soluprot'] > 0.6]
|
|
needs_optimization = [s for s in assessed if s['soluprot'] <= 0.6]
|
|
|
|
print(f" Good sequences: {len(good_sequences)}")
|
|
print(f" Need optimization: {len(needs_optimization)}")
|
|
|
|
if needs_optimization:
|
|
print("Step 3: Optimizing problematic sequences with SolubleMPNN...")
|
|
optimized = optimize_problematic_sequences(needs_optimization)
|
|
all_sequences = good_sequences + optimized
|
|
else:
|
|
all_sequences = good_sequences
|
|
|
|
print("Step 4: Structure-based validation for top candidates...")
|
|
top_20 = all_sequences[:20]
|
|
final_validated = structure_validation(top_20)
|
|
|
|
# Final ranking
|
|
final_validated.sort(
|
|
key=lambda x: (
|
|
x['pass_structure_check'],
|
|
x['combined'],
|
|
-x['psae']
|
|
),
|
|
reverse=True
|
|
)
|
|
|
|
return final_validated
|
|
|
|
# Usage
|
|
initial_library = {
|
|
'variant_1': 'MKVLWAALLGLLGAAA...',
|
|
'variant_2': 'MATGVLWAALLGLLGA...',
|
|
# ... more sequences
|
|
}
|
|
|
|
optimized_library = complete_optimization_pipeline(initial_library)
|
|
|
|
# Submit top sequences to Adaptyv
|
|
top_sequences_for_testing = optimized_library[:50]
|
|
```
|
|
|
|
## Best Practices Summary
|
|
|
|
1. **Always pre-screen** before experimental testing
|
|
2. **Use NetSolP first** for fast filtering of large libraries
|
|
3. **Apply SolubleMPNN** as default optimization tool
|
|
4. **Validate with ESM** to avoid unnatural sequences
|
|
5. **Calculate pSAE** for structure-based validation
|
|
6. **Test multiple variants** per design to account for prediction uncertainty
|
|
7. **Keep controls** - include wild-type or known-good sequences
|
|
8. **Iterate** - use experimental results to refine predictions
|
|
|
|
## Integration with Adaptyv
|
|
|
|
After computational optimization, submit sequences to Adaptyv:
|
|
|
|
```python
|
|
# After optimization pipeline
|
|
optimized_sequences = complete_optimization_pipeline(initial_library)
|
|
|
|
# Prepare FASTA format
|
|
fasta_content = ""
|
|
for seq_data in optimized_sequences[:50]: # Top 50
|
|
fasta_content += f">{seq_data['name']}\n{seq_data['sequence']}\n"
|
|
|
|
# Submit to Adaptyv
|
|
import requests
|
|
response = requests.post(
|
|
"https://kq5jp7qj7wdqklhsxmovkzn4l40obksv.lambda-url.eu-central-1.on.aws/experiments",
|
|
headers={"Authorization": f"Bearer {api_key}"},
|
|
json={
|
|
"sequences": fasta_content,
|
|
"experiment_type": "expression",
|
|
"metadata": {
|
|
"optimization_method": "SolubleMPNN_ESM_pipeline",
|
|
"computational_scores": [s['combined'] for s in optimized_sequences[:50]]
|
|
}
|
|
}
|
|
)
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
**Issue: All sequences score poorly on solubility predictions**
|
|
- Check if sequences contain unusual amino acids
|
|
- Verify FASTA format is correct
|
|
- Consider if protein family is naturally low-solubility
|
|
- May need experimental validation despite predictions
|
|
|
|
**Issue: SolubleMPNN changes functionally important residues**
|
|
- Provide structure file to preserve spatial constraints
|
|
- Mask critical residues from mutation
|
|
- Lower temperature parameter for conservative changes
|
|
- Manually revert problematic mutations
|
|
|
|
**Issue: ESM scores are low after optimization**
|
|
- Optimization may be too aggressive
|
|
- Try lower temperature in SolubleMPNN
|
|
- Balance between solubility and naturalness
|
|
- Consider that some optimization may require non-natural mutations
|
|
|
|
**Issue: Predictions don't match experimental results**
|
|
- Predictions are probabilistic, not deterministic
|
|
- Host system and conditions affect expression
|
|
- Some proteins may need experimental validation
|
|
- Use predictions as enrichment, not absolute filters
|