Files
gh-k-dense-ai-claude-scient…/skills/adaptyv/reference/protein_optimization.md
2025-11-30 08:30:10 +08:00

638 lines
17 KiB
Markdown

# Protein Sequence Optimization
## Overview
Before submitting protein sequences for experimental testing, use computational tools to optimize sequences for improved expression, solubility, and stability. This pre-screening reduces experimental costs and increases success rates.
## Common Protein Expression Problems
### 1. Unpaired Cysteines
**Problem:**
- Unpaired cysteines form unwanted disulfide bonds
- Leads to aggregation and misfolding
- Reduces expression yield and stability
**Solution:**
- Remove unpaired cysteines unless functionally necessary
- Pair cysteines appropriately for structural disulfides
- Replace with serine or alanine in non-critical positions
**Example:**
```python
# Check for cysteine pairs
from Bio.Seq import Seq
def check_cysteines(sequence):
cys_count = sequence.count('C')
if cys_count % 2 != 0:
print(f"Warning: Odd number of cysteines ({cys_count})")
return cys_count
```
### 2. Excessive Hydrophobicity
**Problem:**
- Long hydrophobic patches promote aggregation
- Exposed hydrophobic residues drive protein clumping
- Poor solubility in aqueous buffers
**Solution:**
- Maintain balanced hydropathy profiles
- Use short, flexible linkers between domains
- Reduce surface-exposed hydrophobic residues
**Metrics:**
- Kyte-Doolittle hydropathy plots
- GRAVY score (Grand Average of Hydropathy)
- pSAE (percent Solvent-Accessible hydrophobic residues)
### 3. Low Solubility
**Problem:**
- Proteins precipitate during expression or purification
- Inclusion body formation
- Difficult downstream processing
**Solution:**
- Use solubility prediction tools for pre-screening
- Apply sequence optimization algorithms
- Add solubilizing tags if needed
## Computational Tools for Optimization
### NetSolP - Initial Solubility Screening
**Purpose:** Fast solubility prediction for filtering sequences.
**Method:** Machine learning model trained on E. coli expression data.
**Usage:**
```python
# Install: uv pip install requests
import requests
def predict_solubility_netsolp(sequence):
"""Predict protein solubility using NetSolP web service"""
url = "https://services.healthtech.dtu.dk/services/NetSolP-1.0/api/predict"
data = {
"sequence": sequence,
"format": "fasta"
}
response = requests.post(url, data=data)
return response.json()
# Example
sequence = "MKVLWAALLGLLGAAA..."
result = predict_solubility_netsolp(sequence)
print(f"Solubility score: {result['score']}")
```
**Interpretation:**
- Score > 0.5: Likely soluble
- Score < 0.5: Likely insoluble
- Use for initial filtering before more expensive predictions
**When to use:**
- First-pass filtering of large libraries
- Quick validation of designed sequences
- Prioritizing sequences for experimental testing
### SoluProt - Comprehensive Solubility Prediction
**Purpose:** Advanced solubility prediction with higher accuracy.
**Method:** Deep learning model incorporating sequence and structural features.
**Usage:**
```python
# Install: uv pip install soluprot
from soluprot import predict_solubility
def screen_variants_soluprot(sequences):
"""Screen multiple sequences for solubility"""
results = []
for name, seq in sequences.items():
score = predict_solubility(seq)
results.append({
'name': name,
'sequence': seq,
'solubility_score': score,
'predicted_soluble': score > 0.6
})
return results
# Example
sequences = {
'variant_1': 'MKVLW...',
'variant_2': 'MATGV...'
}
results = screen_variants_soluprot(sequences)
soluble_variants = [r for r in results if r['predicted_soluble']]
```
**Interpretation:**
- Score > 0.6: High solubility confidence
- Score 0.4-0.6: Uncertain, may need optimization
- Score < 0.4: Likely problematic
**When to use:**
- After initial NetSolP filtering
- When higher prediction accuracy is needed
- Before committing to expensive synthesis/testing
### SolubleMPNN - Sequence Redesign
**Purpose:** Redesign protein sequences to improve solubility while maintaining function.
**Method:** Graph neural network that suggests mutations to increase solubility.
**Usage:**
```python
# Install: uv pip install soluble-mpnn
from soluble_mpnn import optimize_sequence
def optimize_for_solubility(sequence, structure_pdb=None):
"""
Redesign sequence for improved solubility
Args:
sequence: Original amino acid sequence
structure_pdb: Optional PDB file for structure-aware design
Returns:
Optimized sequence variants ranked by predicted solubility
"""
variants = optimize_sequence(
sequence=sequence,
structure=structure_pdb,
num_variants=10,
temperature=0.1 # Lower = more conservative mutations
)
return variants
# Example
original_seq = "MKVLWAALLGLLGAAA..."
optimized_variants = optimize_for_solubility(original_seq)
for i, variant in enumerate(optimized_variants):
print(f"Variant {i+1}:")
print(f" Sequence: {variant['sequence']}")
print(f" Solubility score: {variant['solubility_score']}")
print(f" Mutations: {variant['mutations']}")
```
**Design strategy:**
- **Conservative** (temperature=0.1): Minimal changes, safer
- **Moderate** (temperature=0.3): Balance between change and safety
- **Aggressive** (temperature=0.5): More mutations, higher risk
**When to use:**
- Primary tool for sequence optimization
- Default starting point for improving problematic sequences
- Generating diverse soluble variants
**Best practices:**
- Generate 10-50 variants per sequence
- Use structure information when available (improves accuracy)
- Validate key functional residues are preserved
- Test multiple temperature settings
### ESM (Evolutionary Scale Modeling) - Sequence Likelihood
**Purpose:** Assess how "natural" a protein sequence appears based on evolutionary patterns.
**Method:** Protein language model trained on millions of natural sequences.
**Usage:**
```python
# Install: uv pip install fair-esm
import torch
from esm import pretrained
def score_sequence_esm(sequence):
"""
Calculate ESM likelihood score for sequence
Higher scores indicate more natural/stable sequences
"""
model, alphabet = pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
data = [("protein", sequence)]
_, _, batch_tokens = batch_converter(data)
with torch.no_grad():
results = model(batch_tokens, repr_layers=[33])
token_logprobs = results["logits"].log_softmax(dim=-1)
# Calculate perplexity as sequence quality metric
sequence_score = token_logprobs.mean().item()
return sequence_score
# Example - Compare variants
sequences = {
'original': 'MKVLW...',
'optimized_1': 'MKVLS...',
'optimized_2': 'MKVLA...'
}
for name, seq in sequences.items():
score = score_sequence_esm(seq)
print(f"{name}: ESM score = {score:.3f}")
```
**Interpretation:**
- Higher scores → More "natural" sequence
- Use to avoid unlikely mutations
- Balance with functional requirements
**When to use:**
- Filtering synthetic designs
- Comparing SolubleMPNN variants
- Ensuring sequences aren't too artificial
- Avoiding expression bottlenecks
**Integration with design:**
```python
def rank_variants_by_esm(variants):
"""Rank protein variants by ESM likelihood"""
scored = []
for v in variants:
esm_score = score_sequence_esm(v['sequence'])
v['esm_score'] = esm_score
scored.append(v)
# Sort by combined solubility and ESM score
scored.sort(
key=lambda x: x['solubility_score'] * x['esm_score'],
reverse=True
)
return scored
```
### ipTM - Interface Stability (AlphaFold-Multimer)
**Purpose:** Assess protein-protein interface stability and binding confidence.
**Method:** Interface predicted TM-score from AlphaFold-Multimer predictions.
**Usage:**
```python
# Requires AlphaFold-Multimer installation
# Or use ColabFold for easier access
def predict_interface_stability(protein_a_seq, protein_b_seq):
"""
Predict interface stability using AlphaFold-Multimer
Returns ipTM score: higher = more stable interface
"""
from colabfold import run_alphafold_multimer
sequences = {
'chainA': protein_a_seq,
'chainB': protein_b_seq
}
result = run_alphafold_multimer(sequences)
return {
'ipTM': result['iptm'],
'pTM': result['ptm'],
'pLDDT': result['plddt']
}
# Example for antibody-antigen binding
antibody_seq = "EVQLVESGGGLVQPGG..."
antigen_seq = "MKVLWAALLGLLGAAA..."
stability = predict_interface_stability(antibody_seq, antigen_seq)
print(f"Interface pTM: {stability['ipTM']:.3f}")
# Interpretation
if stability['ipTM'] > 0.7:
print("High confidence interface")
elif stability['ipTM'] > 0.5:
print("Moderate confidence interface")
else:
print("Low confidence interface - may need redesign")
```
**Interpretation:**
- ipTM > 0.7: Strong predicted interface
- ipTM 0.5-0.7: Moderate interface confidence
- ipTM < 0.5: Weak interface, consider redesign
**When to use:**
- Antibody-antigen design
- Protein-protein interaction engineering
- Validating binding interfaces
- Comparing interface variants
### pSAE - Solvent-Accessible Hydrophobic Residues
**Purpose:** Quantify exposed hydrophobic residues that promote aggregation.
**Method:** Calculates percentage of solvent-accessible surface area (SASA) occupied by hydrophobic residues.
**Usage:**
```python
# Requires structure (PDB file or AlphaFold prediction)
# Install: uv pip install biopython
from Bio.PDB import PDBParser, DSSP
import numpy as np
def calculate_psae(pdb_file):
"""
Calculate percent Solvent-Accessible hydrophobic residues (pSAE)
Lower pSAE = better solubility
"""
parser = PDBParser(QUIET=True)
structure = parser.get_structure('protein', pdb_file)
# Run DSSP to get solvent accessibility
model = structure[0]
dssp = DSSP(model, pdb_file, acc_array='Wilke')
hydrophobic = ['ALA', 'VAL', 'ILE', 'LEU', 'MET', 'PHE', 'TRP', 'PRO']
total_sasa = 0
hydrophobic_sasa = 0
for residue in dssp:
res_name = residue[1]
rel_accessibility = residue[3]
total_sasa += rel_accessibility
if res_name in hydrophobic:
hydrophobic_sasa += rel_accessibility
psae = (hydrophobic_sasa / total_sasa) * 100
return psae
# Example
pdb_file = "protein_structure.pdb"
psae_score = calculate_psae(pdb_file)
print(f"pSAE: {psae_score:.2f}%")
# Interpretation
if psae_score < 25:
print("Good solubility expected")
elif psae_score < 35:
print("Moderate solubility")
else:
print("High aggregation risk")
```
**Interpretation:**
- pSAE < 25%: Low aggregation risk
- pSAE 25-35%: Moderate risk
- pSAE > 35%: High aggregation risk
**When to use:**
- Analyzing designed structures
- Post-AlphaFold validation
- Identifying aggregation hotspots
- Guiding surface mutations
## Recommended Optimization Workflow
### Step 1: Initial Screening (Fast)
```python
def initial_screening(sequences):
"""
Quick first-pass filtering using NetSolP
Filters out obviously problematic sequences
"""
passed = []
for name, seq in sequences.items():
netsolp_score = predict_solubility_netsolp(seq)
if netsolp_score > 0.5:
passed.append((name, seq))
return passed
```
### Step 2: Detailed Assessment (Moderate)
```python
def detailed_assessment(filtered_sequences):
"""
More thorough analysis with SoluProt and ESM
Ranks sequences by multiple criteria
"""
results = []
for name, seq in filtered_sequences:
soluprot_score = predict_solubility(seq)
esm_score = score_sequence_esm(seq)
combined_score = soluprot_score * 0.7 + esm_score * 0.3
results.append({
'name': name,
'sequence': seq,
'soluprot': soluprot_score,
'esm': esm_score,
'combined': combined_score
})
results.sort(key=lambda x: x['combined'], reverse=True)
return results
```
### Step 3: Sequence Optimization (If needed)
```python
def optimize_problematic_sequences(sequences_needing_optimization):
"""
Use SolubleMPNN to redesign problematic sequences
Returns improved variants
"""
optimized = []
for name, seq in sequences_needing_optimization:
# Generate multiple variants
variants = optimize_sequence(
sequence=seq,
num_variants=10,
temperature=0.2
)
# Score variants with ESM
for variant in variants:
variant['esm_score'] = score_sequence_esm(variant['sequence'])
# Keep best variants
variants.sort(
key=lambda x: x['solubility_score'] * x['esm_score'],
reverse=True
)
optimized.extend(variants[:3]) # Top 3 variants per sequence
return optimized
```
### Step 4: Structure-Based Validation (For critical sequences)
```python
def structure_validation(top_candidates):
"""
Predict structures and calculate pSAE for top candidates
Final validation before experimental testing
"""
validated = []
for candidate in top_candidates:
# Predict structure with AlphaFold
structure_pdb = predict_structure_alphafold(candidate['sequence'])
# Calculate pSAE
psae = calculate_psae(structure_pdb)
candidate['psae'] = psae
candidate['pass_structure_check'] = psae < 30
validated.append(candidate)
return validated
```
### Complete Workflow Example
```python
def complete_optimization_pipeline(initial_sequences):
"""
End-to-end optimization pipeline
Input: Dictionary of {name: sequence}
Output: Ranked list of optimized, validated sequences
"""
print("Step 1: Initial screening with NetSolP...")
filtered = initial_screening(initial_sequences)
print(f" Passed: {len(filtered)}/{len(initial_sequences)}")
print("Step 2: Detailed assessment with SoluProt and ESM...")
assessed = detailed_assessment(filtered)
# Split into good and needs-optimization
good_sequences = [s for s in assessed if s['soluprot'] > 0.6]
needs_optimization = [s for s in assessed if s['soluprot'] <= 0.6]
print(f" Good sequences: {len(good_sequences)}")
print(f" Need optimization: {len(needs_optimization)}")
if needs_optimization:
print("Step 3: Optimizing problematic sequences with SolubleMPNN...")
optimized = optimize_problematic_sequences(needs_optimization)
all_sequences = good_sequences + optimized
else:
all_sequences = good_sequences
print("Step 4: Structure-based validation for top candidates...")
top_20 = all_sequences[:20]
final_validated = structure_validation(top_20)
# Final ranking
final_validated.sort(
key=lambda x: (
x['pass_structure_check'],
x['combined'],
-x['psae']
),
reverse=True
)
return final_validated
# Usage
initial_library = {
'variant_1': 'MKVLWAALLGLLGAAA...',
'variant_2': 'MATGVLWAALLGLLGA...',
# ... more sequences
}
optimized_library = complete_optimization_pipeline(initial_library)
# Submit top sequences to Adaptyv
top_sequences_for_testing = optimized_library[:50]
```
## Best Practices Summary
1. **Always pre-screen** before experimental testing
2. **Use NetSolP first** for fast filtering of large libraries
3. **Apply SolubleMPNN** as default optimization tool
4. **Validate with ESM** to avoid unnatural sequences
5. **Calculate pSAE** for structure-based validation
6. **Test multiple variants** per design to account for prediction uncertainty
7. **Keep controls** - include wild-type or known-good sequences
8. **Iterate** - use experimental results to refine predictions
## Integration with Adaptyv
After computational optimization, submit sequences to Adaptyv:
```python
# After optimization pipeline
optimized_sequences = complete_optimization_pipeline(initial_library)
# Prepare FASTA format
fasta_content = ""
for seq_data in optimized_sequences[:50]: # Top 50
fasta_content += f">{seq_data['name']}\n{seq_data['sequence']}\n"
# Submit to Adaptyv
import requests
response = requests.post(
"https://kq5jp7qj7wdqklhsxmovkzn4l40obksv.lambda-url.eu-central-1.on.aws/experiments",
headers={"Authorization": f"Bearer {api_key}"},
json={
"sequences": fasta_content,
"experiment_type": "expression",
"metadata": {
"optimization_method": "SolubleMPNN_ESM_pipeline",
"computational_scores": [s['combined'] for s in optimized_sequences[:50]]
}
}
)
```
## Troubleshooting
**Issue: All sequences score poorly on solubility predictions**
- Check if sequences contain unusual amino acids
- Verify FASTA format is correct
- Consider if protein family is naturally low-solubility
- May need experimental validation despite predictions
**Issue: SolubleMPNN changes functionally important residues**
- Provide structure file to preserve spatial constraints
- Mask critical residues from mutation
- Lower temperature parameter for conservative changes
- Manually revert problematic mutations
**Issue: ESM scores are low after optimization**
- Optimization may be too aggressive
- Try lower temperature in SolubleMPNN
- Balance between solubility and naturalness
- Consider that some optimization may require non-natural mutations
**Issue: Predictions don't match experimental results**
- Predictions are probabilistic, not deterministic
- Host system and conditions affect expression
- Some proteins may need experimental validation
- Use predictions as enrichment, not absolute filters