# Protein Sequence Optimization

## Overview

Before submitting protein sequences for experimental testing, use computational tools to optimize sequences for improved expression, solubility, and stability. This pre-screening reduces experimental costs and increases success rates.

## Common Protein Expression Problems

### 1. Unpaired Cysteines

**Problem:**
- Unpaired cysteines form unwanted disulfide bonds
- Leads to aggregation and misfolding
- Reduces expression yield and stability

**Solution:**
- Remove unpaired cysteines unless functionally necessary
- Pair cysteines appropriately for structural disulfides
- Replace with serine or alanine in non-critical positions

**Example:**
```python
# Check for cysteine pairs
from Bio.Seq import Seq

def check_cysteines(sequence):
    cys_count = sequence.count('C')
    if cys_count % 2 != 0:
        print(f"Warning: Odd number of cysteines ({cys_count})")
    return cys_count
```

### 2. Excessive Hydrophobicity

**Problem:**
- Long hydrophobic patches promote aggregation
- Exposed hydrophobic residues drive protein clumping
- Poor solubility in aqueous buffers

**Solution:**
- Maintain balanced hydropathy profiles
- Use short, flexible linkers between domains
- Reduce surface-exposed hydrophobic residues

**Metrics:**
- Kyte-Doolittle hydropathy plots
- GRAVY score (Grand Average of Hydropathy)
- pSAE (percent Solvent-Accessible hydrophobic residues)

### 3. Low Solubility

**Problem:**
- Proteins precipitate during expression or purification
- Inclusion body formation
- Difficult downstream processing

**Solution:**
- Use solubility prediction tools for pre-screening
- Apply sequence optimization algorithms
- Add solubilizing tags if needed

## Computational Tools for Optimization

### NetSolP - Initial Solubility Screening

**Purpose:** Fast solubility prediction for filtering sequences.

**Method:** Machine learning model trained on E. coli expression data.

**Usage:**
```python
# Install: uv pip install requests
import requests

def predict_solubility_netsolp(sequence):
    """Predict protein solubility using NetSolP web service"""
    url = "https://services.healthtech.dtu.dk/services/NetSolP-1.0/api/predict"

    data = {
        "sequence": sequence,
        "format": "fasta"
    }

    response = requests.post(url, data=data)
    return response.json()

# Example
sequence = "MKVLWAALLGLLGAAA..."
result = predict_solubility_netsolp(sequence)
print(f"Solubility score: {result['score']}")
```

**Interpretation:**
- Score > 0.5: Likely soluble
- Score < 0.5: Likely insoluble
- Use for initial filtering before more expensive predictions

**When to use:**
- First-pass filtering of large libraries
- Quick validation of designed sequences
- Prioritizing sequences for experimental testing

### SoluProt - Comprehensive Solubility Prediction

**Purpose:** Advanced solubility prediction with higher accuracy.

**Method:** Deep learning model incorporating sequence and structural features.

**Usage:**
```python
# Install: uv pip install soluprot
from soluprot import predict_solubility

def screen_variants_soluprot(sequences):
    """Screen multiple sequences for solubility"""
    results = []
    for name, seq in sequences.items():
        score = predict_solubility(seq)
        results.append({
            'name': name,
            'sequence': seq,
            'solubility_score': score,
            'predicted_soluble': score > 0.6
        })
    return results

# Example
sequences = {
    'variant_1': 'MKVLW...',
    'variant_2': 'MATGV...'
}

results = screen_variants_soluprot(sequences)
soluble_variants = [r for r in results if r['predicted_soluble']]
```

**Interpretation:**
- Score > 0.6: High solubility confidence
- Score 0.4-0.6: Uncertain, may need optimization
- Score < 0.4: Likely problematic

**When to use:**
- After initial NetSolP filtering
- When higher prediction accuracy is needed
- Before committing to expensive synthesis/testing

### SolubleMPNN - Sequence Redesign

**Purpose:** Redesign protein sequences to improve solubility while maintaining function.

**Method:** Graph neural network that suggests mutations to increase solubility.

**Usage:**
```python
# Install: uv pip install soluble-mpnn
from soluble_mpnn import optimize_sequence

def optimize_for_solubility(sequence, structure_pdb=None):
    """
    Redesign sequence for improved solubility

    Args:
        sequence: Original amino acid sequence
        structure_pdb: Optional PDB file for structure-aware design

    Returns:
        Optimized sequence variants ranked by predicted solubility
    """

    variants = optimize_sequence(
        sequence=sequence,
        structure=structure_pdb,
        num_variants=10,
        temperature=0.1  # Lower = more conservative mutations
    )

    return variants

# Example
original_seq = "MKVLWAALLGLLGAAA..."
optimized_variants = optimize_for_solubility(original_seq)

for i, variant in enumerate(optimized_variants):
    print(f"Variant {i+1}:")
    print(f"  Sequence: {variant['sequence']}")
    print(f"  Solubility score: {variant['solubility_score']}")
    print(f"  Mutations: {variant['mutations']}")
```

**Design strategy:**
- **Conservative** (temperature=0.1): Minimal changes, safer
- **Moderate** (temperature=0.3): Balance between change and safety
- **Aggressive** (temperature=0.5): More mutations, higher risk

**When to use:**
- Primary tool for sequence optimization
- Default starting point for improving problematic sequences
- Generating diverse soluble variants

**Best practices:**
- Generate 10-50 variants per sequence
- Use structure information when available (improves accuracy)
- Validate key functional residues are preserved
- Test multiple temperature settings

### ESM (Evolutionary Scale Modeling) - Sequence Likelihood

**Purpose:** Assess how "natural" a protein sequence appears based on evolutionary patterns.

**Method:** Protein language model trained on millions of natural sequences.

**Usage:**
```python
# Install: uv pip install fair-esm
import torch
from esm import pretrained

def score_sequence_esm(sequence):
    """
    Calculate ESM likelihood score for sequence
    Higher scores indicate more natural/stable sequences
    """

    model, alphabet = pretrained.esm2_t33_650M_UR50D()
    batch_converter = alphabet.get_batch_converter()

    data = [("protein", sequence)]
    _, _, batch_tokens = batch_converter(data)

    with torch.no_grad():
        results = model(batch_tokens, repr_layers=[33])
        token_logprobs = results["logits"].log_softmax(dim=-1)

    # Calculate perplexity as sequence quality metric
    sequence_score = token_logprobs.mean().item()

    return sequence_score

# Example - Compare variants
sequences = {
    'original': 'MKVLW...',
    'optimized_1': 'MKVLS...',
    'optimized_2': 'MKVLA...'
}

for name, seq in sequences.items():
    score = score_sequence_esm(seq)
    print(f"{name}: ESM score = {score:.3f}")
```

**Interpretation:**
- Higher scores → More "natural" sequence
- Use to avoid unlikely mutations
- Balance with functional requirements

**When to use:**
- Filtering synthetic designs
- Comparing SolubleMPNN variants
- Ensuring sequences aren't too artificial
- Avoiding expression bottlenecks

**Integration with design:**
```python
def rank_variants_by_esm(variants):
    """Rank protein variants by ESM likelihood"""
    scored = []
    for v in variants:
        esm_score = score_sequence_esm(v['sequence'])
        v['esm_score'] = esm_score
        scored.append(v)

    # Sort by combined solubility and ESM score
    scored.sort(
        key=lambda x: x['solubility_score'] * x['esm_score'],
        reverse=True
    )

    return scored
```

### ipTM - Interface Stability (AlphaFold-Multimer)

**Purpose:** Assess protein-protein interface stability and binding confidence.

**Method:** Interface predicted TM-score from AlphaFold-Multimer predictions.

**Usage:**
```python
# Requires AlphaFold-Multimer installation
# Or use ColabFold for easier access

def predict_interface_stability(protein_a_seq, protein_b_seq):
    """
    Predict interface stability using AlphaFold-Multimer

    Returns ipTM score: higher = more stable interface
    """
    from colabfold import run_alphafold_multimer

    sequences = {
        'chainA': protein_a_seq,
        'chainB': protein_b_seq
    }

    result = run_alphafold_multimer(sequences)

    return {
        'ipTM': result['iptm'],
        'pTM': result['ptm'],
        'pLDDT': result['plddt']
    }

# Example for antibody-antigen binding
antibody_seq = "EVQLVESGGGLVQPGG..."
antigen_seq = "MKVLWAALLGLLGAAA..."

stability = predict_interface_stability(antibody_seq, antigen_seq)
print(f"Interface pTM: {stability['ipTM']:.3f}")

# Interpretation
if stability['ipTM'] > 0.7:
    print("High confidence interface")
elif stability['ipTM'] > 0.5:
    print("Moderate confidence interface")
else:
    print("Low confidence interface - may need redesign")
```

**Interpretation:**
- ipTM > 0.7: Strong predicted interface
- ipTM 0.5-0.7: Moderate interface confidence
- ipTM < 0.5: Weak interface, consider redesign

**When to use:**
- Antibody-antigen design
- Protein-protein interaction engineering
- Validating binding interfaces
- Comparing interface variants

### pSAE - Solvent-Accessible Hydrophobic Residues

**Purpose:** Quantify exposed hydrophobic residues that promote aggregation.

**Method:** Calculates percentage of solvent-accessible surface area (SASA) occupied by hydrophobic residues.

**Usage:**
```python
# Requires structure (PDB file or AlphaFold prediction)
# Install: uv pip install biopython

from Bio.PDB import PDBParser, DSSP
import numpy as np

def calculate_psae(pdb_file):
    """
    Calculate percent Solvent-Accessible hydrophobic residues (pSAE)

    Lower pSAE = better solubility
    """

    parser = PDBParser(QUIET=True)
    structure = parser.get_structure('protein', pdb_file)

    # Run DSSP to get solvent accessibility
    model = structure[0]
    dssp = DSSP(model, pdb_file, acc_array='Wilke')

    hydrophobic = ['ALA', 'VAL', 'ILE', 'LEU', 'MET', 'PHE', 'TRP', 'PRO']

    total_sasa = 0
    hydrophobic_sasa = 0

    for residue in dssp:
        res_name = residue[1]
        rel_accessibility = residue[3]

        total_sasa += rel_accessibility
        if res_name in hydrophobic:
            hydrophobic_sasa += rel_accessibility

    psae = (hydrophobic_sasa / total_sasa) * 100

    return psae

# Example
pdb_file = "protein_structure.pdb"
psae_score = calculate_psae(pdb_file)
print(f"pSAE: {psae_score:.2f}%")

# Interpretation
if psae_score < 25:
    print("Good solubility expected")
elif psae_score < 35:
    print("Moderate solubility")
else:
    print("High aggregation risk")
```

**Interpretation:**
- pSAE < 25%: Low aggregation risk
- pSAE 25-35%: Moderate risk
- pSAE > 35%: High aggregation risk

**When to use:**
- Analyzing designed structures
- Post-AlphaFold validation
- Identifying aggregation hotspots
- Guiding surface mutations

## Recommended Optimization Workflow

### Step 1: Initial Screening (Fast)

```python
def initial_screening(sequences):
    """
    Quick first-pass filtering using NetSolP
    Filters out obviously problematic sequences
    """
    passed = []
    for name, seq in sequences.items():
        netsolp_score = predict_solubility_netsolp(seq)
        if netsolp_score > 0.5:
            passed.append((name, seq))

    return passed
```

### Step 2: Detailed Assessment (Moderate)

```python
def detailed_assessment(filtered_sequences):
    """
    More thorough analysis with SoluProt and ESM
    Ranks sequences by multiple criteria
    """
    results = []
    for name, seq in filtered_sequences:
        soluprot_score = predict_solubility(seq)
        esm_score = score_sequence_esm(seq)

        combined_score = soluprot_score * 0.7 + esm_score * 0.3

        results.append({
            'name': name,
            'sequence': seq,
            'soluprot': soluprot_score,
            'esm': esm_score,
            'combined': combined_score
        })

    results.sort(key=lambda x: x['combined'], reverse=True)
    return results
```

### Step 3: Sequence Optimization (If needed)

```python
def optimize_problematic_sequences(sequences_needing_optimization):
    """
    Use SolubleMPNN to redesign problematic sequences
    Returns improved variants
    """
    optimized = []
    for name, seq in sequences_needing_optimization:
        # Generate multiple variants
        variants = optimize_sequence(
            sequence=seq,
            num_variants=10,
            temperature=0.2
        )

        # Score variants with ESM
        for variant in variants:
            variant['esm_score'] = score_sequence_esm(variant['sequence'])

        # Keep best variants
        variants.sort(
            key=lambda x: x['solubility_score'] * x['esm_score'],
            reverse=True
        )

        optimized.extend(variants[:3])  # Top 3 variants per sequence

    return optimized
```

### Step 4: Structure-Based Validation (For critical sequences)

```python
def structure_validation(top_candidates):
    """
    Predict structures and calculate pSAE for top candidates
    Final validation before experimental testing
    """
    validated = []
    for candidate in top_candidates:
        # Predict structure with AlphaFold
        structure_pdb = predict_structure_alphafold(candidate['sequence'])

        # Calculate pSAE
        psae = calculate_psae(structure_pdb)

        candidate['psae'] = psae
        candidate['pass_structure_check'] = psae < 30

        validated.append(candidate)

    return validated
```

### Complete Workflow Example

```python
def complete_optimization_pipeline(initial_sequences):
    """
    End-to-end optimization pipeline

    Input: Dictionary of {name: sequence}
    Output: Ranked list of optimized, validated sequences
    """

    print("Step 1: Initial screening with NetSolP...")
    filtered = initial_screening(initial_sequences)
    print(f"  Passed: {len(filtered)}/{len(initial_sequences)}")

    print("Step 2: Detailed assessment with SoluProt and ESM...")
    assessed = detailed_assessment(filtered)

    # Split into good and needs-optimization
    good_sequences = [s for s in assessed if s['soluprot'] > 0.6]
    needs_optimization = [s for s in assessed if s['soluprot'] <= 0.6]

    print(f"  Good sequences: {len(good_sequences)}")
    print(f"  Need optimization: {len(needs_optimization)}")

    if needs_optimization:
        print("Step 3: Optimizing problematic sequences with SolubleMPNN...")
        optimized = optimize_problematic_sequences(needs_optimization)
        all_sequences = good_sequences + optimized
    else:
        all_sequences = good_sequences

    print("Step 4: Structure-based validation for top candidates...")
    top_20 = all_sequences[:20]
    final_validated = structure_validation(top_20)

    # Final ranking
    final_validated.sort(
        key=lambda x: (
            x['pass_structure_check'],
            x['combined'],
            -x['psae']
        ),
        reverse=True
    )

    return final_validated

# Usage
initial_library = {
    'variant_1': 'MKVLWAALLGLLGAAA...',
    'variant_2': 'MATGVLWAALLGLLGA...',
    # ... more sequences
}

optimized_library = complete_optimization_pipeline(initial_library)

# Submit top sequences to Adaptyv
top_sequences_for_testing = optimized_library[:50]
```

## Best Practices Summary

1. **Always pre-screen** before experimental testing
2. **Use NetSolP first** for fast filtering of large libraries
3. **Apply SolubleMPNN** as default optimization tool
4. **Validate with ESM** to avoid unnatural sequences
5. **Calculate pSAE** for structure-based validation
6. **Test multiple variants** per design to account for prediction uncertainty
7. **Keep controls** - include wild-type or known-good sequences
8. **Iterate** - use experimental results to refine predictions

## Integration with Adaptyv

After computational optimization, submit sequences to Adaptyv:

```python
# After optimization pipeline
optimized_sequences = complete_optimization_pipeline(initial_library)

# Prepare FASTA format
fasta_content = ""
for seq_data in optimized_sequences[:50]:  # Top 50
    fasta_content += f">{seq_data['name']}\n{seq_data['sequence']}\n"

# Submit to Adaptyv
import requests
response = requests.post(
    "https://kq5jp7qj7wdqklhsxmovkzn4l40obksv.lambda-url.eu-central-1.on.aws/experiments",
    headers={"Authorization": f"Bearer {api_key}"},
    json={
        "sequences": fasta_content,
        "experiment_type": "expression",
        "metadata": {
            "optimization_method": "SolubleMPNN_ESM_pipeline",
            "computational_scores": [s['combined'] for s in optimized_sequences[:50]]
        }
    }
)
```

## Troubleshooting

**Issue: All sequences score poorly on solubility predictions**
- Check if sequences contain unusual amino acids
- Verify FASTA format is correct
- Consider if protein family is naturally low-solubility
- May need experimental validation despite predictions

**Issue: SolubleMPNN changes functionally important residues**
- Provide structure file to preserve spatial constraints
- Mask critical residues from mutation
- Lower temperature parameter for conservative changes
- Manually revert problematic mutations

**Issue: ESM scores are low after optimization**
- Optimization may be too aggressive
- Try lower temperature in SolubleMPNN
- Balance between solubility and naturalness
- Consider that some optimization may require non-natural mutations

**Issue: Predictions don't match experimental results**
- Predictions are probabilistic, not deterministic
- Host system and conditions affect expression
- Some proteins may need experimental validation
- Use predictions as enrichment, not absolute filters