Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/adaptyv/reference/protein_optimization.md
+++ b/skills/adaptyv/reference/protein_optimization.md
@@ -0,0 +1,637 @@
+# Protein Sequence Optimization
+
+## Overview
+
+Before submitting protein sequences for experimental testing, use computational tools to optimize sequences for improved expression, solubility, and stability. This pre-screening reduces experimental costs and increases success rates.
+
+## Common Protein Expression Problems
+
+### 1. Unpaired Cysteines
+
+**Problem:**
+- Unpaired cysteines form unwanted disulfide bonds
+- Leads to aggregation and misfolding
+- Reduces expression yield and stability
+
+**Solution:**
+- Remove unpaired cysteines unless functionally necessary
+- Pair cysteines appropriately for structural disulfides
+- Replace with serine or alanine in non-critical positions
+
+**Example:**
+```python
+# Check for cysteine pairs
+from Bio.Seq import Seq
+
+def check_cysteines(sequence):
+    cys_count = sequence.count('C')
+    if cys_count % 2 != 0:
+        print(f"Warning: Odd number of cysteines ({cys_count})")
+    return cys_count
+```
+
+### 2. Excessive Hydrophobicity
+
+**Problem:**
+- Long hydrophobic patches promote aggregation
+- Exposed hydrophobic residues drive protein clumping
+- Poor solubility in aqueous buffers
+
+**Solution:**
+- Maintain balanced hydropathy profiles
+- Use short, flexible linkers between domains
+- Reduce surface-exposed hydrophobic residues
+
+**Metrics:**
+- Kyte-Doolittle hydropathy plots
+- GRAVY score (Grand Average of Hydropathy)
+- pSAE (percent Solvent-Accessible hydrophobic residues)
+
+### 3. Low Solubility
+
+**Problem:**
+- Proteins precipitate during expression or purification
+- Inclusion body formation
+- Difficult downstream processing
+
+**Solution:**
+- Use solubility prediction tools for pre-screening
+- Apply sequence optimization algorithms
+- Add solubilizing tags if needed
+
+## Computational Tools for Optimization
+
+### NetSolP - Initial Solubility Screening
+
+**Purpose:** Fast solubility prediction for filtering sequences.
+
+**Method:** Machine learning model trained on E. coli expression data.
+
+**Usage:**
+```python
+# Install: uv pip install requests
+import requests
+
+def predict_solubility_netsolp(sequence):
+    """Predict protein solubility using NetSolP web service"""
+    url = "https://services.healthtech.dtu.dk/services/NetSolP-1.0/api/predict"
+
+    data = {
+        "sequence": sequence,
+        "format": "fasta"
+    }
+
+    response = requests.post(url, data=data)
+    return response.json()
+
+# Example
+sequence = "MKVLWAALLGLLGAAA..."
+result = predict_solubility_netsolp(sequence)
+print(f"Solubility score: {result['score']}")
+```
+
+**Interpretation:**
+- Score > 0.5: Likely soluble
+- Score < 0.5: Likely insoluble
+- Use for initial filtering before more expensive predictions
+
+**When to use:**
+- First-pass filtering of large libraries
+- Quick validation of designed sequences
+- Prioritizing sequences for experimental testing
+
+### SoluProt - Comprehensive Solubility Prediction
+
+**Purpose:** Advanced solubility prediction with higher accuracy.
+
+**Method:** Deep learning model incorporating sequence and structural features.
+
+**Usage:**
+```python
+# Install: uv pip install soluprot
+from soluprot import predict_solubility
+
+def screen_variants_soluprot(sequences):
+    """Screen multiple sequences for solubility"""
+    results = []
+    for name, seq in sequences.items():
+        score = predict_solubility(seq)
+        results.append({
+            'name': name,
+            'sequence': seq,
+            'solubility_score': score,
+            'predicted_soluble': score > 0.6
+        })
+    return results
+
+# Example
+sequences = {
+    'variant_1': 'MKVLW...',
+    'variant_2': 'MATGV...'
+}
+
+results = screen_variants_soluprot(sequences)
+soluble_variants = [r for r in results if r['predicted_soluble']]
+```
+
+**Interpretation:**
+- Score > 0.6: High solubility confidence
+- Score 0.4-0.6: Uncertain, may need optimization
+- Score < 0.4: Likely problematic
+
+**When to use:**
+- After initial NetSolP filtering
+- When higher prediction accuracy is needed
+- Before committing to expensive synthesis/testing
+
+### SolubleMPNN - Sequence Redesign
+
+**Purpose:** Redesign protein sequences to improve solubility while maintaining function.
+
+**Method:** Graph neural network that suggests mutations to increase solubility.
+
+**Usage:**
+```python
+# Install: uv pip install soluble-mpnn
+from soluble_mpnn import optimize_sequence
+
+def optimize_for_solubility(sequence, structure_pdb=None):
+    """
+    Redesign sequence for improved solubility
+
+    Args:
+        sequence: Original amino acid sequence
+        structure_pdb: Optional PDB file for structure-aware design
+
+    Returns:
+        Optimized sequence variants ranked by predicted solubility
+    """
+
+    variants = optimize_sequence(
+        sequence=sequence,
+        structure=structure_pdb,
+        num_variants=10,
+        temperature=0.1  # Lower = more conservative mutations
+    )
+
+    return variants
+
+# Example
+original_seq = "MKVLWAALLGLLGAAA..."
+optimized_variants = optimize_for_solubility(original_seq)
+
+for i, variant in enumerate(optimized_variants):
+    print(f"Variant {i+1}:")
+    print(f"  Sequence: {variant['sequence']}")
+    print(f"  Solubility score: {variant['solubility_score']}")
+    print(f"  Mutations: {variant['mutations']}")
+```
+
+**Design strategy:**
+- **Conservative** (temperature=0.1): Minimal changes, safer
+- **Moderate** (temperature=0.3): Balance between change and safety
+- **Aggressive** (temperature=0.5): More mutations, higher risk
+
+**When to use:**
+- Primary tool for sequence optimization
+- Default starting point for improving problematic sequences
+- Generating diverse soluble variants
+
+**Best practices:**
+- Generate 10-50 variants per sequence
+- Use structure information when available (improves accuracy)
+- Validate key functional residues are preserved
+- Test multiple temperature settings
+
+### ESM (Evolutionary Scale Modeling) - Sequence Likelihood
+
+**Purpose:** Assess how "natural" a protein sequence appears based on evolutionary patterns.
+
+**Method:** Protein language model trained on millions of natural sequences.
+
+**Usage:**
+```python
+# Install: uv pip install fair-esm
+import torch
+from esm import pretrained
+
+def score_sequence_esm(sequence):
+    """
+    Calculate ESM likelihood score for sequence
+    Higher scores indicate more natural/stable sequences
+    """
+
+    model, alphabet = pretrained.esm2_t33_650M_UR50D()
+    batch_converter = alphabet.get_batch_converter()
+
+    data = [("protein", sequence)]
+    _, _, batch_tokens = batch_converter(data)
+
+    with torch.no_grad():
+        results = model(batch_tokens, repr_layers=[33])
+        token_logprobs = results["logits"].log_softmax(dim=-1)
+
+    # Calculate perplexity as sequence quality metric
+    sequence_score = token_logprobs.mean().item()
+
+    return sequence_score
+
+# Example - Compare variants
+sequences = {
+    'original': 'MKVLW...',
+    'optimized_1': 'MKVLS...',
+    'optimized_2': 'MKVLA...'
+}
+
+for name, seq in sequences.items():
+    score = score_sequence_esm(seq)
+    print(f"{name}: ESM score = {score:.3f}")
+```
+
+**Interpretation:**
+- Higher scores → More "natural" sequence
+- Use to avoid unlikely mutations
+- Balance with functional requirements
+
+**When to use:**
+- Filtering synthetic designs
+- Comparing SolubleMPNN variants
+- Ensuring sequences aren't too artificial
+- Avoiding expression bottlenecks
+
+**Integration with design:**
+```python
+def rank_variants_by_esm(variants):
+    """Rank protein variants by ESM likelihood"""
+    scored = []
+    for v in variants:
+        esm_score = score_sequence_esm(v['sequence'])
+        v['esm_score'] = esm_score
+        scored.append(v)
+
+    # Sort by combined solubility and ESM score
+    scored.sort(
+        key=lambda x: x['solubility_score'] * x['esm_score'],
+        reverse=True
+    )
+
+    return scored
+```
+
+### ipTM - Interface Stability (AlphaFold-Multimer)
+
+**Purpose:** Assess protein-protein interface stability and binding confidence.
+
+**Method:** Interface predicted TM-score from AlphaFold-Multimer predictions.
+
+**Usage:**
+```python
+# Requires AlphaFold-Multimer installation
+# Or use ColabFold for easier access
+
+def predict_interface_stability(protein_a_seq, protein_b_seq):
+    """
+    Predict interface stability using AlphaFold-Multimer
+
+    Returns ipTM score: higher = more stable interface
+    """
+    from colabfold import run_alphafold_multimer
+
+    sequences = {
+        'chainA': protein_a_seq,
+        'chainB': protein_b_seq
+    }
+
+    result = run_alphafold_multimer(sequences)
+
+    return {
+        'ipTM': result['iptm'],
+        'pTM': result['ptm'],
+        'pLDDT': result['plddt']
+    }
+
+# Example for antibody-antigen binding
+antibody_seq = "EVQLVESGGGLVQPGG..."
+antigen_seq = "MKVLWAALLGLLGAAA..."
+
+stability = predict_interface_stability(antibody_seq, antigen_seq)
+print(f"Interface pTM: {stability['ipTM']:.3f}")
+
+# Interpretation
+if stability['ipTM'] > 0.7:
+    print("High confidence interface")
+elif stability['ipTM'] > 0.5:
+    print("Moderate confidence interface")
+else:
+    print("Low confidence interface - may need redesign")
+```
+
+**Interpretation:**
+- ipTM > 0.7: Strong predicted interface
+- ipTM 0.5-0.7: Moderate interface confidence
+- ipTM < 0.5: Weak interface, consider redesign
+
+**When to use:**
+- Antibody-antigen design
+- Protein-protein interaction engineering
+- Validating binding interfaces
+- Comparing interface variants
+
+### pSAE - Solvent-Accessible Hydrophobic Residues
+
+**Purpose:** Quantify exposed hydrophobic residues that promote aggregation.
+
+**Method:** Calculates percentage of solvent-accessible surface area (SASA) occupied by hydrophobic residues.
+
+**Usage:**
+```python
+# Requires structure (PDB file or AlphaFold prediction)
+# Install: uv pip install biopython
+
+from Bio.PDB import PDBParser, DSSP
+import numpy as np
+
+def calculate_psae(pdb_file):
+    """
+    Calculate percent Solvent-Accessible hydrophobic residues (pSAE)
+
+    Lower pSAE = better solubility
+    """
+
+    parser = PDBParser(QUIET=True)
+    structure = parser.get_structure('protein', pdb_file)
+
+    # Run DSSP to get solvent accessibility
+    model = structure[0]
+    dssp = DSSP(model, pdb_file, acc_array='Wilke')
+
+    hydrophobic = ['ALA', 'VAL', 'ILE', 'LEU', 'MET', 'PHE', 'TRP', 'PRO']
+
+    total_sasa = 0
+    hydrophobic_sasa = 0
+
+    for residue in dssp:
+        res_name = residue[1]
+        rel_accessibility = residue[3]
+
+        total_sasa += rel_accessibility
+        if res_name in hydrophobic:
+            hydrophobic_sasa += rel_accessibility
+
+    psae = (hydrophobic_sasa / total_sasa) * 100
+
+    return psae
+
+# Example
+pdb_file = "protein_structure.pdb"
+psae_score = calculate_psae(pdb_file)
+print(f"pSAE: {psae_score:.2f}%")
+
+# Interpretation
+if psae_score < 25:
+    print("Good solubility expected")
+elif psae_score < 35:
+    print("Moderate solubility")
+else:
+    print("High aggregation risk")
+```
+
+**Interpretation:**
+- pSAE < 25%: Low aggregation risk
+- pSAE 25-35%: Moderate risk
+- pSAE > 35%: High aggregation risk
+
+**When to use:**
+- Analyzing designed structures
+- Post-AlphaFold validation
+- Identifying aggregation hotspots
+- Guiding surface mutations
+
+## Recommended Optimization Workflow
+
+### Step 1: Initial Screening (Fast)
+
+```python
+def initial_screening(sequences):
+    """
+    Quick first-pass filtering using NetSolP
+    Filters out obviously problematic sequences
+    """
+    passed = []
+    for name, seq in sequences.items():
+        netsolp_score = predict_solubility_netsolp(seq)
+        if netsolp_score > 0.5:
+            passed.append((name, seq))
+
+    return passed
+```
+
+### Step 2: Detailed Assessment (Moderate)
+
+```python
+def detailed_assessment(filtered_sequences):
+    """
+    More thorough analysis with SoluProt and ESM
+    Ranks sequences by multiple criteria
+    """
+    results = []
+    for name, seq in filtered_sequences:
+        soluprot_score = predict_solubility(seq)
+        esm_score = score_sequence_esm(seq)
+
+        combined_score = soluprot_score * 0.7 + esm_score * 0.3
+
+        results.append({
+            'name': name,
+            'sequence': seq,
+            'soluprot': soluprot_score,
+            'esm': esm_score,
+            'combined': combined_score
+        })
+
+    results.sort(key=lambda x: x['combined'], reverse=True)
+    return results
+```
+
+### Step 3: Sequence Optimization (If needed)
+
+```python
+def optimize_problematic_sequences(sequences_needing_optimization):
+    """
+    Use SolubleMPNN to redesign problematic sequences
+    Returns improved variants
+    """
+    optimized = []
+    for name, seq in sequences_needing_optimization:
+        # Generate multiple variants
+        variants = optimize_sequence(
+            sequence=seq,
+            num_variants=10,
+            temperature=0.2
+        )
+
+        # Score variants with ESM
+        for variant in variants:
+            variant['esm_score'] = score_sequence_esm(variant['sequence'])
+
+        # Keep best variants
+        variants.sort(
+            key=lambda x: x['solubility_score'] * x['esm_score'],
+            reverse=True
+        )
+
+        optimized.extend(variants[:3])  # Top 3 variants per sequence
+
+    return optimized
+```
+
+### Step 4: Structure-Based Validation (For critical sequences)
+
+```python
+def structure_validation(top_candidates):
+    """
+    Predict structures and calculate pSAE for top candidates
+    Final validation before experimental testing
+    """
+    validated = []
+    for candidate in top_candidates:
+        # Predict structure with AlphaFold
+        structure_pdb = predict_structure_alphafold(candidate['sequence'])
+
+        # Calculate pSAE
+        psae = calculate_psae(structure_pdb)
+
+        candidate['psae'] = psae
+        candidate['pass_structure_check'] = psae < 30
+
+        validated.append(candidate)
+
+    return validated
+```
+
+### Complete Workflow Example
+
+```python
+def complete_optimization_pipeline(initial_sequences):
+    """
+    End-to-end optimization pipeline
+
+    Input: Dictionary of {name: sequence}
+    Output: Ranked list of optimized, validated sequences
+    """
+
+    print("Step 1: Initial screening with NetSolP...")
+    filtered = initial_screening(initial_sequences)
+    print(f"  Passed: {len(filtered)}/{len(initial_sequences)}")
+
+    print("Step 2: Detailed assessment with SoluProt and ESM...")
+    assessed = detailed_assessment(filtered)
+
+    # Split into good and needs-optimization
+    good_sequences = [s for s in assessed if s['soluprot'] > 0.6]
+    needs_optimization = [s for s in assessed if s['soluprot'] <= 0.6]
+
+    print(f"  Good sequences: {len(good_sequences)}")
+    print(f"  Need optimization: {len(needs_optimization)}")
+
+    if needs_optimization:
+        print("Step 3: Optimizing problematic sequences with SolubleMPNN...")
+        optimized = optimize_problematic_sequences(needs_optimization)
+        all_sequences = good_sequences + optimized
+    else:
+        all_sequences = good_sequences
+
+    print("Step 4: Structure-based validation for top candidates...")
+    top_20 = all_sequences[:20]
+    final_validated = structure_validation(top_20)
+
+    # Final ranking
+    final_validated.sort(
+        key=lambda x: (
+            x['pass_structure_check'],
+            x['combined'],
+            -x['psae']
+        ),
+        reverse=True
+    )
+
+    return final_validated
+
+# Usage
+initial_library = {
+    'variant_1': 'MKVLWAALLGLLGAAA...',
+    'variant_2': 'MATGVLWAALLGLLGA...',
+    # ... more sequences
+}
+
+optimized_library = complete_optimization_pipeline(initial_library)
+
+# Submit top sequences to Adaptyv
+top_sequences_for_testing = optimized_library[:50]
+```
+
+## Best Practices Summary
+
+1. **Always pre-screen** before experimental testing
+2. **Use NetSolP first** for fast filtering of large libraries
+3. **Apply SolubleMPNN** as default optimization tool
+4. **Validate with ESM** to avoid unnatural sequences
+5. **Calculate pSAE** for structure-based validation
+6. **Test multiple variants** per design to account for prediction uncertainty
+7. **Keep controls** - include wild-type or known-good sequences
+8. **Iterate** - use experimental results to refine predictions
+
+## Integration with Adaptyv
+
+After computational optimization, submit sequences to Adaptyv:
+
+```python
+# After optimization pipeline
+optimized_sequences = complete_optimization_pipeline(initial_library)
+
+# Prepare FASTA format
+fasta_content = ""
+for seq_data in optimized_sequences[:50]:  # Top 50
+    fasta_content += f">{seq_data['name']}\n{seq_data['sequence']}\n"
+
+# Submit to Adaptyv
+import requests
+response = requests.post(
+    "https://kq5jp7qj7wdqklhsxmovkzn4l40obksv.lambda-url.eu-central-1.on.aws/experiments",
+    headers={"Authorization": f"Bearer {api_key}"},
+    json={
+        "sequences": fasta_content,
+        "experiment_type": "expression",
+        "metadata": {
+            "optimization_method": "SolubleMPNN_ESM_pipeline",
+            "computational_scores": [s['combined'] for s in optimized_sequences[:50]]
+        }
+    }
+)
+```
+
+## Troubleshooting
+
+**Issue: All sequences score poorly on solubility predictions**
+- Check if sequences contain unusual amino acids
+- Verify FASTA format is correct
+- Consider if protein family is naturally low-solubility
+- May need experimental validation despite predictions
+
+**Issue: SolubleMPNN changes functionally important residues**
+- Provide structure file to preserve spatial constraints
+- Mask critical residues from mutation
+- Lower temperature parameter for conservative changes
+- Manually revert problematic mutations
+
+**Issue: ESM scores are low after optimization**
+- Optimization may be too aggressive
+- Try lower temperature in SolubleMPNN
+- Balance between solubility and naturalness
+- Consider that some optimization may require non-natural mutations
+
+**Issue: Predictions don't match experimental results**
+- Predictions are probabilistic, not deterministic
+- Host system and conditions affect expression
+- Some proteins may need experimental validation
+- Use predictions as enrichment, not absolute filters