zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

17 KiB

Raw Permalink Blame History

Protein Sequence Optimization

Overview

Before submitting protein sequences for experimental testing, use computational tools to optimize sequences for improved expression, solubility, and stability. This pre-screening reduces experimental costs and increases success rates.

Common Protein Expression Problems

1. Unpaired Cysteines

Problem:

Unpaired cysteines form unwanted disulfide bonds
Leads to aggregation and misfolding
Reduces expression yield and stability

Solution:

Remove unpaired cysteines unless functionally necessary
Pair cysteines appropriately for structural disulfides
Replace with serine or alanine in non-critical positions

Example:

# Check for cysteine pairs
from Bio.Seq import Seq

def check_cysteines(sequence):
    cys_count = sequence.count('C')
    if cys_count % 2 != 0:
        print(f"Warning: Odd number of cysteines ({cys_count})")
    return cys_count

2. Excessive Hydrophobicity

Problem:

Long hydrophobic patches promote aggregation
Exposed hydrophobic residues drive protein clumping
Poor solubility in aqueous buffers

Solution:

Maintain balanced hydropathy profiles
Use short, flexible linkers between domains
Reduce surface-exposed hydrophobic residues

Metrics:

Kyte-Doolittle hydropathy plots
GRAVY score (Grand Average of Hydropathy)
pSAE (percent Solvent-Accessible hydrophobic residues)

3. Low Solubility

Problem:

Proteins precipitate during expression or purification
Inclusion body formation
Difficult downstream processing

Solution:

Use solubility prediction tools for pre-screening
Apply sequence optimization algorithms
Add solubilizing tags if needed

Computational Tools for Optimization

NetSolP - Initial Solubility Screening

Purpose: Fast solubility prediction for filtering sequences.

Method: Machine learning model trained on E. coli expression data.

Usage:

# Install: uv pip install requests
import requests

def predict_solubility_netsolp(sequence):
    """Predict protein solubility using NetSolP web service"""
    url = "https://services.healthtech.dtu.dk/services/NetSolP-1.0/api/predict"

    data = {
        "sequence": sequence,
        "format": "fasta"
    }

    response = requests.post(url, data=data)
    return response.json()

# Example
sequence = "MKVLWAALLGLLGAAA..."
result = predict_solubility_netsolp(sequence)
print(f"Solubility score: {result['score']}")

Interpretation:

Score > 0.5: Likely soluble
Score < 0.5: Likely insoluble
Use for initial filtering before more expensive predictions

When to use:

First-pass filtering of large libraries
Quick validation of designed sequences
Prioritizing sequences for experimental testing

SoluProt - Comprehensive Solubility Prediction

Purpose: Advanced solubility prediction with higher accuracy.

Method: Deep learning model incorporating sequence and structural features.

Usage:

# Install: uv pip install soluprot
from soluprot import predict_solubility

def screen_variants_soluprot(sequences):
    """Screen multiple sequences for solubility"""
    results = []
    for name, seq in sequences.items():
        score = predict_solubility(seq)
        results.append({
            'name': name,
            'sequence': seq,
            'solubility_score': score,
            'predicted_soluble': score > 0.6
        })
    return results

# Example
sequences = {
    'variant_1': 'MKVLW...',
    'variant_2': 'MATGV...'
}

results = screen_variants_soluprot(sequences)
soluble_variants = [r for r in results if r['predicted_soluble']]

Interpretation:

Score > 0.6: High solubility confidence
Score 0.4-0.6: Uncertain, may need optimization
Score < 0.4: Likely problematic

When to use:

After initial NetSolP filtering
When higher prediction accuracy is needed
Before committing to expensive synthesis/testing

SolubleMPNN - Sequence Redesign

Purpose: Redesign protein sequences to improve solubility while maintaining function.

Method: Graph neural network that suggests mutations to increase solubility.

Usage:

# Install: uv pip install soluble-mpnn
from soluble_mpnn import optimize_sequence

def optimize_for_solubility(sequence, structure_pdb=None):
    """
    Redesign sequence for improved solubility

    Args:
        sequence: Original amino acid sequence
        structure_pdb: Optional PDB file for structure-aware design

    Returns:
        Optimized sequence variants ranked by predicted solubility
    """

    variants = optimize_sequence(
        sequence=sequence,
        structure=structure_pdb,
        num_variants=10,
        temperature=0.1  # Lower = more conservative mutations
    )

    return variants

# Example
original_seq = "MKVLWAALLGLLGAAA..."
optimized_variants = optimize_for_solubility(original_seq)

for i, variant in enumerate(optimized_variants):
    print(f"Variant {i+1}:")
    print(f"  Sequence: {variant['sequence']}")
    print(f"  Solubility score: {variant['solubility_score']}")
    print(f"  Mutations: {variant['mutations']}")

Design strategy:

Conservative (temperature=0.1): Minimal changes, safer
Moderate (temperature=0.3): Balance between change and safety
Aggressive (temperature=0.5): More mutations, higher risk

When to use:

Primary tool for sequence optimization
Default starting point for improving problematic sequences
Generating diverse soluble variants

Best practices:

Generate 10-50 variants per sequence
Use structure information when available (improves accuracy)
Validate key functional residues are preserved
Test multiple temperature settings

ESM (Evolutionary Scale Modeling) - Sequence Likelihood

Purpose: Assess how "natural" a protein sequence appears based on evolutionary patterns.

Method: Protein language model trained on millions of natural sequences.

Usage:

# Install: uv pip install fair-esm
import torch
from esm import pretrained

def score_sequence_esm(sequence):
    """
    Calculate ESM likelihood score for sequence
    Higher scores indicate more natural/stable sequences
    """

    model, alphabet = pretrained.esm2_t33_650M_UR50D()
    batch_converter = alphabet.get_batch_converter()

    data = [("protein", sequence)]
    _, _, batch_tokens = batch_converter(data)

    with torch.no_grad():
        results = model(batch_tokens, repr_layers=[33])
        token_logprobs = results["logits"].log_softmax(dim=-1)

    # Calculate perplexity as sequence quality metric
    sequence_score = token_logprobs.mean().item()

    return sequence_score

# Example - Compare variants
sequences = {
    'original': 'MKVLW...',
    'optimized_1': 'MKVLS...',
    'optimized_2': 'MKVLA...'
}

for name, seq in sequences.items():
    score = score_sequence_esm(seq)
    print(f"{name}: ESM score = {score:.3f}")

Interpretation:

Higher scores → More "natural" sequence
Use to avoid unlikely mutations
Balance with functional requirements

When to use:

Filtering synthetic designs
Comparing SolubleMPNN variants
Ensuring sequences aren't too artificial
Avoiding expression bottlenecks

Integration with design:

def rank_variants_by_esm(variants):
    """Rank protein variants by ESM likelihood"""
    scored = []
    for v in variants:
        esm_score = score_sequence_esm(v['sequence'])
        v['esm_score'] = esm_score
        scored.append(v)

    # Sort by combined solubility and ESM score
    scored.sort(
        key=lambda x: x['solubility_score'] * x['esm_score'],
        reverse=True
    )

    return scored

ipTM - Interface Stability (AlphaFold-Multimer)

Purpose: Assess protein-protein interface stability and binding confidence.

Method: Interface predicted TM-score from AlphaFold-Multimer predictions.

Usage:

# Requires AlphaFold-Multimer installation
# Or use ColabFold for easier access

def predict_interface_stability(protein_a_seq, protein_b_seq):
    """
    Predict interface stability using AlphaFold-Multimer

    Returns ipTM score: higher = more stable interface
    """
    from colabfold import run_alphafold_multimer

    sequences = {
        'chainA': protein_a_seq,
        'chainB': protein_b_seq
    }

    result = run_alphafold_multimer(sequences)

    return {
        'ipTM': result['iptm'],
        'pTM': result['ptm'],
        'pLDDT': result['plddt']
    }

# Example for antibody-antigen binding
antibody_seq = "EVQLVESGGGLVQPGG..."
antigen_seq = "MKVLWAALLGLLGAAA..."

stability = predict_interface_stability(antibody_seq, antigen_seq)
print(f"Interface pTM: {stability['ipTM']:.3f}")

# Interpretation
if stability['ipTM'] > 0.7:
    print("High confidence interface")
elif stability['ipTM'] > 0.5:
    print("Moderate confidence interface")
else:
    print("Low confidence interface - may need redesign")

Interpretation:

ipTM > 0.7: Strong predicted interface
ipTM 0.5-0.7: Moderate interface confidence
ipTM < 0.5: Weak interface, consider redesign

When to use:

Antibody-antigen design
Protein-protein interaction engineering
Validating binding interfaces
Comparing interface variants

pSAE - Solvent-Accessible Hydrophobic Residues

Purpose: Quantify exposed hydrophobic residues that promote aggregation.

Method: Calculates percentage of solvent-accessible surface area (SASA) occupied by hydrophobic residues.

Usage:

# Requires structure (PDB file or AlphaFold prediction)
# Install: uv pip install biopython

from Bio.PDB import PDBParser, DSSP
import numpy as np

def calculate_psae(pdb_file):
    """
    Calculate percent Solvent-Accessible hydrophobic residues (pSAE)

    Lower pSAE = better solubility
    """

    parser = PDBParser(QUIET=True)
    structure = parser.get_structure('protein', pdb_file)

    # Run DSSP to get solvent accessibility
    model = structure[0]
    dssp = DSSP(model, pdb_file, acc_array='Wilke')

    hydrophobic = ['ALA', 'VAL', 'ILE', 'LEU', 'MET', 'PHE', 'TRP', 'PRO']

    total_sasa = 0
    hydrophobic_sasa = 0

    for residue in dssp:
        res_name = residue[1]
        rel_accessibility = residue[3]

        total_sasa += rel_accessibility
        if res_name in hydrophobic:
            hydrophobic_sasa += rel_accessibility

    psae = (hydrophobic_sasa / total_sasa) * 100

    return psae

# Example
pdb_file = "protein_structure.pdb"
psae_score = calculate_psae(pdb_file)
print(f"pSAE: {psae_score:.2f}%")

# Interpretation
if psae_score < 25:
    print("Good solubility expected")
elif psae_score < 35:
    print("Moderate solubility")
else:
    print("High aggregation risk")

Interpretation:

pSAE < 25%: Low aggregation risk
pSAE 25-35%: Moderate risk
pSAE > 35%: High aggregation risk

When to use:

Analyzing designed structures
Post-AlphaFold validation
Identifying aggregation hotspots
Guiding surface mutations

Recommended Optimization Workflow

Step 1: Initial Screening (Fast)

def initial_screening(sequences):
    """
    Quick first-pass filtering using NetSolP
    Filters out obviously problematic sequences
    """
    passed = []
    for name, seq in sequences.items():
        netsolp_score = predict_solubility_netsolp(seq)
        if netsolp_score > 0.5:
            passed.append((name, seq))

    return passed

Step 2: Detailed Assessment (Moderate)

def detailed_assessment(filtered_sequences):
    """
    More thorough analysis with SoluProt and ESM
    Ranks sequences by multiple criteria
    """
    results = []
    for name, seq in filtered_sequences:
        soluprot_score = predict_solubility(seq)
        esm_score = score_sequence_esm(seq)

        combined_score = soluprot_score * 0.7 + esm_score * 0.3

        results.append({
            'name': name,
            'sequence': seq,
            'soluprot': soluprot_score,
            'esm': esm_score,
            'combined': combined_score
        })

    results.sort(key=lambda x: x['combined'], reverse=True)
    return results

Step 3: Sequence Optimization (If needed)

def optimize_problematic_sequences(sequences_needing_optimization):
    """
    Use SolubleMPNN to redesign problematic sequences
    Returns improved variants
    """
    optimized = []
    for name, seq in sequences_needing_optimization:
        # Generate multiple variants
        variants = optimize_sequence(
            sequence=seq,
            num_variants=10,
            temperature=0.2
        )

        # Score variants with ESM
        for variant in variants:
            variant['esm_score'] = score_sequence_esm(variant['sequence'])

        # Keep best variants
        variants.sort(
            key=lambda x: x['solubility_score'] * x['esm_score'],
            reverse=True
        )

        optimized.extend(variants[:3])  # Top 3 variants per sequence

    return optimized

Step 4: Structure-Based Validation (For critical sequences)

def structure_validation(top_candidates):
    """
    Predict structures and calculate pSAE for top candidates
    Final validation before experimental testing
    """
    validated = []
    for candidate in top_candidates:
        # Predict structure with AlphaFold
        structure_pdb = predict_structure_alphafold(candidate['sequence'])

        # Calculate pSAE
        psae = calculate_psae(structure_pdb)

        candidate['psae'] = psae
        candidate['pass_structure_check'] = psae < 30

        validated.append(candidate)

    return validated

Complete Workflow Example

def complete_optimization_pipeline(initial_sequences):
    """
    End-to-end optimization pipeline

    Input: Dictionary of {name: sequence}
    Output: Ranked list of optimized, validated sequences
    """

    print("Step 1: Initial screening with NetSolP...")
    filtered = initial_screening(initial_sequences)
    print(f"  Passed: {len(filtered)}/{len(initial_sequences)}")

    print("Step 2: Detailed assessment with SoluProt and ESM...")
    assessed = detailed_assessment(filtered)

    # Split into good and needs-optimization
    good_sequences = [s for s in assessed if s['soluprot'] > 0.6]
    needs_optimization = [s for s in assessed if s['soluprot'] <= 0.6]

    print(f"  Good sequences: {len(good_sequences)}")
    print(f"  Need optimization: {len(needs_optimization)}")

    if needs_optimization:
        print("Step 3: Optimizing problematic sequences with SolubleMPNN...")
        optimized = optimize_problematic_sequences(needs_optimization)
        all_sequences = good_sequences + optimized
    else:
        all_sequences = good_sequences

    print("Step 4: Structure-based validation for top candidates...")
    top_20 = all_sequences[:20]
    final_validated = structure_validation(top_20)

    # Final ranking
    final_validated.sort(
        key=lambda x: (
            x['pass_structure_check'],
            x['combined'],
            -x['psae']
        ),
        reverse=True
    )

    return final_validated

# Usage
initial_library = {
    'variant_1': 'MKVLWAALLGLLGAAA...',
    'variant_2': 'MATGVLWAALLGLLGA...',
    # ... more sequences
}

optimized_library = complete_optimization_pipeline(initial_library)

# Submit top sequences to Adaptyv
top_sequences_for_testing = optimized_library[:50]

Best Practices Summary

Always pre-screen before experimental testing
Use NetSolP first for fast filtering of large libraries
Apply SolubleMPNN as default optimization tool
Validate with ESM to avoid unnatural sequences
Calculate pSAE for structure-based validation
Test multiple variants per design to account for prediction uncertainty
Keep controls - include wild-type or known-good sequences
Iterate - use experimental results to refine predictions

Integration with Adaptyv

After computational optimization, submit sequences to Adaptyv:

# After optimization pipeline
optimized_sequences = complete_optimization_pipeline(initial_library)

# Prepare FASTA format
fasta_content = ""
for seq_data in optimized_sequences[:50]:  # Top 50
    fasta_content += f">{seq_data['name']}\n{seq_data['sequence']}\n"

# Submit to Adaptyv
import requests
response = requests.post(
    "https://kq5jp7qj7wdqklhsxmovkzn4l40obksv.lambda-url.eu-central-1.on.aws/experiments",
    headers={"Authorization": f"Bearer {api_key}"},
    json={
        "sequences": fasta_content,
        "experiment_type": "expression",
        "metadata": {
            "optimization_method": "SolubleMPNN_ESM_pipeline",
            "computational_scores": [s['combined'] for s in optimized_sequences[:50]]
        }
    }
)

Troubleshooting

Issue: All sequences score poorly on solubility predictions

Check if sequences contain unusual amino acids
Verify FASTA format is correct
Consider if protein family is naturally low-solubility
May need experimental validation despite predictions

Issue: SolubleMPNN changes functionally important residues

Provide structure file to preserve spatial constraints
Mask critical residues from mutation
Lower temperature parameter for conservative changes
Manually revert problematic mutations

Issue: ESM scores are low after optimization

Optimization may be too aggressive
Try lower temperature in SolubleMPNN
Balance between solubility and naturalness
Consider that some optimization may require non-natural mutations

Issue: Predictions don't match experimental results

Predictions are probabilistic, not deterministic
Host system and conditions affect expression
Some proteins may need experimental validation
Use predictions as enrichment, not absolute filters

17 KiB Raw Permalink Blame History

Protein Sequence Optimization

Overview

Common Protein Expression Problems

1. Unpaired Cysteines

2. Excessive Hydrophobicity

3. Low Solubility

Computational Tools for Optimization

NetSolP - Initial Solubility Screening

SoluProt - Comprehensive Solubility Prediction

SolubleMPNN - Sequence Redesign

ESM (Evolutionary Scale Modeling) - Sequence Likelihood

ipTM - Interface Stability (AlphaFold-Multimer)

pSAE - Solvent-Accessible Hydrophobic Residues

Recommended Optimization Workflow

Step 1: Initial Screening (Fast)

Step 2: Detailed Assessment (Moderate)

Step 3: Sequence Optimization (If needed)

Step 4: Structure-Based Validation (For critical sequences)

Complete Workflow Example

Best Practices Summary

Integration with Adaptyv

Troubleshooting

17 KiB

Raw Permalink Blame History