Files
gh-k-dense-ai-claude-scient…/skills/datamol/references/fragments_scaffolds.md
2025-11-30 08:30:10 +08:00

5.7 KiB

Datamol Fragments and Scaffolds Reference

Scaffolds Module (datamol.scaffold)

Scaffolds represent the core structure of molecules, useful for identifying structural families and analyzing structure-activity relationships (SAR).

Murcko Scaffolds

dm.to_scaffold_murcko(mol)

Extract Bemis-Murcko scaffold (molecular framework).

  • Method: Removes side chains, retaining ring systems and linkers
  • Returns: Molecule object representing the scaffold
  • Use case: Identify core structures across compound series
  • Example:
    mol = dm.to_mol("c1ccc(cc1)CCN")  # Phenethylamine
    scaffold = dm.to_scaffold_murcko(mol)
    scaffold_smiles = dm.to_smiles(scaffold)
    # Returns: 'c1ccccc1CC' (benzene ring + ethyl linker)
    

Workflow for scaffold analysis:

# Extract scaffolds from compound library
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]

# Count scaffold frequency
from collections import Counter
scaffold_counts = Counter(scaffold_smiles)
most_common = scaffold_counts.most_common(10)

Fuzzy Scaffolds

dm.scaffold.fuzzy_scaffolding(mol, ...)

Generate fuzzy scaffolds with enforceable groups that must appear in the core.

  • Purpose: More flexible scaffold definition allowing specified functional groups
  • Use case: Custom scaffold definitions beyond Murcko rules

Applications

Scaffold-based splitting (for ML model validation):

# Group compounds by scaffold
scaffold_to_mols = {}
for mol, scaffold in zip(mols, scaffolds):
    smi = dm.to_smiles(scaffold)
    if smi not in scaffold_to_mols:
        scaffold_to_mols[smi] = []
    scaffold_to_mols[smi].append(mol)

# Ensure train/test sets have different scaffolds

SAR analysis:

# Group by scaffold and analyze activity
for scaffold_smi, molecules in scaffold_to_mols.items():
    activities = [get_activity(mol) for mol in molecules]
    print(f"Scaffold: {scaffold_smi}, Mean activity: {np.mean(activities)}")

Fragments Module (datamol.fragment)

Molecular fragmentation breaks molecules into smaller pieces based on chemical rules, useful for fragment-based drug design and substructure analysis.

BRICS Fragmentation

dm.fragment.brics(mol, ...)

Fragment molecule using BRICS (Breaking Retrosynthetically Interesting Chemical Substructures).

  • Method: Dissects based on 16 chemically meaningful bond types
  • Consideration: Considers chemical environment and surrounding substructures
  • Returns: Set of fragment SMILES strings
  • Use case: Retrosynthetic analysis, fragment-based design
  • Example:
    mol = dm.to_mol("c1ccccc1CCN")
    fragments = dm.fragment.brics(mol)
    # Returns fragments like: '[1*]CCN', '[1*]c1ccccc1', etc.
    # [1*] represents attachment points
    

RECAP Fragmentation

dm.fragment.recap(mol, ...)

Fragment molecule using RECAP (Retrosynthetic Combinatorial Analysis Procedure).

  • Method: Dissects based on 11 predefined bond types
  • Rules:
    • Leaves alkyl groups smaller than 5 carbons intact
    • Preserves cyclic bonds
  • Returns: Set of fragment SMILES strings
  • Use case: Combinatorial library design
  • Example:
    mol = dm.to_mol("CCCCCc1ccccc1")
    fragments = dm.fragment.recap(mol)
    

MMPA Fragmentation

dm.fragment.mmpa_frag(mol, ...)

Fragment for Matched Molecular Pair Analysis.

  • Purpose: Generate fragments suitable for identifying molecular pairs
  • Use case: Analyzing how small structural changes affect properties
  • Example:
    fragments = dm.fragment.mmpa_frag(mol)
    # Used to find pairs of molecules differing by single transformation
    

Comparison of Methods

Method Bond Types Preserves Cycles Best For
BRICS 16 Yes Retrosynthetic analysis, fragment recombination
RECAP 11 Yes Combinatorial library design
MMPA Variable Depends Structure-activity relationship analysis

Fragmentation Workflow

import datamol as dm

# 1. Fragment a molecule
mol = dm.to_mol("CC(=O)Oc1ccccc1C(=O)O")  # Aspirin
brics_frags = dm.fragment.brics(mol)
recap_frags = dm.fragment.recap(mol)

# 2. Analyze fragment frequency across library
all_fragments = []
for mol in molecule_library:
    frags = dm.fragment.brics(mol)
    all_fragments.extend(frags)

# 3. Identify common fragments
from collections import Counter
fragment_counts = Counter(all_fragments)
common_fragments = fragment_counts.most_common(20)

# 4. Convert fragments back to molecules (remove attachment points)
def clean_fragment(frag_smiles):
    # Remove [1*], [2*], etc. attachment point markers
    clean = frag_smiles.replace('[1*]', '[H]')
    return dm.to_mol(clean)

Advanced: Fragment-Based Virtual Screening

# Build fragment library from known actives
active_fragments = set()
for active_mol in active_compounds:
    frags = dm.fragment.brics(active_mol)
    active_fragments.update(frags)

# Screen compounds for presence of active fragments
def score_by_fragments(mol, fragment_set):
    mol_frags = dm.fragment.brics(mol)
    overlap = mol_frags.intersection(fragment_set)
    return len(overlap) / len(mol_frags)

# Score screening library
scores = [score_by_fragments(mol, active_fragments) for mol in screening_lib]

Key Concepts

  • Attachment Points: Marked with [1*], [2*], etc. in fragment SMILES
  • Retrosynthetic: Fragmentation mimics synthetic disconnections
  • Chemically Meaningful: Breaks occur at typical synthetic bonds
  • Recombination: Fragments can theoretically be recombined into valid molecules