Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/rdkit/SKILL.md
+++ b/skills/rdkit/SKILL.md
@@ -0,0 +1,763 @@
+---
+name: rdkit
+description: "Cheminformatics toolkit for fine-grained molecular control. SMILES/SDF parsing, descriptors (MW, LogP, TPSA), fingerprints, substructure search, 2D/3D generation, similarity, reactions. For standard workflows with simpler interface, use datamol (wrapper around RDKit). Use rdkit for advanced control, custom sanitization, specialized algorithms."
+---
+
+# RDKit Cheminformatics Toolkit
+
+## Overview
+
+RDKit is a comprehensive cheminformatics library providing Python APIs for molecular analysis and manipulation. This skill provides guidance for reading/writing molecular structures, calculating descriptors, fingerprinting, substructure searching, chemical reactions, 2D/3D coordinate generation, and molecular visualization. Use this skill for drug discovery, computational chemistry, and cheminformatics research tasks.
+
+## Core Capabilities
+
+### 1. Molecular I/O and Creation
+
+**Reading Molecules:**
+
+Read molecular structures from various formats:
+
+```python
+from rdkit import Chem
+
+# From SMILES strings
+mol = Chem.MolFromSmiles('Cc1ccccc1')  # Returns Mol object or None
+
+# From MOL files
+mol = Chem.MolFromMolFile('path/to/file.mol')
+
+# From MOL blocks (string data)
+mol = Chem.MolFromMolBlock(mol_block_string)
+
+# From InChI
+mol = Chem.MolFromInchi('InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H')
+```
+
+**Writing Molecules:**
+
+Convert molecules to text representations:
+
+```python
+# To canonical SMILES
+smiles = Chem.MolToSmiles(mol)
+
+# To MOL block
+mol_block = Chem.MolToMolBlock(mol)
+
+# To InChI
+inchi = Chem.MolToInchi(mol)
+```
+
+**Batch Processing:**
+
+For processing multiple molecules, use Supplier/Writer objects:
+
+```python
+# Read SDF files
+suppl = Chem.SDMolSupplier('molecules.sdf')
+for mol in suppl:
+    if mol is not None:  # Check for parsing errors
+        # Process molecule
+        pass
+
+# Read SMILES files
+suppl = Chem.SmilesMolSupplier('molecules.smi', titleLine=False)
+
+# For large files or compressed data
+with gzip.open('molecules.sdf.gz') as f:
+    suppl = Chem.ForwardSDMolSupplier(f)
+    for mol in suppl:
+        # Process molecule
+        pass
+
+# Multithreaded processing for large datasets
+suppl = Chem.MultithreadedSDMolSupplier('molecules.sdf')
+
+# Write molecules to SDF
+writer = Chem.SDWriter('output.sdf')
+for mol in molecules:
+    writer.write(mol)
+writer.close()
+```
+
+**Important Notes:**
+- All `MolFrom*` functions return `None` on failure with error messages
+- Always check for `None` before processing molecules
+- Molecules are automatically sanitized on import (validates valence, perceives aromaticity)
+
+### 2. Molecular Sanitization and Validation
+
+RDKit automatically sanitizes molecules during parsing, executing 13 steps including valence checking, aromaticity perception, and chirality assignment.
+
+**Sanitization Control:**
+
+```python
+# Disable automatic sanitization
+mol = Chem.MolFromSmiles('C1=CC=CC=C1', sanitize=False)
+
+# Manual sanitization
+Chem.SanitizeMol(mol)
+
+# Detect problems before sanitization
+problems = Chem.DetectChemistryProblems(mol)
+for problem in problems:
+    print(problem.GetType(), problem.Message())
+
+# Partial sanitization (skip specific steps)
+from rdkit.Chem import rdMolStandardize
+Chem.SanitizeMol(mol, sanitizeOps=Chem.SANITIZE_ALL ^ Chem.SANITIZE_PROPERTIES)
+```
+
+**Common Sanitization Issues:**
+- Atoms with explicit valence exceeding maximum allowed will raise exceptions
+- Invalid aromatic rings will cause kekulization errors
+- Radical electrons may not be properly assigned without explicit specification
+
+### 3. Molecular Analysis and Properties
+
+**Accessing Molecular Structure:**
+
+```python
+# Iterate atoms and bonds
+for atom in mol.GetAtoms():
+    print(atom.GetSymbol(), atom.GetIdx(), atom.GetDegree())
+
+for bond in mol.GetBonds():
+    print(bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond.GetBondType())
+
+# Ring information
+ring_info = mol.GetRingInfo()
+ring_info.NumRings()
+ring_info.AtomRings()  # Returns tuples of atom indices
+
+# Check if atom is in ring
+atom = mol.GetAtomWithIdx(0)
+atom.IsInRing()
+atom.IsInRingSize(6)  # Check for 6-membered rings
+
+# Find smallest set of smallest rings (SSSR)
+from rdkit.Chem import GetSymmSSSR
+rings = GetSymmSSSR(mol)
+```
+
+**Stereochemistry:**
+
+```python
+# Find chiral centers
+from rdkit.Chem import FindMolChiralCenters
+chiral_centers = FindMolChiralCenters(mol, includeUnassigned=True)
+# Returns list of (atom_idx, chirality) tuples
+
+# Assign stereochemistry from 3D coordinates
+from rdkit.Chem import AssignStereochemistryFrom3D
+AssignStereochemistryFrom3D(mol)
+
+# Check bond stereochemistry
+bond = mol.GetBondWithIdx(0)
+stereo = bond.GetStereo()  # STEREONONE, STEREOZ, STEREOE, etc.
+```
+
+**Fragment Analysis:**
+
+```python
+# Get disconnected fragments
+frags = Chem.GetMolFrags(mol, asMols=True)
+
+# Fragment on specific bonds
+from rdkit.Chem import FragmentOnBonds
+frag_mol = FragmentOnBonds(mol, [bond_idx1, bond_idx2])
+
+# Count ring systems
+from rdkit.Chem.Scaffolds import MurckoScaffold
+scaffold = MurckoScaffold.GetScaffoldForMol(mol)
+```
+
+### 4. Molecular Descriptors and Properties
+
+**Basic Descriptors:**
+
+```python
+from rdkit.Chem import Descriptors
+
+# Molecular weight
+mw = Descriptors.MolWt(mol)
+exact_mw = Descriptors.ExactMolWt(mol)
+
+# LogP (lipophilicity)
+logp = Descriptors.MolLogP(mol)
+
+# Topological polar surface area
+tpsa = Descriptors.TPSA(mol)
+
+# Number of hydrogen bond donors/acceptors
+hbd = Descriptors.NumHDonors(mol)
+hba = Descriptors.NumHAcceptors(mol)
+
+# Number of rotatable bonds
+rot_bonds = Descriptors.NumRotatableBonds(mol)
+
+# Number of aromatic rings
+aromatic_rings = Descriptors.NumAromaticRings(mol)
+```
+
+**Batch Descriptor Calculation:**
+
+```python
+# Calculate all descriptors at once
+all_descriptors = Descriptors.CalcMolDescriptors(mol)
+# Returns dictionary: {'MolWt': 180.16, 'MolLogP': 1.23, ...}
+
+# Get list of available descriptor names
+descriptor_names = [desc[0] for desc in Descriptors._descList]
+```
+
+**Lipinski's Rule of Five:**
+
+```python
+# Check drug-likeness
+mw = Descriptors.MolWt(mol) <= 500
+logp = Descriptors.MolLogP(mol) <= 5
+hbd = Descriptors.NumHDonors(mol) <= 5
+hba = Descriptors.NumHAcceptors(mol) <= 10
+
+is_drug_like = mw and logp and hbd and hba
+```
+
+### 5. Fingerprints and Molecular Similarity
+
+**Fingerprint Types:**
+
+```python
+from rdkit.Chem import AllChem, RDKFingerprint
+from rdkit.Chem.AtomPairs import Pairs, Torsions
+from rdkit.Chem import MACCSkeys
+
+# RDKit topological fingerprint
+fp = Chem.RDKFingerprint(mol)
+
+# Morgan fingerprints (circular fingerprints, similar to ECFP)
+fp = AllChem.GetMorganFingerprint(mol, radius=2)
+fp_bits = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048)
+
+# MACCS keys (166-bit structural key)
+fp = MACCSkeys.GenMACCSKeys(mol)
+
+# Atom pair fingerprints
+fp = Pairs.GetAtomPairFingerprint(mol)
+
+# Topological torsion fingerprints
+fp = Torsions.GetTopologicalTorsionFingerprint(mol)
+
+# Avalon fingerprints (if available)
+from rdkit.Avalon import pyAvalonTools
+fp = pyAvalonTools.GetAvalonFP(mol)
+```
+
+**Similarity Calculation:**
+
+```python
+from rdkit import DataStructs
+
+# Calculate Tanimoto similarity
+fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1, radius=2)
+fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2, radius=2)
+similarity = DataStructs.TanimotoSimilarity(fp1, fp2)
+
+# Calculate similarity for multiple molecules
+similarities = DataStructs.BulkTanimotoSimilarity(fp1, [fp2, fp3, fp4])
+
+# Other similarity metrics
+dice = DataStructs.DiceSimilarity(fp1, fp2)
+cosine = DataStructs.CosineSimilarity(fp1, fp2)
+```
+
+**Clustering and Diversity:**
+
+```python
+# Butina clustering based on fingerprint similarity
+from rdkit.ML.Cluster import Butina
+
+# Calculate distance matrix
+dists = []
+fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 2) for mol in mols]
+for i in range(len(fps)):
+    sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i])
+    dists.extend([1-sim for sim in sims])
+
+# Cluster with distance cutoff
+clusters = Butina.ClusterData(dists, len(fps), distThresh=0.3, isDistData=True)
+```
+
+### 6. Substructure Searching and SMARTS
+
+**Basic Substructure Matching:**
+
+```python
+# Define query using SMARTS
+query = Chem.MolFromSmarts('[#6]1:[#6]:[#6]:[#6]:[#6]:[#6]:1')  # Benzene ring
+
+# Check if molecule contains substructure
+has_match = mol.HasSubstructMatch(query)
+
+# Get all matches (returns tuple of tuples with atom indices)
+matches = mol.GetSubstructMatches(query)
+
+# Get only first match
+match = mol.GetSubstructMatch(query)
+```
+
+**Common SMARTS Patterns:**
+
+```python
+# Primary alcohols
+primary_alcohol = Chem.MolFromSmarts('[CH2][OH1]')
+
+# Carboxylic acids
+carboxylic_acid = Chem.MolFromSmarts('C(=O)[OH]')
+
+# Amides
+amide = Chem.MolFromSmarts('C(=O)N')
+
+# Aromatic heterocycles
+aromatic_n = Chem.MolFromSmarts('[nR]')  # Aromatic nitrogen in ring
+
+# Macrocycles (rings > 12 atoms)
+macrocycle = Chem.MolFromSmarts('[r{12-}]')
+```
+
+**Matching Rules:**
+- Unspecified properties in query match any value in target
+- Hydrogens are ignored unless explicitly specified
+- Charged query atom won't match uncharged target atom
+- Aromatic query atom won't match aliphatic target atom (unless query is generic)
+
+### 7. Chemical Reactions
+
+**Reaction SMARTS:**
+
+```python
+from rdkit.Chem import AllChem
+
+# Define reaction using SMARTS: reactants >> products
+rxn = AllChem.ReactionFromSmarts('[C:1]=[O:2]>>[C:1][O:2]')  # Ketone reduction
+
+# Apply reaction to molecules
+reactants = (mol1,)
+products = rxn.RunReactants(reactants)
+
+# Products is tuple of tuples (one tuple per product set)
+for product_set in products:
+    for product in product_set:
+        # Sanitize product
+        Chem.SanitizeMol(product)
+```
+
+**Reaction Features:**
+- Atom mapping preserves specific atoms between reactants and products
+- Dummy atoms in products are replaced by corresponding reactant atoms
+- "Any" bonds inherit bond order from reactants
+- Chirality preserved unless explicitly changed
+
+**Reaction Similarity:**
+
+```python
+# Generate reaction fingerprints
+fp = AllChem.CreateDifferenceFingerprintForReaction(rxn)
+
+# Compare reactions
+similarity = DataStructs.TanimotoSimilarity(fp1, fp2)
+```
+
+### 8. 2D and 3D Coordinate Generation
+
+**2D Coordinate Generation:**
+
+```python
+from rdkit.Chem import AllChem
+
+# Generate 2D coordinates for depiction
+AllChem.Compute2DCoords(mol)
+
+# Align molecule to template structure
+template = Chem.MolFromSmiles('c1ccccc1')
+AllChem.Compute2DCoords(template)
+AllChem.GenerateDepictionMatching2DStructure(mol, template)
+```
+
+**3D Coordinate Generation and Conformers:**
+
+```python
+# Generate single 3D conformer using ETKDG
+AllChem.EmbedMolecule(mol, randomSeed=42)
+
+# Generate multiple conformers
+conf_ids = AllChem.EmbedMultipleConfs(mol, numConfs=10, randomSeed=42)
+
+# Optimize geometry with force field
+AllChem.UFFOptimizeMolecule(mol)  # UFF force field
+AllChem.MMFFOptimizeMolecule(mol)  # MMFF94 force field
+
+# Optimize all conformers
+for conf_id in conf_ids:
+    AllChem.MMFFOptimizeMolecule(mol, confId=conf_id)
+
+# Calculate RMSD between conformers
+from rdkit.Chem import AllChem
+rms = AllChem.GetConformerRMS(mol, conf_id1, conf_id2)
+
+# Align molecules
+AllChem.AlignMol(probe_mol, ref_mol)
+```
+
+**Constrained Embedding:**
+
+```python
+# Embed with part of molecule constrained to specific coordinates
+AllChem.ConstrainedEmbed(mol, core_mol)
+```
+
+### 9. Molecular Visualization
+
+**Basic Drawing:**
+
+```python
+from rdkit.Chem import Draw
+
+# Draw single molecule to PIL image
+img = Draw.MolToImage(mol, size=(300, 300))
+img.save('molecule.png')
+
+# Draw to file directly
+Draw.MolToFile(mol, 'molecule.png')
+
+# Draw multiple molecules in grid
+mols = [mol1, mol2, mol3, mol4]
+img = Draw.MolsToGridImage(mols, molsPerRow=2, subImgSize=(200, 200))
+```
+
+**Highlighting Substructures:**
+
+```python
+# Highlight substructure match
+query = Chem.MolFromSmarts('c1ccccc1')
+match = mol.GetSubstructMatch(query)
+
+img = Draw.MolToImage(mol, highlightAtoms=match)
+
+# Custom highlight colors
+highlight_colors = {atom_idx: (1, 0, 0) for atom_idx in match}  # Red
+img = Draw.MolToImage(mol, highlightAtoms=match,
+                      highlightAtomColors=highlight_colors)
+```
+
+**Customizing Visualization:**
+
+```python
+from rdkit.Chem.Draw import rdMolDraw2D
+
+# Create drawer with custom options
+drawer = rdMolDraw2D.MolDraw2DCairo(300, 300)
+opts = drawer.drawOptions()
+
+# Customize options
+opts.addAtomIndices = True
+opts.addStereoAnnotation = True
+opts.bondLineWidth = 2
+
+# Draw molecule
+drawer.DrawMolecule(mol)
+drawer.FinishDrawing()
+
+# Save to file
+with open('molecule.png', 'wb') as f:
+    f.write(drawer.GetDrawingText())
+```
+
+**Jupyter Notebook Integration:**
+
+```python
+# Enable inline display in Jupyter
+from rdkit.Chem.Draw import IPythonConsole
+
+# Customize default display
+IPythonConsole.ipython_useSVG = True  # Use SVG instead of PNG
+IPythonConsole.molSize = (300, 300)   # Default size
+
+# Molecules now display automatically
+mol  # Shows molecule image
+```
+
+**Visualizing Fingerprint Bits:**
+
+```python
+# Show what molecular features a fingerprint bit represents
+from rdkit.Chem import Draw
+
+# For Morgan fingerprints
+bit_info = {}
+fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, bitInfo=bit_info)
+
+# Draw environment for specific bit
+img = Draw.DrawMorganBit(mol, bit_id, bit_info)
+```
+
+### 10. Molecular Modification
+
+**Adding/Removing Hydrogens:**
+
+```python
+# Add explicit hydrogens
+mol_h = Chem.AddHs(mol)
+
+# Remove explicit hydrogens
+mol = Chem.RemoveHs(mol_h)
+```
+
+**Kekulization and Aromaticity:**
+
+```python
+# Convert aromatic bonds to alternating single/double
+Chem.Kekulize(mol)
+
+# Set aromaticity
+Chem.SetAromaticity(mol)
+```
+
+**Replacing Substructures:**
+
+```python
+# Replace substructure with another structure
+query = Chem.MolFromSmarts('c1ccccc1')  # Benzene
+replacement = Chem.MolFromSmiles('C1CCCCC1')  # Cyclohexane
+
+new_mol = Chem.ReplaceSubstructs(mol, query, replacement)[0]
+```
+
+**Neutralizing Charges:**
+
+```python
+# Remove formal charges by adding/removing hydrogens
+from rdkit.Chem.MolStandardize import rdMolStandardize
+
+# Using Uncharger
+uncharger = rdMolStandardize.Uncharger()
+mol_neutral = uncharger.uncharge(mol)
+```
+
+### 11. Working with Molecular Hashes and Standardization
+
+**Molecular Hashing:**
+
+```python
+from rdkit.Chem import rdMolHash
+
+# Generate Murcko scaffold hash
+scaffold_hash = rdMolHash.MolHash(mol, rdMolHash.HashFunction.MurckoScaffold)
+
+# Canonical SMILES hash
+canonical_hash = rdMolHash.MolHash(mol, rdMolHash.HashFunction.CanonicalSmiles)
+
+# Regioisomer hash (ignores stereochemistry)
+regio_hash = rdMolHash.MolHash(mol, rdMolHash.HashFunction.Regioisomer)
+```
+
+**Randomized SMILES:**
+
+```python
+# Generate random SMILES representations (for data augmentation)
+from rdkit.Chem import MolToRandomSmilesVect
+
+random_smiles = MolToRandomSmilesVect(mol, numSmiles=10, randomSeed=42)
+```
+
+### 12. Pharmacophore and 3D Features
+
+**Pharmacophore Features:**
+
+```python
+from rdkit.Chem import ChemicalFeatures
+from rdkit import RDConfig
+import os
+
+# Load feature factory
+fdef_path = os.path.join(RDConfig.RDDataDir, 'BaseFeatures.fdef')
+factory = ChemicalFeatures.BuildFeatureFactory(fdef_path)
+
+# Get pharmacophore features
+features = factory.GetFeaturesForMol(mol)
+
+for feat in features:
+    print(feat.GetFamily(), feat.GetType(), feat.GetAtomIds())
+```
+
+## Common Workflows
+
+### Drug-likeness Analysis
+
+```python
+from rdkit import Chem
+from rdkit.Chem import Descriptors
+
+def analyze_druglikeness(smiles):
+    mol = Chem.MolFromSmiles(smiles)
+    if mol is None:
+        return None
+
+    # Calculate Lipinski descriptors
+    results = {
+        'MW': Descriptors.MolWt(mol),
+        'LogP': Descriptors.MolLogP(mol),
+        'HBD': Descriptors.NumHDonors(mol),
+        'HBA': Descriptors.NumHAcceptors(mol),
+        'TPSA': Descriptors.TPSA(mol),
+        'RotBonds': Descriptors.NumRotatableBonds(mol)
+    }
+
+    # Check Lipinski's Rule of Five
+    results['Lipinski'] = (
+        results['MW'] <= 500 and
+        results['LogP'] <= 5 and
+        results['HBD'] <= 5 and
+        results['HBA'] <= 10
+    )
+
+    return results
+```
+
+### Similarity Screening
+
+```python
+from rdkit import Chem
+from rdkit.Chem import AllChem
+from rdkit import DataStructs
+
+def similarity_screen(query_smiles, database_smiles, threshold=0.7):
+    query_mol = Chem.MolFromSmiles(query_smiles)
+    query_fp = AllChem.GetMorganFingerprintAsBitVect(query_mol, 2)
+
+    hits = []
+    for idx, smiles in enumerate(database_smiles):
+        mol = Chem.MolFromSmiles(smiles)
+        if mol:
+            fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
+            sim = DataStructs.TanimotoSimilarity(query_fp, fp)
+            if sim >= threshold:
+                hits.append((idx, smiles, sim))
+
+    return sorted(hits, key=lambda x: x[2], reverse=True)
+```
+
+### Substructure Filtering
+
+```python
+from rdkit import Chem
+
+def filter_by_substructure(smiles_list, pattern_smarts):
+    query = Chem.MolFromSmarts(pattern_smarts)
+
+    hits = []
+    for smiles in smiles_list:
+        mol = Chem.MolFromSmiles(smiles)
+        if mol and mol.HasSubstructMatch(query):
+            hits.append(smiles)
+
+    return hits
+```
+
+## Best Practices
+
+### Error Handling
+
+Always check for `None` when parsing molecules:
+
+```python
+mol = Chem.MolFromSmiles(smiles)
+if mol is None:
+    print(f"Failed to parse: {smiles}")
+    continue
+```
+
+### Performance Optimization
+
+**Use binary formats for storage:**
+
+```python
+import pickle
+
+# Pickle molecules for fast loading
+with open('molecules.pkl', 'wb') as f:
+    pickle.dump(mols, f)
+
+# Load pickled molecules (much faster than reparsing)
+with open('molecules.pkl', 'rb') as f:
+    mols = pickle.load(f)
+```
+
+**Use bulk operations:**
+
+```python
+# Calculate fingerprints for all molecules at once
+fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 2) for mol in mols]
+
+# Use bulk similarity calculations
+similarities = DataStructs.BulkTanimotoSimilarity(fps[0], fps[1:])
+```
+
+### Thread Safety
+
+RDKit operations are generally thread-safe for:
+- Molecule I/O (SMILES, mol blocks)
+- Coordinate generation
+- Fingerprinting and descriptors
+- Substructure searching
+- Reactions
+- Drawing
+
+**Not thread-safe:** MolSuppliers when accessed concurrently.
+
+### Memory Management
+
+For large datasets:
+
+```python
+# Use ForwardSDMolSupplier to avoid loading entire file
+with open('large.sdf') as f:
+    suppl = Chem.ForwardSDMolSupplier(f)
+    for mol in suppl:
+        # Process one molecule at a time
+        pass
+
+# Use MultithreadedSDMolSupplier for parallel processing
+suppl = Chem.MultithreadedSDMolSupplier('large.sdf', numWriterThreads=4)
+```
+
+## Common Pitfalls
+
+1. **Forgetting to check for None:** Always validate molecules after parsing
+2. **Sanitization failures:** Use `DetectChemistryProblems()` to debug
+3. **Missing hydrogens:** Use `AddHs()` when calculating properties that depend on hydrogen
+4. **2D vs 3D:** Generate appropriate coordinates before visualization or 3D analysis
+5. **SMARTS matching rules:** Remember that unspecified properties match anything
+6. **Thread safety with MolSuppliers:** Don't share supplier objects across threads
+
+## Resources
+
+### references/
+
+This skill includes detailed API reference documentation:
+
+- `api_reference.md` - Comprehensive listing of RDKit modules, functions, and classes organized by functionality
+- `descriptors_reference.md` - Complete list of available molecular descriptors with descriptions
+- `smarts_patterns.md` - Common SMARTS patterns for functional groups and structural features
+
+Load these references when needing specific API details, parameter information, or pattern examples.
+
+### scripts/
+
+Example scripts for common RDKit workflows:
+
+- `molecular_properties.py` - Calculate comprehensive molecular properties and descriptors
+- `similarity_search.py` - Perform fingerprint-based similarity screening
+- `substructure_filter.py` - Filter molecules by substructure patterns
+
+These scripts can be executed directly or used as templates for custom workflows.