Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/pubchem-database/SKILL.md
+++ b/skills/pubchem-database/SKILL.md
@@ -0,0 +1,568 @@
+---
+name: pubchem-database
+description: "Query PubChem via PUG-REST API/PubChemPy (110M+ compounds). Search by name/CID/SMILES, retrieve properties, similarity/substructure searches, bioactivity, for cheminformatics."
+---
+
+# PubChem Database
+
+## Overview
+
+PubChem is the world's largest freely available chemical database with 110M+ compounds and 270M+ bioactivities. Query chemical structures by name, CID, or SMILES, retrieve molecular properties, perform similarity and substructure searches, access bioactivity data using PUG-REST API and PubChemPy.
+
+## When to Use This Skill
+
+This skill should be used when:
+- Searching for chemical compounds by name, structure (SMILES/InChI), or molecular formula
+- Retrieving molecular properties (MW, LogP, TPSA, hydrogen bonding descriptors)
+- Performing similarity searches to find structurally related compounds
+- Conducting substructure searches for specific chemical motifs
+- Accessing bioactivity data from screening assays
+- Converting between chemical identifier formats (CID, SMILES, InChI)
+- Batch processing multiple compounds for drug-likeness screening or property analysis
+
+## Core Capabilities
+
+### 1. Chemical Structure Search
+
+Search for compounds using multiple identifier types:
+
+**By Chemical Name**:
+```python
+import pubchempy as pcp
+compounds = pcp.get_compounds('aspirin', 'name')
+compound = compounds[0]
+```
+
+**By CID (Compound ID)**:
+```python
+compound = pcp.Compound.from_cid(2244)  # Aspirin
+```
+
+**By SMILES**:
+```python
+compound = pcp.get_compounds('CC(=O)OC1=CC=CC=C1C(=O)O', 'smiles')[0]
+```
+
+**By InChI**:
+```python
+compound = pcp.get_compounds('InChI=1S/C9H8O4/...', 'inchi')[0]
+```
+
+**By Molecular Formula**:
+```python
+compounds = pcp.get_compounds('C9H8O4', 'formula')
+# Returns all compounds matching this formula
+```
+
+### 2. Property Retrieval
+
+Retrieve molecular properties for compounds using either high-level or low-level approaches:
+
+**Using PubChemPy (Recommended)**:
+```python
+import pubchempy as pcp
+
+# Get compound object with all properties
+compound = pcp.get_compounds('caffeine', 'name')[0]
+
+# Access individual properties
+molecular_formula = compound.molecular_formula
+molecular_weight = compound.molecular_weight
+iupac_name = compound.iupac_name
+smiles = compound.canonical_smiles
+inchi = compound.inchi
+xlogp = compound.xlogp  # Partition coefficient
+tpsa = compound.tpsa    # Topological polar surface area
+```
+
+**Get Specific Properties**:
+```python
+# Request only specific properties
+properties = pcp.get_properties(
+    ['MolecularFormula', 'MolecularWeight', 'CanonicalSMILES', 'XLogP'],
+    'aspirin',
+    'name'
+)
+# Returns list of dictionaries
+```
+
+**Batch Property Retrieval**:
+```python
+import pandas as pd
+
+compound_names = ['aspirin', 'ibuprofen', 'paracetamol']
+all_properties = []
+
+for name in compound_names:
+    props = pcp.get_properties(
+        ['MolecularFormula', 'MolecularWeight', 'XLogP'],
+        name,
+        'name'
+    )
+    all_properties.extend(props)
+
+df = pd.DataFrame(all_properties)
+```
+
+**Available Properties**: MolecularFormula, MolecularWeight, CanonicalSMILES, IsomericSMILES, InChI, InChIKey, IUPACName, XLogP, TPSA, HBondDonorCount, HBondAcceptorCount, RotatableBondCount, Complexity, Charge, and many more (see `references/api_reference.md` for complete list).
+
+### 3. Similarity Search
+
+Find structurally similar compounds using Tanimoto similarity:
+
+```python
+import pubchempy as pcp
+
+# Start with a query compound
+query_compound = pcp.get_compounds('gefitinib', 'name')[0]
+query_smiles = query_compound.canonical_smiles
+
+# Perform similarity search
+similar_compounds = pcp.get_compounds(
+    query_smiles,
+    'smiles',
+    searchtype='similarity',
+    Threshold=85,  # Similarity threshold (0-100)
+    MaxRecords=50
+)
+
+# Process results
+for compound in similar_compounds[:10]:
+    print(f"CID {compound.cid}: {compound.iupac_name}")
+    print(f"  MW: {compound.molecular_weight}")
+```
+
+**Note**: Similarity searches are asynchronous for large queries and may take 15-30 seconds to complete. PubChemPy handles the asynchronous pattern automatically.
+
+### 4. Substructure Search
+
+Find compounds containing a specific structural motif:
+
+```python
+import pubchempy as pcp
+
+# Search for compounds containing pyridine ring
+pyridine_smiles = 'c1ccncc1'
+
+matches = pcp.get_compounds(
+    pyridine_smiles,
+    'smiles',
+    searchtype='substructure',
+    MaxRecords=100
+)
+
+print(f"Found {len(matches)} compounds containing pyridine")
+```
+
+**Common Substructures**:
+- Benzene ring: `c1ccccc1`
+- Pyridine: `c1ccncc1`
+- Phenol: `c1ccc(O)cc1`
+- Carboxylic acid: `C(=O)O`
+
+### 5. Format Conversion
+
+Convert between different chemical structure formats:
+
+```python
+import pubchempy as pcp
+
+compound = pcp.get_compounds('aspirin', 'name')[0]
+
+# Convert to different formats
+smiles = compound.canonical_smiles
+inchi = compound.inchi
+inchikey = compound.inchikey
+cid = compound.cid
+
+# Download structure files
+pcp.download('SDF', 'aspirin', 'name', 'aspirin.sdf', overwrite=True)
+pcp.download('JSON', '2244', 'cid', 'aspirin.json', overwrite=True)
+```
+
+### 6. Structure Visualization
+
+Generate 2D structure images:
+
+```python
+import pubchempy as pcp
+
+# Download compound structure as PNG
+pcp.download('PNG', 'caffeine', 'name', 'caffeine.png', overwrite=True)
+
+# Using direct URL (via requests)
+import requests
+
+cid = 2244  # Aspirin
+url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/PNG?image_size=large"
+response = requests.get(url)
+
+with open('structure.png', 'wb') as f:
+    f.write(response.content)
+```
+
+### 7. Synonym Retrieval
+
+Get all known names and synonyms for a compound:
+
+```python
+import pubchempy as pcp
+
+synonyms_data = pcp.get_synonyms('aspirin', 'name')
+
+if synonyms_data:
+    cid = synonyms_data[0]['CID']
+    synonyms = synonyms_data[0]['Synonym']
+
+    print(f"CID {cid} has {len(synonyms)} synonyms:")
+    for syn in synonyms[:10]:  # First 10
+        print(f"  - {syn}")
+```
+
+### 8. Bioactivity Data Access
+
+Retrieve biological activity data from assays:
+
+```python
+import requests
+import json
+
+# Get bioassay summary for a compound
+cid = 2244  # Aspirin
+url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON"
+
+response = requests.get(url)
+if response.status_code == 200:
+    data = response.json()
+    # Process bioassay information
+    table = data.get('Table', {})
+    rows = table.get('Row', [])
+    print(f"Found {len(rows)} bioassay records")
+```
+
+**For more complex bioactivity queries**, use the `scripts/bioactivity_query.py` helper script which provides:
+- Bioassay summaries with activity outcome filtering
+- Assay target identification
+- Search for compounds by biological target
+- Active compound lists for specific assays
+
+### 9. Comprehensive Compound Annotations
+
+Access detailed compound information through PUG-View:
+
+```python
+import requests
+
+cid = 2244
+url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"
+
+response = requests.get(url)
+if response.status_code == 200:
+    annotations = response.json()
+    # Contains extensive data including:
+    # - Chemical and Physical Properties
+    # - Drug and Medication Information
+    # - Pharmacology and Biochemistry
+    # - Safety and Hazards
+    # - Toxicity
+    # - Literature references
+    # - Patents
+```
+
+**Get Specific Section**:
+```python
+# Get only drug information
+url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON?heading=Drug and Medication Information"
+```
+
+## Installation Requirements
+
+Install PubChemPy for Python-based access:
+
+```bash
+uv pip install pubchempy
+```
+
+For direct API access and bioactivity queries:
+
+```bash
+uv pip install requests
+```
+
+Optional for data analysis:
+
+```bash
+uv pip install pandas
+```
+
+## Helper Scripts
+
+This skill includes Python scripts for common PubChem tasks:
+
+### scripts/compound_search.py
+
+Provides utility functions for searching and retrieving compound information:
+
+**Key Functions**:
+- `search_by_name(name, max_results=10)`: Search compounds by name
+- `search_by_smiles(smiles)`: Search by SMILES string
+- `get_compound_by_cid(cid)`: Retrieve compound by CID
+- `get_compound_properties(identifier, namespace, properties)`: Get specific properties
+- `similarity_search(smiles, threshold, max_records)`: Perform similarity search
+- `substructure_search(smiles, max_records)`: Perform substructure search
+- `get_synonyms(identifier, namespace)`: Get all synonyms
+- `batch_search(identifiers, namespace, properties)`: Batch search multiple compounds
+- `download_structure(identifier, namespace, format, filename)`: Download structures
+- `print_compound_info(compound)`: Print formatted compound information
+
+**Usage**:
+```python
+from scripts.compound_search import search_by_name, get_compound_properties
+
+# Search for a compound
+compounds = search_by_name('ibuprofen')
+
+# Get specific properties
+props = get_compound_properties('aspirin', 'name', ['MolecularWeight', 'XLogP'])
+```
+
+### scripts/bioactivity_query.py
+
+Provides functions for retrieving biological activity data:
+
+**Key Functions**:
+- `get_bioassay_summary(cid)`: Get bioassay summary for compound
+- `get_compound_bioactivities(cid, activity_outcome)`: Get filtered bioactivities
+- `get_assay_description(aid)`: Get detailed assay information
+- `get_assay_targets(aid)`: Get biological targets for assay
+- `search_assays_by_target(target_name, max_results)`: Find assays by target
+- `get_active_compounds_in_assay(aid, max_results)`: Get active compounds
+- `get_compound_annotations(cid, section)`: Get PUG-View annotations
+- `summarize_bioactivities(cid)`: Generate bioactivity summary statistics
+- `find_compounds_by_bioactivity(target, threshold, max_compounds)`: Find compounds by target
+
+**Usage**:
+```python
+from scripts.bioactivity_query import get_bioassay_summary, summarize_bioactivities
+
+# Get bioactivity summary
+summary = summarize_bioactivities(2244)  # Aspirin
+print(f"Total assays: {summary['total_assays']}")
+print(f"Active: {summary['active']}, Inactive: {summary['inactive']}")
+```
+
+## API Rate Limits and Best Practices
+
+**Rate Limits**:
+- Maximum 5 requests per second
+- Maximum 400 requests per minute
+- Maximum 300 seconds running time per minute
+
+**Best Practices**:
+1. **Use CIDs for repeated queries**: CIDs are more efficient than names or structures
+2. **Cache results locally**: Store frequently accessed data
+3. **Batch requests**: Combine multiple queries when possible
+4. **Implement delays**: Add 0.2-0.3 second delays between requests
+5. **Handle errors gracefully**: Check for HTTP errors and missing data
+6. **Use PubChemPy**: Higher-level abstraction handles many edge cases
+7. **Leverage asynchronous pattern**: For large similarity/substructure searches
+8. **Specify MaxRecords**: Limit results to avoid timeouts
+
+**Error Handling**:
+```python
+from pubchempy import BadRequestError, NotFoundError, TimeoutError
+
+try:
+    compound = pcp.get_compounds('query', 'name')[0]
+except NotFoundError:
+    print("Compound not found")
+except BadRequestError:
+    print("Invalid request format")
+except TimeoutError:
+    print("Request timed out - try reducing scope")
+except IndexError:
+    print("No results returned")
+```
+
+## Common Workflows
+
+### Workflow 1: Chemical Identifier Conversion Pipeline
+
+Convert between different chemical identifiers:
+
+```python
+import pubchempy as pcp
+
+# Start with any identifier type
+compound = pcp.get_compounds('caffeine', 'name')[0]
+
+# Extract all identifier formats
+identifiers = {
+    'CID': compound.cid,
+    'Name': compound.iupac_name,
+    'SMILES': compound.canonical_smiles,
+    'InChI': compound.inchi,
+    'InChIKey': compound.inchikey,
+    'Formula': compound.molecular_formula
+}
+```
+
+### Workflow 2: Drug-Like Property Screening
+
+Screen compounds using Lipinski's Rule of Five:
+
+```python
+import pubchempy as pcp
+
+def check_drug_likeness(compound_name):
+    compound = pcp.get_compounds(compound_name, 'name')[0]
+
+    # Lipinski's Rule of Five
+    rules = {
+        'MW <= 500': compound.molecular_weight <= 500,
+        'LogP <= 5': compound.xlogp <= 5 if compound.xlogp else None,
+        'HBD <= 5': compound.h_bond_donor_count <= 5,
+        'HBA <= 10': compound.h_bond_acceptor_count <= 10
+    }
+
+    violations = sum(1 for v in rules.values() if v is False)
+    return rules, violations
+
+rules, violations = check_drug_likeness('aspirin')
+print(f"Lipinski violations: {violations}")
+```
+
+### Workflow 3: Finding Similar Drug Candidates
+
+Identify structurally similar compounds to a known drug:
+
+```python
+import pubchempy as pcp
+
+# Start with known drug
+reference_drug = pcp.get_compounds('imatinib', 'name')[0]
+reference_smiles = reference_drug.canonical_smiles
+
+# Find similar compounds
+similar = pcp.get_compounds(
+    reference_smiles,
+    'smiles',
+    searchtype='similarity',
+    Threshold=85,
+    MaxRecords=20
+)
+
+# Filter by drug-like properties
+candidates = []
+for comp in similar:
+    if comp.molecular_weight and 200 <= comp.molecular_weight <= 600:
+        if comp.xlogp and -1 <= comp.xlogp <= 5:
+            candidates.append(comp)
+
+print(f"Found {len(candidates)} drug-like candidates")
+```
+
+### Workflow 4: Batch Compound Property Comparison
+
+Compare properties across multiple compounds:
+
+```python
+import pubchempy as pcp
+import pandas as pd
+
+compound_list = ['aspirin', 'ibuprofen', 'naproxen', 'celecoxib']
+
+properties_list = []
+for name in compound_list:
+    try:
+        compound = pcp.get_compounds(name, 'name')[0]
+        properties_list.append({
+            'Name': name,
+            'CID': compound.cid,
+            'Formula': compound.molecular_formula,
+            'MW': compound.molecular_weight,
+            'LogP': compound.xlogp,
+            'TPSA': compound.tpsa,
+            'HBD': compound.h_bond_donor_count,
+            'HBA': compound.h_bond_acceptor_count
+        })
+    except Exception as e:
+        print(f"Error processing {name}: {e}")
+
+df = pd.DataFrame(properties_list)
+print(df.to_string(index=False))
+```
+
+### Workflow 5: Substructure-Based Virtual Screening
+
+Screen for compounds containing specific pharmacophores:
+
+```python
+import pubchempy as pcp
+
+# Define pharmacophore (e.g., sulfonamide group)
+pharmacophore_smiles = 'S(=O)(=O)N'
+
+# Search for compounds containing this substructure
+hits = pcp.get_compounds(
+    pharmacophore_smiles,
+    'smiles',
+    searchtype='substructure',
+    MaxRecords=100
+)
+
+# Further filter by properties
+filtered_hits = [
+    comp for comp in hits
+    if comp.molecular_weight and comp.molecular_weight < 500
+]
+
+print(f"Found {len(filtered_hits)} compounds with desired substructure")
+```
+
+## Reference Documentation
+
+For detailed API documentation, including complete property lists, URL patterns, advanced query options, and more examples, consult `references/api_reference.md`. This comprehensive reference includes:
+
+- Complete PUG-REST API endpoint documentation
+- Full list of available molecular properties
+- Asynchronous request handling patterns
+- PubChemPy API reference
+- PUG-View API for annotations
+- Common workflows and use cases
+- Links to official PubChem documentation
+
+## Troubleshooting
+
+**Compound Not Found**:
+- Try alternative names or synonyms
+- Use CID if known
+- Check spelling and chemical name format
+
+**Timeout Errors**:
+- Reduce MaxRecords parameter
+- Add delays between requests
+- Use CIDs instead of names for faster queries
+
+**Empty Property Values**:
+- Not all properties are available for all compounds
+- Check if property exists before accessing: `if compound.xlogp:`
+- Some properties only available for certain compound types
+
+**Rate Limit Exceeded**:
+- Implement delays (0.2-0.3 seconds) between requests
+- Use batch operations where possible
+- Consider caching results locally
+
+**Similarity/Substructure Search Hangs**:
+- These are asynchronous operations that may take 15-30 seconds
+- PubChemPy handles polling automatically
+- Reduce MaxRecords if timing out
+
+## Additional Resources
+
+- PubChem Home: https://pubchem.ncbi.nlm.nih.gov/
+- PUG-REST Documentation: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
+- PUG-REST Tutorial: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest-tutorial
+- PubChemPy Documentation: https://pubchempy.readthedocs.io/
+- PubChemPy GitHub: https://github.com/mcs07/PubChemPy