Initial commit
This commit is contained in:
568
skills/pubchem-database/SKILL.md
Normal file
568
skills/pubchem-database/SKILL.md
Normal file
@@ -0,0 +1,568 @@
|
||||
---
|
||||
name: pubchem-database
|
||||
description: "Query PubChem via PUG-REST API/PubChemPy (110M+ compounds). Search by name/CID/SMILES, retrieve properties, similarity/substructure searches, bioactivity, for cheminformatics."
|
||||
---
|
||||
|
||||
# PubChem Database
|
||||
|
||||
## Overview
|
||||
|
||||
PubChem is the world's largest freely available chemical database with 110M+ compounds and 270M+ bioactivities. Query chemical structures by name, CID, or SMILES, retrieve molecular properties, perform similarity and substructure searches, access bioactivity data using PUG-REST API and PubChemPy.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when:
|
||||
- Searching for chemical compounds by name, structure (SMILES/InChI), or molecular formula
|
||||
- Retrieving molecular properties (MW, LogP, TPSA, hydrogen bonding descriptors)
|
||||
- Performing similarity searches to find structurally related compounds
|
||||
- Conducting substructure searches for specific chemical motifs
|
||||
- Accessing bioactivity data from screening assays
|
||||
- Converting between chemical identifier formats (CID, SMILES, InChI)
|
||||
- Batch processing multiple compounds for drug-likeness screening or property analysis
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Chemical Structure Search
|
||||
|
||||
Search for compounds using multiple identifier types:
|
||||
|
||||
**By Chemical Name**:
|
||||
```python
|
||||
import pubchempy as pcp
|
||||
compounds = pcp.get_compounds('aspirin', 'name')
|
||||
compound = compounds[0]
|
||||
```
|
||||
|
||||
**By CID (Compound ID)**:
|
||||
```python
|
||||
compound = pcp.Compound.from_cid(2244) # Aspirin
|
||||
```
|
||||
|
||||
**By SMILES**:
|
||||
```python
|
||||
compound = pcp.get_compounds('CC(=O)OC1=CC=CC=C1C(=O)O', 'smiles')[0]
|
||||
```
|
||||
|
||||
**By InChI**:
|
||||
```python
|
||||
compound = pcp.get_compounds('InChI=1S/C9H8O4/...', 'inchi')[0]
|
||||
```
|
||||
|
||||
**By Molecular Formula**:
|
||||
```python
|
||||
compounds = pcp.get_compounds('C9H8O4', 'formula')
|
||||
# Returns all compounds matching this formula
|
||||
```
|
||||
|
||||
### 2. Property Retrieval
|
||||
|
||||
Retrieve molecular properties for compounds using either high-level or low-level approaches:
|
||||
|
||||
**Using PubChemPy (Recommended)**:
|
||||
```python
|
||||
import pubchempy as pcp
|
||||
|
||||
# Get compound object with all properties
|
||||
compound = pcp.get_compounds('caffeine', 'name')[0]
|
||||
|
||||
# Access individual properties
|
||||
molecular_formula = compound.molecular_formula
|
||||
molecular_weight = compound.molecular_weight
|
||||
iupac_name = compound.iupac_name
|
||||
smiles = compound.canonical_smiles
|
||||
inchi = compound.inchi
|
||||
xlogp = compound.xlogp # Partition coefficient
|
||||
tpsa = compound.tpsa # Topological polar surface area
|
||||
```
|
||||
|
||||
**Get Specific Properties**:
|
||||
```python
|
||||
# Request only specific properties
|
||||
properties = pcp.get_properties(
|
||||
['MolecularFormula', 'MolecularWeight', 'CanonicalSMILES', 'XLogP'],
|
||||
'aspirin',
|
||||
'name'
|
||||
)
|
||||
# Returns list of dictionaries
|
||||
```
|
||||
|
||||
**Batch Property Retrieval**:
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
compound_names = ['aspirin', 'ibuprofen', 'paracetamol']
|
||||
all_properties = []
|
||||
|
||||
for name in compound_names:
|
||||
props = pcp.get_properties(
|
||||
['MolecularFormula', 'MolecularWeight', 'XLogP'],
|
||||
name,
|
||||
'name'
|
||||
)
|
||||
all_properties.extend(props)
|
||||
|
||||
df = pd.DataFrame(all_properties)
|
||||
```
|
||||
|
||||
**Available Properties**: MolecularFormula, MolecularWeight, CanonicalSMILES, IsomericSMILES, InChI, InChIKey, IUPACName, XLogP, TPSA, HBondDonorCount, HBondAcceptorCount, RotatableBondCount, Complexity, Charge, and many more (see `references/api_reference.md` for complete list).
|
||||
|
||||
### 3. Similarity Search
|
||||
|
||||
Find structurally similar compounds using Tanimoto similarity:
|
||||
|
||||
```python
|
||||
import pubchempy as pcp
|
||||
|
||||
# Start with a query compound
|
||||
query_compound = pcp.get_compounds('gefitinib', 'name')[0]
|
||||
query_smiles = query_compound.canonical_smiles
|
||||
|
||||
# Perform similarity search
|
||||
similar_compounds = pcp.get_compounds(
|
||||
query_smiles,
|
||||
'smiles',
|
||||
searchtype='similarity',
|
||||
Threshold=85, # Similarity threshold (0-100)
|
||||
MaxRecords=50
|
||||
)
|
||||
|
||||
# Process results
|
||||
for compound in similar_compounds[:10]:
|
||||
print(f"CID {compound.cid}: {compound.iupac_name}")
|
||||
print(f" MW: {compound.molecular_weight}")
|
||||
```
|
||||
|
||||
**Note**: Similarity searches are asynchronous for large queries and may take 15-30 seconds to complete. PubChemPy handles the asynchronous pattern automatically.
|
||||
|
||||
### 4. Substructure Search
|
||||
|
||||
Find compounds containing a specific structural motif:
|
||||
|
||||
```python
|
||||
import pubchempy as pcp
|
||||
|
||||
# Search for compounds containing pyridine ring
|
||||
pyridine_smiles = 'c1ccncc1'
|
||||
|
||||
matches = pcp.get_compounds(
|
||||
pyridine_smiles,
|
||||
'smiles',
|
||||
searchtype='substructure',
|
||||
MaxRecords=100
|
||||
)
|
||||
|
||||
print(f"Found {len(matches)} compounds containing pyridine")
|
||||
```
|
||||
|
||||
**Common Substructures**:
|
||||
- Benzene ring: `c1ccccc1`
|
||||
- Pyridine: `c1ccncc1`
|
||||
- Phenol: `c1ccc(O)cc1`
|
||||
- Carboxylic acid: `C(=O)O`
|
||||
|
||||
### 5. Format Conversion
|
||||
|
||||
Convert between different chemical structure formats:
|
||||
|
||||
```python
|
||||
import pubchempy as pcp
|
||||
|
||||
compound = pcp.get_compounds('aspirin', 'name')[0]
|
||||
|
||||
# Convert to different formats
|
||||
smiles = compound.canonical_smiles
|
||||
inchi = compound.inchi
|
||||
inchikey = compound.inchikey
|
||||
cid = compound.cid
|
||||
|
||||
# Download structure files
|
||||
pcp.download('SDF', 'aspirin', 'name', 'aspirin.sdf', overwrite=True)
|
||||
pcp.download('JSON', '2244', 'cid', 'aspirin.json', overwrite=True)
|
||||
```
|
||||
|
||||
### 6. Structure Visualization
|
||||
|
||||
Generate 2D structure images:
|
||||
|
||||
```python
|
||||
import pubchempy as pcp
|
||||
|
||||
# Download compound structure as PNG
|
||||
pcp.download('PNG', 'caffeine', 'name', 'caffeine.png', overwrite=True)
|
||||
|
||||
# Using direct URL (via requests)
|
||||
import requests
|
||||
|
||||
cid = 2244 # Aspirin
|
||||
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/PNG?image_size=large"
|
||||
response = requests.get(url)
|
||||
|
||||
with open('structure.png', 'wb') as f:
|
||||
f.write(response.content)
|
||||
```
|
||||
|
||||
### 7. Synonym Retrieval
|
||||
|
||||
Get all known names and synonyms for a compound:
|
||||
|
||||
```python
|
||||
import pubchempy as pcp
|
||||
|
||||
synonyms_data = pcp.get_synonyms('aspirin', 'name')
|
||||
|
||||
if synonyms_data:
|
||||
cid = synonyms_data[0]['CID']
|
||||
synonyms = synonyms_data[0]['Synonym']
|
||||
|
||||
print(f"CID {cid} has {len(synonyms)} synonyms:")
|
||||
for syn in synonyms[:10]: # First 10
|
||||
print(f" - {syn}")
|
||||
```
|
||||
|
||||
### 8. Bioactivity Data Access
|
||||
|
||||
Retrieve biological activity data from assays:
|
||||
|
||||
```python
|
||||
import requests
|
||||
import json
|
||||
|
||||
# Get bioassay summary for a compound
|
||||
cid = 2244 # Aspirin
|
||||
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON"
|
||||
|
||||
response = requests.get(url)
|
||||
if response.status_code == 200:
|
||||
data = response.json()
|
||||
# Process bioassay information
|
||||
table = data.get('Table', {})
|
||||
rows = table.get('Row', [])
|
||||
print(f"Found {len(rows)} bioassay records")
|
||||
```
|
||||
|
||||
**For more complex bioactivity queries**, use the `scripts/bioactivity_query.py` helper script which provides:
|
||||
- Bioassay summaries with activity outcome filtering
|
||||
- Assay target identification
|
||||
- Search for compounds by biological target
|
||||
- Active compound lists for specific assays
|
||||
|
||||
### 9. Comprehensive Compound Annotations
|
||||
|
||||
Access detailed compound information through PUG-View:
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
cid = 2244
|
||||
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"
|
||||
|
||||
response = requests.get(url)
|
||||
if response.status_code == 200:
|
||||
annotations = response.json()
|
||||
# Contains extensive data including:
|
||||
# - Chemical and Physical Properties
|
||||
# - Drug and Medication Information
|
||||
# - Pharmacology and Biochemistry
|
||||
# - Safety and Hazards
|
||||
# - Toxicity
|
||||
# - Literature references
|
||||
# - Patents
|
||||
```
|
||||
|
||||
**Get Specific Section**:
|
||||
```python
|
||||
# Get only drug information
|
||||
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON?heading=Drug and Medication Information"
|
||||
```
|
||||
|
||||
## Installation Requirements
|
||||
|
||||
Install PubChemPy for Python-based access:
|
||||
|
||||
```bash
|
||||
uv pip install pubchempy
|
||||
```
|
||||
|
||||
For direct API access and bioactivity queries:
|
||||
|
||||
```bash
|
||||
uv pip install requests
|
||||
```
|
||||
|
||||
Optional for data analysis:
|
||||
|
||||
```bash
|
||||
uv pip install pandas
|
||||
```
|
||||
|
||||
## Helper Scripts
|
||||
|
||||
This skill includes Python scripts for common PubChem tasks:
|
||||
|
||||
### scripts/compound_search.py
|
||||
|
||||
Provides utility functions for searching and retrieving compound information:
|
||||
|
||||
**Key Functions**:
|
||||
- `search_by_name(name, max_results=10)`: Search compounds by name
|
||||
- `search_by_smiles(smiles)`: Search by SMILES string
|
||||
- `get_compound_by_cid(cid)`: Retrieve compound by CID
|
||||
- `get_compound_properties(identifier, namespace, properties)`: Get specific properties
|
||||
- `similarity_search(smiles, threshold, max_records)`: Perform similarity search
|
||||
- `substructure_search(smiles, max_records)`: Perform substructure search
|
||||
- `get_synonyms(identifier, namespace)`: Get all synonyms
|
||||
- `batch_search(identifiers, namespace, properties)`: Batch search multiple compounds
|
||||
- `download_structure(identifier, namespace, format, filename)`: Download structures
|
||||
- `print_compound_info(compound)`: Print formatted compound information
|
||||
|
||||
**Usage**:
|
||||
```python
|
||||
from scripts.compound_search import search_by_name, get_compound_properties
|
||||
|
||||
# Search for a compound
|
||||
compounds = search_by_name('ibuprofen')
|
||||
|
||||
# Get specific properties
|
||||
props = get_compound_properties('aspirin', 'name', ['MolecularWeight', 'XLogP'])
|
||||
```
|
||||
|
||||
### scripts/bioactivity_query.py
|
||||
|
||||
Provides functions for retrieving biological activity data:
|
||||
|
||||
**Key Functions**:
|
||||
- `get_bioassay_summary(cid)`: Get bioassay summary for compound
|
||||
- `get_compound_bioactivities(cid, activity_outcome)`: Get filtered bioactivities
|
||||
- `get_assay_description(aid)`: Get detailed assay information
|
||||
- `get_assay_targets(aid)`: Get biological targets for assay
|
||||
- `search_assays_by_target(target_name, max_results)`: Find assays by target
|
||||
- `get_active_compounds_in_assay(aid, max_results)`: Get active compounds
|
||||
- `get_compound_annotations(cid, section)`: Get PUG-View annotations
|
||||
- `summarize_bioactivities(cid)`: Generate bioactivity summary statistics
|
||||
- `find_compounds_by_bioactivity(target, threshold, max_compounds)`: Find compounds by target
|
||||
|
||||
**Usage**:
|
||||
```python
|
||||
from scripts.bioactivity_query import get_bioassay_summary, summarize_bioactivities
|
||||
|
||||
# Get bioactivity summary
|
||||
summary = summarize_bioactivities(2244) # Aspirin
|
||||
print(f"Total assays: {summary['total_assays']}")
|
||||
print(f"Active: {summary['active']}, Inactive: {summary['inactive']}")
|
||||
```
|
||||
|
||||
## API Rate Limits and Best Practices
|
||||
|
||||
**Rate Limits**:
|
||||
- Maximum 5 requests per second
|
||||
- Maximum 400 requests per minute
|
||||
- Maximum 300 seconds running time per minute
|
||||
|
||||
**Best Practices**:
|
||||
1. **Use CIDs for repeated queries**: CIDs are more efficient than names or structures
|
||||
2. **Cache results locally**: Store frequently accessed data
|
||||
3. **Batch requests**: Combine multiple queries when possible
|
||||
4. **Implement delays**: Add 0.2-0.3 second delays between requests
|
||||
5. **Handle errors gracefully**: Check for HTTP errors and missing data
|
||||
6. **Use PubChemPy**: Higher-level abstraction handles many edge cases
|
||||
7. **Leverage asynchronous pattern**: For large similarity/substructure searches
|
||||
8. **Specify MaxRecords**: Limit results to avoid timeouts
|
||||
|
||||
**Error Handling**:
|
||||
```python
|
||||
from pubchempy import BadRequestError, NotFoundError, TimeoutError
|
||||
|
||||
try:
|
||||
compound = pcp.get_compounds('query', 'name')[0]
|
||||
except NotFoundError:
|
||||
print("Compound not found")
|
||||
except BadRequestError:
|
||||
print("Invalid request format")
|
||||
except TimeoutError:
|
||||
print("Request timed out - try reducing scope")
|
||||
except IndexError:
|
||||
print("No results returned")
|
||||
```
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Workflow 1: Chemical Identifier Conversion Pipeline
|
||||
|
||||
Convert between different chemical identifiers:
|
||||
|
||||
```python
|
||||
import pubchempy as pcp
|
||||
|
||||
# Start with any identifier type
|
||||
compound = pcp.get_compounds('caffeine', 'name')[0]
|
||||
|
||||
# Extract all identifier formats
|
||||
identifiers = {
|
||||
'CID': compound.cid,
|
||||
'Name': compound.iupac_name,
|
||||
'SMILES': compound.canonical_smiles,
|
||||
'InChI': compound.inchi,
|
||||
'InChIKey': compound.inchikey,
|
||||
'Formula': compound.molecular_formula
|
||||
}
|
||||
```
|
||||
|
||||
### Workflow 2: Drug-Like Property Screening
|
||||
|
||||
Screen compounds using Lipinski's Rule of Five:
|
||||
|
||||
```python
|
||||
import pubchempy as pcp
|
||||
|
||||
def check_drug_likeness(compound_name):
|
||||
compound = pcp.get_compounds(compound_name, 'name')[0]
|
||||
|
||||
# Lipinski's Rule of Five
|
||||
rules = {
|
||||
'MW <= 500': compound.molecular_weight <= 500,
|
||||
'LogP <= 5': compound.xlogp <= 5 if compound.xlogp else None,
|
||||
'HBD <= 5': compound.h_bond_donor_count <= 5,
|
||||
'HBA <= 10': compound.h_bond_acceptor_count <= 10
|
||||
}
|
||||
|
||||
violations = sum(1 for v in rules.values() if v is False)
|
||||
return rules, violations
|
||||
|
||||
rules, violations = check_drug_likeness('aspirin')
|
||||
print(f"Lipinski violations: {violations}")
|
||||
```
|
||||
|
||||
### Workflow 3: Finding Similar Drug Candidates
|
||||
|
||||
Identify structurally similar compounds to a known drug:
|
||||
|
||||
```python
|
||||
import pubchempy as pcp
|
||||
|
||||
# Start with known drug
|
||||
reference_drug = pcp.get_compounds('imatinib', 'name')[0]
|
||||
reference_smiles = reference_drug.canonical_smiles
|
||||
|
||||
# Find similar compounds
|
||||
similar = pcp.get_compounds(
|
||||
reference_smiles,
|
||||
'smiles',
|
||||
searchtype='similarity',
|
||||
Threshold=85,
|
||||
MaxRecords=20
|
||||
)
|
||||
|
||||
# Filter by drug-like properties
|
||||
candidates = []
|
||||
for comp in similar:
|
||||
if comp.molecular_weight and 200 <= comp.molecular_weight <= 600:
|
||||
if comp.xlogp and -1 <= comp.xlogp <= 5:
|
||||
candidates.append(comp)
|
||||
|
||||
print(f"Found {len(candidates)} drug-like candidates")
|
||||
```
|
||||
|
||||
### Workflow 4: Batch Compound Property Comparison
|
||||
|
||||
Compare properties across multiple compounds:
|
||||
|
||||
```python
|
||||
import pubchempy as pcp
|
||||
import pandas as pd
|
||||
|
||||
compound_list = ['aspirin', 'ibuprofen', 'naproxen', 'celecoxib']
|
||||
|
||||
properties_list = []
|
||||
for name in compound_list:
|
||||
try:
|
||||
compound = pcp.get_compounds(name, 'name')[0]
|
||||
properties_list.append({
|
||||
'Name': name,
|
||||
'CID': compound.cid,
|
||||
'Formula': compound.molecular_formula,
|
||||
'MW': compound.molecular_weight,
|
||||
'LogP': compound.xlogp,
|
||||
'TPSA': compound.tpsa,
|
||||
'HBD': compound.h_bond_donor_count,
|
||||
'HBA': compound.h_bond_acceptor_count
|
||||
})
|
||||
except Exception as e:
|
||||
print(f"Error processing {name}: {e}")
|
||||
|
||||
df = pd.DataFrame(properties_list)
|
||||
print(df.to_string(index=False))
|
||||
```
|
||||
|
||||
### Workflow 5: Substructure-Based Virtual Screening
|
||||
|
||||
Screen for compounds containing specific pharmacophores:
|
||||
|
||||
```python
|
||||
import pubchempy as pcp
|
||||
|
||||
# Define pharmacophore (e.g., sulfonamide group)
|
||||
pharmacophore_smiles = 'S(=O)(=O)N'
|
||||
|
||||
# Search for compounds containing this substructure
|
||||
hits = pcp.get_compounds(
|
||||
pharmacophore_smiles,
|
||||
'smiles',
|
||||
searchtype='substructure',
|
||||
MaxRecords=100
|
||||
)
|
||||
|
||||
# Further filter by properties
|
||||
filtered_hits = [
|
||||
comp for comp in hits
|
||||
if comp.molecular_weight and comp.molecular_weight < 500
|
||||
]
|
||||
|
||||
print(f"Found {len(filtered_hits)} compounds with desired substructure")
|
||||
```
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
For detailed API documentation, including complete property lists, URL patterns, advanced query options, and more examples, consult `references/api_reference.md`. This comprehensive reference includes:
|
||||
|
||||
- Complete PUG-REST API endpoint documentation
|
||||
- Full list of available molecular properties
|
||||
- Asynchronous request handling patterns
|
||||
- PubChemPy API reference
|
||||
- PUG-View API for annotations
|
||||
- Common workflows and use cases
|
||||
- Links to official PubChem documentation
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Compound Not Found**:
|
||||
- Try alternative names or synonyms
|
||||
- Use CID if known
|
||||
- Check spelling and chemical name format
|
||||
|
||||
**Timeout Errors**:
|
||||
- Reduce MaxRecords parameter
|
||||
- Add delays between requests
|
||||
- Use CIDs instead of names for faster queries
|
||||
|
||||
**Empty Property Values**:
|
||||
- Not all properties are available for all compounds
|
||||
- Check if property exists before accessing: `if compound.xlogp:`
|
||||
- Some properties only available for certain compound types
|
||||
|
||||
**Rate Limit Exceeded**:
|
||||
- Implement delays (0.2-0.3 seconds) between requests
|
||||
- Use batch operations where possible
|
||||
- Consider caching results locally
|
||||
|
||||
**Similarity/Substructure Search Hangs**:
|
||||
- These are asynchronous operations that may take 15-30 seconds
|
||||
- PubChemPy handles polling automatically
|
||||
- Reduce MaxRecords if timing out
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- PubChem Home: https://pubchem.ncbi.nlm.nih.gov/
|
||||
- PUG-REST Documentation: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
|
||||
- PUG-REST Tutorial: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest-tutorial
|
||||
- PubChemPy Documentation: https://pubchempy.readthedocs.io/
|
||||
- PubChemPy GitHub: https://github.com/mcs07/PubChemPy
|
||||
Reference in New Issue
Block a user