Files
2025-11-30 08:30:10 +08:00

569 lines
16 KiB
Markdown

---
name: pubchem-database
description: "Query PubChem via PUG-REST API/PubChemPy (110M+ compounds). Search by name/CID/SMILES, retrieve properties, similarity/substructure searches, bioactivity, for cheminformatics."
---
# PubChem Database
## Overview
PubChem is the world's largest freely available chemical database with 110M+ compounds and 270M+ bioactivities. Query chemical structures by name, CID, or SMILES, retrieve molecular properties, perform similarity and substructure searches, access bioactivity data using PUG-REST API and PubChemPy.
## When to Use This Skill
This skill should be used when:
- Searching for chemical compounds by name, structure (SMILES/InChI), or molecular formula
- Retrieving molecular properties (MW, LogP, TPSA, hydrogen bonding descriptors)
- Performing similarity searches to find structurally related compounds
- Conducting substructure searches for specific chemical motifs
- Accessing bioactivity data from screening assays
- Converting between chemical identifier formats (CID, SMILES, InChI)
- Batch processing multiple compounds for drug-likeness screening or property analysis
## Core Capabilities
### 1. Chemical Structure Search
Search for compounds using multiple identifier types:
**By Chemical Name**:
```python
import pubchempy as pcp
compounds = pcp.get_compounds('aspirin', 'name')
compound = compounds[0]
```
**By CID (Compound ID)**:
```python
compound = pcp.Compound.from_cid(2244) # Aspirin
```
**By SMILES**:
```python
compound = pcp.get_compounds('CC(=O)OC1=CC=CC=C1C(=O)O', 'smiles')[0]
```
**By InChI**:
```python
compound = pcp.get_compounds('InChI=1S/C9H8O4/...', 'inchi')[0]
```
**By Molecular Formula**:
```python
compounds = pcp.get_compounds('C9H8O4', 'formula')
# Returns all compounds matching this formula
```
### 2. Property Retrieval
Retrieve molecular properties for compounds using either high-level or low-level approaches:
**Using PubChemPy (Recommended)**:
```python
import pubchempy as pcp
# Get compound object with all properties
compound = pcp.get_compounds('caffeine', 'name')[0]
# Access individual properties
molecular_formula = compound.molecular_formula
molecular_weight = compound.molecular_weight
iupac_name = compound.iupac_name
smiles = compound.canonical_smiles
inchi = compound.inchi
xlogp = compound.xlogp # Partition coefficient
tpsa = compound.tpsa # Topological polar surface area
```
**Get Specific Properties**:
```python
# Request only specific properties
properties = pcp.get_properties(
['MolecularFormula', 'MolecularWeight', 'CanonicalSMILES', 'XLogP'],
'aspirin',
'name'
)
# Returns list of dictionaries
```
**Batch Property Retrieval**:
```python
import pandas as pd
compound_names = ['aspirin', 'ibuprofen', 'paracetamol']
all_properties = []
for name in compound_names:
props = pcp.get_properties(
['MolecularFormula', 'MolecularWeight', 'XLogP'],
name,
'name'
)
all_properties.extend(props)
df = pd.DataFrame(all_properties)
```
**Available Properties**: MolecularFormula, MolecularWeight, CanonicalSMILES, IsomericSMILES, InChI, InChIKey, IUPACName, XLogP, TPSA, HBondDonorCount, HBondAcceptorCount, RotatableBondCount, Complexity, Charge, and many more (see `references/api_reference.md` for complete list).
### 3. Similarity Search
Find structurally similar compounds using Tanimoto similarity:
```python
import pubchempy as pcp
# Start with a query compound
query_compound = pcp.get_compounds('gefitinib', 'name')[0]
query_smiles = query_compound.canonical_smiles
# Perform similarity search
similar_compounds = pcp.get_compounds(
query_smiles,
'smiles',
searchtype='similarity',
Threshold=85, # Similarity threshold (0-100)
MaxRecords=50
)
# Process results
for compound in similar_compounds[:10]:
print(f"CID {compound.cid}: {compound.iupac_name}")
print(f" MW: {compound.molecular_weight}")
```
**Note**: Similarity searches are asynchronous for large queries and may take 15-30 seconds to complete. PubChemPy handles the asynchronous pattern automatically.
### 4. Substructure Search
Find compounds containing a specific structural motif:
```python
import pubchempy as pcp
# Search for compounds containing pyridine ring
pyridine_smiles = 'c1ccncc1'
matches = pcp.get_compounds(
pyridine_smiles,
'smiles',
searchtype='substructure',
MaxRecords=100
)
print(f"Found {len(matches)} compounds containing pyridine")
```
**Common Substructures**:
- Benzene ring: `c1ccccc1`
- Pyridine: `c1ccncc1`
- Phenol: `c1ccc(O)cc1`
- Carboxylic acid: `C(=O)O`
### 5. Format Conversion
Convert between different chemical structure formats:
```python
import pubchempy as pcp
compound = pcp.get_compounds('aspirin', 'name')[0]
# Convert to different formats
smiles = compound.canonical_smiles
inchi = compound.inchi
inchikey = compound.inchikey
cid = compound.cid
# Download structure files
pcp.download('SDF', 'aspirin', 'name', 'aspirin.sdf', overwrite=True)
pcp.download('JSON', '2244', 'cid', 'aspirin.json', overwrite=True)
```
### 6. Structure Visualization
Generate 2D structure images:
```python
import pubchempy as pcp
# Download compound structure as PNG
pcp.download('PNG', 'caffeine', 'name', 'caffeine.png', overwrite=True)
# Using direct URL (via requests)
import requests
cid = 2244 # Aspirin
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/PNG?image_size=large"
response = requests.get(url)
with open('structure.png', 'wb') as f:
f.write(response.content)
```
### 7. Synonym Retrieval
Get all known names and synonyms for a compound:
```python
import pubchempy as pcp
synonyms_data = pcp.get_synonyms('aspirin', 'name')
if synonyms_data:
cid = synonyms_data[0]['CID']
synonyms = synonyms_data[0]['Synonym']
print(f"CID {cid} has {len(synonyms)} synonyms:")
for syn in synonyms[:10]: # First 10
print(f" - {syn}")
```
### 8. Bioactivity Data Access
Retrieve biological activity data from assays:
```python
import requests
import json
# Get bioassay summary for a compound
cid = 2244 # Aspirin
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON"
response = requests.get(url)
if response.status_code == 200:
data = response.json()
# Process bioassay information
table = data.get('Table', {})
rows = table.get('Row', [])
print(f"Found {len(rows)} bioassay records")
```
**For more complex bioactivity queries**, use the `scripts/bioactivity_query.py` helper script which provides:
- Bioassay summaries with activity outcome filtering
- Assay target identification
- Search for compounds by biological target
- Active compound lists for specific assays
### 9. Comprehensive Compound Annotations
Access detailed compound information through PUG-View:
```python
import requests
cid = 2244
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"
response = requests.get(url)
if response.status_code == 200:
annotations = response.json()
# Contains extensive data including:
# - Chemical and Physical Properties
# - Drug and Medication Information
# - Pharmacology and Biochemistry
# - Safety and Hazards
# - Toxicity
# - Literature references
# - Patents
```
**Get Specific Section**:
```python
# Get only drug information
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON?heading=Drug and Medication Information"
```
## Installation Requirements
Install PubChemPy for Python-based access:
```bash
uv pip install pubchempy
```
For direct API access and bioactivity queries:
```bash
uv pip install requests
```
Optional for data analysis:
```bash
uv pip install pandas
```
## Helper Scripts
This skill includes Python scripts for common PubChem tasks:
### scripts/compound_search.py
Provides utility functions for searching and retrieving compound information:
**Key Functions**:
- `search_by_name(name, max_results=10)`: Search compounds by name
- `search_by_smiles(smiles)`: Search by SMILES string
- `get_compound_by_cid(cid)`: Retrieve compound by CID
- `get_compound_properties(identifier, namespace, properties)`: Get specific properties
- `similarity_search(smiles, threshold, max_records)`: Perform similarity search
- `substructure_search(smiles, max_records)`: Perform substructure search
- `get_synonyms(identifier, namespace)`: Get all synonyms
- `batch_search(identifiers, namespace, properties)`: Batch search multiple compounds
- `download_structure(identifier, namespace, format, filename)`: Download structures
- `print_compound_info(compound)`: Print formatted compound information
**Usage**:
```python
from scripts.compound_search import search_by_name, get_compound_properties
# Search for a compound
compounds = search_by_name('ibuprofen')
# Get specific properties
props = get_compound_properties('aspirin', 'name', ['MolecularWeight', 'XLogP'])
```
### scripts/bioactivity_query.py
Provides functions for retrieving biological activity data:
**Key Functions**:
- `get_bioassay_summary(cid)`: Get bioassay summary for compound
- `get_compound_bioactivities(cid, activity_outcome)`: Get filtered bioactivities
- `get_assay_description(aid)`: Get detailed assay information
- `get_assay_targets(aid)`: Get biological targets for assay
- `search_assays_by_target(target_name, max_results)`: Find assays by target
- `get_active_compounds_in_assay(aid, max_results)`: Get active compounds
- `get_compound_annotations(cid, section)`: Get PUG-View annotations
- `summarize_bioactivities(cid)`: Generate bioactivity summary statistics
- `find_compounds_by_bioactivity(target, threshold, max_compounds)`: Find compounds by target
**Usage**:
```python
from scripts.bioactivity_query import get_bioassay_summary, summarize_bioactivities
# Get bioactivity summary
summary = summarize_bioactivities(2244) # Aspirin
print(f"Total assays: {summary['total_assays']}")
print(f"Active: {summary['active']}, Inactive: {summary['inactive']}")
```
## API Rate Limits and Best Practices
**Rate Limits**:
- Maximum 5 requests per second
- Maximum 400 requests per minute
- Maximum 300 seconds running time per minute
**Best Practices**:
1. **Use CIDs for repeated queries**: CIDs are more efficient than names or structures
2. **Cache results locally**: Store frequently accessed data
3. **Batch requests**: Combine multiple queries when possible
4. **Implement delays**: Add 0.2-0.3 second delays between requests
5. **Handle errors gracefully**: Check for HTTP errors and missing data
6. **Use PubChemPy**: Higher-level abstraction handles many edge cases
7. **Leverage asynchronous pattern**: For large similarity/substructure searches
8. **Specify MaxRecords**: Limit results to avoid timeouts
**Error Handling**:
```python
from pubchempy import BadRequestError, NotFoundError, TimeoutError
try:
compound = pcp.get_compounds('query', 'name')[0]
except NotFoundError:
print("Compound not found")
except BadRequestError:
print("Invalid request format")
except TimeoutError:
print("Request timed out - try reducing scope")
except IndexError:
print("No results returned")
```
## Common Workflows
### Workflow 1: Chemical Identifier Conversion Pipeline
Convert between different chemical identifiers:
```python
import pubchempy as pcp
# Start with any identifier type
compound = pcp.get_compounds('caffeine', 'name')[0]
# Extract all identifier formats
identifiers = {
'CID': compound.cid,
'Name': compound.iupac_name,
'SMILES': compound.canonical_smiles,
'InChI': compound.inchi,
'InChIKey': compound.inchikey,
'Formula': compound.molecular_formula
}
```
### Workflow 2: Drug-Like Property Screening
Screen compounds using Lipinski's Rule of Five:
```python
import pubchempy as pcp
def check_drug_likeness(compound_name):
compound = pcp.get_compounds(compound_name, 'name')[0]
# Lipinski's Rule of Five
rules = {
'MW <= 500': compound.molecular_weight <= 500,
'LogP <= 5': compound.xlogp <= 5 if compound.xlogp else None,
'HBD <= 5': compound.h_bond_donor_count <= 5,
'HBA <= 10': compound.h_bond_acceptor_count <= 10
}
violations = sum(1 for v in rules.values() if v is False)
return rules, violations
rules, violations = check_drug_likeness('aspirin')
print(f"Lipinski violations: {violations}")
```
### Workflow 3: Finding Similar Drug Candidates
Identify structurally similar compounds to a known drug:
```python
import pubchempy as pcp
# Start with known drug
reference_drug = pcp.get_compounds('imatinib', 'name')[0]
reference_smiles = reference_drug.canonical_smiles
# Find similar compounds
similar = pcp.get_compounds(
reference_smiles,
'smiles',
searchtype='similarity',
Threshold=85,
MaxRecords=20
)
# Filter by drug-like properties
candidates = []
for comp in similar:
if comp.molecular_weight and 200 <= comp.molecular_weight <= 600:
if comp.xlogp and -1 <= comp.xlogp <= 5:
candidates.append(comp)
print(f"Found {len(candidates)} drug-like candidates")
```
### Workflow 4: Batch Compound Property Comparison
Compare properties across multiple compounds:
```python
import pubchempy as pcp
import pandas as pd
compound_list = ['aspirin', 'ibuprofen', 'naproxen', 'celecoxib']
properties_list = []
for name in compound_list:
try:
compound = pcp.get_compounds(name, 'name')[0]
properties_list.append({
'Name': name,
'CID': compound.cid,
'Formula': compound.molecular_formula,
'MW': compound.molecular_weight,
'LogP': compound.xlogp,
'TPSA': compound.tpsa,
'HBD': compound.h_bond_donor_count,
'HBA': compound.h_bond_acceptor_count
})
except Exception as e:
print(f"Error processing {name}: {e}")
df = pd.DataFrame(properties_list)
print(df.to_string(index=False))
```
### Workflow 5: Substructure-Based Virtual Screening
Screen for compounds containing specific pharmacophores:
```python
import pubchempy as pcp
# Define pharmacophore (e.g., sulfonamide group)
pharmacophore_smiles = 'S(=O)(=O)N'
# Search for compounds containing this substructure
hits = pcp.get_compounds(
pharmacophore_smiles,
'smiles',
searchtype='substructure',
MaxRecords=100
)
# Further filter by properties
filtered_hits = [
comp for comp in hits
if comp.molecular_weight and comp.molecular_weight < 500
]
print(f"Found {len(filtered_hits)} compounds with desired substructure")
```
## Reference Documentation
For detailed API documentation, including complete property lists, URL patterns, advanced query options, and more examples, consult `references/api_reference.md`. This comprehensive reference includes:
- Complete PUG-REST API endpoint documentation
- Full list of available molecular properties
- Asynchronous request handling patterns
- PubChemPy API reference
- PUG-View API for annotations
- Common workflows and use cases
- Links to official PubChem documentation
## Troubleshooting
**Compound Not Found**:
- Try alternative names or synonyms
- Use CID if known
- Check spelling and chemical name format
**Timeout Errors**:
- Reduce MaxRecords parameter
- Add delays between requests
- Use CIDs instead of names for faster queries
**Empty Property Values**:
- Not all properties are available for all compounds
- Check if property exists before accessing: `if compound.xlogp:`
- Some properties only available for certain compound types
**Rate Limit Exceeded**:
- Implement delays (0.2-0.3 seconds) between requests
- Use batch operations where possible
- Consider caching results locally
**Similarity/Substructure Search Hangs**:
- These are asynchronous operations that may take 15-30 seconds
- PubChemPy handles polling automatically
- Reduce MaxRecords if timing out
## Additional Resources
- PubChem Home: https://pubchem.ncbi.nlm.nih.gov/
- PUG-REST Documentation: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
- PUG-REST Tutorial: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest-tutorial
- PubChemPy Documentation: https://pubchempy.readthedocs.io/
- PubChemPy GitHub: https://github.com/mcs07/PubChemPy