441 lines
10 KiB
Markdown
441 lines
10 KiB
Markdown
# PubChem API Reference
|
|
|
|
## Overview
|
|
|
|
PubChem is the world's largest freely available chemical database maintained by the National Center for Biotechnology Information (NCBI). It contains over 110 million unique chemical structures and over 270 million bioactivities from more than 770 data sources.
|
|
|
|
## Database Structure
|
|
|
|
PubChem consists of three primary subdatabases:
|
|
|
|
1. **Compound Database**: Unique validated chemical structures with computed properties
|
|
2. **Substance Database**: Deposited chemical substance records from data sources
|
|
3. **BioAssay Database**: Biological activity test results for chemical compounds
|
|
|
|
## PubChem PUG-REST API
|
|
|
|
### Base URL Structure
|
|
|
|
```
|
|
https://pubchem.ncbi.nlm.nih.gov/rest/pug/<input>/<operation>/<output>
|
|
```
|
|
|
|
Components:
|
|
- `<input>`: compound/cid, substance/sid, assay/aid, or search specifications
|
|
- `<operation>`: Optional operations like property, synonyms, classification, etc.
|
|
- `<output>`: Format such as JSON, XML, CSV, PNG, SDF, etc.
|
|
|
|
### Common Request Patterns
|
|
|
|
#### 1. Retrieve by Identifier
|
|
|
|
Get compound by CID (Compound ID):
|
|
```
|
|
GET /rest/pug/compound/cid/{cid}/property/{properties}/JSON
|
|
```
|
|
|
|
Get compound by name:
|
|
```
|
|
GET /rest/pug/compound/name/{name}/property/{properties}/JSON
|
|
```
|
|
|
|
Get compound by SMILES:
|
|
```
|
|
GET /rest/pug/compound/smiles/{smiles}/property/{properties}/JSON
|
|
```
|
|
|
|
Get compound by InChI:
|
|
```
|
|
GET /rest/pug/compound/inchi/{inchi}/property/{properties}/JSON
|
|
```
|
|
|
|
#### 2. Available Properties
|
|
|
|
Common molecular properties that can be retrieved:
|
|
- `MolecularFormula`
|
|
- `MolecularWeight`
|
|
- `CanonicalSMILES`
|
|
- `IsomericSMILES`
|
|
- `InChI`
|
|
- `InChIKey`
|
|
- `IUPACName`
|
|
- `XLogP`
|
|
- `ExactMass`
|
|
- `MonoisotopicMass`
|
|
- `TPSA` (Topological Polar Surface Area)
|
|
- `Complexity`
|
|
- `Charge`
|
|
- `HBondDonorCount`
|
|
- `HBondAcceptorCount`
|
|
- `RotatableBondCount`
|
|
- `HeavyAtomCount`
|
|
- `IsotopeAtomCount`
|
|
- `AtomStereoCount`
|
|
- `BondStereoCount`
|
|
- `CovalentUnitCount`
|
|
- `Volume3D`
|
|
- `XStericQuadrupole3D`
|
|
- `YStericQuadrupole3D`
|
|
- `ZStericQuadrupole3D`
|
|
- `FeatureCount3D`
|
|
|
|
To retrieve multiple properties, separate them with commas:
|
|
```
|
|
/property/MolecularFormula,MolecularWeight,CanonicalSMILES/JSON
|
|
```
|
|
|
|
#### 3. Structure Search Operations
|
|
|
|
**Similarity Search**:
|
|
```
|
|
POST /rest/pug/compound/similarity/smiles/{smiles}/JSON
|
|
Parameters: Threshold (default 90%)
|
|
```
|
|
|
|
**Substructure Search**:
|
|
```
|
|
POST /rest/pug/compound/substructure/smiles/{smiles}/cids/JSON
|
|
```
|
|
|
|
**Superstructure Search**:
|
|
```
|
|
POST /rest/pug/compound/superstructure/smiles/{smiles}/cids/JSON
|
|
```
|
|
|
|
#### 4. Image Generation
|
|
|
|
Get 2D structure image:
|
|
```
|
|
GET /rest/pug/compound/cid/{cid}/PNG
|
|
Optional parameters: image_size=small|large
|
|
```
|
|
|
|
#### 5. Format Conversion
|
|
|
|
Get compound as SDF (Structure-Data File):
|
|
```
|
|
GET /rest/pug/compound/cid/{cid}/SDF
|
|
```
|
|
|
|
Get compound as MOL:
|
|
```
|
|
GET /rest/pug/compound/cid/{cid}/record/SDF
|
|
```
|
|
|
|
#### 6. Synonym Retrieval
|
|
|
|
Get all synonyms for a compound:
|
|
```
|
|
GET /rest/pug/compound/cid/{cid}/synonyms/JSON
|
|
```
|
|
|
|
#### 7. Bioassay Data
|
|
|
|
Get bioassay data for a compound:
|
|
```
|
|
GET /rest/pug/compound/cid/{cid}/assaysummary/JSON
|
|
```
|
|
|
|
Get specific assay information:
|
|
```
|
|
GET /rest/pug/assay/aid/{aid}/description/JSON
|
|
```
|
|
|
|
### Asynchronous Requests
|
|
|
|
For large queries (similarity/substructure searches), PUG-REST uses an asynchronous pattern:
|
|
|
|
1. Submit the query (returns ListKey)
|
|
2. Check status using the ListKey
|
|
3. Retrieve results when ready
|
|
|
|
Example workflow:
|
|
```python
|
|
# Step 1: Submit similarity search
|
|
response = requests.post(
|
|
"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/similarity/smiles/{smiles}/cids/JSON",
|
|
data={"Threshold": 90}
|
|
)
|
|
listkey = response.json()["Waiting"]["ListKey"]
|
|
|
|
# Step 2: Check status
|
|
status_url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/listkey/{listkey}/cids/JSON"
|
|
|
|
# Step 3: Poll until ready (with timeout)
|
|
# Step 4: Retrieve results from the same URL
|
|
```
|
|
|
|
### Usage Limits
|
|
|
|
**Rate Limits**:
|
|
- Maximum 5 requests per second
|
|
- Maximum 400 requests per minute
|
|
- Maximum 300 seconds running time per minute
|
|
|
|
**Best Practices**:
|
|
- Use batch requests when possible
|
|
- Implement exponential backoff for retries
|
|
- Cache results when appropriate
|
|
- Use asynchronous pattern for large queries
|
|
|
|
## PubChemPy Python Library
|
|
|
|
PubChemPy is a Python wrapper that simplifies PUG-REST API access.
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
pip install pubchempy
|
|
```
|
|
|
|
### Key Classes
|
|
|
|
#### Compound Class
|
|
|
|
Main class for representing chemical compounds:
|
|
|
|
```python
|
|
import pubchempy as pcp
|
|
|
|
# Get by CID
|
|
compound = pcp.Compound.from_cid(2244)
|
|
|
|
# Access properties
|
|
compound.molecular_formula # 'C9H8O4'
|
|
compound.molecular_weight # 180.16
|
|
compound.iupac_name # '2-acetyloxybenzoic acid'
|
|
compound.canonical_smiles # 'CC(=O)OC1=CC=CC=C1C(=O)O'
|
|
compound.isomeric_smiles # Same as canonical for non-stereoisomers
|
|
compound.inchi # InChI string
|
|
compound.inchikey # InChI Key
|
|
compound.xlogp # Partition coefficient
|
|
compound.tpsa # Topological polar surface area
|
|
```
|
|
|
|
#### Search Methods
|
|
|
|
**By Name**:
|
|
```python
|
|
compounds = pcp.get_compounds('aspirin', 'name')
|
|
# Returns list of Compound objects
|
|
```
|
|
|
|
**By SMILES**:
|
|
```python
|
|
compound = pcp.get_compounds('CC(=O)OC1=CC=CC=C1C(=O)O', 'smiles')[0]
|
|
```
|
|
|
|
**By InChI**:
|
|
```python
|
|
compound = pcp.get_compounds('InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)', 'inchi')[0]
|
|
```
|
|
|
|
**By Formula**:
|
|
```python
|
|
compounds = pcp.get_compounds('C9H8O4', 'formula')
|
|
# Returns all compounds with this formula
|
|
```
|
|
|
|
**Similarity Search**:
|
|
```python
|
|
results = pcp.get_compounds('CC(=O)OC1=CC=CC=C1C(=O)O', 'smiles',
|
|
searchtype='similarity',
|
|
Threshold=90)
|
|
```
|
|
|
|
**Substructure Search**:
|
|
```python
|
|
results = pcp.get_compounds('c1ccccc1', 'smiles',
|
|
searchtype='substructure')
|
|
# Returns all compounds containing benzene ring
|
|
```
|
|
|
|
#### Property Retrieval
|
|
|
|
Get specific properties for multiple compounds:
|
|
```python
|
|
properties = pcp.get_properties(
|
|
['MolecularFormula', 'MolecularWeight', 'CanonicalSMILES'],
|
|
'aspirin',
|
|
'name'
|
|
)
|
|
# Returns list of dictionaries
|
|
```
|
|
|
|
Get properties as pandas DataFrame:
|
|
```python
|
|
import pandas as pd
|
|
df = pd.DataFrame(properties)
|
|
```
|
|
|
|
#### Synonyms
|
|
|
|
Get all synonyms for a compound:
|
|
```python
|
|
synonyms = pcp.get_synonyms('aspirin', 'name')
|
|
# Returns list of dictionaries with CID and synonym lists
|
|
```
|
|
|
|
#### Download Formats
|
|
|
|
Download compound in various formats:
|
|
```python
|
|
# Get as SDF
|
|
sdf_data = pcp.download('SDF', 'aspirin', 'name', overwrite=True)
|
|
|
|
# Get as JSON
|
|
json_data = pcp.download('JSON', '2244', 'cid')
|
|
|
|
# Get as PNG image
|
|
pcp.download('PNG', '2244', 'cid', 'aspirin.png', overwrite=True)
|
|
```
|
|
|
|
### Error Handling
|
|
|
|
```python
|
|
from pubchempy import BadRequestError, NotFoundError, TimeoutError
|
|
|
|
try:
|
|
compound = pcp.get_compounds('nonexistent', 'name')
|
|
except NotFoundError:
|
|
print("Compound not found")
|
|
except BadRequestError:
|
|
print("Invalid request")
|
|
except TimeoutError:
|
|
print("Request timed out")
|
|
```
|
|
|
|
## PUG-View API
|
|
|
|
PUG-View provides access to full textual annotations and specialized reports.
|
|
|
|
### Key Endpoints
|
|
|
|
Get compound annotations:
|
|
```
|
|
GET /rest/pug_view/data/compound/{cid}/JSON
|
|
```
|
|
|
|
Get specific annotation sections:
|
|
```
|
|
GET /rest/pug_view/data/compound/{cid}/JSON?heading={section_name}
|
|
```
|
|
|
|
Available sections include:
|
|
- Chemical and Physical Properties
|
|
- Drug and Medication Information
|
|
- Pharmacology and Biochemistry
|
|
- Safety and Hazards
|
|
- Toxicity
|
|
- Literature
|
|
- Patents
|
|
- Biomolecular Interactions and Pathways
|
|
|
|
## Common Workflows
|
|
|
|
### 1. Chemical Identifier Conversion
|
|
|
|
Convert from name to SMILES to InChI:
|
|
```python
|
|
import pubchempy as pcp
|
|
|
|
compound = pcp.get_compounds('caffeine', 'name')[0]
|
|
smiles = compound.canonical_smiles
|
|
inchi = compound.inchi
|
|
inchikey = compound.inchikey
|
|
cid = compound.cid
|
|
```
|
|
|
|
### 2. Batch Property Retrieval
|
|
|
|
Get properties for multiple compounds:
|
|
```python
|
|
compound_names = ['aspirin', 'ibuprofen', 'paracetamol']
|
|
properties = []
|
|
|
|
for name in compound_names:
|
|
props = pcp.get_properties(
|
|
['MolecularFormula', 'MolecularWeight', 'XLogP'],
|
|
name,
|
|
'name'
|
|
)
|
|
properties.extend(props)
|
|
|
|
import pandas as pd
|
|
df = pd.DataFrame(properties)
|
|
```
|
|
|
|
### 3. Finding Similar Compounds
|
|
|
|
Find structurally similar compounds to a query:
|
|
```python
|
|
# Start with a known compound
|
|
query_compound = pcp.get_compounds('gefitinib', 'name')[0]
|
|
query_smiles = query_compound.canonical_smiles
|
|
|
|
# Perform similarity search
|
|
similar = pcp.get_compounds(
|
|
query_smiles,
|
|
'smiles',
|
|
searchtype='similarity',
|
|
Threshold=85
|
|
)
|
|
|
|
# Get properties for similar compounds
|
|
for compound in similar[:10]: # First 10 results
|
|
print(f"{compound.cid}: {compound.iupac_name}, MW: {compound.molecular_weight}")
|
|
```
|
|
|
|
### 4. Substructure Screening
|
|
|
|
Find all compounds containing a specific substructure:
|
|
```python
|
|
# Search for compounds containing pyridine ring
|
|
pyridine_smiles = 'c1ccncc1'
|
|
|
|
matches = pcp.get_compounds(
|
|
pyridine_smiles,
|
|
'smiles',
|
|
searchtype='substructure',
|
|
MaxRecords=100
|
|
)
|
|
|
|
print(f"Found {len(matches)} compounds containing pyridine")
|
|
```
|
|
|
|
### 5. Bioactivity Data Retrieval
|
|
|
|
```python
|
|
import requests
|
|
|
|
cid = 2244 # Aspirin
|
|
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON"
|
|
|
|
response = requests.get(url)
|
|
if response.status_code == 200:
|
|
bioassay_data = response.json()
|
|
# Process bioassay information
|
|
```
|
|
|
|
## Tips and Best Practices
|
|
|
|
1. **Use CIDs for repeated queries**: CIDs are more efficient than names or structures
|
|
2. **Cache results**: Store frequently accessed data locally
|
|
3. **Batch requests**: Combine multiple queries when possible
|
|
4. **Handle rate limits**: Implement delays between requests
|
|
5. **Use appropriate search types**: Similarity for related compounds, substructure for motif finding
|
|
6. **Leverage PubChemPy**: Higher-level abstraction simplifies common tasks
|
|
7. **Handle missing data**: Not all properties are available for all compounds
|
|
8. **Use asynchronous pattern**: For large similarity/substructure searches
|
|
9. **Specify output format**: Choose JSON for programmatic access, SDF for cheminformatics tools
|
|
10. **Read documentation**: Full PUG-REST documentation available at https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
|
|
|
|
## Additional Resources
|
|
|
|
- PubChem Home: https://pubchem.ncbi.nlm.nih.gov/
|
|
- PUG-REST Documentation: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
|
|
- PUG-REST Tutorial: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest-tutorial
|
|
- PubChemPy Documentation: https://pubchempy.readthedocs.io/
|
|
- PubChemPy GitHub: https://github.com/mcs07/PubChemPy
|
|
- IUPAC Tutorial: https://iupac.github.io/WFChemCookbook/datasources/pubchem_pugrest.html
|