Files
2025-11-30 08:30:10 +08:00

10 KiB

PubChem API Reference

Overview

PubChem is the world's largest freely available chemical database maintained by the National Center for Biotechnology Information (NCBI). It contains over 110 million unique chemical structures and over 270 million bioactivities from more than 770 data sources.

Database Structure

PubChem consists of three primary subdatabases:

  1. Compound Database: Unique validated chemical structures with computed properties
  2. Substance Database: Deposited chemical substance records from data sources
  3. BioAssay Database: Biological activity test results for chemical compounds

PubChem PUG-REST API

Base URL Structure

https://pubchem.ncbi.nlm.nih.gov/rest/pug/<input>/<operation>/<output>

Components:

  • <input>: compound/cid, substance/sid, assay/aid, or search specifications
  • <operation>: Optional operations like property, synonyms, classification, etc.
  • <output>: Format such as JSON, XML, CSV, PNG, SDF, etc.

Common Request Patterns

1. Retrieve by Identifier

Get compound by CID (Compound ID):

GET /rest/pug/compound/cid/{cid}/property/{properties}/JSON

Get compound by name:

GET /rest/pug/compound/name/{name}/property/{properties}/JSON

Get compound by SMILES:

GET /rest/pug/compound/smiles/{smiles}/property/{properties}/JSON

Get compound by InChI:

GET /rest/pug/compound/inchi/{inchi}/property/{properties}/JSON

2. Available Properties

Common molecular properties that can be retrieved:

  • MolecularFormula
  • MolecularWeight
  • CanonicalSMILES
  • IsomericSMILES
  • InChI
  • InChIKey
  • IUPACName
  • XLogP
  • ExactMass
  • MonoisotopicMass
  • TPSA (Topological Polar Surface Area)
  • Complexity
  • Charge
  • HBondDonorCount
  • HBondAcceptorCount
  • RotatableBondCount
  • HeavyAtomCount
  • IsotopeAtomCount
  • AtomStereoCount
  • BondStereoCount
  • CovalentUnitCount
  • Volume3D
  • XStericQuadrupole3D
  • YStericQuadrupole3D
  • ZStericQuadrupole3D
  • FeatureCount3D

To retrieve multiple properties, separate them with commas:

/property/MolecularFormula,MolecularWeight,CanonicalSMILES/JSON

3. Structure Search Operations

Similarity Search:

POST /rest/pug/compound/similarity/smiles/{smiles}/JSON
Parameters: Threshold (default 90%)

Substructure Search:

POST /rest/pug/compound/substructure/smiles/{smiles}/cids/JSON

Superstructure Search:

POST /rest/pug/compound/superstructure/smiles/{smiles}/cids/JSON

4. Image Generation

Get 2D structure image:

GET /rest/pug/compound/cid/{cid}/PNG
Optional parameters: image_size=small|large

5. Format Conversion

Get compound as SDF (Structure-Data File):

GET /rest/pug/compound/cid/{cid}/SDF

Get compound as MOL:

GET /rest/pug/compound/cid/{cid}/record/SDF

6. Synonym Retrieval

Get all synonyms for a compound:

GET /rest/pug/compound/cid/{cid}/synonyms/JSON

7. Bioassay Data

Get bioassay data for a compound:

GET /rest/pug/compound/cid/{cid}/assaysummary/JSON

Get specific assay information:

GET /rest/pug/assay/aid/{aid}/description/JSON

Asynchronous Requests

For large queries (similarity/substructure searches), PUG-REST uses an asynchronous pattern:

  1. Submit the query (returns ListKey)
  2. Check status using the ListKey
  3. Retrieve results when ready

Example workflow:

# Step 1: Submit similarity search
response = requests.post(
    "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/similarity/smiles/{smiles}/cids/JSON",
    data={"Threshold": 90}
)
listkey = response.json()["Waiting"]["ListKey"]

# Step 2: Check status
status_url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/listkey/{listkey}/cids/JSON"

# Step 3: Poll until ready (with timeout)
# Step 4: Retrieve results from the same URL

Usage Limits

Rate Limits:

  • Maximum 5 requests per second
  • Maximum 400 requests per minute
  • Maximum 300 seconds running time per minute

Best Practices:

  • Use batch requests when possible
  • Implement exponential backoff for retries
  • Cache results when appropriate
  • Use asynchronous pattern for large queries

PubChemPy Python Library

PubChemPy is a Python wrapper that simplifies PUG-REST API access.

Installation

pip install pubchempy

Key Classes

Compound Class

Main class for representing chemical compounds:

import pubchempy as pcp

# Get by CID
compound = pcp.Compound.from_cid(2244)

# Access properties
compound.molecular_formula  # 'C9H8O4'
compound.molecular_weight   # 180.16
compound.iupac_name        # '2-acetyloxybenzoic acid'
compound.canonical_smiles   # 'CC(=O)OC1=CC=CC=C1C(=O)O'
compound.isomeric_smiles    # Same as canonical for non-stereoisomers
compound.inchi             # InChI string
compound.inchikey          # InChI Key
compound.xlogp             # Partition coefficient
compound.tpsa              # Topological polar surface area

Search Methods

By Name:

compounds = pcp.get_compounds('aspirin', 'name')
# Returns list of Compound objects

By SMILES:

compound = pcp.get_compounds('CC(=O)OC1=CC=CC=C1C(=O)O', 'smiles')[0]

By InChI:

compound = pcp.get_compounds('InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)', 'inchi')[0]

By Formula:

compounds = pcp.get_compounds('C9H8O4', 'formula')
# Returns all compounds with this formula

Similarity Search:

results = pcp.get_compounds('CC(=O)OC1=CC=CC=C1C(=O)O', 'smiles',
                           searchtype='similarity',
                           Threshold=90)

Substructure Search:

results = pcp.get_compounds('c1ccccc1', 'smiles',
                           searchtype='substructure')
# Returns all compounds containing benzene ring

Property Retrieval

Get specific properties for multiple compounds:

properties = pcp.get_properties(
    ['MolecularFormula', 'MolecularWeight', 'CanonicalSMILES'],
    'aspirin',
    'name'
)
# Returns list of dictionaries

Get properties as pandas DataFrame:

import pandas as pd
df = pd.DataFrame(properties)

Synonyms

Get all synonyms for a compound:

synonyms = pcp.get_synonyms('aspirin', 'name')
# Returns list of dictionaries with CID and synonym lists

Download Formats

Download compound in various formats:

# Get as SDF
sdf_data = pcp.download('SDF', 'aspirin', 'name', overwrite=True)

# Get as JSON
json_data = pcp.download('JSON', '2244', 'cid')

# Get as PNG image
pcp.download('PNG', '2244', 'cid', 'aspirin.png', overwrite=True)

Error Handling

from pubchempy import BadRequestError, NotFoundError, TimeoutError

try:
    compound = pcp.get_compounds('nonexistent', 'name')
except NotFoundError:
    print("Compound not found")
except BadRequestError:
    print("Invalid request")
except TimeoutError:
    print("Request timed out")

PUG-View API

PUG-View provides access to full textual annotations and specialized reports.

Key Endpoints

Get compound annotations:

GET /rest/pug_view/data/compound/{cid}/JSON

Get specific annotation sections:

GET /rest/pug_view/data/compound/{cid}/JSON?heading={section_name}

Available sections include:

  • Chemical and Physical Properties
  • Drug and Medication Information
  • Pharmacology and Biochemistry
  • Safety and Hazards
  • Toxicity
  • Literature
  • Patents
  • Biomolecular Interactions and Pathways

Common Workflows

1. Chemical Identifier Conversion

Convert from name to SMILES to InChI:

import pubchempy as pcp

compound = pcp.get_compounds('caffeine', 'name')[0]
smiles = compound.canonical_smiles
inchi = compound.inchi
inchikey = compound.inchikey
cid = compound.cid

2. Batch Property Retrieval

Get properties for multiple compounds:

compound_names = ['aspirin', 'ibuprofen', 'paracetamol']
properties = []

for name in compound_names:
    props = pcp.get_properties(
        ['MolecularFormula', 'MolecularWeight', 'XLogP'],
        name,
        'name'
    )
    properties.extend(props)

import pandas as pd
df = pd.DataFrame(properties)

3. Finding Similar Compounds

Find structurally similar compounds to a query:

# Start with a known compound
query_compound = pcp.get_compounds('gefitinib', 'name')[0]
query_smiles = query_compound.canonical_smiles

# Perform similarity search
similar = pcp.get_compounds(
    query_smiles,
    'smiles',
    searchtype='similarity',
    Threshold=85
)

# Get properties for similar compounds
for compound in similar[:10]:  # First 10 results
    print(f"{compound.cid}: {compound.iupac_name}, MW: {compound.molecular_weight}")

4. Substructure Screening

Find all compounds containing a specific substructure:

# Search for compounds containing pyridine ring
pyridine_smiles = 'c1ccncc1'

matches = pcp.get_compounds(
    pyridine_smiles,
    'smiles',
    searchtype='substructure',
    MaxRecords=100
)

print(f"Found {len(matches)} compounds containing pyridine")

5. Bioactivity Data Retrieval

import requests

cid = 2244  # Aspirin
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON"

response = requests.get(url)
if response.status_code == 200:
    bioassay_data = response.json()
    # Process bioassay information

Tips and Best Practices

  1. Use CIDs for repeated queries: CIDs are more efficient than names or structures
  2. Cache results: Store frequently accessed data locally
  3. Batch requests: Combine multiple queries when possible
  4. Handle rate limits: Implement delays between requests
  5. Use appropriate search types: Similarity for related compounds, substructure for motif finding
  6. Leverage PubChemPy: Higher-level abstraction simplifies common tasks
  7. Handle missing data: Not all properties are available for all compounds
  8. Use asynchronous pattern: For large similarity/substructure searches
  9. Specify output format: Choose JSON for programmatic access, SDF for cheminformatics tools
  10. Read documentation: Full PUG-REST documentation available at https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest

Additional Resources