Files
2025-11-30 08:30:10 +08:00

15 KiB

RCSB PDB API Reference

This document provides detailed information about the RCSB Protein Data Bank APIs, including advanced usage patterns, data schemas, and best practices.

API Overview

RCSB PDB provides multiple programmatic interfaces:

  1. Data API - Retrieve PDB data when you have an identifier
  2. Search API - Find identifiers matching specific search criteria
  3. ModelServer API - Access macromolecular model subsets
  4. VolumeServer API - Retrieve volumetric data subsets
  5. Sequence Coordinates API - Obtain alignments between structural and sequence databases
  6. Alignment API - Perform structure alignment computations

Data API

Core Data Objects

The Data API organizes information hierarchically:

  • core_entry: PDB entries or Computed Structure Models (CSM IDs start with AF_ or MA_)
  • core_polymer_entity: Protein, DNA, and RNA entities
  • core_nonpolymer_entity: Ligands, cofactors, ions
  • core_branched_entity: Oligosaccharides
  • core_assembly: Biological assemblies
  • core_polymer_entity_instance: Individual chains
  • core_chem_comp: Chemical components

REST API Endpoints

Base URL: https://data.rcsb.org/rest/v1/

Entry Data:

GET https://data.rcsb.org/rest/v1/core/entry/{entry_id}

Polymer Entity:

GET https://data.rcsb.org/rest/v1/core/polymer_entity/{entry_id}_{entity_id}

Assembly:

GET https://data.rcsb.org/rest/v1/core/assembly/{entry_id}/{assembly_id}

Examples:

# Get entry data for hemoglobin
curl https://data.rcsb.org/rest/v1/core/entry/4HHB

# Get first polymer entity
curl https://data.rcsb.org/rest/v1/core/polymer_entity/4HHB_1

# Get biological assembly 1
curl https://data.rcsb.org/rest/v1/core/assembly/4HHB/1

GraphQL API

Endpoint: https://data.rcsb.org/graphql

The GraphQL API enables flexible data retrieval, allowing you to grab any piece of data from any level of the hierarchy in a single query.

Example Query:

{
  entry(entry_id: "4HHB") {
    struct {
      title
    }
    exptl {
      method
    }
    rcsb_entry_info {
      resolution_combined
      deposited_atom_count
      polymer_entity_count
    }
    rcsb_accession_info {
      deposit_date
      initial_release_date
    }
  }
}

Python Example:

import requests

query = """
{
  polymer_entity(entity_id: "4HHB_1") {
    rcsb_polymer_entity {
      pdbx_description
      formula_weight
    }
    entity_poly {
      pdbx_seq_one_letter_code
      pdbx_strand_id
    }
    rcsb_entity_source_organism {
      ncbi_taxonomy_id
      scientific_name
    }
  }
}
"""

response = requests.post(
    "https://data.rcsb.org/graphql",
    json={"query": query}
)
data = response.json()

Common Data Fields

Entry Level:

  • struct.title - Structure title/description
  • exptl[].method - Experimental method (X-RAY DIFFRACTION, NMR, ELECTRON MICROSCOPY, etc.)
  • rcsb_entry_info.resolution_combined - Resolution in Ångströms
  • rcsb_entry_info.deposited_atom_count - Total number of atoms
  • rcsb_accession_info.deposit_date - Deposition date
  • rcsb_accession_info.initial_release_date - Release date

Polymer Entity Level:

  • entity_poly.pdbx_seq_one_letter_code - Primary sequence
  • rcsb_polymer_entity.formula_weight - Molecular weight
  • rcsb_entity_source_organism.scientific_name - Source organism
  • rcsb_entity_source_organism.ncbi_taxonomy_id - NCBI taxonomy ID

Assembly Level:

  • rcsb_assembly_info.polymer_entity_count - Number of polymer entities
  • rcsb_assembly_info.assembly_id - Assembly identifier

Search API

Query Types

The Search API supports seven primary query types:

  1. TextQuery - Full-text search
  2. AttributeQuery - Property-based search
  3. SequenceQuery - Sequence similarity search
  4. SequenceMotifQuery - Motif pattern search
  5. StructSimilarityQuery - 3D structure similarity
  6. StructMotifQuery - Structural motif search
  7. ChemSimilarityQuery - Chemical similarity search

AttributeQuery Operators

Available operators for AttributeQuery:

  • exact_match - Exact string match
  • contains_words - Contains all words
  • contains_phrase - Contains exact phrase
  • equals - Numerical equality
  • greater - Greater than (numerical)
  • greater_or_equal - Greater than or equal
  • less - Less than (numerical)
  • less_or_equal - Less than or equal
  • range - Numerical range (closed interval)
  • exists - Field has a value
  • in - Value in list

Common Searchable Attributes

Resolution and Quality:

from rcsbapi.search import AttributeQuery
from rcsbapi.search.attrs import rcsb_entry_info

# High-resolution structures
query = AttributeQuery(
    attribute=rcsb_entry_info.resolution_combined,
    operator="less",
    value=2.0
)

Experimental Method:

from rcsbapi.search.attrs import exptl

query = AttributeQuery(
    attribute=exptl.method,
    operator="exact_match",
    value="X-RAY DIFFRACTION"
)

Organism:

from rcsbapi.search.attrs import rcsb_entity_source_organism

query = AttributeQuery(
    attribute=rcsb_entity_source_organism.scientific_name,
    operator="exact_match",
    value="Homo sapiens"
)

Molecular Weight:

from rcsbapi.search.attrs import rcsb_polymer_entity

query = AttributeQuery(
    attribute=rcsb_polymer_entity.formula_weight,
    operator="range",
    value=(10000, 50000)  # 10-50 kDa
)

Release Date:

from rcsbapi.search.attrs import rcsb_accession_info

# Structures released in 2024
query = AttributeQuery(
    attribute=rcsb_accession_info.initial_release_date,
    operator="range",
    value=("2024-01-01", "2024-12-31")
)

Search for structures with similar sequences using MMseqs2:

from rcsbapi.search import SequenceQuery

# Basic sequence search
query = SequenceQuery(
    value="MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHHYREQIKRVKDSEDVPMVLVGNKCDLPSRTVDTKQAQDLARSYGIPFIETSAKTRQGVDDAFYTLVREIRKHKEKMSKDGKKKKKKSKTKCVIM",
    evalue_cutoff=0.1,
    identity_cutoff=0.9
)

# With sequence type specified
query = SequenceQuery(
    value="ACGTACGTACGT",
    evalue_cutoff=1e-5,
    identity_cutoff=0.8,
    sequence_type="dna"  # or "rna" or "protein"
)

Find structures with similar 3D geometry using BioZernike:

from rcsbapi.search import StructSimilarityQuery

# Search by entry
query = StructSimilarityQuery(
    structure_search_type="entry",
    entry_id="4HHB"
)

# Search by chain
query = StructSimilarityQuery(
    structure_search_type="chain",
    entry_id="4HHB",
    chain_id="A"
)

# Search by assembly
query = StructSimilarityQuery(
    structure_search_type="assembly",
    entry_id="4HHB",
    assembly_id="1"
)

Combining Queries

Use Python bitwise operators to combine queries:

from rcsbapi.search import TextQuery, AttributeQuery
from rcsbapi.search.attrs import rcsb_entry_info, rcsb_entity_source_organism

# AND operation (&)
query1 = TextQuery("kinase")
query2 = AttributeQuery(
    attribute=rcsb_entity_source_organism.scientific_name,
    operator="exact_match",
    value="Homo sapiens"
)
combined = query1 & query2

# OR operation (|)
organism1 = AttributeQuery(
    attribute=rcsb_entity_source_organism.scientific_name,
    operator="exact_match",
    value="Homo sapiens"
)
organism2 = AttributeQuery(
    attribute=rcsb_entity_source_organism.scientific_name,
    operator="exact_match",
    value="Mus musculus"
)
combined = organism1 | organism2

# NOT operation (~)
all_structures = TextQuery("protein")
low_res = AttributeQuery(
    attribute=rcsb_entry_info.resolution_combined,
    operator="greater",
    value=3.0
)
high_res_only = all_structures & (~low_res)

# Complex combinations
high_res_human_kinases = (
    TextQuery("kinase") &
    AttributeQuery(
        attribute=rcsb_entity_source_organism.scientific_name,
        operator="exact_match",
        value="Homo sapiens"
    ) &
    AttributeQuery(
        attribute=rcsb_entry_info.resolution_combined,
        operator="less",
        value=2.5
    )
)

Return Types

Control what information is returned:

from rcsbapi.search import TextQuery, ReturnType

query = TextQuery("hemoglobin")

# Return PDB IDs (default)
results = list(query())  # ['4HHB', '1A3N', ...]

# Return entry IDs with scores
results = list(query(return_type=ReturnType.ENTRY, return_scores=True))
# [{'identifier': '4HHB', 'score': 0.95}, ...]

# Return polymer entities
results = list(query(return_type=ReturnType.POLYMER_ENTITY))
# ['4HHB_1', '4HHB_2', ...]

File Download URLs

Structure Files

PDB Format (legacy):

https://files.rcsb.org/download/{PDB_ID}.pdb

mmCIF Format (modern standard):

https://files.rcsb.org/download/{PDB_ID}.cif

Structure Factors:

https://files.rcsb.org/download/{PDB_ID}-sf.cif

Biological Assembly:

https://files.rcsb.org/download/{PDB_ID}.pdb1  # Assembly 1
https://files.rcsb.org/download/{PDB_ID}.pdb2  # Assembly 2

FASTA Sequence:

https://www.rcsb.org/fasta/entry/{PDB_ID}

Python Download Helper

import requests

def download_pdb_file(pdb_id, format="pdb", output_dir="."):
    """
    Download PDB structure file.

    Args:
        pdb_id: 4-character PDB ID
        format: 'pdb' or 'cif'
        output_dir: Directory to save file
    """
    base_url = "https://files.rcsb.org/download"
    url = f"{base_url}/{pdb_id}.{format}"

    response = requests.get(url)
    if response.status_code == 200:
        output_path = f"{output_dir}/{pdb_id}.{format}"
        with open(output_path, "w") as f:
            f.write(response.text)
        print(f"Downloaded {pdb_id}.{format}")
        return output_path
    else:
        print(f"Error downloading {pdb_id}: {response.status_code}")
        return None

# Usage
download_pdb_file("4HHB", format="pdb")
download_pdb_file("4HHB", format="cif")

Rate Limiting and Best Practices

Rate Limits

  • The API implements rate limiting to ensure fair usage
  • If you exceed the limit, you'll receive a 429 HTTP error code
  • Recommended starting point: a few requests per second
  • Use exponential backoff to find acceptable request rates

Exponential Backoff Implementation

import time
import requests

def fetch_with_retry(url, max_retries=5, initial_delay=1):
    """
    Fetch URL with exponential backoff on rate limit errors.

    Args:
        url: URL to fetch
        max_retries: Maximum number of retry attempts
        initial_delay: Initial delay in seconds
    """
    delay = initial_delay

    for attempt in range(max_retries):
        response = requests.get(url)

        if response.status_code == 200:
            return response
        elif response.status_code == 429:
            print(f"Rate limited. Waiting {delay}s before retry...")
            time.sleep(delay)
            delay *= 2  # Exponential backoff
        else:
            response.raise_for_status()

    raise Exception(f"Failed after {max_retries} retries")

Batch Processing Best Practices

  1. Use Search API first to get list of IDs, then fetch data
  2. Cache results to avoid redundant queries
  3. Process in chunks rather than all at once
  4. Add delays between requests to respect rate limits
  5. Use GraphQL for complex queries to minimize requests
import time
from rcsbapi.search import TextQuery
from rcsbapi.data import fetch, Schema

def batch_fetch_structures(query, delay=0.5):
    """
    Fetch structures matching a query with rate limiting.

    Args:
        query: Search query object
        delay: Delay between requests in seconds
    """
    # Get list of IDs
    pdb_ids = list(query())
    print(f"Found {len(pdb_ids)} structures")

    # Fetch data for each
    results = {}
    for i, pdb_id in enumerate(pdb_ids):
        try:
            data = fetch(pdb_id, schema=Schema.ENTRY)
            results[pdb_id] = data
            print(f"Fetched {i+1}/{len(pdb_ids)}: {pdb_id}")
            time.sleep(delay)  # Rate limiting
        except Exception as e:
            print(f"Error fetching {pdb_id}: {e}")

    return results

Advanced Use Cases

Finding Drug-Target Complexes

from rcsbapi.search import AttributeQuery
from rcsbapi.search.attrs import rcsb_polymer_entity, rcsb_nonpolymer_entity_instance_container_identifiers

# Find structures with specific drug molecule
query = AttributeQuery(
    attribute=rcsb_nonpolymer_entity_instance_container_identifiers.comp_id,
    operator="exact_match",
    value="ATP"  # or other ligand code
)

results = list(query())
print(f"Found {len(results)} structures with ATP")

Filtering by Resolution and R-factor

from rcsbapi.search import AttributeQuery
from rcsbapi.search.attrs import rcsb_entry_info, refine

# High-quality X-ray structures
resolution_query = AttributeQuery(
    attribute=rcsb_entry_info.resolution_combined,
    operator="less",
    value=2.0
)

rfactor_query = AttributeQuery(
    attribute=refine.ls_R_factor_R_free,
    operator="less",
    value=0.25
)

high_quality = resolution_query & rfactor_query
results = list(high_quality())

Finding Recent Structures

from rcsbapi.search import AttributeQuery
from rcsbapi.search.attrs import rcsb_accession_info

# Structures released in last month
import datetime

one_month_ago = (datetime.date.today() - datetime.timedelta(days=30)).isoformat()
today = datetime.date.today().isoformat()

query = AttributeQuery(
    attribute=rcsb_accession_info.initial_release_date,
    operator="range",
    value=(one_month_ago, today)
)

recent_structures = list(query())

Troubleshooting

Common Errors

404 Not Found:

  • PDB ID doesn't exist or is obsolete
  • Check if ID is correct (case-sensitive)
  • Verify entry hasn't been superseded

429 Too Many Requests:

  • Rate limit exceeded
  • Implement exponential backoff
  • Reduce request frequency

500 Internal Server Error:

  • Temporary server issue
  • Retry after short delay
  • Check RCSB PDB status page

Empty Results:

  • Query too restrictive
  • Check attribute names and operators
  • Verify data exists for searched field

Debugging Tips

# Enable verbose output for searches
from rcsbapi.search import TextQuery

query = TextQuery("hemoglobin")
print(query.to_dict())  # See query structure

# Check query JSON
import json
print(json.dumps(query.to_dict(), indent=2))

# Test with curl
import subprocess
result = subprocess.run(
    ["curl", "https://data.rcsb.org/rest/v1/core/entry/4HHB"],
    capture_output=True,
    text=True
)
print(result.stdout)

Additional Resources