zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

15 KiB

Raw Permalink Blame History

RCSB PDB API Reference

This document provides detailed information about the RCSB Protein Data Bank APIs, including advanced usage patterns, data schemas, and best practices.

API Overview

RCSB PDB provides multiple programmatic interfaces:

Data API - Retrieve PDB data when you have an identifier
Search API - Find identifiers matching specific search criteria
ModelServer API - Access macromolecular model subsets
VolumeServer API - Retrieve volumetric data subsets
Sequence Coordinates API - Obtain alignments between structural and sequence databases
Alignment API - Perform structure alignment computations

Data API

Core Data Objects

The Data API organizes information hierarchically:

core_entry: PDB entries or Computed Structure Models (CSM IDs start with AF_ or MA_)
core_polymer_entity: Protein, DNA, and RNA entities
core_nonpolymer_entity: Ligands, cofactors, ions
core_branched_entity: Oligosaccharides
core_assembly: Biological assemblies
core_polymer_entity_instance: Individual chains
core_chem_comp: Chemical components

REST API Endpoints

Base URL: https://data.rcsb.org/rest/v1/

Entry Data:

GET https://data.rcsb.org/rest/v1/core/entry/{entry_id}

Polymer Entity:

GET https://data.rcsb.org/rest/v1/core/polymer_entity/{entry_id}_{entity_id}

Assembly:

GET https://data.rcsb.org/rest/v1/core/assembly/{entry_id}/{assembly_id}

Examples:

# Get entry data for hemoglobin
curl https://data.rcsb.org/rest/v1/core/entry/4HHB

# Get first polymer entity
curl https://data.rcsb.org/rest/v1/core/polymer_entity/4HHB_1

# Get biological assembly 1
curl https://data.rcsb.org/rest/v1/core/assembly/4HHB/1

GraphQL API

Endpoint: https://data.rcsb.org/graphql

The GraphQL API enables flexible data retrieval, allowing you to grab any piece of data from any level of the hierarchy in a single query.

Example Query:

{
  entry(entry_id: "4HHB") {
    struct {
      title
    }
    exptl {
      method
    }
    rcsb_entry_info {
      resolution_combined
      deposited_atom_count
      polymer_entity_count
    }
    rcsb_accession_info {
      deposit_date
      initial_release_date
    }
  }
}

Python Example:

import requests

query = """
{
  polymer_entity(entity_id: "4HHB_1") {
    rcsb_polymer_entity {
      pdbx_description
      formula_weight
    }
    entity_poly {
      pdbx_seq_one_letter_code
      pdbx_strand_id
    }
    rcsb_entity_source_organism {
      ncbi_taxonomy_id
      scientific_name
    }
  }
}
"""

response = requests.post(
    "https://data.rcsb.org/graphql",
    json={"query": query}
)
data = response.json()

Common Data Fields

Entry Level:

struct.title - Structure title/description
exptl[].method - Experimental method (X-RAY DIFFRACTION, NMR, ELECTRON MICROSCOPY, etc.)
rcsb_entry_info.resolution_combined - Resolution in Ångströms
rcsb_entry_info.deposited_atom_count - Total number of atoms
rcsb_accession_info.deposit_date - Deposition date
rcsb_accession_info.initial_release_date - Release date

Polymer Entity Level:

entity_poly.pdbx_seq_one_letter_code - Primary sequence
rcsb_polymer_entity.formula_weight - Molecular weight
rcsb_entity_source_organism.scientific_name - Source organism
rcsb_entity_source_organism.ncbi_taxonomy_id - NCBI taxonomy ID

Assembly Level:

rcsb_assembly_info.polymer_entity_count - Number of polymer entities
rcsb_assembly_info.assembly_id - Assembly identifier

Search API

Query Types

The Search API supports seven primary query types:

TextQuery - Full-text search
AttributeQuery - Property-based search
SequenceQuery - Sequence similarity search
SequenceMotifQuery - Motif pattern search
StructSimilarityQuery - 3D structure similarity
StructMotifQuery - Structural motif search
ChemSimilarityQuery - Chemical similarity search

AttributeQuery Operators

Available operators for AttributeQuery:

exact_match - Exact string match
contains_words - Contains all words
contains_phrase - Contains exact phrase
equals - Numerical equality
greater - Greater than (numerical)
greater_or_equal - Greater than or equal
less - Less than (numerical)
less_or_equal - Less than or equal
range - Numerical range (closed interval)
exists - Field has a value
in - Value in list

Common Searchable Attributes

Resolution and Quality:

from rcsbapi.search import AttributeQuery
from rcsbapi.search.attrs import rcsb_entry_info

# High-resolution structures
query = AttributeQuery(
    attribute=rcsb_entry_info.resolution_combined,
    operator="less",
    value=2.0
)

Experimental Method:

from rcsbapi.search.attrs import exptl

query = AttributeQuery(
    attribute=exptl.method,
    operator="exact_match",
    value="X-RAY DIFFRACTION"
)

Organism:

from rcsbapi.search.attrs import rcsb_entity_source_organism

query = AttributeQuery(
    attribute=rcsb_entity_source_organism.scientific_name,
    operator="exact_match",
    value="Homo sapiens"
)

Molecular Weight:

from rcsbapi.search.attrs import rcsb_polymer_entity

query = AttributeQuery(
    attribute=rcsb_polymer_entity.formula_weight,
    operator="range",
    value=(10000, 50000)  # 10-50 kDa
)

Release Date:

from rcsbapi.search.attrs import rcsb_accession_info

# Structures released in 2024
query = AttributeQuery(
    attribute=rcsb_accession_info.initial_release_date,
    operator="range",
    value=("2024-01-01", "2024-12-31")
)

Sequence Similarity Search

Search for structures with similar sequences using MMseqs2:

from rcsbapi.search import SequenceQuery

# Basic sequence search
query = SequenceQuery(
    value="MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHHYREQIKRVKDSEDVPMVLVGNKCDLPSRTVDTKQAQDLARSYGIPFIETSAKTRQGVDDAFYTLVREIRKHKEKMSKDGKKKKKKSKTKCVIM",
    evalue_cutoff=0.1,
    identity_cutoff=0.9
)

# With sequence type specified
query = SequenceQuery(
    value="ACGTACGTACGT",
    evalue_cutoff=1e-5,
    identity_cutoff=0.8,
    sequence_type="dna"  # or "rna" or "protein"
)

Structure Similarity Search

Find structures with similar 3D geometry using BioZernike:

from rcsbapi.search import StructSimilarityQuery

# Search by entry
query = StructSimilarityQuery(
    structure_search_type="entry",
    entry_id="4HHB"
)

# Search by chain
query = StructSimilarityQuery(
    structure_search_type="chain",
    entry_id="4HHB",
    chain_id="A"
)

# Search by assembly
query = StructSimilarityQuery(
    structure_search_type="assembly",
    entry_id="4HHB",
    assembly_id="1"
)

Combining Queries

Use Python bitwise operators to combine queries:

from rcsbapi.search import TextQuery, AttributeQuery
from rcsbapi.search.attrs import rcsb_entry_info, rcsb_entity_source_organism

# AND operation (&)
query1 = TextQuery("kinase")
query2 = AttributeQuery(
    attribute=rcsb_entity_source_organism.scientific_name,
    operator="exact_match",
    value="Homo sapiens"
)
combined = query1 & query2

# OR operation (|)
organism1 = AttributeQuery(
    attribute=rcsb_entity_source_organism.scientific_name,
    operator="exact_match",
    value="Homo sapiens"
)
organism2 = AttributeQuery(
    attribute=rcsb_entity_source_organism.scientific_name,
    operator="exact_match",
    value="Mus musculus"
)
combined = organism1 | organism2

# NOT operation (~)
all_structures = TextQuery("protein")
low_res = AttributeQuery(
    attribute=rcsb_entry_info.resolution_combined,
    operator="greater",
    value=3.0
)
high_res_only = all_structures & (~low_res)

# Complex combinations
high_res_human_kinases = (
    TextQuery("kinase") &
    AttributeQuery(
        attribute=rcsb_entity_source_organism.scientific_name,
        operator="exact_match",
        value="Homo sapiens"
    ) &
    AttributeQuery(
        attribute=rcsb_entry_info.resolution_combined,
        operator="less",
        value=2.5
    )
)

Return Types

Control what information is returned:

from rcsbapi.search import TextQuery, ReturnType

query = TextQuery("hemoglobin")

# Return PDB IDs (default)
results = list(query())  # ['4HHB', '1A3N', ...]

# Return entry IDs with scores
results = list(query(return_type=ReturnType.ENTRY, return_scores=True))
# [{'identifier': '4HHB', 'score': 0.95}, ...]

# Return polymer entities
results = list(query(return_type=ReturnType.POLYMER_ENTITY))
# ['4HHB_1', '4HHB_2', ...]

File Download URLs

Structure Files

PDB Format (legacy):

https://files.rcsb.org/download/{PDB_ID}.pdb

mmCIF Format (modern standard):

https://files.rcsb.org/download/{PDB_ID}.cif

Structure Factors:

https://files.rcsb.org/download/{PDB_ID}-sf.cif

Biological Assembly:

https://files.rcsb.org/download/{PDB_ID}.pdb1  # Assembly 1
https://files.rcsb.org/download/{PDB_ID}.pdb2  # Assembly 2

FASTA Sequence:

https://www.rcsb.org/fasta/entry/{PDB_ID}

Python Download Helper

import requests

def download_pdb_file(pdb_id, format="pdb", output_dir="."):
    """
    Download PDB structure file.

    Args:
        pdb_id: 4-character PDB ID
        format: 'pdb' or 'cif'
        output_dir: Directory to save file
    """
    base_url = "https://files.rcsb.org/download"
    url = f"{base_url}/{pdb_id}.{format}"

    response = requests.get(url)
    if response.status_code == 200:
        output_path = f"{output_dir}/{pdb_id}.{format}"
        with open(output_path, "w") as f:
            f.write(response.text)
        print(f"Downloaded {pdb_id}.{format}")
        return output_path
    else:
        print(f"Error downloading {pdb_id}: {response.status_code}")
        return None

# Usage
download_pdb_file("4HHB", format="pdb")
download_pdb_file("4HHB", format="cif")

Rate Limiting and Best Practices

Rate Limits

The API implements rate limiting to ensure fair usage
If you exceed the limit, you'll receive a 429 HTTP error code
Recommended starting point: a few requests per second
Use exponential backoff to find acceptable request rates

Exponential Backoff Implementation

import time
import requests

def fetch_with_retry(url, max_retries=5, initial_delay=1):
    """
    Fetch URL with exponential backoff on rate limit errors.

    Args:
        url: URL to fetch
        max_retries: Maximum number of retry attempts
        initial_delay: Initial delay in seconds
    """
    delay = initial_delay

    for attempt in range(max_retries):
        response = requests.get(url)

        if response.status_code == 200:
            return response
        elif response.status_code == 429:
            print(f"Rate limited. Waiting {delay}s before retry...")
            time.sleep(delay)
            delay *= 2  # Exponential backoff
        else:
            response.raise_for_status()

    raise Exception(f"Failed after {max_retries} retries")

Batch Processing Best Practices

Use Search API first to get list of IDs, then fetch data
Cache results to avoid redundant queries
Process in chunks rather than all at once
Add delays between requests to respect rate limits
Use GraphQL for complex queries to minimize requests

import time
from rcsbapi.search import TextQuery
from rcsbapi.data import fetch, Schema

def batch_fetch_structures(query, delay=0.5):
    """
    Fetch structures matching a query with rate limiting.

    Args:
        query: Search query object
        delay: Delay between requests in seconds
    """
    # Get list of IDs
    pdb_ids = list(query())
    print(f"Found {len(pdb_ids)} structures")

    # Fetch data for each
    results = {}
    for i, pdb_id in enumerate(pdb_ids):
        try:
            data = fetch(pdb_id, schema=Schema.ENTRY)
            results[pdb_id] = data
            print(f"Fetched {i+1}/{len(pdb_ids)}: {pdb_id}")
            time.sleep(delay)  # Rate limiting
        except Exception as e:
            print(f"Error fetching {pdb_id}: {e}")

    return results

Advanced Use Cases

Finding Drug-Target Complexes

from rcsbapi.search import AttributeQuery
from rcsbapi.search.attrs import rcsb_polymer_entity, rcsb_nonpolymer_entity_instance_container_identifiers

# Find structures with specific drug molecule
query = AttributeQuery(
    attribute=rcsb_nonpolymer_entity_instance_container_identifiers.comp_id,
    operator="exact_match",
    value="ATP"  # or other ligand code
)

results = list(query())
print(f"Found {len(results)} structures with ATP")

Filtering by Resolution and R-factor

from rcsbapi.search import AttributeQuery
from rcsbapi.search.attrs import rcsb_entry_info, refine

# High-quality X-ray structures
resolution_query = AttributeQuery(
    attribute=rcsb_entry_info.resolution_combined,
    operator="less",
    value=2.0
)

rfactor_query = AttributeQuery(
    attribute=refine.ls_R_factor_R_free,
    operator="less",
    value=0.25
)

high_quality = resolution_query & rfactor_query
results = list(high_quality())

Finding Recent Structures

from rcsbapi.search import AttributeQuery
from rcsbapi.search.attrs import rcsb_accession_info

# Structures released in last month
import datetime

one_month_ago = (datetime.date.today() - datetime.timedelta(days=30)).isoformat()
today = datetime.date.today().isoformat()

query = AttributeQuery(
    attribute=rcsb_accession_info.initial_release_date,
    operator="range",
    value=(one_month_ago, today)
)

recent_structures = list(query())

Troubleshooting

Common Errors

404 Not Found:

PDB ID doesn't exist or is obsolete
Check if ID is correct (case-sensitive)
Verify entry hasn't been superseded

429 Too Many Requests:

Rate limit exceeded
Implement exponential backoff
Reduce request frequency

500 Internal Server Error:

Temporary server issue
Retry after short delay
Check RCSB PDB status page

Empty Results:

Query too restrictive
Check attribute names and operators
Verify data exists for searched field

Debugging Tips

# Enable verbose output for searches
from rcsbapi.search import TextQuery

query = TextQuery("hemoglobin")
print(query.to_dict())  # See query structure

# Check query JSON
import json
print(json.dumps(query.to_dict(), indent=2))

# Test with curl
import subprocess
result = subprocess.run(
    ["curl", "https://data.rcsb.org/rest/v1/core/entry/4HHB"],
    capture_output=True,
    text=True
)
print(result.stdout)

Additional Resources

API Documentation: https://www.rcsb.org/docs/programmatic-access/web-apis-overview
Data API Redoc: https://data.rcsb.org/redoc/index.html
GraphQL Schema: https://data.rcsb.org/graphql
Python Package Docs: https://rcsbapi.readthedocs.io/
GitHub Issues: https://github.com/rcsb/py-rcsb-api/issues
Community Forum: https://www.rcsb.org/help

15 KiB Raw Permalink Blame History