# RCSB PDB API Reference

This document provides detailed information about the RCSB Protein Data Bank APIs, including advanced usage patterns, data schemas, and best practices.

## API Overview

RCSB PDB provides multiple programmatic interfaces:

1. **Data API** - Retrieve PDB data when you have an identifier
2. **Search API** - Find identifiers matching specific search criteria
3. **ModelServer API** - Access macromolecular model subsets
4. **VolumeServer API** - Retrieve volumetric data subsets
5. **Sequence Coordinates API** - Obtain alignments between structural and sequence databases
6. **Alignment API** - Perform structure alignment computations

## Data API

### Core Data Objects

The Data API organizes information hierarchically:

- **core_entry**: PDB entries or Computed Structure Models (CSM IDs start with AF_ or MA_)
- **core_polymer_entity**: Protein, DNA, and RNA entities
- **core_nonpolymer_entity**: Ligands, cofactors, ions
- **core_branched_entity**: Oligosaccharides
- **core_assembly**: Biological assemblies
- **core_polymer_entity_instance**: Individual chains
- **core_chem_comp**: Chemical components

### REST API Endpoints

Base URL: `https://data.rcsb.org/rest/v1/`

**Entry Data:**
```
GET https://data.rcsb.org/rest/v1/core/entry/{entry_id}
```

**Polymer Entity:**
```
GET https://data.rcsb.org/rest/v1/core/polymer_entity/{entry_id}_{entity_id}
```

**Assembly:**
```
GET https://data.rcsb.org/rest/v1/core/assembly/{entry_id}/{assembly_id}
```

**Examples:**
```bash
# Get entry data for hemoglobin
curl https://data.rcsb.org/rest/v1/core/entry/4HHB

# Get first polymer entity
curl https://data.rcsb.org/rest/v1/core/polymer_entity/4HHB_1

# Get biological assembly 1
curl https://data.rcsb.org/rest/v1/core/assembly/4HHB/1
```

### GraphQL API

Endpoint: `https://data.rcsb.org/graphql`

The GraphQL API enables flexible data retrieval, allowing you to grab any piece of data from any level of the hierarchy in a single query.

**Example Query:**
```graphql
{
  entry(entry_id: "4HHB") {
    struct {
      title
    }
    exptl {
      method
    }
    rcsb_entry_info {
      resolution_combined
      deposited_atom_count
      polymer_entity_count
    }
    rcsb_accession_info {
      deposit_date
      initial_release_date
    }
  }
}
```

**Python Example:**
```python
import requests

query = """
{
  polymer_entity(entity_id: "4HHB_1") {
    rcsb_polymer_entity {
      pdbx_description
      formula_weight
    }
    entity_poly {
      pdbx_seq_one_letter_code
      pdbx_strand_id
    }
    rcsb_entity_source_organism {
      ncbi_taxonomy_id
      scientific_name
    }
  }
}
"""

response = requests.post(
    "https://data.rcsb.org/graphql",
    json={"query": query}
)
data = response.json()
```

### Common Data Fields

**Entry Level:**
- `struct.title` - Structure title/description
- `exptl[].method` - Experimental method (X-RAY DIFFRACTION, NMR, ELECTRON MICROSCOPY, etc.)
- `rcsb_entry_info.resolution_combined` - Resolution in Ångströms
- `rcsb_entry_info.deposited_atom_count` - Total number of atoms
- `rcsb_accession_info.deposit_date` - Deposition date
- `rcsb_accession_info.initial_release_date` - Release date

**Polymer Entity Level:**
- `entity_poly.pdbx_seq_one_letter_code` - Primary sequence
- `rcsb_polymer_entity.formula_weight` - Molecular weight
- `rcsb_entity_source_organism.scientific_name` - Source organism
- `rcsb_entity_source_organism.ncbi_taxonomy_id` - NCBI taxonomy ID

**Assembly Level:**
- `rcsb_assembly_info.polymer_entity_count` - Number of polymer entities
- `rcsb_assembly_info.assembly_id` - Assembly identifier

## Search API

### Query Types

The Search API supports seven primary query types:

1. **TextQuery** - Full-text search
2. **AttributeQuery** - Property-based search
3. **SequenceQuery** - Sequence similarity search
4. **SequenceMotifQuery** - Motif pattern search
5. **StructSimilarityQuery** - 3D structure similarity
6. **StructMotifQuery** - Structural motif search
7. **ChemSimilarityQuery** - Chemical similarity search

### AttributeQuery Operators

Available operators for AttributeQuery:

- `exact_match` - Exact string match
- `contains_words` - Contains all words
- `contains_phrase` - Contains exact phrase
- `equals` - Numerical equality
- `greater` - Greater than (numerical)
- `greater_or_equal` - Greater than or equal
- `less` - Less than (numerical)
- `less_or_equal` - Less than or equal
- `range` - Numerical range (closed interval)
- `exists` - Field has a value
- `in` - Value in list

### Common Searchable Attributes

**Resolution and Quality:**
```python
from rcsbapi.search import AttributeQuery
from rcsbapi.search.attrs import rcsb_entry_info

# High-resolution structures
query = AttributeQuery(
    attribute=rcsb_entry_info.resolution_combined,
    operator="less",
    value=2.0
)
```

**Experimental Method:**
```python
from rcsbapi.search.attrs import exptl

query = AttributeQuery(
    attribute=exptl.method,
    operator="exact_match",
    value="X-RAY DIFFRACTION"
)
```

**Organism:**
```python
from rcsbapi.search.attrs import rcsb_entity_source_organism

query = AttributeQuery(
    attribute=rcsb_entity_source_organism.scientific_name,
    operator="exact_match",
    value="Homo sapiens"
)
```

**Molecular Weight:**
```python
from rcsbapi.search.attrs import rcsb_polymer_entity

query = AttributeQuery(
    attribute=rcsb_polymer_entity.formula_weight,
    operator="range",
    value=(10000, 50000)  # 10-50 kDa
)
```

**Release Date:**
```python
from rcsbapi.search.attrs import rcsb_accession_info

# Structures released in 2024
query = AttributeQuery(
    attribute=rcsb_accession_info.initial_release_date,
    operator="range",
    value=("2024-01-01", "2024-12-31")
)
```

### Sequence Similarity Search

Search for structures with similar sequences using MMseqs2:

```python
from rcsbapi.search import SequenceQuery

# Basic sequence search
query = SequenceQuery(
    value="MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHHYREQIKRVKDSEDVPMVLVGNKCDLPSRTVDTKQAQDLARSYGIPFIETSAKTRQGVDDAFYTLVREIRKHKEKMSKDGKKKKKKSKTKCVIM",
    evalue_cutoff=0.1,
    identity_cutoff=0.9
)

# With sequence type specified
query = SequenceQuery(
    value="ACGTACGTACGT",
    evalue_cutoff=1e-5,
    identity_cutoff=0.8,
    sequence_type="dna"  # or "rna" or "protein"
)
```

### Structure Similarity Search

Find structures with similar 3D geometry using BioZernike:

```python
from rcsbapi.search import StructSimilarityQuery

# Search by entry
query = StructSimilarityQuery(
    structure_search_type="entry",
    entry_id="4HHB"
)

# Search by chain
query = StructSimilarityQuery(
    structure_search_type="chain",
    entry_id="4HHB",
    chain_id="A"
)

# Search by assembly
query = StructSimilarityQuery(
    structure_search_type="assembly",
    entry_id="4HHB",
    assembly_id="1"
)
```

### Combining Queries

Use Python bitwise operators to combine queries:

```python
from rcsbapi.search import TextQuery, AttributeQuery
from rcsbapi.search.attrs import rcsb_entry_info, rcsb_entity_source_organism

# AND operation (&)
query1 = TextQuery("kinase")
query2 = AttributeQuery(
    attribute=rcsb_entity_source_organism.scientific_name,
    operator="exact_match",
    value="Homo sapiens"
)
combined = query1 & query2

# OR operation (|)
organism1 = AttributeQuery(
    attribute=rcsb_entity_source_organism.scientific_name,
    operator="exact_match",
    value="Homo sapiens"
)
organism2 = AttributeQuery(
    attribute=rcsb_entity_source_organism.scientific_name,
    operator="exact_match",
    value="Mus musculus"
)
combined = organism1 | organism2

# NOT operation (~)
all_structures = TextQuery("protein")
low_res = AttributeQuery(
    attribute=rcsb_entry_info.resolution_combined,
    operator="greater",
    value=3.0
)
high_res_only = all_structures & (~low_res)

# Complex combinations
high_res_human_kinases = (
    TextQuery("kinase") &
    AttributeQuery(
        attribute=rcsb_entity_source_organism.scientific_name,
        operator="exact_match",
        value="Homo sapiens"
    ) &
    AttributeQuery(
        attribute=rcsb_entry_info.resolution_combined,
        operator="less",
        value=2.5
    )
)
```

### Return Types

Control what information is returned:

```python
from rcsbapi.search import TextQuery, ReturnType

query = TextQuery("hemoglobin")

# Return PDB IDs (default)
results = list(query())  # ['4HHB', '1A3N', ...]

# Return entry IDs with scores
results = list(query(return_type=ReturnType.ENTRY, return_scores=True))
# [{'identifier': '4HHB', 'score': 0.95}, ...]

# Return polymer entities
results = list(query(return_type=ReturnType.POLYMER_ENTITY))
# ['4HHB_1', '4HHB_2', ...]
```

## File Download URLs

### Structure Files

**PDB Format (legacy):**
```
https://files.rcsb.org/download/{PDB_ID}.pdb
```

**mmCIF Format (modern standard):**
```
https://files.rcsb.org/download/{PDB_ID}.cif
```

**Structure Factors:**
```
https://files.rcsb.org/download/{PDB_ID}-sf.cif
```

**Biological Assembly:**
```
https://files.rcsb.org/download/{PDB_ID}.pdb1  # Assembly 1
https://files.rcsb.org/download/{PDB_ID}.pdb2  # Assembly 2
```

**FASTA Sequence:**
```
https://www.rcsb.org/fasta/entry/{PDB_ID}
```

### Python Download Helper

```python
import requests

def download_pdb_file(pdb_id, format="pdb", output_dir="."):
    """
    Download PDB structure file.

    Args:
        pdb_id: 4-character PDB ID
        format: 'pdb' or 'cif'
        output_dir: Directory to save file
    """
    base_url = "https://files.rcsb.org/download"
    url = f"{base_url}/{pdb_id}.{format}"

    response = requests.get(url)
    if response.status_code == 200:
        output_path = f"{output_dir}/{pdb_id}.{format}"
        with open(output_path, "w") as f:
            f.write(response.text)
        print(f"Downloaded {pdb_id}.{format}")
        return output_path
    else:
        print(f"Error downloading {pdb_id}: {response.status_code}")
        return None

# Usage
download_pdb_file("4HHB", format="pdb")
download_pdb_file("4HHB", format="cif")
```

## Rate Limiting and Best Practices

### Rate Limits

- The API implements rate limiting to ensure fair usage
- If you exceed the limit, you'll receive a 429 HTTP error code
- Recommended starting point: a few requests per second
- Use exponential backoff to find acceptable request rates

### Exponential Backoff Implementation

```python
import time
import requests

def fetch_with_retry(url, max_retries=5, initial_delay=1):
    """
    Fetch URL with exponential backoff on rate limit errors.

    Args:
        url: URL to fetch
        max_retries: Maximum number of retry attempts
        initial_delay: Initial delay in seconds
    """
    delay = initial_delay

    for attempt in range(max_retries):
        response = requests.get(url)

        if response.status_code == 200:
            return response
        elif response.status_code == 429:
            print(f"Rate limited. Waiting {delay}s before retry...")
            time.sleep(delay)
            delay *= 2  # Exponential backoff
        else:
            response.raise_for_status()

    raise Exception(f"Failed after {max_retries} retries")
```

### Batch Processing Best Practices

1. **Use Search API first** to get list of IDs, then fetch data
2. **Cache results** to avoid redundant queries
3. **Process in chunks** rather than all at once
4. **Add delays** between requests to respect rate limits
5. **Use GraphQL** for complex queries to minimize requests

```python
import time
from rcsbapi.search import TextQuery
from rcsbapi.data import fetch, Schema

def batch_fetch_structures(query, delay=0.5):
    """
    Fetch structures matching a query with rate limiting.

    Args:
        query: Search query object
        delay: Delay between requests in seconds
    """
    # Get list of IDs
    pdb_ids = list(query())
    print(f"Found {len(pdb_ids)} structures")

    # Fetch data for each
    results = {}
    for i, pdb_id in enumerate(pdb_ids):
        try:
            data = fetch(pdb_id, schema=Schema.ENTRY)
            results[pdb_id] = data
            print(f"Fetched {i+1}/{len(pdb_ids)}: {pdb_id}")
            time.sleep(delay)  # Rate limiting
        except Exception as e:
            print(f"Error fetching {pdb_id}: {e}")

    return results
```

## Advanced Use Cases

### Finding Drug-Target Complexes

```python
from rcsbapi.search import AttributeQuery
from rcsbapi.search.attrs import rcsb_polymer_entity, rcsb_nonpolymer_entity_instance_container_identifiers

# Find structures with specific drug molecule
query = AttributeQuery(
    attribute=rcsb_nonpolymer_entity_instance_container_identifiers.comp_id,
    operator="exact_match",
    value="ATP"  # or other ligand code
)

results = list(query())
print(f"Found {len(results)} structures with ATP")
```

### Filtering by Resolution and R-factor

```python
from rcsbapi.search import AttributeQuery
from rcsbapi.search.attrs import rcsb_entry_info, refine

# High-quality X-ray structures
resolution_query = AttributeQuery(
    attribute=rcsb_entry_info.resolution_combined,
    operator="less",
    value=2.0
)

rfactor_query = AttributeQuery(
    attribute=refine.ls_R_factor_R_free,
    operator="less",
    value=0.25
)

high_quality = resolution_query & rfactor_query
results = list(high_quality())
```

### Finding Recent Structures

```python
from rcsbapi.search import AttributeQuery
from rcsbapi.search.attrs import rcsb_accession_info

# Structures released in last month
import datetime

one_month_ago = (datetime.date.today() - datetime.timedelta(days=30)).isoformat()
today = datetime.date.today().isoformat()

query = AttributeQuery(
    attribute=rcsb_accession_info.initial_release_date,
    operator="range",
    value=(one_month_ago, today)
)

recent_structures = list(query())
```

## Troubleshooting

### Common Errors

**404 Not Found:**
- PDB ID doesn't exist or is obsolete
- Check if ID is correct (case-sensitive)
- Verify entry hasn't been superseded

**429 Too Many Requests:**
- Rate limit exceeded
- Implement exponential backoff
- Reduce request frequency

**500 Internal Server Error:**
- Temporary server issue
- Retry after short delay
- Check RCSB PDB status page

**Empty Results:**
- Query too restrictive
- Check attribute names and operators
- Verify data exists for searched field

### Debugging Tips

```python
# Enable verbose output for searches
from rcsbapi.search import TextQuery

query = TextQuery("hemoglobin")
print(query.to_dict())  # See query structure

# Check query JSON
import json
print(json.dumps(query.to_dict(), indent=2))

# Test with curl
import subprocess
result = subprocess.run(
    ["curl", "https://data.rcsb.org/rest/v1/core/entry/4HHB"],
    capture_output=True,
    text=True
)
print(result.stdout)
```

## Additional Resources

- **API Documentation:** https://www.rcsb.org/docs/programmatic-access/web-apis-overview
- **Data API Redoc:** https://data.rcsb.org/redoc/index.html
- **GraphQL Schema:** https://data.rcsb.org/graphql
- **Python Package Docs:** https://rcsbapi.readthedocs.io/
- **GitHub Issues:** https://github.com/rcsb/py-rcsb-api/issues
- **Community Forum:** https://www.rcsb.org/help