gh-k-dense-ai-claude-scient…/skills/torch_geometric/references/datasets_reference.md

# PyTorch Geometric Datasets Reference

This document provides a comprehensive catalog of all datasets available in `torch_geometric.datasets`.

## Citation Networks

### Planetoid
**Usage**: Node classification, semi-supervised learning
**Networks**: Cora, CiteSeer, PubMed
**Description**: Citation networks where nodes are papers and edges are citations
- **Cora**: 2,708 nodes, 5,429 edges, 7 classes, 1,433 features
- **CiteSeer**: 3,327 nodes, 4,732 edges, 6 classes, 3,703 features
- **PubMed**: 19,717 nodes, 44,338 edges, 3 classes, 500 features

```python
from torch_geometric.datasets import Planetoid
dataset = Planetoid(root='/tmp/Cora', name='Cora')
```

### Coauthor
**Usage**: Node classification on collaboration networks
**Networks**: CS, Physics
**Description**: Co-authorship networks from Microsoft Academic Graph
- **CS**: 18,333 nodes, 81,894 edges, 15 classes (computer science)
- **Physics**: 34,493 nodes, 247,962 edges, 5 classes (physics)

```python
from torch_geometric.datasets import Coauthor
dataset = Coauthor(root='/tmp/CS', name='CS')
```

### Amazon
**Usage**: Node classification on product networks
**Networks**: Computers, Photo
**Description**: Amazon co-purchase networks where nodes are products
- **Computers**: 13,752 nodes, 245,861 edges, 10 classes
- **Photo**: 7,650 nodes, 119,081 edges, 8 classes

```python
from torch_geometric.datasets import Amazon
dataset = Amazon(root='/tmp/Computers', name='Computers')
```

### CitationFull
**Usage**: Citation network analysis
**Networks**: Cora, Cora_ML, DBLP, PubMed
**Description**: Full citation networks without sampling

```python
from torch_geometric.datasets import CitationFull
dataset = CitationFull(root='/tmp/Cora', name='Cora')
```

## Graph Classification

### TUDataset
**Usage**: Graph classification, graph kernel benchmarks
**Description**: Collection of 120+ graph classification datasets
- **MUTAG**: 188 graphs, 2 classes (molecular compounds)
- **PROTEINS**: 1,113 graphs, 2 classes (protein structures)
- **ENZYMES**: 600 graphs, 6 classes (protein enzymes)
- **IMDB-BINARY**: 1,000 graphs, 2 classes (social networks)
- **REDDIT-BINARY**: 2,000 graphs, 2 classes (discussion threads)
- **COLLAB**: 5,000 graphs, 3 classes (scientific collaborations)
- **NCI1**: 4,110 graphs, 2 classes (chemical compounds)
- **DD**: 1,178 graphs, 2 classes (protein structures)

```python
from torch_geometric.datasets import TUDataset
dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES')
```

### MoleculeNet
**Usage**: Molecular property prediction
**Datasets**: Over 10 molecular benchmark datasets
**Description**: Comprehensive molecular machine learning benchmarks
- **ESOL**: Aqueous solubility (regression)
- **FreeSolv**: Hydration free energy (regression)
- **Lipophilicity**: Octanol/water distribution (regression)
- **BACE**: Binding results (classification)
- **BBBP**: Blood-brain barrier penetration (classification)
- **HIV**: HIV inhibition (classification)
- **Tox21**: Toxicity prediction (multi-task classification)
- **ToxCast**: Toxicology forecasting (multi-task classification)
- **SIDER**: Side effects (multi-task classification)
- **ClinTox**: Clinical trial toxicity (multi-task classification)

```python
from torch_geometric.datasets import MoleculeNet
dataset = MoleculeNet(root='/tmp/ESOL', name='ESOL')
```

## Molecular and Chemical Datasets

### QM7b
**Usage**: Molecular property prediction (quantum mechanics)
**Description**: 7,211 molecules with up to 7 heavy atoms
- Properties: Atomization energies, electronic properties

```python
from torch_geometric.datasets import QM7b
dataset = QM7b(root='/tmp/QM7b')
```

### QM9
**Usage**: Molecular property prediction (quantum mechanics)
**Description**: ~130,000 molecules with up to 9 heavy atoms (C, O, N, F)
- Properties: 19 quantum chemical properties including HOMO, LUMO, gap, energy

```python
from torch_geometric.datasets import QM9
dataset = QM9(root='/tmp/QM9')
```

### ZINC
**Usage**: Molecular generation, property prediction
**Description**: ~250,000 drug-like molecular graphs
- Properties: Constrained solubility, molecular weight

```python
from torch_geometric.datasets import ZINC
dataset = ZINC(root='/tmp/ZINC', subset=True)
```

### AQSOL
**Usage**: Aqueous solubility prediction
**Description**: ~10,000 molecules with solubility measurements

```python
from torch_geometric.datasets import AQSOL
dataset = AQSOL(root='/tmp/AQSOL')
```

### MD17
**Usage**: Molecular dynamics, force field learning
**Description**: Molecular dynamics trajectories for small molecules
- Molecules: Benzene, Uracil, Naphthalene, Aspirin, Salicylic acid, etc.

```python
from torch_geometric.datasets import MD17
dataset = MD17(root='/tmp/MD17', name='benzene')
```

### PCQM4Mv2
**Usage**: Large-scale molecular property prediction
**Description**: 3.8M molecules from PubChem for quantum chemistry
- Part of OGB Large-Scale Challenge

```python
from torch_geometric.datasets import PCQM4Mv2
dataset = PCQM4Mv2(root='/tmp/PCQM4Mv2')
```

## Social Networks

### Reddit
**Usage**: Large-scale node classification
**Description**: Reddit posts from September 2014
- 232,965 nodes, 11,606,919 edges, 41 classes
- Features: TF-IDF of post content

```python
from torch_geometric.datasets import Reddit
dataset = Reddit(root='/tmp/Reddit')
```

### Reddit2
**Usage**: Large-scale node classification
**Description**: Updated Reddit dataset with more posts

```python
from torch_geometric.datasets import Reddit2
dataset = Reddit2(root='/tmp/Reddit2')
```

### Twitch
**Usage**: Node classification, social network analysis
**Networks**: DE, EN, ES, FR, PT, RU
**Description**: Twitch user networks by language

```python
from torch_geometric.datasets import Twitch
dataset = Twitch(root='/tmp/Twitch', name='DE')
```

### Facebook
**Usage**: Social network analysis, node classification
**Description**: Facebook page-page networks

```python
from torch_geometric.datasets import FacebookPagePage
dataset = FacebookPagePage(root='/tmp/Facebook')
```

### GitHub
**Usage**: Social network analysis
**Description**: GitHub developer networks

```python
from torch_geometric.datasets import GitHub
dataset = GitHub(root='/tmp/GitHub')
```

## Knowledge Graphs

### Entities
**Usage**: Link prediction, knowledge graph embeddings
**Datasets**: AIFB, MUTAG, BGS, AM
**Description**: RDF knowledge graphs with typed relations

```python
from torch_geometric.datasets import Entities
dataset = Entities(root='/tmp/AIFB', name='AIFB')
```

### WordNet18
**Usage**: Link prediction on semantic networks
**Description**: Subset of WordNet with 18 relations
- 40,943 entities, 151,442 triplets

```python
from torch_geometric.datasets import WordNet18
dataset = WordNet18(root='/tmp/WordNet18')
```

### WordNet18RR
**Usage**: Link prediction (no inverse relations)
**Description**: Refined version without inverse relations

```python
from torch_geometric.datasets import WordNet18RR
dataset = WordNet18RR(root='/tmp/WordNet18RR')
```

### FB15k-237
**Usage**: Link prediction on Freebase
**Description**: Subset of Freebase with 237 relations
- 14,541 entities, 310,116 triplets

```python
from torch_geometric.datasets import FB15k_237
dataset = FB15k_237(root='/tmp/FB15k')
```

## Heterogeneous Graphs

### OGB_MAG
**Usage**: Heterogeneous graph learning, node classification
**Description**: Microsoft Academic Graph with multiple node/edge types
- Node types: paper, author, institution, field of study
- 1M+ nodes, 21M+ edges

```python
from torch_geometric.datasets import OGB_MAG
dataset = OGB_MAG(root='/tmp/OGB_MAG')
```

### MovieLens
**Usage**: Recommendation systems, link prediction
**Versions**: 100K, 1M, 10M, 20M
**Description**: User-movie rating networks
- Node types: user, movie
- Edge types: rates

```python
from torch_geometric.datasets import MovieLens
dataset = MovieLens(root='/tmp/MovieLens', model_name='100k')
```

### IMDB
**Usage**: Heterogeneous graph learning
**Description**: IMDB movie network
- Node types: movie, actor, director

```python
from torch_geometric.datasets import IMDB
dataset = IMDB(root='/tmp/IMDB')
```

### DBLP
**Usage**: Heterogeneous graph learning, node classification
**Description**: DBLP bibliography network
- Node types: author, paper, term, conference

```python
from torch_geometric.datasets import DBLP
dataset = DBLP(root='/tmp/DBLP')
```

### LastFM
**Usage**: Heterogeneous recommendation
**Description**: LastFM music network
- Node types: user, artist, tag

```python
from torch_geometric.datasets import LastFM
dataset = LastFM(root='/tmp/LastFM')
```

## Temporal Graphs

### BitcoinOTC
**Usage**: Temporal link prediction, trust networks
**Description**: Bitcoin OTC trust network over time

```python
from torch_geometric.datasets import BitcoinOTC
dataset = BitcoinOTC(root='/tmp/BitcoinOTC')
```

### ICEWS18
**Usage**: Temporal knowledge graph completion
**Description**: Integrated Crisis Early Warning System events

```python
from torch_geometric.datasets import ICEWS18
dataset = ICEWS18(root='/tmp/ICEWS18')
```

### GDELT
**Usage**: Temporal event forecasting
**Description**: Global Database of Events, Language, and Tone

```python
from torch_geometric.datasets import GDELT
dataset = GDELT(root='/tmp/GDELT')
```

### JODIEDataset
**Usage**: Dynamic graph learning
**Datasets**: Reddit, Wikipedia, MOOC, LastFM
**Description**: Temporal interaction networks

```python
from torch_geometric.datasets import JODIEDataset
dataset = JODIEDataset(root='/tmp/JODIE', name='Reddit')
```

## 3D Meshes and Point Clouds

### ShapeNet
**Usage**: 3D shape classification and segmentation
**Description**: Large-scale 3D CAD model dataset
- 16,881 models across 16 categories
- Part-level segmentation labels

```python
from torch_geometric.datasets import ShapeNet
dataset = ShapeNet(root='/tmp/ShapeNet', categories=['Airplane'])
```

### ModelNet
**Usage**: 3D shape classification
**Versions**: ModelNet10, ModelNet40
**Description**: CAD models for 3D object classification
- ModelNet10: 4,899 models, 10 categories
- ModelNet40: 12,311 models, 40 categories

```python
from torch_geometric.datasets import ModelNet
dataset = ModelNet(root='/tmp/ModelNet', name='10')
```

### FAUST
**Usage**: 3D shape matching, correspondence
**Description**: Human body scans for shape analysis
- 100 meshes of 10 people in 10 poses

```python
from torch_geometric.datasets import FAUST
dataset = FAUST(root='/tmp/FAUST')
```

### CoMA
**Usage**: 3D mesh deformation
**Description**: Facial expression meshes
- 20,466 3D face scans with expressions

```python
from torch_geometric.datasets import CoMA
dataset = CoMA(root='/tmp/CoMA')
```

### S3DIS
**Usage**: 3D semantic segmentation
**Description**: Stanford Large-Scale 3D Indoor Spaces
- 6 areas, 271 rooms, point cloud data

```python
from torch_geometric.datasets import S3DIS
dataset = S3DIS(root='/tmp/S3DIS', test_area=6)
```

## Image and Vision Datasets

### MNISTSuperpixels
**Usage**: Graph-based image classification
**Description**: MNIST images as superpixel graphs
- 70,000 graphs (60k train, 10k test)

```python
from torch_geometric.datasets import MNISTSuperpixels
dataset = MNISTSuperpixels(root='/tmp/MNIST')
```

### Flickr
**Usage**: Image description, node classification
**Description**: Flickr image network
- 89,250 nodes, 899,756 edges

```python
from torch_geometric.datasets import Flickr
dataset = Flickr(root='/tmp/Flickr')
```

### PPI
**Usage**: Protein-protein interaction prediction
**Description**: Multi-graph protein interaction networks
- 24 graphs, 2,373 nodes total

```python
from torch_geometric.datasets import PPI
dataset = PPI(root='/tmp/PPI', split='train')
```

## Small Classic Graphs

### KarateClub
**Usage**: Community detection, visualization
**Description**: Zachary's karate club network
- 34 nodes, 78 edges, 2 communities

```python
from torch_geometric.datasets import KarateClub
dataset = KarateClub()
```

## Open Graph Benchmark (OGB)

PyG integrates seamlessly with OGB datasets:

### Node Property Prediction
- **ogbn-products**: Amazon product network (2.4M nodes)
- **ogbn-proteins**: Protein association network (132K nodes)
- **ogbn-arxiv**: Citation network (169K nodes)
- **ogbn-papers100M**: Large citation network (111M nodes)
- **ogbn-mag**: Heterogeneous academic graph

### Link Property Prediction
- **ogbl-ppa**: Protein association networks
- **ogbl-collab**: Collaboration networks
- **ogbl-ddi**: Drug-drug interaction network
- **ogbl-citation2**: Citation network
- **ogbl-wikikg2**: Wikidata knowledge graph

### Graph Property Prediction
- **ogbg-molhiv**: Molecular HIV activity prediction
- **ogbg-molpcba**: Molecular bioassays (multi-task)
- **ogbg-ppa**: Protein function prediction
- **ogbg-code2**: Code abstract syntax trees

```python
from torch_geometric.datasets import OGB_MAG, OGB_PPA
# or
from ogb.nodeproppred import PygNodePropPredDataset
dataset = PygNodePropPredDataset(name='ogbn-arxiv')
```

## Synthetic Datasets

### FakeDataset
**Usage**: Testing, debugging
**Description**: Generates random graph data

```python
from torch_geometric.datasets import FakeDataset
dataset = FakeDataset(num_graphs=100, avg_num_nodes=50)
```

### StochasticBlockModelDataset
**Usage**: Community detection benchmarks
**Description**: Graphs generated from stochastic block models

```python
from torch_geometric.datasets import StochasticBlockModelDataset
dataset = StochasticBlockModelDataset(root='/tmp/SBM', num_graphs=1000)
```

### ExplainerDataset
**Usage**: Testing explainability methods
**Description**: Synthetic graphs with known explanation ground truth

```python
from torch_geometric.datasets import ExplainerDataset
dataset = ExplainerDataset(num_graphs=1000)
```

## Materials Science

### QM8
**Usage**: Molecular property prediction
**Description**: Electronic properties of small molecules

```python
from torch_geometric.datasets import QM8
dataset = QM8(root='/tmp/QM8')
```

## Biological Networks

### PPI (Protein-Protein Interaction)
Already listed above under Image and Vision Datasets

### STRING
**Usage**: Protein interaction networks
**Description**: Known and predicted protein-protein interactions

```python
# Available through external sources or custom loading
```

## Usage Tips

1. **Start with small datasets**: Use Cora, KarateClub, or ENZYMES for prototyping
2. **Citation networks**: Planetoid datasets are perfect for node classification
3. **Graph classification**: TUDataset provides diverse benchmarks
4. **Molecular**: QM9, ZINC, MoleculeNet for chemistry applications
5. **Large-scale**: Use Reddit, OGB datasets with NeighborLoader
6. **Heterogeneous**: OGB_MAG, MovieLens, IMDB for multi-type graphs
7. **Temporal**: JODIE, ICEWS for dynamic graph learning
8. **3D**: ShapeNet, ModelNet, S3DIS for geometric learning

## Common Patterns

### Loading with Transforms
```python
from torch_geometric.datasets import Planetoid
from torch_geometric.transforms import NormalizeFeatures

dataset = Planetoid(root='/tmp/Cora', name='Cora',
                    transform=NormalizeFeatures())
```

### Train/Val/Test Splits
```python
# For datasets with pre-defined splits
data = dataset[0]
train_data = data[data.train_mask]
val_data = data[data.val_mask]
test_data = data[data.test_mask]

# For graph classification
from torch_geometric.loader import DataLoader
train_dataset = dataset[:int(len(dataset) * 0.8)]
test_dataset = dataset[int(len(dataset) * 0.8):]
train_loader = DataLoader(train_dataset, batch_size=32)
```

### Custom Data Loading
```python
from torch_geometric.data import Data, Dataset

class MyCustomDataset(Dataset):
    def __init__(self, root, transform=None):
        super().__init__(root, transform)
        # Your initialization

    def len(self):
        return len(self.data_list)

    def get(self, idx):
        # Load and return data object
        return self.data_list[idx]
```