16 KiB
PyTorch Geometric Datasets Reference
This document provides a comprehensive catalog of all datasets available in torch_geometric.datasets.
Citation Networks
Planetoid
Usage: Node classification, semi-supervised learning Networks: Cora, CiteSeer, PubMed Description: Citation networks where nodes are papers and edges are citations
- Cora: 2,708 nodes, 5,429 edges, 7 classes, 1,433 features
- CiteSeer: 3,327 nodes, 4,732 edges, 6 classes, 3,703 features
- PubMed: 19,717 nodes, 44,338 edges, 3 classes, 500 features
from torch_geometric.datasets import Planetoid
dataset = Planetoid(root='/tmp/Cora', name='Cora')
Coauthor
Usage: Node classification on collaboration networks Networks: CS, Physics Description: Co-authorship networks from Microsoft Academic Graph
- CS: 18,333 nodes, 81,894 edges, 15 classes (computer science)
- Physics: 34,493 nodes, 247,962 edges, 5 classes (physics)
from torch_geometric.datasets import Coauthor
dataset = Coauthor(root='/tmp/CS', name='CS')
Amazon
Usage: Node classification on product networks Networks: Computers, Photo Description: Amazon co-purchase networks where nodes are products
- Computers: 13,752 nodes, 245,861 edges, 10 classes
- Photo: 7,650 nodes, 119,081 edges, 8 classes
from torch_geometric.datasets import Amazon
dataset = Amazon(root='/tmp/Computers', name='Computers')
CitationFull
Usage: Citation network analysis Networks: Cora, Cora_ML, DBLP, PubMed Description: Full citation networks without sampling
from torch_geometric.datasets import CitationFull
dataset = CitationFull(root='/tmp/Cora', name='Cora')
Graph Classification
TUDataset
Usage: Graph classification, graph kernel benchmarks Description: Collection of 120+ graph classification datasets
- MUTAG: 188 graphs, 2 classes (molecular compounds)
- PROTEINS: 1,113 graphs, 2 classes (protein structures)
- ENZYMES: 600 graphs, 6 classes (protein enzymes)
- IMDB-BINARY: 1,000 graphs, 2 classes (social networks)
- REDDIT-BINARY: 2,000 graphs, 2 classes (discussion threads)
- COLLAB: 5,000 graphs, 3 classes (scientific collaborations)
- NCI1: 4,110 graphs, 2 classes (chemical compounds)
- DD: 1,178 graphs, 2 classes (protein structures)
from torch_geometric.datasets import TUDataset
dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES')
MoleculeNet
Usage: Molecular property prediction Datasets: Over 10 molecular benchmark datasets Description: Comprehensive molecular machine learning benchmarks
- ESOL: Aqueous solubility (regression)
- FreeSolv: Hydration free energy (regression)
- Lipophilicity: Octanol/water distribution (regression)
- BACE: Binding results (classification)
- BBBP: Blood-brain barrier penetration (classification)
- HIV: HIV inhibition (classification)
- Tox21: Toxicity prediction (multi-task classification)
- ToxCast: Toxicology forecasting (multi-task classification)
- SIDER: Side effects (multi-task classification)
- ClinTox: Clinical trial toxicity (multi-task classification)
from torch_geometric.datasets import MoleculeNet
dataset = MoleculeNet(root='/tmp/ESOL', name='ESOL')
Molecular and Chemical Datasets
QM7b
Usage: Molecular property prediction (quantum mechanics) Description: 7,211 molecules with up to 7 heavy atoms
- Properties: Atomization energies, electronic properties
from torch_geometric.datasets import QM7b
dataset = QM7b(root='/tmp/QM7b')
QM9
Usage: Molecular property prediction (quantum mechanics) Description: ~130,000 molecules with up to 9 heavy atoms (C, O, N, F)
- Properties: 19 quantum chemical properties including HOMO, LUMO, gap, energy
from torch_geometric.datasets import QM9
dataset = QM9(root='/tmp/QM9')
ZINC
Usage: Molecular generation, property prediction Description: ~250,000 drug-like molecular graphs
- Properties: Constrained solubility, molecular weight
from torch_geometric.datasets import ZINC
dataset = ZINC(root='/tmp/ZINC', subset=True)
AQSOL
Usage: Aqueous solubility prediction Description: ~10,000 molecules with solubility measurements
from torch_geometric.datasets import AQSOL
dataset = AQSOL(root='/tmp/AQSOL')
MD17
Usage: Molecular dynamics, force field learning Description: Molecular dynamics trajectories for small molecules
- Molecules: Benzene, Uracil, Naphthalene, Aspirin, Salicylic acid, etc.
from torch_geometric.datasets import MD17
dataset = MD17(root='/tmp/MD17', name='benzene')
PCQM4Mv2
Usage: Large-scale molecular property prediction Description: 3.8M molecules from PubChem for quantum chemistry
- Part of OGB Large-Scale Challenge
from torch_geometric.datasets import PCQM4Mv2
dataset = PCQM4Mv2(root='/tmp/PCQM4Mv2')
Social Networks
Usage: Large-scale node classification Description: Reddit posts from September 2014
- 232,965 nodes, 11,606,919 edges, 41 classes
- Features: TF-IDF of post content
from torch_geometric.datasets import Reddit
dataset = Reddit(root='/tmp/Reddit')
Reddit2
Usage: Large-scale node classification Description: Updated Reddit dataset with more posts
from torch_geometric.datasets import Reddit2
dataset = Reddit2(root='/tmp/Reddit2')
Twitch
Usage: Node classification, social network analysis Networks: DE, EN, ES, FR, PT, RU Description: Twitch user networks by language
from torch_geometric.datasets import Twitch
dataset = Twitch(root='/tmp/Twitch', name='DE')
Usage: Social network analysis, node classification Description: Facebook page-page networks
from torch_geometric.datasets import FacebookPagePage
dataset = FacebookPagePage(root='/tmp/Facebook')
GitHub
Usage: Social network analysis Description: GitHub developer networks
from torch_geometric.datasets import GitHub
dataset = GitHub(root='/tmp/GitHub')
Knowledge Graphs
Entities
Usage: Link prediction, knowledge graph embeddings Datasets: AIFB, MUTAG, BGS, AM Description: RDF knowledge graphs with typed relations
from torch_geometric.datasets import Entities
dataset = Entities(root='/tmp/AIFB', name='AIFB')
WordNet18
Usage: Link prediction on semantic networks Description: Subset of WordNet with 18 relations
- 40,943 entities, 151,442 triplets
from torch_geometric.datasets import WordNet18
dataset = WordNet18(root='/tmp/WordNet18')
WordNet18RR
Usage: Link prediction (no inverse relations) Description: Refined version without inverse relations
from torch_geometric.datasets import WordNet18RR
dataset = WordNet18RR(root='/tmp/WordNet18RR')
FB15k-237
Usage: Link prediction on Freebase Description: Subset of Freebase with 237 relations
- 14,541 entities, 310,116 triplets
from torch_geometric.datasets import FB15k_237
dataset = FB15k_237(root='/tmp/FB15k')
Heterogeneous Graphs
OGB_MAG
Usage: Heterogeneous graph learning, node classification Description: Microsoft Academic Graph with multiple node/edge types
- Node types: paper, author, institution, field of study
- 1M+ nodes, 21M+ edges
from torch_geometric.datasets import OGB_MAG
dataset = OGB_MAG(root='/tmp/OGB_MAG')
MovieLens
Usage: Recommendation systems, link prediction Versions: 100K, 1M, 10M, 20M Description: User-movie rating networks
- Node types: user, movie
- Edge types: rates
from torch_geometric.datasets import MovieLens
dataset = MovieLens(root='/tmp/MovieLens', model_name='100k')
IMDB
Usage: Heterogeneous graph learning Description: IMDB movie network
- Node types: movie, actor, director
from torch_geometric.datasets import IMDB
dataset = IMDB(root='/tmp/IMDB')
DBLP
Usage: Heterogeneous graph learning, node classification Description: DBLP bibliography network
- Node types: author, paper, term, conference
from torch_geometric.datasets import DBLP
dataset = DBLP(root='/tmp/DBLP')
LastFM
Usage: Heterogeneous recommendation Description: LastFM music network
- Node types: user, artist, tag
from torch_geometric.datasets import LastFM
dataset = LastFM(root='/tmp/LastFM')
Temporal Graphs
BitcoinOTC
Usage: Temporal link prediction, trust networks Description: Bitcoin OTC trust network over time
from torch_geometric.datasets import BitcoinOTC
dataset = BitcoinOTC(root='/tmp/BitcoinOTC')
ICEWS18
Usage: Temporal knowledge graph completion Description: Integrated Crisis Early Warning System events
from torch_geometric.datasets import ICEWS18
dataset = ICEWS18(root='/tmp/ICEWS18')
GDELT
Usage: Temporal event forecasting Description: Global Database of Events, Language, and Tone
from torch_geometric.datasets import GDELT
dataset = GDELT(root='/tmp/GDELT')
JODIEDataset
Usage: Dynamic graph learning Datasets: Reddit, Wikipedia, MOOC, LastFM Description: Temporal interaction networks
from torch_geometric.datasets import JODIEDataset
dataset = JODIEDataset(root='/tmp/JODIE', name='Reddit')
3D Meshes and Point Clouds
ShapeNet
Usage: 3D shape classification and segmentation Description: Large-scale 3D CAD model dataset
- 16,881 models across 16 categories
- Part-level segmentation labels
from torch_geometric.datasets import ShapeNet
dataset = ShapeNet(root='/tmp/ShapeNet', categories=['Airplane'])
ModelNet
Usage: 3D shape classification Versions: ModelNet10, ModelNet40 Description: CAD models for 3D object classification
- ModelNet10: 4,899 models, 10 categories
- ModelNet40: 12,311 models, 40 categories
from torch_geometric.datasets import ModelNet
dataset = ModelNet(root='/tmp/ModelNet', name='10')
FAUST
Usage: 3D shape matching, correspondence Description: Human body scans for shape analysis
- 100 meshes of 10 people in 10 poses
from torch_geometric.datasets import FAUST
dataset = FAUST(root='/tmp/FAUST')
CoMA
Usage: 3D mesh deformation Description: Facial expression meshes
- 20,466 3D face scans with expressions
from torch_geometric.datasets import CoMA
dataset = CoMA(root='/tmp/CoMA')
S3DIS
Usage: 3D semantic segmentation Description: Stanford Large-Scale 3D Indoor Spaces
- 6 areas, 271 rooms, point cloud data
from torch_geometric.datasets import S3DIS
dataset = S3DIS(root='/tmp/S3DIS', test_area=6)
Image and Vision Datasets
MNISTSuperpixels
Usage: Graph-based image classification Description: MNIST images as superpixel graphs
- 70,000 graphs (60k train, 10k test)
from torch_geometric.datasets import MNISTSuperpixels
dataset = MNISTSuperpixels(root='/tmp/MNIST')
Flickr
Usage: Image description, node classification Description: Flickr image network
- 89,250 nodes, 899,756 edges
from torch_geometric.datasets import Flickr
dataset = Flickr(root='/tmp/Flickr')
PPI
Usage: Protein-protein interaction prediction Description: Multi-graph protein interaction networks
- 24 graphs, 2,373 nodes total
from torch_geometric.datasets import PPI
dataset = PPI(root='/tmp/PPI', split='train')
Small Classic Graphs
KarateClub
Usage: Community detection, visualization Description: Zachary's karate club network
- 34 nodes, 78 edges, 2 communities
from torch_geometric.datasets import KarateClub
dataset = KarateClub()
Open Graph Benchmark (OGB)
PyG integrates seamlessly with OGB datasets:
Node Property Prediction
- ogbn-products: Amazon product network (2.4M nodes)
- ogbn-proteins: Protein association network (132K nodes)
- ogbn-arxiv: Citation network (169K nodes)
- ogbn-papers100M: Large citation network (111M nodes)
- ogbn-mag: Heterogeneous academic graph
Link Property Prediction
- ogbl-ppa: Protein association networks
- ogbl-collab: Collaboration networks
- ogbl-ddi: Drug-drug interaction network
- ogbl-citation2: Citation network
- ogbl-wikikg2: Wikidata knowledge graph
Graph Property Prediction
- ogbg-molhiv: Molecular HIV activity prediction
- ogbg-molpcba: Molecular bioassays (multi-task)
- ogbg-ppa: Protein function prediction
- ogbg-code2: Code abstract syntax trees
from torch_geometric.datasets import OGB_MAG, OGB_PPA
# or
from ogb.nodeproppred import PygNodePropPredDataset
dataset = PygNodePropPredDataset(name='ogbn-arxiv')
Synthetic Datasets
FakeDataset
Usage: Testing, debugging Description: Generates random graph data
from torch_geometric.datasets import FakeDataset
dataset = FakeDataset(num_graphs=100, avg_num_nodes=50)
StochasticBlockModelDataset
Usage: Community detection benchmarks Description: Graphs generated from stochastic block models
from torch_geometric.datasets import StochasticBlockModelDataset
dataset = StochasticBlockModelDataset(root='/tmp/SBM', num_graphs=1000)
ExplainerDataset
Usage: Testing explainability methods Description: Synthetic graphs with known explanation ground truth
from torch_geometric.datasets import ExplainerDataset
dataset = ExplainerDataset(num_graphs=1000)
Materials Science
QM8
Usage: Molecular property prediction Description: Electronic properties of small molecules
from torch_geometric.datasets import QM8
dataset = QM8(root='/tmp/QM8')
Biological Networks
PPI (Protein-Protein Interaction)
Already listed above under Image and Vision Datasets
STRING
Usage: Protein interaction networks Description: Known and predicted protein-protein interactions
# Available through external sources or custom loading
Usage Tips
- Start with small datasets: Use Cora, KarateClub, or ENZYMES for prototyping
- Citation networks: Planetoid datasets are perfect for node classification
- Graph classification: TUDataset provides diverse benchmarks
- Molecular: QM9, ZINC, MoleculeNet for chemistry applications
- Large-scale: Use Reddit, OGB datasets with NeighborLoader
- Heterogeneous: OGB_MAG, MovieLens, IMDB for multi-type graphs
- Temporal: JODIE, ICEWS for dynamic graph learning
- 3D: ShapeNet, ModelNet, S3DIS for geometric learning
Common Patterns
Loading with Transforms
from torch_geometric.datasets import Planetoid
from torch_geometric.transforms import NormalizeFeatures
dataset = Planetoid(root='/tmp/Cora', name='Cora',
transform=NormalizeFeatures())
Train/Val/Test Splits
# For datasets with pre-defined splits
data = dataset[0]
train_data = data[data.train_mask]
val_data = data[data.val_mask]
test_data = data[data.test_mask]
# For graph classification
from torch_geometric.loader import DataLoader
train_dataset = dataset[:int(len(dataset) * 0.8)]
test_dataset = dataset[int(len(dataset) * 0.8):]
train_loader = DataLoader(train_dataset, batch_size=32)
Custom Data Loading
from torch_geometric.data import Data, Dataset
class MyCustomDataset(Dataset):
def __init__(self, root, transform=None):
super().__init__(root, transform)
# Your initialization
def len(self):
return len(self.data_list)
def get(self, idx):
# Load and return data object
return self.data_list[idx]