Files
gh-k-dense-ai-claude-scient…/skills/torch_geometric/references/datasets_reference.md
2025-11-30 08:30:10 +08:00

16 KiB

PyTorch Geometric Datasets Reference

This document provides a comprehensive catalog of all datasets available in torch_geometric.datasets.

Citation Networks

Planetoid

Usage: Node classification, semi-supervised learning Networks: Cora, CiteSeer, PubMed Description: Citation networks where nodes are papers and edges are citations

  • Cora: 2,708 nodes, 5,429 edges, 7 classes, 1,433 features
  • CiteSeer: 3,327 nodes, 4,732 edges, 6 classes, 3,703 features
  • PubMed: 19,717 nodes, 44,338 edges, 3 classes, 500 features
from torch_geometric.datasets import Planetoid
dataset = Planetoid(root='/tmp/Cora', name='Cora')

Coauthor

Usage: Node classification on collaboration networks Networks: CS, Physics Description: Co-authorship networks from Microsoft Academic Graph

  • CS: 18,333 nodes, 81,894 edges, 15 classes (computer science)
  • Physics: 34,493 nodes, 247,962 edges, 5 classes (physics)
from torch_geometric.datasets import Coauthor
dataset = Coauthor(root='/tmp/CS', name='CS')

Amazon

Usage: Node classification on product networks Networks: Computers, Photo Description: Amazon co-purchase networks where nodes are products

  • Computers: 13,752 nodes, 245,861 edges, 10 classes
  • Photo: 7,650 nodes, 119,081 edges, 8 classes
from torch_geometric.datasets import Amazon
dataset = Amazon(root='/tmp/Computers', name='Computers')

CitationFull

Usage: Citation network analysis Networks: Cora, Cora_ML, DBLP, PubMed Description: Full citation networks without sampling

from torch_geometric.datasets import CitationFull
dataset = CitationFull(root='/tmp/Cora', name='Cora')

Graph Classification

TUDataset

Usage: Graph classification, graph kernel benchmarks Description: Collection of 120+ graph classification datasets

  • MUTAG: 188 graphs, 2 classes (molecular compounds)
  • PROTEINS: 1,113 graphs, 2 classes (protein structures)
  • ENZYMES: 600 graphs, 6 classes (protein enzymes)
  • IMDB-BINARY: 1,000 graphs, 2 classes (social networks)
  • REDDIT-BINARY: 2,000 graphs, 2 classes (discussion threads)
  • COLLAB: 5,000 graphs, 3 classes (scientific collaborations)
  • NCI1: 4,110 graphs, 2 classes (chemical compounds)
  • DD: 1,178 graphs, 2 classes (protein structures)
from torch_geometric.datasets import TUDataset
dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES')

MoleculeNet

Usage: Molecular property prediction Datasets: Over 10 molecular benchmark datasets Description: Comprehensive molecular machine learning benchmarks

  • ESOL: Aqueous solubility (regression)
  • FreeSolv: Hydration free energy (regression)
  • Lipophilicity: Octanol/water distribution (regression)
  • BACE: Binding results (classification)
  • BBBP: Blood-brain barrier penetration (classification)
  • HIV: HIV inhibition (classification)
  • Tox21: Toxicity prediction (multi-task classification)
  • ToxCast: Toxicology forecasting (multi-task classification)
  • SIDER: Side effects (multi-task classification)
  • ClinTox: Clinical trial toxicity (multi-task classification)
from torch_geometric.datasets import MoleculeNet
dataset = MoleculeNet(root='/tmp/ESOL', name='ESOL')

Molecular and Chemical Datasets

QM7b

Usage: Molecular property prediction (quantum mechanics) Description: 7,211 molecules with up to 7 heavy atoms

  • Properties: Atomization energies, electronic properties
from torch_geometric.datasets import QM7b
dataset = QM7b(root='/tmp/QM7b')

QM9

Usage: Molecular property prediction (quantum mechanics) Description: ~130,000 molecules with up to 9 heavy atoms (C, O, N, F)

  • Properties: 19 quantum chemical properties including HOMO, LUMO, gap, energy
from torch_geometric.datasets import QM9
dataset = QM9(root='/tmp/QM9')

ZINC

Usage: Molecular generation, property prediction Description: ~250,000 drug-like molecular graphs

  • Properties: Constrained solubility, molecular weight
from torch_geometric.datasets import ZINC
dataset = ZINC(root='/tmp/ZINC', subset=True)

AQSOL

Usage: Aqueous solubility prediction Description: ~10,000 molecules with solubility measurements

from torch_geometric.datasets import AQSOL
dataset = AQSOL(root='/tmp/AQSOL')

MD17

Usage: Molecular dynamics, force field learning Description: Molecular dynamics trajectories for small molecules

  • Molecules: Benzene, Uracil, Naphthalene, Aspirin, Salicylic acid, etc.
from torch_geometric.datasets import MD17
dataset = MD17(root='/tmp/MD17', name='benzene')

PCQM4Mv2

Usage: Large-scale molecular property prediction Description: 3.8M molecules from PubChem for quantum chemistry

  • Part of OGB Large-Scale Challenge
from torch_geometric.datasets import PCQM4Mv2
dataset = PCQM4Mv2(root='/tmp/PCQM4Mv2')

Social Networks

Reddit

Usage: Large-scale node classification Description: Reddit posts from September 2014

  • 232,965 nodes, 11,606,919 edges, 41 classes
  • Features: TF-IDF of post content
from torch_geometric.datasets import Reddit
dataset = Reddit(root='/tmp/Reddit')

Reddit2

Usage: Large-scale node classification Description: Updated Reddit dataset with more posts

from torch_geometric.datasets import Reddit2
dataset = Reddit2(root='/tmp/Reddit2')

Twitch

Usage: Node classification, social network analysis Networks: DE, EN, ES, FR, PT, RU Description: Twitch user networks by language

from torch_geometric.datasets import Twitch
dataset = Twitch(root='/tmp/Twitch', name='DE')

Facebook

Usage: Social network analysis, node classification Description: Facebook page-page networks

from torch_geometric.datasets import FacebookPagePage
dataset = FacebookPagePage(root='/tmp/Facebook')

GitHub

Usage: Social network analysis Description: GitHub developer networks

from torch_geometric.datasets import GitHub
dataset = GitHub(root='/tmp/GitHub')

Knowledge Graphs

Entities

Usage: Link prediction, knowledge graph embeddings Datasets: AIFB, MUTAG, BGS, AM Description: RDF knowledge graphs with typed relations

from torch_geometric.datasets import Entities
dataset = Entities(root='/tmp/AIFB', name='AIFB')

WordNet18

Usage: Link prediction on semantic networks Description: Subset of WordNet with 18 relations

  • 40,943 entities, 151,442 triplets
from torch_geometric.datasets import WordNet18
dataset = WordNet18(root='/tmp/WordNet18')

WordNet18RR

Usage: Link prediction (no inverse relations) Description: Refined version without inverse relations

from torch_geometric.datasets import WordNet18RR
dataset = WordNet18RR(root='/tmp/WordNet18RR')

FB15k-237

Usage: Link prediction on Freebase Description: Subset of Freebase with 237 relations

  • 14,541 entities, 310,116 triplets
from torch_geometric.datasets import FB15k_237
dataset = FB15k_237(root='/tmp/FB15k')

Heterogeneous Graphs

OGB_MAG

Usage: Heterogeneous graph learning, node classification Description: Microsoft Academic Graph with multiple node/edge types

  • Node types: paper, author, institution, field of study
  • 1M+ nodes, 21M+ edges
from torch_geometric.datasets import OGB_MAG
dataset = OGB_MAG(root='/tmp/OGB_MAG')

MovieLens

Usage: Recommendation systems, link prediction Versions: 100K, 1M, 10M, 20M Description: User-movie rating networks

  • Node types: user, movie
  • Edge types: rates
from torch_geometric.datasets import MovieLens
dataset = MovieLens(root='/tmp/MovieLens', model_name='100k')

IMDB

Usage: Heterogeneous graph learning Description: IMDB movie network

  • Node types: movie, actor, director
from torch_geometric.datasets import IMDB
dataset = IMDB(root='/tmp/IMDB')

DBLP

Usage: Heterogeneous graph learning, node classification Description: DBLP bibliography network

  • Node types: author, paper, term, conference
from torch_geometric.datasets import DBLP
dataset = DBLP(root='/tmp/DBLP')

LastFM

Usage: Heterogeneous recommendation Description: LastFM music network

  • Node types: user, artist, tag
from torch_geometric.datasets import LastFM
dataset = LastFM(root='/tmp/LastFM')

Temporal Graphs

BitcoinOTC

Usage: Temporal link prediction, trust networks Description: Bitcoin OTC trust network over time

from torch_geometric.datasets import BitcoinOTC
dataset = BitcoinOTC(root='/tmp/BitcoinOTC')

ICEWS18

Usage: Temporal knowledge graph completion Description: Integrated Crisis Early Warning System events

from torch_geometric.datasets import ICEWS18
dataset = ICEWS18(root='/tmp/ICEWS18')

GDELT

Usage: Temporal event forecasting Description: Global Database of Events, Language, and Tone

from torch_geometric.datasets import GDELT
dataset = GDELT(root='/tmp/GDELT')

JODIEDataset

Usage: Dynamic graph learning Datasets: Reddit, Wikipedia, MOOC, LastFM Description: Temporal interaction networks

from torch_geometric.datasets import JODIEDataset
dataset = JODIEDataset(root='/tmp/JODIE', name='Reddit')

3D Meshes and Point Clouds

ShapeNet

Usage: 3D shape classification and segmentation Description: Large-scale 3D CAD model dataset

  • 16,881 models across 16 categories
  • Part-level segmentation labels
from torch_geometric.datasets import ShapeNet
dataset = ShapeNet(root='/tmp/ShapeNet', categories=['Airplane'])

ModelNet

Usage: 3D shape classification Versions: ModelNet10, ModelNet40 Description: CAD models for 3D object classification

  • ModelNet10: 4,899 models, 10 categories
  • ModelNet40: 12,311 models, 40 categories
from torch_geometric.datasets import ModelNet
dataset = ModelNet(root='/tmp/ModelNet', name='10')

FAUST

Usage: 3D shape matching, correspondence Description: Human body scans for shape analysis

  • 100 meshes of 10 people in 10 poses
from torch_geometric.datasets import FAUST
dataset = FAUST(root='/tmp/FAUST')

CoMA

Usage: 3D mesh deformation Description: Facial expression meshes

  • 20,466 3D face scans with expressions
from torch_geometric.datasets import CoMA
dataset = CoMA(root='/tmp/CoMA')

S3DIS

Usage: 3D semantic segmentation Description: Stanford Large-Scale 3D Indoor Spaces

  • 6 areas, 271 rooms, point cloud data
from torch_geometric.datasets import S3DIS
dataset = S3DIS(root='/tmp/S3DIS', test_area=6)

Image and Vision Datasets

MNISTSuperpixels

Usage: Graph-based image classification Description: MNIST images as superpixel graphs

  • 70,000 graphs (60k train, 10k test)
from torch_geometric.datasets import MNISTSuperpixels
dataset = MNISTSuperpixels(root='/tmp/MNIST')

Flickr

Usage: Image description, node classification Description: Flickr image network

  • 89,250 nodes, 899,756 edges
from torch_geometric.datasets import Flickr
dataset = Flickr(root='/tmp/Flickr')

PPI

Usage: Protein-protein interaction prediction Description: Multi-graph protein interaction networks

  • 24 graphs, 2,373 nodes total
from torch_geometric.datasets import PPI
dataset = PPI(root='/tmp/PPI', split='train')

Small Classic Graphs

KarateClub

Usage: Community detection, visualization Description: Zachary's karate club network

  • 34 nodes, 78 edges, 2 communities
from torch_geometric.datasets import KarateClub
dataset = KarateClub()

Open Graph Benchmark (OGB)

PyG integrates seamlessly with OGB datasets:

Node Property Prediction

  • ogbn-products: Amazon product network (2.4M nodes)
  • ogbn-proteins: Protein association network (132K nodes)
  • ogbn-arxiv: Citation network (169K nodes)
  • ogbn-papers100M: Large citation network (111M nodes)
  • ogbn-mag: Heterogeneous academic graph
  • ogbl-ppa: Protein association networks
  • ogbl-collab: Collaboration networks
  • ogbl-ddi: Drug-drug interaction network
  • ogbl-citation2: Citation network
  • ogbl-wikikg2: Wikidata knowledge graph

Graph Property Prediction

  • ogbg-molhiv: Molecular HIV activity prediction
  • ogbg-molpcba: Molecular bioassays (multi-task)
  • ogbg-ppa: Protein function prediction
  • ogbg-code2: Code abstract syntax trees
from torch_geometric.datasets import OGB_MAG, OGB_PPA
# or
from ogb.nodeproppred import PygNodePropPredDataset
dataset = PygNodePropPredDataset(name='ogbn-arxiv')

Synthetic Datasets

FakeDataset

Usage: Testing, debugging Description: Generates random graph data

from torch_geometric.datasets import FakeDataset
dataset = FakeDataset(num_graphs=100, avg_num_nodes=50)

StochasticBlockModelDataset

Usage: Community detection benchmarks Description: Graphs generated from stochastic block models

from torch_geometric.datasets import StochasticBlockModelDataset
dataset = StochasticBlockModelDataset(root='/tmp/SBM', num_graphs=1000)

ExplainerDataset

Usage: Testing explainability methods Description: Synthetic graphs with known explanation ground truth

from torch_geometric.datasets import ExplainerDataset
dataset = ExplainerDataset(num_graphs=1000)

Materials Science

QM8

Usage: Molecular property prediction Description: Electronic properties of small molecules

from torch_geometric.datasets import QM8
dataset = QM8(root='/tmp/QM8')

Biological Networks

PPI (Protein-Protein Interaction)

Already listed above under Image and Vision Datasets

STRING

Usage: Protein interaction networks Description: Known and predicted protein-protein interactions

# Available through external sources or custom loading

Usage Tips

  1. Start with small datasets: Use Cora, KarateClub, or ENZYMES for prototyping
  2. Citation networks: Planetoid datasets are perfect for node classification
  3. Graph classification: TUDataset provides diverse benchmarks
  4. Molecular: QM9, ZINC, MoleculeNet for chemistry applications
  5. Large-scale: Use Reddit, OGB datasets with NeighborLoader
  6. Heterogeneous: OGB_MAG, MovieLens, IMDB for multi-type graphs
  7. Temporal: JODIE, ICEWS for dynamic graph learning
  8. 3D: ShapeNet, ModelNet, S3DIS for geometric learning

Common Patterns

Loading with Transforms

from torch_geometric.datasets import Planetoid
from torch_geometric.transforms import NormalizeFeatures

dataset = Planetoid(root='/tmp/Cora', name='Cora',
                    transform=NormalizeFeatures())

Train/Val/Test Splits

# For datasets with pre-defined splits
data = dataset[0]
train_data = data[data.train_mask]
val_data = data[data.val_mask]
test_data = data[data.test_mask]

# For graph classification
from torch_geometric.loader import DataLoader
train_dataset = dataset[:int(len(dataset) * 0.8)]
test_dataset = dataset[int(len(dataset) * 0.8):]
train_loader = DataLoader(train_dataset, batch_size=32)

Custom Data Loading

from torch_geometric.data import Data, Dataset

class MyCustomDataset(Dataset):
    def __init__(self, root, transform=None):
        super().__init__(root, transform)
        # Your initialization

    def len(self):
        return len(self.data_list)

    def get(self, idx):
        # Load and return data object
        return self.data_list[idx]