zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

8.4 KiB

Raw Blame History

Model Loading and Management

Overview

The transformers library provides flexible model loading with automatic architecture detection, device management, and configuration control.

Loading Models

AutoModel Classes

Use AutoModel classes for automatic architecture selection:

from transformers import AutoModel, AutoModelForSequenceClassification, AutoModelForCausalLM

# Base model (no task head)
model = AutoModel.from_pretrained("bert-base-uncased")

# Sequence classification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

# Causal language modeling (GPT-style)
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Masked language modeling (BERT-style)
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

# Sequence-to-sequence (T5-style)
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

Common AutoModel Classes

NLP Tasks:

AutoModelForSequenceClassification: Text classification, sentiment analysis
AutoModelForTokenClassification: NER, POS tagging
AutoModelForQuestionAnswering: Extractive QA
AutoModelForCausalLM: Text generation (GPT, Llama)
AutoModelForMaskedLM: Masked language modeling (BERT)
AutoModelForSeq2SeqLM: Translation, summarization (T5, BART)

Vision Tasks:

AutoModelForImageClassification: Image classification
AutoModelForObjectDetection: Object detection
AutoModelForImageSegmentation: Image segmentation

Audio Tasks:

AutoModelForAudioClassification: Audio classification
AutoModelForSpeechSeq2Seq: Speech recognition

Multimodal:

AutoModelForVision2Seq: Image captioning, VQA

Loading Parameters

Basic Parameters

pretrained_model_name_or_path: Model identifier or local path

model = AutoModel.from_pretrained("bert-base-uncased")  # From Hub
model = AutoModel.from_pretrained("./local/model/path")  # From disk

num_labels: Number of output labels for classification

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=3
)

cache_dir: Custom cache location

model = AutoModel.from_pretrained("model-id", cache_dir="./my_cache")

Device Management

device_map: Automatic device allocation for large models

# Automatically distribute across GPUs and CPU
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto"
)

# Sequential placement
model = AutoModelForCausalLM.from_pretrained(
    "model-id",
    device_map="sequential"
)

# Custom device map
device_map = {
    "transformer.layers.0": 0,      # GPU 0
    "transformer.layers.1": 1,      # GPU 1
    "transformer.layers.2": "cpu",  # CPU
}
model = AutoModel.from_pretrained("model-id", device_map=device_map)

Manual device placement:

import torch
model = AutoModel.from_pretrained("model-id")
model.to("cuda:0")  # Move to GPU 0
model.to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

Precision Control

torch_dtype: Set model precision

import torch

# Float16 (half precision)
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16)

# BFloat16 (better range than float16)
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.bfloat16)

# Auto (use original dtype)
model = AutoModel.from_pretrained("model-id", torch_dtype="auto")

Attention Implementation

attn_implementation: Choose attention mechanism

# Scaled Dot Product Attention (PyTorch 2.0+, fastest)
model = AutoModel.from_pretrained("model-id", attn_implementation="sdpa")

# Flash Attention 2 (requires flash-attn package)
model = AutoModel.from_pretrained("model-id", attn_implementation="flash_attention_2")

# Eager (default, most compatible)
model = AutoModel.from_pretrained("model-id", attn_implementation="eager")

Memory Optimization

low_cpu_mem_usage: Reduce CPU memory during loading

model = AutoModelForCausalLM.from_pretrained(
    "large-model-id",
    low_cpu_mem_usage=True,
    device_map="auto"
)

load_in_8bit: 8-bit quantization (requires bitsandbytes)

model = AutoModelForCausalLM.from_pretrained(
    "model-id",
    load_in_8bit=True,
    device_map="auto"
)

load_in_4bit: 4-bit quantization

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    "model-id",
    quantization_config=quantization_config,
    device_map="auto"
)

Model Configuration

Loading with Custom Config

from transformers import AutoConfig, AutoModel

# Load and modify config
config = AutoConfig.from_pretrained("bert-base-uncased")
config.hidden_dropout_prob = 0.2
config.attention_probs_dropout_prob = 0.2

# Initialize model with custom config
model = AutoModel.from_pretrained("bert-base-uncased", config=config)

Initializing from Config Only

config = AutoConfig.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_config(config)  # Random weights

Model Modes

Training vs Evaluation Mode

Models load in evaluation mode by default:

model = AutoModel.from_pretrained("model-id")
print(model.training)  # False

# Switch to training mode
model.train()

# Switch back to evaluation mode
model.eval()

Evaluation mode disables dropout and uses batch norm statistics.

Saving Models

Save Locally

model.save_pretrained("./my_model")

This creates:

config.json: Model configuration
pytorch_model.bin or model.safetensors: Model weights

Save to Hugging Face Hub

model.push_to_hub("username/model-name")

# With custom commit message
model.push_to_hub("username/model-name", commit_message="Update model")

# Private repository
model.push_to_hub("username/model-name", private=True)

Model Inspection

Parameter Count

# Total parameters
total_params = model.num_parameters()

# Trainable parameters only
trainable_params = model.num_parameters(only_trainable=True)

print(f"Total: {total_params:,}")
print(f"Trainable: {trainable_params:,}")

Memory Footprint

memory_bytes = model.get_memory_footprint()
memory_mb = memory_bytes / 1024**2
print(f"Memory: {memory_mb:.2f} MB")

Model Architecture

print(model)  # Print full architecture

# Access specific components
print(model.config)
print(model.base_model)

Forward Pass

Basic inference:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModelForSequenceClassification.from_pretrained("model-id")

inputs = tokenizer("Sample text", return_tensors="pt")
outputs = model(**inputs)

logits = outputs.logits
predictions = logits.argmax(dim=-1)

Model Formats

SafeTensors vs PyTorch

SafeTensors is faster and safer:

# Save as safetensors (recommended)
model.save_pretrained("./model", safe_serialization=True)

# Load either format automatically
model = AutoModel.from_pretrained("./model")

ONNX Export

Export for optimized inference:

from transformers.onnx import export

# Export to ONNX
export(
    tokenizer=tokenizer,
    model=model,
    config=config,
    output=Path("model.onnx")
)

Best Practices

Use AutoModel classes: Automatic architecture detection
Specify dtype explicitly: Control precision and memory
Use device_map="auto": For large models
Enable low_cpu_mem_usage: When loading large models
Use safetensors format: Faster and safer serialization
Check model.training: Ensure correct mode for task
Consider quantization: For deployment on resource-constrained devices
Cache models locally: Set TRANSFORMERS_CACHE environment variable

Common Issues

CUDA out of memory:

# Use smaller precision
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16)

# Or use quantization
model = AutoModel.from_pretrained("model-id", load_in_8bit=True)

# Or use CPU
model = AutoModel.from_pretrained("model-id", device_map="cpu")

Slow loading:

# Enable low CPU memory mode
model = AutoModel.from_pretrained("model-id", low_cpu_mem_usage=True)

Model not found:

# Verify model ID on hub.co
# Check authentication for private models
from huggingface_hub import login
login()

8.4 KiB Raw Blame History