Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,157 @@
---
name: transformers
description: This skill should be used when working with pre-trained transformer models for natural language processing, computer vision, audio, or multimodal tasks. Use for text generation, classification, question answering, translation, summarization, image classification, object detection, speech recognition, and fine-tuning models on custom datasets.
---
# Transformers
## Overview
The Hugging Face Transformers library provides access to thousands of pre-trained models for tasks across NLP, computer vision, audio, and multimodal domains. Use this skill to load models, perform inference, and fine-tune on custom data.
## Installation
Install transformers and core dependencies:
```bash
uv pip install torch transformers datasets evaluate accelerate
```
For vision tasks, add:
```bash
uv pip install timm pillow
```
For audio tasks, add:
```bash
uv pip install librosa soundfile
```
## Authentication
Many models on the Hugging Face Hub require authentication. Set up access:
```python
from huggingface_hub import login
login() # Follow prompts to enter token
```
Or set environment variable:
```bash
export HUGGINGFACE_TOKEN="your_token_here"
```
Get tokens at: https://huggingface.co/settings/tokens
## Quick Start
Use the Pipeline API for fast inference without manual configuration:
```python
from transformers import pipeline
# Text generation
generator = pipeline("text-generation", model="gpt2")
result = generator("The future of AI is", max_length=50)
# Text classification
classifier = pipeline("text-classification")
result = classifier("This movie was excellent!")
# Question answering
qa = pipeline("question-answering")
result = qa(question="What is AI?", context="AI is artificial intelligence...")
```
## Core Capabilities
### 1. Pipelines for Quick Inference
Use for simple, optimized inference across many tasks. Supports text generation, classification, NER, question answering, summarization, translation, image classification, object detection, audio classification, and more.
**When to use**: Quick prototyping, simple inference tasks, no custom preprocessing needed.
See `references/pipelines.md` for comprehensive task coverage and optimization.
### 2. Model Loading and Management
Load pre-trained models with fine-grained control over configuration, device placement, and precision.
**When to use**: Custom model initialization, advanced device management, model inspection.
See `references/models.md` for loading patterns and best practices.
### 3. Text Generation
Generate text with LLMs using various decoding strategies (greedy, beam search, sampling) and control parameters (temperature, top-k, top-p).
**When to use**: Creative text generation, code generation, conversational AI, text completion.
See `references/generation.md` for generation strategies and parameters.
### 4. Training and Fine-Tuning
Fine-tune pre-trained models on custom datasets using the Trainer API with automatic mixed precision, distributed training, and logging.
**When to use**: Task-specific model adaptation, domain adaptation, improving model performance.
See `references/training.md` for training workflows and best practices.
### 5. Tokenization
Convert text to tokens and token IDs for model input, with padding, truncation, and special token handling.
**When to use**: Custom preprocessing pipelines, understanding model inputs, batch processing.
See `references/tokenizers.md` for tokenization details.
## Common Patterns
### Pattern 1: Simple Inference
For straightforward tasks, use pipelines:
```python
pipe = pipeline("task-name", model="model-id")
output = pipe(input_data)
```
### Pattern 2: Custom Model Usage
For advanced control, load model and tokenizer separately:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModelForCausalLM.from_pretrained("model-id", device_map="auto")
inputs = tokenizer("text", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
result = tokenizer.decode(outputs[0])
```
### Pattern 3: Fine-Tuning
For task adaptation, use Trainer:
```python
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=8,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
)
trainer.train()
```
## Reference Documentation
For detailed information on specific components:
- **Pipelines**: `references/pipelines.md` - All supported tasks and optimization
- **Models**: `references/models.md` - Loading, saving, and configuration
- **Generation**: `references/generation.md` - Text generation strategies and parameters
- **Training**: `references/training.md` - Fine-tuning with Trainer API
- **Tokenizers**: `references/tokenizers.md` - Tokenization and preprocessing

View File

@@ -0,0 +1,467 @@
# Text Generation
## Overview
Generate text with language models using the `generate()` method. Control output quality and style through generation strategies and parameters.
## Basic Generation
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Tokenize input
inputs = tokenizer("Once upon a time", return_tensors="pt")
# Generate
outputs = model.generate(**inputs, max_new_tokens=50)
# Decode
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)
```
## Generation Strategies
### Greedy Decoding
Select highest probability token at each step (deterministic):
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=False # Greedy decoding (default)
)
```
**Use for**: Factual text, translations, where determinism is needed.
### Sampling
Randomly sample from probability distribution:
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.95
)
```
**Use for**: Creative writing, diverse outputs, open-ended generation.
### Beam Search
Explore multiple hypotheses in parallel:
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
early_stopping=True
)
```
**Use for**: Translations, summarization, where quality is critical.
### Contrastive Search
Balance quality and diversity:
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
penalty_alpha=0.6,
top_k=4
)
```
**Use for**: Long-form generation, reducing repetition.
## Key Parameters
### Length Control
**max_new_tokens**: Maximum tokens to generate
```python
max_new_tokens=100 # Generate up to 100 new tokens
```
**max_length**: Maximum total length (input + output)
```python
max_length=512 # Total sequence length
```
**min_new_tokens**: Minimum tokens to generate
```python
min_new_tokens=50 # Force at least 50 tokens
```
**min_length**: Minimum total length
```python
min_length=100
```
### Temperature
Controls randomness (only with sampling):
```python
temperature=1.0 # Default, balanced
temperature=0.7 # More focused, less random
temperature=1.5 # More creative, more random
```
Lower temperature → more deterministic
Higher temperature → more random
### Top-K Sampling
Consider only top K most likely tokens:
```python
do_sample=True
top_k=50 # Sample from top 50 tokens
```
**Common values**: 40-100 for balanced output, 10-20 for focused output.
### Top-P (Nucleus) Sampling
Consider tokens with cumulative probability ≥ P:
```python
do_sample=True
top_p=0.95 # Sample from smallest set with 95% cumulative probability
```
**Common values**: 0.9-0.95 for balanced, 0.7-0.85 for focused.
### Repetition Penalty
Discourage repetition:
```python
repetition_penalty=1.2 # Penalize repeated tokens
```
**Values**: 1.0 = no penalty, 1.2-1.5 = moderate, 2.0+ = strong penalty.
### Beam Search Parameters
**num_beams**: Number of beams
```python
num_beams=5 # Keep 5 hypotheses
```
**early_stopping**: Stop when num_beams sentences are finished
```python
early_stopping=True
```
**no_repeat_ngram_size**: Prevent n-gram repetition
```python
no_repeat_ngram_size=3 # Don't repeat any 3-gram
```
### Output Control
**num_return_sequences**: Generate multiple outputs
```python
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5,
num_return_sequences=3 # Return 3 different sequences
)
```
**pad_token_id**: Specify padding token
```python
pad_token_id=tokenizer.eos_token_id
```
**eos_token_id**: Stop generation at specific token
```python
eos_token_id=tokenizer.eos_token_id
```
## Advanced Features
### Batch Generation
Generate for multiple prompts:
```python
prompts = ["Hello, my name is", "Once upon a time"]
inputs = tokenizer(prompts, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_new_tokens=50)
for i, output in enumerate(outputs):
text = tokenizer.decode(output, skip_special_tokens=True)
print(f"Prompt {i}: {text}\n")
```
### Streaming Generation
Stream tokens as generated:
```python
from transformers import TextIteratorStreamer
from threading import Thread
streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
generation_kwargs = dict(
inputs,
streamer=streamer,
max_new_tokens=100
)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for text in streamer:
print(text, end="", flush=True)
thread.join()
```
### Constrained Generation
Force specific token sequences:
```python
# Force generation to start with specific tokens
force_words = ["Paris", "France"]
force_words_ids = [tokenizer.encode(word, add_special_tokens=False) for word in force_words]
outputs = model.generate(
**inputs,
force_words_ids=force_words_ids,
num_beams=5
)
```
### Guidance and Control
**Prevent bad words:**
```python
bad_words = ["offensive", "inappropriate"]
bad_words_ids = [tokenizer.encode(word, add_special_tokens=False) for word in bad_words]
outputs = model.generate(
**inputs,
bad_words_ids=bad_words_ids
)
```
### Generation Config
Save and reuse generation parameters:
```python
from transformers import GenerationConfig
# Create config
generation_config = GenerationConfig(
max_new_tokens=100,
temperature=0.7,
top_k=50,
top_p=0.95,
do_sample=True
)
# Save
generation_config.save_pretrained("./my_generation_config")
# Load and use
generation_config = GenerationConfig.from_pretrained("./my_generation_config")
outputs = model.generate(**inputs, generation_config=generation_config)
```
## Model-Specific Generation
### Chat Models
Use chat templates:
```python
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
input_text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
```
### Encoder-Decoder Models
For T5, BART, etc.:
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")
# T5 uses task prefixes
input_text = "translate English to French: Hello, how are you?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
```
## Optimization
### Caching
Enable KV cache for faster generation:
```python
outputs = model.generate(
**inputs,
max_new_tokens=100,
use_cache=True # Default, faster generation
)
```
### Static Cache
For fixed sequence lengths:
```python
from transformers import StaticCache
cache = StaticCache(model.config, max_batch_size=1, max_cache_len=1024, device="cuda")
outputs = model.generate(
**inputs,
max_new_tokens=100,
past_key_values=cache
)
```
### Attention Implementation
Use Flash Attention for speed:
```python
model = AutoModelForCausalLM.from_pretrained(
"model-id",
attn_implementation="flash_attention_2"
)
```
## Generation Recipes
### Creative Writing
```python
outputs = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.8,
top_k=50,
top_p=0.95,
repetition_penalty=1.2
)
```
### Factual Generation
```python
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=False, # Greedy
repetition_penalty=1.1
)
```
### Diverse Outputs
```python
outputs = model.generate(
**inputs,
max_new_tokens=100,
num_beams=5,
num_return_sequences=5,
temperature=1.5,
do_sample=True
)
```
### Long-Form Generation
```python
outputs = model.generate(
**inputs,
max_new_tokens=1000,
penalty_alpha=0.6, # Contrastive search
top_k=4,
repetition_penalty=1.2
)
```
### Translation/Summarization
```python
outputs = model.generate(
**inputs,
max_new_tokens=100,
num_beams=5,
early_stopping=True,
no_repeat_ngram_size=3
)
```
## Common Issues
**Repetitive output:**
- Increase repetition_penalty (1.2-1.5)
- Use no_repeat_ngram_size (2-3)
- Try contrastive search
- Lower temperature
**Poor quality:**
- Use beam search (num_beams=5)
- Lower temperature
- Adjust top_k/top_p
**Too deterministic:**
- Enable sampling (do_sample=True)
- Increase temperature (0.7-1.0)
- Adjust top_k/top_p
**Slow generation:**
- Reduce batch size
- Enable use_cache=True
- Use Flash Attention
- Reduce max_new_tokens
## Best Practices
1. **Start with defaults**: Then tune based on output
2. **Use appropriate strategy**: Greedy for factual, sampling for creative
3. **Set max_new_tokens**: Avoid unnecessarily long generation
4. **Enable caching**: For faster sequential generation
5. **Tune temperature**: Most impactful parameter for sampling
6. **Use beam search carefully**: Slower but higher quality
7. **Test different seeds**: For reproducibility with sampling
8. **Monitor memory**: Large beams use significant memory

View File

@@ -0,0 +1,361 @@
# Model Loading and Management
## Overview
The transformers library provides flexible model loading with automatic architecture detection, device management, and configuration control.
## Loading Models
### AutoModel Classes
Use AutoModel classes for automatic architecture selection:
```python
from transformers import AutoModel, AutoModelForSequenceClassification, AutoModelForCausalLM
# Base model (no task head)
model = AutoModel.from_pretrained("bert-base-uncased")
# Sequence classification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
# Causal language modeling (GPT-style)
model = AutoModelForCausalLM.from_pretrained("gpt2")
# Masked language modeling (BERT-style)
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
# Sequence-to-sequence (T5-style)
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
```
### Common AutoModel Classes
**NLP Tasks:**
- `AutoModelForSequenceClassification`: Text classification, sentiment analysis
- `AutoModelForTokenClassification`: NER, POS tagging
- `AutoModelForQuestionAnswering`: Extractive QA
- `AutoModelForCausalLM`: Text generation (GPT, Llama)
- `AutoModelForMaskedLM`: Masked language modeling (BERT)
- `AutoModelForSeq2SeqLM`: Translation, summarization (T5, BART)
**Vision Tasks:**
- `AutoModelForImageClassification`: Image classification
- `AutoModelForObjectDetection`: Object detection
- `AutoModelForImageSegmentation`: Image segmentation
**Audio Tasks:**
- `AutoModelForAudioClassification`: Audio classification
- `AutoModelForSpeechSeq2Seq`: Speech recognition
**Multimodal:**
- `AutoModelForVision2Seq`: Image captioning, VQA
## Loading Parameters
### Basic Parameters
**pretrained_model_name_or_path**: Model identifier or local path
```python
model = AutoModel.from_pretrained("bert-base-uncased") # From Hub
model = AutoModel.from_pretrained("./local/model/path") # From disk
```
**num_labels**: Number of output labels for classification
```python
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=3
)
```
**cache_dir**: Custom cache location
```python
model = AutoModel.from_pretrained("model-id", cache_dir="./my_cache")
```
### Device Management
**device_map**: Automatic device allocation for large models
```python
# Automatically distribute across GPUs and CPU
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
device_map="auto"
)
# Sequential placement
model = AutoModelForCausalLM.from_pretrained(
"model-id",
device_map="sequential"
)
# Custom device map
device_map = {
"transformer.layers.0": 0, # GPU 0
"transformer.layers.1": 1, # GPU 1
"transformer.layers.2": "cpu", # CPU
}
model = AutoModel.from_pretrained("model-id", device_map=device_map)
```
Manual device placement:
```python
import torch
model = AutoModel.from_pretrained("model-id")
model.to("cuda:0") # Move to GPU 0
model.to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))
```
### Precision Control
**torch_dtype**: Set model precision
```python
import torch
# Float16 (half precision)
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16)
# BFloat16 (better range than float16)
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.bfloat16)
# Auto (use original dtype)
model = AutoModel.from_pretrained("model-id", torch_dtype="auto")
```
### Attention Implementation
**attn_implementation**: Choose attention mechanism
```python
# Scaled Dot Product Attention (PyTorch 2.0+, fastest)
model = AutoModel.from_pretrained("model-id", attn_implementation="sdpa")
# Flash Attention 2 (requires flash-attn package)
model = AutoModel.from_pretrained("model-id", attn_implementation="flash_attention_2")
# Eager (default, most compatible)
model = AutoModel.from_pretrained("model-id", attn_implementation="eager")
```
### Memory Optimization
**low_cpu_mem_usage**: Reduce CPU memory during loading
```python
model = AutoModelForCausalLM.from_pretrained(
"large-model-id",
low_cpu_mem_usage=True,
device_map="auto"
)
```
**load_in_8bit**: 8-bit quantization (requires bitsandbytes)
```python
model = AutoModelForCausalLM.from_pretrained(
"model-id",
load_in_8bit=True,
device_map="auto"
)
```
**load_in_4bit**: 4-bit quantization
```python
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"model-id",
quantization_config=quantization_config,
device_map="auto"
)
```
## Model Configuration
### Loading with Custom Config
```python
from transformers import AutoConfig, AutoModel
# Load and modify config
config = AutoConfig.from_pretrained("bert-base-uncased")
config.hidden_dropout_prob = 0.2
config.attention_probs_dropout_prob = 0.2
# Initialize model with custom config
model = AutoModel.from_pretrained("bert-base-uncased", config=config)
```
### Initializing from Config Only
```python
config = AutoConfig.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_config(config) # Random weights
```
## Model Modes
### Training vs Evaluation Mode
Models load in evaluation mode by default:
```python
model = AutoModel.from_pretrained("model-id")
print(model.training) # False
# Switch to training mode
model.train()
# Switch back to evaluation mode
model.eval()
```
Evaluation mode disables dropout and uses batch norm statistics.
## Saving Models
### Save Locally
```python
model.save_pretrained("./my_model")
```
This creates:
- `config.json`: Model configuration
- `pytorch_model.bin` or `model.safetensors`: Model weights
### Save to Hugging Face Hub
```python
model.push_to_hub("username/model-name")
# With custom commit message
model.push_to_hub("username/model-name", commit_message="Update model")
# Private repository
model.push_to_hub("username/model-name", private=True)
```
## Model Inspection
### Parameter Count
```python
# Total parameters
total_params = model.num_parameters()
# Trainable parameters only
trainable_params = model.num_parameters(only_trainable=True)
print(f"Total: {total_params:,}")
print(f"Trainable: {trainable_params:,}")
```
### Memory Footprint
```python
memory_bytes = model.get_memory_footprint()
memory_mb = memory_bytes / 1024**2
print(f"Memory: {memory_mb:.2f} MB")
```
### Model Architecture
```python
print(model) # Print full architecture
# Access specific components
print(model.config)
print(model.base_model)
```
## Forward Pass
Basic inference:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModelForSequenceClassification.from_pretrained("model-id")
inputs = tokenizer("Sample text", return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predictions = logits.argmax(dim=-1)
```
## Model Formats
### SafeTensors vs PyTorch
SafeTensors is faster and safer:
```python
# Save as safetensors (recommended)
model.save_pretrained("./model", safe_serialization=True)
# Load either format automatically
model = AutoModel.from_pretrained("./model")
```
### ONNX Export
Export for optimized inference:
```python
from transformers.onnx import export
# Export to ONNX
export(
tokenizer=tokenizer,
model=model,
config=config,
output=Path("model.onnx")
)
```
## Best Practices
1. **Use AutoModel classes**: Automatic architecture detection
2. **Specify dtype explicitly**: Control precision and memory
3. **Use device_map="auto"**: For large models
4. **Enable low_cpu_mem_usage**: When loading large models
5. **Use safetensors format**: Faster and safer serialization
6. **Check model.training**: Ensure correct mode for task
7. **Consider quantization**: For deployment on resource-constrained devices
8. **Cache models locally**: Set TRANSFORMERS_CACHE environment variable
## Common Issues
**CUDA out of memory:**
```python
# Use smaller precision
model = AutoModel.from_pretrained("model-id", torch_dtype=torch.float16)
# Or use quantization
model = AutoModel.from_pretrained("model-id", load_in_8bit=True)
# Or use CPU
model = AutoModel.from_pretrained("model-id", device_map="cpu")
```
**Slow loading:**
```python
# Enable low CPU memory mode
model = AutoModel.from_pretrained("model-id", low_cpu_mem_usage=True)
```
**Model not found:**
```python
# Verify model ID on hub.co
# Check authentication for private models
from huggingface_hub import login
login()
```

View File

@@ -0,0 +1,335 @@
# Pipeline API Reference
## Overview
Pipelines provide the simplest way to use pre-trained models for inference. They abstract away tokenization, model loading, and post-processing, offering a unified interface for dozens of tasks.
## Basic Usage
Create a pipeline by specifying a task:
```python
from transformers import pipeline
# Auto-select default model for task
pipe = pipeline("text-classification")
result = pipe("This is great!")
```
Or specify a model:
```python
pipe = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
```
## Supported Tasks
### Natural Language Processing
**text-generation**: Generate text continuations
```python
generator = pipeline("text-generation", model="gpt2")
output = generator("Once upon a time", max_length=50, num_return_sequences=2)
```
**text-classification**: Classify text into categories
```python
classifier = pipeline("text-classification")
result = classifier("I love this product!") # Returns label and score
```
**token-classification**: Label individual tokens (NER, POS tagging)
```python
ner = pipeline("token-classification", model="dslim/bert-base-NER")
entities = ner("Hugging Face is based in New York City")
```
**question-answering**: Extract answers from context
```python
qa = pipeline("question-answering")
result = qa(question="What is the capital?", context="Paris is the capital of France.")
```
**fill-mask**: Predict masked tokens
```python
unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("Paris is the [MASK] of France")
```
**summarization**: Summarize long texts
```python
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summarizer("Long article text...", max_length=130, min_length=30)
```
**translation**: Translate between languages
```python
translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
result = translator("Hello, how are you?")
```
**zero-shot-classification**: Classify without training data
```python
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
result = classifier(
"This is a course about Python programming",
candidate_labels=["education", "politics", "business"]
)
```
**sentiment-analysis**: Alias for text-classification focused on sentiment
```python
sentiment = pipeline("sentiment-analysis")
result = sentiment("This product exceeded my expectations!")
```
### Computer Vision
**image-classification**: Classify images
```python
classifier = pipeline("image-classification", model="google/vit-base-patch16-224")
result = classifier("path/to/image.jpg")
# Or use PIL Image or URL
from PIL import Image
result = classifier(Image.open("image.jpg"))
```
**object-detection**: Detect objects in images
```python
detector = pipeline("object-detection", model="facebook/detr-resnet-50")
results = detector("image.jpg") # Returns bounding boxes and labels
```
**image-segmentation**: Segment images
```python
segmenter = pipeline("image-segmentation", model="facebook/detr-resnet-50-panoptic")
segments = segmenter("image.jpg")
```
**depth-estimation**: Estimate depth from images
```python
depth = pipeline("depth-estimation", model="Intel/dpt-large")
result = depth("image.jpg")
```
**zero-shot-image-classification**: Classify images without training
```python
classifier = pipeline("zero-shot-image-classification", model="openai/clip-vit-base-patch32")
result = classifier("image.jpg", candidate_labels=["cat", "dog", "bird"])
```
### Audio
**automatic-speech-recognition**: Transcribe speech
```python
asr = pipeline("automatic-speech-recognition", model="openai/whisper-base")
text = asr("audio.mp3")
```
**audio-classification**: Classify audio
```python
classifier = pipeline("audio-classification", model="MIT/ast-finetuned-audioset-10-10-0.4593")
result = classifier("audio.wav")
```
**text-to-speech**: Generate speech from text (with specific models)
```python
tts = pipeline("text-to-speech", model="microsoft/speecht5_tts")
audio = tts("Hello, this is a test")
```
### Multimodal
**visual-question-answering**: Answer questions about images
```python
vqa = pipeline("visual-question-answering", model="dandelin/vilt-b32-finetuned-vqa")
result = vqa(image="image.jpg", question="What color is the car?")
```
**document-question-answering**: Answer questions about documents
```python
doc_qa = pipeline("document-question-answering", model="impira/layoutlm-document-qa")
result = doc_qa(image="document.png", question="What is the invoice number?")
```
**image-to-text**: Generate captions for images
```python
captioner = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")
caption = captioner("image.jpg")
```
## Pipeline Parameters
### Common Parameters
**model**: Model identifier or path
```python
pipe = pipeline("task", model="model-id")
```
**device**: GPU device index (-1 for CPU, 0+ for GPU)
```python
pipe = pipeline("task", device=0) # Use first GPU
```
**device_map**: Automatic device allocation for large models
```python
pipe = pipeline("task", model="large-model", device_map="auto")
```
**dtype**: Model precision (reduces memory)
```python
import torch
pipe = pipeline("task", torch_dtype=torch.float16)
```
**batch_size**: Process multiple inputs at once
```python
pipe = pipeline("task", batch_size=8)
results = pipe(["text1", "text2", "text3"])
```
**framework**: Choose PyTorch or TensorFlow
```python
pipe = pipeline("task", framework="pt") # or "tf"
```
## Batch Processing
Process multiple inputs efficiently:
```python
classifier = pipeline("text-classification")
texts = ["Great product!", "Terrible experience", "Just okay"]
results = classifier(texts)
```
For large datasets, use generators or KeyDataset:
```python
from transformers.pipelines.pt_utils import KeyDataset
import datasets
dataset = datasets.load_dataset("dataset-name", split="test")
pipe = pipeline("task", device=0)
for output in pipe(KeyDataset(dataset, "text")):
print(output)
```
## Performance Optimization
### GPU Acceleration
Always specify device for GPU usage:
```python
pipe = pipeline("task", device=0)
```
### Mixed Precision
Use float16 for 2x speedup on supported GPUs:
```python
import torch
pipe = pipeline("task", torch_dtype=torch.float16, device=0)
```
### Batching Guidelines
- **CPU**: Usually skip batching
- **GPU with variable lengths**: May reduce efficiency
- **GPU with similar lengths**: Significant speedup
- **Real-time applications**: Skip batching (increases latency)
```python
# Good for throughput
pipe = pipeline("task", batch_size=32, device=0)
results = pipe(list_of_texts)
```
### Streaming Output
For text generation, stream tokens as they're generated:
```python
from transformers import TextStreamer
generator = pipeline("text-generation", model="gpt2", streamer=TextStreamer())
generator("The future of AI", max_length=100)
```
## Custom Pipeline Configuration
Specify tokenizer and model separately:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModelForSequenceClassification.from_pretrained("model-id")
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
```
Use custom pipeline classes:
```python
from transformers import TextClassificationPipeline
class CustomPipeline(TextClassificationPipeline):
def postprocess(self, model_outputs, **kwargs):
# Custom post-processing
return super().postprocess(model_outputs, **kwargs)
pipe = pipeline("text-classification", model="model-id", pipeline_class=CustomPipeline)
```
## Input Formats
Pipelines accept various input types:
**Text tasks**: Strings or lists of strings
```python
pipe("single text")
pipe(["text1", "text2"])
```
**Image tasks**: URLs, file paths, PIL Images, or numpy arrays
```python
pipe("https://example.com/image.jpg")
pipe("local/path/image.png")
pipe(PIL.Image.open("image.jpg"))
pipe(numpy_array)
```
**Audio tasks**: File paths, numpy arrays, or raw waveforms
```python
pipe("audio.mp3")
pipe(audio_array)
```
## Error Handling
Handle common issues:
```python
try:
result = pipe(input_data)
except Exception as e:
if "CUDA out of memory" in str(e):
# Reduce batch size or use CPU
pipe = pipeline("task", device=-1)
elif "does not appear to have a file named" in str(e):
# Model not found
print("Check model identifier")
else:
raise
```
## Best Practices
1. **Use pipelines for prototyping**: Fast iteration without boilerplate
2. **Specify models explicitly**: Default models may change
3. **Enable GPU when available**: Significant speedup
4. **Use batching for throughput**: When processing many inputs
5. **Consider memory usage**: Use float16 or smaller models for large batches
6. **Cache models locally**: Avoid repeated downloads

View File

@@ -0,0 +1,447 @@
# Tokenizers
## Overview
Tokenizers convert text into numerical representations (tokens) that models can process. They handle special tokens, padding, truncation, and attention masks.
## Loading Tokenizers
### AutoTokenizer
Automatically load the correct tokenizer for a model:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
```
Load from local path:
```python
tokenizer = AutoTokenizer.from_pretrained("./local/tokenizer/path")
```
## Basic Tokenization
### Encode Text
```python
# Simple encoding
text = "Hello, how are you?"
tokens = tokenizer.encode(text)
print(tokens) # [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
# With text tokenization
tokens = tokenizer.tokenize(text)
print(tokens) # ['hello', ',', 'how', 'are', 'you', '?']
```
### Decode Tokens
```python
token_ids = [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
text = tokenizer.decode(token_ids)
print(text) # "hello, how are you?"
# Skip special tokens
text = tokenizer.decode(token_ids, skip_special_tokens=True)
print(text) # "hello, how are you?"
```
## The `__call__` Method
Primary tokenization interface:
```python
# Single text
inputs = tokenizer("Hello, how are you?")
# Returns dictionary with input_ids, attention_mask
print(inputs)
# {
# 'input_ids': [101, 7592, 1010, 2129, 2024, 2017, 1029, 102],
# 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]
# }
```
Multiple texts:
```python
texts = ["Hello", "How are you?"]
inputs = tokenizer(texts, padding=True, truncation=True)
```
## Key Parameters
### Return Tensors
**return_tensors**: Output format ("pt", "tf", "np")
```python
# PyTorch tensors
inputs = tokenizer("text", return_tensors="pt")
# TensorFlow tensors
inputs = tokenizer("text", return_tensors="tf")
# NumPy arrays
inputs = tokenizer("text", return_tensors="np")
```
### Padding
**padding**: Pad sequences to same length
```python
# Pad to longest sequence in batch
inputs = tokenizer(texts, padding=True)
# Pad to specific length
inputs = tokenizer(texts, padding="max_length", max_length=128)
# No padding
inputs = tokenizer(texts, padding=False)
```
**pad_to_multiple_of**: Pad to multiple of specified value
```python
inputs = tokenizer(texts, padding=True, pad_to_multiple_of=8)
```
### Truncation
**truncation**: Limit sequence length
```python
# Truncate to max_length
inputs = tokenizer(text, truncation=True, max_length=512)
# Truncate first sequence in pairs
inputs = tokenizer(text1, text2, truncation="only_first")
# Truncate second sequence
inputs = tokenizer(text1, text2, truncation="only_second")
# Truncate longest first (default for pairs)
inputs = tokenizer(text1, text2, truncation="longest_first", max_length=512)
```
### Max Length
**max_length**: Maximum sequence length
```python
inputs = tokenizer(text, max_length=512, truncation=True)
```
### Additional Outputs
**return_attention_mask**: Include attention mask (default True)
```python
inputs = tokenizer(text, return_attention_mask=True)
```
**return_token_type_ids**: Segment IDs for sentence pairs
```python
inputs = tokenizer(text1, text2, return_token_type_ids=True)
```
**return_offsets_mapping**: Character position mapping (Fast tokenizers only)
```python
inputs = tokenizer(text, return_offsets_mapping=True)
```
**return_length**: Include sequence lengths
```python
inputs = tokenizer(texts, padding=True, return_length=True)
```
## Special Tokens
### Predefined Special Tokens
Access special tokens:
```python
print(tokenizer.cls_token) # [CLS] or <s>
print(tokenizer.sep_token) # [SEP] or </s>
print(tokenizer.pad_token) # [PAD]
print(tokenizer.unk_token) # [UNK]
print(tokenizer.mask_token) # [MASK]
print(tokenizer.eos_token) # End of sequence
print(tokenizer.bos_token) # Beginning of sequence
# Get IDs
print(tokenizer.cls_token_id)
print(tokenizer.sep_token_id)
```
### Add Special Tokens
Manual control:
```python
# Automatically add special tokens (default True)
inputs = tokenizer(text, add_special_tokens=True)
# Skip special tokens
inputs = tokenizer(text, add_special_tokens=False)
```
### Custom Special Tokens
```python
special_tokens_dict = {
"additional_special_tokens": ["<CUSTOM>", "<SPECIAL>"]
}
num_added = tokenizer.add_special_tokens(special_tokens_dict)
print(f"Added {num_added} tokens")
# Resize model embeddings after adding tokens
model.resize_token_embeddings(len(tokenizer))
```
## Sentence Pairs
Tokenize text pairs:
```python
text1 = "What is the capital of France?"
text2 = "Paris is the capital of France."
# Automatically handles separation
inputs = tokenizer(text1, text2, padding=True, truncation=True)
# Results in: [CLS] text1 [SEP] text2 [SEP]
```
## Batch Encoding
Process multiple texts:
```python
texts = ["First text", "Second text", "Third text"]
# Basic batch encoding
batch = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Access individual encodings
for i in range(len(texts)):
input_ids = batch["input_ids"][i]
attention_mask = batch["attention_mask"][i]
```
## Fast Tokenizers
Use Rust-based tokenizers for speed:
```python
from transformers import AutoTokenizer
# Automatically loads Fast version if available
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Check if Fast
print(tokenizer.is_fast) # True
# Force Fast tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
# Force slow (Python) tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=False)
```
### Fast Tokenizer Features
**Offset mapping** (character positions):
```python
inputs = tokenizer("Hello world", return_offsets_mapping=True)
print(inputs["offset_mapping"])
# [(0, 0), (0, 5), (6, 11), (0, 0)] # [CLS], "Hello", "world", [SEP]
```
**Token to word mapping**:
```python
encoding = tokenizer("Hello world")
word_ids = encoding.word_ids()
print(word_ids) # [None, 0, 1, None] # [CLS]=None, "Hello"=0, "world"=1, [SEP]=None
```
## Saving Tokenizers
Save locally:
```python
tokenizer.save_pretrained("./my_tokenizer")
```
Push to Hub:
```python
tokenizer.push_to_hub("username/my-tokenizer")
```
## Advanced Usage
### Vocabulary
Access vocabulary:
```python
vocab = tokenizer.get_vocab()
vocab_size = len(vocab)
# Get token for ID
token = tokenizer.convert_ids_to_tokens(100)
# Get ID for token
token_id = tokenizer.convert_tokens_to_ids("hello")
```
### Encoding Details
Get detailed encoding information:
```python
encoding = tokenizer("Hello world", return_tensors="pt")
# Original methods still available
tokens = encoding.tokens()
word_ids = encoding.word_ids()
sequence_ids = encoding.sequence_ids()
```
### Custom Preprocessing
Subclass for custom behavior:
```python
class CustomTokenizer(AutoTokenizer):
def __call__(self, text, **kwargs):
# Custom preprocessing
text = text.lower().strip()
return super().__call__(text, **kwargs)
```
## Chat Templates
For conversational models:
```python
messages = [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi there!"},
{"role": "user", "content": "How are you?"}
]
# Apply chat template
text = tokenizer.apply_chat_template(messages, tokenize=False)
print(text)
# Tokenize directly
inputs = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt")
```
## Common Patterns
### Pattern 1: Simple Text Classification
```python
texts = ["I love this!", "I hate this!"]
labels = [1, 0]
inputs = tokenizer(
texts,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt"
)
# Use with model
outputs = model(**inputs, labels=torch.tensor(labels))
```
### Pattern 2: Question Answering
```python
question = "What is the capital?"
context = "Paris is the capital of France."
inputs = tokenizer(
question,
context,
padding=True,
truncation=True,
max_length=384,
return_tensors="pt"
)
```
### Pattern 3: Text Generation
```python
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate
outputs = model.generate(
inputs["input_ids"],
max_new_tokens=50,
pad_token_id=tokenizer.eos_token_id
)
# Decode
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
```
### Pattern 4: Dataset Tokenization
```python
def tokenize_function(examples):
return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=512
)
# Apply to dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)
```
## Best Practices
1. **Always specify return_tensors**: For model input
2. **Use padding and truncation**: For batch processing
3. **Set max_length explicitly**: Prevent memory issues
4. **Use Fast tokenizers**: When available for speed
5. **Handle pad_token**: Set to eos_token if None for generation
6. **Add special tokens**: Leave enabled (default) unless specific reason
7. **Resize embeddings**: After adding custom tokens
8. **Decode with skip_special_tokens**: For cleaner output
9. **Use batched processing**: For efficiency with datasets
10. **Save tokenizer with model**: Ensure compatibility
## Common Issues
**Padding token not set:**
```python
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
```
**Sequence too long:**
```python
# Enable truncation
inputs = tokenizer(text, truncation=True, max_length=512)
```
**Mismatched vocabulary:**
```python
# Always load tokenizer and model from same checkpoint
tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModel.from_pretrained("model-id")
```
**Attention mask issues:**
```python
# Ensure attention_mask is passed
outputs = model(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"]
)
```

View File

@@ -0,0 +1,500 @@
# Training and Fine-Tuning
## Overview
Fine-tune pre-trained models on custom datasets using the Trainer API. The Trainer handles training loops, gradient accumulation, mixed precision, logging, and checkpointing.
## Basic Fine-Tuning Workflow
### Step 1: Load and Preprocess Data
```python
from datasets import load_dataset
# Load dataset
dataset = load_dataset("yelp_review_full")
train_dataset = dataset["train"]
eval_dataset = dataset["test"]
# Tokenize
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenize_function(examples):
return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=512
)
train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)
```
### Step 2: Load Model
```python
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=5 # Number of classes
)
```
### Step 3: Define Metrics
```python
import evaluate
import numpy as np
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
```
### Step 4: Configure Training
```python
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=3,
weight_decay=0.01,
logging_dir="./logs",
logging_steps=10,
load_best_model_at_end=True,
metric_for_best_model="accuracy",
)
```
### Step 5: Create Trainer and Train
```python
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
# Start training
trainer.train()
# Evaluate
results = trainer.evaluate()
print(results)
```
### Step 6: Save Model
```python
trainer.save_model("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")
# Or push to Hub
trainer.push_to_hub("username/my-finetuned-model")
```
## TrainingArguments Parameters
### Essential Parameters
**output_dir**: Directory for checkpoints and logs
```python
output_dir="./results"
```
**num_train_epochs**: Number of training epochs
```python
num_train_epochs=3
```
**per_device_train_batch_size**: Batch size per GPU/CPU
```python
per_device_train_batch_size=8
```
**learning_rate**: Optimizer learning rate
```python
learning_rate=2e-5 # Common for BERT-style models
learning_rate=5e-5 # Common for smaller models
```
**weight_decay**: L2 regularization
```python
weight_decay=0.01
```
### Evaluation and Saving
**eval_strategy**: When to evaluate ("no", "steps", "epoch")
```python
eval_strategy="epoch" # Evaluate after each epoch
eval_strategy="steps" # Evaluate every eval_steps
```
**save_strategy**: When to save checkpoints
```python
save_strategy="epoch"
save_strategy="steps"
save_steps=500
```
**load_best_model_at_end**: Load best checkpoint after training
```python
load_best_model_at_end=True
metric_for_best_model="accuracy" # Metric to compare
```
### Optimization
**gradient_accumulation_steps**: Accumulate gradients over multiple steps
```python
gradient_accumulation_steps=4 # Effective batch size = batch_size * 4
```
**fp16**: Enable mixed precision (NVIDIA GPUs)
```python
fp16=True
```
**bf16**: Enable bfloat16 (newer GPUs)
```python
bf16=True
```
**gradient_checkpointing**: Trade compute for memory
```python
gradient_checkpointing=True # Slower but uses less memory
```
**optim**: Optimizer choice
```python
optim="adamw_torch" # Default
optim="adamw_8bit" # 8-bit Adam (requires bitsandbytes)
optim="adafactor" # Memory-efficient alternative
```
### Learning Rate Scheduling
**lr_scheduler_type**: Learning rate schedule
```python
lr_scheduler_type="linear" # Linear decay
lr_scheduler_type="cosine" # Cosine annealing
lr_scheduler_type="constant" # No decay
lr_scheduler_type="constant_with_warmup"
```
**warmup_steps** or **warmup_ratio**: Warmup period
```python
warmup_steps=500
# Or
warmup_ratio=0.1 # 10% of total steps
```
### Logging
**logging_dir**: TensorBoard logs directory
```python
logging_dir="./logs"
```
**logging_steps**: Log every N steps
```python
logging_steps=10
```
**report_to**: Logging integrations
```python
report_to=["tensorboard"]
report_to=["wandb"]
report_to=["tensorboard", "wandb"]
```
### Distributed Training
**ddp_backend**: Distributed backend
```python
ddp_backend="nccl" # For multi-GPU
```
**deepspeed**: DeepSpeed config file
```python
deepspeed="ds_config.json"
```
## Data Collators
Handle dynamic padding and special preprocessing:
### DataCollatorWithPadding
Pad sequences to longest in batch:
```python
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
data_collator=data_collator,
)
```
### DataCollatorForLanguageModeling
For masked language modeling:
```python
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=True,
mlm_probability=0.15
)
```
### DataCollatorForSeq2Seq
For sequence-to-sequence tasks:
```python
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(
tokenizer=tokenizer,
model=model,
padding=True
)
```
## Custom Training
### Custom Trainer
Override methods for custom behavior:
```python
from transformers import Trainer
class CustomTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
labels = inputs.pop("labels")
outputs = model(**inputs)
logits = outputs.logits
# Custom loss computation
loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights)
loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
return (loss, outputs) if return_outputs else loss
```
### Custom Callbacks
Monitor and control training:
```python
from transformers import TrainerCallback
class CustomCallback(TrainerCallback):
def on_epoch_end(self, args, state, control, **kwargs):
print(f"Epoch {state.epoch} completed")
# Custom logic here
return control
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
callbacks=[CustomCallback],
)
```
## Advanced Training Techniques
### Parameter-Efficient Fine-Tuning (PEFT)
Use LoRA for efficient fine-tuning:
```python
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["query", "value"],
lora_dropout=0.05,
bias="none",
task_type="SEQ_CLS"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # Shows reduced parameter count
# Train normally with Trainer
trainer = Trainer(model=model, args=training_args, ...)
trainer.train()
```
### Gradient Checkpointing
Reduce memory at cost of speed:
```python
model.gradient_checkpointing_enable()
training_args = TrainingArguments(
gradient_checkpointing=True,
...
)
```
### Mixed Precision Training
```python
training_args = TrainingArguments(
fp16=True, # For NVIDIA GPUs with Tensor Cores
# or
bf16=True, # For newer GPUs (A100, H100)
...
)
```
### DeepSpeed Integration
For very large models:
```python
# ds_config.json
{
"train_batch_size": 16,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "AdamW",
"params": {
"lr": 2e-5
}
},
"fp16": {
"enabled": true
},
"zero_optimization": {
"stage": 2
}
}
```
```python
training_args = TrainingArguments(
deepspeed="ds_config.json",
...
)
```
## Training Tips
### Hyperparameter Tuning
Common starting points:
- **Learning rate**: 2e-5 to 5e-5 for BERT-like models, 1e-4 to 1e-3 for smaller models
- **Batch size**: 8-32 depending on GPU memory
- **Epochs**: 2-4 for fine-tuning, more for domain adaptation
- **Warmup**: 10% of total steps
Use Optuna for hyperparameter search:
```python
def model_init():
return AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=5
)
def optuna_hp_space(trial):
return {
"learning_rate": trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True),
"per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]),
"num_train_epochs": trial.suggest_int("num_train_epochs", 2, 5),
}
trainer = Trainer(model_init=model_init, args=training_args, ...)
best_trial = trainer.hyperparameter_search(
direction="maximize",
backend="optuna",
hp_space=optuna_hp_space,
n_trials=10,
)
```
### Monitoring Training
Use TensorBoard:
```bash
tensorboard --logdir ./logs
```
Or Weights & Biases:
```python
import wandb
wandb.init(project="my-project")
training_args = TrainingArguments(
report_to=["wandb"],
...
)
```
### Resume Training
Resume from checkpoint:
```python
trainer.train(resume_from_checkpoint="./results/checkpoint-1000")
```
## Common Issues
**CUDA out of memory:**
- Reduce batch size
- Enable gradient checkpointing
- Use gradient accumulation
- Use 8-bit optimizers
**Overfitting:**
- Increase weight_decay
- Add dropout
- Use early stopping
- Reduce model size or training epochs
**Slow training:**
- Increase batch size
- Enable mixed precision (fp16/bf16)
- Use multiple GPUs
- Optimize data loading
## Best Practices
1. **Start small**: Test on small dataset subset first
2. **Use evaluation**: Monitor validation metrics
3. **Save checkpoints**: Enable save_strategy
4. **Log extensively**: Use TensorBoard or W&B
5. **Try different learning rates**: Start with 2e-5
6. **Use warmup**: Helps training stability
7. **Enable mixed precision**: Faster training
8. **Consider PEFT**: For large models with limited resources