9.7 KiB
Tokenizers
Overview
Tokenizers convert text into numerical representations (tokens) that models can process. They handle special tokens, padding, truncation, and attention masks.
Loading Tokenizers
AutoTokenizer
Automatically load the correct tokenizer for a model:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
Load from local path:
tokenizer = AutoTokenizer.from_pretrained("./local/tokenizer/path")
Basic Tokenization
Encode Text
# Simple encoding
text = "Hello, how are you?"
tokens = tokenizer.encode(text)
print(tokens) # [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
# With text tokenization
tokens = tokenizer.tokenize(text)
print(tokens) # ['hello', ',', 'how', 'are', 'you', '?']
Decode Tokens
token_ids = [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
text = tokenizer.decode(token_ids)
print(text) # "hello, how are you?"
# Skip special tokens
text = tokenizer.decode(token_ids, skip_special_tokens=True)
print(text) # "hello, how are you?"
The __call__ Method
Primary tokenization interface:
# Single text
inputs = tokenizer("Hello, how are you?")
# Returns dictionary with input_ids, attention_mask
print(inputs)
# {
# 'input_ids': [101, 7592, 1010, 2129, 2024, 2017, 1029, 102],
# 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]
# }
Multiple texts:
texts = ["Hello", "How are you?"]
inputs = tokenizer(texts, padding=True, truncation=True)
Key Parameters
Return Tensors
return_tensors: Output format ("pt", "tf", "np")
# PyTorch tensors
inputs = tokenizer("text", return_tensors="pt")
# TensorFlow tensors
inputs = tokenizer("text", return_tensors="tf")
# NumPy arrays
inputs = tokenizer("text", return_tensors="np")
Padding
padding: Pad sequences to same length
# Pad to longest sequence in batch
inputs = tokenizer(texts, padding=True)
# Pad to specific length
inputs = tokenizer(texts, padding="max_length", max_length=128)
# No padding
inputs = tokenizer(texts, padding=False)
pad_to_multiple_of: Pad to multiple of specified value
inputs = tokenizer(texts, padding=True, pad_to_multiple_of=8)
Truncation
truncation: Limit sequence length
# Truncate to max_length
inputs = tokenizer(text, truncation=True, max_length=512)
# Truncate first sequence in pairs
inputs = tokenizer(text1, text2, truncation="only_first")
# Truncate second sequence
inputs = tokenizer(text1, text2, truncation="only_second")
# Truncate longest first (default for pairs)
inputs = tokenizer(text1, text2, truncation="longest_first", max_length=512)
Max Length
max_length: Maximum sequence length
inputs = tokenizer(text, max_length=512, truncation=True)
Additional Outputs
return_attention_mask: Include attention mask (default True)
inputs = tokenizer(text, return_attention_mask=True)
return_token_type_ids: Segment IDs for sentence pairs
inputs = tokenizer(text1, text2, return_token_type_ids=True)
return_offsets_mapping: Character position mapping (Fast tokenizers only)
inputs = tokenizer(text, return_offsets_mapping=True)
return_length: Include sequence lengths
inputs = tokenizer(texts, padding=True, return_length=True)
Special Tokens
Predefined Special Tokens
Access special tokens:
print(tokenizer.cls_token) # [CLS] or <s>
print(tokenizer.sep_token) # [SEP] or </s>
print(tokenizer.pad_token) # [PAD]
print(tokenizer.unk_token) # [UNK]
print(tokenizer.mask_token) # [MASK]
print(tokenizer.eos_token) # End of sequence
print(tokenizer.bos_token) # Beginning of sequence
# Get IDs
print(tokenizer.cls_token_id)
print(tokenizer.sep_token_id)
Add Special Tokens
Manual control:
# Automatically add special tokens (default True)
inputs = tokenizer(text, add_special_tokens=True)
# Skip special tokens
inputs = tokenizer(text, add_special_tokens=False)
Custom Special Tokens
special_tokens_dict = {
"additional_special_tokens": ["<CUSTOM>", "<SPECIAL>"]
}
num_added = tokenizer.add_special_tokens(special_tokens_dict)
print(f"Added {num_added} tokens")
# Resize model embeddings after adding tokens
model.resize_token_embeddings(len(tokenizer))
Sentence Pairs
Tokenize text pairs:
text1 = "What is the capital of France?"
text2 = "Paris is the capital of France."
# Automatically handles separation
inputs = tokenizer(text1, text2, padding=True, truncation=True)
# Results in: [CLS] text1 [SEP] text2 [SEP]
Batch Encoding
Process multiple texts:
texts = ["First text", "Second text", "Third text"]
# Basic batch encoding
batch = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Access individual encodings
for i in range(len(texts)):
input_ids = batch["input_ids"][i]
attention_mask = batch["attention_mask"][i]
Fast Tokenizers
Use Rust-based tokenizers for speed:
from transformers import AutoTokenizer
# Automatically loads Fast version if available
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Check if Fast
print(tokenizer.is_fast) # True
# Force Fast tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
# Force slow (Python) tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=False)
Fast Tokenizer Features
Offset mapping (character positions):
inputs = tokenizer("Hello world", return_offsets_mapping=True)
print(inputs["offset_mapping"])
# [(0, 0), (0, 5), (6, 11), (0, 0)] # [CLS], "Hello", "world", [SEP]
Token to word mapping:
encoding = tokenizer("Hello world")
word_ids = encoding.word_ids()
print(word_ids) # [None, 0, 1, None] # [CLS]=None, "Hello"=0, "world"=1, [SEP]=None
Saving Tokenizers
Save locally:
tokenizer.save_pretrained("./my_tokenizer")
Push to Hub:
tokenizer.push_to_hub("username/my-tokenizer")
Advanced Usage
Vocabulary
Access vocabulary:
vocab = tokenizer.get_vocab()
vocab_size = len(vocab)
# Get token for ID
token = tokenizer.convert_ids_to_tokens(100)
# Get ID for token
token_id = tokenizer.convert_tokens_to_ids("hello")
Encoding Details
Get detailed encoding information:
encoding = tokenizer("Hello world", return_tensors="pt")
# Original methods still available
tokens = encoding.tokens()
word_ids = encoding.word_ids()
sequence_ids = encoding.sequence_ids()
Custom Preprocessing
Subclass for custom behavior:
class CustomTokenizer(AutoTokenizer):
def __call__(self, text, **kwargs):
# Custom preprocessing
text = text.lower().strip()
return super().__call__(text, **kwargs)
Chat Templates
For conversational models:
messages = [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi there!"},
{"role": "user", "content": "How are you?"}
]
# Apply chat template
text = tokenizer.apply_chat_template(messages, tokenize=False)
print(text)
# Tokenize directly
inputs = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt")
Common Patterns
Pattern 1: Simple Text Classification
texts = ["I love this!", "I hate this!"]
labels = [1, 0]
inputs = tokenizer(
texts,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt"
)
# Use with model
outputs = model(**inputs, labels=torch.tensor(labels))
Pattern 2: Question Answering
question = "What is the capital?"
context = "Paris is the capital of France."
inputs = tokenizer(
question,
context,
padding=True,
truncation=True,
max_length=384,
return_tensors="pt"
)
Pattern 3: Text Generation
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate
outputs = model.generate(
inputs["input_ids"],
max_new_tokens=50,
pad_token_id=tokenizer.eos_token_id
)
# Decode
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
Pattern 4: Dataset Tokenization
def tokenize_function(examples):
return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=512
)
# Apply to dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)
Best Practices
- Always specify return_tensors: For model input
- Use padding and truncation: For batch processing
- Set max_length explicitly: Prevent memory issues
- Use Fast tokenizers: When available for speed
- Handle pad_token: Set to eos_token if None for generation
- Add special tokens: Leave enabled (default) unless specific reason
- Resize embeddings: After adding custom tokens
- Decode with skip_special_tokens: For cleaner output
- Use batched processing: For efficiency with datasets
- Save tokenizer with model: Ensure compatibility
Common Issues
Padding token not set:
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
Sequence too long:
# Enable truncation
inputs = tokenizer(text, truncation=True, max_length=512)
Mismatched vocabulary:
# Always load tokenizer and model from same checkpoint
tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModel.from_pretrained("model-id")
Attention mask issues:
# Ensure attention_mask is passed
outputs = model(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"]
)