Files
gh-k-dense-ai-claude-scient…/skills/transformers/references/tokenizers.md
2025-11-30 08:30:10 +08:00

9.7 KiB

Tokenizers

Overview

Tokenizers convert text into numerical representations (tokens) that models can process. They handle special tokens, padding, truncation, and attention masks.

Loading Tokenizers

AutoTokenizer

Automatically load the correct tokenizer for a model:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Load from local path:

tokenizer = AutoTokenizer.from_pretrained("./local/tokenizer/path")

Basic Tokenization

Encode Text

# Simple encoding
text = "Hello, how are you?"
tokens = tokenizer.encode(text)
print(tokens)  # [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]

# With text tokenization
tokens = tokenizer.tokenize(text)
print(tokens)  # ['hello', ',', 'how', 'are', 'you', '?']

Decode Tokens

token_ids = [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
text = tokenizer.decode(token_ids)
print(text)  # "hello, how are you?"

# Skip special tokens
text = tokenizer.decode(token_ids, skip_special_tokens=True)
print(text)  # "hello, how are you?"

The __call__ Method

Primary tokenization interface:

# Single text
inputs = tokenizer("Hello, how are you?")

# Returns dictionary with input_ids, attention_mask
print(inputs)
# {
#   'input_ids': [101, 7592, 1010, 2129, 2024, 2017, 1029, 102],
#   'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]
# }

Multiple texts:

texts = ["Hello", "How are you?"]
inputs = tokenizer(texts, padding=True, truncation=True)

Key Parameters

Return Tensors

return_tensors: Output format ("pt", "tf", "np")

# PyTorch tensors
inputs = tokenizer("text", return_tensors="pt")

# TensorFlow tensors
inputs = tokenizer("text", return_tensors="tf")

# NumPy arrays
inputs = tokenizer("text", return_tensors="np")

Padding

padding: Pad sequences to same length

# Pad to longest sequence in batch
inputs = tokenizer(texts, padding=True)

# Pad to specific length
inputs = tokenizer(texts, padding="max_length", max_length=128)

# No padding
inputs = tokenizer(texts, padding=False)

pad_to_multiple_of: Pad to multiple of specified value

inputs = tokenizer(texts, padding=True, pad_to_multiple_of=8)

Truncation

truncation: Limit sequence length

# Truncate to max_length
inputs = tokenizer(text, truncation=True, max_length=512)

# Truncate first sequence in pairs
inputs = tokenizer(text1, text2, truncation="only_first")

# Truncate second sequence
inputs = tokenizer(text1, text2, truncation="only_second")

# Truncate longest first (default for pairs)
inputs = tokenizer(text1, text2, truncation="longest_first", max_length=512)

Max Length

max_length: Maximum sequence length

inputs = tokenizer(text, max_length=512, truncation=True)

Additional Outputs

return_attention_mask: Include attention mask (default True)

inputs = tokenizer(text, return_attention_mask=True)

return_token_type_ids: Segment IDs for sentence pairs

inputs = tokenizer(text1, text2, return_token_type_ids=True)

return_offsets_mapping: Character position mapping (Fast tokenizers only)

inputs = tokenizer(text, return_offsets_mapping=True)

return_length: Include sequence lengths

inputs = tokenizer(texts, padding=True, return_length=True)

Special Tokens

Predefined Special Tokens

Access special tokens:

print(tokenizer.cls_token)      # [CLS] or <s>
print(tokenizer.sep_token)      # [SEP] or </s>
print(tokenizer.pad_token)      # [PAD]
print(tokenizer.unk_token)      # [UNK]
print(tokenizer.mask_token)     # [MASK]
print(tokenizer.eos_token)      # End of sequence
print(tokenizer.bos_token)      # Beginning of sequence

# Get IDs
print(tokenizer.cls_token_id)
print(tokenizer.sep_token_id)

Add Special Tokens

Manual control:

# Automatically add special tokens (default True)
inputs = tokenizer(text, add_special_tokens=True)

# Skip special tokens
inputs = tokenizer(text, add_special_tokens=False)

Custom Special Tokens

special_tokens_dict = {
    "additional_special_tokens": ["<CUSTOM>", "<SPECIAL>"]
}

num_added = tokenizer.add_special_tokens(special_tokens_dict)
print(f"Added {num_added} tokens")

# Resize model embeddings after adding tokens
model.resize_token_embeddings(len(tokenizer))

Sentence Pairs

Tokenize text pairs:

text1 = "What is the capital of France?"
text2 = "Paris is the capital of France."

# Automatically handles separation
inputs = tokenizer(text1, text2, padding=True, truncation=True)

# Results in: [CLS] text1 [SEP] text2 [SEP]

Batch Encoding

Process multiple texts:

texts = ["First text", "Second text", "Third text"]

# Basic batch encoding
batch = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Access individual encodings
for i in range(len(texts)):
    input_ids = batch["input_ids"][i]
    attention_mask = batch["attention_mask"][i]

Fast Tokenizers

Use Rust-based tokenizers for speed:

from transformers import AutoTokenizer

# Automatically loads Fast version if available
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Check if Fast
print(tokenizer.is_fast)  # True

# Force Fast tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)

# Force slow (Python) tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=False)

Fast Tokenizer Features

Offset mapping (character positions):

inputs = tokenizer("Hello world", return_offsets_mapping=True)
print(inputs["offset_mapping"])
# [(0, 0), (0, 5), (6, 11), (0, 0)]  # [CLS], "Hello", "world", [SEP]

Token to word mapping:

encoding = tokenizer("Hello world")
word_ids = encoding.word_ids()
print(word_ids)  # [None, 0, 1, None]  # [CLS]=None, "Hello"=0, "world"=1, [SEP]=None

Saving Tokenizers

Save locally:

tokenizer.save_pretrained("./my_tokenizer")

Push to Hub:

tokenizer.push_to_hub("username/my-tokenizer")

Advanced Usage

Vocabulary

Access vocabulary:

vocab = tokenizer.get_vocab()
vocab_size = len(vocab)

# Get token for ID
token = tokenizer.convert_ids_to_tokens(100)

# Get ID for token
token_id = tokenizer.convert_tokens_to_ids("hello")

Encoding Details

Get detailed encoding information:

encoding = tokenizer("Hello world", return_tensors="pt")

# Original methods still available
tokens = encoding.tokens()
word_ids = encoding.word_ids()
sequence_ids = encoding.sequence_ids()

Custom Preprocessing

Subclass for custom behavior:

class CustomTokenizer(AutoTokenizer):
    def __call__(self, text, **kwargs):
        # Custom preprocessing
        text = text.lower().strip()
        return super().__call__(text, **kwargs)

Chat Templates

For conversational models:

messages = [
    {"role": "system", "content": "You are helpful."},
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there!"},
    {"role": "user", "content": "How are you?"}
]

# Apply chat template
text = tokenizer.apply_chat_template(messages, tokenize=False)
print(text)

# Tokenize directly
inputs = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt")

Common Patterns

Pattern 1: Simple Text Classification

texts = ["I love this!", "I hate this!"]
labels = [1, 0]

inputs = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

# Use with model
outputs = model(**inputs, labels=torch.tensor(labels))

Pattern 2: Question Answering

question = "What is the capital?"
context = "Paris is the capital of France."

inputs = tokenizer(
    question,
    context,
    padding=True,
    truncation=True,
    max_length=384,
    return_tensors="pt"
)

Pattern 3: Text Generation

prompt = "Once upon a time"

inputs = tokenizer(prompt, return_tensors="pt")

# Generate
outputs = model.generate(
    inputs["input_ids"],
    max_new_tokens=50,
    pad_token_id=tokenizer.eos_token_id
)

# Decode
text = tokenizer.decode(outputs[0], skip_special_tokens=True)

Pattern 4: Dataset Tokenization

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=512
    )

# Apply to dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)

Best Practices

  1. Always specify return_tensors: For model input
  2. Use padding and truncation: For batch processing
  3. Set max_length explicitly: Prevent memory issues
  4. Use Fast tokenizers: When available for speed
  5. Handle pad_token: Set to eos_token if None for generation
  6. Add special tokens: Leave enabled (default) unless specific reason
  7. Resize embeddings: After adding custom tokens
  8. Decode with skip_special_tokens: For cleaner output
  9. Use batched processing: For efficiency with datasets
  10. Save tokenizer with model: Ensure compatibility

Common Issues

Padding token not set:

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

Sequence too long:

# Enable truncation
inputs = tokenizer(text, truncation=True, max_length=512)

Mismatched vocabulary:

# Always load tokenizer and model from same checkpoint
tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModel.from_pretrained("model-id")

Attention mask issues:

# Ensure attention_mask is passed
outputs = model(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"]
)