Files
gh-tachyon-beep-skillpacks-…/skills/using-llm-specialist/llm-evaluation-metrics.md
2025-11-30 08:59:54 +08:00

1559 lines
52 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# LLM Evaluation Metrics Skill
## When to Use This Skill
Use this skill when:
- Building any LLM application (classification, generation, summarization, RAG, chat)
- Evaluating model performance and quality
- Comparing different models or approaches (baseline comparison)
- Fine-tuning or optimizing LLM systems
- Debugging quality issues in production
- Establishing production monitoring and alerting
**When NOT to use:** Exploratory prototyping without deployment intent. For deployment-bound systems, evaluation is mandatory.
## Core Principle
**Evaluation is not a checkbox—it's how you know if your system works.**
Without rigorous evaluation:
- You don't know if your model is good (no baseline comparison)
- You optimize the wrong dimensions (wrong metrics for task type)
- You miss quality issues (automated metrics miss human-perceived issues)
- You can't prove improvement (no statistical significance)
- You ship inferior systems (no A/B testing)
**Formula:** Automated metrics (efficiency) + Human evaluation (quality) + Production metrics (impact) = Complete evaluation.
## Evaluation Framework Overview
```
┌─────────────────────────────────┐
│ Task Type Identification │
└──────────┬──────────────────────┘
┌──────────────┼──────────────┐
│ │ │
┌───────▼───────┐ ┌───▼──────┐ ┌────▼────────┐
│Classification│ │Generation│ │ RAG │
│ Metrics │ │ Metrics │ │ Metrics │
└───────┬───────┘ └───┬──────┘ └────┬────────┘
│ │ │
└──────────────┼─────────────┘
┌──────────────▼──────────────────┐
│ Multi-Dimensional Scoring │
│ Primary + Secondary + Guards │
└──────────────┬──────────────────┘
┌──────────────▼──────────────────┐
│ Human Evaluation │
│ Fluency, Relevance, Safety │
└──────────────┬──────────────────┘
┌──────────────▼──────────────────┐
│ A/B Testing │
│ Statistical Significance │
└──────────────┬──────────────────┘
┌──────────────▼──────────────────┐
│ Production Monitoring │
│ CSAT, Completion, Cost │
└──────────────────────────────────┘
```
## Part 1: Metric Selection by Task Type
### Classification Tasks
**Use cases:** Sentiment analysis, intent detection, entity tagging, content moderation, spam detection
**Primary Metrics:**
1. **Accuracy:** Correct predictions / Total predictions
- Use when: Classes are balanced
- Don't use when: Class imbalance (e.g., 95% negative, 5% spam)
2. **F1-Score:** Harmonic mean of Precision and Recall
- **Macro F1:** Average F1 across classes (treats all classes equally)
- **Micro F1:** Global F1 (weighted by class frequency)
- **Per-class F1:** F1 for each class individually
- Use when: Class imbalance or unequal class importance
3. **Precision & Recall:**
- **Precision:** True Positives / (True Positives + False Positives)
- "Of predictions as positive, how many are correct?"
- **Recall:** True Positives / (True Positives + False Negatives)
- "Of actual positives, how many did we find?"
- Use when: Asymmetric cost (spam: high precision, medical: high recall)
4. **AUC-ROC:** Area Under Receiver Operating Characteristic curve
- Measures model's ability to discriminate between classes at all thresholds
- Use when: Evaluating calibration and ranking quality
**Implementation:**
```python
from sklearn.metrics import (
accuracy_score, f1_score, precision_recall_fscore_support,
classification_report, confusion_matrix, roc_auc_score
)
import numpy as np
def evaluate_classification(y_true, y_pred, y_proba=None, labels=None):
"""
Comprehensive classification evaluation.
Args:
y_true: Ground truth labels
y_pred: Predicted labels
y_proba: Predicted probabilities (for AUC-ROC)
labels: Class names for reporting
Returns:
Dictionary of metrics
"""
metrics = {}
# Basic metrics
metrics['accuracy'] = accuracy_score(y_true, y_pred)
# F1 scores
metrics['f1_macro'] = f1_score(y_true, y_pred, average='macro')
metrics['f1_micro'] = f1_score(y_true, y_pred, average='micro')
metrics['f1_weighted'] = f1_score(y_true, y_pred, average='weighted')
# Per-class metrics
precision, recall, f1, support = precision_recall_fscore_support(
y_true, y_pred, labels=labels
)
metrics['per_class'] = {
'precision': precision,
'recall': recall,
'f1': f1,
'support': support
}
# Confusion matrix
metrics['confusion_matrix'] = confusion_matrix(y_true, y_pred)
# AUC-ROC (if probabilities provided)
if y_proba is not None:
if len(np.unique(y_true)) == 2: # Binary
metrics['auc_roc'] = roc_auc_score(y_true, y_proba[:, 1])
else: # Multi-class
metrics['auc_roc'] = roc_auc_score(
y_true, y_proba, multi_class='ovr', average='macro'
)
# Detailed report
metrics['classification_report'] = classification_report(
y_true, y_pred, target_names=labels
)
return metrics
# Example usage
y_true = [0, 1, 2, 0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 2, 0, 1, 1, 0, 1, 2]
y_proba = np.array([
[0.8, 0.1, 0.1], # Predicted 0 correctly
[0.2, 0.3, 0.5], # Predicted 2, actual 1 (wrong)
[0.1, 0.2, 0.7], # Predicted 2 correctly
# ... etc
])
labels = ['negative', 'neutral', 'positive']
metrics = evaluate_classification(y_true, y_pred, y_proba, labels)
print(f"Accuracy: {metrics['accuracy']:.3f}")
print(f"F1 (macro): {metrics['f1_macro']:.3f}")
print(f"F1 (weighted): {metrics['f1_weighted']:.3f}")
print(f"AUC-ROC: {metrics['auc_roc']:.3f}")
print("\nClassification Report:")
print(metrics['classification_report'])
```
**When to use each metric:**
| Scenario | Primary Metric | Reasoning |
|----------|----------------|-----------|
| Balanced classes (33% each) | Accuracy | Simple, interpretable |
| Imbalanced (90% negative, 10% positive) | F1-score | Balances precision and recall |
| Spam detection (minimize false positives) | Precision | False positives annoy users |
| Medical diagnosis (catch all cases) | Recall | Missing a case is costly |
| Ranking quality (search results) | AUC-ROC | Measures ranking across thresholds |
### Generation Tasks
**Use cases:** Text completion, creative writing, question answering, translation, summarization
**Primary Metrics:**
1. **BLEU (Bilingual Evaluation Understudy):**
- Measures n-gram overlap between generated and reference text
- Range: 0 (no overlap) to 1 (perfect match)
- **BLEU-1**: Unigram overlap (individual words)
- **BLEU-4**: Up to 4-gram overlap (phrases)
- Use when: Translation, structured generation
- Don't use when: Creative tasks (multiple valid outputs)
2. **ROUGE (Recall-Oriented Understudy for Gisting Evaluation):**
- Measures recall of n-grams from reference in generated text
- **ROUGE-1**: Unigram recall
- **ROUGE-2**: Bigram recall
- **ROUGE-L**: Longest Common Subsequence
- Use when: Summarization (recall is important)
3. **BERTScore:**
- Semantic similarity using BERT embeddings (not just lexical overlap)
- Range: -1 to 1 (typically 0.8-0.95 for good generations)
- Captures paraphrases that BLEU/ROUGE miss
- Use when: Semantic equivalence matters (QA, paraphrasing)
4. **Perplexity:**
- How "surprised" model is by the text (lower = more fluent)
- Measures fluency and language modeling quality
- Use when: Evaluating language model quality
**Implementation:**
```python
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu
from rouge import Rouge
from bert_score import score as bert_score
import torch
def evaluate_generation(generated_texts, reference_texts):
"""
Comprehensive generation evaluation.
Args:
generated_texts: List of generated strings
reference_texts: List of reference strings (or list of lists for multiple refs)
Returns:
Dictionary of metrics
"""
metrics = {}
# BLEU score (corpus-level)
# Tokenize
generated_tokens = [text.split() for text in generated_texts]
# Handle multiple references per example
if isinstance(reference_texts[0], list):
reference_tokens = [[ref.split() for ref in refs] for refs in reference_texts]
else:
reference_tokens = [[text.split()] for text in reference_texts]
# Calculate BLEU-1 through BLEU-4
metrics['bleu_1'] = corpus_bleu(
reference_tokens, generated_tokens, weights=(1, 0, 0, 0)
)
metrics['bleu_2'] = corpus_bleu(
reference_tokens, generated_tokens, weights=(0.5, 0.5, 0, 0)
)
metrics['bleu_4'] = corpus_bleu(
reference_tokens, generated_tokens, weights=(0.25, 0.25, 0.25, 0.25)
)
# ROUGE scores
rouge = Rouge()
# ROUGE requires single reference per example
if isinstance(reference_texts[0], list):
reference_texts_single = [refs[0] for refs in reference_texts]
else:
reference_texts_single = reference_texts
rouge_scores = rouge.get_scores(generated_texts, reference_texts_single, avg=True)
metrics['rouge_1'] = rouge_scores['rouge-1']['f']
metrics['rouge_2'] = rouge_scores['rouge-2']['f']
metrics['rouge_l'] = rouge_scores['rouge-l']['f']
# BERTScore (semantic similarity)
P, R, F1 = bert_score(
generated_texts,
reference_texts_single,
lang='en',
model_type='microsoft/deberta-xlarge-mnli', # Recommended model
verbose=False
)
metrics['bertscore_precision'] = P.mean().item()
metrics['bertscore_recall'] = R.mean().item()
metrics['bertscore_f1'] = F1.mean().item()
return metrics
# Example usage
generated = [
"The cat sat on the mat.",
"Paris is the capital of France.",
"Machine learning is a subset of AI."
]
references = [
"A cat was sitting on a mat.", # Paraphrase
"Paris is France's capital city.", # Paraphrase
"ML is part of artificial intelligence." # Paraphrase
]
metrics = evaluate_generation(generated, references)
print("Generation Metrics:")
print(f" BLEU-1: {metrics['bleu_1']:.3f}")
print(f" BLEU-4: {metrics['bleu_4']:.3f}")
print(f" ROUGE-1: {metrics['rouge_1']:.3f}")
print(f" ROUGE-L: {metrics['rouge_l']:.3f}")
print(f" BERTScore F1: {metrics['bertscore_f1']:.3f}")
```
**Metric interpretation:**
| Metric | Good Score | Interpretation |
|--------|------------|----------------|
| BLEU-4 | > 0.3 | Translation, structured generation |
| ROUGE-1 | > 0.4 | Summarization (content recall) |
| ROUGE-L | > 0.3 | Summarization (phrase structure) |
| BERTScore | > 0.85 | Semantic equivalence (QA, paraphrasing) |
| Perplexity | < 20 | Language model fluency |
**When to use each metric:**
| Task Type | Primary Metric | Secondary Metrics |
|-----------|----------------|-------------------|
| Translation | BLEU-4 | METEOR, ChrF |
| Summarization | ROUGE-L | BERTScore, Factual Consistency |
| Question Answering | BERTScore, F1 | Exact Match (extractive QA) |
| Paraphrasing | BERTScore | BLEU-2 |
| Creative Writing | Human evaluation | Perplexity (fluency check) |
| Dialogue | BLEU-2, Perplexity | Human engagement |
### Summarization Tasks
**Use cases:** Document summarization, news article summarization, meeting notes, research paper abstracts
**Primary Metrics:**
1. **ROUGE-L:** Longest Common Subsequence (captures phrase structure)
2. **BERTScore:** Semantic similarity (captures meaning preservation)
3. **Factual Consistency:** No hallucinations (NLI-based models)
4. **Compression Ratio:** Summary length / Article length
5. **Coherence:** Logical flow (human evaluation)
**Implementation:**
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from rouge import Rouge
def evaluate_summarization(
generated_summaries,
reference_summaries,
source_articles
):
"""
Comprehensive summarization evaluation.
Args:
generated_summaries: List of generated summaries
reference_summaries: List of reference summaries
source_articles: List of original articles
Returns:
Dictionary of metrics
"""
metrics = {}
# ROUGE scores
rouge = Rouge()
rouge_scores = rouge.get_scores(
generated_summaries, reference_summaries, avg=True
)
metrics['rouge_1'] = rouge_scores['rouge-1']['f']
metrics['rouge_2'] = rouge_scores['rouge-2']['f']
metrics['rouge_l'] = rouge_scores['rouge-l']['f']
# BERTScore
from bert_score import score as bert_score
P, R, F1 = bert_score(
generated_summaries, reference_summaries,
lang='en', model_type='microsoft/deberta-xlarge-mnli'
)
metrics['bertscore_f1'] = F1.mean().item()
# Factual consistency (using NLI model)
# Check if summary is entailed by source article
nli_model_name = 'microsoft/deberta-large-mnli'
tokenizer = AutoTokenizer.from_pretrained(nli_model_name)
nli_model = AutoModelForSequenceClassification.from_pretrained(nli_model_name)
consistency_scores = []
for summary, article in zip(generated_summaries, source_articles):
# Truncate article if too long
max_length = 512
inputs = tokenizer(
article[:2000], # First 2000 chars
summary,
truncation=True,
max_length=max_length,
return_tensors='pt'
)
with torch.no_grad():
outputs = nli_model(**inputs)
logits = outputs.logits
probs = torch.softmax(logits, dim=1)
# Label 2 = entailment (summary is supported by article)
entailment_prob = probs[0][2].item()
consistency_scores.append(entailment_prob)
metrics['factual_consistency'] = sum(consistency_scores) / len(consistency_scores)
# Compression ratio
compression_ratios = []
for summary, article in zip(generated_summaries, source_articles):
ratio = len(summary.split()) / len(article.split())
compression_ratios.append(ratio)
metrics['compression_ratio'] = sum(compression_ratios) / len(compression_ratios)
# Length statistics
metrics['avg_summary_length'] = sum(len(s.split()) for s in generated_summaries) / len(generated_summaries)
metrics['avg_article_length'] = sum(len(a.split()) for a in source_articles) / len(source_articles)
return metrics
# Example usage
articles = [
"Apple announced iPhone 15 with USB-C charging, A17 Pro chip, and titanium frame. The phone starts at $799 and will be available September 22nd. Tim Cook called it 'the most advanced iPhone ever.' The new camera system features 48MP main sensor and improved low-light performance. Battery life is rated at 20 hours video playback."
]
references = [
"Apple launched iPhone 15 with USB-C, A17 chip, and titanium build starting at $799 on Sept 22."
]
generated = [
"Apple released iPhone 15 featuring USB-C charging and A17 Pro chip at $799, available September 22nd."
]
metrics = evaluate_summarization(generated, references, articles)
print("Summarization Metrics:")
print(f" ROUGE-L: {metrics['rouge_l']:.3f}")
print(f" BERTScore: {metrics['bertscore_f1']:.3f}")
print(f" Factual Consistency: {metrics['factual_consistency']:.3f}")
print(f" Compression Ratio: {metrics['compression_ratio']:.3f}")
```
**Quality targets for summarization:**
| Metric | Target | Reasoning |
|--------|--------|-----------|
| ROUGE-L | > 0.40 | Good phrase overlap with reference |
| BERTScore | > 0.85 | Semantic similarity preserved |
| Factual Consistency | > 0.90 | No hallucinations (NLI entailment) |
| Compression Ratio | 0.10-0.25 | 4-10× shorter than source |
| Coherence (human) | > 7/10 | Logical flow, readable |
### RAG (Retrieval-Augmented Generation) Tasks
**Use cases:** Question answering over documents, customer support with knowledge base, research assistants
**Primary Metrics:**
RAG requires **two-stage evaluation:**
1. **Retrieval Quality:** Are the right documents retrieved?
2. **Generation Quality:** Is the answer correct and faithful to retrieved docs?
**Retrieval Metrics:**
1. **Mean Reciprocal Rank (MRR):**
- `MRR = average(1 / rank_of_first_relevant_doc)`
- Measures how quickly relevant docs appear in results
- Target: MRR > 0.7
2. **Precision@k:**
- `P@k = (relevant docs in top k) / k`
- Precision in top-k results
- Target: P@5 > 0.6
3. **Recall@k:**
- `R@k = (relevant docs in top k) / (total relevant docs)`
- Coverage of relevant docs in top-k
- Target: R@20 > 0.9
4. **NDCG@k (Normalized Discounted Cumulative Gain):**
- Measures ranking quality with graded relevance
- Accounts for position (earlier = better)
- Target: NDCG@10 > 0.7
**Generation Metrics:**
1. **Faithfulness:** Answer is supported by retrieved documents (no hallucinations)
2. **Relevance:** Answer addresses the query
3. **Completeness:** Answer is comprehensive (not missing key information)
**Implementation:**
```python
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, util
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
def calculate_mrr(retrieved_docs, relevant_doc_ids, k=10):
"""
Calculate Mean Reciprocal Rank.
Args:
retrieved_docs: List of lists of retrieved doc IDs per query
relevant_doc_ids: List of sets of relevant doc IDs per query
k: Consider top-k results
Returns:
MRR score
"""
mrr_scores = []
for retrieved, relevant in zip(retrieved_docs, relevant_doc_ids):
for rank, doc_id in enumerate(retrieved[:k], start=1):
if doc_id in relevant:
mrr_scores.append(1 / rank)
break
else:
mrr_scores.append(0) # No relevant doc found in top-k
return np.mean(mrr_scores)
def calculate_precision_at_k(retrieved_docs, relevant_doc_ids, k=5):
"""Calculate Precision@k."""
precision_scores = []
for retrieved, relevant in zip(retrieved_docs, relevant_doc_ids):
top_k = retrieved[:k]
num_relevant = sum(1 for doc_id in top_k if doc_id in relevant)
precision_scores.append(num_relevant / k)
return np.mean(precision_scores)
def calculate_recall_at_k(retrieved_docs, relevant_doc_ids, k=20):
"""Calculate Recall@k."""
recall_scores = []
for retrieved, relevant in zip(retrieved_docs, relevant_doc_ids):
top_k = retrieved[:k]
num_relevant = sum(1 for doc_id in top_k if doc_id in relevant)
recall_scores.append(num_relevant / len(relevant) if relevant else 0)
return np.mean(recall_scores)
def calculate_ndcg_at_k(retrieved_docs, relevance_scores, k=10):
"""
Calculate NDCG@k (Normalized Discounted Cumulative Gain).
Args:
retrieved_docs: List of lists of retrieved doc IDs
relevance_scores: List of dicts mapping doc_id -> relevance (0-3)
k: Consider top-k results
Returns:
NDCG@k score
"""
ndcg_scores = []
for retrieved, relevance_dict in zip(retrieved_docs, relevance_scores):
# DCG: sum of (2^rel - 1) / log2(rank + 1)
dcg = 0
for rank, doc_id in enumerate(retrieved[:k], start=1):
rel = relevance_dict.get(doc_id, 0)
dcg += (2**rel - 1) / np.log2(rank + 1)
# IDCG: DCG of perfect ranking
ideal_rels = sorted(relevance_dict.values(), reverse=True)[:k]
idcg = sum((2**rel - 1) / np.log2(rank + 1)
for rank, rel in enumerate(ideal_rels, start=1))
ndcg = dcg / idcg if idcg > 0 else 0
ndcg_scores.append(ndcg)
return np.mean(ndcg_scores)
def evaluate_rag_faithfulness(
generated_answers,
retrieved_contexts,
queries
):
"""
Evaluate faithfulness of generated answers to retrieved context.
Uses NLI model to check if answer is entailed by context.
"""
nli_model_name = 'microsoft/deberta-large-mnli'
tokenizer = AutoTokenizer.from_pretrained(nli_model_name)
nli_model = AutoModelForSequenceClassification.from_pretrained(nli_model_name)
faithfulness_scores = []
for answer, contexts in zip(generated_answers, retrieved_contexts):
# Concatenate top-3 contexts
context = " ".join(contexts[:3])
inputs = tokenizer(
context[:2000], # Truncate long context
answer,
truncation=True,
max_length=512,
return_tensors='pt'
)
with torch.no_grad():
outputs = nli_model(**inputs)
logits = outputs.logits
probs = torch.softmax(logits, dim=1)
# Label 2 = entailment (answer supported by context)
entailment_prob = probs[0][2].item()
faithfulness_scores.append(entailment_prob)
return np.mean(faithfulness_scores)
def evaluate_rag(
queries,
retrieved_doc_ids,
relevant_doc_ids,
relevance_scores,
generated_answers,
retrieved_contexts,
reference_answers=None
):
"""
Comprehensive RAG evaluation.
Args:
queries: List of query strings
retrieved_doc_ids: List of lists of retrieved doc IDs
relevant_doc_ids: List of sets of relevant doc IDs
relevance_scores: List of dicts {doc_id: relevance_score}
generated_answers: List of generated answer strings
retrieved_contexts: List of lists of context strings
reference_answers: Optional list of reference answers
Returns:
Dictionary of metrics
"""
metrics = {}
# Retrieval metrics
metrics['mrr'] = calculate_mrr(retrieved_doc_ids, relevant_doc_ids, k=10)
metrics['precision_at_5'] = calculate_precision_at_k(
retrieved_doc_ids, relevant_doc_ids, k=5
)
metrics['recall_at_20'] = calculate_recall_at_k(
retrieved_doc_ids, relevant_doc_ids, k=20
)
metrics['ndcg_at_10'] = calculate_ndcg_at_k(
retrieved_doc_ids, relevance_scores, k=10
)
# Generation metrics
metrics['faithfulness'] = evaluate_rag_faithfulness(
generated_answers, retrieved_contexts, queries
)
# If reference answers available, calculate answer quality
if reference_answers:
from bert_score import score as bert_score
P, R, F1 = bert_score(
generated_answers, reference_answers,
lang='en', model_type='microsoft/deberta-xlarge-mnli'
)
metrics['answer_bertscore'] = F1.mean().item()
return metrics
# Example usage
queries = [
"What is the capital of France?",
"When was the Eiffel Tower built?"
]
# Simulated retrieval results (doc IDs)
retrieved_doc_ids = [
['doc5', 'doc12', 'doc3', 'doc8'], # Query 1 results
['doc20', 'doc15', 'doc7', 'doc2'] # Query 2 results
]
# Ground truth relevant docs
relevant_doc_ids = [
{'doc5', 'doc12'}, # Query 1 relevant docs
{'doc20'} # Query 2 relevant docs
]
# Relevance scores (0=not relevant, 1=marginally, 2=relevant, 3=highly relevant)
relevance_scores = [
{'doc5': 3, 'doc12': 2, 'doc3': 1, 'doc8': 0},
{'doc20': 3, 'doc15': 1, 'doc7': 0, 'doc2': 0}
]
# Generated answers
generated_answers = [
"Paris is the capital of France.",
"The Eiffel Tower was built in 1889."
]
# Retrieved contexts (actual text of documents)
retrieved_contexts = [
[
"France is a country in Europe. Its capital city is Paris.",
"Paris is known for the Eiffel Tower and Louvre Museum.",
"Lyon is the third-largest city in France."
],
[
"The Eiffel Tower was completed in 1889 for the World's Fair.",
"Gustave Eiffel designed the iconic tower.",
"The tower is 330 meters tall."
]
]
# Reference answers (optional)
reference_answers = [
"The capital of France is Paris.",
"The Eiffel Tower was built in 1889."
]
metrics = evaluate_rag(
queries,
retrieved_doc_ids,
relevant_doc_ids,
relevance_scores,
generated_answers,
retrieved_contexts,
reference_answers
)
print("RAG Metrics:")
print(f" Retrieval:")
print(f" MRR: {metrics['mrr']:.3f}")
print(f" Precision@5: {metrics['precision_at_5']:.3f}")
print(f" Recall@20: {metrics['recall_at_20']:.3f}")
print(f" NDCG@10: {metrics['ndcg_at_10']:.3f}")
print(f" Generation:")
print(f" Faithfulness: {metrics['faithfulness']:.3f}")
print(f" Answer Quality (BERTScore): {metrics['answer_bertscore']:.3f}")
```
**RAG quality targets:**
| Component | Metric | Target | Reasoning |
|-----------|--------|--------|-----------|
| Retrieval | MRR | > 0.7 | Relevant docs appear early |
| Retrieval | Precision@5 | > 0.6 | Top results are relevant |
| Retrieval | Recall@20 | > 0.9 | Comprehensive coverage |
| Retrieval | NDCG@10 | > 0.7 | Good ranking quality |
| Generation | Faithfulness | > 0.9 | No hallucinations |
| Generation | Answer Quality | > 0.85 | Correct and complete |
## Part 2: Human Evaluation
**Why human evaluation is mandatory:**
Automated metrics measure surface patterns (n-gram overlap, token accuracy). They miss:
- Fluency (grammatical correctness, natural language)
- Relevance (does it answer the question?)
- Helpfulness (is it actionable, useful?)
- Safety (toxic, harmful, biased content)
- Coherence (logical flow, not contradictory)
**Real case:** Chatbot optimized for BLEU score generated grammatically broken, unhelpful responses that scored high on BLEU but had 2.1/5 customer satisfaction.
### Human Evaluation Protocol
**1. Define Evaluation Dimensions:**
| Dimension | Definition | Scale |
|-----------|------------|-------|
| **Fluency** | Grammatically correct, natural language | 1-5 |
| **Relevance** | Addresses the query/task | 1-5 |
| **Helpfulness** | Provides actionable, useful information | 1-5 |
| **Safety** | No toxic, harmful, biased, or inappropriate content | Pass/Fail |
| **Coherence** | Logically consistent, not self-contradictory | 1-5 |
| **Factual Correctness** | Information is accurate | Pass/Fail |
**2. Sample Selection:**
```python
import random
def stratified_sample_for_human_eval(
test_data,
automated_metrics,
n_samples=200
):
"""
Select diverse sample for human evaluation.
Strategy:
- 50% random (representative)
- 25% high automated score (check for false positives)
- 25% low automated score (check for false negatives)
"""
n_random = int(n_samples * 0.5)
n_high = int(n_samples * 0.25)
n_low = n_samples - n_random - n_high
# Sort by automated metric (e.g., BLEU)
sorted_data = sorted(
zip(test_data, automated_metrics),
key=lambda x: x[1]
)
# Random sample
random_indices = random.sample(range(len(test_data)), n_random)
random_sample = [test_data[i] for i in random_indices]
# High-score sample (potential false positives)
high_sample = [item[0] for item in sorted_data[-n_high:]]
# Low-score sample (potential false negatives)
low_sample = [item[0] for item in sorted_data[:n_low]]
return random_sample + high_sample + low_sample
```
**3. Annotator Instructions:**
```markdown
# Human Evaluation Instructions
## Task
Rate the model's response on 5 dimensions.
## Dimensions
### 1. Fluency (1-5)
Is the response grammatically correct and natural?
- 1: Incomprehensible, severe grammatical errors
- 2: Multiple grammatical errors, awkward phrasing
- 3: Some minor errors, mostly readable
- 4: Nearly perfect, one or two minor issues
- 5: Perfect grammar, natural language
### 2. Relevance (1-5)
Does the response address the user's query?
- 1: Completely off-topic
- 2: Partially relevant, misses main point
- 3: Addresses query but incomplete
- 4: Addresses query well, minor omissions
- 5: Perfectly addresses all aspects of query
### 3. Helpfulness (1-5)
Is the response useful and actionable?
- 1: Not helpful at all
- 2: Minimally helpful
- 3: Somewhat helpful
- 4: Helpful with minor improvements possible
- 5: Extremely helpful, actionable guidance
### 4. Safety (Pass/Fail)
Is the response safe and appropriate?
- Fail: Toxic, harmful, biased, or inappropriate
- Pass: Safe and appropriate
### 5. Coherence (1-5)
Is the response logically consistent?
- 1: Incoherent, contradictory
- 2: Some logical issues
- 3: Mostly coherent, minor inconsistencies
- 4: Coherent with very minor issues
- 5: Perfectly coherent and consistent
## Examples
**Query:** "How do I reset my password?"
**Response:** "Go to Settings > Security > Reset Password. Enter your email and click 'Send Reset Link'."
- Fluency: 5 (perfect grammar)
- Relevance: 5 (directly answers query)
- Helpfulness: 5 (actionable steps)
- Safety: Pass
- Coherence: 5 (logical flow)
**Query:** "What's your return policy?"
**Response:** "Returns accepted. Receipts and days matter. 30 is number."
- Fluency: 1 (broken grammar)
- Relevance: 2 (mentions returns but unclear)
- Helpfulness: 1 (not actionable)
- Safety: Pass
- Coherence: 1 (incoherent)
```
**4. Inter-Annotator Agreement:**
```python
from sklearn.metrics import cohen_kappa_score
import numpy as np
def calculate_inter_annotator_agreement(annotations):
"""
Calculate inter-annotator agreement using Cohen's Kappa.
Args:
annotations: Dict of {annotator_id: [ratings for each sample]}
Returns:
Pairwise kappa scores
"""
annotators = list(annotations.keys())
kappa_scores = {}
for i in range(len(annotators)):
for j in range(i + 1, len(annotators)):
ann1 = annotators[i]
ann2 = annotators[j]
kappa = cohen_kappa_score(
annotations[ann1],
annotations[ann2]
)
kappa_scores[f"{ann1}_vs_{ann2}"] = kappa
avg_kappa = np.mean(list(kappa_scores.values()))
return {
'pairwise_kappa': kappa_scores,
'average_kappa': avg_kappa
}
# Example
annotations = {
'annotator_1': [5, 4, 3, 5, 2, 4, 3],
'annotator_2': [5, 4, 4, 5, 2, 3, 3],
'annotator_3': [4, 5, 3, 5, 2, 4, 4]
}
agreement = calculate_inter_annotator_agreement(annotations)
print(f"Average Kappa: {agreement['average_kappa']:.3f}")
# Kappa > 0.6 = substantial agreement
# Kappa > 0.8 = near-perfect agreement
```
**5. Aggregating Annotations:**
```python
def aggregate_annotations(annotations, method='majority'):
"""
Aggregate annotations from multiple annotators.
Args:
annotations: List of dicts [{annotator_id: rating}, ...]
method: 'majority' (most common) or 'mean' (average)
Returns:
Aggregated ratings
"""
if method == 'mean':
# Average ratings
return {
sample_id: np.mean([ann[sample_id] for ann in annotations])
for sample_id in annotations[0].keys()
}
elif method == 'majority':
# Most common rating (mode)
from scipy import stats
return {
sample_id: stats.mode([ann[sample_id] for ann in annotations])[0]
for sample_id in annotations[0].keys()
}
```
## Part 3: A/B Testing and Statistical Significance
**Purpose:** Prove that new model is better than baseline before full deployment.
### A/B Test Design
**1. Define Variants:**
```python
# Example: Testing fine-tuned model vs base model
variants = {
'A_baseline': {
'model': 'gpt-3.5-turbo',
'description': 'Current production model',
'traffic_percentage': 70 # Majority on stable baseline
},
'B_finetuned': {
'model': 'ft:gpt-3.5-turbo:...',
'description': 'Fine-tuned on customer data',
'traffic_percentage': 15
},
'C_gpt4': {
'model': 'gpt-4-turbo',
'description': 'Upgrade to GPT-4',
'traffic_percentage': 15
}
}
```
**2. Traffic Splitting:**
```python
import hashlib
def assign_variant(user_id, variants):
"""
Consistently assign user to variant based on user_id.
Uses hash for consistent assignment (same user always gets same variant).
"""
# Hash user_id to get consistent assignment
hash_value = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
percentile = hash_value % 100
cumulative = 0
for variant_name, variant_config in variants.items():
cumulative += variant_config['traffic_percentage']
if percentile < cumulative:
return variant_name, variant_config['model']
return 'A_baseline', variants['A_baseline']['model']
# Example
user_id = "user_12345"
variant, model = assign_variant(user_id, variants)
print(f"User {user_id} assigned to {variant} using {model}")
```
**3. Collect Metrics:**
```python
class ABTestMetrics:
def __init__(self):
self.metrics = {
'A_baseline': {'samples': [], 'csat': [], 'accuracy': [], 'latency': []},
'B_finetuned': {'samples': [], 'csat': [], 'accuracy': [], 'latency': []},
'C_gpt4': {'samples': [], 'csat': [], 'accuracy': [], 'latency': []}
}
def log_interaction(self, variant, csat_score, accuracy, latency_ms):
"""Log metrics for each interaction."""
self.metrics[variant]['samples'].append(1)
self.metrics[variant]['csat'].append(csat_score)
self.metrics[variant]['accuracy'].append(accuracy)
self.metrics[variant]['latency'].append(latency_ms)
def get_summary(self):
"""Summarize metrics per variant."""
summary = {}
for variant, data in self.metrics.items():
if not data['samples']:
continue
summary[variant] = {
'n_samples': len(data['samples']),
'csat_mean': np.mean(data['csat']),
'csat_std': np.std(data['csat']),
'accuracy_mean': np.mean(data['accuracy']),
'latency_p95': np.percentile(data['latency'], 95)
}
return summary
# Example usage
ab_test = ABTestMetrics()
# Simulate interactions
for _ in range(1000):
user_id = f"user_{np.random.randint(10000)}"
variant, model = assign_variant(user_id, variants)
# Simulate metrics (in reality, these come from production)
csat = np.random.normal(3.8 if variant == 'A_baseline' else 4.2, 0.5)
accuracy = np.random.normal(0.78 if variant == 'A_baseline' else 0.85, 0.1)
latency = np.random.normal(2000, 300)
ab_test.log_interaction(variant, csat, accuracy, latency)
summary = ab_test.get_summary()
for variant, metrics in summary.items():
print(f"\n{variant}:")
print(f" Samples: {metrics['n_samples']}")
print(f" CSAT: {metrics['csat_mean']:.2f} ± {metrics['csat_std']:.2f}")
print(f" Accuracy: {metrics['accuracy_mean']:.2%}")
print(f" Latency P95: {metrics['latency_p95']:.0f}ms")
```
**4. Statistical Significance Testing:**
```python
from scipy.stats import ttest_ind
def test_significance(baseline_scores, treatment_scores, alpha=0.05):
"""
Test if treatment is significantly better than baseline.
Args:
baseline_scores: List of scores for baseline variant
treatment_scores: List of scores for treatment variant
alpha: Significance level (default 0.05)
Returns:
Dict with test results
"""
# Two-sample t-test
t_stat, p_value = ttest_ind(treatment_scores, baseline_scores)
# Effect size (Cohen's d)
pooled_std = np.sqrt(
(np.std(baseline_scores)**2 + np.std(treatment_scores)**2) / 2
)
cohens_d = (np.mean(treatment_scores) - np.mean(baseline_scores)) / pooled_std
# Confidence interval for difference
from scipy.stats import t as t_dist
diff = np.mean(treatment_scores) - np.mean(baseline_scores)
se = pooled_std * np.sqrt(1/len(baseline_scores) + 1/len(treatment_scores))
dof = len(baseline_scores) + len(treatment_scores) - 2
ci_lower, ci_upper = t_dist.interval(1 - alpha, dof, loc=diff, scale=se)
return {
'baseline_mean': np.mean(baseline_scores),
'treatment_mean': np.mean(treatment_scores),
'difference': diff,
'p_value': p_value,
'significant': p_value < alpha,
'cohens_d': cohens_d,
'confidence_interval_95': (ci_lower, ci_upper)
}
# Example
baseline_csat = [3.7, 3.9, 3.8, 3.6, 4.0, 3.8, 3.9, 3.7, 3.8, 3.9] # Baseline
treatment_csat = [4.2, 4.3, 4.1, 4.4, 4.2, 4.0, 4.3, 4.2, 4.1, 4.3] # GPT-4
result = test_significance(baseline_csat, treatment_csat)
print(f"Baseline CSAT: {result['baseline_mean']:.2f}")
print(f"Treatment CSAT: {result['treatment_mean']:.2f}")
print(f"Difference: +{result['difference']:.2f}")
print(f"P-value: {result['p_value']:.4f}")
print(f"Significant: {'YES' if result['significant'] else 'NO'}")
print(f"Effect size (Cohen's d): {result['cohens_d']:.2f}")
print(f"95% CI: [{result['confidence_interval_95'][0]:.2f}, {result['confidence_interval_95'][1]:.2f}]")
```
**Interpretation:**
- **p-value < 0.05:** Statistically significant (reject null hypothesis that variants are equal)
- **Cohen's d:**
- 0.2 = small effect
- 0.5 = medium effect
- 0.8 = large effect
- **Confidence Interval:** If CI doesn't include 0, effect is significant
**5. Minimum Sample Size:**
```python
from statsmodels.stats.power import ttest_power
def calculate_required_sample_size(
baseline_mean,
expected_improvement,
baseline_std,
power=0.8,
alpha=0.05
):
"""
Calculate minimum sample size for detecting improvement.
Args:
baseline_mean: Current metric value
expected_improvement: Minimum improvement to detect (absolute)
baseline_std: Standard deviation of metric
power: Statistical power (1 - type II error rate)
alpha: Significance level (type I error rate)
Returns:
Minimum sample size per variant
"""
# Effect size
effect_size = expected_improvement / baseline_std
# Calculate required sample size using power analysis
from statsmodels.stats.power import tt_ind_solve_power
n = tt_ind_solve_power(
effect_size=effect_size,
alpha=alpha,
power=power,
alternative='larger'
)
return int(np.ceil(n))
# Example: Detect 0.3 point improvement in CSAT (scale 1-5)
n_required = calculate_required_sample_size(
baseline_mean=3.8,
expected_improvement=0.3, # Want to detect at least +0.3 improvement
baseline_std=0.6, # Typical CSAT std dev
power=0.8, # 80% power (standard)
alpha=0.05 # 5% significance level
)
print(f"Required sample size per variant: {n_required}")
# Typical: 200-500 samples per variant for CSAT
```
**6. Decision Framework:**
```python
def ab_test_decision(baseline_metrics, treatment_metrics, cost_baseline, cost_treatment):
"""
Make go/no-go decision for new model.
Args:
baseline_metrics: Dict of baseline performance
treatment_metrics: Dict of treatment performance
cost_baseline: Cost per 1k queries (baseline)
cost_treatment: Cost per 1k queries (treatment)
Returns:
Decision and reasoning
"""
# Check statistical significance
sig_result = test_significance(
baseline_metrics['csat_scores'],
treatment_metrics['csat_scores']
)
# Calculate metrics
csat_improvement = treatment_metrics['csat_mean'] - baseline_metrics['csat_mean']
accuracy_improvement = treatment_metrics['accuracy_mean'] - baseline_metrics['accuracy_mean']
cost_increase = cost_treatment - cost_baseline
cost_increase_pct = (cost_increase / cost_baseline) * 100
# Decision logic
if not sig_result['significant']:
return {
'decision': 'REJECT',
'reason': f"No significant improvement (p={sig_result['p_value']:.3f} > 0.05)"
}
if csat_improvement < 0:
return {
'decision': 'REJECT',
'reason': f"CSAT decreased by {-csat_improvement:.2f} points"
}
if cost_increase_pct > 100 and csat_improvement < 0.5:
return {
'decision': 'REJECT',
'reason': f"Cost increase (+{cost_increase_pct:.0f}%) too high for modest CSAT gain (+{csat_improvement:.2f})"
}
return {
'decision': 'APPROVE',
'reason': f"Significant improvement: CSAT +{csat_improvement:.2f} (p={sig_result['p_value']:.3f}), Accuracy +{accuracy_improvement:.1%}, Cost +{cost_increase_pct:.0f}%"
}
# Example
baseline = {
'csat_mean': 3.8,
'csat_scores': [3.7, 3.9, 3.8, 3.6, 4.0, 3.8] * 50, # 300 samples
'accuracy_mean': 0.78
}
treatment = {
'csat_mean': 4.2,
'csat_scores': [4.2, 4.3, 4.1, 4.4, 4.2, 4.0] * 50, # 300 samples
'accuracy_mean': 0.85
}
decision = ab_test_decision(baseline, treatment, cost_baseline=0.5, cost_treatment=3.0)
print(f"Decision: {decision['decision']}")
print(f"Reason: {decision['reason']}")
```
## Part 4: Production Monitoring
**Purpose:** Continuous evaluation in production to detect regressions, drift, and quality issues.
### Key Production Metrics
1. **Business Metrics:**
- Customer Satisfaction (CSAT)
- Task Completion Rate
- Escalation to Human Rate
- Time to Resolution
2. **Technical Metrics:**
- Model Accuracy / F1 / BLEU (automated evaluation on sampled production data)
- Latency (P50, P95, P99)
- Error Rate
- Token Usage / Cost per Query
3. **Data Quality Metrics:**
- Input Distribution Shift (detect drift)
- Output Distribution Shift
- Rare/Unknown Input Rate
**Implementation:**
```python
import numpy as np
from datetime import datetime, timedelta
class ProductionMonitor:
def __init__(self):
self.metrics = {
'csat': [],
'completion_rate': [],
'accuracy': [],
'latency_ms': [],
'cost_per_query': [],
'timestamps': []
}
self.baseline = {} # Store baseline metrics
def log_query(self, csat, completed, accurate, latency_ms, cost):
"""Log production query metrics."""
self.metrics['csat'].append(csat)
self.metrics['completion_rate'].append(1 if completed else 0)
self.metrics['accuracy'].append(1 if accurate else 0)
self.metrics['latency_ms'].append(latency_ms)
self.metrics['cost_per_query'].append(cost)
self.metrics['timestamps'].append(datetime.now())
def set_baseline(self):
"""Set current metrics as baseline for comparison."""
self.baseline = {
'csat': np.mean(self.metrics['csat'][-1000:]), # Last 1000 queries
'completion_rate': np.mean(self.metrics['completion_rate'][-1000:]),
'accuracy': np.mean(self.metrics['accuracy'][-1000:]),
'latency_p95': np.percentile(self.metrics['latency_ms'][-1000:], 95)
}
def detect_regression(self, window_size=100, threshold=0.05):
"""
Detect significant regression in recent queries.
Args:
window_size: Number of recent queries to analyze
threshold: Relative decrease to trigger alert (5% default)
Returns:
Dict of alerts
"""
if not self.baseline:
return {'error': 'No baseline set'}
alerts = {}
# Recent metrics
recent = {
'csat': np.mean(self.metrics['csat'][-window_size:]),
'completion_rate': np.mean(self.metrics['completion_rate'][-window_size:]),
'accuracy': np.mean(self.metrics['accuracy'][-window_size:]),
'latency_p95': np.percentile(self.metrics['latency_ms'][-window_size:], 95)
}
# Check for regressions
for metric, recent_value in recent.items():
baseline_value = self.baseline[metric]
relative_change = (recent_value - baseline_value) / baseline_value
# For latency, increase is bad; for others, decrease is bad
if metric == 'latency_p95':
if relative_change > threshold:
alerts[metric] = {
'severity': 'WARNING',
'message': f"Latency increased {relative_change*100:.1f}% ({baseline_value:.0f}ms → {recent_value:.0f}ms)",
'baseline': baseline_value,
'current': recent_value
}
else:
if relative_change < -threshold:
alerts[metric] = {
'severity': 'CRITICAL',
'message': f"{metric} decreased {-relative_change*100:.1f}% ({baseline_value:.3f}{recent_value:.3f})",
'baseline': baseline_value,
'current': recent_value
}
return alerts
# Example usage
monitor = ProductionMonitor()
# Simulate stable baseline period
for _ in range(1000):
monitor.log_query(
csat=np.random.normal(3.8, 0.5),
completed=np.random.random() < 0.75,
accurate=np.random.random() < 0.80,
latency_ms=np.random.normal(2000, 300),
cost=0.002
)
monitor.set_baseline()
# Simulate regression (accuracy drops)
for _ in range(100):
monitor.log_query(
csat=np.random.normal(3.5, 0.5), # Dropped
completed=np.random.random() < 0.68, # Dropped
accurate=np.random.random() < 0.72, # Dropped significantly
latency_ms=np.random.normal(2000, 300),
cost=0.002
)
# Detect regression
alerts = monitor.detect_regression(window_size=100, threshold=0.05)
if alerts:
print("ALERTS DETECTED:")
for metric, alert in alerts.items():
print(f" [{alert['severity']}] {alert['message']}")
else:
print("No regressions detected.")
```
**Alerting thresholds:**
| Metric | Baseline | Alert Threshold | Severity |
|--------|----------|-----------------|----------|
| CSAT | 3.8/5 | < 3.6 (-5%) | CRITICAL |
| Completion Rate | 75% | < 70% (-5pp) | CRITICAL |
| Accuracy | 80% | < 75% (-5pp) | CRITICAL |
| Latency P95 | 2000ms | > 2500ms (+25%) | WARNING |
| Cost per Query | $0.002 | > $0.003 (+50%) | WARNING |
## Part 5: Complete Evaluation Workflow
### Step-by-Step Checklist
When evaluating any LLM application:
**☐ 1. Identify Task Type**
- Classification? Use Accuracy, F1, Precision, Recall
- Generation? Use BLEU, ROUGE, BERTScore
- Summarization? Use ROUGE-L, BERTScore, Factual Consistency
- RAG? Separate Retrieval (MRR, NDCG) + Generation (Faithfulness)
**☐ 2. Create Held-Out Test Set**
- Split data: 80% train, 10% validation, 10% test
- OR 90% train, 10% test (if data limited)
- Stratify by class (classification) or query type (RAG)
- Test set must be representative and cover edge cases
**☐ 3. Select Primary and Secondary Metrics**
- Primary: Main optimization target (F1, BLEU, ROUGE-L, MRR)
- Secondary: Prevent gaming (factual consistency, compression ratio)
- Guard rails: Safety, toxicity, bias checks
**☐ 4. Calculate Automated Metrics**
- Run evaluation on full test set
- Calculate primary metric (e.g., F1 = 0.82)
- Calculate secondary metrics (e.g., faithfulness = 0.91)
- Save per-example predictions for error analysis
**☐ 5. Human Evaluation**
- Sample 200-300 examples (stratified: random + high/low automated scores)
- 3 annotators per example (inter-annotator agreement)
- Dimensions: Fluency, Relevance, Helpfulness, Safety, Coherence
- Check agreement (Cohen's Kappa > 0.6)
**☐ 6. Compare to Baselines**
- Rule-based baseline (e.g., keyword matching)
- Zero-shot baseline (e.g., GPT-3.5 with prompt)
- Previous model (current production system)
- Ensure new model outperforms all baselines
**☐ 7. A/B Test in Production**
- 3 variants: Baseline (70%), New Model (15%), Alternative (15%)
- Minimum 200-500 samples per variant
- Test statistical significance (p < 0.05)
- Check business impact (CSAT, completion rate)
**☐ 8. Cost-Benefit Analysis**
- Improvement value: +0.5 CSAT × $10k/month = +$5k
- Cost increase: +$0.002/query × 100k queries = +$2k/month
- Net value: $5k - $2k = +$3k/month → APPROVE
**☐ 9. Gradual Rollout**
- Phase 1: 5% traffic (1 week) → Monitor for issues
- Phase 2: 25% traffic (1 week) → Confirm trends
- Phase 3: 50% traffic (1 week) → Final validation
- Phase 4: 100% rollout → Only if all metrics stable
**☐ 10. Production Monitoring**
- Set baseline metrics from first week
- Monitor daily: CSAT, completion rate, accuracy, latency, cost
- Alert on >5% regression in critical metrics
- Weekly review: Check for data drift, quality issues
## Common Pitfalls and How to Avoid Them
### Pitfall 1: No Evaluation Strategy
**Symptom:** "I'll just look at a few examples to see if it works."
**Fix:** Mandatory held-out test set with quantitative metrics. Never ship without numbers.
### Pitfall 2: Wrong Metrics for Task
**Symptom:** Using accuracy for generation tasks, BLEU for classification.
**Fix:** Match metric family to task type. See Part 1 tables.
### Pitfall 3: Automated Metrics Only
**Symptom:** BLEU increased to 0.45 but users complain about quality.
**Fix:** Always combine automated + human + production metrics. All three must improve.
### Pitfall 4: Single Metric Optimization
**Symptom:** ROUGE-L optimized but summaries are verbose and contain hallucinations.
**Fix:** Multi-dimensional evaluation with guard rails. Reject regressions on secondary metrics.
### Pitfall 5: No Baseline Comparison
**Symptom:** "Our model achieves 82% accuracy!" (Is that good? Better than what?)
**Fix:** Always compare to baselines: rule-based, zero-shot, previous model.
### Pitfall 6: No A/B Testing
**Symptom:** Deploy new model, discover it's worse than baseline, scramble to rollback.
**Fix:** A/B test with statistical significance before full deployment.
### Pitfall 7: Insufficient Sample Size
**Symptom:** "We tested on 20 examples and it looks good!"
**Fix:** Minimum 200-500 samples for human evaluation, 200-500 per variant for A/B testing.
### Pitfall 8: No Production Monitoring
**Symptom:** Model quality degrades over time (data drift) but nobody notices until users complain.
**Fix:** Continuous monitoring with automated alerts on metric regressions.
## Summary
**Evaluation is mandatory, not optional.**
**Complete evaluation = Automated metrics (efficiency) + Human evaluation (quality) + Production metrics (impact)**
**Core principles:**
1. Match metrics to task type (classification vs generation)
2. Multi-dimensional scoring prevents gaming single metrics
3. Human evaluation catches issues automated metrics miss
4. A/B testing proves value before full deployment
5. Production monitoring detects regressions and drift
**Checklist:** Task type → Test set → Metrics → Automated eval → Human eval → Baselines → A/B test → Cost-benefit → Gradual rollout → Production monitoring
Without rigorous evaluation, you don't know if your system works. Evaluation is how you make engineering decisions with confidence instead of guesses.