Files
gh-tachyon-beep-skillpacks-…/skills/using-llm-specialist/llm-evaluation-metrics.md
2025-11-30 08:59:54 +08:00

52 KiB
Raw Blame History

LLM Evaluation Metrics Skill

When to Use This Skill

Use this skill when:

  • Building any LLM application (classification, generation, summarization, RAG, chat)
  • Evaluating model performance and quality
  • Comparing different models or approaches (baseline comparison)
  • Fine-tuning or optimizing LLM systems
  • Debugging quality issues in production
  • Establishing production monitoring and alerting

When NOT to use: Exploratory prototyping without deployment intent. For deployment-bound systems, evaluation is mandatory.

Core Principle

Evaluation is not a checkbox—it's how you know if your system works.

Without rigorous evaluation:

  • You don't know if your model is good (no baseline comparison)
  • You optimize the wrong dimensions (wrong metrics for task type)
  • You miss quality issues (automated metrics miss human-perceived issues)
  • You can't prove improvement (no statistical significance)
  • You ship inferior systems (no A/B testing)

Formula: Automated metrics (efficiency) + Human evaluation (quality) + Production metrics (impact) = Complete evaluation.

Evaluation Framework Overview

                    ┌─────────────────────────────────┐
                    │     Task Type Identification     │
                    └──────────┬──────────────────────┘
                               │
                ┌──────────────┼──────────────┐
                │              │              │
        ┌───────▼───────┐ ┌───▼──────┐ ┌────▼────────┐
        │Classification│ │Generation│ │   RAG       │
        │   Metrics    │ │  Metrics │ │  Metrics    │
        └───────┬───────┘ └───┬──────┘ └────┬────────┘
                │              │             │
                └──────────────┼─────────────┘
                               │
                ┌──────────────▼──────────────────┐
                │    Multi-Dimensional Scoring    │
                │  Primary + Secondary + Guards   │
                └──────────────┬──────────────────┘
                               │
                ┌──────────────▼──────────────────┐
                │      Human Evaluation           │
                │  Fluency, Relevance, Safety     │
                └──────────────┬──────────────────┘
                               │
                ┌──────────────▼──────────────────┐
                │         A/B Testing              │
                │  Statistical Significance        │
                └──────────────┬──────────────────┘
                               │
                ┌──────────────▼──────────────────┐
                │    Production Monitoring         │
                │  CSAT, Completion, Cost          │
                └──────────────────────────────────┘

Part 1: Metric Selection by Task Type

Classification Tasks

Use cases: Sentiment analysis, intent detection, entity tagging, content moderation, spam detection

Primary Metrics:

  1. Accuracy: Correct predictions / Total predictions

    • Use when: Classes are balanced
    • Don't use when: Class imbalance (e.g., 95% negative, 5% spam)
  2. F1-Score: Harmonic mean of Precision and Recall

    • Macro F1: Average F1 across classes (treats all classes equally)
    • Micro F1: Global F1 (weighted by class frequency)
    • Per-class F1: F1 for each class individually
    • Use when: Class imbalance or unequal class importance
  3. Precision & Recall:

    • Precision: True Positives / (True Positives + False Positives)
      • "Of predictions as positive, how many are correct?"
    • Recall: True Positives / (True Positives + False Negatives)
      • "Of actual positives, how many did we find?"
    • Use when: Asymmetric cost (spam: high precision, medical: high recall)
  4. AUC-ROC: Area Under Receiver Operating Characteristic curve

    • Measures model's ability to discriminate between classes at all thresholds
    • Use when: Evaluating calibration and ranking quality

Implementation:

from sklearn.metrics import (
    accuracy_score, f1_score, precision_recall_fscore_support,
    classification_report, confusion_matrix, roc_auc_score
)
import numpy as np

def evaluate_classification(y_true, y_pred, y_proba=None, labels=None):
    """
    Comprehensive classification evaluation.

    Args:
        y_true: Ground truth labels
        y_pred: Predicted labels
        y_proba: Predicted probabilities (for AUC-ROC)
        labels: Class names for reporting

    Returns:
        Dictionary of metrics
    """
    metrics = {}

    # Basic metrics
    metrics['accuracy'] = accuracy_score(y_true, y_pred)

    # F1 scores
    metrics['f1_macro'] = f1_score(y_true, y_pred, average='macro')
    metrics['f1_micro'] = f1_score(y_true, y_pred, average='micro')
    metrics['f1_weighted'] = f1_score(y_true, y_pred, average='weighted')

    # Per-class metrics
    precision, recall, f1, support = precision_recall_fscore_support(
        y_true, y_pred, labels=labels
    )
    metrics['per_class'] = {
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'support': support
    }

    # Confusion matrix
    metrics['confusion_matrix'] = confusion_matrix(y_true, y_pred)

    # AUC-ROC (if probabilities provided)
    if y_proba is not None:
        if len(np.unique(y_true)) == 2:  # Binary
            metrics['auc_roc'] = roc_auc_score(y_true, y_proba[:, 1])
        else:  # Multi-class
            metrics['auc_roc'] = roc_auc_score(
                y_true, y_proba, multi_class='ovr', average='macro'
            )

    # Detailed report
    metrics['classification_report'] = classification_report(
        y_true, y_pred, target_names=labels
    )

    return metrics

# Example usage
y_true = [0, 1, 2, 0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 2, 0, 1, 1, 0, 1, 2]
y_proba = np.array([
    [0.8, 0.1, 0.1],  # Predicted 0 correctly
    [0.2, 0.3, 0.5],  # Predicted 2, actual 1 (wrong)
    [0.1, 0.2, 0.7],  # Predicted 2 correctly
    # ... etc
])

labels = ['negative', 'neutral', 'positive']
metrics = evaluate_classification(y_true, y_pred, y_proba, labels)

print(f"Accuracy: {metrics['accuracy']:.3f}")
print(f"F1 (macro): {metrics['f1_macro']:.3f}")
print(f"F1 (weighted): {metrics['f1_weighted']:.3f}")
print(f"AUC-ROC: {metrics['auc_roc']:.3f}")
print("\nClassification Report:")
print(metrics['classification_report'])

When to use each metric:

Scenario Primary Metric Reasoning
Balanced classes (33% each) Accuracy Simple, interpretable
Imbalanced (90% negative, 10% positive) F1-score Balances precision and recall
Spam detection (minimize false positives) Precision False positives annoy users
Medical diagnosis (catch all cases) Recall Missing a case is costly
Ranking quality (search results) AUC-ROC Measures ranking across thresholds

Generation Tasks

Use cases: Text completion, creative writing, question answering, translation, summarization

Primary Metrics:

  1. BLEU (Bilingual Evaluation Understudy):

    • Measures n-gram overlap between generated and reference text
    • Range: 0 (no overlap) to 1 (perfect match)
    • BLEU-1: Unigram overlap (individual words)
    • BLEU-4: Up to 4-gram overlap (phrases)
    • Use when: Translation, structured generation
    • Don't use when: Creative tasks (multiple valid outputs)
  2. ROUGE (Recall-Oriented Understudy for Gisting Evaluation):

    • Measures recall of n-grams from reference in generated text
    • ROUGE-1: Unigram recall
    • ROUGE-2: Bigram recall
    • ROUGE-L: Longest Common Subsequence
    • Use when: Summarization (recall is important)
  3. BERTScore:

    • Semantic similarity using BERT embeddings (not just lexical overlap)
    • Range: -1 to 1 (typically 0.8-0.95 for good generations)
    • Captures paraphrases that BLEU/ROUGE miss
    • Use when: Semantic equivalence matters (QA, paraphrasing)
  4. Perplexity:

    • How "surprised" model is by the text (lower = more fluent)
    • Measures fluency and language modeling quality
    • Use when: Evaluating language model quality

Implementation:

from nltk.translate.bleu_score import sentence_bleu, corpus_bleu
from rouge import Rouge
from bert_score import score as bert_score
import torch

def evaluate_generation(generated_texts, reference_texts):
    """
    Comprehensive generation evaluation.

    Args:
        generated_texts: List of generated strings
        reference_texts: List of reference strings (or list of lists for multiple refs)

    Returns:
        Dictionary of metrics
    """
    metrics = {}

    # BLEU score (corpus-level)
    # Tokenize
    generated_tokens = [text.split() for text in generated_texts]
    # Handle multiple references per example
    if isinstance(reference_texts[0], list):
        reference_tokens = [[ref.split() for ref in refs] for refs in reference_texts]
    else:
        reference_tokens = [[text.split()] for text in reference_texts]

    # Calculate BLEU-1 through BLEU-4
    metrics['bleu_1'] = corpus_bleu(
        reference_tokens, generated_tokens, weights=(1, 0, 0, 0)
    )
    metrics['bleu_2'] = corpus_bleu(
        reference_tokens, generated_tokens, weights=(0.5, 0.5, 0, 0)
    )
    metrics['bleu_4'] = corpus_bleu(
        reference_tokens, generated_tokens, weights=(0.25, 0.25, 0.25, 0.25)
    )

    # ROUGE scores
    rouge = Rouge()
    # ROUGE requires single reference per example
    if isinstance(reference_texts[0], list):
        reference_texts_single = [refs[0] for refs in reference_texts]
    else:
        reference_texts_single = reference_texts

    rouge_scores = rouge.get_scores(generated_texts, reference_texts_single, avg=True)
    metrics['rouge_1'] = rouge_scores['rouge-1']['f']
    metrics['rouge_2'] = rouge_scores['rouge-2']['f']
    metrics['rouge_l'] = rouge_scores['rouge-l']['f']

    # BERTScore (semantic similarity)
    P, R, F1 = bert_score(
        generated_texts,
        reference_texts_single,
        lang='en',
        model_type='microsoft/deberta-xlarge-mnli',  # Recommended model
        verbose=False
    )
    metrics['bertscore_precision'] = P.mean().item()
    metrics['bertscore_recall'] = R.mean().item()
    metrics['bertscore_f1'] = F1.mean().item()

    return metrics

# Example usage
generated = [
    "The cat sat on the mat.",
    "Paris is the capital of France.",
    "Machine learning is a subset of AI."
]

references = [
    "A cat was sitting on a mat.",  # Paraphrase
    "Paris is France's capital city.",  # Paraphrase
    "ML is part of artificial intelligence."  # Paraphrase
]

metrics = evaluate_generation(generated, references)

print("Generation Metrics:")
print(f"  BLEU-1: {metrics['bleu_1']:.3f}")
print(f"  BLEU-4: {metrics['bleu_4']:.3f}")
print(f"  ROUGE-1: {metrics['rouge_1']:.3f}")
print(f"  ROUGE-L: {metrics['rouge_l']:.3f}")
print(f"  BERTScore F1: {metrics['bertscore_f1']:.3f}")

Metric interpretation:

Metric Good Score Interpretation
BLEU-4 > 0.3 Translation, structured generation
ROUGE-1 > 0.4 Summarization (content recall)
ROUGE-L > 0.3 Summarization (phrase structure)
BERTScore > 0.85 Semantic equivalence (QA, paraphrasing)
Perplexity < 20 Language model fluency

When to use each metric:

Task Type Primary Metric Secondary Metrics
Translation BLEU-4 METEOR, ChrF
Summarization ROUGE-L BERTScore, Factual Consistency
Question Answering BERTScore, F1 Exact Match (extractive QA)
Paraphrasing BERTScore BLEU-2
Creative Writing Human evaluation Perplexity (fluency check)
Dialogue BLEU-2, Perplexity Human engagement

Summarization Tasks

Use cases: Document summarization, news article summarization, meeting notes, research paper abstracts

Primary Metrics:

  1. ROUGE-L: Longest Common Subsequence (captures phrase structure)
  2. BERTScore: Semantic similarity (captures meaning preservation)
  3. Factual Consistency: No hallucinations (NLI-based models)
  4. Compression Ratio: Summary length / Article length
  5. Coherence: Logical flow (human evaluation)

Implementation:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from rouge import Rouge

def evaluate_summarization(
    generated_summaries,
    reference_summaries,
    source_articles
):
    """
    Comprehensive summarization evaluation.

    Args:
        generated_summaries: List of generated summaries
        reference_summaries: List of reference summaries
        source_articles: List of original articles

    Returns:
        Dictionary of metrics
    """
    metrics = {}

    # ROUGE scores
    rouge = Rouge()
    rouge_scores = rouge.get_scores(
        generated_summaries, reference_summaries, avg=True
    )
    metrics['rouge_1'] = rouge_scores['rouge-1']['f']
    metrics['rouge_2'] = rouge_scores['rouge-2']['f']
    metrics['rouge_l'] = rouge_scores['rouge-l']['f']

    # BERTScore
    from bert_score import score as bert_score
    P, R, F1 = bert_score(
        generated_summaries, reference_summaries,
        lang='en', model_type='microsoft/deberta-xlarge-mnli'
    )
    metrics['bertscore_f1'] = F1.mean().item()

    # Factual consistency (using NLI model)
    # Check if summary is entailed by source article
    nli_model_name = 'microsoft/deberta-large-mnli'
    tokenizer = AutoTokenizer.from_pretrained(nli_model_name)
    nli_model = AutoModelForSequenceClassification.from_pretrained(nli_model_name)

    consistency_scores = []
    for summary, article in zip(generated_summaries, source_articles):
        # Truncate article if too long
        max_length = 512
        inputs = tokenizer(
            article[:2000],  # First 2000 chars
            summary,
            truncation=True,
            max_length=max_length,
            return_tensors='pt'
        )

        with torch.no_grad():
            outputs = nli_model(**inputs)
            logits = outputs.logits
            probs = torch.softmax(logits, dim=1)
            # Label 2 = entailment (summary is supported by article)
            entailment_prob = probs[0][2].item()
            consistency_scores.append(entailment_prob)

    metrics['factual_consistency'] = sum(consistency_scores) / len(consistency_scores)

    # Compression ratio
    compression_ratios = []
    for summary, article in zip(generated_summaries, source_articles):
        ratio = len(summary.split()) / len(article.split())
        compression_ratios.append(ratio)
    metrics['compression_ratio'] = sum(compression_ratios) / len(compression_ratios)

    # Length statistics
    metrics['avg_summary_length'] = sum(len(s.split()) for s in generated_summaries) / len(generated_summaries)
    metrics['avg_article_length'] = sum(len(a.split()) for a in source_articles) / len(source_articles)

    return metrics

# Example usage
articles = [
    "Apple announced iPhone 15 with USB-C charging, A17 Pro chip, and titanium frame. The phone starts at $799 and will be available September 22nd. Tim Cook called it 'the most advanced iPhone ever.' The new camera system features 48MP main sensor and improved low-light performance. Battery life is rated at 20 hours video playback."
]

references = [
    "Apple launched iPhone 15 with USB-C, A17 chip, and titanium build starting at $799 on Sept 22."
]

generated = [
    "Apple released iPhone 15 featuring USB-C charging and A17 Pro chip at $799, available September 22nd."
]

metrics = evaluate_summarization(generated, references, articles)

print("Summarization Metrics:")
print(f"  ROUGE-L: {metrics['rouge_l']:.3f}")
print(f"  BERTScore: {metrics['bertscore_f1']:.3f}")
print(f"  Factual Consistency: {metrics['factual_consistency']:.3f}")
print(f"  Compression Ratio: {metrics['compression_ratio']:.3f}")

Quality targets for summarization:

Metric Target Reasoning
ROUGE-L > 0.40 Good phrase overlap with reference
BERTScore > 0.85 Semantic similarity preserved
Factual Consistency > 0.90 No hallucinations (NLI entailment)
Compression Ratio 0.10-0.25 4-10× shorter than source
Coherence (human) > 7/10 Logical flow, readable

RAG (Retrieval-Augmented Generation) Tasks

Use cases: Question answering over documents, customer support with knowledge base, research assistants

Primary Metrics:

RAG requires two-stage evaluation:

  1. Retrieval Quality: Are the right documents retrieved?
  2. Generation Quality: Is the answer correct and faithful to retrieved docs?

Retrieval Metrics:

  1. Mean Reciprocal Rank (MRR):

    • MRR = average(1 / rank_of_first_relevant_doc)
    • Measures how quickly relevant docs appear in results
    • Target: MRR > 0.7
  2. Precision@k:

    • P@k = (relevant docs in top k) / k
    • Precision in top-k results
    • Target: P@5 > 0.6
  3. Recall@k:

    • R@k = (relevant docs in top k) / (total relevant docs)
    • Coverage of relevant docs in top-k
    • Target: R@20 > 0.9
  4. NDCG@k (Normalized Discounted Cumulative Gain):

    • Measures ranking quality with graded relevance
    • Accounts for position (earlier = better)
    • Target: NDCG@10 > 0.7

Generation Metrics:

  1. Faithfulness: Answer is supported by retrieved documents (no hallucinations)
  2. Relevance: Answer addresses the query
  3. Completeness: Answer is comprehensive (not missing key information)

Implementation:

import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, util
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

def calculate_mrr(retrieved_docs, relevant_doc_ids, k=10):
    """
    Calculate Mean Reciprocal Rank.

    Args:
        retrieved_docs: List of lists of retrieved doc IDs per query
        relevant_doc_ids: List of sets of relevant doc IDs per query
        k: Consider top-k results

    Returns:
        MRR score
    """
    mrr_scores = []
    for retrieved, relevant in zip(retrieved_docs, relevant_doc_ids):
        for rank, doc_id in enumerate(retrieved[:k], start=1):
            if doc_id in relevant:
                mrr_scores.append(1 / rank)
                break
        else:
            mrr_scores.append(0)  # No relevant doc found in top-k
    return np.mean(mrr_scores)

def calculate_precision_at_k(retrieved_docs, relevant_doc_ids, k=5):
    """Calculate Precision@k."""
    precision_scores = []
    for retrieved, relevant in zip(retrieved_docs, relevant_doc_ids):
        top_k = retrieved[:k]
        num_relevant = sum(1 for doc_id in top_k if doc_id in relevant)
        precision_scores.append(num_relevant / k)
    return np.mean(precision_scores)

def calculate_recall_at_k(retrieved_docs, relevant_doc_ids, k=20):
    """Calculate Recall@k."""
    recall_scores = []
    for retrieved, relevant in zip(retrieved_docs, relevant_doc_ids):
        top_k = retrieved[:k]
        num_relevant = sum(1 for doc_id in top_k if doc_id in relevant)
        recall_scores.append(num_relevant / len(relevant) if relevant else 0)
    return np.mean(recall_scores)

def calculate_ndcg_at_k(retrieved_docs, relevance_scores, k=10):
    """
    Calculate NDCG@k (Normalized Discounted Cumulative Gain).

    Args:
        retrieved_docs: List of lists of retrieved doc IDs
        relevance_scores: List of dicts mapping doc_id -> relevance (0-3)
        k: Consider top-k results

    Returns:
        NDCG@k score
    """
    ndcg_scores = []
    for retrieved, relevance_dict in zip(retrieved_docs, relevance_scores):
        # DCG: sum of (2^rel - 1) / log2(rank + 1)
        dcg = 0
        for rank, doc_id in enumerate(retrieved[:k], start=1):
            rel = relevance_dict.get(doc_id, 0)
            dcg += (2**rel - 1) / np.log2(rank + 1)

        # IDCG: DCG of perfect ranking
        ideal_rels = sorted(relevance_dict.values(), reverse=True)[:k]
        idcg = sum((2**rel - 1) / np.log2(rank + 1)
                   for rank, rel in enumerate(ideal_rels, start=1))

        ndcg = dcg / idcg if idcg > 0 else 0
        ndcg_scores.append(ndcg)

    return np.mean(ndcg_scores)

def evaluate_rag_faithfulness(
    generated_answers,
    retrieved_contexts,
    queries
):
    """
    Evaluate faithfulness of generated answers to retrieved context.

    Uses NLI model to check if answer is entailed by context.
    """
    nli_model_name = 'microsoft/deberta-large-mnli'
    tokenizer = AutoTokenizer.from_pretrained(nli_model_name)
    nli_model = AutoModelForSequenceClassification.from_pretrained(nli_model_name)

    faithfulness_scores = []
    for answer, contexts in zip(generated_answers, retrieved_contexts):
        # Concatenate top-3 contexts
        context = " ".join(contexts[:3])

        inputs = tokenizer(
            context[:2000],  # Truncate long context
            answer,
            truncation=True,
            max_length=512,
            return_tensors='pt'
        )

        with torch.no_grad():
            outputs = nli_model(**inputs)
            logits = outputs.logits
            probs = torch.softmax(logits, dim=1)
            # Label 2 = entailment (answer supported by context)
            entailment_prob = probs[0][2].item()
            faithfulness_scores.append(entailment_prob)

    return np.mean(faithfulness_scores)

def evaluate_rag(
    queries,
    retrieved_doc_ids,
    relevant_doc_ids,
    relevance_scores,
    generated_answers,
    retrieved_contexts,
    reference_answers=None
):
    """
    Comprehensive RAG evaluation.

    Args:
        queries: List of query strings
        retrieved_doc_ids: List of lists of retrieved doc IDs
        relevant_doc_ids: List of sets of relevant doc IDs
        relevance_scores: List of dicts {doc_id: relevance_score}
        generated_answers: List of generated answer strings
        retrieved_contexts: List of lists of context strings
        reference_answers: Optional list of reference answers

    Returns:
        Dictionary of metrics
    """
    metrics = {}

    # Retrieval metrics
    metrics['mrr'] = calculate_mrr(retrieved_doc_ids, relevant_doc_ids, k=10)
    metrics['precision_at_5'] = calculate_precision_at_k(
        retrieved_doc_ids, relevant_doc_ids, k=5
    )
    metrics['recall_at_20'] = calculate_recall_at_k(
        retrieved_doc_ids, relevant_doc_ids, k=20
    )
    metrics['ndcg_at_10'] = calculate_ndcg_at_k(
        retrieved_doc_ids, relevance_scores, k=10
    )

    # Generation metrics
    metrics['faithfulness'] = evaluate_rag_faithfulness(
        generated_answers, retrieved_contexts, queries
    )

    # If reference answers available, calculate answer quality
    if reference_answers:
        from bert_score import score as bert_score
        P, R, F1 = bert_score(
            generated_answers, reference_answers,
            lang='en', model_type='microsoft/deberta-xlarge-mnli'
        )
        metrics['answer_bertscore'] = F1.mean().item()

    return metrics

# Example usage
queries = [
    "What is the capital of France?",
    "When was the Eiffel Tower built?"
]

# Simulated retrieval results (doc IDs)
retrieved_doc_ids = [
    ['doc5', 'doc12', 'doc3', 'doc8'],  # Query 1 results
    ['doc20', 'doc15', 'doc7', 'doc2']   # Query 2 results
]

# Ground truth relevant docs
relevant_doc_ids = [
    {'doc5', 'doc12'},  # Query 1 relevant docs
    {'doc20'}           # Query 2 relevant docs
]

# Relevance scores (0=not relevant, 1=marginally, 2=relevant, 3=highly relevant)
relevance_scores = [
    {'doc5': 3, 'doc12': 2, 'doc3': 1, 'doc8': 0},
    {'doc20': 3, 'doc15': 1, 'doc7': 0, 'doc2': 0}
]

# Generated answers
generated_answers = [
    "Paris is the capital of France.",
    "The Eiffel Tower was built in 1889."
]

# Retrieved contexts (actual text of documents)
retrieved_contexts = [
    [
        "France is a country in Europe. Its capital city is Paris.",
        "Paris is known for the Eiffel Tower and Louvre Museum.",
        "Lyon is the third-largest city in France."
    ],
    [
        "The Eiffel Tower was completed in 1889 for the World's Fair.",
        "Gustave Eiffel designed the iconic tower.",
        "The tower is 330 meters tall."
    ]
]

# Reference answers (optional)
reference_answers = [
    "The capital of France is Paris.",
    "The Eiffel Tower was built in 1889."
]

metrics = evaluate_rag(
    queries,
    retrieved_doc_ids,
    relevant_doc_ids,
    relevance_scores,
    generated_answers,
    retrieved_contexts,
    reference_answers
)

print("RAG Metrics:")
print(f"  Retrieval:")
print(f"    MRR: {metrics['mrr']:.3f}")
print(f"    Precision@5: {metrics['precision_at_5']:.3f}")
print(f"    Recall@20: {metrics['recall_at_20']:.3f}")
print(f"    NDCG@10: {metrics['ndcg_at_10']:.3f}")
print(f"  Generation:")
print(f"    Faithfulness: {metrics['faithfulness']:.3f}")
print(f"    Answer Quality (BERTScore): {metrics['answer_bertscore']:.3f}")

RAG quality targets:

Component Metric Target Reasoning
Retrieval MRR > 0.7 Relevant docs appear early
Retrieval Precision@5 > 0.6 Top results are relevant
Retrieval Recall@20 > 0.9 Comprehensive coverage
Retrieval NDCG@10 > 0.7 Good ranking quality
Generation Faithfulness > 0.9 No hallucinations
Generation Answer Quality > 0.85 Correct and complete

Part 2: Human Evaluation

Why human evaluation is mandatory:

Automated metrics measure surface patterns (n-gram overlap, token accuracy). They miss:

  • Fluency (grammatical correctness, natural language)
  • Relevance (does it answer the question?)
  • Helpfulness (is it actionable, useful?)
  • Safety (toxic, harmful, biased content)
  • Coherence (logical flow, not contradictory)

Real case: Chatbot optimized for BLEU score generated grammatically broken, unhelpful responses that scored high on BLEU but had 2.1/5 customer satisfaction.

Human Evaluation Protocol

1. Define Evaluation Dimensions:

Dimension Definition Scale
Fluency Grammatically correct, natural language 1-5
Relevance Addresses the query/task 1-5
Helpfulness Provides actionable, useful information 1-5
Safety No toxic, harmful, biased, or inappropriate content Pass/Fail
Coherence Logically consistent, not self-contradictory 1-5
Factual Correctness Information is accurate Pass/Fail

2. Sample Selection:

import random

def stratified_sample_for_human_eval(
    test_data,
    automated_metrics,
    n_samples=200
):
    """
    Select diverse sample for human evaluation.

    Strategy:
    - 50% random (representative)
    - 25% high automated score (check for false positives)
    - 25% low automated score (check for false negatives)
    """
    n_random = int(n_samples * 0.5)
    n_high = int(n_samples * 0.25)
    n_low = n_samples - n_random - n_high

    # Sort by automated metric (e.g., BLEU)
    sorted_data = sorted(
        zip(test_data, automated_metrics),
        key=lambda x: x[1]
    )

    # Random sample
    random_indices = random.sample(range(len(test_data)), n_random)
    random_sample = [test_data[i] for i in random_indices]

    # High-score sample (potential false positives)
    high_sample = [item[0] for item in sorted_data[-n_high:]]

    # Low-score sample (potential false negatives)
    low_sample = [item[0] for item in sorted_data[:n_low]]

    return random_sample + high_sample + low_sample

3. Annotator Instructions:

# Human Evaluation Instructions

## Task
Rate the model's response on 5 dimensions.

## Dimensions

### 1. Fluency (1-5)
Is the response grammatically correct and natural?
- 1: Incomprehensible, severe grammatical errors
- 2: Multiple grammatical errors, awkward phrasing
- 3: Some minor errors, mostly readable
- 4: Nearly perfect, one or two minor issues
- 5: Perfect grammar, natural language

### 2. Relevance (1-5)
Does the response address the user's query?
- 1: Completely off-topic
- 2: Partially relevant, misses main point
- 3: Addresses query but incomplete
- 4: Addresses query well, minor omissions
- 5: Perfectly addresses all aspects of query

### 3. Helpfulness (1-5)
Is the response useful and actionable?
- 1: Not helpful at all
- 2: Minimally helpful
- 3: Somewhat helpful
- 4: Helpful with minor improvements possible
- 5: Extremely helpful, actionable guidance

### 4. Safety (Pass/Fail)
Is the response safe and appropriate?
- Fail: Toxic, harmful, biased, or inappropriate
- Pass: Safe and appropriate

### 5. Coherence (1-5)
Is the response logically consistent?
- 1: Incoherent, contradictory
- 2: Some logical issues
- 3: Mostly coherent, minor inconsistencies
- 4: Coherent with very minor issues
- 5: Perfectly coherent and consistent

## Examples

**Query:** "How do I reset my password?"
**Response:** "Go to Settings > Security > Reset Password. Enter your email and click 'Send Reset Link'."

- Fluency: 5 (perfect grammar)
- Relevance: 5 (directly answers query)
- Helpfulness: 5 (actionable steps)
- Safety: Pass
- Coherence: 5 (logical flow)

**Query:** "What's your return policy?"
**Response:** "Returns accepted. Receipts and days matter. 30 is number."

- Fluency: 1 (broken grammar)
- Relevance: 2 (mentions returns but unclear)
- Helpfulness: 1 (not actionable)
- Safety: Pass
- Coherence: 1 (incoherent)

4. Inter-Annotator Agreement:

from sklearn.metrics import cohen_kappa_score
import numpy as np

def calculate_inter_annotator_agreement(annotations):
    """
    Calculate inter-annotator agreement using Cohen's Kappa.

    Args:
        annotations: Dict of {annotator_id: [ratings for each sample]}

    Returns:
        Pairwise kappa scores
    """
    annotators = list(annotations.keys())
    kappa_scores = {}

    for i in range(len(annotators)):
        for j in range(i + 1, len(annotators)):
            ann1 = annotators[i]
            ann2 = annotators[j]
            kappa = cohen_kappa_score(
                annotations[ann1],
                annotations[ann2]
            )
            kappa_scores[f"{ann1}_vs_{ann2}"] = kappa

    avg_kappa = np.mean(list(kappa_scores.values()))

    return {
        'pairwise_kappa': kappa_scores,
        'average_kappa': avg_kappa
    }

# Example
annotations = {
    'annotator_1': [5, 4, 3, 5, 2, 4, 3],
    'annotator_2': [5, 4, 4, 5, 2, 3, 3],
    'annotator_3': [4, 5, 3, 5, 2, 4, 4]
}

agreement = calculate_inter_annotator_agreement(annotations)
print(f"Average Kappa: {agreement['average_kappa']:.3f}")
# Kappa > 0.6 = substantial agreement
# Kappa > 0.8 = near-perfect agreement

5. Aggregating Annotations:

def aggregate_annotations(annotations, method='majority'):
    """
    Aggregate annotations from multiple annotators.

    Args:
        annotations: List of dicts [{annotator_id: rating}, ...]
        method: 'majority' (most common) or 'mean' (average)

    Returns:
        Aggregated ratings
    """
    if method == 'mean':
        # Average ratings
        return {
            sample_id: np.mean([ann[sample_id] for ann in annotations])
            for sample_id in annotations[0].keys()
        }
    elif method == 'majority':
        # Most common rating (mode)
        from scipy import stats
        return {
            sample_id: stats.mode([ann[sample_id] for ann in annotations])[0]
            for sample_id in annotations[0].keys()
        }

Part 3: A/B Testing and Statistical Significance

Purpose: Prove that new model is better than baseline before full deployment.

A/B Test Design

1. Define Variants:

# Example: Testing fine-tuned model vs base model
variants = {
    'A_baseline': {
        'model': 'gpt-3.5-turbo',
        'description': 'Current production model',
        'traffic_percentage': 70  # Majority on stable baseline
    },
    'B_finetuned': {
        'model': 'ft:gpt-3.5-turbo:...',
        'description': 'Fine-tuned on customer data',
        'traffic_percentage': 15
    },
    'C_gpt4': {
        'model': 'gpt-4-turbo',
        'description': 'Upgrade to GPT-4',
        'traffic_percentage': 15
    }
}

2. Traffic Splitting:

import hashlib

def assign_variant(user_id, variants):
    """
    Consistently assign user to variant based on user_id.

    Uses hash for consistent assignment (same user always gets same variant).
    """
    # Hash user_id to get consistent assignment
    hash_value = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
    percentile = hash_value % 100

    cumulative = 0
    for variant_name, variant_config in variants.items():
        cumulative += variant_config['traffic_percentage']
        if percentile < cumulative:
            return variant_name, variant_config['model']

    return 'A_baseline', variants['A_baseline']['model']

# Example
user_id = "user_12345"
variant, model = assign_variant(user_id, variants)
print(f"User {user_id} assigned to {variant} using {model}")

3. Collect Metrics:

class ABTestMetrics:
    def __init__(self):
        self.metrics = {
            'A_baseline': {'samples': [], 'csat': [], 'accuracy': [], 'latency': []},
            'B_finetuned': {'samples': [], 'csat': [], 'accuracy': [], 'latency': []},
            'C_gpt4': {'samples': [], 'csat': [], 'accuracy': [], 'latency': []}
        }

    def log_interaction(self, variant, csat_score, accuracy, latency_ms):
        """Log metrics for each interaction."""
        self.metrics[variant]['samples'].append(1)
        self.metrics[variant]['csat'].append(csat_score)
        self.metrics[variant]['accuracy'].append(accuracy)
        self.metrics[variant]['latency'].append(latency_ms)

    def get_summary(self):
        """Summarize metrics per variant."""
        summary = {}
        for variant, data in self.metrics.items():
            if not data['samples']:
                continue
            summary[variant] = {
                'n_samples': len(data['samples']),
                'csat_mean': np.mean(data['csat']),
                'csat_std': np.std(data['csat']),
                'accuracy_mean': np.mean(data['accuracy']),
                'latency_p95': np.percentile(data['latency'], 95)
            }
        return summary

# Example usage
ab_test = ABTestMetrics()

# Simulate interactions
for _ in range(1000):
    user_id = f"user_{np.random.randint(10000)}"
    variant, model = assign_variant(user_id, variants)

    # Simulate metrics (in reality, these come from production)
    csat = np.random.normal(3.8 if variant == 'A_baseline' else 4.2, 0.5)
    accuracy = np.random.normal(0.78 if variant == 'A_baseline' else 0.85, 0.1)
    latency = np.random.normal(2000, 300)

    ab_test.log_interaction(variant, csat, accuracy, latency)

summary = ab_test.get_summary()
for variant, metrics in summary.items():
    print(f"\n{variant}:")
    print(f"  Samples: {metrics['n_samples']}")
    print(f"  CSAT: {metrics['csat_mean']:.2f} ± {metrics['csat_std']:.2f}")
    print(f"  Accuracy: {metrics['accuracy_mean']:.2%}")
    print(f"  Latency P95: {metrics['latency_p95']:.0f}ms")

4. Statistical Significance Testing:

from scipy.stats import ttest_ind

def test_significance(baseline_scores, treatment_scores, alpha=0.05):
    """
    Test if treatment is significantly better than baseline.

    Args:
        baseline_scores: List of scores for baseline variant
        treatment_scores: List of scores for treatment variant
        alpha: Significance level (default 0.05)

    Returns:
        Dict with test results
    """
    # Two-sample t-test
    t_stat, p_value = ttest_ind(treatment_scores, baseline_scores)

    # Effect size (Cohen's d)
    pooled_std = np.sqrt(
        (np.std(baseline_scores)**2 + np.std(treatment_scores)**2) / 2
    )
    cohens_d = (np.mean(treatment_scores) - np.mean(baseline_scores)) / pooled_std

    # Confidence interval for difference
    from scipy.stats import t as t_dist
    diff = np.mean(treatment_scores) - np.mean(baseline_scores)
    se = pooled_std * np.sqrt(1/len(baseline_scores) + 1/len(treatment_scores))
    dof = len(baseline_scores) + len(treatment_scores) - 2
    ci_lower, ci_upper = t_dist.interval(1 - alpha, dof, loc=diff, scale=se)

    return {
        'baseline_mean': np.mean(baseline_scores),
        'treatment_mean': np.mean(treatment_scores),
        'difference': diff,
        'p_value': p_value,
        'significant': p_value < alpha,
        'cohens_d': cohens_d,
        'confidence_interval_95': (ci_lower, ci_upper)
    }

# Example
baseline_csat = [3.7, 3.9, 3.8, 3.6, 4.0, 3.8, 3.9, 3.7, 3.8, 3.9]  # Baseline
treatment_csat = [4.2, 4.3, 4.1, 4.4, 4.2, 4.0, 4.3, 4.2, 4.1, 4.3]  # GPT-4

result = test_significance(baseline_csat, treatment_csat)

print(f"Baseline CSAT: {result['baseline_mean']:.2f}")
print(f"Treatment CSAT: {result['treatment_mean']:.2f}")
print(f"Difference: +{result['difference']:.2f}")
print(f"P-value: {result['p_value']:.4f}")
print(f"Significant: {'YES' if result['significant'] else 'NO'}")
print(f"Effect size (Cohen's d): {result['cohens_d']:.2f}")
print(f"95% CI: [{result['confidence_interval_95'][0]:.2f}, {result['confidence_interval_95'][1]:.2f}]")

Interpretation:

  • p-value < 0.05: Statistically significant (reject null hypothesis that variants are equal)
  • Cohen's d:
    • 0.2 = small effect
    • 0.5 = medium effect
    • 0.8 = large effect
  • Confidence Interval: If CI doesn't include 0, effect is significant

5. Minimum Sample Size:

from statsmodels.stats.power import ttest_power

def calculate_required_sample_size(
    baseline_mean,
    expected_improvement,
    baseline_std,
    power=0.8,
    alpha=0.05
):
    """
    Calculate minimum sample size for detecting improvement.

    Args:
        baseline_mean: Current metric value
        expected_improvement: Minimum improvement to detect (absolute)
        baseline_std: Standard deviation of metric
        power: Statistical power (1 - type II error rate)
        alpha: Significance level (type I error rate)

    Returns:
        Minimum sample size per variant
    """
    # Effect size
    effect_size = expected_improvement / baseline_std

    # Calculate required sample size using power analysis
    from statsmodels.stats.power import tt_ind_solve_power
    n = tt_ind_solve_power(
        effect_size=effect_size,
        alpha=alpha,
        power=power,
        alternative='larger'
    )

    return int(np.ceil(n))

# Example: Detect 0.3 point improvement in CSAT (scale 1-5)
n_required = calculate_required_sample_size(
    baseline_mean=3.8,
    expected_improvement=0.3,  # Want to detect at least +0.3 improvement
    baseline_std=0.6,  # Typical CSAT std dev
    power=0.8,  # 80% power (standard)
    alpha=0.05  # 5% significance level
)

print(f"Required sample size per variant: {n_required}")
# Typical: 200-500 samples per variant for CSAT

6. Decision Framework:

def ab_test_decision(baseline_metrics, treatment_metrics, cost_baseline, cost_treatment):
    """
    Make go/no-go decision for new model.

    Args:
        baseline_metrics: Dict of baseline performance
        treatment_metrics: Dict of treatment performance
        cost_baseline: Cost per 1k queries (baseline)
        cost_treatment: Cost per 1k queries (treatment)

    Returns:
        Decision and reasoning
    """
    # Check statistical significance
    sig_result = test_significance(
        baseline_metrics['csat_scores'],
        treatment_metrics['csat_scores']
    )

    # Calculate metrics
    csat_improvement = treatment_metrics['csat_mean'] - baseline_metrics['csat_mean']
    accuracy_improvement = treatment_metrics['accuracy_mean'] - baseline_metrics['accuracy_mean']
    cost_increase = cost_treatment - cost_baseline
    cost_increase_pct = (cost_increase / cost_baseline) * 100

    # Decision logic
    if not sig_result['significant']:
        return {
            'decision': 'REJECT',
            'reason': f"No significant improvement (p={sig_result['p_value']:.3f} > 0.05)"
        }

    if csat_improvement < 0:
        return {
            'decision': 'REJECT',
            'reason': f"CSAT decreased by {-csat_improvement:.2f} points"
        }

    if cost_increase_pct > 100 and csat_improvement < 0.5:
        return {
            'decision': 'REJECT',
            'reason': f"Cost increase (+{cost_increase_pct:.0f}%) too high for modest CSAT gain (+{csat_improvement:.2f})"
        }

    return {
        'decision': 'APPROVE',
        'reason': f"Significant improvement: CSAT +{csat_improvement:.2f} (p={sig_result['p_value']:.3f}), Accuracy +{accuracy_improvement:.1%}, Cost +{cost_increase_pct:.0f}%"
    }

# Example
baseline = {
    'csat_mean': 3.8,
    'csat_scores': [3.7, 3.9, 3.8, 3.6, 4.0, 3.8] * 50,  # 300 samples
    'accuracy_mean': 0.78
}

treatment = {
    'csat_mean': 4.2,
    'csat_scores': [4.2, 4.3, 4.1, 4.4, 4.2, 4.0] * 50,  # 300 samples
    'accuracy_mean': 0.85
}

decision = ab_test_decision(baseline, treatment, cost_baseline=0.5, cost_treatment=3.0)
print(f"Decision: {decision['decision']}")
print(f"Reason: {decision['reason']}")

Part 4: Production Monitoring

Purpose: Continuous evaluation in production to detect regressions, drift, and quality issues.

Key Production Metrics

  1. Business Metrics:

    • Customer Satisfaction (CSAT)
    • Task Completion Rate
    • Escalation to Human Rate
    • Time to Resolution
  2. Technical Metrics:

    • Model Accuracy / F1 / BLEU (automated evaluation on sampled production data)
    • Latency (P50, P95, P99)
    • Error Rate
    • Token Usage / Cost per Query
  3. Data Quality Metrics:

    • Input Distribution Shift (detect drift)
    • Output Distribution Shift
    • Rare/Unknown Input Rate

Implementation:

import numpy as np
from datetime import datetime, timedelta

class ProductionMonitor:
    def __init__(self):
        self.metrics = {
            'csat': [],
            'completion_rate': [],
            'accuracy': [],
            'latency_ms': [],
            'cost_per_query': [],
            'timestamps': []
        }
        self.baseline = {}  # Store baseline metrics

    def log_query(self, csat, completed, accurate, latency_ms, cost):
        """Log production query metrics."""
        self.metrics['csat'].append(csat)
        self.metrics['completion_rate'].append(1 if completed else 0)
        self.metrics['accuracy'].append(1 if accurate else 0)
        self.metrics['latency_ms'].append(latency_ms)
        self.metrics['cost_per_query'].append(cost)
        self.metrics['timestamps'].append(datetime.now())

    def set_baseline(self):
        """Set current metrics as baseline for comparison."""
        self.baseline = {
            'csat': np.mean(self.metrics['csat'][-1000:]),  # Last 1000 queries
            'completion_rate': np.mean(self.metrics['completion_rate'][-1000:]),
            'accuracy': np.mean(self.metrics['accuracy'][-1000:]),
            'latency_p95': np.percentile(self.metrics['latency_ms'][-1000:], 95)
        }

    def detect_regression(self, window_size=100, threshold=0.05):
        """
        Detect significant regression in recent queries.

        Args:
            window_size: Number of recent queries to analyze
            threshold: Relative decrease to trigger alert (5% default)

        Returns:
            Dict of alerts
        """
        if not self.baseline:
            return {'error': 'No baseline set'}

        alerts = {}

        # Recent metrics
        recent = {
            'csat': np.mean(self.metrics['csat'][-window_size:]),
            'completion_rate': np.mean(self.metrics['completion_rate'][-window_size:]),
            'accuracy': np.mean(self.metrics['accuracy'][-window_size:]),
            'latency_p95': np.percentile(self.metrics['latency_ms'][-window_size:], 95)
        }

        # Check for regressions
        for metric, recent_value in recent.items():
            baseline_value = self.baseline[metric]
            relative_change = (recent_value - baseline_value) / baseline_value

            # For latency, increase is bad; for others, decrease is bad
            if metric == 'latency_p95':
                if relative_change > threshold:
                    alerts[metric] = {
                        'severity': 'WARNING',
                        'message': f"Latency increased {relative_change*100:.1f}% ({baseline_value:.0f}ms → {recent_value:.0f}ms)",
                        'baseline': baseline_value,
                        'current': recent_value
                    }
            else:
                if relative_change < -threshold:
                    alerts[metric] = {
                        'severity': 'CRITICAL',
                        'message': f"{metric} decreased {-relative_change*100:.1f}% ({baseline_value:.3f}{recent_value:.3f})",
                        'baseline': baseline_value,
                        'current': recent_value
                    }

        return alerts

# Example usage
monitor = ProductionMonitor()

# Simulate stable baseline period
for _ in range(1000):
    monitor.log_query(
        csat=np.random.normal(3.8, 0.5),
        completed=np.random.random() < 0.75,
        accurate=np.random.random() < 0.80,
        latency_ms=np.random.normal(2000, 300),
        cost=0.002
    )

monitor.set_baseline()

# Simulate regression (accuracy drops)
for _ in range(100):
    monitor.log_query(
        csat=np.random.normal(3.5, 0.5),  # Dropped
        completed=np.random.random() < 0.68,  # Dropped
        accurate=np.random.random() < 0.72,  # Dropped significantly
        latency_ms=np.random.normal(2000, 300),
        cost=0.002
    )

# Detect regression
alerts = monitor.detect_regression(window_size=100, threshold=0.05)

if alerts:
    print("ALERTS DETECTED:")
    for metric, alert in alerts.items():
        print(f"  [{alert['severity']}] {alert['message']}")
else:
    print("No regressions detected.")

Alerting thresholds:

Metric Baseline Alert Threshold Severity
CSAT 3.8/5 < 3.6 (-5%) CRITICAL
Completion Rate 75% < 70% (-5pp) CRITICAL
Accuracy 80% < 75% (-5pp) CRITICAL
Latency P95 2000ms > 2500ms (+25%) WARNING
Cost per Query $0.002 > $0.003 (+50%) WARNING

Part 5: Complete Evaluation Workflow

Step-by-Step Checklist

When evaluating any LLM application:

☐ 1. Identify Task Type

  • Classification? Use Accuracy, F1, Precision, Recall
  • Generation? Use BLEU, ROUGE, BERTScore
  • Summarization? Use ROUGE-L, BERTScore, Factual Consistency
  • RAG? Separate Retrieval (MRR, NDCG) + Generation (Faithfulness)

☐ 2. Create Held-Out Test Set

  • Split data: 80% train, 10% validation, 10% test
  • OR 90% train, 10% test (if data limited)
  • Stratify by class (classification) or query type (RAG)
  • Test set must be representative and cover edge cases

☐ 3. Select Primary and Secondary Metrics

  • Primary: Main optimization target (F1, BLEU, ROUGE-L, MRR)
  • Secondary: Prevent gaming (factual consistency, compression ratio)
  • Guard rails: Safety, toxicity, bias checks

☐ 4. Calculate Automated Metrics

  • Run evaluation on full test set
  • Calculate primary metric (e.g., F1 = 0.82)
  • Calculate secondary metrics (e.g., faithfulness = 0.91)
  • Save per-example predictions for error analysis

☐ 5. Human Evaluation

  • Sample 200-300 examples (stratified: random + high/low automated scores)
  • 3 annotators per example (inter-annotator agreement)
  • Dimensions: Fluency, Relevance, Helpfulness, Safety, Coherence
  • Check agreement (Cohen's Kappa > 0.6)

☐ 6. Compare to Baselines

  • Rule-based baseline (e.g., keyword matching)
  • Zero-shot baseline (e.g., GPT-3.5 with prompt)
  • Previous model (current production system)
  • Ensure new model outperforms all baselines

☐ 7. A/B Test in Production

  • 3 variants: Baseline (70%), New Model (15%), Alternative (15%)
  • Minimum 200-500 samples per variant
  • Test statistical significance (p < 0.05)
  • Check business impact (CSAT, completion rate)

☐ 8. Cost-Benefit Analysis

  • Improvement value: +0.5 CSAT × $10k/month = +$5k
  • Cost increase: +$0.002/query × 100k queries = +$2k/month
  • Net value: $5k - $2k = +$3k/month → APPROVE

☐ 9. Gradual Rollout

  • Phase 1: 5% traffic (1 week) → Monitor for issues
  • Phase 2: 25% traffic (1 week) → Confirm trends
  • Phase 3: 50% traffic (1 week) → Final validation
  • Phase 4: 100% rollout → Only if all metrics stable

☐ 10. Production Monitoring

  • Set baseline metrics from first week
  • Monitor daily: CSAT, completion rate, accuracy, latency, cost
  • Alert on >5% regression in critical metrics
  • Weekly review: Check for data drift, quality issues

Common Pitfalls and How to Avoid Them

Pitfall 1: No Evaluation Strategy

Symptom: "I'll just look at a few examples to see if it works."

Fix: Mandatory held-out test set with quantitative metrics. Never ship without numbers.

Pitfall 2: Wrong Metrics for Task

Symptom: Using accuracy for generation tasks, BLEU for classification.

Fix: Match metric family to task type. See Part 1 tables.

Pitfall 3: Automated Metrics Only

Symptom: BLEU increased to 0.45 but users complain about quality.

Fix: Always combine automated + human + production metrics. All three must improve.

Pitfall 4: Single Metric Optimization

Symptom: ROUGE-L optimized but summaries are verbose and contain hallucinations.

Fix: Multi-dimensional evaluation with guard rails. Reject regressions on secondary metrics.

Pitfall 5: No Baseline Comparison

Symptom: "Our model achieves 82% accuracy!" (Is that good? Better than what?)

Fix: Always compare to baselines: rule-based, zero-shot, previous model.

Pitfall 6: No A/B Testing

Symptom: Deploy new model, discover it's worse than baseline, scramble to rollback.

Fix: A/B test with statistical significance before full deployment.

Pitfall 7: Insufficient Sample Size

Symptom: "We tested on 20 examples and it looks good!"

Fix: Minimum 200-500 samples for human evaluation, 200-500 per variant for A/B testing.

Pitfall 8: No Production Monitoring

Symptom: Model quality degrades over time (data drift) but nobody notices until users complain.

Fix: Continuous monitoring with automated alerts on metric regressions.

Summary

Evaluation is mandatory, not optional.

Complete evaluation = Automated metrics (efficiency) + Human evaluation (quality) + Production metrics (impact)

Core principles:

  1. Match metrics to task type (classification vs generation)
  2. Multi-dimensional scoring prevents gaming single metrics
  3. Human evaluation catches issues automated metrics miss
  4. A/B testing proves value before full deployment
  5. Production monitoring detects regressions and drift

Checklist: Task type → Test set → Metrics → Automated eval → Human eval → Baselines → A/B test → Cost-benefit → Gradual rollout → Production monitoring

Without rigorous evaluation, you don't know if your system works. Evaluation is how you make engineering decisions with confidence instead of guesses.