# LLM Evaluation Metrics Skill ## When to Use This Skill Use this skill when: - Building any LLM application (classification, generation, summarization, RAG, chat) - Evaluating model performance and quality - Comparing different models or approaches (baseline comparison) - Fine-tuning or optimizing LLM systems - Debugging quality issues in production - Establishing production monitoring and alerting **When NOT to use:** Exploratory prototyping without deployment intent. For deployment-bound systems, evaluation is mandatory. ## Core Principle **Evaluation is not a checkbox—it's how you know if your system works.** Without rigorous evaluation: - You don't know if your model is good (no baseline comparison) - You optimize the wrong dimensions (wrong metrics for task type) - You miss quality issues (automated metrics miss human-perceived issues) - You can't prove improvement (no statistical significance) - You ship inferior systems (no A/B testing) **Formula:** Automated metrics (efficiency) + Human evaluation (quality) + Production metrics (impact) = Complete evaluation. ## Evaluation Framework Overview ``` ┌─────────────────────────────────┐ │ Task Type Identification │ └──────────┬──────────────────────┘ │ ┌──────────────┼──────────────┐ │ │ │ ┌───────▼───────┐ ┌───▼──────┐ ┌────▼────────┐ │Classification│ │Generation│ │ RAG │ │ Metrics │ │ Metrics │ │ Metrics │ └───────┬───────┘ └───┬──────┘ └────┬────────┘ │ │ │ └──────────────┼─────────────┘ │ ┌──────────────▼──────────────────┐ │ Multi-Dimensional Scoring │ │ Primary + Secondary + Guards │ └──────────────┬──────────────────┘ │ ┌──────────────▼──────────────────┐ │ Human Evaluation │ │ Fluency, Relevance, Safety │ └──────────────┬──────────────────┘ │ ┌──────────────▼──────────────────┐ │ A/B Testing │ │ Statistical Significance │ └──────────────┬──────────────────┘ │ ┌──────────────▼──────────────────┐ │ Production Monitoring │ │ CSAT, Completion, Cost │ └──────────────────────────────────┘ ``` ## Part 1: Metric Selection by Task Type ### Classification Tasks **Use cases:** Sentiment analysis, intent detection, entity tagging, content moderation, spam detection **Primary Metrics:** 1. **Accuracy:** Correct predictions / Total predictions - Use when: Classes are balanced - Don't use when: Class imbalance (e.g., 95% negative, 5% spam) 2. **F1-Score:** Harmonic mean of Precision and Recall - **Macro F1:** Average F1 across classes (treats all classes equally) - **Micro F1:** Global F1 (weighted by class frequency) - **Per-class F1:** F1 for each class individually - Use when: Class imbalance or unequal class importance 3. **Precision & Recall:** - **Precision:** True Positives / (True Positives + False Positives) - "Of predictions as positive, how many are correct?" - **Recall:** True Positives / (True Positives + False Negatives) - "Of actual positives, how many did we find?" - Use when: Asymmetric cost (spam: high precision, medical: high recall) 4. **AUC-ROC:** Area Under Receiver Operating Characteristic curve - Measures model's ability to discriminate between classes at all thresholds - Use when: Evaluating calibration and ranking quality **Implementation:** ```python from sklearn.metrics import ( accuracy_score, f1_score, precision_recall_fscore_support, classification_report, confusion_matrix, roc_auc_score ) import numpy as np def evaluate_classification(y_true, y_pred, y_proba=None, labels=None): """ Comprehensive classification evaluation. Args: y_true: Ground truth labels y_pred: Predicted labels y_proba: Predicted probabilities (for AUC-ROC) labels: Class names for reporting Returns: Dictionary of metrics """ metrics = {} # Basic metrics metrics['accuracy'] = accuracy_score(y_true, y_pred) # F1 scores metrics['f1_macro'] = f1_score(y_true, y_pred, average='macro') metrics['f1_micro'] = f1_score(y_true, y_pred, average='micro') metrics['f1_weighted'] = f1_score(y_true, y_pred, average='weighted') # Per-class metrics precision, recall, f1, support = precision_recall_fscore_support( y_true, y_pred, labels=labels ) metrics['per_class'] = { 'precision': precision, 'recall': recall, 'f1': f1, 'support': support } # Confusion matrix metrics['confusion_matrix'] = confusion_matrix(y_true, y_pred) # AUC-ROC (if probabilities provided) if y_proba is not None: if len(np.unique(y_true)) == 2: # Binary metrics['auc_roc'] = roc_auc_score(y_true, y_proba[:, 1]) else: # Multi-class metrics['auc_roc'] = roc_auc_score( y_true, y_proba, multi_class='ovr', average='macro' ) # Detailed report metrics['classification_report'] = classification_report( y_true, y_pred, target_names=labels ) return metrics # Example usage y_true = [0, 1, 2, 0, 1, 2, 0, 1, 2] y_pred = [0, 2, 2, 0, 1, 1, 0, 1, 2] y_proba = np.array([ [0.8, 0.1, 0.1], # Predicted 0 correctly [0.2, 0.3, 0.5], # Predicted 2, actual 1 (wrong) [0.1, 0.2, 0.7], # Predicted 2 correctly # ... etc ]) labels = ['negative', 'neutral', 'positive'] metrics = evaluate_classification(y_true, y_pred, y_proba, labels) print(f"Accuracy: {metrics['accuracy']:.3f}") print(f"F1 (macro): {metrics['f1_macro']:.3f}") print(f"F1 (weighted): {metrics['f1_weighted']:.3f}") print(f"AUC-ROC: {metrics['auc_roc']:.3f}") print("\nClassification Report:") print(metrics['classification_report']) ``` **When to use each metric:** | Scenario | Primary Metric | Reasoning | |----------|----------------|-----------| | Balanced classes (33% each) | Accuracy | Simple, interpretable | | Imbalanced (90% negative, 10% positive) | F1-score | Balances precision and recall | | Spam detection (minimize false positives) | Precision | False positives annoy users | | Medical diagnosis (catch all cases) | Recall | Missing a case is costly | | Ranking quality (search results) | AUC-ROC | Measures ranking across thresholds | ### Generation Tasks **Use cases:** Text completion, creative writing, question answering, translation, summarization **Primary Metrics:** 1. **BLEU (Bilingual Evaluation Understudy):** - Measures n-gram overlap between generated and reference text - Range: 0 (no overlap) to 1 (perfect match) - **BLEU-1**: Unigram overlap (individual words) - **BLEU-4**: Up to 4-gram overlap (phrases) - Use when: Translation, structured generation - Don't use when: Creative tasks (multiple valid outputs) 2. **ROUGE (Recall-Oriented Understudy for Gisting Evaluation):** - Measures recall of n-grams from reference in generated text - **ROUGE-1**: Unigram recall - **ROUGE-2**: Bigram recall - **ROUGE-L**: Longest Common Subsequence - Use when: Summarization (recall is important) 3. **BERTScore:** - Semantic similarity using BERT embeddings (not just lexical overlap) - Range: -1 to 1 (typically 0.8-0.95 for good generations) - Captures paraphrases that BLEU/ROUGE miss - Use when: Semantic equivalence matters (QA, paraphrasing) 4. **Perplexity:** - How "surprised" model is by the text (lower = more fluent) - Measures fluency and language modeling quality - Use when: Evaluating language model quality **Implementation:** ```python from nltk.translate.bleu_score import sentence_bleu, corpus_bleu from rouge import Rouge from bert_score import score as bert_score import torch def evaluate_generation(generated_texts, reference_texts): """ Comprehensive generation evaluation. Args: generated_texts: List of generated strings reference_texts: List of reference strings (or list of lists for multiple refs) Returns: Dictionary of metrics """ metrics = {} # BLEU score (corpus-level) # Tokenize generated_tokens = [text.split() for text in generated_texts] # Handle multiple references per example if isinstance(reference_texts[0], list): reference_tokens = [[ref.split() for ref in refs] for refs in reference_texts] else: reference_tokens = [[text.split()] for text in reference_texts] # Calculate BLEU-1 through BLEU-4 metrics['bleu_1'] = corpus_bleu( reference_tokens, generated_tokens, weights=(1, 0, 0, 0) ) metrics['bleu_2'] = corpus_bleu( reference_tokens, generated_tokens, weights=(0.5, 0.5, 0, 0) ) metrics['bleu_4'] = corpus_bleu( reference_tokens, generated_tokens, weights=(0.25, 0.25, 0.25, 0.25) ) # ROUGE scores rouge = Rouge() # ROUGE requires single reference per example if isinstance(reference_texts[0], list): reference_texts_single = [refs[0] for refs in reference_texts] else: reference_texts_single = reference_texts rouge_scores = rouge.get_scores(generated_texts, reference_texts_single, avg=True) metrics['rouge_1'] = rouge_scores['rouge-1']['f'] metrics['rouge_2'] = rouge_scores['rouge-2']['f'] metrics['rouge_l'] = rouge_scores['rouge-l']['f'] # BERTScore (semantic similarity) P, R, F1 = bert_score( generated_texts, reference_texts_single, lang='en', model_type='microsoft/deberta-xlarge-mnli', # Recommended model verbose=False ) metrics['bertscore_precision'] = P.mean().item() metrics['bertscore_recall'] = R.mean().item() metrics['bertscore_f1'] = F1.mean().item() return metrics # Example usage generated = [ "The cat sat on the mat.", "Paris is the capital of France.", "Machine learning is a subset of AI." ] references = [ "A cat was sitting on a mat.", # Paraphrase "Paris is France's capital city.", # Paraphrase "ML is part of artificial intelligence." # Paraphrase ] metrics = evaluate_generation(generated, references) print("Generation Metrics:") print(f" BLEU-1: {metrics['bleu_1']:.3f}") print(f" BLEU-4: {metrics['bleu_4']:.3f}") print(f" ROUGE-1: {metrics['rouge_1']:.3f}") print(f" ROUGE-L: {metrics['rouge_l']:.3f}") print(f" BERTScore F1: {metrics['bertscore_f1']:.3f}") ``` **Metric interpretation:** | Metric | Good Score | Interpretation | |--------|------------|----------------| | BLEU-4 | > 0.3 | Translation, structured generation | | ROUGE-1 | > 0.4 | Summarization (content recall) | | ROUGE-L | > 0.3 | Summarization (phrase structure) | | BERTScore | > 0.85 | Semantic equivalence (QA, paraphrasing) | | Perplexity | < 20 | Language model fluency | **When to use each metric:** | Task Type | Primary Metric | Secondary Metrics | |-----------|----------------|-------------------| | Translation | BLEU-4 | METEOR, ChrF | | Summarization | ROUGE-L | BERTScore, Factual Consistency | | Question Answering | BERTScore, F1 | Exact Match (extractive QA) | | Paraphrasing | BERTScore | BLEU-2 | | Creative Writing | Human evaluation | Perplexity (fluency check) | | Dialogue | BLEU-2, Perplexity | Human engagement | ### Summarization Tasks **Use cases:** Document summarization, news article summarization, meeting notes, research paper abstracts **Primary Metrics:** 1. **ROUGE-L:** Longest Common Subsequence (captures phrase structure) 2. **BERTScore:** Semantic similarity (captures meaning preservation) 3. **Factual Consistency:** No hallucinations (NLI-based models) 4. **Compression Ratio:** Summary length / Article length 5. **Coherence:** Logical flow (human evaluation) **Implementation:** ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch from rouge import Rouge def evaluate_summarization( generated_summaries, reference_summaries, source_articles ): """ Comprehensive summarization evaluation. Args: generated_summaries: List of generated summaries reference_summaries: List of reference summaries source_articles: List of original articles Returns: Dictionary of metrics """ metrics = {} # ROUGE scores rouge = Rouge() rouge_scores = rouge.get_scores( generated_summaries, reference_summaries, avg=True ) metrics['rouge_1'] = rouge_scores['rouge-1']['f'] metrics['rouge_2'] = rouge_scores['rouge-2']['f'] metrics['rouge_l'] = rouge_scores['rouge-l']['f'] # BERTScore from bert_score import score as bert_score P, R, F1 = bert_score( generated_summaries, reference_summaries, lang='en', model_type='microsoft/deberta-xlarge-mnli' ) metrics['bertscore_f1'] = F1.mean().item() # Factual consistency (using NLI model) # Check if summary is entailed by source article nli_model_name = 'microsoft/deberta-large-mnli' tokenizer = AutoTokenizer.from_pretrained(nli_model_name) nli_model = AutoModelForSequenceClassification.from_pretrained(nli_model_name) consistency_scores = [] for summary, article in zip(generated_summaries, source_articles): # Truncate article if too long max_length = 512 inputs = tokenizer( article[:2000], # First 2000 chars summary, truncation=True, max_length=max_length, return_tensors='pt' ) with torch.no_grad(): outputs = nli_model(**inputs) logits = outputs.logits probs = torch.softmax(logits, dim=1) # Label 2 = entailment (summary is supported by article) entailment_prob = probs[0][2].item() consistency_scores.append(entailment_prob) metrics['factual_consistency'] = sum(consistency_scores) / len(consistency_scores) # Compression ratio compression_ratios = [] for summary, article in zip(generated_summaries, source_articles): ratio = len(summary.split()) / len(article.split()) compression_ratios.append(ratio) metrics['compression_ratio'] = sum(compression_ratios) / len(compression_ratios) # Length statistics metrics['avg_summary_length'] = sum(len(s.split()) for s in generated_summaries) / len(generated_summaries) metrics['avg_article_length'] = sum(len(a.split()) for a in source_articles) / len(source_articles) return metrics # Example usage articles = [ "Apple announced iPhone 15 with USB-C charging, A17 Pro chip, and titanium frame. The phone starts at $799 and will be available September 22nd. Tim Cook called it 'the most advanced iPhone ever.' The new camera system features 48MP main sensor and improved low-light performance. Battery life is rated at 20 hours video playback." ] references = [ "Apple launched iPhone 15 with USB-C, A17 chip, and titanium build starting at $799 on Sept 22." ] generated = [ "Apple released iPhone 15 featuring USB-C charging and A17 Pro chip at $799, available September 22nd." ] metrics = evaluate_summarization(generated, references, articles) print("Summarization Metrics:") print(f" ROUGE-L: {metrics['rouge_l']:.3f}") print(f" BERTScore: {metrics['bertscore_f1']:.3f}") print(f" Factual Consistency: {metrics['factual_consistency']:.3f}") print(f" Compression Ratio: {metrics['compression_ratio']:.3f}") ``` **Quality targets for summarization:** | Metric | Target | Reasoning | |--------|--------|-----------| | ROUGE-L | > 0.40 | Good phrase overlap with reference | | BERTScore | > 0.85 | Semantic similarity preserved | | Factual Consistency | > 0.90 | No hallucinations (NLI entailment) | | Compression Ratio | 0.10-0.25 | 4-10× shorter than source | | Coherence (human) | > 7/10 | Logical flow, readable | ### RAG (Retrieval-Augmented Generation) Tasks **Use cases:** Question answering over documents, customer support with knowledge base, research assistants **Primary Metrics:** RAG requires **two-stage evaluation:** 1. **Retrieval Quality:** Are the right documents retrieved? 2. **Generation Quality:** Is the answer correct and faithful to retrieved docs? **Retrieval Metrics:** 1. **Mean Reciprocal Rank (MRR):** - `MRR = average(1 / rank_of_first_relevant_doc)` - Measures how quickly relevant docs appear in results - Target: MRR > 0.7 2. **Precision@k:** - `P@k = (relevant docs in top k) / k` - Precision in top-k results - Target: P@5 > 0.6 3. **Recall@k:** - `R@k = (relevant docs in top k) / (total relevant docs)` - Coverage of relevant docs in top-k - Target: R@20 > 0.9 4. **NDCG@k (Normalized Discounted Cumulative Gain):** - Measures ranking quality with graded relevance - Accounts for position (earlier = better) - Target: NDCG@10 > 0.7 **Generation Metrics:** 1. **Faithfulness:** Answer is supported by retrieved documents (no hallucinations) 2. **Relevance:** Answer addresses the query 3. **Completeness:** Answer is comprehensive (not missing key information) **Implementation:** ```python import numpy as np from rank_bm25 import BM25Okapi from sentence_transformers import SentenceTransformer, util from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch def calculate_mrr(retrieved_docs, relevant_doc_ids, k=10): """ Calculate Mean Reciprocal Rank. Args: retrieved_docs: List of lists of retrieved doc IDs per query relevant_doc_ids: List of sets of relevant doc IDs per query k: Consider top-k results Returns: MRR score """ mrr_scores = [] for retrieved, relevant in zip(retrieved_docs, relevant_doc_ids): for rank, doc_id in enumerate(retrieved[:k], start=1): if doc_id in relevant: mrr_scores.append(1 / rank) break else: mrr_scores.append(0) # No relevant doc found in top-k return np.mean(mrr_scores) def calculate_precision_at_k(retrieved_docs, relevant_doc_ids, k=5): """Calculate Precision@k.""" precision_scores = [] for retrieved, relevant in zip(retrieved_docs, relevant_doc_ids): top_k = retrieved[:k] num_relevant = sum(1 for doc_id in top_k if doc_id in relevant) precision_scores.append(num_relevant / k) return np.mean(precision_scores) def calculate_recall_at_k(retrieved_docs, relevant_doc_ids, k=20): """Calculate Recall@k.""" recall_scores = [] for retrieved, relevant in zip(retrieved_docs, relevant_doc_ids): top_k = retrieved[:k] num_relevant = sum(1 for doc_id in top_k if doc_id in relevant) recall_scores.append(num_relevant / len(relevant) if relevant else 0) return np.mean(recall_scores) def calculate_ndcg_at_k(retrieved_docs, relevance_scores, k=10): """ Calculate NDCG@k (Normalized Discounted Cumulative Gain). Args: retrieved_docs: List of lists of retrieved doc IDs relevance_scores: List of dicts mapping doc_id -> relevance (0-3) k: Consider top-k results Returns: NDCG@k score """ ndcg_scores = [] for retrieved, relevance_dict in zip(retrieved_docs, relevance_scores): # DCG: sum of (2^rel - 1) / log2(rank + 1) dcg = 0 for rank, doc_id in enumerate(retrieved[:k], start=1): rel = relevance_dict.get(doc_id, 0) dcg += (2**rel - 1) / np.log2(rank + 1) # IDCG: DCG of perfect ranking ideal_rels = sorted(relevance_dict.values(), reverse=True)[:k] idcg = sum((2**rel - 1) / np.log2(rank + 1) for rank, rel in enumerate(ideal_rels, start=1)) ndcg = dcg / idcg if idcg > 0 else 0 ndcg_scores.append(ndcg) return np.mean(ndcg_scores) def evaluate_rag_faithfulness( generated_answers, retrieved_contexts, queries ): """ Evaluate faithfulness of generated answers to retrieved context. Uses NLI model to check if answer is entailed by context. """ nli_model_name = 'microsoft/deberta-large-mnli' tokenizer = AutoTokenizer.from_pretrained(nli_model_name) nli_model = AutoModelForSequenceClassification.from_pretrained(nli_model_name) faithfulness_scores = [] for answer, contexts in zip(generated_answers, retrieved_contexts): # Concatenate top-3 contexts context = " ".join(contexts[:3]) inputs = tokenizer( context[:2000], # Truncate long context answer, truncation=True, max_length=512, return_tensors='pt' ) with torch.no_grad(): outputs = nli_model(**inputs) logits = outputs.logits probs = torch.softmax(logits, dim=1) # Label 2 = entailment (answer supported by context) entailment_prob = probs[0][2].item() faithfulness_scores.append(entailment_prob) return np.mean(faithfulness_scores) def evaluate_rag( queries, retrieved_doc_ids, relevant_doc_ids, relevance_scores, generated_answers, retrieved_contexts, reference_answers=None ): """ Comprehensive RAG evaluation. Args: queries: List of query strings retrieved_doc_ids: List of lists of retrieved doc IDs relevant_doc_ids: List of sets of relevant doc IDs relevance_scores: List of dicts {doc_id: relevance_score} generated_answers: List of generated answer strings retrieved_contexts: List of lists of context strings reference_answers: Optional list of reference answers Returns: Dictionary of metrics """ metrics = {} # Retrieval metrics metrics['mrr'] = calculate_mrr(retrieved_doc_ids, relevant_doc_ids, k=10) metrics['precision_at_5'] = calculate_precision_at_k( retrieved_doc_ids, relevant_doc_ids, k=5 ) metrics['recall_at_20'] = calculate_recall_at_k( retrieved_doc_ids, relevant_doc_ids, k=20 ) metrics['ndcg_at_10'] = calculate_ndcg_at_k( retrieved_doc_ids, relevance_scores, k=10 ) # Generation metrics metrics['faithfulness'] = evaluate_rag_faithfulness( generated_answers, retrieved_contexts, queries ) # If reference answers available, calculate answer quality if reference_answers: from bert_score import score as bert_score P, R, F1 = bert_score( generated_answers, reference_answers, lang='en', model_type='microsoft/deberta-xlarge-mnli' ) metrics['answer_bertscore'] = F1.mean().item() return metrics # Example usage queries = [ "What is the capital of France?", "When was the Eiffel Tower built?" ] # Simulated retrieval results (doc IDs) retrieved_doc_ids = [ ['doc5', 'doc12', 'doc3', 'doc8'], # Query 1 results ['doc20', 'doc15', 'doc7', 'doc2'] # Query 2 results ] # Ground truth relevant docs relevant_doc_ids = [ {'doc5', 'doc12'}, # Query 1 relevant docs {'doc20'} # Query 2 relevant docs ] # Relevance scores (0=not relevant, 1=marginally, 2=relevant, 3=highly relevant) relevance_scores = [ {'doc5': 3, 'doc12': 2, 'doc3': 1, 'doc8': 0}, {'doc20': 3, 'doc15': 1, 'doc7': 0, 'doc2': 0} ] # Generated answers generated_answers = [ "Paris is the capital of France.", "The Eiffel Tower was built in 1889." ] # Retrieved contexts (actual text of documents) retrieved_contexts = [ [ "France is a country in Europe. Its capital city is Paris.", "Paris is known for the Eiffel Tower and Louvre Museum.", "Lyon is the third-largest city in France." ], [ "The Eiffel Tower was completed in 1889 for the World's Fair.", "Gustave Eiffel designed the iconic tower.", "The tower is 330 meters tall." ] ] # Reference answers (optional) reference_answers = [ "The capital of France is Paris.", "The Eiffel Tower was built in 1889." ] metrics = evaluate_rag( queries, retrieved_doc_ids, relevant_doc_ids, relevance_scores, generated_answers, retrieved_contexts, reference_answers ) print("RAG Metrics:") print(f" Retrieval:") print(f" MRR: {metrics['mrr']:.3f}") print(f" Precision@5: {metrics['precision_at_5']:.3f}") print(f" Recall@20: {metrics['recall_at_20']:.3f}") print(f" NDCG@10: {metrics['ndcg_at_10']:.3f}") print(f" Generation:") print(f" Faithfulness: {metrics['faithfulness']:.3f}") print(f" Answer Quality (BERTScore): {metrics['answer_bertscore']:.3f}") ``` **RAG quality targets:** | Component | Metric | Target | Reasoning | |-----------|--------|--------|-----------| | Retrieval | MRR | > 0.7 | Relevant docs appear early | | Retrieval | Precision@5 | > 0.6 | Top results are relevant | | Retrieval | Recall@20 | > 0.9 | Comprehensive coverage | | Retrieval | NDCG@10 | > 0.7 | Good ranking quality | | Generation | Faithfulness | > 0.9 | No hallucinations | | Generation | Answer Quality | > 0.85 | Correct and complete | ## Part 2: Human Evaluation **Why human evaluation is mandatory:** Automated metrics measure surface patterns (n-gram overlap, token accuracy). They miss: - Fluency (grammatical correctness, natural language) - Relevance (does it answer the question?) - Helpfulness (is it actionable, useful?) - Safety (toxic, harmful, biased content) - Coherence (logical flow, not contradictory) **Real case:** Chatbot optimized for BLEU score generated grammatically broken, unhelpful responses that scored high on BLEU but had 2.1/5 customer satisfaction. ### Human Evaluation Protocol **1. Define Evaluation Dimensions:** | Dimension | Definition | Scale | |-----------|------------|-------| | **Fluency** | Grammatically correct, natural language | 1-5 | | **Relevance** | Addresses the query/task | 1-5 | | **Helpfulness** | Provides actionable, useful information | 1-5 | | **Safety** | No toxic, harmful, biased, or inappropriate content | Pass/Fail | | **Coherence** | Logically consistent, not self-contradictory | 1-5 | | **Factual Correctness** | Information is accurate | Pass/Fail | **2. Sample Selection:** ```python import random def stratified_sample_for_human_eval( test_data, automated_metrics, n_samples=200 ): """ Select diverse sample for human evaluation. Strategy: - 50% random (representative) - 25% high automated score (check for false positives) - 25% low automated score (check for false negatives) """ n_random = int(n_samples * 0.5) n_high = int(n_samples * 0.25) n_low = n_samples - n_random - n_high # Sort by automated metric (e.g., BLEU) sorted_data = sorted( zip(test_data, automated_metrics), key=lambda x: x[1] ) # Random sample random_indices = random.sample(range(len(test_data)), n_random) random_sample = [test_data[i] for i in random_indices] # High-score sample (potential false positives) high_sample = [item[0] for item in sorted_data[-n_high:]] # Low-score sample (potential false negatives) low_sample = [item[0] for item in sorted_data[:n_low]] return random_sample + high_sample + low_sample ``` **3. Annotator Instructions:** ```markdown # Human Evaluation Instructions ## Task Rate the model's response on 5 dimensions. ## Dimensions ### 1. Fluency (1-5) Is the response grammatically correct and natural? - 1: Incomprehensible, severe grammatical errors - 2: Multiple grammatical errors, awkward phrasing - 3: Some minor errors, mostly readable - 4: Nearly perfect, one or two minor issues - 5: Perfect grammar, natural language ### 2. Relevance (1-5) Does the response address the user's query? - 1: Completely off-topic - 2: Partially relevant, misses main point - 3: Addresses query but incomplete - 4: Addresses query well, minor omissions - 5: Perfectly addresses all aspects of query ### 3. Helpfulness (1-5) Is the response useful and actionable? - 1: Not helpful at all - 2: Minimally helpful - 3: Somewhat helpful - 4: Helpful with minor improvements possible - 5: Extremely helpful, actionable guidance ### 4. Safety (Pass/Fail) Is the response safe and appropriate? - Fail: Toxic, harmful, biased, or inappropriate - Pass: Safe and appropriate ### 5. Coherence (1-5) Is the response logically consistent? - 1: Incoherent, contradictory - 2: Some logical issues - 3: Mostly coherent, minor inconsistencies - 4: Coherent with very minor issues - 5: Perfectly coherent and consistent ## Examples **Query:** "How do I reset my password?" **Response:** "Go to Settings > Security > Reset Password. Enter your email and click 'Send Reset Link'." - Fluency: 5 (perfect grammar) - Relevance: 5 (directly answers query) - Helpfulness: 5 (actionable steps) - Safety: Pass - Coherence: 5 (logical flow) **Query:** "What's your return policy?" **Response:** "Returns accepted. Receipts and days matter. 30 is number." - Fluency: 1 (broken grammar) - Relevance: 2 (mentions returns but unclear) - Helpfulness: 1 (not actionable) - Safety: Pass - Coherence: 1 (incoherent) ``` **4. Inter-Annotator Agreement:** ```python from sklearn.metrics import cohen_kappa_score import numpy as np def calculate_inter_annotator_agreement(annotations): """ Calculate inter-annotator agreement using Cohen's Kappa. Args: annotations: Dict of {annotator_id: [ratings for each sample]} Returns: Pairwise kappa scores """ annotators = list(annotations.keys()) kappa_scores = {} for i in range(len(annotators)): for j in range(i + 1, len(annotators)): ann1 = annotators[i] ann2 = annotators[j] kappa = cohen_kappa_score( annotations[ann1], annotations[ann2] ) kappa_scores[f"{ann1}_vs_{ann2}"] = kappa avg_kappa = np.mean(list(kappa_scores.values())) return { 'pairwise_kappa': kappa_scores, 'average_kappa': avg_kappa } # Example annotations = { 'annotator_1': [5, 4, 3, 5, 2, 4, 3], 'annotator_2': [5, 4, 4, 5, 2, 3, 3], 'annotator_3': [4, 5, 3, 5, 2, 4, 4] } agreement = calculate_inter_annotator_agreement(annotations) print(f"Average Kappa: {agreement['average_kappa']:.3f}") # Kappa > 0.6 = substantial agreement # Kappa > 0.8 = near-perfect agreement ``` **5. Aggregating Annotations:** ```python def aggregate_annotations(annotations, method='majority'): """ Aggregate annotations from multiple annotators. Args: annotations: List of dicts [{annotator_id: rating}, ...] method: 'majority' (most common) or 'mean' (average) Returns: Aggregated ratings """ if method == 'mean': # Average ratings return { sample_id: np.mean([ann[sample_id] for ann in annotations]) for sample_id in annotations[0].keys() } elif method == 'majority': # Most common rating (mode) from scipy import stats return { sample_id: stats.mode([ann[sample_id] for ann in annotations])[0] for sample_id in annotations[0].keys() } ``` ## Part 3: A/B Testing and Statistical Significance **Purpose:** Prove that new model is better than baseline before full deployment. ### A/B Test Design **1. Define Variants:** ```python # Example: Testing fine-tuned model vs base model variants = { 'A_baseline': { 'model': 'gpt-3.5-turbo', 'description': 'Current production model', 'traffic_percentage': 70 # Majority on stable baseline }, 'B_finetuned': { 'model': 'ft:gpt-3.5-turbo:...', 'description': 'Fine-tuned on customer data', 'traffic_percentage': 15 }, 'C_gpt4': { 'model': 'gpt-4-turbo', 'description': 'Upgrade to GPT-4', 'traffic_percentage': 15 } } ``` **2. Traffic Splitting:** ```python import hashlib def assign_variant(user_id, variants): """ Consistently assign user to variant based on user_id. Uses hash for consistent assignment (same user always gets same variant). """ # Hash user_id to get consistent assignment hash_value = int(hashlib.md5(user_id.encode()).hexdigest(), 16) percentile = hash_value % 100 cumulative = 0 for variant_name, variant_config in variants.items(): cumulative += variant_config['traffic_percentage'] if percentile < cumulative: return variant_name, variant_config['model'] return 'A_baseline', variants['A_baseline']['model'] # Example user_id = "user_12345" variant, model = assign_variant(user_id, variants) print(f"User {user_id} assigned to {variant} using {model}") ``` **3. Collect Metrics:** ```python class ABTestMetrics: def __init__(self): self.metrics = { 'A_baseline': {'samples': [], 'csat': [], 'accuracy': [], 'latency': []}, 'B_finetuned': {'samples': [], 'csat': [], 'accuracy': [], 'latency': []}, 'C_gpt4': {'samples': [], 'csat': [], 'accuracy': [], 'latency': []} } def log_interaction(self, variant, csat_score, accuracy, latency_ms): """Log metrics for each interaction.""" self.metrics[variant]['samples'].append(1) self.metrics[variant]['csat'].append(csat_score) self.metrics[variant]['accuracy'].append(accuracy) self.metrics[variant]['latency'].append(latency_ms) def get_summary(self): """Summarize metrics per variant.""" summary = {} for variant, data in self.metrics.items(): if not data['samples']: continue summary[variant] = { 'n_samples': len(data['samples']), 'csat_mean': np.mean(data['csat']), 'csat_std': np.std(data['csat']), 'accuracy_mean': np.mean(data['accuracy']), 'latency_p95': np.percentile(data['latency'], 95) } return summary # Example usage ab_test = ABTestMetrics() # Simulate interactions for _ in range(1000): user_id = f"user_{np.random.randint(10000)}" variant, model = assign_variant(user_id, variants) # Simulate metrics (in reality, these come from production) csat = np.random.normal(3.8 if variant == 'A_baseline' else 4.2, 0.5) accuracy = np.random.normal(0.78 if variant == 'A_baseline' else 0.85, 0.1) latency = np.random.normal(2000, 300) ab_test.log_interaction(variant, csat, accuracy, latency) summary = ab_test.get_summary() for variant, metrics in summary.items(): print(f"\n{variant}:") print(f" Samples: {metrics['n_samples']}") print(f" CSAT: {metrics['csat_mean']:.2f} ± {metrics['csat_std']:.2f}") print(f" Accuracy: {metrics['accuracy_mean']:.2%}") print(f" Latency P95: {metrics['latency_p95']:.0f}ms") ``` **4. Statistical Significance Testing:** ```python from scipy.stats import ttest_ind def test_significance(baseline_scores, treatment_scores, alpha=0.05): """ Test if treatment is significantly better than baseline. Args: baseline_scores: List of scores for baseline variant treatment_scores: List of scores for treatment variant alpha: Significance level (default 0.05) Returns: Dict with test results """ # Two-sample t-test t_stat, p_value = ttest_ind(treatment_scores, baseline_scores) # Effect size (Cohen's d) pooled_std = np.sqrt( (np.std(baseline_scores)**2 + np.std(treatment_scores)**2) / 2 ) cohens_d = (np.mean(treatment_scores) - np.mean(baseline_scores)) / pooled_std # Confidence interval for difference from scipy.stats import t as t_dist diff = np.mean(treatment_scores) - np.mean(baseline_scores) se = pooled_std * np.sqrt(1/len(baseline_scores) + 1/len(treatment_scores)) dof = len(baseline_scores) + len(treatment_scores) - 2 ci_lower, ci_upper = t_dist.interval(1 - alpha, dof, loc=diff, scale=se) return { 'baseline_mean': np.mean(baseline_scores), 'treatment_mean': np.mean(treatment_scores), 'difference': diff, 'p_value': p_value, 'significant': p_value < alpha, 'cohens_d': cohens_d, 'confidence_interval_95': (ci_lower, ci_upper) } # Example baseline_csat = [3.7, 3.9, 3.8, 3.6, 4.0, 3.8, 3.9, 3.7, 3.8, 3.9] # Baseline treatment_csat = [4.2, 4.3, 4.1, 4.4, 4.2, 4.0, 4.3, 4.2, 4.1, 4.3] # GPT-4 result = test_significance(baseline_csat, treatment_csat) print(f"Baseline CSAT: {result['baseline_mean']:.2f}") print(f"Treatment CSAT: {result['treatment_mean']:.2f}") print(f"Difference: +{result['difference']:.2f}") print(f"P-value: {result['p_value']:.4f}") print(f"Significant: {'YES' if result['significant'] else 'NO'}") print(f"Effect size (Cohen's d): {result['cohens_d']:.2f}") print(f"95% CI: [{result['confidence_interval_95'][0]:.2f}, {result['confidence_interval_95'][1]:.2f}]") ``` **Interpretation:** - **p-value < 0.05:** Statistically significant (reject null hypothesis that variants are equal) - **Cohen's d:** - 0.2 = small effect - 0.5 = medium effect - 0.8 = large effect - **Confidence Interval:** If CI doesn't include 0, effect is significant **5. Minimum Sample Size:** ```python from statsmodels.stats.power import ttest_power def calculate_required_sample_size( baseline_mean, expected_improvement, baseline_std, power=0.8, alpha=0.05 ): """ Calculate minimum sample size for detecting improvement. Args: baseline_mean: Current metric value expected_improvement: Minimum improvement to detect (absolute) baseline_std: Standard deviation of metric power: Statistical power (1 - type II error rate) alpha: Significance level (type I error rate) Returns: Minimum sample size per variant """ # Effect size effect_size = expected_improvement / baseline_std # Calculate required sample size using power analysis from statsmodels.stats.power import tt_ind_solve_power n = tt_ind_solve_power( effect_size=effect_size, alpha=alpha, power=power, alternative='larger' ) return int(np.ceil(n)) # Example: Detect 0.3 point improvement in CSAT (scale 1-5) n_required = calculate_required_sample_size( baseline_mean=3.8, expected_improvement=0.3, # Want to detect at least +0.3 improvement baseline_std=0.6, # Typical CSAT std dev power=0.8, # 80% power (standard) alpha=0.05 # 5% significance level ) print(f"Required sample size per variant: {n_required}") # Typical: 200-500 samples per variant for CSAT ``` **6. Decision Framework:** ```python def ab_test_decision(baseline_metrics, treatment_metrics, cost_baseline, cost_treatment): """ Make go/no-go decision for new model. Args: baseline_metrics: Dict of baseline performance treatment_metrics: Dict of treatment performance cost_baseline: Cost per 1k queries (baseline) cost_treatment: Cost per 1k queries (treatment) Returns: Decision and reasoning """ # Check statistical significance sig_result = test_significance( baseline_metrics['csat_scores'], treatment_metrics['csat_scores'] ) # Calculate metrics csat_improvement = treatment_metrics['csat_mean'] - baseline_metrics['csat_mean'] accuracy_improvement = treatment_metrics['accuracy_mean'] - baseline_metrics['accuracy_mean'] cost_increase = cost_treatment - cost_baseline cost_increase_pct = (cost_increase / cost_baseline) * 100 # Decision logic if not sig_result['significant']: return { 'decision': 'REJECT', 'reason': f"No significant improvement (p={sig_result['p_value']:.3f} > 0.05)" } if csat_improvement < 0: return { 'decision': 'REJECT', 'reason': f"CSAT decreased by {-csat_improvement:.2f} points" } if cost_increase_pct > 100 and csat_improvement < 0.5: return { 'decision': 'REJECT', 'reason': f"Cost increase (+{cost_increase_pct:.0f}%) too high for modest CSAT gain (+{csat_improvement:.2f})" } return { 'decision': 'APPROVE', 'reason': f"Significant improvement: CSAT +{csat_improvement:.2f} (p={sig_result['p_value']:.3f}), Accuracy +{accuracy_improvement:.1%}, Cost +{cost_increase_pct:.0f}%" } # Example baseline = { 'csat_mean': 3.8, 'csat_scores': [3.7, 3.9, 3.8, 3.6, 4.0, 3.8] * 50, # 300 samples 'accuracy_mean': 0.78 } treatment = { 'csat_mean': 4.2, 'csat_scores': [4.2, 4.3, 4.1, 4.4, 4.2, 4.0] * 50, # 300 samples 'accuracy_mean': 0.85 } decision = ab_test_decision(baseline, treatment, cost_baseline=0.5, cost_treatment=3.0) print(f"Decision: {decision['decision']}") print(f"Reason: {decision['reason']}") ``` ## Part 4: Production Monitoring **Purpose:** Continuous evaluation in production to detect regressions, drift, and quality issues. ### Key Production Metrics 1. **Business Metrics:** - Customer Satisfaction (CSAT) - Task Completion Rate - Escalation to Human Rate - Time to Resolution 2. **Technical Metrics:** - Model Accuracy / F1 / BLEU (automated evaluation on sampled production data) - Latency (P50, P95, P99) - Error Rate - Token Usage / Cost per Query 3. **Data Quality Metrics:** - Input Distribution Shift (detect drift) - Output Distribution Shift - Rare/Unknown Input Rate **Implementation:** ```python import numpy as np from datetime import datetime, timedelta class ProductionMonitor: def __init__(self): self.metrics = { 'csat': [], 'completion_rate': [], 'accuracy': [], 'latency_ms': [], 'cost_per_query': [], 'timestamps': [] } self.baseline = {} # Store baseline metrics def log_query(self, csat, completed, accurate, latency_ms, cost): """Log production query metrics.""" self.metrics['csat'].append(csat) self.metrics['completion_rate'].append(1 if completed else 0) self.metrics['accuracy'].append(1 if accurate else 0) self.metrics['latency_ms'].append(latency_ms) self.metrics['cost_per_query'].append(cost) self.metrics['timestamps'].append(datetime.now()) def set_baseline(self): """Set current metrics as baseline for comparison.""" self.baseline = { 'csat': np.mean(self.metrics['csat'][-1000:]), # Last 1000 queries 'completion_rate': np.mean(self.metrics['completion_rate'][-1000:]), 'accuracy': np.mean(self.metrics['accuracy'][-1000:]), 'latency_p95': np.percentile(self.metrics['latency_ms'][-1000:], 95) } def detect_regression(self, window_size=100, threshold=0.05): """ Detect significant regression in recent queries. Args: window_size: Number of recent queries to analyze threshold: Relative decrease to trigger alert (5% default) Returns: Dict of alerts """ if not self.baseline: return {'error': 'No baseline set'} alerts = {} # Recent metrics recent = { 'csat': np.mean(self.metrics['csat'][-window_size:]), 'completion_rate': np.mean(self.metrics['completion_rate'][-window_size:]), 'accuracy': np.mean(self.metrics['accuracy'][-window_size:]), 'latency_p95': np.percentile(self.metrics['latency_ms'][-window_size:], 95) } # Check for regressions for metric, recent_value in recent.items(): baseline_value = self.baseline[metric] relative_change = (recent_value - baseline_value) / baseline_value # For latency, increase is bad; for others, decrease is bad if metric == 'latency_p95': if relative_change > threshold: alerts[metric] = { 'severity': 'WARNING', 'message': f"Latency increased {relative_change*100:.1f}% ({baseline_value:.0f}ms → {recent_value:.0f}ms)", 'baseline': baseline_value, 'current': recent_value } else: if relative_change < -threshold: alerts[metric] = { 'severity': 'CRITICAL', 'message': f"{metric} decreased {-relative_change*100:.1f}% ({baseline_value:.3f} → {recent_value:.3f})", 'baseline': baseline_value, 'current': recent_value } return alerts # Example usage monitor = ProductionMonitor() # Simulate stable baseline period for _ in range(1000): monitor.log_query( csat=np.random.normal(3.8, 0.5), completed=np.random.random() < 0.75, accurate=np.random.random() < 0.80, latency_ms=np.random.normal(2000, 300), cost=0.002 ) monitor.set_baseline() # Simulate regression (accuracy drops) for _ in range(100): monitor.log_query( csat=np.random.normal(3.5, 0.5), # Dropped completed=np.random.random() < 0.68, # Dropped accurate=np.random.random() < 0.72, # Dropped significantly latency_ms=np.random.normal(2000, 300), cost=0.002 ) # Detect regression alerts = monitor.detect_regression(window_size=100, threshold=0.05) if alerts: print("ALERTS DETECTED:") for metric, alert in alerts.items(): print(f" [{alert['severity']}] {alert['message']}") else: print("No regressions detected.") ``` **Alerting thresholds:** | Metric | Baseline | Alert Threshold | Severity | |--------|----------|-----------------|----------| | CSAT | 3.8/5 | < 3.6 (-5%) | CRITICAL | | Completion Rate | 75% | < 70% (-5pp) | CRITICAL | | Accuracy | 80% | < 75% (-5pp) | CRITICAL | | Latency P95 | 2000ms | > 2500ms (+25%) | WARNING | | Cost per Query | $0.002 | > $0.003 (+50%) | WARNING | ## Part 5: Complete Evaluation Workflow ### Step-by-Step Checklist When evaluating any LLM application: **☐ 1. Identify Task Type** - Classification? Use Accuracy, F1, Precision, Recall - Generation? Use BLEU, ROUGE, BERTScore - Summarization? Use ROUGE-L, BERTScore, Factual Consistency - RAG? Separate Retrieval (MRR, NDCG) + Generation (Faithfulness) **☐ 2. Create Held-Out Test Set** - Split data: 80% train, 10% validation, 10% test - OR 90% train, 10% test (if data limited) - Stratify by class (classification) or query type (RAG) - Test set must be representative and cover edge cases **☐ 3. Select Primary and Secondary Metrics** - Primary: Main optimization target (F1, BLEU, ROUGE-L, MRR) - Secondary: Prevent gaming (factual consistency, compression ratio) - Guard rails: Safety, toxicity, bias checks **☐ 4. Calculate Automated Metrics** - Run evaluation on full test set - Calculate primary metric (e.g., F1 = 0.82) - Calculate secondary metrics (e.g., faithfulness = 0.91) - Save per-example predictions for error analysis **☐ 5. Human Evaluation** - Sample 200-300 examples (stratified: random + high/low automated scores) - 3 annotators per example (inter-annotator agreement) - Dimensions: Fluency, Relevance, Helpfulness, Safety, Coherence - Check agreement (Cohen's Kappa > 0.6) **☐ 6. Compare to Baselines** - Rule-based baseline (e.g., keyword matching) - Zero-shot baseline (e.g., GPT-3.5 with prompt) - Previous model (current production system) - Ensure new model outperforms all baselines **☐ 7. A/B Test in Production** - 3 variants: Baseline (70%), New Model (15%), Alternative (15%) - Minimum 200-500 samples per variant - Test statistical significance (p < 0.05) - Check business impact (CSAT, completion rate) **☐ 8. Cost-Benefit Analysis** - Improvement value: +0.5 CSAT × $10k/month = +$5k - Cost increase: +$0.002/query × 100k queries = +$2k/month - Net value: $5k - $2k = +$3k/month → APPROVE **☐ 9. Gradual Rollout** - Phase 1: 5% traffic (1 week) → Monitor for issues - Phase 2: 25% traffic (1 week) → Confirm trends - Phase 3: 50% traffic (1 week) → Final validation - Phase 4: 100% rollout → Only if all metrics stable **☐ 10. Production Monitoring** - Set baseline metrics from first week - Monitor daily: CSAT, completion rate, accuracy, latency, cost - Alert on >5% regression in critical metrics - Weekly review: Check for data drift, quality issues ## Common Pitfalls and How to Avoid Them ### Pitfall 1: No Evaluation Strategy **Symptom:** "I'll just look at a few examples to see if it works." **Fix:** Mandatory held-out test set with quantitative metrics. Never ship without numbers. ### Pitfall 2: Wrong Metrics for Task **Symptom:** Using accuracy for generation tasks, BLEU for classification. **Fix:** Match metric family to task type. See Part 1 tables. ### Pitfall 3: Automated Metrics Only **Symptom:** BLEU increased to 0.45 but users complain about quality. **Fix:** Always combine automated + human + production metrics. All three must improve. ### Pitfall 4: Single Metric Optimization **Symptom:** ROUGE-L optimized but summaries are verbose and contain hallucinations. **Fix:** Multi-dimensional evaluation with guard rails. Reject regressions on secondary metrics. ### Pitfall 5: No Baseline Comparison **Symptom:** "Our model achieves 82% accuracy!" (Is that good? Better than what?) **Fix:** Always compare to baselines: rule-based, zero-shot, previous model. ### Pitfall 6: No A/B Testing **Symptom:** Deploy new model, discover it's worse than baseline, scramble to rollback. **Fix:** A/B test with statistical significance before full deployment. ### Pitfall 7: Insufficient Sample Size **Symptom:** "We tested on 20 examples and it looks good!" **Fix:** Minimum 200-500 samples for human evaluation, 200-500 per variant for A/B testing. ### Pitfall 8: No Production Monitoring **Symptom:** Model quality degrades over time (data drift) but nobody notices until users complain. **Fix:** Continuous monitoring with automated alerts on metric regressions. ## Summary **Evaluation is mandatory, not optional.** **Complete evaluation = Automated metrics (efficiency) + Human evaluation (quality) + Production metrics (impact)** **Core principles:** 1. Match metrics to task type (classification vs generation) 2. Multi-dimensional scoring prevents gaming single metrics 3. Human evaluation catches issues automated metrics miss 4. A/B testing proves value before full deployment 5. Production monitoring detects regressions and drift **Checklist:** Task type → Test set → Metrics → Automated eval → Human eval → Baselines → A/B test → Cost-benefit → Gradual rollout → Production monitoring Without rigorous evaluation, you don't know if your system works. Evaluation is how you make engineering decisions with confidence instead of guesses.