472 lines
13 KiB
Markdown
472 lines
13 KiB
Markdown
---
|
|
name: llm-evaluation
|
|
description: Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
|
|
---
|
|
|
|
# LLM Evaluation
|
|
|
|
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
|
|
|
|
## When to Use This Skill
|
|
|
|
- Measuring LLM application performance systematically
|
|
- Comparing different models or prompts
|
|
- Detecting performance regressions before deployment
|
|
- Validating improvements from prompt changes
|
|
- Building confidence in production systems
|
|
- Establishing baselines and tracking progress over time
|
|
- Debugging unexpected model behavior
|
|
|
|
## Core Evaluation Types
|
|
|
|
### 1. Automated Metrics
|
|
Fast, repeatable, scalable evaluation using computed scores.
|
|
|
|
**Text Generation:**
|
|
- **BLEU**: N-gram overlap (translation)
|
|
- **ROUGE**: Recall-oriented (summarization)
|
|
- **METEOR**: Semantic similarity
|
|
- **BERTScore**: Embedding-based similarity
|
|
- **Perplexity**: Language model confidence
|
|
|
|
**Classification:**
|
|
- **Accuracy**: Percentage correct
|
|
- **Precision/Recall/F1**: Class-specific performance
|
|
- **Confusion Matrix**: Error patterns
|
|
- **AUC-ROC**: Ranking quality
|
|
|
|
**Retrieval (RAG):**
|
|
- **MRR**: Mean Reciprocal Rank
|
|
- **NDCG**: Normalized Discounted Cumulative Gain
|
|
- **Precision@K**: Relevant in top K
|
|
- **Recall@K**: Coverage in top K
|
|
|
|
### 2. Human Evaluation
|
|
Manual assessment for quality aspects difficult to automate.
|
|
|
|
**Dimensions:**
|
|
- **Accuracy**: Factual correctness
|
|
- **Coherence**: Logical flow
|
|
- **Relevance**: Answers the question
|
|
- **Fluency**: Natural language quality
|
|
- **Safety**: No harmful content
|
|
- **Helpfulness**: Useful to the user
|
|
|
|
### 3. LLM-as-Judge
|
|
Use stronger LLMs to evaluate weaker model outputs.
|
|
|
|
**Approaches:**
|
|
- **Pointwise**: Score individual responses
|
|
- **Pairwise**: Compare two responses
|
|
- **Reference-based**: Compare to gold standard
|
|
- **Reference-free**: Judge without ground truth
|
|
|
|
## Quick Start
|
|
|
|
```python
|
|
from llm_eval import EvaluationSuite, Metric
|
|
|
|
# Define evaluation suite
|
|
suite = EvaluationSuite([
|
|
Metric.accuracy(),
|
|
Metric.bleu(),
|
|
Metric.bertscore(),
|
|
Metric.custom(name="groundedness", fn=check_groundedness)
|
|
])
|
|
|
|
# Prepare test cases
|
|
test_cases = [
|
|
{
|
|
"input": "What is the capital of France?",
|
|
"expected": "Paris",
|
|
"context": "France is a country in Europe. Paris is its capital."
|
|
},
|
|
# ... more test cases
|
|
]
|
|
|
|
# Run evaluation
|
|
results = suite.evaluate(
|
|
model=your_model,
|
|
test_cases=test_cases
|
|
)
|
|
|
|
print(f"Overall Accuracy: {results.metrics['accuracy']}")
|
|
print(f"BLEU Score: {results.metrics['bleu']}")
|
|
```
|
|
|
|
## Automated Metrics Implementation
|
|
|
|
### BLEU Score
|
|
```python
|
|
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
|
|
|
|
def calculate_bleu(reference, hypothesis):
|
|
"""Calculate BLEU score between reference and hypothesis."""
|
|
smoothie = SmoothingFunction().method4
|
|
|
|
return sentence_bleu(
|
|
[reference.split()],
|
|
hypothesis.split(),
|
|
smoothing_function=smoothie
|
|
)
|
|
|
|
# Usage
|
|
bleu = calculate_bleu(
|
|
reference="The cat sat on the mat",
|
|
hypothesis="A cat is sitting on the mat"
|
|
)
|
|
```
|
|
|
|
### ROUGE Score
|
|
```python
|
|
from rouge_score import rouge_scorer
|
|
|
|
def calculate_rouge(reference, hypothesis):
|
|
"""Calculate ROUGE scores."""
|
|
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
|
|
scores = scorer.score(reference, hypothesis)
|
|
|
|
return {
|
|
'rouge1': scores['rouge1'].fmeasure,
|
|
'rouge2': scores['rouge2'].fmeasure,
|
|
'rougeL': scores['rougeL'].fmeasure
|
|
}
|
|
```
|
|
|
|
### BERTScore
|
|
```python
|
|
from bert_score import score
|
|
|
|
def calculate_bertscore(references, hypotheses):
|
|
"""Calculate BERTScore using pre-trained BERT."""
|
|
P, R, F1 = score(
|
|
hypotheses,
|
|
references,
|
|
lang='en',
|
|
model_type='microsoft/deberta-xlarge-mnli'
|
|
)
|
|
|
|
return {
|
|
'precision': P.mean().item(),
|
|
'recall': R.mean().item(),
|
|
'f1': F1.mean().item()
|
|
}
|
|
```
|
|
|
|
### Custom Metrics
|
|
```python
|
|
def calculate_groundedness(response, context):
|
|
"""Check if response is grounded in provided context."""
|
|
# Use NLI model to check entailment
|
|
from transformers import pipeline
|
|
|
|
nli = pipeline("text-classification", model="microsoft/deberta-large-mnli")
|
|
|
|
result = nli(f"{context} [SEP] {response}")[0]
|
|
|
|
# Return confidence that response is entailed by context
|
|
return result['score'] if result['label'] == 'ENTAILMENT' else 0.0
|
|
|
|
def calculate_toxicity(text):
|
|
"""Measure toxicity in generated text."""
|
|
from detoxify import Detoxify
|
|
|
|
results = Detoxify('original').predict(text)
|
|
return max(results.values()) # Return highest toxicity score
|
|
|
|
def calculate_factuality(claim, knowledge_base):
|
|
"""Verify factual claims against knowledge base."""
|
|
# Implementation depends on your knowledge base
|
|
# Could use retrieval + NLI, or fact-checking API
|
|
pass
|
|
```
|
|
|
|
## LLM-as-Judge Patterns
|
|
|
|
### Single Output Evaluation
|
|
```python
|
|
def llm_judge_quality(response, question):
|
|
"""Use GPT-4 to judge response quality."""
|
|
prompt = f"""Rate the following response on a scale of 1-10 for:
|
|
1. Accuracy (factually correct)
|
|
2. Helpfulness (answers the question)
|
|
3. Clarity (well-written and understandable)
|
|
|
|
Question: {question}
|
|
Response: {response}
|
|
|
|
Provide ratings in JSON format:
|
|
{{
|
|
"accuracy": <1-10>,
|
|
"helpfulness": <1-10>,
|
|
"clarity": <1-10>,
|
|
"reasoning": "<brief explanation>"
|
|
}}
|
|
"""
|
|
|
|
result = openai.ChatCompletion.create(
|
|
model="gpt-4",
|
|
messages=[{"role": "user", "content": prompt}],
|
|
temperature=0
|
|
)
|
|
|
|
return json.loads(result.choices[0].message.content)
|
|
```
|
|
|
|
### Pairwise Comparison
|
|
```python
|
|
def compare_responses(question, response_a, response_b):
|
|
"""Compare two responses using LLM judge."""
|
|
prompt = f"""Compare these two responses to the question and determine which is better.
|
|
|
|
Question: {question}
|
|
|
|
Response A: {response_a}
|
|
|
|
Response B: {response_b}
|
|
|
|
Which response is better and why? Consider accuracy, helpfulness, and clarity.
|
|
|
|
Answer with JSON:
|
|
{{
|
|
"winner": "A" or "B" or "tie",
|
|
"reasoning": "<explanation>",
|
|
"confidence": <1-10>
|
|
}}
|
|
"""
|
|
|
|
result = openai.ChatCompletion.create(
|
|
model="gpt-4",
|
|
messages=[{"role": "user", "content": prompt}],
|
|
temperature=0
|
|
)
|
|
|
|
return json.loads(result.choices[0].message.content)
|
|
```
|
|
|
|
## Human Evaluation Frameworks
|
|
|
|
### Annotation Guidelines
|
|
```python
|
|
class AnnotationTask:
|
|
"""Structure for human annotation task."""
|
|
|
|
def __init__(self, response, question, context=None):
|
|
self.response = response
|
|
self.question = question
|
|
self.context = context
|
|
|
|
def get_annotation_form(self):
|
|
return {
|
|
"question": self.question,
|
|
"context": self.context,
|
|
"response": self.response,
|
|
"ratings": {
|
|
"accuracy": {
|
|
"scale": "1-5",
|
|
"description": "Is the response factually correct?"
|
|
},
|
|
"relevance": {
|
|
"scale": "1-5",
|
|
"description": "Does it answer the question?"
|
|
},
|
|
"coherence": {
|
|
"scale": "1-5",
|
|
"description": "Is it logically consistent?"
|
|
}
|
|
},
|
|
"issues": {
|
|
"factual_error": False,
|
|
"hallucination": False,
|
|
"off_topic": False,
|
|
"unsafe_content": False
|
|
},
|
|
"feedback": ""
|
|
}
|
|
```
|
|
|
|
### Inter-Rater Agreement
|
|
```python
|
|
from sklearn.metrics import cohen_kappa_score
|
|
|
|
def calculate_agreement(rater1_scores, rater2_scores):
|
|
"""Calculate inter-rater agreement."""
|
|
kappa = cohen_kappa_score(rater1_scores, rater2_scores)
|
|
|
|
interpretation = {
|
|
kappa < 0: "Poor",
|
|
kappa < 0.2: "Slight",
|
|
kappa < 0.4: "Fair",
|
|
kappa < 0.6: "Moderate",
|
|
kappa < 0.8: "Substantial",
|
|
kappa <= 1.0: "Almost Perfect"
|
|
}
|
|
|
|
return {
|
|
"kappa": kappa,
|
|
"interpretation": interpretation[True]
|
|
}
|
|
```
|
|
|
|
## A/B Testing
|
|
|
|
### Statistical Testing Framework
|
|
```python
|
|
from scipy import stats
|
|
import numpy as np
|
|
|
|
class ABTest:
|
|
def __init__(self, variant_a_name="A", variant_b_name="B"):
|
|
self.variant_a = {"name": variant_a_name, "scores": []}
|
|
self.variant_b = {"name": variant_b_name, "scores": []}
|
|
|
|
def add_result(self, variant, score):
|
|
"""Add evaluation result for a variant."""
|
|
if variant == "A":
|
|
self.variant_a["scores"].append(score)
|
|
else:
|
|
self.variant_b["scores"].append(score)
|
|
|
|
def analyze(self, alpha=0.05):
|
|
"""Perform statistical analysis."""
|
|
a_scores = self.variant_a["scores"]
|
|
b_scores = self.variant_b["scores"]
|
|
|
|
# T-test
|
|
t_stat, p_value = stats.ttest_ind(a_scores, b_scores)
|
|
|
|
# Effect size (Cohen's d)
|
|
pooled_std = np.sqrt((np.std(a_scores)**2 + np.std(b_scores)**2) / 2)
|
|
cohens_d = (np.mean(b_scores) - np.mean(a_scores)) / pooled_std
|
|
|
|
return {
|
|
"variant_a_mean": np.mean(a_scores),
|
|
"variant_b_mean": np.mean(b_scores),
|
|
"difference": np.mean(b_scores) - np.mean(a_scores),
|
|
"relative_improvement": (np.mean(b_scores) - np.mean(a_scores)) / np.mean(a_scores),
|
|
"p_value": p_value,
|
|
"statistically_significant": p_value < alpha,
|
|
"cohens_d": cohens_d,
|
|
"effect_size": self.interpret_cohens_d(cohens_d),
|
|
"winner": "B" if np.mean(b_scores) > np.mean(a_scores) else "A"
|
|
}
|
|
|
|
@staticmethod
|
|
def interpret_cohens_d(d):
|
|
"""Interpret Cohen's d effect size."""
|
|
abs_d = abs(d)
|
|
if abs_d < 0.2:
|
|
return "negligible"
|
|
elif abs_d < 0.5:
|
|
return "small"
|
|
elif abs_d < 0.8:
|
|
return "medium"
|
|
else:
|
|
return "large"
|
|
```
|
|
|
|
## Regression Testing
|
|
|
|
### Regression Detection
|
|
```python
|
|
class RegressionDetector:
|
|
def __init__(self, baseline_results, threshold=0.05):
|
|
self.baseline = baseline_results
|
|
self.threshold = threshold
|
|
|
|
def check_for_regression(self, new_results):
|
|
"""Detect if new results show regression."""
|
|
regressions = []
|
|
|
|
for metric in self.baseline.keys():
|
|
baseline_score = self.baseline[metric]
|
|
new_score = new_results.get(metric)
|
|
|
|
if new_score is None:
|
|
continue
|
|
|
|
# Calculate relative change
|
|
relative_change = (new_score - baseline_score) / baseline_score
|
|
|
|
# Flag if significant decrease
|
|
if relative_change < -self.threshold:
|
|
regressions.append({
|
|
"metric": metric,
|
|
"baseline": baseline_score,
|
|
"current": new_score,
|
|
"change": relative_change
|
|
})
|
|
|
|
return {
|
|
"has_regression": len(regressions) > 0,
|
|
"regressions": regressions
|
|
}
|
|
```
|
|
|
|
## Benchmarking
|
|
|
|
### Running Benchmarks
|
|
```python
|
|
class BenchmarkRunner:
|
|
def __init__(self, benchmark_dataset):
|
|
self.dataset = benchmark_dataset
|
|
|
|
def run_benchmark(self, model, metrics):
|
|
"""Run model on benchmark and calculate metrics."""
|
|
results = {metric.name: [] for metric in metrics}
|
|
|
|
for example in self.dataset:
|
|
# Generate prediction
|
|
prediction = model.predict(example["input"])
|
|
|
|
# Calculate each metric
|
|
for metric in metrics:
|
|
score = metric.calculate(
|
|
prediction=prediction,
|
|
reference=example["reference"],
|
|
context=example.get("context")
|
|
)
|
|
results[metric.name].append(score)
|
|
|
|
# Aggregate results
|
|
return {
|
|
metric: {
|
|
"mean": np.mean(scores),
|
|
"std": np.std(scores),
|
|
"min": min(scores),
|
|
"max": max(scores)
|
|
}
|
|
for metric, scores in results.items()
|
|
}
|
|
```
|
|
|
|
## Resources
|
|
|
|
- **references/metrics.md**: Comprehensive metric guide
|
|
- **references/human-evaluation.md**: Annotation best practices
|
|
- **references/benchmarking.md**: Standard benchmarks
|
|
- **references/a-b-testing.md**: Statistical testing guide
|
|
- **references/regression-testing.md**: CI/CD integration
|
|
- **assets/evaluation-framework.py**: Complete evaluation harness
|
|
- **assets/benchmark-dataset.jsonl**: Example datasets
|
|
- **scripts/evaluate-model.py**: Automated evaluation runner
|
|
|
|
## Best Practices
|
|
|
|
1. **Multiple Metrics**: Use diverse metrics for comprehensive view
|
|
2. **Representative Data**: Test on real-world, diverse examples
|
|
3. **Baselines**: Always compare against baseline performance
|
|
4. **Statistical Rigor**: Use proper statistical tests for comparisons
|
|
5. **Continuous Evaluation**: Integrate into CI/CD pipeline
|
|
6. **Human Validation**: Combine automated metrics with human judgment
|
|
7. **Error Analysis**: Investigate failures to understand weaknesses
|
|
8. **Version Control**: Track evaluation results over time
|
|
|
|
## Common Pitfalls
|
|
|
|
- **Single Metric Obsession**: Optimizing for one metric at the expense of others
|
|
- **Small Sample Size**: Drawing conclusions from too few examples
|
|
- **Data Contamination**: Testing on training data
|
|
- **Ignoring Variance**: Not accounting for statistical uncertainty
|
|
- **Metric Mismatch**: Using metrics not aligned with business goals
|