Initial commit
This commit is contained in:
471
skills/llm-evaluation/SKILL.md
Normal file
471
skills/llm-evaluation/SKILL.md
Normal file
@@ -0,0 +1,471 @@
|
||||
---
|
||||
name: llm-evaluation
|
||||
description: Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
|
||||
---
|
||||
|
||||
# LLM Evaluation
|
||||
|
||||
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
- Measuring LLM application performance systematically
|
||||
- Comparing different models or prompts
|
||||
- Detecting performance regressions before deployment
|
||||
- Validating improvements from prompt changes
|
||||
- Building confidence in production systems
|
||||
- Establishing baselines and tracking progress over time
|
||||
- Debugging unexpected model behavior
|
||||
|
||||
## Core Evaluation Types
|
||||
|
||||
### 1. Automated Metrics
|
||||
Fast, repeatable, scalable evaluation using computed scores.
|
||||
|
||||
**Text Generation:**
|
||||
- **BLEU**: N-gram overlap (translation)
|
||||
- **ROUGE**: Recall-oriented (summarization)
|
||||
- **METEOR**: Semantic similarity
|
||||
- **BERTScore**: Embedding-based similarity
|
||||
- **Perplexity**: Language model confidence
|
||||
|
||||
**Classification:**
|
||||
- **Accuracy**: Percentage correct
|
||||
- **Precision/Recall/F1**: Class-specific performance
|
||||
- **Confusion Matrix**: Error patterns
|
||||
- **AUC-ROC**: Ranking quality
|
||||
|
||||
**Retrieval (RAG):**
|
||||
- **MRR**: Mean Reciprocal Rank
|
||||
- **NDCG**: Normalized Discounted Cumulative Gain
|
||||
- **Precision@K**: Relevant in top K
|
||||
- **Recall@K**: Coverage in top K
|
||||
|
||||
### 2. Human Evaluation
|
||||
Manual assessment for quality aspects difficult to automate.
|
||||
|
||||
**Dimensions:**
|
||||
- **Accuracy**: Factual correctness
|
||||
- **Coherence**: Logical flow
|
||||
- **Relevance**: Answers the question
|
||||
- **Fluency**: Natural language quality
|
||||
- **Safety**: No harmful content
|
||||
- **Helpfulness**: Useful to the user
|
||||
|
||||
### 3. LLM-as-Judge
|
||||
Use stronger LLMs to evaluate weaker model outputs.
|
||||
|
||||
**Approaches:**
|
||||
- **Pointwise**: Score individual responses
|
||||
- **Pairwise**: Compare two responses
|
||||
- **Reference-based**: Compare to gold standard
|
||||
- **Reference-free**: Judge without ground truth
|
||||
|
||||
## Quick Start
|
||||
|
||||
```python
|
||||
from llm_eval import EvaluationSuite, Metric
|
||||
|
||||
# Define evaluation suite
|
||||
suite = EvaluationSuite([
|
||||
Metric.accuracy(),
|
||||
Metric.bleu(),
|
||||
Metric.bertscore(),
|
||||
Metric.custom(name="groundedness", fn=check_groundedness)
|
||||
])
|
||||
|
||||
# Prepare test cases
|
||||
test_cases = [
|
||||
{
|
||||
"input": "What is the capital of France?",
|
||||
"expected": "Paris",
|
||||
"context": "France is a country in Europe. Paris is its capital."
|
||||
},
|
||||
# ... more test cases
|
||||
]
|
||||
|
||||
# Run evaluation
|
||||
results = suite.evaluate(
|
||||
model=your_model,
|
||||
test_cases=test_cases
|
||||
)
|
||||
|
||||
print(f"Overall Accuracy: {results.metrics['accuracy']}")
|
||||
print(f"BLEU Score: {results.metrics['bleu']}")
|
||||
```
|
||||
|
||||
## Automated Metrics Implementation
|
||||
|
||||
### BLEU Score
|
||||
```python
|
||||
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
|
||||
|
||||
def calculate_bleu(reference, hypothesis):
|
||||
"""Calculate BLEU score between reference and hypothesis."""
|
||||
smoothie = SmoothingFunction().method4
|
||||
|
||||
return sentence_bleu(
|
||||
[reference.split()],
|
||||
hypothesis.split(),
|
||||
smoothing_function=smoothie
|
||||
)
|
||||
|
||||
# Usage
|
||||
bleu = calculate_bleu(
|
||||
reference="The cat sat on the mat",
|
||||
hypothesis="A cat is sitting on the mat"
|
||||
)
|
||||
```
|
||||
|
||||
### ROUGE Score
|
||||
```python
|
||||
from rouge_score import rouge_scorer
|
||||
|
||||
def calculate_rouge(reference, hypothesis):
|
||||
"""Calculate ROUGE scores."""
|
||||
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
|
||||
scores = scorer.score(reference, hypothesis)
|
||||
|
||||
return {
|
||||
'rouge1': scores['rouge1'].fmeasure,
|
||||
'rouge2': scores['rouge2'].fmeasure,
|
||||
'rougeL': scores['rougeL'].fmeasure
|
||||
}
|
||||
```
|
||||
|
||||
### BERTScore
|
||||
```python
|
||||
from bert_score import score
|
||||
|
||||
def calculate_bertscore(references, hypotheses):
|
||||
"""Calculate BERTScore using pre-trained BERT."""
|
||||
P, R, F1 = score(
|
||||
hypotheses,
|
||||
references,
|
||||
lang='en',
|
||||
model_type='microsoft/deberta-xlarge-mnli'
|
||||
)
|
||||
|
||||
return {
|
||||
'precision': P.mean().item(),
|
||||
'recall': R.mean().item(),
|
||||
'f1': F1.mean().item()
|
||||
}
|
||||
```
|
||||
|
||||
### Custom Metrics
|
||||
```python
|
||||
def calculate_groundedness(response, context):
|
||||
"""Check if response is grounded in provided context."""
|
||||
# Use NLI model to check entailment
|
||||
from transformers import pipeline
|
||||
|
||||
nli = pipeline("text-classification", model="microsoft/deberta-large-mnli")
|
||||
|
||||
result = nli(f"{context} [SEP] {response}")[0]
|
||||
|
||||
# Return confidence that response is entailed by context
|
||||
return result['score'] if result['label'] == 'ENTAILMENT' else 0.0
|
||||
|
||||
def calculate_toxicity(text):
|
||||
"""Measure toxicity in generated text."""
|
||||
from detoxify import Detoxify
|
||||
|
||||
results = Detoxify('original').predict(text)
|
||||
return max(results.values()) # Return highest toxicity score
|
||||
|
||||
def calculate_factuality(claim, knowledge_base):
|
||||
"""Verify factual claims against knowledge base."""
|
||||
# Implementation depends on your knowledge base
|
||||
# Could use retrieval + NLI, or fact-checking API
|
||||
pass
|
||||
```
|
||||
|
||||
## LLM-as-Judge Patterns
|
||||
|
||||
### Single Output Evaluation
|
||||
```python
|
||||
def llm_judge_quality(response, question):
|
||||
"""Use GPT-5 to judge response quality."""
|
||||
prompt = f"""Rate the following response on a scale of 1-10 for:
|
||||
1. Accuracy (factually correct)
|
||||
2. Helpfulness (answers the question)
|
||||
3. Clarity (well-written and understandable)
|
||||
|
||||
Question: {question}
|
||||
Response: {response}
|
||||
|
||||
Provide ratings in JSON format:
|
||||
{{
|
||||
"accuracy": <1-10>,
|
||||
"helpfulness": <1-10>,
|
||||
"clarity": <1-10>,
|
||||
"reasoning": "<brief explanation>"
|
||||
}}
|
||||
"""
|
||||
|
||||
result = openai.ChatCompletion.create(
|
||||
model="gpt-5",
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
temperature=0
|
||||
)
|
||||
|
||||
return json.loads(result.choices[0].message.content)
|
||||
```
|
||||
|
||||
### Pairwise Comparison
|
||||
```python
|
||||
def compare_responses(question, response_a, response_b):
|
||||
"""Compare two responses using LLM judge."""
|
||||
prompt = f"""Compare these two responses to the question and determine which is better.
|
||||
|
||||
Question: {question}
|
||||
|
||||
Response A: {response_a}
|
||||
|
||||
Response B: {response_b}
|
||||
|
||||
Which response is better and why? Consider accuracy, helpfulness, and clarity.
|
||||
|
||||
Answer with JSON:
|
||||
{{
|
||||
"winner": "A" or "B" or "tie",
|
||||
"reasoning": "<explanation>",
|
||||
"confidence": <1-10>
|
||||
}}
|
||||
"""
|
||||
|
||||
result = openai.ChatCompletion.create(
|
||||
model="gpt-5",
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
temperature=0
|
||||
)
|
||||
|
||||
return json.loads(result.choices[0].message.content)
|
||||
```
|
||||
|
||||
## Human Evaluation Frameworks
|
||||
|
||||
### Annotation Guidelines
|
||||
```python
|
||||
class AnnotationTask:
|
||||
"""Structure for human annotation task."""
|
||||
|
||||
def __init__(self, response, question, context=None):
|
||||
self.response = response
|
||||
self.question = question
|
||||
self.context = context
|
||||
|
||||
def get_annotation_form(self):
|
||||
return {
|
||||
"question": self.question,
|
||||
"context": self.context,
|
||||
"response": self.response,
|
||||
"ratings": {
|
||||
"accuracy": {
|
||||
"scale": "1-5",
|
||||
"description": "Is the response factually correct?"
|
||||
},
|
||||
"relevance": {
|
||||
"scale": "1-5",
|
||||
"description": "Does it answer the question?"
|
||||
},
|
||||
"coherence": {
|
||||
"scale": "1-5",
|
||||
"description": "Is it logically consistent?"
|
||||
}
|
||||
},
|
||||
"issues": {
|
||||
"factual_error": False,
|
||||
"hallucination": False,
|
||||
"off_topic": False,
|
||||
"unsafe_content": False
|
||||
},
|
||||
"feedback": ""
|
||||
}
|
||||
```
|
||||
|
||||
### Inter-Rater Agreement
|
||||
```python
|
||||
from sklearn.metrics import cohen_kappa_score
|
||||
|
||||
def calculate_agreement(rater1_scores, rater2_scores):
|
||||
"""Calculate inter-rater agreement."""
|
||||
kappa = cohen_kappa_score(rater1_scores, rater2_scores)
|
||||
|
||||
interpretation = {
|
||||
kappa < 0: "Poor",
|
||||
kappa < 0.2: "Slight",
|
||||
kappa < 0.4: "Fair",
|
||||
kappa < 0.6: "Moderate",
|
||||
kappa < 0.8: "Substantial",
|
||||
kappa <= 1.0: "Almost Perfect"
|
||||
}
|
||||
|
||||
return {
|
||||
"kappa": kappa,
|
||||
"interpretation": interpretation[True]
|
||||
}
|
||||
```
|
||||
|
||||
## A/B Testing
|
||||
|
||||
### Statistical Testing Framework
|
||||
```python
|
||||
from scipy import stats
|
||||
import numpy as np
|
||||
|
||||
class ABTest:
|
||||
def __init__(self, variant_a_name="A", variant_b_name="B"):
|
||||
self.variant_a = {"name": variant_a_name, "scores": []}
|
||||
self.variant_b = {"name": variant_b_name, "scores": []}
|
||||
|
||||
def add_result(self, variant, score):
|
||||
"""Add evaluation result for a variant."""
|
||||
if variant == "A":
|
||||
self.variant_a["scores"].append(score)
|
||||
else:
|
||||
self.variant_b["scores"].append(score)
|
||||
|
||||
def analyze(self, alpha=0.05):
|
||||
"""Perform statistical analysis."""
|
||||
a_scores = self.variant_a["scores"]
|
||||
b_scores = self.variant_b["scores"]
|
||||
|
||||
# T-test
|
||||
t_stat, p_value = stats.ttest_ind(a_scores, b_scores)
|
||||
|
||||
# Effect size (Cohen's d)
|
||||
pooled_std = np.sqrt((np.std(a_scores)**2 + np.std(b_scores)**2) / 2)
|
||||
cohens_d = (np.mean(b_scores) - np.mean(a_scores)) / pooled_std
|
||||
|
||||
return {
|
||||
"variant_a_mean": np.mean(a_scores),
|
||||
"variant_b_mean": np.mean(b_scores),
|
||||
"difference": np.mean(b_scores) - np.mean(a_scores),
|
||||
"relative_improvement": (np.mean(b_scores) - np.mean(a_scores)) / np.mean(a_scores),
|
||||
"p_value": p_value,
|
||||
"statistically_significant": p_value < alpha,
|
||||
"cohens_d": cohens_d,
|
||||
"effect_size": self.interpret_cohens_d(cohens_d),
|
||||
"winner": "B" if np.mean(b_scores) > np.mean(a_scores) else "A"
|
||||
}
|
||||
|
||||
@staticmethod
|
||||
def interpret_cohens_d(d):
|
||||
"""Interpret Cohen's d effect size."""
|
||||
abs_d = abs(d)
|
||||
if abs_d < 0.2:
|
||||
return "negligible"
|
||||
elif abs_d < 0.5:
|
||||
return "small"
|
||||
elif abs_d < 0.8:
|
||||
return "medium"
|
||||
else:
|
||||
return "large"
|
||||
```
|
||||
|
||||
## Regression Testing
|
||||
|
||||
### Regression Detection
|
||||
```python
|
||||
class RegressionDetector:
|
||||
def __init__(self, baseline_results, threshold=0.05):
|
||||
self.baseline = baseline_results
|
||||
self.threshold = threshold
|
||||
|
||||
def check_for_regression(self, new_results):
|
||||
"""Detect if new results show regression."""
|
||||
regressions = []
|
||||
|
||||
for metric in self.baseline.keys():
|
||||
baseline_score = self.baseline[metric]
|
||||
new_score = new_results.get(metric)
|
||||
|
||||
if new_score is None:
|
||||
continue
|
||||
|
||||
# Calculate relative change
|
||||
relative_change = (new_score - baseline_score) / baseline_score
|
||||
|
||||
# Flag if significant decrease
|
||||
if relative_change < -self.threshold:
|
||||
regressions.append({
|
||||
"metric": metric,
|
||||
"baseline": baseline_score,
|
||||
"current": new_score,
|
||||
"change": relative_change
|
||||
})
|
||||
|
||||
return {
|
||||
"has_regression": len(regressions) > 0,
|
||||
"regressions": regressions
|
||||
}
|
||||
```
|
||||
|
||||
## Benchmarking
|
||||
|
||||
### Running Benchmarks
|
||||
```python
|
||||
class BenchmarkRunner:
|
||||
def __init__(self, benchmark_dataset):
|
||||
self.dataset = benchmark_dataset
|
||||
|
||||
def run_benchmark(self, model, metrics):
|
||||
"""Run model on benchmark and calculate metrics."""
|
||||
results = {metric.name: [] for metric in metrics}
|
||||
|
||||
for example in self.dataset:
|
||||
# Generate prediction
|
||||
prediction = model.predict(example["input"])
|
||||
|
||||
# Calculate each metric
|
||||
for metric in metrics:
|
||||
score = metric.calculate(
|
||||
prediction=prediction,
|
||||
reference=example["reference"],
|
||||
context=example.get("context")
|
||||
)
|
||||
results[metric.name].append(score)
|
||||
|
||||
# Aggregate results
|
||||
return {
|
||||
metric: {
|
||||
"mean": np.mean(scores),
|
||||
"std": np.std(scores),
|
||||
"min": min(scores),
|
||||
"max": max(scores)
|
||||
}
|
||||
for metric, scores in results.items()
|
||||
}
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
- **references/metrics.md**: Comprehensive metric guide
|
||||
- **references/human-evaluation.md**: Annotation best practices
|
||||
- **references/benchmarking.md**: Standard benchmarks
|
||||
- **references/a-b-testing.md**: Statistical testing guide
|
||||
- **references/regression-testing.md**: CI/CD integration
|
||||
- **assets/evaluation-framework.py**: Complete evaluation harness
|
||||
- **assets/benchmark-dataset.jsonl**: Example datasets
|
||||
- **scripts/evaluate-model.py**: Automated evaluation runner
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Multiple Metrics**: Use diverse metrics for comprehensive view
|
||||
2. **Representative Data**: Test on real-world, diverse examples
|
||||
3. **Baselines**: Always compare against baseline performance
|
||||
4. **Statistical Rigor**: Use proper statistical tests for comparisons
|
||||
5. **Continuous Evaluation**: Integrate into CI/CD pipeline
|
||||
6. **Human Validation**: Combine automated metrics with human judgment
|
||||
7. **Error Analysis**: Investigate failures to understand weaknesses
|
||||
8. **Version Control**: Track evaluation results over time
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
- **Single Metric Obsession**: Optimizing for one metric at the expense of others
|
||||
- **Small Sample Size**: Drawing conclusions from too few examples
|
||||
- **Data Contamination**: Testing on training data
|
||||
- **Ignoring Variance**: Not accounting for statistical uncertainty
|
||||
- **Metric Mismatch**: Using metrics not aligned with business goals
|
||||
Reference in New Issue
Block a user