Initial commit

2025-11-29 18:44:49 +08:00
commit 4ba66edb07
20 changed files with 6428 additions and 0 deletions
--- a/skills/llm-evaluation/SKILL.md
+++ b/skills/llm-evaluation/SKILL.md
@@ -0,0 +1,471 @@
+---
+name: llm-evaluation
+description: Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
+---
+
+# LLM Evaluation
+
+Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
+
+## When to Use This Skill
+
+- Measuring LLM application performance systematically
+- Comparing different models or prompts
+- Detecting performance regressions before deployment
+- Validating improvements from prompt changes
+- Building confidence in production systems
+- Establishing baselines and tracking progress over time
+- Debugging unexpected model behavior
+
+## Core Evaluation Types
+
+### 1. Automated Metrics
+Fast, repeatable, scalable evaluation using computed scores.
+
+**Text Generation:**
+- **BLEU**: N-gram overlap (translation)
+- **ROUGE**: Recall-oriented (summarization)
+- **METEOR**: Semantic similarity
+- **BERTScore**: Embedding-based similarity
+- **Perplexity**: Language model confidence
+
+**Classification:**
+- **Accuracy**: Percentage correct
+- **Precision/Recall/F1**: Class-specific performance
+- **Confusion Matrix**: Error patterns
+- **AUC-ROC**: Ranking quality
+
+**Retrieval (RAG):**
+- **MRR**: Mean Reciprocal Rank
+- **NDCG**: Normalized Discounted Cumulative Gain
+- **Precision@K**: Relevant in top K
+- **Recall@K**: Coverage in top K
+
+### 2. Human Evaluation
+Manual assessment for quality aspects difficult to automate.
+
+**Dimensions:**
+- **Accuracy**: Factual correctness
+- **Coherence**: Logical flow
+- **Relevance**: Answers the question
+- **Fluency**: Natural language quality
+- **Safety**: No harmful content
+- **Helpfulness**: Useful to the user
+
+### 3. LLM-as-Judge
+Use stronger LLMs to evaluate weaker model outputs.
+
+**Approaches:**
+- **Pointwise**: Score individual responses
+- **Pairwise**: Compare two responses
+- **Reference-based**: Compare to gold standard
+- **Reference-free**: Judge without ground truth
+
+## Quick Start
+
+```python
+from llm_eval import EvaluationSuite, Metric
+
+# Define evaluation suite
+suite = EvaluationSuite([
+    Metric.accuracy(),
+    Metric.bleu(),
+    Metric.bertscore(),
+    Metric.custom(name="groundedness", fn=check_groundedness)
+])
+
+# Prepare test cases
+test_cases = [
+    {
+        "input": "What is the capital of France?",
+        "expected": "Paris",
+        "context": "France is a country in Europe. Paris is its capital."
+    },
+    # ... more test cases
+]
+
+# Run evaluation
+results = suite.evaluate(
+    model=your_model,
+    test_cases=test_cases
+)
+
+print(f"Overall Accuracy: {results.metrics['accuracy']}")
+print(f"BLEU Score: {results.metrics['bleu']}")
+```
+
+## Automated Metrics Implementation
+
+### BLEU Score
+```python
+from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
+
+def calculate_bleu(reference, hypothesis):
+    """Calculate BLEU score between reference and hypothesis."""
+    smoothie = SmoothingFunction().method4
+
+    return sentence_bleu(
+        [reference.split()],
+        hypothesis.split(),
+        smoothing_function=smoothie
+    )
+
+# Usage
+bleu = calculate_bleu(
+    reference="The cat sat on the mat",
+    hypothesis="A cat is sitting on the mat"
+)
+```
+
+### ROUGE Score
+```python
+from rouge_score import rouge_scorer
+
+def calculate_rouge(reference, hypothesis):
+    """Calculate ROUGE scores."""
+    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
+    scores = scorer.score(reference, hypothesis)
+
+    return {
+        'rouge1': scores['rouge1'].fmeasure,
+        'rouge2': scores['rouge2'].fmeasure,
+        'rougeL': scores['rougeL'].fmeasure
+    }
+```
+
+### BERTScore
+```python
+from bert_score import score
+
+def calculate_bertscore(references, hypotheses):
+    """Calculate BERTScore using pre-trained BERT."""
+    P, R, F1 = score(
+        hypotheses,
+        references,
+        lang='en',
+        model_type='microsoft/deberta-xlarge-mnli'
+    )
+
+    return {
+        'precision': P.mean().item(),
+        'recall': R.mean().item(),
+        'f1': F1.mean().item()
+    }
+```
+
+### Custom Metrics
+```python
+def calculate_groundedness(response, context):
+    """Check if response is grounded in provided context."""
+    # Use NLI model to check entailment
+    from transformers import pipeline
+
+    nli = pipeline("text-classification", model="microsoft/deberta-large-mnli")
+
+    result = nli(f"{context} [SEP] {response}")[0]
+
+    # Return confidence that response is entailed by context
+    return result['score'] if result['label'] == 'ENTAILMENT' else 0.0
+
+def calculate_toxicity(text):
+    """Measure toxicity in generated text."""
+    from detoxify import Detoxify
+
+    results = Detoxify('original').predict(text)
+    return max(results.values())  # Return highest toxicity score
+
+def calculate_factuality(claim, knowledge_base):
+    """Verify factual claims against knowledge base."""
+    # Implementation depends on your knowledge base
+    # Could use retrieval + NLI, or fact-checking API
+    pass
+```
+
+## LLM-as-Judge Patterns
+
+### Single Output Evaluation
+```python
+def llm_judge_quality(response, question):
+    """Use GPT-5 to judge response quality."""
+    prompt = f"""Rate the following response on a scale of 1-10 for:
+1. Accuracy (factually correct)
+2. Helpfulness (answers the question)
+3. Clarity (well-written and understandable)
+
+Question: {question}
+Response: {response}
+
+Provide ratings in JSON format:
+{{
+  "accuracy": <1-10>,
+  "helpfulness": <1-10>,
+  "clarity": <1-10>,
+  "reasoning": "<brief explanation>"
+}}
+"""
+
+    result = openai.ChatCompletion.create(
+        model="gpt-5",
+        messages=[{"role": "user", "content": prompt}],
+        temperature=0
+    )
+
+    return json.loads(result.choices[0].message.content)
+```
+
+### Pairwise Comparison
+```python
+def compare_responses(question, response_a, response_b):
+    """Compare two responses using LLM judge."""
+    prompt = f"""Compare these two responses to the question and determine which is better.
+
+Question: {question}
+
+Response A: {response_a}
+
+Response B: {response_b}
+
+Which response is better and why? Consider accuracy, helpfulness, and clarity.
+
+Answer with JSON:
+{{
+  "winner": "A" or "B" or "tie",
+  "reasoning": "<explanation>",
+  "confidence": <1-10>
+}}
+"""
+
+    result = openai.ChatCompletion.create(
+        model="gpt-5",
+        messages=[{"role": "user", "content": prompt}],
+        temperature=0
+    )
+
+    return json.loads(result.choices[0].message.content)
+```
+
+## Human Evaluation Frameworks
+
+### Annotation Guidelines
+```python
+class AnnotationTask:
+    """Structure for human annotation task."""
+
+    def __init__(self, response, question, context=None):
+        self.response = response
+        self.question = question
+        self.context = context
+
+    def get_annotation_form(self):
+        return {
+            "question": self.question,
+            "context": self.context,
+            "response": self.response,
+            "ratings": {
+                "accuracy": {
+                    "scale": "1-5",
+                    "description": "Is the response factually correct?"
+                },
+                "relevance": {
+                    "scale": "1-5",
+                    "description": "Does it answer the question?"
+                },
+                "coherence": {
+                    "scale": "1-5",
+                    "description": "Is it logically consistent?"
+                }
+            },
+            "issues": {
+                "factual_error": False,
+                "hallucination": False,
+                "off_topic": False,
+                "unsafe_content": False
+            },
+            "feedback": ""
+        }
+```
+
+### Inter-Rater Agreement
+```python
+from sklearn.metrics import cohen_kappa_score
+
+def calculate_agreement(rater1_scores, rater2_scores):
+    """Calculate inter-rater agreement."""
+    kappa = cohen_kappa_score(rater1_scores, rater2_scores)
+
+    interpretation = {
+        kappa < 0: "Poor",
+        kappa < 0.2: "Slight",
+        kappa < 0.4: "Fair",
+        kappa < 0.6: "Moderate",
+        kappa < 0.8: "Substantial",
+        kappa <= 1.0: "Almost Perfect"
+    }
+
+    return {
+        "kappa": kappa,
+        "interpretation": interpretation[True]
+    }
+```
+
+## A/B Testing
+
+### Statistical Testing Framework
+```python
+from scipy import stats
+import numpy as np
+
+class ABTest:
+    def __init__(self, variant_a_name="A", variant_b_name="B"):
+        self.variant_a = {"name": variant_a_name, "scores": []}
+        self.variant_b = {"name": variant_b_name, "scores": []}
+
+    def add_result(self, variant, score):
+        """Add evaluation result for a variant."""
+        if variant == "A":
+            self.variant_a["scores"].append(score)
+        else:
+            self.variant_b["scores"].append(score)
+
+    def analyze(self, alpha=0.05):
+        """Perform statistical analysis."""
+        a_scores = self.variant_a["scores"]
+        b_scores = self.variant_b["scores"]
+
+        # T-test
+        t_stat, p_value = stats.ttest_ind(a_scores, b_scores)
+
+        # Effect size (Cohen's d)
+        pooled_std = np.sqrt((np.std(a_scores)**2 + np.std(b_scores)**2) / 2)
+        cohens_d = (np.mean(b_scores) - np.mean(a_scores)) / pooled_std
+
+        return {
+            "variant_a_mean": np.mean(a_scores),
+            "variant_b_mean": np.mean(b_scores),
+            "difference": np.mean(b_scores) - np.mean(a_scores),
+            "relative_improvement": (np.mean(b_scores) - np.mean(a_scores)) / np.mean(a_scores),
+            "p_value": p_value,
+            "statistically_significant": p_value < alpha,
+            "cohens_d": cohens_d,
+            "effect_size": self.interpret_cohens_d(cohens_d),
+            "winner": "B" if np.mean(b_scores) > np.mean(a_scores) else "A"
+        }
+
+    @staticmethod
+    def interpret_cohens_d(d):
+        """Interpret Cohen's d effect size."""
+        abs_d = abs(d)
+        if abs_d < 0.2:
+            return "negligible"
+        elif abs_d < 0.5:
+            return "small"
+        elif abs_d < 0.8:
+            return "medium"
+        else:
+            return "large"
+```
+
+## Regression Testing
+
+### Regression Detection
+```python
+class RegressionDetector:
+    def __init__(self, baseline_results, threshold=0.05):
+        self.baseline = baseline_results
+        self.threshold = threshold
+
+    def check_for_regression(self, new_results):
+        """Detect if new results show regression."""
+        regressions = []
+
+        for metric in self.baseline.keys():
+            baseline_score = self.baseline[metric]
+            new_score = new_results.get(metric)
+
+            if new_score is None:
+                continue
+
+            # Calculate relative change
+            relative_change = (new_score - baseline_score) / baseline_score
+
+            # Flag if significant decrease
+            if relative_change < -self.threshold:
+                regressions.append({
+                    "metric": metric,
+                    "baseline": baseline_score,
+                    "current": new_score,
+                    "change": relative_change
+                })
+
+        return {
+            "has_regression": len(regressions) > 0,
+            "regressions": regressions
+        }
+```
+
+## Benchmarking
+
+### Running Benchmarks
+```python
+class BenchmarkRunner:
+    def __init__(self, benchmark_dataset):
+        self.dataset = benchmark_dataset
+
+    def run_benchmark(self, model, metrics):
+        """Run model on benchmark and calculate metrics."""
+        results = {metric.name: [] for metric in metrics}
+
+        for example in self.dataset:
+            # Generate prediction
+            prediction = model.predict(example["input"])
+
+            # Calculate each metric
+            for metric in metrics:
+                score = metric.calculate(
+                    prediction=prediction,
+                    reference=example["reference"],
+                    context=example.get("context")
+                )
+                results[metric.name].append(score)
+
+        # Aggregate results
+        return {
+            metric: {
+                "mean": np.mean(scores),
+                "std": np.std(scores),
+                "min": min(scores),
+                "max": max(scores)
+            }
+            for metric, scores in results.items()
+        }
+```
+
+## Resources
+
+- **references/metrics.md**: Comprehensive metric guide
+- **references/human-evaluation.md**: Annotation best practices
+- **references/benchmarking.md**: Standard benchmarks
+- **references/a-b-testing.md**: Statistical testing guide
+- **references/regression-testing.md**: CI/CD integration
+- **assets/evaluation-framework.py**: Complete evaluation harness
+- **assets/benchmark-dataset.jsonl**: Example datasets
+- **scripts/evaluate-model.py**: Automated evaluation runner
+
+## Best Practices
+
+1. **Multiple Metrics**: Use diverse metrics for comprehensive view
+2. **Representative Data**: Test on real-world, diverse examples
+3. **Baselines**: Always compare against baseline performance
+4. **Statistical Rigor**: Use proper statistical tests for comparisons
+5. **Continuous Evaluation**: Integrate into CI/CD pipeline
+6. **Human Validation**: Combine automated metrics with human judgment
+7. **Error Analysis**: Investigate failures to understand weaknesses
+8. **Version Control**: Track evaluation results over time
+
+## Common Pitfalls
+
+- **Single Metric Obsession**: Optimizing for one metric at the expense of others
+- **Small Sample Size**: Drawing conclusions from too few examples
+- **Data Contamination**: Testing on training data
+- **Ignoring Variance**: Not accounting for statistical uncertainty
+- **Metric Mismatch**: Using metrics not aligned with business goals