From d601a38b284b8e46d08705ff416791f49c62b61f Mon Sep 17 00:00:00 2001
From: Zhongwei Li <lizhongwei.nkcs@gmail.com>
Date: Sat, 29 Nov 2025 17:50:49 +0800
Subject: [PATCH] Initial commit

---
 .claude-plugin/plugin.json       |  11 +
 README.md                        |   3 +
 SKILL.md                         | 197 ++++++++
 plugin.json                      |  14 +
 plugin.lock.json                 |  77 ++++
 references/anti-patterns.md      | 687 ++++++++++++++++++++++++++++
 references/best-practices.md     | 761 +++++++++++++++++++++++++++++++
 references/ci-cd-optimization.md | 735 +++++++++++++++++++++++++++++
 references/coverage-strategy.md  | 622 +++++++++++++++++++++++++
 references/legacy-code.md        | 703 ++++++++++++++++++++++++++++
 references/methodologies.md      | 663 +++++++++++++++++++++++++++
 references/testing-layers.md     | 595 ++++++++++++++++++++++++
 12 files changed, 5068 insertions(+)
 create mode 100644 .claude-plugin/plugin.json
 create mode 100644 README.md
 create mode 100644 SKILL.md
 create mode 100644 plugin.json
 create mode 100644 plugin.lock.json
 create mode 100644 references/anti-patterns.md
 create mode 100644 references/best-practices.md
 create mode 100644 references/ci-cd-optimization.md
 create mode 100644 references/coverage-strategy.md
 create mode 100644 references/legacy-code.md
 create mode 100644 references/methodologies.md
 create mode 100644 references/testing-layers.md

diff --git a/.claude-plugin/plugin.json b/.claude-plugin/plugin.json
new file mode 100644
index 0000000..843b019
--- /dev/null
+++ b/.claude-plugin/plugin.json
@@ -0,0 +1,11 @@
+{
+  "name": "testing",
+  "description": "Expert guidance on software testing strategies, methodologies, and best practices. Provides language-agnostic advice for designing test strategies, choosing appropriate test levels (unit, integration, E2E), applying methodologies (TDD, BDD), optimizing CI/CD pipelines, and testing legacy code.",
+  "version": "1.0.0",
+  "author": {
+    "name": "lxdb"
+  },
+  "skills": [
+    "./"
+  ]
+}
\ No newline at end of file
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..140b258
--- /dev/null
+++ b/README.md
@@ -0,0 +1,3 @@
+# testing
+
+Expert guidance on software testing strategies, methodologies, and best practices. Provides language-agnostic advice for designing test strategies, choosing appropriate test levels (unit, integration, E2E), applying methodologies (TDD, BDD), optimizing CI/CD pipelines, and testing legacy code.
diff --git a/SKILL.md b/SKILL.md
new file mode 100644
index 0000000..e2cb149
--- /dev/null
+++ b/SKILL.md
@@ -0,0 +1,197 @@
+---
+name: testing
+description: Expert guidance on software testing strategies, methodologies, and best practices. Provides language-agnostic advice for designing test strategies, choosing appropriate test levels (unit, integration, E2E), applying methodologies (TDD, BDD), optimizing CI/CD pipelines, and testing legacy code. Use when designing test strategies, deciding between testing approaches, refactoring tests, or optimizing test suites.
+---
+
+# Testing Strategy Expert
+
+## Overview
+
+Transform Claude into an expert testing consultant providing comprehensive, language-agnostic guidance on software testing strategies, methodologies, and best practices. This skill enables informed decisions about test design, coverage strategy, CI/CD integration, and legacy code testing.
+
+**Note on Examples**: This skill uses Python syntax in code examples for clarity and consistency. However, all principles, strategies, and methodologies apply universally across programming languages including JavaScript, PHP, Java, C#, Ruby, Go, Rust, and others. The testing concepts transcend specific languages and frameworks—adapt the patterns to your language's idioms and testing tools.
+
+## Core Testing Principles
+
+Apply these foundational principles across all testing decisions:
+
+### 1. Test Behavior, Not Implementation
+Verify observable outcomes and public contracts rather than internal implementation details. Tests coupled to implementation break during safe refactoring. Focus on what the system does from a user's or caller's perspective.
+
+**Example**: Test that a login function succeeds with valid credentials and fails appropriately with invalid ones, not that it calls specific internal methods in a particular order.
+
+### 2. Optimize for Signal Quality Over Coverage
+A test suite with 80% coverage that catches real bugs beats one with 100% coverage that never fails meaningfully. Focus on tests that provide clear, actionable feedback when they break, not coverage percentages.
+
+### 3. Design for Fast Feedback Loops
+Speed of feedback directly impacts developer productivity. Structure test suites for rapid feedback: fast unit tests run pre-commit, comprehensive suites run post-merge, exhaustive tests run nightly.
+
+### 4. Ensure Test Independence and Determinism
+Every test should be independent, produce consistent results, and avoid shared state. Independent tests can run in parallel, fail clearly, and be debugged easily without cascading failures.
+
+### 5. Apply Tests at the Appropriate Level
+Use unit tests for logic and edge cases, integration tests for component boundaries, E2E tests for critical paths. Testing everything through the UI wastes time; testing everything with mocks provides false confidence.
+
+## Quick Decision Framework
+
+Use this framework to rapidly determine the appropriate testing approach:
+
+### When to Use Unit Tests
+- Business logic, algorithms, calculations
+- Data transformations and pure functions
+- Edge cases and boundary conditions
+- Validation rules and complex conditionals
+- Any scenario expressible as "given input X, expect output Y"
+
+**Target**: 70-80% of test suite, focusing on business logic coverage
+
+### When to Use Integration Tests
+- Database queries and ORM behavior
+- API endpoints with framework integration
+- Service-to-service communication
+- Cross-cutting concerns (auth, caching, transactions)
+- Framework-specific behavior validation
+
+**Target**: 15-25% of test suite, focusing on component boundaries
+
+### When to Use E2E Tests
+- Critical user journeys only (login, checkout, payment)
+- Complete user workflows through the entire system
+- Cross-system integration validation
+- Production-like environment verification
+
+**Target**: 5-10% of test suite, focusing on critical paths only
+
+### When to Use Contract Tests
+- Microservices architectures with independent service deployment
+- Third-party API integrations
+- Multi-team service development
+- Preventing breaking changes at service boundaries
+
+## Testing Methodology Selection
+
+### Apply Test-Driven Development (TDD) When:
+- Building complex business logic with well-understood requirements
+- Exploring API design and interface contracts
+- Implementing algorithms or data structures
+- Building pure functions with clear input/output specifications
+
+**Skip TDD for**: UI implementation, proof-of-concept work, framework exploration, simple glue code
+
+### Apply Behavior-Driven Development (BDD) When:
+- Building user-facing features requiring stakeholder collaboration
+- Complex workflows where examples clarify requirements better than abstract descriptions
+- Needing living documentation that business users can read
+- Aligning product, engineering, and QA teams
+
+**Skip BDD for**: Technical components, utility functions, projects without business stakeholder involvement
+
+### Test-Later Approach When:
+- Prototyping or spike solutions
+- Working with legacy code (characterization tests first)
+- Simple scripts or one-off tools
+- Time-boxed experiments
+
+## Architecture-Specific Strategies
+
+### Monolithic Applications
+Use the traditional testing pyramid: 70-80% unit tests, 15-20% integration tests, 5-10% E2E tests. Optimize for fast unit test execution while ensuring integration tests verify database and framework integration.
+
+### Microservices Architectures
+Shift toward the testing honeycomb with greater emphasis on integration and contract tests. In distributed systems, the highest risk lies in service interactions. Prioritize contract testing for service compatibility.
+
+### Legacy Code
+Start with characterization tests to document current behavior, identify seams for testing, break dependencies using safe refactorings, and prioritize testing by risk (critical + frequently changed code first).
+
+## Test Design Best Practices
+
+### Structure Tests with Arrange-Act-Assert (AAA)
+- **Arrange**: Set up test environment, create objects, prepare test data
+- **Act**: Execute the code being tested (single, clear action)
+- **Assert**: Verify expected outcome with specific, meaningful assertions
+
+Keep tests focused on one logical assertion per test for clarity.
+
+### Ensure Deterministic Tests
+- Eliminate timing dependencies (use conditional waits, not fixed sleeps)
+- Avoid shared state between tests
+- Mock external dependencies (APIs, databases, file systems)
+- Use fixed seeds for random data or freeze time in tests
+
+### Apply Boundary Analysis
+Test edge cases systematically: empty inputs, single element, boundary values (0, -1, max), just below/above limits, invalid inputs. This technique finds off-by-one errors and edge case handling issues.
+
+### Write Expressive Tests
+- Use descriptive test names: `test_[function]_[scenario]_[expected_result]`
+- Include meaningful assertion messages for complex conditions
+- Keep tests short and readable
+- Make test failures immediately communicate what broke and why
+
+### Mock Strategically
+- **Stub** query dependencies that provide data
+- **Mock** command dependencies where verifying interaction matters
+- **Fake** complex dependencies needing realistic behavior (in-memory databases)
+- **Avoid over-mocking** internal collaborators—mock only at system boundaries
+
+## Common Anti-Patterns to Avoid
+
+1. **Testing Implementation Details**: Tests break during refactoring despite unchanged behavior
+2. **Over-Mocking**: Mocking every dependency couples tests to implementation
+3. **Brittle Tests**: Tests coupled to UI selectors, private methods, internal state
+4. **Inverted Pyramid**: More integration/E2E tests than unit tests leads to slow, flaky suites
+5. **Coverage Chasing**: Optimizing for coverage percentage over test quality
+6. **Flaky Tests**: Non-deterministic tests that erode trust in the test suite
+7. **Global State**: Shared mutable state creates order dependencies
+
+## Coverage Strategy
+
+Focus on behavior coverage (scenarios validated) over code coverage (lines executed):
+
+- **Critical paths and business logic**: 90-100% coverage at unit level
+- **Integration-heavy code**: 70-90% coverage at integration level
+- **Edge cases and error handling**: Comprehensive coverage at unit level
+- **Cross-cutting concerns**: Integration-level coverage
+
+Prioritize testing based on risk: business-critical code, frequently changed code, historically bug-prone areas deserve more thorough coverage.
+
+## CI/CD Integration
+
+Structure test execution for optimal feedback:
+
+1. **Pre-commit/Pre-push**: Fast unit tests (<30s) with linting
+2. **Pull Request Pipeline**: All unit tests + critical integration tests (10-15 min)
+3. **Post-merge**: Full regression suite including slower integration tests (20-45 min)
+4. **Nightly/Scheduled**: Exhaustive E2E, performance, security tests (hours)
+
+Parallelize test execution, detect and quarantine flaky tests immediately, and track test health metrics (defect detection efficiency, execution time, flakiness rate).
+
+## Bundled Resources
+
+This skill includes comprehensive reference documentation for deep dives into specific testing topics:
+
+### references/testing-layers.md
+Detailed guidance on unit, integration, E2E, and contract testing including specific scenarios, characteristics, and examples for each level.
+
+### references/methodologies.md
+Comprehensive coverage of TDD, BDD, test-later approaches, and public behavior testing with workflow examples.
+
+### references/best-practices.md
+In-depth test design patterns including deterministic testing, boundary analysis, test data management (fixtures, factories, builders), and mocking strategies.
+
+### references/anti-patterns.md
+Common testing mistakes with explanations of why they're problematic and how to fix them.
+
+### references/ci-cd-optimization.md
+Strategies for structuring test suites for fast feedback, parallel execution, flaky test management, and test health metrics.
+
+### references/legacy-code.md
+Techniques for testing untestable code including characterization testing, identifying seams, breaking dependencies, and prioritization strategies.
+
+### references/coverage-strategy.md
+Guidance on code vs behavior coverage, appropriate coverage targets by level, when coverage becomes harmful, and risk-based prioritization.
+
+## Usage
+
+When users ask about testing topics, consult the appropriate reference file for detailed guidance. Apply the core principles and quick decision framework first, then dive into specific methodologies or patterns as needed.
+
+Always provide context-aware recommendations that balance pragmatism with best practices, explaining trade-offs and helping teams make informed decisions rather than prescriptive rules.
diff --git a/plugin.json b/plugin.json
new file mode 100644
index 0000000..7624001
--- /dev/null
+++ b/plugin.json
@@ -0,0 +1,14 @@
+{
+  "name": "testing",
+  "description": "Expert guidance on software testing strategies, methodologies, and best practices. Provides language-agnostic advice for designing test strategies, choosing appropriate test levels (unit, integration, E2E), applying methodologies (TDD, BDD), optimizing CI/CD pipelines, and testing legacy code. Use when designing test strategies, deciding between testing approaches, refactoring tests, or optimizing test suites.",
+  "version": "1.0.0",
+  "author": {
+    "name": "lxdb"
+  },
+  "homepage": "https://github.com/aazbeltran/claude-code-plugins",
+  "repository": "https://github.com/aazbeltran/claude-code-plugins",
+  "license": "MIT",
+  "keywords": ["testing", "tdd", "bdd", "quality-assurance", "ci-cd", "best-practices", "unit-testing", "integration-testing", "e2e-testing"],
+  "category": "development",
+  "skills": ["./" ]
+}
diff --git a/plugin.lock.json b/plugin.lock.json
new file mode 100644
index 0000000..34d7b31
--- /dev/null
+++ b/plugin.lock.json
@@ -0,0 +1,77 @@
+{
+  "$schema": "internal://schemas/plugin.lock.v1.json",
+  "pluginId": "gh:aazbeltran/claude-code-plugins:plugins/testing",
+  "normalized": {
+    "repo": null,
+    "ref": "refs/tags/v20251128.0",
+    "commit": "264eec62c9fabaf5faabe461f200ac1e15931008",
+    "treeHash": "4c203038e520d4eef2a5109571286d3e9cec6aa07ab778a6dd80a36046471d4c",
+    "generatedAt": "2025-11-28T10:13:00.420229Z",
+    "toolVersion": "publish_plugins.py@0.2.0"
+  },
+  "origin": {
+    "remote": "git@github.com:zhongweili/42plugin-data.git",
+    "branch": "master",
+    "commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
+    "repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
+  },
+  "manifest": {
+    "name": "testing",
+    "description": "Expert guidance on software testing strategies, methodologies, and best practices. Provides language-agnostic advice for designing test strategies, choosing appropriate test levels (unit, integration, E2E), applying methodologies (TDD, BDD), optimizing CI/CD pipelines, and testing legacy code.",
+    "version": "1.0.0"
+  },
+  "content": {
+    "files": [
+      {
+        "path": "plugin.json",
+        "sha256": "5a001665bfe4b6a9003933d65b60eb80d392a361f28d1398ead24ecabbf5bc39"
+      },
+      {
+        "path": "README.md",
+        "sha256": "22b66dd2c10913aefa85e9bf421c7d9d27fac93a057db890c6251e5e280bb6a2"
+      },
+      {
+        "path": "SKILL.md",
+        "sha256": "ac4254b1496d10762b7d84ddbb559e4707a4b230eeed81acd5e87128d3a694c7"
+      },
+      {
+        "path": "references/best-practices.md",
+        "sha256": "93b5740d5f0e3ab011bbc934c843b67afef0de276457477f55426bd75f942fb7"
+      },
+      {
+        "path": "references/legacy-code.md",
+        "sha256": "614768fb22d5aa9fe283d58d7803593394d41eb7f25c3f4136f71eb5943798a6"
+      },
+      {
+        "path": "references/testing-layers.md",
+        "sha256": "ddd023f8bf0f37cac5d9d702f2307a2b085720e01c2dd820650586d2c4a91dad"
+      },
+      {
+        "path": "references/methodologies.md",
+        "sha256": "cfedefbe37c276615151ed6d2e983fec6e2e980441a97fcbd8f0979d9e73d2b5"
+      },
+      {
+        "path": "references/ci-cd-optimization.md",
+        "sha256": "a1a7dfc7f2a81e2e73b36ce4dce27e8cddbf3cac25d48c3d6acb09ee78dd6747"
+      },
+      {
+        "path": "references/anti-patterns.md",
+        "sha256": "98b134298b8c3d9993c14a59e63de87d5ead375b6b59ce407ffe31c2f9cd5377"
+      },
+      {
+        "path": "references/coverage-strategy.md",
+        "sha256": "a53c852f20817617da7e9cf2f5aa52709432e5cd6181f787ad7731d483be3fdd"
+      },
+      {
+        "path": ".claude-plugin/plugin.json",
+        "sha256": "580babcae2c21d7f657bed1095de82ef9aefc7c3fd573194d1c91ebcb6b667ac"
+      }
+    ],
+    "dirSha256": "4c203038e520d4eef2a5109571286d3e9cec6aa07ab778a6dd80a36046471d4c"
+  },
+  "security": {
+    "scannedAt": null,
+    "scannerVersion": null,
+    "flags": []
+  }
+}
\ No newline at end of file
diff --git a/references/anti-patterns.md b/references/anti-patterns.md
new file mode 100644
index 0000000..d559ae9
--- /dev/null
+++ b/references/anti-patterns.md
@@ -0,0 +1,687 @@
+# Testing Anti-Patterns: Common Mistakes to Avoid
+
+## Testing Implementation Details
+
+**The Anti-Pattern**: Tests verify internal mechanics rather than observable behavior, causing tests to break during safe refactoring.
+
+### Examples of Testing Implementation Details
+
+**Testing Private Methods Directly**
+```python
+# ❌ BAD: Testing private method
+class OrderService:
+    def create_order(self, items):
+        self._validate_items(items)
+        return self._build_order(items)
+
+    def _validate_items(self, items):
+        # Internal implementation detail
+        pass
+
+    def _build_order(self, items):
+        # Internal implementation detail
+        pass
+
+def test_validate_items_rejects_empty():
+    service = OrderService()
+    # Accessing private method
+    with pytest.raises(ValidationError):
+        service._validate_items([])
+```
+
+**Testing Call Sequences**
+```python
+# ❌ BAD: Verifying internal method calls
+def test_order_creation_process():
+    mock_validator = Mock()
+    mock_builder = Mock()
+    service = OrderService(mock_validator, mock_builder)
+
+    service.create_order(items)
+
+    # Testing HOW it works internally
+    mock_validator.validate.assert_called_once_with(items)
+    mock_builder.build.assert_called_once()
+    assert mock_validator.validate.call_count == 1
+```
+
+**Testing Internal State**
+```python
+# ❌ BAD: Accessing private fields
+def test_user_activation():
+    user = User()
+
+    user.activate()
+
+    assert user._activated_at is not None  # Private field
+    assert user._activation_status == "active"  # Private field
+```
+
+### How to Fix It
+
+**Test Through Public Interfaces**
+```python
+# ✅ GOOD: Testing observable behavior
+def test_create_order_with_valid_items():
+    service = OrderService()
+    items = [Item("Widget", 10)]
+
+    order = service.create_order(items)
+
+    assert order.id is not None
+    assert order.total == 10
+    assert len(order.items) == 1
+
+def test_create_order_with_empty_items_raises_error():
+    service = OrderService()
+
+    with pytest.raises(ValidationError):
+        service.create_order([])
+```
+
+**Test Outcomes, Not Process**
+```python
+# ✅ GOOD: Verifying results
+def test_order_completion():
+    order_service = OrderService(email_service)
+
+    completed_order = order_service.complete_order(order_id=123)
+
+    assert completed_order.status == "completed"
+    assert completed_order.completed_at is not None
+    # Email sending is internal detail - verify user-visible outcome
+```
+
+### Why It Matters
+
+- **Refactoring breaks tests**: Changing internal structure breaks tests even though behavior is unchanged
+- **False negatives**: Tests fail when nothing is actually broken
+- **Maintenance burden**: Every internal change requires test updates
+- **Missed real bugs**: Focus on implementation means less focus on behavior
+- **Discourages improvement**: Fear of breaking tests prevents refactoring
+
+---
+
+## Over-Mocking and Excessive Test Doubles
+
+**The Anti-Pattern**: Mocking every dependency regardless of whether it's necessary, creating tests that pass while real code fails.
+
+### Examples of Over-Mocking
+
+**Mocking Everything**
+```python
+# ❌ BAD: Mocking internal collaborators
+def test_user_service():
+    mock_validator = Mock()
+    mock_repository = Mock()
+    mock_email_service = Mock()
+    mock_logger = Mock()
+    mock_cache = Mock()
+    mock_event_bus = Mock()
+
+    service = UserService(
+        mock_validator,
+        mock_repository,
+        mock_email_service,
+        mock_logger,
+        mock_cache,
+        mock_event_bus
+    )
+
+    # Test provides no confidence real code works
+    service.create_user(data)
+
+    # Verifying every interaction
+    mock_validator.validate.assert_called()
+    mock_repository.save.assert_called()
+    # ... and so on
+```
+
+**Mocking What You Own**
+```python
+# ❌ BAD: Mocking internal business logic
+def test_order_total_calculation():
+    mock_calculator = Mock()
+    mock_calculator.calculate.return_value = 100
+
+    order_service = OrderService(mock_calculator)
+
+    total = order_service.get_order_total(order)
+
+    assert total == 100
+    # This test verifies nothing - just that mock returns what we told it to
+```
+
+### How to Fix It
+
+**Mock Only at System Boundaries**
+```python
+# ✅ GOOD: Mock external dependencies, use real internal objects
+def test_user_service():
+    # Mock external system
+    mock_email_service = Mock()
+
+    # Real internal collaborators
+    validator = UserValidator()
+    repository = InMemoryUserRepository()
+
+    service = UserService(validator, repository, mock_email_service)
+
+    user = service.create_user(valid_user_data)
+
+    assert user.id is not None
+    assert repository.find(user.id) == user
+```
+
+**Use Real Implementations Where Possible**
+```python
+# ✅ GOOD: Real implementations for testable components
+def test_order_total_calculation():
+    calculator = OrderCalculator()  # Real calculator
+    order = Order(items=[Item("Widget", 10), Item("Gadget", 20)])
+
+    total = calculator.calculate_total(order)
+
+    assert total == 30
+```
+
+### When to Mock
+
+**DO mock**:
+- External HTTP APIs
+- Databases (sometimes - consider in-memory alternatives)
+- File systems
+- System time/clocks
+- Random number generators
+- Third-party services
+
+**DON'T mock**:
+- Internal business logic
+- Value objects and data structures
+- Utilities and helpers you own
+- Simple collaborators
+
+### Why It Matters
+
+- **False confidence**: Tests pass but real integrations fail
+- **Implementation coupling**: Tests break when changing internal code structure
+- **Maintenance nightmare**: Every refactoring breaks mocks
+- **Missing integration bugs**: Real interactions between components never tested
+- **Complex test setup**: More time setting up mocks than writing actual test
+
+---
+
+## Brittle Tests That Block Refactoring
+
+**The Anti-Pattern**: Tests so tightly coupled to current implementation that refactoring becomes impossible without rewriting tests.
+
+### Examples of Brittle Tests
+
+**UI Tests Coupled to Selectors**
+```python
+# ❌ BAD: Coupled to specific HTML structure
+def test_login():
+    browser.find_element_by_xpath(
+        '/html/body/div[1]/div[2]/form/div[1]/input'
+    ).send_keys('user@example.com')
+
+    browser.find_element_by_xpath(
+        '/html/body/div[1]/div[2]/form/div[2]/input'
+    ).send_keys('password')
+
+    browser.find_element_by_xpath(
+        '/html/body/div[1]/div[2]/form/button[1]'
+    ).click()
+```
+
+**Tests Dependent on Execution Order**
+```python
+# ❌ BAD: Tests must run in specific order
+def test_1_create_user():
+    global user_id
+    user_id = create_user("test@example.com")
+
+def test_2_activate_user():
+    activate_user(user_id)  # Depends on test_1
+
+def test_3_delete_user():
+    delete_user(user_id)  # Depends on test_2
+```
+
+**Tests Coupled to Data Format**
+```python
+# ❌ BAD: Expects exact JSON structure
+def test_api_response():
+    response = api.get_user(123)
+
+    assert response == {
+        'id': 123,
+        'name': 'Test User',
+        'email': 'test@example.com',
+        'created_at': '2024-01-01T00:00:00Z',
+        'updated_at': '2024-01-01T00:00:00Z'
+    }
+    # Breaks if any field added or order changes
+```
+
+### How to Fix It
+
+**Use Stable Selectors**
+```python
+# ✅ GOOD: Use semantic selectors
+def test_login():
+    browser.find_element_by_label_text('Email').send_keys('user@example.com')
+    browser.find_element_by_label_text('Password').send_keys('password')
+    browser.find_element_by_role('button', name='Login').click()
+
+# Or use data-testid attributes
+def test_login_with_test_ids():
+    browser.find_element_by_test_id('email-input').send_keys('user@example.com')
+    browser.find_element_by_test_id('password-input').send_keys('password')
+    browser.find_element_by_test_id('login-button').click()
+```
+
+**Make Tests Independent**
+```python
+# ✅ GOOD: Each test is self-contained
+def test_create_user():
+    user_id = create_user("test@example.com")
+    assert user_id is not None
+
+def test_activate_user():
+    user_id = create_user("test2@example.com")
+    result = activate_user(user_id)
+    assert result.success
+
+def test_delete_user():
+    user_id = create_user("test3@example.com")
+    delete_user(user_id)
+    assert find_user(user_id) is None
+```
+
+**Test Required Fields, Ignore Optional Ones**
+```python
+# ✅ GOOD: Test essential fields only
+def test_api_response():
+    response = api.get_user(123)
+
+    assert response['id'] == 123
+    assert response['email'] == 'test@example.com'
+    assert 'created_at' in response
+    # Don't care about other fields or order
+```
+
+### Why It Matters
+
+- **Prevents refactoring**: Fear of breaking tests stops code improvement
+- **High maintenance cost**: Every small change requires test updates
+- **False failures**: Tests fail when nothing is actually broken
+- **Slows development**: More time fixing tests than improving code
+- **Tests become liability**: Team considers tests obstacle rather than asset
+
+---
+
+## The Inverted Test Pyramid (Ice Cream Cone)
+
+**The Anti-Pattern**: More E2E and integration tests than unit tests, leading to slow, flaky test suites.
+
+### What It Looks Like
+
+```
+        /\
+       /  \      ← Large E2E test suite (slow, flaky)
+      /    \
+     /------\    ← Moderate integration tests
+    /        \
+   /   unit   \  ← Small unit test suite
+  /____________\
+```
+
+**Symptoms**:
+- Test suite takes 30+ minutes to run
+- Frequent flaky test failures
+- Developers skip running tests locally
+- CI pipeline is bottleneck
+- Hard to diagnose failures
+
+### How It Happens
+
+**Over-Reliance on E2E Tests**
+```python
+# ❌ BAD: Testing edge cases through UI
+def test_registration_empty_email():
+    browser.get('/register')
+    # ... fill form with empty email ...
+    # 10 seconds to test something unit test could verify in milliseconds
+
+def test_registration_invalid_email():
+    browser.get('/register')
+    # ... fill form with invalid email ...
+    # Another 10 seconds
+
+def test_registration_duplicate_email():
+    browser.get('/register')
+    # ... fill form with existing email ...
+    # Another 10 seconds
+
+# 30 seconds to test what should be 3 fast unit tests
+```
+
+**Missing Unit Tests**
+```python
+# ❌ BAD: No unit tests for business logic
+# All testing happens through API or UI
+# Business logic never tested in isolation
+```
+
+### How to Fix It
+
+**Push Tests Down the Pyramid**
+```python
+# ✅ GOOD: Unit test edge cases
+def test_email_validation_empty():
+    with pytest.raises(ValidationError):
+        validate_email("")
+
+def test_email_validation_invalid_format():
+    with pytest.raises(ValidationError):
+        validate_email("notanemail")
+
+def test_email_validation_duplicate():
+    with pytest.raises(ValidationError):
+        validate_email_uniqueness("existing@example.com")
+
+# Each test: milliseconds
+
+# ✅ GOOD: One E2E test for happy path
+def test_successful_registration():
+    browser.get('/register')
+    fill_registration_form(valid_data)
+    submit()
+    assert_redirected_to_dashboard()
+
+# One E2E test: 10 seconds
+```
+
+**Proper Test Distribution**
+```
+        /\
+       /E2E\     ← Few critical path tests (5-10%)
+      /____\
+     /      \
+    /  Int   \   ← Component boundary tests (15-25%)
+   /  Tests   \
+  /____________\
+ /              \
+/   Unit Tests   \ ← Most tests here (70-80%)
+/__________________\
+```
+
+### Why It Matters
+
+- **Slow feedback**: 30-minute test suites kill productivity
+- **Flaky tests**: E2E tests are inherently less stable
+- **Unclear failures**: Hard to pinpoint bugs in complex E2E tests
+- **Resource intensive**: E2E tests require more infrastructure
+- **Maintenance burden**: UI changes break many E2E tests
+
+---
+
+## Assertion Roulette and Unclear Failures
+
+**The Anti-Pattern**: Tests with multiple unrelated assertions that make failures ambiguous.
+
+### Examples
+
+**Too Many Assertions**
+```python
+# ❌ BAD: Which assertion failed?
+def test_user_operations():
+    user = create_user("test@example.com")
+    assert user.id is not None
+    assert user.email == "test@example.com"
+    assert user.is_active
+    user.update(name="Test User")
+    assert user.name == "Test User"
+    user.deactivate()
+    assert not user.is_active
+    user.delete()
+    assert find_user(user.id) is None
+    # If test fails on line 7, was it deactivate() or the assertion that broke?
+```
+
+**No Assertion Messages**
+```python
+# ❌ BAD: Failure gives no context
+def test_discount_calculation():
+    result = calculate_discount(customer, order)
+    assert result == 10
+    # Failure: AssertionError: assert 0 == 10
+    # (Why did we expect 10? What were the inputs?)
+```
+
+### How to Fix It
+
+**One Logical Assertion Per Test**
+```python
+# ✅ GOOD: Focused tests
+def test_user_creation_assigns_id():
+    user = create_user("test@example.com")
+    assert user.id is not None
+
+def test_user_creation_sets_email():
+    user = create_user("test@example.com")
+    assert user.email == "test@example.com"
+
+def test_new_user_is_active_by_default():
+    user = create_user("test@example.com")
+    assert user.is_active
+```
+
+**Add Descriptive Assertion Messages**
+```python
+# ✅ GOOD: Clear failure messages
+def test_vip_discount_calculation():
+    customer = Customer(type="VIP")
+    order = Order(total=100)
+
+    discount = calculate_discount(customer, order)
+
+    assert discount == 10, \
+        f"VIP customer with ${order.total} order should get $10 discount, got ${discount}"
+```
+
+### Why It Matters
+
+- **Unclear failures**: Can't tell what broke without debugging
+- **Wasted time**: Must debug to understand test failure
+- **Cascading failures**: One bug causes multiple assertion failures
+- **Hard to maintain**: Complex tests are hard to understand and update
+
+---
+
+## Global State and Hidden Dependencies
+
+**The Anti-Pattern**: Tests that depend on global state, singletons, or hidden dependencies, creating order dependencies and mysterious failures.
+
+### Examples
+
+**Global State**
+```python
+# ❌ BAD: Tests share global state
+current_user = None
+
+def test_login():
+    global current_user
+    current_user = login("user@example.com", "password")
+    assert current_user is not None
+
+def test_access_dashboard():
+    # Depends on global current_user from previous test
+    response = get_dashboard(current_user)
+    assert response.status == 200
+```
+
+**Singleton Dependencies**
+```python
+# ❌ BAD: Singleton holds state between tests
+class Database:
+    _instance = None
+    _data = {}
+
+    @classmethod
+    def instance(cls):
+        if cls._instance is None:
+            cls._instance = cls()
+        return cls._instance
+
+def test_save_user():
+    db = Database.instance()
+    db.save(user)
+    # Pollutes singleton state
+
+def test_user_count():
+    db = Database.instance()
+    # Still has data from previous test
+    assert db.count() == 1  # Breaks if other tests ran first
+```
+
+### How to Fix It
+
+**Explicit Dependencies**
+```python
+# ✅ GOOD: Explicit fixtures
+@pytest.fixture
+def authenticated_user():
+    user = User(email="test@example.com")
+    session = login(user)
+    return session
+
+def test_access_dashboard(authenticated_user):
+    response = get_dashboard(authenticated_user)
+    assert response.status == 200
+```
+
+**Dependency Injection**
+```python
+# ✅ GOOD: Inject dependencies
+@pytest.fixture
+def database():
+    db = Database()
+    yield db
+    db.clear()
+
+def test_save_user(database):
+    user = User(email="test@example.com")
+    database.save(user)
+    assert database.count() == 1
+
+def test_find_user(database):
+    user = User(email="test@example.com")
+    database.save(user)
+    found = database.find_by_email("test@example.com")
+    assert found == user
+```
+
+### Why It Matters
+
+- **Order dependencies**: Tests must run in specific order
+- **Flaky failures**: Tests pass individually but fail in suite
+- **Parallel execution impossible**: Can't run tests concurrently
+- **Hard to debug**: Failures depend on what ran before
+- **Misleading results**: Test results inconsistent between runs
+
+---
+
+## High Coverage, Low Value Tests
+
+**The Anti-Pattern**: Achieving high code coverage with tests that don't catch bugs, providing false sense of security.
+
+### Examples
+
+**Tests That Don't Assert**
+```python
+# ❌ BAD: Executes code but verifies nothing
+def test_process_order():
+    service = OrderService()
+    service.process(order)
+    # No assertions - just executes code for coverage
+```
+
+**Testing Trivial Code**
+```python
+# ❌ BAD: Testing getters and setters
+def test_user_get_email():
+    user = User(email="test@example.com")
+    assert user.get_email() == "test@example.com"
+
+def test_user_set_email():
+    user = User()
+    user.set_email("test@example.com")
+    assert user.email == "test@example.com"
+```
+
+**Testing Framework Behavior**
+```python
+# ❌ BAD: Testing that framework works
+def test_django_model_save():
+    user = User(email="test@example.com")
+    user.save()
+    # Just testing Django's save() works - not your code
+```
+
+### How to Fix It
+
+**Test Behavior, Not Code Execution**
+```python
+# ✅ GOOD: Verifies actual behavior
+def test_order_processing_updates_inventory():
+    inventory = Inventory(product_id=1, quantity=10)
+    order = Order(product_id=1, quantity=3)
+
+    process_order(order, inventory)
+
+    assert inventory.quantity == 7
+
+def test_order_processing_sends_confirmation():
+    mock_email = Mock()
+    order = Order(customer_email="test@example.com")
+
+    process_order(order, email_service=mock_email)
+
+    mock_email.send_confirmation.assert_called_with("test@example.com")
+```
+
+**Focus on Complex Logic**
+```python
+# ✅ GOOD: Test non-trivial behavior
+def test_discount_calculation_for_bulk_orders():
+    items = [Item("Widget", 10, quantity=100)]
+
+    discount = calculate_bulk_discount(items)
+
+    assert discount == 500  # 50% off for orders > 50 units
+```
+
+### Why It Matters
+
+- **False confidence**: High coverage numbers hide lack of actual testing
+- **Bugs slip through**: Tests don't catch real problems
+- **Wasted effort**: Time spent on tests that provide no value
+- **Misleading metrics**: Coverage metric becomes meaningless
+
+---
+
+## Summary: Avoiding Anti-Patterns
+
+| Anti-Pattern | Impact | Solution |
+|-------------|---------|----------|
+| Testing implementation details | Brittle tests, blocks refactoring | Test through public interfaces, verify behavior |
+| Over-mocking | False confidence, missed integration bugs | Mock only external boundaries |
+| Brittle tests | High maintenance, prevents improvement | Use stable selectors, test essentials |
+| Inverted pyramid | Slow, flaky suite | Push tests down: more unit, fewer E2E |
+| Assertion roulette | Unclear failures | One logical assertion per test |
+| Global state | Order dependencies, flaky tests | Use fixtures, dependency injection |
+| Coverage without value | False security | Test behavior, not code execution |
+
+**Core Principle**: Good tests enable confident refactoring and catch real bugs. If tests block improvement or fail without catching bugs, they've become an anti-pattern.
diff --git a/references/best-practices.md b/references/best-practices.md
new file mode 100644
index 0000000..0c4425e
--- /dev/null
+++ b/references/best-practices.md
@@ -0,0 +1,761 @@
+# Test Design Best Practices
+
+## Building Stable, Deterministic Tests
+
+The foundation of a reliable test suite is deterministic tests that produce consistent results. Non-deterministic or "flaky" tests erode trust and waste time debugging phantom failures.
+
+### Eliminating Timing Dependencies
+
+**The Problem**: Fixed sleep statements create unreliable tests that either waste time or fail intermittently.
+
+```python
+# ❌ BAD: Fixed sleep
+def test_async_operation():
+    start_async_task()
+    time.sleep(5)  # Hope 5 seconds is enough
+    assert task_completed()
+
+# ✅ GOOD: Conditional wait
+def test_async_operation():
+    start_async_task()
+    wait_until(lambda: task_completed(), timeout=10, interval=0.1)
+    assert task_completed()
+```
+
+**Strategies**:
+- Use explicit waits with conditions, not arbitrary sleeps
+- Implement timeout-based polling for async operations
+- Use testing framework wait utilities (e.g., Selenium WebDriverWait)
+- Mock time-dependent code to make tests deterministic
+- Use fast fake timers in tests instead of actual delays
+
+**Example: Polling with Timeout**
+```python
+def wait_until(condition, timeout=10, interval=0.1):
+    """Wait until condition is true or timeout expires."""
+    end_time = time.time() + timeout
+    while time.time() < end_time:
+        if condition():
+            return True
+        time.sleep(interval)
+    raise TimeoutError(f"Condition not met within {timeout} seconds")
+
+def test_message_processing():
+    queue.publish(message)
+
+    wait_until(lambda: queue.is_empty(), timeout=5)
+
+    assert message_handler.processed_count == 1
+```
+
+### Avoiding Shared State
+
+**The Problem**: Tests that share state create order dependencies and cascading failures.
+
+```python
+# ❌ BAD: Shared state
+shared_user = None
+
+def test_create_user():
+    global shared_user
+    shared_user = User.create(email="test@example.com")
+    assert shared_user.id is not None
+
+def test_user_can_login():
+    # Depends on previous test
+    result = login(shared_user.email, "password")
+    assert result.success
+
+# ✅ GOOD: Independent tests
+@pytest.fixture
+def user():
+    """Each test gets its own user."""
+    user = User.create(email=f"test-{uuid4()}@example.com")
+    yield user
+    user.delete()
+
+def test_create_user(user):
+    assert user.id is not None
+
+def test_user_can_login(user):
+    result = login(user.email, "password")
+    assert result.success
+```
+
+**Strategies**:
+- Use test fixtures that create fresh instances
+- Employ database transactions that rollback after each test
+- Use in-memory databases that reset between tests
+- Avoid global variables and singletons in tests
+- Make each test completely self-contained
+
+**Database Isolation Pattern**
+```python
+@pytest.fixture
+def db_transaction():
+    """Each test runs in a transaction that rolls back."""
+    connection = engine.connect()
+    transaction = connection.begin()
+    session = Session(bind=connection)
+
+    yield session
+
+    session.close()
+    transaction.rollback()
+    connection.close()
+
+def test_user_creation(db_transaction):
+    user = User(email="test@example.com")
+    db_transaction.add(user)
+    db_transaction.commit()
+
+    assert db_transaction.query(User).count() == 1
+    # Transaction rolls back after test - no cleanup needed
+```
+
+### Handling External Dependencies
+
+**The Problem**: Tests depending on external systems (APIs, databases, file systems) become non-deterministic and slow.
+
+**Strategies**:
+- **Mock external HTTP APIs**: Use libraries like `responses`, `httpretty`, or `requests-mock`
+- **Use in-memory databases**: SQLite in-memory for SQL databases
+- **Fake file systems**: Use `pyfakefs` or similar for file operations
+- **Container-based dependencies**: Use Testcontainers for isolated, real dependencies
+- **Test doubles**: Stubs, mocks, and fakes for external services
+
+**Mocking HTTP Calls**
+```python
+import responses
+
+@responses.activate
+def test_fetch_user_data():
+    responses.add(
+        responses.GET,
+        'https://api.example.com/users/123',
+        json={'id': 123, 'name': 'Test User'},
+        status=200
+    )
+
+    user_data = api_client.get_user(123)
+
+    assert user_data['name'] == 'Test User'
+```
+
+### Controlling Randomness and Time
+
+**Random Data**: Use fixed seeds for reproducible randomness
+```python
+# ❌ BAD: Non-deterministic
+def test_shuffle():
+    items = [1, 2, 3, 4, 5]
+    random.shuffle(items)  # Different result each run
+    assert items[0] < items[-1]
+
+# ✅ GOOD: Deterministic
+def test_shuffle():
+    random.seed(42)
+    items = [1, 2, 3, 4, 5]
+    random.shuffle(items)
+    assert items == [2, 4, 5, 3, 1]  # Always same order
+```
+
+**Time-Dependent Code**: Mock current time
+```python
+from datetime import datetime
+from unittest.mock import patch
+
+# ❌ BAD: Depends on actual current time
+def test_is_expired():
+    expiry = datetime(2024, 1, 1)
+    assert is_expired(expiry)  # Fails after 2024-01-01
+
+# ✅ GOOD: Freezes time
+@patch('mymodule.datetime')
+def test_is_expired(mock_datetime):
+    mock_datetime.now.return_value = datetime(2024, 6, 1)
+
+    expiry = datetime(2024, 1, 1)
+
+    assert is_expired(expiry)
+```
+
+---
+
+## Boundary Analysis and Equivalence Partitioning
+
+Systematic approaches to test input selection ensure comprehensive coverage without exhaustive testing.
+
+### Boundary Value Analysis
+
+**Principle**: Bugs often occur at the edges of input ranges. Test boundaries explicitly.
+
+**For a function accepting ages 0-120:**
+- Test: -1 (below minimum)
+- Test: 0 (minimum boundary)
+- Test: 1 (just above minimum)
+- Test: 119 (just below maximum)
+- Test: 120 (maximum boundary)
+- Test: 121 (above maximum)
+
+**Example Implementation**
+```python
+def validate_age(age):
+    if age < 0 or age > 120:
+        raise ValueError("Age must be between 0 and 120")
+    return True
+
+@pytest.mark.parametrize("age,should_pass", [
+    (-1, False),   # Below minimum
+    (0, True),     # Minimum boundary
+    (1, True),     # Just above minimum
+    (60, True),    # Typical value
+    (119, True),   # Just below maximum
+    (120, True),   # Maximum boundary
+    (121, False),  # Above maximum
+])
+def test_age_validation(age, should_pass):
+    if should_pass:
+        assert validate_age(age) == True
+    else:
+        with pytest.raises(ValueError):
+            validate_age(age)
+```
+
+### Equivalence Partitioning
+
+**Principle**: Divide input space into classes that should behave identically. Test one representative from each class.
+
+**For a discount function based on purchase amount:**
+- Partition 1: $0-100 (no discount)
+- Partition 2: $101-500 (10% discount)
+- Partition 3: $501+ (20% discount)
+
+Test representatives: $50, $300, $600
+
+**Example Implementation**
+```python
+def calculate_discount(amount):
+    if amount <= 100:
+        return 0
+    elif amount <= 500:
+        return amount * 0.10
+    else:
+        return amount * 0.20
+
+@pytest.mark.parametrize("amount,expected_discount", [
+    (50, 0),         # Partition 1: 0-100
+    (100, 0),        # Boundary
+    (101, 10.10),    # Just into partition 2
+    (300, 30),       # Partition 2: 101-500
+    (500, 50),       # Boundary
+    (501, 100.20),   # Just into partition 3
+    (1000, 200),     # Partition 3: 501+
+])
+def test_discount_calculation(amount, expected_discount):
+    assert calculate_discount(amount) == pytest.approx(expected_discount)
+```
+
+### Property-Based Testing
+
+**Principle**: Define invariants that should always hold, then generate hundreds of test cases automatically.
+
+**Example: Reversing a List**
+```python
+from hypothesis import given
+from hypothesis.strategies import lists, integers
+
+@given(lists(integers()))
+def test_reverse_twice_returns_original(items):
+    """Reversing a list twice should return the original."""
+    reversed_once = list(reversed(items))
+    reversed_twice = list(reversed(reversed_once))
+    assert reversed_twice == items
+
+@given(lists(integers()))
+def test_sorted_list_has_same_elements(items):
+    """Sorted list should have same elements as original."""
+    sorted_items = sorted(items)
+    assert sorted(sorted_items) == sorted(items)
+    assert len(sorted_items) == len(items)
+```
+
+**Benefits**:
+- Finds edge cases you didn't think to test
+- Tests many more scenarios than manual tests
+- Shrinks failing examples to minimal reproducible cases
+- Documents properties/invariants of the code
+
+---
+
+## Writing Expressive, Maintainable Tests
+
+### Arrange-Act-Assert (AAA) Pattern
+
+**Structure**: Organize tests into three clear sections for maximum readability.
+
+```python
+def test_user_registration():
+    # Arrange: Set up test data and dependencies
+    email = "newuser@example.com"
+    password = "SecurePass123!"
+    user_service = UserService(database, email_service)
+
+    # Act: Execute the behavior being tested
+    result = user_service.register(email, password)
+
+    # Assert: Verify expected outcomes
+    assert result.success == True
+    assert result.user.email == email
+    assert result.user.is_verified == False
+```
+
+**Benefits**:
+- Immediately clear what the test does
+- Easy to identify what's being tested
+- Simple to understand failure location
+- Consistent structure across test suite
+
+**Variations**:
+- Use blank lines to separate sections
+- Use comments to label sections
+- BDD Given-When-Then is equivalent structure
+
+### Descriptive Test Names
+
+**Principle**: Test names should describe scenario and expected outcome without reading test code.
+
+**Naming Patterns**:
+```python
+# Pattern 1: test_[function]_[scenario]_[expected_result]
+def test_calculate_discount_for_vip_customer_returns_ten_percent():
+    pass
+
+# Pattern 2: test_[scenario]_[expected_result]
+def test_invalid_email_raises_validation_error():
+    pass
+
+# Pattern 3: Natural language with underscores
+def test_user_cannot_login_with_expired_password():
+    pass
+
+# Pattern 4: BDD-style
+def test_given_vip_customer_when_ordering_then_receives_discount():
+    pass
+```
+
+**Good Test Names**:
+- ✅ `test_empty_cart_checkout_raises_error`
+- ✅ `test_duplicate_email_registration_rejected`
+- ✅ `test_expired_token_authentication_fails`
+- ✅ `test_concurrent_order_updates_handled_correctly`
+
+**Bad Test Names**:
+- ❌ `test_1`, `test_2`, `test_3` (meaningless)
+- ❌ `test_order` (what about orders?)
+- ❌ `test_the_thing_works` (what thing? how?)
+- ❌ `test_bug_fix` (what bug?)
+
+### Clear, Actionable Assertions
+
+**Principle**: Assertion failures should immediately communicate what went wrong.
+
+```python
+# ❌ BAD: Unclear failure message
+def test_discount_calculation():
+    assert calculate_discount(100) == 10
+
+# Output: AssertionError: assert 0 == 10
+# (Why did we expect 10? What was the input context?)
+
+# ✅ GOOD: Clear failure message
+def test_discount_calculation():
+    customer = Customer(type="VIP")
+    order_total = 100
+
+    discount = calculate_discount(customer, order_total)
+
+    assert discount == 10, \
+        f"VIP customer with ${order_total} order should get $10 discount, got ${discount}"
+
+# Output: AssertionError: VIP customer with $100 order should get $10 discount, got $0
+```
+
+**Custom Assertion Messages**:
+```python
+# Complex conditions benefit from explanatory messages
+assert user.is_active, \
+    f"User {user.email} should be active after verification, but is_active={user.is_active}"
+
+assert len(results) > 0, \
+    f"Search for '{search_term}' should return results, got empty list"
+
+assert response.status_code == 200, \
+    f"GET /api/users/{user_id} should return 200, got {response.status_code}: {response.text}"
+```
+
+### Keep Tests Focused and Small
+
+**Principle**: Each test should verify one behavior or logical assertion.
+
+```python
+# ❌ BAD: Testing too much
+def test_user_lifecycle():
+    # Create user
+    user = User.create(email="test@example.com")
+    assert user.id is not None
+
+    # Activate user
+    user.activate()
+    assert user.is_active
+
+    # Update profile
+    user.update_profile(name="Test User")
+    assert user.name == "Test User"
+
+    # Deactivate user
+    user.deactivate()
+    assert not user.is_active
+
+    # Delete user
+    user.delete()
+    assert User.find(user.id) is None
+
+# ✅ GOOD: Focused tests
+def test_create_user_assigns_id():
+    user = User.create(email="test@example.com")
+    assert user.id is not None
+
+def test_activate_user_sets_active_flag():
+    user = User.create(email="test@example.com")
+
+    user.activate()
+
+    assert user.is_active == True
+
+def test_update_profile_changes_name():
+    user = User.create(email="test@example.com")
+
+    user.update_profile(name="Test User")
+
+    assert user.name == "Test User"
+```
+
+**When Multiple Assertions Are OK**:
+- Verifying different properties of the same object
+- Checking related side effects of single action
+- Asserting preconditions and postconditions
+
+```python
+# ✅ GOOD: Multiple related assertions
+def test_order_creation():
+    items = [Item("Widget", 10), Item("Gadget", 20)]
+
+    order = Order.create(items)
+
+    assert order.id is not None
+    assert order.total == 30
+    assert len(order.items) == 2
+    assert order.status == "pending"
+```
+
+---
+
+## Test Data Management
+
+### Fixtures: Providing Test Dependencies
+
+**Fixtures** provide reusable test dependencies and setup/teardown logic.
+
+**Pytest Fixtures**
+```python
+@pytest.fixture
+def database():
+    """Provide a test database."""
+    db = create_test_database()
+    yield db
+    db.close()
+
+@pytest.fixture
+def user(database):
+    """Provide a test user."""
+    user = User.create(email="test@example.com")
+    yield user
+    user.delete()
+
+def test_user_can_place_order(user, database):
+    order = Order.create(user=user, items=[Item("Widget", 10)])
+    assert order.user_id == user.id
+```
+
+**Fixture Scopes**:
+- `function`: New instance per test (default)
+- `class`: Shared across test class
+- `module`: Shared across test file
+- `session`: Shared across entire test run
+
+```python
+@pytest.fixture(scope="session")
+def app():
+    """Start application once for entire test session."""
+    app = create_app()
+    app.start()
+    yield app
+    app.stop()
+```
+
+### Factories: Flexible Test Data Creation
+
+**Factories** create test objects with sensible defaults that can be customized.
+
+```python
+# Factory pattern
+class UserFactory:
+    _counter = 0
+
+    @classmethod
+    def create(cls, **kwargs):
+        cls._counter += 1
+        defaults = {
+            'email': f'user{cls._counter}@example.com',
+            'name': f'Test User {cls._counter}',
+            'role': 'user',
+            'is_active': True
+        }
+        defaults.update(kwargs)
+        return User(**defaults)
+
+# Usage
+def test_admin_user():
+    admin = UserFactory.create(role='admin', name='Admin User')
+    assert admin.role == 'admin'
+
+def test_inactive_user():
+    inactive = UserFactory.create(is_active=False)
+    assert not inactive.is_active
+```
+
+**FactoryBoy Library**
+```python
+import factory
+
+class UserFactory(factory.Factory):
+    class Meta:
+        model = User
+
+    email = factory.Sequence(lambda n: f'user{n}@example.com')
+    name = factory.Faker('name')
+    role = 'user'
+    is_active = True
+
+# Usage
+def test_user_creation():
+    user = UserFactory()
+    assert user.email.endswith('@example.com')
+
+def test_admin_user():
+    admin = UserFactory(role='admin')
+    assert admin.role == 'admin'
+```
+
+### Builders: Constructing Complex Objects
+
+**Builders** provide fluent APIs for creating complex test objects.
+
+```python
+class OrderBuilder:
+    def __init__(self):
+        self._user = None
+        self._items = []
+        self._status = 'pending'
+        self._shipping_address = None
+
+    def with_user(self, user):
+        self._user = user
+        return self
+
+    def with_item(self, name, price, quantity=1):
+        self._items.append(Item(name, price, quantity))
+        return self
+
+    def with_status(self, status):
+        self._status = status
+        return self
+
+    def with_shipping(self, address):
+        self._shipping_address = address
+        return self
+
+    def build(self):
+        return Order(
+            user=self._user,
+            items=self._items,
+            status=self._status,
+            shipping_address=self._shipping_address
+        )
+
+# Usage
+def test_order_total_calculation():
+    order = (OrderBuilder()
+        .with_item("Widget", 10, quantity=2)
+        .with_item("Gadget", 20)
+        .build())
+
+    assert order.total == 40
+```
+
+### Snapshot/Golden Master Testing
+
+**Principle**: Capture complex output as baseline, verify future runs match.
+
+**Use Cases**:
+- Testing rendered HTML or UI components
+- Verifying generated reports or documents
+- Validating complex JSON/XML output
+- Testing data transformations with many fields
+
+```python
+def test_user_profile_rendering(snapshot):
+    user = User(name="John Doe", email="john@example.com", age=30)
+
+    html = render_user_profile(user)
+
+    snapshot.assert_match(html, 'user_profile.html')
+
+# First run: Captures output as baseline
+# Subsequent runs: Compares against baseline
+# To update baseline: pytest --snapshot-update
+```
+
+**Trade-offs**:
+- **Pro**: Easy to test complex outputs
+- **Pro**: Catches unintended changes
+- **Con**: Changes to output format require updating all snapshots
+- **Con**: Can hide real regressions if snapshots updated without review
+
+---
+
+## Mocking, Stubbing, and Fakes
+
+### Understanding the Distinctions
+
+**Stub**: Provides canned responses to method calls
+```python
+class StubEmailService:
+    def send_email(self, to, subject, body):
+        return True  # Always succeeds, doesn't actually send
+```
+
+**Mock**: Records interactions and verifies they occurred as expected
+```python
+from unittest.mock import Mock
+
+email_service = Mock()
+user_service.register(user)
+
+email_service.send_welcome_email.assert_called_once_with(user.email)
+```
+
+**Fake**: Simplified working implementation
+```python
+class FakeDatabase:
+    def __init__(self):
+        self._data = {}
+
+    def save(self, id, value):
+        self._data[id] = value
+
+    def find(self, id):
+        return self._data.get(id)
+```
+
+### When to Use Each Type
+
+**Use Stubs** for query dependencies that provide data
+```python
+def test_calculate_user_discount():
+    stub_repository = StubUserRepository()
+    stub_repository.set_return_value(User(type="VIP"))
+
+    discount_service = DiscountService(stub_repository)
+
+    discount = discount_service.calculate_for_user(user_id=123)
+
+    assert discount == 10
+```
+
+**Use Mocks** for command dependencies where interaction matters
+```python
+def test_order_completion_sends_confirmation():
+    mock_email_service = Mock()
+    order_service = OrderService(mock_email_service)
+
+    order_service.complete_order(order_id=123)
+
+    mock_email_service.send_confirmation.assert_called_once()
+```
+
+**Use Fakes** for complex dependencies needing realistic behavior
+```python
+def test_user_repository_operations():
+    fake_db = FakeDatabase()
+    repository = UserRepository(fake_db)
+
+    user = User(email="test@example.com")
+    repository.save(user)
+
+    found = repository.find_by_email("test@example.com")
+    assert found.email == user.email
+```
+
+### Avoiding Over-Mocking
+
+**The Problem**: Mocking every dependency couples tests to implementation.
+
+```python
+# ❌ BAD: Over-mocking
+def test_order_creation():
+    mock_validator = Mock()
+    mock_repository = Mock()
+    mock_calculator = Mock()
+    mock_logger = Mock()
+
+    service = OrderService(mock_validator, mock_repository, mock_calculator, mock_logger)
+
+    service.create_order(data)
+
+    mock_validator.validate.assert_called_once()
+    mock_calculator.calculate_total.assert_called_once()
+    mock_repository.save.assert_called_once()
+    mock_logger.log.assert_called()
+
+# ✅ GOOD: Mock only external boundaries
+def test_order_creation():
+    repository = InMemoryOrderRepository()
+    service = OrderService(repository)
+
+    order = service.create_order(data)
+
+    assert order.total == 100
+    assert repository.find(order.id) is not None
+```
+
+**Guidelines**:
+- Mock external systems (databases, APIs, file systems)
+- Use real implementations for internal collaborators
+- Don't verify every method call
+- Focus on observable behavior, not interactions
+
+---
+
+## Summary: Test Design Principles
+
+1. **Deterministic tests**: Eliminate timing issues, shared state, and external dependencies
+2. **Boundary testing**: Test edges systematically where bugs hide
+3. **Expressive tests**: Clear names, AAA structure, actionable failures
+4. **Focused tests**: One behavior per test, small and understandable
+5. **Flexible test data**: Use factories and builders for maintainable test data
+6. **Strategic mocking**: Mock at system boundaries, use real implementations internally
+7. **Fast execution**: Keep unit tests fast for rapid feedback loops
+
+These practices work together to create test suites that catch bugs, enable refactoring, and remain maintainable over time.
diff --git a/references/ci-cd-optimization.md b/references/ci-cd-optimization.md
new file mode 100644
index 0000000..893ff39
--- /dev/null
+++ b/references/ci-cd-optimization.md
@@ -0,0 +1,735 @@
+# CI/CD Test Optimization and Feedback Loop Strategies
+
+## Structuring Test Suites for Fast Feedback
+
+Modern development velocity depends on rapid feedback from automated tests. The key is structuring test execution to provide the right feedback at the right time in the development workflow.
+
+### The Feedback Loop Hierarchy
+
+**Goal**: Provide fastest feedback for most common operations, comprehensive testing for critical gates.
+
+```
+Pre-Commit Hook      →  Ultra-fast (< 30s)   →  Run constantly
+Pre-Push Hook        →  Fast (2-5 min)       →  Before pushing code
+PR Pipeline          →  Quick (10-15 min)    →  For every PR
+Post-Merge Pipeline  →  Moderate (20-45 min) →  After merging
+Nightly Build        →  Comprehensive (hrs)  →  Daily deep validation
+```
+
+### Pre-Commit Testing (< 30 seconds)
+
+**Purpose**: Catch obvious issues before code is committed, maintaining flow state.
+
+**What to Run**:
+- Linting and code formatting
+- Type checking (if applicable)
+- Fast unit tests for changed files only
+- Quick static analysis
+
+**Example Git Hook**
+```bash
+#!/bin/bash
+# .git/hooks/pre-commit
+
+# Run linter
+echo "Running linter..."
+pylint $(git diff --cached --name-only --diff-filter=ACM | grep '\.py$')
+if [ $? -ne 0 ]; then
+    echo "Linting failed. Commit aborted."
+    exit 1
+fi
+
+# Run type checker
+echo "Running type checker..."
+mypy $(git diff --cached --name-only --diff-filter=ACM | grep '\.py$')
+if [ $? -ne 0 ]; then
+    echo "Type checking failed. Commit aborted."
+    exit 1
+fi
+
+# Run fast tests for changed files
+echo "Running tests..."
+pytest --quick tests/unit/ -k "$(git diff --cached --name-only | sed 's/\.py$//' | tr '\n' ' ')"
+if [ $? -ne 0 ]; then
+    echo "Tests failed. Commit aborted."
+    exit 1
+fi
+
+echo "Pre-commit checks passed!"
+```
+
+**Configuration**
+```ini
+# pytest.ini
+[pytest]
+markers =
+    quick: marks tests as quick (< 100ms each)
+
+# In test files
+@pytest.mark.quick
+def test_calculate_total():
+    assert calculate_total([1, 2, 3]) == 6
+```
+
+**Trade-offs**:
+- ✅ Immediate feedback prevents committing broken code
+- ✅ Maintains developer flow state
+- ❌ Can slow down commits if not truly fast
+- ❌ May be bypassed with --no-verify if too restrictive
+
+### Pre-Push Testing (2-5 minutes)
+
+**Purpose**: Run comprehensive unit tests before code reaches CI, catching bugs locally.
+
+**What to Run**:
+- All unit tests (should complete in 2-5 minutes)
+- Security scanning of dependencies
+- More thorough static analysis
+- License compliance checks
+
+**Example Git Hook**
+```bash
+#!/bin/bash
+# .git/hooks/pre-push
+
+echo "Running full unit test suite..."
+pytest tests/unit/ --maxfail=5 --tb=short
+
+if [ $? -ne 0 ]; then
+    echo "Unit tests failed. Push aborted."
+    echo "Run 'pytest tests/unit/' to see details."
+    exit 1
+fi
+
+echo "Checking for security vulnerabilities..."
+safety check
+
+if [ $? -ne 0 ]; then
+    echo "Security vulnerabilities detected. Push aborted."
+    exit 1
+fi
+
+echo "Pre-push checks passed!"
+```
+
+**Optimization Techniques**:
+```python
+# conftest.py - Optimize test setup
+import pytest
+
+@pytest.fixture(scope="session")
+def expensive_setup():
+    """Run once per test session, not per test."""
+    resource = create_expensive_resource()
+    yield resource
+    resource.cleanup()
+
+# Run tests in parallel
+# pytest -n auto tests/unit/
+```
+
+### Pull Request Pipeline (10-15 minutes)
+
+**Purpose**: Verify PR quality before merge, providing comprehensive feedback without excessive wait.
+
+**What to Run**:
+- All unit tests (parallelized)
+- Critical integration tests
+- Code coverage analysis
+- Static security analysis (SAST)
+- Dependency vulnerability scanning
+- Documentation generation
+- Container image build (if applicable)
+
+**Example GitHub Actions Workflow**
+```yaml
+name: Pull Request Tests
+
+on:
+  pull_request:
+    branches: [main, develop]
+
+jobs:
+  unit-tests:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: [3.9, 3.10, 3.11]
+
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      - name: Cache dependencies
+        uses: actions/cache@v3
+        with:
+          path: ~/.cache/pip
+          key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
+
+      - name: Install dependencies
+        run: |
+          pip install -r requirements.txt
+          pip install pytest pytest-cov pytest-xdist
+
+      - name: Run unit tests
+        run: pytest tests/unit/ -n auto --cov=src --cov-report=xml
+
+      - name: Upload coverage
+        uses: codecov/codecov-action@v3
+
+  integration-tests:
+    runs-on: ubuntu-latest
+    services:
+      postgres:
+        image: postgres:15
+        env:
+          POSTGRES_PASSWORD: postgres
+        options: >-
+          --health-cmd pg_isready
+          --health-interval 10s
+          --health-timeout 5s
+          --health-retries 5
+
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Run critical integration tests
+        run: pytest tests/integration/ -m "critical"
+
+  security-scan:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Run security scan
+        uses: snyk/actions/python@master
+        env:
+          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
+```
+
+**Optimization Strategies**:
+
+**1. Parallel Execution**
+```yaml
+# Run test suites in parallel
+jobs:
+  unit-tests:
+    # ... as above
+
+  integration-tests:
+    # ... runs simultaneously
+
+  lint:
+    # ... runs simultaneously
+```
+
+**2. Fail Fast**
+```bash
+# Stop after first few failures
+pytest --maxfail=3
+
+# Or stop on first failure for quick feedback
+pytest -x
+```
+
+**3. Smart Test Selection**
+```bash
+# Only run tests affected by changes
+pytest --testmon
+
+# Or use pytest-picked
+pytest --picked
+```
+
+**4. Caching**
+```yaml
+- name: Cache dependencies
+  uses: actions/cache@v3
+  with:
+    path: |
+      ~/.cache/pip
+      ~/.npm
+      node_modules
+    key: ${{ runner.os }}-deps-${{ hashFiles('**/requirements.txt', '**/package-lock.json') }}
+```
+
+### Post-Merge Pipeline (20-45 minutes)
+
+**Purpose**: Run comprehensive test suite on main branch, catching integration issues.
+
+**What to Run**:
+- All unit tests
+- All integration tests
+- Critical E2E tests
+- Performance regression tests
+- Database migration tests
+- Cross-browser testing (subset)
+- Container security scanning
+
+**Example Configuration**
+```yaml
+name: Post-Merge Comprehensive Tests
+
+on:
+  push:
+    branches: [main]
+
+jobs:
+  comprehensive-tests:
+    runs-on: ubuntu-latest
+    timeout-minutes: 45
+
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Run all tests
+        run: |
+          pytest tests/ \
+            --cov=src \
+            --cov-report=html \
+            --cov-report=term \
+            --junitxml=test-results.xml
+
+      - name: Run E2E tests
+        run: pytest tests/e2e/ -m "critical"
+
+      - name: Performance tests
+        run: pytest tests/performance/ --benchmark-only
+
+      - name: Publish test results
+        uses: EnricoMi/publish-unit-test-result-action@v2
+        if: always()
+        with:
+          files: test-results.xml
+
+  deploy-staging:
+    needs: comprehensive-tests
+    runs-on: ubuntu-latest
+    steps:
+      - name: Deploy to staging
+        run: ./scripts/deploy-staging.sh
+```
+
+### Nightly/Scheduled Builds (Hours)
+
+**Purpose**: Exhaustive testing including expensive, slow, or less critical tests.
+
+**What to Run**:
+- Full E2E test suite
+- Cross-browser testing (all browsers)
+- Performance and load tests
+- Security penetration testing
+- Accessibility testing
+- Visual regression testing
+- Long-running integration tests
+- Compatibility matrix testing
+
+**Example Schedule**
+```yaml
+name: Nightly Comprehensive Tests
+
+on:
+  schedule:
+    - cron: '0 2 * * *'  # 2 AM daily
+  workflow_dispatch:  # Manual trigger
+
+jobs:
+  full-e2e-suite:
+    runs-on: ubuntu-latest
+    timeout-minutes: 180
+
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Run full E2E suite
+        run: pytest tests/e2e/ --headed --slowmo=50
+
+  cross-browser-tests:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        browser: [chrome, firefox, safari, edge]
+
+    steps:
+      - name: Run tests on ${{ matrix.browser }}
+        run: pytest tests/e2e/ --browser=${{ matrix.browser }}
+
+  performance-tests:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Load testing
+        run: locust -f tests/performance/locustfile.py --headless
+
+  security-scan:
+    runs-on: ubuntu-latest
+    steps:
+      - name: DAST scanning
+        run: zap-baseline.py -t https://staging.example.com
+```
+
+---
+
+## Parallel Execution and Test Sharding
+
+### Test Parallelization Strategies
+
+**1. Process-Level Parallelization**
+```bash
+# pytest with xdist
+pytest -n auto  # Auto-detect CPU count
+pytest -n 4     # Use 4 processes
+
+# With load balancing
+pytest -n auto --dist loadscope
+```
+
+**2. CI Matrix Parallelization**
+```yaml
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        shard: [1, 2, 3, 4]
+
+    steps:
+      - name: Run shard ${{ matrix.shard }}
+        run: |
+          pytest tests/ \
+            --splits 4 \
+            --group ${{ matrix.shard }}
+```
+
+**3. Test File Sharding**
+```python
+# scripts/shard_tests.py
+import sys
+import os
+
+def shard_tests(test_files, shard_index, total_shards):
+    """Distribute test files across shards."""
+    return [f for i, f in enumerate(test_files)
+            if i % total_shards == shard_index]
+
+if __name__ == "__main__":
+    shard_index = int(os.environ.get('SHARD_INDEX', 0))
+    total_shards = int(os.environ.get('TOTAL_SHARDS', 1))
+
+    all_tests = collect_all_test_files()
+    shard_tests = shard_tests(all_tests, shard_index, total_shards)
+
+    run_pytest(shard_tests)
+```
+
+### Ensuring Test Independence
+
+**Requirements for Parallel Execution**:
+- Tests must not share state
+- Tests must not depend on execution order
+- Tests must use unique data (e.g., unique emails, IDs)
+- Tests must isolate database operations
+
+**Database Isolation Strategies**
+```python
+# Strategy 1: Transaction rollback per test
+@pytest.fixture
+def db_session():
+    connection = engine.connect()
+    transaction = connection.begin()
+    session = Session(bind=connection)
+    yield session
+    session.close()
+    transaction.rollback()
+    connection.close()
+
+# Strategy 2: Unique database per worker
+@pytest.fixture(scope="session")
+def db_url(worker_id):
+    if worker_id == "master":
+        return "postgresql://localhost/test_db"
+    return f"postgresql://localhost/test_db_{worker_id}"
+
+# Strategy 3: Unique test data per test
+@pytest.fixture
+def user():
+    unique_email = f"test-{uuid4()}@example.com"
+    user = User.create(email=unique_email)
+    yield user
+    user.delete()
+```
+
+---
+
+## Flaky Test Management
+
+Flaky tests—those that pass or fail non-deterministically—are worse than no tests. They train developers to ignore failures and erode trust.
+
+### Detecting Flaky Tests
+
+**1. Automatic Detection**
+```bash
+# Run tests multiple times to detect flakiness
+pytest tests/ --count=10 --randomly-seed=12345
+
+# Or use pytest-flakefinder
+pytest --flake-finder --flake-runs=50
+```
+
+**2. CI-Based Detection**
+```yaml
+jobs:
+  flaky-test-detection:
+    runs-on: ubuntu-latest
+    if: github.event_name == 'schedule'
+
+    steps:
+      - name: Run tests 20 times
+        run: |
+          for i in {1..20}; do
+            pytest tests/ --junitxml=results/run-$i.xml || true
+          done
+
+      - name: Analyze results
+        run: python scripts/detect_flaky_tests.py results/
+```
+
+**3. Tracking Test History**
+```python
+# scripts/detect_flaky_tests.py
+import xml.etree.ElementTree as ET
+from collections import defaultdict
+
+def analyze_test_runs(result_files):
+    test_results = defaultdict(list)
+
+    for file in result_files:
+        tree = ET.parse(file)
+        for testcase in tree.findall('.//testcase'):
+            test_name = f"{testcase.get('classname')}.{testcase.get('name')}"
+            passed = testcase.find('failure') is None
+            test_results[test_name].append(passed)
+
+    flaky_tests = []
+    for test, results in test_results.items():
+        if not all(results) and any(results):
+            pass_rate = sum(results) / len(results)
+            flaky_tests.append((test, pass_rate))
+
+    return sorted(flaky_tests, key=lambda x: x[1])
+```
+
+### Quarantine Process
+
+**Purpose**: Isolate flaky tests without blocking development.
+
+```python
+# Mark flaky tests
+@pytest.mark.flaky
+def test_sometimes_fails():
+    # Test with known flakiness
+    pass
+
+# pytest configuration
+# pytest.ini
+[pytest]
+markers =
+    flaky: marks tests as flaky (quarantined)
+
+# Skip flaky tests in PR pipeline
+pytest tests/ -m "not flaky"
+
+# Run only flaky tests in diagnostic mode
+pytest tests/ -m "flaky" --reruns 5
+```
+
+**Quarantine Workflow**:
+1. **Detect**: Identify flaky test through automated detection or manual observation
+2. **Mark**: Add `@pytest.mark.flaky` marker
+3. **Track**: Create ticket to fix or remove flaky test
+4. **Fix**: Investigate root cause and stabilize
+5. **Verify**: Run fixed test 50+ times to confirm stability
+6. **Release**: Remove flaky marker and return to normal suite
+
+### Common Flakiness Root Causes and Fixes
+
+**1. Timing Issues**
+```python
+# ❌ BAD: Fixed sleep
+def test_async_operation():
+    start_operation()
+    time.sleep(2)  # Hope 2 seconds is enough
+    assert operation_complete()
+
+# ✅ GOOD: Conditional wait
+def test_async_operation():
+    start_operation()
+    wait_for(lambda: operation_complete(), timeout=5)
+    assert operation_complete()
+```
+
+**2. Shared State**
+```python
+# ❌ BAD: Global state
+cache = {}
+
+def test_cache_operation():
+    cache['key'] = 'value'
+    assert get_from_cache('key') == 'value'
+
+# ✅ GOOD: Isolated state
+@pytest.fixture
+def cache():
+    cache = {}
+    yield cache
+    cache.clear()
+
+def test_cache_operation(cache):
+    cache['key'] = 'value'
+    assert cache['key'] == 'value'
+```
+
+**3. External Service Dependencies**
+```python
+# ❌ BAD: Calling real API
+def test_fetch_data():
+    data = api_client.get('https://external-api.com/data')
+    assert data is not None
+
+# ✅ GOOD: Mock external service
+@responses.activate
+def test_fetch_data():
+    responses.add(
+        responses.GET,
+        'https://external-api.com/data',
+        json={'result': 'data'},
+        status=200
+    )
+    data = api_client.get('https://external-api.com/data')
+    assert data['result'] == 'data'
+```
+
+**4. Test Order Dependencies**
+```python
+# ❌ BAD: Order-dependent tests
+def test_create_user():
+    global user
+    user = create_user()
+
+def test_update_user():
+    update_user(user)  # Depends on test_create_user
+
+# ✅ GOOD: Independent tests
+@pytest.fixture
+def user():
+    user = create_user()
+    yield user
+    delete_user(user)
+
+def test_create_user(user):
+    assert user.id is not None
+
+def test_update_user(user):
+    update_user(user)
+    assert user.updated_at is not None
+```
+
+---
+
+## Test Health Metrics
+
+Track these metrics to maintain test suite health and identify areas for improvement.
+
+### Key Metrics
+
+**1. Defect Detection Efficiency (DDE)**
+```
+DDE = Bugs Found by Tests / (Bugs Found by Tests + Bugs Found in Production)
+
+Target: > 90%
+```
+
+Track monthly to identify test coverage gaps.
+
+**2. Test Execution Time**
+```
+- Unit tests: < 5 minutes
+- Integration tests: < 15 minutes
+- Full suite: < 45 minutes
+- Nightly comprehensive: < 3 hours
+```
+
+**3. Flakiness Rate**
+```
+Flakiness Rate = Flaky Test Runs / Total Test Runs
+
+Target: < 5%
+```
+
+**4. Test Maintenance Burden**
+```
+Hours spent fixing broken tests / Total development hours
+
+Target: < 10%
+```
+
+**5. Code Coverage**
+```
+- Unit test coverage: 70-90% (focusing on business logic)
+- Integration coverage: Component boundaries covered
+- E2E coverage: Critical paths covered
+```
+
+### Dashboarding and Tracking
+
+**Example Metrics Dashboard**
+```python
+# scripts/test_metrics.py
+import json
+from datetime import datetime, timedelta
+
+def calculate_test_metrics():
+    metrics = {
+        'timestamp': datetime.now().isoformat(),
+        'execution_time': {
+            'unit_tests': get_unit_test_time(),
+            'integration_tests': get_integration_test_time(),
+            'e2e_tests': get_e2e_test_time()
+        },
+        'flakiness': {
+            'flaky_tests': count_flaky_tests(),
+            'total_tests': count_total_tests(),
+            'flakiness_rate': calculate_flakiness_rate()
+        },
+        'coverage': {
+            'line_coverage': get_line_coverage(),
+            'branch_coverage': get_branch_coverage()
+        },
+        'failures': {
+            'failed_builds_last_week': count_failed_builds(days=7),
+            'flaky_failures_last_week': count_flaky_failures(days=7)
+        }
+    }
+
+    with open('test_metrics.json', 'w') as f:
+        json.dump(metrics, f, indent=2)
+
+    return metrics
+```
+
+---
+
+## Summary: CI/CD Test Optimization Principles
+
+1. **Layer feedback**: Fast local tests, comprehensive CI tests, exhaustive nightly tests
+2. **Parallelize aggressively**: Use all available CPU and runner capacity
+3. **Fail fast**: Stop on first failures for rapid feedback
+4. **Cache dependencies**: Don't rebuild the world every time
+5. **Quarantine flaky tests**: Don't let them poison the suite
+6. **Monitor metrics**: Track execution time, flakiness, and defect detection
+7. **Optimize continuously**: Test suites slow down over time—invest in speed
+
+**Goal**: Developers get feedback in seconds locally, minutes in PR, comprehensive validation post-merge, and exhaustive testing nightly—all without flaky failures eroding trust.
diff --git a/references/coverage-strategy.md b/references/coverage-strategy.md
new file mode 100644
index 0000000..e1f12bc
--- /dev/null
+++ b/references/coverage-strategy.md
@@ -0,0 +1,622 @@
+# Test Coverage Strategy: Beyond the Numbers
+
+Code coverage metrics have been both praised and criticized. The truth lies in understanding what coverage means and doesn't mean, and using it appropriately to guide testing strategy.
+
+## Code Coverage vs. Behavior Coverage
+
+### Code Coverage Metrics
+
+**Line Coverage**: Percentage of code lines executed by tests
+```python
+def calculate_discount(customer_type, order_total):
+    if customer_type == "VIP":          # Line 1
+        return order_total * 0.10       # Line 2
+    return 0                            # Line 3
+
+# Test that achieves 66% line coverage
+def test_vip_discount():
+    result = calculate_discount("VIP", 100)
+    assert result == 10
+# Lines executed: 1, 2 (66%)
+# Line not executed: 3 (33%)
+```
+
+**Branch Coverage**: Percentage of decision branches executed
+```python
+def calculate_discount(customer_type, order_total):
+    if customer_type == "VIP":
+        return order_total * 0.10
+    return 0
+
+# 50% branch coverage (only true branch tested)
+# Need test for false branch to reach 100%
+```
+
+**Path Coverage**: Percentage of execution paths through code (most comprehensive)
+```python
+def calculate_price(quantity, is_vip, has_coupon):
+    price = quantity * 10
+
+    if is_vip:              # 2 branches
+        price *= 0.9
+
+    if has_coupon:          # 2 branches
+        price *= 0.95
+
+    return price
+
+# 2 × 2 = 4 possible paths:
+# 1. Not VIP, no coupon
+# 2. Not VIP, has coupon
+# 3. VIP, no coupon
+# 4. VIP, has coupon
+```
+
+### Behavior Coverage
+
+**The Critical Distinction**: Code coverage measures execution; behavior coverage measures validation of scenarios.
+
+```python
+# ❌ HIGH code coverage, LOW behavior coverage
+def test_calculate_discount():
+    # Executes all lines
+    result1 = calculate_discount("VIP", 100)
+    result2 = calculate_discount("Regular", 100)
+    # But makes NO assertions - just executes code
+    # 100% line coverage, 0% behavior validation
+
+# ✅ GOOD behavior coverage
+def test_vip_customer_receives_discount():
+    result = calculate_discount("VIP", 100)
+    assert result == 10, "VIP customers should get 10% discount"
+
+def test_regular_customer_receives_no_discount():
+    result = calculate_discount("Regular", 100)
+    assert result == 0, "Regular customers should get no discount"
+
+def test_discount_scales_with_order_total():
+    assert calculate_discount("VIP", 100) == 10
+    assert calculate_discount("VIP", 200) == 20
+
+def test_empty_customer_type_returns_zero():
+    result = calculate_discount("", 100)
+    assert result == 0
+```
+
+### What Coverage Metrics Tell You
+
+**Coverage CAN tell you**:
+- Which code has been executed by tests
+- Which branches/paths have not been tested
+- Where obvious gaps in testing exist
+- Whether new code has any tests
+
+**Coverage CANNOT tell you**:
+- Whether tests make meaningful assertions
+- Whether edge cases are properly validated
+- Whether business logic is correct
+- Whether tests would catch real bugs
+- Test quality or effectiveness
+
+---
+
+## Recommended Coverage Targets by Component Type
+
+Coverage targets should vary based on code type, recognizing that different components have different testing ROI.
+
+### Business Logic and Domain Code: 90-100%
+
+**What**: Core business rules, calculations, domain models, workflows
+
+**Why High Coverage**:
+- High complexity requires thorough testing
+- Bugs here directly impact business outcomes
+- Logic changes frequently
+- Unit tests are fast and provide clear value
+
+**Example**:
+```python
+# Core business logic deserves comprehensive coverage
+class OrderPricing:
+    def calculate_total(self, items, customer):
+        """Calculate order total with discounts."""
+        subtotal = sum(item.price * item.quantity for items)
+
+        if customer.is_vip:
+            subtotal *= 0.90
+
+        if subtotal > 1000:
+            subtotal *= 0.95  # Bulk discount
+
+        return subtotal + self.calculate_tax(subtotal)
+
+# Comprehensive test coverage
+def test_regular_customer_basic_order():
+    assert pricing.calculate_total(items_100, regular_customer) == ...
+
+def test_vip_customer_receives_discount():
+    assert pricing.calculate_total(items_100, vip_customer) == ...
+
+def test_bulk_discount_applied_over_threshold():
+    assert pricing.calculate_total(items_1500, regular_customer) == ...
+
+def test_vip_and_bulk_discounts_stack():
+    assert pricing.calculate_total(items_1500, vip_customer) == ...
+
+def test_tax_calculation_included():
+    assert pricing.calculate_total(items_100, regular_customer) > 100
+```
+
+### Integration Points: 70-90%
+
+**What**: Repositories, API clients, message queue handlers, external service integrations
+
+**Why Moderate Coverage**:
+- Integration behavior is critical but some scenarios are impractical to test
+- Integration tests are slower than unit tests
+- Some paths only make sense in production environment
+
+**Example**:
+```python
+# Repository - 80% coverage is reasonable
+class UserRepository:
+    def save(self, user):
+        # Test this
+        pass
+
+    def find_by_email(self, email):
+        # Test this
+        pass
+
+    def find_all_active(self):
+        # Test this
+        pass
+
+    def cleanup_old_sessions(self):
+        # Maintenance query - maybe skip or light testing
+        pass
+
+# Test critical paths thoroughly
+def test_save_user_persists_to_database():
+    # Essential to test
+    pass
+
+def test_find_by_email_returns_correct_user():
+    # Essential to test
+    pass
+
+# Less critical edge cases can be sampled
+def test_find_by_email_handles_sql_injection_attempt():
+    # Nice to have but framework may handle this
+    pass
+```
+
+### Controllers and API Endpoints: 70-90%
+
+**What**: HTTP controllers, API route handlers, request/response handling
+
+**Why Moderate Coverage**:
+- Framework handles much of the complexity
+- Some error scenarios difficult to trigger
+- Integration tests cover many paths automatically
+
+**Example**:
+```python
+# API endpoint - focus on business logic integration
+@app.route('/api/orders', methods=['POST'])
+def create_order():
+    data = request.json
+
+    try:
+        order = order_service.create(data)
+        return jsonify(order.to_dict()), 201
+    except ValidationError as e:
+        return jsonify({'error': str(e)}), 400
+    except Exception as e:
+        logger.error(f"Order creation failed: {e}")
+        return jsonify({'error': 'Internal error'}), 500
+
+# Test main scenarios
+def test_create_order_success():
+    response = client.post('/api/orders', json=valid_order_data)
+    assert response.status_code == 201
+
+def test_create_order_validation_error():
+    response = client.post('/api/orders', json=invalid_order_data)
+    assert response.status_code == 400
+
+# Generic 500 error handling may not need explicit test
+# if framework testing covers it
+```
+
+### Configuration and Glue Code: 30-50%
+
+**What**: Configuration loading, simple adapters, basic initialization code
+
+**Why Low Coverage**:
+- Little logic to test
+- Mostly framework or library interaction
+- Better covered by integration tests
+- Testing provides minimal value
+
+**Example**:
+```python
+# Configuration loader - low unit test value
+class Config:
+    @staticmethod
+    def load():
+        return {
+            'database_url': os.getenv('DATABASE_URL'),
+            'api_key': os.getenv('API_KEY'),
+            'debug': os.getenv('DEBUG', 'false').lower() == 'true'
+        }
+
+# Minimal unit testing needed
+# Better to test that app starts with config in integration test
+```
+
+### Utilities and Helpers: 80-90%
+
+**What**: Shared utility functions, formatters, parsers, helpers
+
+**Why High Coverage**:
+- Used in multiple places - bugs have wide impact
+- Usually pure functions - easy to test thoroughly
+- Fast unit tests with high value
+
+**Example**:
+```python
+# Utility function - deserves thorough testing
+def format_currency(amount, currency='USD'):
+    """Format amount as currency string."""
+    symbols = {'USD': '$', 'EUR': '€', 'GBP': '£'}
+    symbol = symbols.get(currency, currency)
+
+    if amount < 0:
+        return f"-{symbol}{abs(amount):.2f}"
+    return f"{symbol}{amount:.2f}"
+
+# Comprehensive tests for shared utility
+def test_format_usd():
+    assert format_currency(100) == "$100.00"
+
+def test_format_eur():
+    assert format_currency(100, 'EUR') == "€100.00"
+
+def test_format_negative():
+    assert format_currency(-50) == "-$50.00"
+
+def test_format_unknown_currency():
+    assert format_currency(100, 'XYZ') == "XYZ100.00"
+```
+
+---
+
+## When Coverage Becomes Harmful
+
+Chasing coverage percentages as a goal leads to counterproductive practices.
+
+### Anti-Pattern: Testing for Coverage, Not Value
+
+**The Problem**: Writing tests that execute code but don't verify behavior, just to increase coverage numbers.
+
+```python
+# ❌ BAD: Test that only boosts coverage
+def test_user_creation():
+    user = User(email="test@example.com")
+    # No assertions - just creates object for coverage
+    # Provides zero value
+
+# ✅ GOOD: Test that validates behavior
+def test_user_creation_sets_default_values():
+    user = User(email="test@example.com")
+
+    assert user.email == "test@example.com"
+    assert user.is_active == True
+    assert user.created_at is not None
+```
+
+### Anti-Pattern: Testing Implementation Details to Hit Coverage
+
+**The Problem**: Creating brittle tests of private methods just to reach coverage targets.
+
+```python
+class OrderService:
+    def create_order(self, items):
+        self._validate(items)
+        return self._build_order(items)
+
+    def _validate(self, items):
+        # Private method
+        if not items:
+            raise ValueError("No items")
+
+    def _build_order(self, items):
+        # Private method
+        return Order(items)
+
+# ❌ BAD: Testing private methods for coverage
+def test_validate_private_method():
+    service = OrderService()
+    with pytest.raises(ValueError):
+        service._validate([])  # Testing implementation detail
+
+# ✅ GOOD: Testing through public interface
+def test_create_order_with_empty_items_raises_error():
+    service = OrderService()
+    with pytest.raises(ValueError):
+        service.create_order([])  # Testing behavior
+```
+
+### Anti-Pattern: 100% Coverage Without Edge Case Testing
+
+**The Problem**: Achieving high line coverage while missing important edge cases.
+
+```python
+def divide(a, b):
+    return a / b
+
+# Achieves 100% line coverage
+def test_divide():
+    result = divide(10, 2)
+    assert result == 5
+
+# But misses critical edge cases:
+# - Division by zero
+# - Negative numbers
+# - Very large numbers
+# - Floating point precision issues
+
+# ✅ GOOD: Comprehensive edge case testing
+def test_divide_normal_case():
+    assert divide(10, 2) == 5
+
+def test_divide_by_zero_raises_error():
+    with pytest.raises(ZeroDivisionError):
+        divide(10, 0)
+
+def test_divide_negative_numbers():
+    assert divide(-10, 2) == -5
+    assert divide(10, -2) == -5
+
+def test_divide_floating_point():
+    assert divide(1, 3) == pytest.approx(0.333, rel=0.01)
+```
+
+---
+
+## Practical Coverage Guidance
+
+### Setting Organizational Coverage Targets
+
+**Recommended Approach**: Set minimum thresholds that prevent coverage from decreasing, not aspirational 100% targets.
+
+```yaml
+# .coveragerc
+[report]
+fail_under = 80  # Fail if overall coverage drops below 80%
+
+[coverage:run]
+branch = true
+source = src/
+
+[coverage:report]
+exclude_lines =
+    # Don't require coverage for:
+    pragma: no cover
+    def __repr__
+    raise AssertionError
+    raise NotImplementedError
+    if __name__ == .__main__.:
+    if TYPE_CHECKING:
+    @abstract
+```
+
+**Progressive Enforcement**:
+```bash
+# Start where you are
+Current coverage: 65%
+Set target: 65% (prevent regression)
+
+# Gradually improve
+After 3 months: Raise to 70%
+After 6 months: Raise to 75%
+After 12 months: Raise to 80%
+```
+
+### Coverage in Code Review
+
+**Use Coverage Reports to Guide Review**:
+```bash
+# Generate coverage diff for PR
+pytest --cov=src --cov-report=html
+diff-cover coverage.xml --compare-branch=main
+
+# Shows:
+# - New code added without tests
+# - Changed code with decreased coverage
+# - Critical paths missing coverage
+```
+
+**Review Checklist**:
+- [ ] New business logic has >90% coverage
+- [ ] Critical paths are covered
+- [ ] Edge cases have tests
+- [ ] Tests make meaningful assertions
+- [ ] No coverage-chasing empty tests
+
+### Coverage for Different Test Levels
+
+**Unit Test Coverage**: Track separately from integration/E2E
+```bash
+# Unit tests only
+pytest tests/unit/ --cov=src --cov-report=term
+
+# Unit test coverage should be higher (70-90%)
+```
+
+**Integration Test Coverage**: Track what integration tests add
+```bash
+# Run all tests, see what integration adds
+pytest tests/ --cov=src --cov-report=term
+
+# Many integration tests don't add coverage
+# but add confidence in integration behavior
+```
+
+**Don't Count E2E Against Coverage**: E2E tests validate workflows, not code coverage
+```bash
+# Exclude E2E from coverage metrics
+pytest tests/unit/ tests/integration/ --cov=src
+# Don't include tests/e2e/ in coverage calculation
+```
+
+---
+
+## Risk-Based Coverage Strategy
+
+Focus coverage efforts where bugs have highest impact.
+
+### Risk Assessment Matrix
+
+```
+High Risk = High Business Impact + High Complexity + High Change Frequency
+Medium Risk = Mixed factors
+Low Risk = Low on most factors
+
+High Risk (Target: 90-100% coverage)
+├─ Payment processing
+├─ Authentication and authorization
+├─ Order calculation and fulfillment
+├─ Data migrations
+└─ Security-critical operations
+
+Medium Risk (Target: 70-90% coverage)
+├─ User management
+├─ Reporting and analytics
+├─ Email notifications
+├─ Search functionality
+└─ Content management
+
+Low Risk (Target: 30-70% coverage)
+├─ Configuration loading
+├─ Static content rendering
+├─ Logging utilities
+├─ Admin tools (low usage)
+└─ Deprecated features
+```
+
+### Historical Bug Analysis
+
+**Use Past Data to Guide Coverage**:
+```sql
+-- Find modules with highest bug density
+SELECT
+    module,
+    COUNT(*) as bug_count,
+    COUNT(*) / lines_of_code as bug_density
+FROM bugs
+JOIN code_metrics ON bugs.module = code_metrics.module
+WHERE created_at > DATE_SUB(NOW(), INTERVAL 12 MONTH)
+GROUP BY module
+ORDER BY bug_density DESC;
+
+-- Prioritize coverage for high bug density modules
+```
+
+### Coverage Heat Maps
+
+**Visualize Coverage by Risk**:
+```
+Component              | Coverage | Risk  | Status
+-----------------------|----------|-------|--------
+Payment Processing     | 95%      | High  | ✓ Good
+Authentication         | 88%      | High  | ⚠ Needs improvement
+Order Calculation      | 92%      | High  | ✓ Good
+User Management        | 75%      | Med   | ✓ Acceptable
+Email Service          | 65%      | Med   | ⚠ Consider adding tests
+Logging Utils          | 45%      | Low   | ✓ Acceptable
+```
+
+---
+
+## Coverage Tools and Techniques
+
+### Python Coverage Tools
+
+```bash
+# pytest-cov (most common)
+pytest --cov=src --cov-report=html --cov-report=term
+
+# View HTML report
+open htmlcov/index.html
+
+# Show missing lines
+pytest --cov=src --cov-report=term-missing
+
+# Branch coverage
+pytest --cov=src --cov-branch
+
+# Fail if coverage below threshold
+pytest --cov=src --cov-fail-under=80
+```
+
+### Coverage Badges for Documentation
+
+```markdown
+# README.md
+![Coverage](https://img.shields.io/badge/coverage-85%25-green)
+
+# Updates automatically in CI
+# codecov.io, coveralls.io provide this
+```
+
+### Continuous Coverage Tracking
+
+```yaml
+# .github/workflows/coverage.yml
+name: Coverage
+
+on: [push, pull_request]
+
+jobs:
+  coverage:
+    runs-on: ubuntu-latest
+
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Run tests with coverage
+        run: pytest --cov=src --cov-report=xml
+
+      - name: Upload to Codecov
+        uses: codecov/codecov-action@v3
+        with:
+          file: ./coverage.xml
+          fail_ci_if_error: true
+
+      - name: Comment coverage on PR
+        uses: py-cov-action/python-coverage-comment-action@v3
+        with:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+```
+
+---
+
+## Summary: Coverage Strategy Principles
+
+1. **Focus on behavior coverage, not just code coverage**: Tests should validate scenarios, not just execute lines
+
+2. **Set differentiated targets**: Business logic deserves higher coverage than configuration code
+
+3. **Use coverage as a guide, not a goal**: Low coverage indicates gaps; high coverage doesn't guarantee quality
+
+4. **Prioritize by risk**: Critical systems deserve comprehensive testing, peripheral features need less
+
+5. **Track trends, not just numbers**: Coverage should generally increase over time, especially for new code
+
+6. **Don't game the metrics**: Tests without assertions or testing trivial code provides false confidence
+
+7. **Combine with other metrics**: Coverage + defect detection rate + test execution time = complete picture
+
+**The Ultimate Goal**: High-quality tests that catch bugs, enable refactoring, and provide confidence—not just high coverage numbers. Coverage is a useful tool, but test effectiveness is what truly matters.
diff --git a/references/legacy-code.md b/references/legacy-code.md
new file mode 100644
index 0000000..deaacb7
--- /dev/null
+++ b/references/legacy-code.md
@@ -0,0 +1,703 @@
+# Testing Legacy Code: Strategies for the Untestable
+
+Michael Feathers defines legacy code as "code without tests." Working with such code requires different strategies than greenfield development because you cannot safely change untested code, yet you cannot write tests without changing the code—a classic catch-22.
+
+## The Legacy Code Dilemma
+
+**The Problem**:
+1. Want to refactor code to improve it
+2. Can't refactor safely without tests
+3. Can't write tests because code isn't testable
+4. Need to change code to make it testable (back to step 1)
+
+**The Solution**: Break the cycle using characterization tests, seam identification, and careful dependency breaking.
+
+---
+
+## Characterization Testing
+
+Characterization tests document what code actually does right now, not what it should do. They serve as a safety net during refactoring.
+
+### The Characterization Testing Process
+
+**Step 1: Identify the Code to Characterize**
+Start with the area you need to change, not the entire system.
+
+```python
+# Legacy code you need to modify
+def process_order(order_data, customer_id):
+    # 200 lines of complex, untested logic
+    # involving database calls, external APIs,
+    # and business rules you don't understand
+    pass
+```
+
+**Step 2: Write a Test with Unknown Expected Output**
+```python
+def test_process_order_behavior():
+    order_data = {
+        'items': [{'id': 1, 'quantity': 2}],
+        'total': 100
+    }
+    customer_id = 123
+
+    # Don't know what this returns yet
+    result = process_order(order_data, customer_id)
+
+    # Write assertion with placeholder
+    assert result == ???
+```
+
+**Step 3: Run Test and Observe Actual Output**
+```
+AssertionError: assert {'status': 'processed', 'order_id': 456, 'timestamp': '2024-01-15'} == ???
+```
+
+**Step 4: Update Test to Expect Actual Output**
+```python
+def test_process_order_behavior():
+    order_data = {
+        'items': [{'id': 1, 'quantity': 2}],
+        'total': 100
+    }
+    customer_id = 123
+
+    result = process_order(order_data, customer_id)
+
+    # Now documents actual behavior
+    assert result['status'] == 'processed'
+    assert result['order_id'] == 456
+    assert 'timestamp' in result
+```
+
+**Step 5: Add More Test Cases**
+Characterize different input scenarios:
+
+```python
+def test_process_order_with_discount():
+    order_data = {'items': [{'id': 1, 'quantity': 2}], 'total': 100, 'discount': 10}
+    result = process_order(order_data, 123)
+    # Document this path
+
+def test_process_order_with_invalid_customer():
+    order_data = {'items': [{'id': 1, 'quantity': 2}], 'total': 100}
+    result = process_order(order_data, -1)
+    # Document error behavior
+
+def test_process_order_with_empty_items():
+    order_data = {'items': [], 'total': 0}
+    result = process_order(order_data, 123)
+    # Document edge case
+```
+
+### Value of Characterization Tests
+
+**What They Do**:
+- Document current behavior as executable specification
+- Catch regressions during refactoring
+- Build confidence before making changes
+- Reveal unexpected behaviors and edge cases
+
+**What They Don't Do**:
+- Find existing bugs (they codify them)
+- Verify correctness (only that behavior hasn't changed)
+- Replace proper behavioral tests (those come later)
+
+### Golden Master Testing
+
+For complex outputs (HTML, reports, data transformations), capture entire output as baseline.
+
+```python
+def test_generate_invoice_report(golden_master):
+    """Characterize complex report generation."""
+    order = create_sample_order()
+
+    html_report = generate_invoice_report(order)
+
+    # First run: Captures HTML as baseline
+    # Subsequent runs: Compares against baseline
+    golden_master.assert_match(html_report, 'invoice_report.html')
+```
+
+**When to Use Golden Masters**:
+- Complex text/HTML output
+- Generated reports or documents
+- Data transformations with many fields
+- Visual output (images, PDFs)
+
+**Trade-offs**:
+- Fast to create comprehensive characterization
+- Catches any change to output
+- But: Changes to format require updating all golden files
+- May hide intentional improvements in approval
+
+---
+
+## Identifying and Exploiting Seams
+
+A **seam** is a place where you can alter behavior without editing code in that place. Seams enable testing by providing points to inject test doubles.
+
+### Types of Seams
+
+**1. Object Seams (Dependency Injection)**
+```python
+# ❌ BAD: Hard-coded dependency
+class OrderService:
+    def __init__(self):
+        self.db = PostgresDatabase()  # Can't replace for testing
+
+    def create_order(self, data):
+        return self.db.save(data)
+
+# ✅ GOOD: Injected dependency (seam)
+class OrderService:
+    def __init__(self, database):
+        self.db = database  # Can inject test double
+
+    def create_order(self, data):
+        return self.db.save(data)
+
+# Now testable
+def test_order_service():
+    fake_db = FakeDatabase()
+    service = OrderService(fake_db)
+
+    order = service.create_order(data)
+
+    assert fake_db.saved_items[0] == data
+```
+
+**2. Method Seams (Inheritance/Overriding)**
+```python
+# Legacy code with untestable external call
+class PaymentProcessor:
+    def process_payment(self, amount):
+        result = self.call_payment_gateway(amount)  # External API
+        self.send_receipt(result)
+        return result
+
+    def call_payment_gateway(self, amount):
+        # Actual API call
+        return external_api.charge(amount)
+
+# Create testable subclass
+class TestablePaymentProcessor(PaymentProcessor):
+    def call_payment_gateway(self, amount):
+        # Override with fake implementation
+        return {'status': 'success', 'transaction_id': '12345'}
+
+def test_payment_processing():
+    processor = TestablePaymentProcessor()
+
+    result = processor.process_payment(100)
+
+    assert result['status'] == 'success'
+```
+
+**3. Interface Seams (Extract Interface)**
+```python
+# Legacy code tightly coupled to implementation
+class UserService:
+    def __init__(self):
+        self.repo = UserRepositoryPostgres()  # Concrete class
+
+# Extract interface
+class UserRepository(ABC):
+    @abstractmethod
+    def save(self, user):
+        pass
+
+    @abstractmethod
+    def find(self, user_id):
+        pass
+
+class UserRepositoryPostgres(UserRepository):
+    def save(self, user):
+        # PostgreSQL implementation
+        pass
+
+    def find(self, user_id):
+        # PostgreSQL implementation
+        pass
+
+class UserRepositoryInMemory(UserRepository):
+    def __init__(self):
+        self.users = {}
+
+    def save(self, user):
+        self.users[user.id] = user
+
+    def find(self, user_id):
+        return self.users.get(user_id)
+
+# Now testable
+def test_user_service():
+    repo = UserRepositoryInMemory()
+    service = UserService(repo)
+
+    service.register(user)
+
+    assert repo.find(user.id) == user
+```
+
+**4. Parameter Seams (Introduce Parameter)**
+```python
+# ❌ BAD: Creates dependency internally
+def send_notification(user, message):
+    email_service = EmailService()  # Hard-coded
+    email_service.send(user.email, message)
+
+# ✅ GOOD: Dependency as parameter
+def send_notification(user, message, email_service=None):
+    if email_service is None:
+        email_service = EmailService()  # Default for production
+
+    email_service.send(user.email, message)
+
+# Now testable
+def test_send_notification():
+    mock_email = Mock()
+
+    send_notification(user, "Test message", mock_email)
+
+    mock_email.send.assert_called_with(user.email, "Test message")
+```
+
+---
+
+## Breaking Dependencies Safely
+
+### The Sprout Technique
+
+When adding new functionality to untested code, write new testable code separately, then call it from legacy code.
+
+```python
+# Legacy untested code
+def process_invoice(invoice):
+    # 300 lines of complex logic
+    # Need to add discount calculation
+    pass
+
+# Sprout: Write new code with tests
+def calculate_discount(invoice):
+    """New, tested discount logic."""
+    if invoice.customer.is_vip:
+        return invoice.total * 0.10
+    return 0
+
+def test_calculate_discount_for_vip():
+    customer = Customer(is_vip=True)
+    invoice = Invoice(total=100, customer=customer)
+
+    discount = calculate_discount(invoice)
+
+    assert discount == 10
+
+# Now call from legacy code
+def process_invoice(invoice):
+    # ... existing logic ...
+    discount = calculate_discount(invoice)  # Tested code
+    # ... rest of logic ...
+```
+
+### The Wrap Technique
+
+When changing behavior, wrap old code in new testable interface.
+
+```python
+# Legacy untested code
+def legacy_report_generator(data):
+    # 500 lines of report generation
+    # Need to add caching
+    pass
+
+# Wrap with tested code
+class CachedReportGenerator:
+    def __init__(self, cache):
+        self.cache = cache
+
+    def generate(self, data):
+        cache_key = self._compute_cache_key(data)
+
+        if cache_key in self.cache:
+            return self.cache[cache_key]
+
+        report = legacy_report_generator(data)  # Use legacy code
+        self.cache[cache_key] = report
+        return report
+
+    def _compute_cache_key(self, data):
+        return f"report_{data['id']}"
+
+def test_cached_report_generator():
+    cache = {}
+    generator = CachedReportGenerator(cache)
+
+    # First call generates report
+    report1 = generator.generate({'id': 1, 'data': 'test'})
+
+    # Second call uses cache
+    report2 = generator.generate({'id': 1, 'data': 'test'})
+
+    assert report1 == report2
+    assert len(cache) == 1
+```
+
+### Extract Method Refactoring
+
+Break large untested methods into smaller, testable pieces.
+
+```python
+# ❌ BAD: Monolithic untested method
+def process_order(order_data):
+    # Validate
+    if not order_data.get('customer_id'):
+        raise ValueError("Missing customer")
+    if not order_data.get('items'):
+        raise ValueError("No items")
+
+    # Calculate total
+    total = 0
+    for item in order_data['items']:
+        total += item['price'] * item['quantity']
+
+    # Apply discount
+    if order_data.get('promo_code') == 'VIP':
+        total *= 0.9
+
+    # Save to database
+    db.execute("INSERT INTO orders ...", total, ...)
+
+    # Send email
+    email.send(order_data['customer_email'], "Order confirmed")
+
+    return {'total': total}
+
+# ✅ GOOD: Extracted testable methods
+def validate_order(order_data):
+    """Extracted, testable validation."""
+    if not order_data.get('customer_id'):
+        raise ValueError("Missing customer")
+    if not order_data.get('items'):
+        raise ValueError("No items")
+
+def calculate_total(items):
+    """Extracted, testable calculation."""
+    return sum(item['price'] * item['quantity'] for item in items)
+
+def apply_discount(total, promo_code):
+    """Extracted, testable discount logic."""
+    if promo_code == 'VIP':
+        return total * 0.9
+    return total
+
+def process_order(order_data):
+    validate_order(order_data)
+    total = calculate_total(order_data['items'])
+    total = apply_discount(total, order_data.get('promo_code'))
+
+    db.execute("INSERT INTO orders ...", total, ...)
+    email.send(order_data['customer_email'], "Order confirmed")
+
+    return {'total': total}
+
+# Now individual pieces can be tested
+def test_calculate_total():
+    items = [{'price': 10, 'quantity': 2}, {'price': 5, 'quantity': 3}]
+    assert calculate_total(items) == 35
+
+def test_apply_vip_discount():
+    assert apply_discount(100, 'VIP') == 90
+
+def test_apply_no_discount():
+    assert apply_discount(100, None) == 100
+```
+
+---
+
+## Prioritization Strategies
+
+You cannot test everything in legacy code. Prioritize strategically.
+
+### Test What You Touch
+
+**Principle**: Add tests only for code you're actively changing.
+
+```python
+# Working in this area? Add tests
+def update_inventory(product_id, quantity):
+    # Adding tests here because you're modifying this
+
+# Not working here? Don't add tests yet
+def generate_monthly_report():
+    # Leave alone for now
+```
+
+**Workflow**:
+1. Need to modify function X
+2. Write characterization tests for X
+3. Make changes with test safety net
+4. Refactor and add proper behavioral tests
+5. Move on (don't test unrelated code)
+
+### Risk-Based Prioritization
+
+**High Priority** (test first):
+- Business-critical code (payment, auth, core features)
+- Frequently changing code
+- Historically bug-prone areas
+- Security-sensitive operations
+
+**Medium Priority** (test when touched):
+- Important features used regularly
+- Moderately complex business logic
+- Customer-facing functionality
+
+**Low Priority** (may never need tests):
+- Stable code that rarely changes
+- Simple pass-through or configuration code
+- Deprecated features being phased out
+- One-time scripts or utilities
+
+### Change Frequency Analysis
+
+```python
+# Analyze git history to find frequently changed files
+git log --format=format: --name-only | \
+    grep '\.py$' | \
+    sort | \
+    uniq -c | \
+    sort -rn | \
+    head -20
+
+# Output:
+#   47 src/payment/processor.py     ← High priority: changes often
+#   42 src/user/service.py
+#   38 src/order/calculator.py
+#    3 src/legacy/old_report.py     ← Low priority: stable
+#    1 src/config/settings.py
+```
+
+### Bug Density Analysis
+
+Track where bugs occur to prioritize testing effort:
+
+```python
+# Analyze bug tracker data
+SELECT file_path, COUNT(*) as bug_count
+FROM bugs
+WHERE created_at > DATE_SUB(NOW(), INTERVAL 6 MONTH)
+GROUP BY file_path
+ORDER BY bug_count DESC
+LIMIT 20;
+
+# Focus testing on files with high bug counts
+```
+
+---
+
+## Gradual Architectural Improvement
+
+Don't attempt big-bang rewrites. Incrementally improve architecture while maintaining functionality.
+
+### The Strangler Fig Pattern
+
+Gradually replace legacy code by routing new functionality to new implementation while old code continues working.
+
+```python
+# Phase 1: Legacy code handles everything
+def process_order(order):
+    return legacy_order_processor(order)
+
+# Phase 2: Router directs to new or old code
+def process_order(order):
+    if order.use_new_system:
+        return new_order_processor(order)  # New, tested code
+    return legacy_order_processor(order)    # Legacy code
+
+# Phase 3: Gradually expand new system coverage
+def process_order(order):
+    if order.use_new_system or order.created_after('2024-01-01'):
+        return new_order_processor(order)
+    return legacy_order_processor(order)
+
+# Phase 4: Eventually, all orders use new system
+def process_order(order):
+    return new_order_processor(order)
+    # Legacy code can be removed
+```
+
+### Extract Core Domain
+
+Identify and extract business logic from infrastructure concerns.
+
+```python
+# ❌ BAD: Business logic mixed with infrastructure
+def process_payment(amount, card_number):
+    # Business logic
+    if amount < 0:
+        raise ValueError("Amount must be positive")
+
+    # Infrastructure
+    db_connection = psycopg2.connect("postgresql://...")
+    cursor = db_connection.cursor()
+
+    # More business logic
+    fee = amount * 0.03
+
+    # More infrastructure
+    cursor.execute("INSERT INTO payments ...", amount, fee)
+
+    # Business logic
+    final_amount = amount + fee
+
+    return final_amount
+
+# ✅ GOOD: Extracted business logic
+class PaymentCalculator:
+    """Pure business logic - easily testable."""
+
+    FEE_RATE = 0.03
+
+    def validate_amount(self, amount):
+        if amount < 0:
+            raise ValueError("Amount must be positive")
+
+    def calculate_fee(self, amount):
+        return amount * self.FEE_RATE
+
+    def calculate_total(self, amount):
+        self.validate_amount(amount)
+        fee = self.calculate_fee(amount)
+        return amount + fee
+
+# Infrastructure layer
+class PaymentService:
+    def __init__(self, calculator, repository):
+        self.calculator = calculator
+        self.repository = repository
+
+    def process_payment(self, amount, card_number):
+        total = self.calculator.calculate_total(amount)
+        self.repository.save_payment(total, card_number)
+        return total
+
+# Now business logic is testable without database
+def test_payment_calculation():
+    calculator = PaymentCalculator()
+
+    total = calculator.calculate_total(100)
+
+    assert total == 103  # 100 + 3% fee
+```
+
+### Dependency Injection at System Boundaries
+
+Introduce seams at architectural boundaries first, then work inward.
+
+```
+Old Architecture:
+├─ Controllers (hard-coded services)
+├─ Services (hard-coded repositories)
+└─ Repositories (hard-coded database)
+
+Step 1: Inject at Repository level
+├─ Controllers (hard-coded services)
+├─ Services (hard-coded repositories)
+└─ Repositories (injected database) ← Start here
+
+Step 2: Inject at Service level
+├─ Controllers (hard-coded services)
+├─ Services (injected repositories)
+└─ Repositories (injected database)
+
+Step 3: Inject at Controller level
+├─ Controllers (injected services)
+├─ Services (injected repositories)
+└─ Repositories (injected database) ← Fully testable
+```
+
+---
+
+## Documentation and Knowledge Capture
+
+Legacy code often lacks documentation. As you learn, document for the team.
+
+### Architecture Decision Records (ADRs)
+
+Document why the system is structured as it is:
+
+```markdown
+# ADR 001: Legacy Order Processing System
+
+## Status
+Accepted
+
+## Context
+The legacy order processing system uses a monolithic stored procedure
+that handles validation, calculation, and persistence in one transaction.
+
+## Decision
+We will gradually extract business logic into application layer while
+maintaining the stored procedure for backwards compatibility.
+
+## Consequences
+- Positive: Can test business logic independently
+- Positive: Can gradually migrate to new system
+- Negative: Temporary duplication during migration
+- Negative: Must maintain both paths during transition
+```
+
+### Code Comments for Non-Obvious Behavior
+
+```python
+def calculate_discount(order):
+    # NOTE: VIP discount logic has historical quirk:
+    # Orders placed on weekends get additional 5% even for VIPs
+    # due to legacy promotion that can't be removed yet.
+    # See ticket #1234 for context.
+
+    base_discount = 0.10 if order.customer.is_vip else 0
+
+    if order.placed_on_weekend():
+        # Don't refactor this away - it's required
+        base_discount += 0.05
+
+    return order.total * base_discount
+```
+
+### Test Documentation
+
+```python
+def test_weekend_vip_discount_edge_case():
+    """
+    Legacy behavior: VIP customers get extra 5% on weekends.
+
+    This is a historical promotion that some customers depend on.
+    Do NOT remove without business approval.
+
+    Context: Implemented in 2018 for holiday promotion, never removed.
+    See: docs/promotions/weekend-vip-discount.md
+    """
+    customer = Customer(is_vip=True)
+    order = Order(total=100, customer=customer, date=saturday())
+
+    discount = calculate_discount(order)
+
+    assert discount == 15  # 10% VIP + 5% weekend
+```
+
+---
+
+## Summary: Legacy Code Testing Strategy
+
+1. **Start with characterization tests** to document current behavior
+2. **Identify seams** where you can introduce test doubles
+3. **Break dependencies** using safe, mechanical refactorings
+4. **Prioritize by risk and touch**: Test critical code and code you're changing
+5. **Use sprout and wrap** to add new functionality testably
+6. **Extract methods** to create testable units
+7. **Apply strangler fig** for gradual architectural improvement
+8. **Document discoveries** for the team
+
+**Remember**: You cannot test all legacy code at once. Be strategic, be incremental, and focus on creating safety nets around the areas you're actively working in. Each small improvement makes the next one easier.
diff --git a/references/methodologies.md b/references/methodologies.md
new file mode 100644
index 0000000..0603d0c
--- /dev/null
+++ b/references/methodologies.md
@@ -0,0 +1,663 @@
+# Testing Methodologies: Comprehensive Guide
+
+## Test-Driven Development (TDD)
+
+Test-Driven Development follows the red-green-refactor cycle: write a failing test (red), write minimal code to make it pass (green), then refactor while keeping tests green. TDD advocates argue this approach leads to better design, higher quality code, and comprehensive test coverage.
+
+### The Red-Green-Refactor Cycle
+
+**Red**: Write a failing test that describes the desired behavior
+```python
+def test_calculate_discount_for_vip_customer():
+    customer = Customer(type="VIP")
+    order = Order(total=100)
+
+    discount = calculate_discount(customer, order)
+
+    assert discount == 10  # VIP customers get 10% discount
+```
+
+**Green**: Write minimal code to make the test pass
+```python
+def calculate_discount(customer, order):
+    if customer.type == "VIP":
+        return order.total * 0.10
+    return 0
+```
+
+**Refactor**: Improve code structure while keeping tests green
+```python
+class DiscountCalculator:
+    VIP_DISCOUNT_RATE = 0.10
+
+    def calculate(self, customer, order):
+        if customer.is_vip():
+            return order.total * self.VIP_DISCOUNT_RATE
+        return 0
+```
+
+### When TDD Provides Value
+
+**Building Complex Business Logic**
+When the problem is well understood but the solution isn't yet clear, TDD helps explore the design space through tests. Writing tests first forces consideration of the interface before implementation details.
+
+**Examples**:
+- Implementing pricing rules with multiple conditions
+- Building discount calculation engine with various customer tiers
+- Creating validation logic with complex business rules
+- Developing workflow engines with state transitions
+
+**Benefits**:
+- Forces clear thinking about requirements
+- Produces testable, decoupled design
+- Creates comprehensive test coverage automatically
+- Documents behavior through tests
+
+**Exploring API Design**
+TDD excels when designing public APIs or interfaces. Writing tests from the consumer perspective reveals awkward APIs, missing functionality, or unclear contracts.
+
+**Examples**:
+- Designing a data access layer API
+- Creating a service layer interface
+- Building a library's public API
+- Developing a plugin system
+
+**Benefits**:
+- Tests written from user perspective reveal API usability issues
+- Interface emerges organically from usage
+- Contract clearly defined before implementation
+- Refactoring doesn't break consumers
+
+**Algorithmic Solutions and Data Structures**
+For well-defined algorithmic problems, TDD provides confidence that edge cases are handled and performance requirements met.
+
+**Examples**:
+- Implementing sorting or searching algorithms
+- Building custom data structures (trees, graphs, caches)
+- Creating parsers or compilers
+- Developing encryption or hashing functions
+
+**Benefits**:
+- Edge cases discovered early
+- Regression prevention during optimization
+- Performance can be verified through tests
+- Correctness validated incrementally
+
+**Pure Functions and Transformations**
+When building functions with clear inputs and outputs and no side effects, TDD is straightforward and highly effective.
+
+**Examples**:
+- Data transformation pipelines
+- Calculation engines
+- Formatting and parsing utilities
+- Functional programming constructs
+
+**Benefits**:
+- Tests are simple: input → output
+- No mocking or stubbing complexity
+- Fast test execution
+- Complete behavior specification
+
+### When TDD is Overkill or Counterproductive
+
+**UI Implementation and Visual Design**
+TDD struggles with highly visual work where you need to see output to know if it's correct. Testing pixel-perfect layouts, animations, or visual aesthetics through TDD is inefficient.
+
+**Why TDD Doesn't Fit**:
+- Visual appearance requires human judgment
+- Tests written first don't know what "looks good" means
+- Constant refinement of visual elements requires test rewrites
+- Visual regression tools are better than TDD for this
+
+**Alternative Approach**: Build UI, then add tests for behavior (interactions, state management, data flow). Use visual regression testing for appearance.
+
+**Exploratory Spike Solutions**
+When exploring a new technology, third-party library, or proof-of-concept, TDD slows learning. Experimentation requires freedom to try things quickly without the overhead of tests.
+
+**Why TDD Doesn't Fit**:
+- Don't know what the API should look like yet
+- Exploring what's possible, not implementing requirements
+- Rapid iteration needed without test maintenance burden
+- May throw away code after learning
+
+**Alternative Approach**: Spike without tests to learn, then implement properly with TDD (or test-later) once direction is clear.
+
+**Framework Learning and Tutorials**
+When learning a new framework or following a tutorial, TDD adds complexity that distracts from learning the framework itself.
+
+**Why TDD Doesn't Fit**:
+- Focus should be on understanding framework concepts
+- Tutorial code is exploratory and throwaway
+- Framework patterns may dictate testing approach
+- Learning two things at once (framework + TDD) is harder
+
+**Alternative Approach**: Learn framework through examples and tutorials, then apply TDD to real projects once comfortable.
+
+**Short-Deadline Tactical Solutions**
+When facing time pressure on well-understood problems with established patterns, TDD's upfront investment may not be worth it.
+
+**Why TDD Doesn't Fit**:
+- Team already knows the solution pattern
+- Risk is low (well-traveled path)
+- Time to market is critical
+- Tests can be added after if needed
+
+**Alternative Approach**: Implement directly with test-later approach, focusing on critical path coverage.
+
+**Simple CRUD and Glue Code**
+Trivial pass-through code, simple getters/setters, or basic CRUD operations provide little benefit from TDD.
+
+**Why TDD Doesn't Fit**:
+- No interesting logic to test-drive
+- Framework already handles the complexity
+- Tests provide little signal beyond "code compiles"
+- Integration tests are more valuable
+
+**Alternative Approach**: Cover with integration tests that verify framework integration, skip unit-level TDD.
+
+### TDD Best Practices
+
+**Write the Smallest Failing Test**
+Don't write all tests upfront. Write one small test, make it pass, refactor, repeat. This tight loop provides constant feedback.
+
+**Follow the Three Laws of TDD**
+1. Don't write production code until you have a failing test
+2. Don't write more of a test than is sufficient to fail
+3. Don't write more production code than is sufficient to pass the test
+
+**Refactor Both Test and Production Code**
+Tests are first-class code. Refactor for clarity, remove duplication, and improve structure in both test and production code.
+
+**Test One Thing at a Time**
+Each test should verify one behavior or scenario. Multiple assertions are fine if they verify different aspects of the same behavior.
+
+**Use Descriptive Test Names**
+Test names should describe the scenario and expected outcome: `test_vip_customers_receive_ten_percent_discount()`
+
+**Keep Tests Fast**
+TDD relies on running tests constantly. If tests are slow, developers skip them, breaking the cycle. Use fakes and stubs to keep unit tests fast.
+
+---
+
+## Behavior-Driven Development (BDD)
+
+Behavior-Driven Development extends TDD by emphasizing collaboration between developers, testers, and business stakeholders through shared specifications written in natural language. BDD uses Given-When-Then format to describe scenarios in business terms.
+
+### The Three Practices of BDD
+
+**Discovery**: Collaborative exploration of requirements
+Bring together product, QA, and engineering to explore requirements through concrete examples. Example mapping sessions identify rules, examples, questions, and assumptions.
+
+**Example Discovery Session**:
+```
+Feature: VIP Customer Discounts
+
+Rule: VIP customers receive 10% discount on all orders
+  Example: VIP customer ordering $100 item
+    Given a VIP customer
+    When they order an item for $100
+    Then they receive a $10 discount
+
+  Example: Regular customer ordering $100 item
+    Given a regular customer
+    When they order an item for $100
+    Then they receive no discount
+
+Question: Do VIP customers get discounts on already-discounted items?
+Assumption: VIP status is determined by total spend > $10,000
+```
+
+**Formulation**: Document examples in structured format
+Convert examples into Gherkin scenarios that serve as both specification and test:
+
+```gherkin
+Feature: VIP Customer Discounts
+  As a VIP customer
+  I want to receive discounts on my orders
+  So that I'm rewarded for my loyalty
+
+  Background:
+    Given the following customers exist:
+      | Name     | Type    |
+      | Alice    | VIP     |
+      | Bob      | Regular |
+
+  Scenario: VIP customer receives discount
+    Given Alice is logged in
+    When she adds a $100 item to cart
+    And she proceeds to checkout
+    Then she should see a $10 discount
+    And her total should be $90
+
+  Scenario: Regular customer receives no discount
+    Given Bob is logged in
+    When he adds a $100 item to cart
+    And he proceeds to checkout
+    Then he should see no discount
+    And his total should be $100
+```
+
+**Automation**: Implement step definitions that execute scenarios
+```python
+from pytest_bdd import scenario, given, when, then, parsers
+
+@scenario('vip_discounts.feature', 'VIP customer receives discount')
+def test_vip_discount():
+    pass
+
+@given(parsers.parse('{customer} is logged in'))
+def login_customer(context, customer):
+    context.customer = customers[customer]
+    context.session = login(context.customer)
+
+@when(parsers.parse('she adds a ${price:d} item to cart'))
+def add_item_to_cart(context, price):
+    context.cart = Cart(context.session)
+    context.cart.add_item(Item(price=price))
+
+@when('she proceeds to checkout')
+def proceed_to_checkout(context):
+    context.order = context.cart.checkout()
+
+@then(parsers.parse('she should see a ${discount:d} discount'))
+def verify_discount(context, discount):
+    assert context.order.discount == discount
+```
+
+### When BDD Provides Value
+
+**User-Facing Features with Stakeholder Collaboration**
+BDD shines when building features where business stakeholders can meaningfully contribute to defining behavior.
+
+**Examples**:
+- E-commerce checkout flow
+- Account registration and onboarding
+- Reporting and analytics features
+- Approval workflows
+- Customer-facing integrations
+
+**Benefits**:
+- Shared understanding between business and technical teams
+- Requirements emerge through concrete examples
+- Scenarios serve as acceptance criteria
+- Living documentation stays up to date
+
+**Complex Workflows with Multiple Paths**
+When workflows have many variations and edge cases, BDD examples help enumerate and clarify all scenarios.
+
+**Examples**:
+- Multi-step application processes
+- Order fulfillment with various shipping options
+- Insurance quote calculation with many factors
+- Loan approval with different criteria
+
+**Benefits**:
+- Examples reveal edge cases and variations
+- Business rules become explicit
+- Missing scenarios identified through discussion
+- Comprehensive coverage of workflow paths
+
+**Aligning Product, Engineering, and QA**
+BDD's collaborative approach creates shared understanding and reduces misunderstandings about requirements.
+
+**Examples**:
+- Features with ambiguous requirements
+- Projects with multiple stakeholders
+- Cross-functional team collaboration
+- Customer-driven feature development
+
+**Benefits**:
+- Reduces rework from misunderstood requirements
+- Creates shared language between roles
+- Enables earlier feedback on requirements
+- Builds trust through transparency
+
+**Living Documentation Needs**
+When documentation that stays current with code is valuable, BDD scenarios serve as executable specifications.
+
+**Examples**:
+- Complex business rules that change frequently
+- Regulatory compliance requirements
+- Third-party API integrations
+- Customer-facing feature documentation
+
+**Benefits**:
+- Documentation is verified by tests
+- Can't become outdated (tests fail if it does)
+- Business users can read and understand
+- Single source of truth for behavior
+
+### BDD Limitations and When to Skip It
+
+**Technical Components Without Business Stakeholders**
+For purely technical components, BDD's business-readable format adds overhead without providing value.
+
+**Examples**:
+- Algorithms and data structures
+- Utility functions and helpers
+- Framework integration code
+- Database access layers
+
+**Why BDD Doesn't Fit**:
+- No business stakeholders to collaborate with
+- Technical language is clearer than business language
+- Gherkin adds verbosity without benefit
+- Traditional unit tests are more appropriate
+
+**Projects Without Collaborative Discovery**
+If business stakeholders aren't involved in refining requirements, BDD loses its primary benefit.
+
+**Why BDD Doesn't Fit**:
+- No collaboration means no shared understanding benefit
+- Scenarios written by developers alone don't add value over tests
+- Overhead of Gherkin without collaborative payoff
+- Better to write traditional tests
+
+**Simple Features with Clear Requirements**
+When requirements are straightforward and unambiguous, BDD's collaborative approach is overkill.
+
+**Why BDD Doesn't Fit**:
+- No ambiguity to resolve through examples
+- Overhead not justified by complexity
+- Traditional tests are sufficient
+- Faster to implement without BDD ceremony
+
+### BDD Best Practices
+
+**Keep Scenarios Declarative, Not Imperative**
+Focus on what the user wants to achieve, not how the system implements it.
+
+```gherkin
+# ❌ BAD: Imperative, implementation-focused
+Scenario: Login
+  Given I navigate to "https://example.com/login"
+  When I enter "user@example.com" in the "email" field
+  And I enter "password123" in the "password" field
+  And I click the "Login" button with id "login-btn"
+  Then I should see the URL change to "https://example.com/dashboard"
+
+# ✅ GOOD: Declarative, behavior-focused
+Scenario: Login
+  Given I am a registered user
+  When I log in with valid credentials
+  Then I should see my dashboard
+```
+
+**Use Background for Common Setup**
+Avoid repeating setup steps across scenarios with Background sections.
+
+**Don't Over-Specify UI Details**
+Scenarios should be resilient to UI changes. Avoid tying scenarios to specific HTML elements or page structure.
+
+**One Scenario Per Business Rule**
+Keep scenarios focused. If a scenario grows too complex, split it into multiple scenarios.
+
+**Involve Business Stakeholders in Discovery**
+BDD's value comes from collaboration. Don't write scenarios in isolation—involve product, QA, and business users.
+
+---
+
+## Test-Later and Pragmatic Approaches
+
+Not all development begins with tests, and mature teams recognize when to write tests during or after implementation.
+
+### Test-During Development
+
+Writing tests alongside implementation (but not necessarily first) represents a pragmatic middle ground. This approach works well when the solution is roughly known but details need exploration.
+
+**When to Use Test-During**:
+- Building features with some unknowns
+- Refactoring code that needs tests added
+- Implementing based on examples or prototypes
+- Balancing speed with quality
+
+**Workflow**:
+1. Sketch out the interface or design
+2. Implement a piece of functionality
+3. Write tests for what was just built
+4. Refine both code and tests
+5. Repeat for next piece
+
+**Benefits**:
+- Flexibility to explore during implementation
+- Still achieves good test coverage
+- Less upfront overthinking than TDD
+- Tests written with full knowledge of implementation
+
+**Trade-offs**:
+- May miss API design issues TDD would catch
+- Temptation to skip tests if timeline pressure
+- Tests may be biased by implementation
+- Less comprehensive coverage than TDD
+
+### Test-Later for Legacy Code
+
+When working with untested legacy code, test-first isn't an option because the code already exists and may not be testable.
+
+**The Legacy Code Dilemma**:
+- Want to refactor to improve code
+- Can't refactor safely without tests
+- Can't write tests because code isn't testable
+- Need to change code to make it testable (catch-22)
+
+**Breaking the Cycle**:
+1. Write characterization tests to lock in current behavior
+2. Make small, safe refactorings to introduce seams
+3. Break dependencies to make code testable
+4. Write proper behavioral tests
+5. Refactor with confidence from test coverage
+
+**Characterization Testing Workflow**:
+```python
+# Step 1: Write test with unknown expected output
+def test_process_order_behavior():
+    result = legacy_process_order(order_data)
+    assert result == ???  # Don't know what this should be yet
+
+# Step 2: Run test, let it fail, observe actual output
+# Test fails: AssertionError: Expected ??? but got {'status': 'processed', 'id': 123}
+
+# Step 3: Update test to expect actual output
+def test_process_order_behavior():
+    result = legacy_process_order(order_data)
+    assert result == {'status': 'processed', 'id': 123}  # Now documents actual behavior
+
+# Step 4: Now safe to refactor - test will catch if behavior changes
+```
+
+**Prioritization Strategy**:
+- Test what you touch: Add tests for areas being modified
+- Test by risk: Cover critical paths and high-bug areas first
+- Test by change frequency: Cover frequently modified code
+- Don't attempt comprehensive coverage upfront
+
+### Test-After for Prototypes and Spikes
+
+When building proof-of-concept code or exploring solutions, tests are often skipped intentionally.
+
+**When Test-After is Appropriate**:
+- True prototypes meant to be thrown away
+- Exploratory coding to learn a technology
+- Rapid validation of a concept
+- Time-boxed experiments
+
+**The Key Question**: Will this code go to production?
+- **Yes**: Tests should be added before production
+- **No**: Tests may be unnecessary
+
+**Workflow for Production-Bound Prototypes**:
+1. Build prototype quickly without tests
+2. Validate concept with stakeholders
+3. If approved, rewrite properly with tests (don't just add tests to prototype)
+4. Or add tests to prototype if code quality is good
+
+### Pragmatic Test Writing
+
+Real-world development requires balancing principles with pragmatism.
+
+**Focus on Value**:
+- Test critical paths thoroughly
+- Test bug-prone areas comprehensively
+- Test frequently changed code religiously
+- Test stable, simple code minimally
+
+**Risk-Based Testing**:
+- High risk (payment, auth, data loss): Comprehensive testing
+- Medium risk (features, workflows): Solid coverage of happy path + key errors
+- Low risk (utilities, config): Basic coverage or integration-level only
+
+**Time-Boxed Testing**:
+- When deadlines loom, prioritize test coverage
+- Cover critical paths first
+- Add comprehensive tests later if time allows
+- Track testing debt explicitly
+
+---
+
+## Testing Around Public Behavior and Contracts
+
+Modern testing philosophy emphasizes testing through public interfaces and stable contracts rather than implementation details, regardless of methodology.
+
+### Public Contract Testing Principles
+
+**Test What, Not How**
+Tests should verify observable behavior visible to callers, not internal implementation.
+
+```python
+# ❌ BAD: Testing implementation details
+def test_order_service_calls_repository():
+    order_service = OrderService(mock_repository)
+
+    order_service.create_order(data)
+
+    mock_repository.save.assert_called_once()  # Testing HOW
+
+# ✅ GOOD: Testing observable behavior
+def test_order_service_creates_order():
+    order_service = OrderService(repository)
+
+    order = order_service.create_order(data)
+
+    assert order.id is not None  # Testing WHAT
+    assert order.status == 'pending'
+```
+
+**Test Through Public Interfaces**
+Interact with code the same way real callers would. Don't bypass encapsulation for testing convenience.
+
+```python
+# ❌ BAD: Accessing private state
+def test_user_activation():
+    user = User(email="test@example.com")
+
+    user.activate()
+
+    assert user._is_activated == True  # Accessing private field
+
+# ✅ GOOD: Testing through public interface
+def test_user_activation():
+    user = User(email="test@example.com")
+
+    user.activate()
+
+    assert user.is_active() == True  # Using public method
+```
+
+**Make Implementation Changes Safely**
+Tests coupled to implementation break during refactoring. Tests coupled to behavior remain stable.
+
+```python
+# Implementation V1
+class OrderService:
+    def create_order(self, items):
+        order = Order()
+        for item in items:
+            order.add_item(item)
+        self.repository.save(order)
+        return order
+
+# Tests check observable behavior
+def test_create_order_returns_saved_order():
+    service = OrderService(repository)
+    items = [Item("Widget", 10)]
+
+    order = service.create_order(items)
+
+    assert order.total == 10
+    assert len(order.items) == 1
+
+# Implementation V2 (refactored)
+class OrderService:
+    def create_order(self, items):
+        order = Order.from_items(items)  # Refactored: different internal implementation
+        self.repository.save(order)
+        return order
+
+# ✅ Tests still pass - behavior unchanged
+```
+
+### Contract Testing for Microservices
+
+At system scale, public behavior testing becomes contract testing between services.
+
+**Consumer-Driven Contracts**
+Consumers define expectations, providers verify they meet them:
+
+```python
+# Consumer Service: Defines what it needs
+def test_user_service_contract():
+    # Define expectation
+    expected_schema = {
+        'id': int,
+        'email': str,
+        'name': str
+    }
+
+    # Test against contract
+    response = user_service_client.get_user(user_id)
+
+    assert matches_schema(response, expected_schema)
+    assert response['email'].endswith('@example.com')
+
+# Provider Service: Verifies it meets consumer expectations
+def test_user_service_meets_contract():
+    # Load consumer's contract expectations
+    contract = load_contract('order-service-user-contract')
+
+    # Verify provider satisfies contract
+    verify_provider_honors_contract(UserServiceAPI, contract)
+```
+
+**Benefits of Contract Testing**:
+- Services can be tested independently
+- Breaking changes detected immediately
+- No need for full integration environment
+- Fast feedback on compatibility
+- Enables independent deployment
+
+**Microservices Testing Philosophy**:
+- Test implementation details minimally (unit tests for critical logic)
+- Test service contracts thoroughly (contract tests)
+- Test integration selectively (key scenarios)
+- Test end-to-end sparingly (critical user journeys)
+
+---
+
+## Methodology Selection Guide
+
+Choose testing methodology based on context:
+
+| Context | Recommended Approach | Rationale |
+|---------|---------------------|-----------|
+| New business logic feature | TDD | Forces clean design, comprehensive coverage |
+| User-facing workflow | BDD | Stakeholder collaboration, living documentation |
+| Exploring new technology | Test-later | Learning takes precedence over testing |
+| Legacy code modification | Characterization tests → TDD | Can't test first if code isn't testable |
+| Microservices API | Contract testing | Service independence, fast feedback |
+| Simple CRUD endpoint | Test-during | Straightforward implementation, tests verify integration |
+| Algorithm implementation | TDD | Edge cases, correctness critical |
+| UI prototype | Test-after (or never) | Visual feedback more important than tests |
+| Critical security code | TDD + BDD | Comprehensive coverage, stakeholder validation |
+| Configuration/glue code | Integration tests | Little logic to unit test |
+
+**The Pragmatic Rule**: Use the methodology that provides the best return on investment for your specific context. No methodology is universally superior—effectiveness depends on the problem, team, and constraints.
diff --git a/references/testing-layers.md b/references/testing-layers.md
new file mode 100644
index 0000000..85590c0
--- /dev/null
+++ b/references/testing-layers.md
@@ -0,0 +1,595 @@
+# Testing Layers: Detailed Guidance
+
+## Unit Tests: The Foundation
+
+Unit tests verify individual units of code in isolation—typically a single function, method, or class. They represent the foundation of most testing strategies because they are fast, focused, and provide precise failure diagnostics.
+
+### Characteristics
+
+- **Speed**: Milliseconds per test
+- **Isolation**: No external dependencies (databases, file systems, networks)
+- **Focus**: Single unit of functionality
+- **Precision**: Failures point directly to the problematic code
+- **Coverage**: Can exhaustively test edge cases and error conditions
+
+### When to Use Unit Tests
+
+Unit tests excel in these scenarios:
+
+**Business Logic and Algorithms**
+- Order calculations, discount logic, pricing rules
+- Validation logic and data transformation
+- Complex conditional logic with multiple branches
+- Mathematical calculations and algorithms
+- Pure functions with deterministic outputs
+
+**Examples**:
+- Testing that a discount calculation applies correct percentages based on order total
+- Verifying that password validation rejects weak passwords
+- Confirming that a sorting algorithm handles edge cases (empty arrays, single elements)
+- Checking that currency conversion produces accurate results
+
+**Edge Cases and Boundary Conditions**
+- Null or undefined inputs
+- Empty collections (arrays, strings, objects)
+- Boundary values (0, -1, maximum integers)
+- Invalid inputs and error conditions
+- Concurrent access scenarios
+
+**Examples**:
+- Testing that a function handles null inputs gracefully
+- Verifying behavior with empty strings or arrays
+- Confirming off-by-one errors don't exist at boundaries
+- Ensuring proper error messages for invalid inputs
+
+**Data Transformations and Parsing**
+- JSON/XML parsing and serialization
+- Data mapping between models
+- String manipulation and formatting
+- Data validation and sanitization
+
+### When NOT to Use Unit Tests
+
+Unit tests become problematic or low-value in these scenarios:
+
+**Database Operations**
+Testing database queries, ORM behavior, or data access layers with mocked repositories provides limited value. The real risk lies in SQL correctness, schema compatibility, and query performance—aspects that mocks cannot verify. Use integration tests with real databases instead.
+
+**Framework-Specific Behavior**
+Testing that a web framework correctly handles routing, middleware, request parsing, or dependency injection through pure unit tests creates more problems than solutions. These behaviors depend on framework implementation details and are better verified through integration tests.
+
+**UI Rendering and Interaction**
+Testing that a React component correctly calls a callback function provides less value than testing that it renders the right UI for given props. UI testing belongs at the integration or E2E level where actual rendering can be verified.
+
+**Cross-Service Communication**
+Testing API clients, message queue handlers, or service-to-service communication with mocks cannot verify actual network behavior, serialization, error handling, or compatibility. Integration or contract tests are more appropriate.
+
+**Simple Pass-Through Code**
+Trivial getters, setters, constructors, or delegation code that simply forwards calls without logic don't benefit from unit tests. Testing these creates maintenance burden without meaningful quality improvement.
+
+### Unit Test Design Patterns
+
+**Arrange-Act-Assert (AAA) Structure**
+```
+# Arrange
+user = User(email="test@example.com", age=25)
+validator = UserValidator()
+
+# Act
+result = validator.validate(user)
+
+# Assert
+assert result.is_valid == True
+assert result.errors == []
+```
+
+**Parameterized Tests for Multiple Cases**
+```
+@pytest.mark.parametrize("age,expected", [
+    (-1, False),  # Below minimum
+    (0, False),   # Boundary
+    (17, False),  # Just below valid
+    (18, True),   # Minimum valid
+    (65, True),   # Typical valid
+    (120, True),  # Maximum valid
+    (121, False), # Above maximum
+])
+def test_age_validation(age, expected):
+    assert validate_age(age) == expected
+```
+
+**Test Doubles for Dependencies**
+```
+# Stub: Provides canned responses
+class StubUserRepository:
+    def find_by_id(self, user_id):
+        return User(id=user_id, name="Test User")
+
+# Mock: Verifies interactions
+mock_email_service = Mock()
+user_service.register(user)
+mock_email_service.send_welcome_email.assert_called_once_with(user.email)
+```
+
+### Recommended Coverage Targets
+
+**High Coverage (90-100%)**
+- Business logic and domain models
+- Algorithmic code and calculations
+- Validation and transformation logic
+- Complex conditional logic
+
+**Moderate Coverage (70-90%)**
+- Utility functions and helpers
+- Data access objects (excluding actual DB calls)
+- API response mappers
+- Configuration and initialization code
+
+**Low Coverage (<70%) Acceptable**
+- Simple property accessors
+- Framework-generated code
+- Third-party library wrappers
+- Code better tested at integration level
+
+---
+
+## Integration Tests: Verifying Component Interactions
+
+Integration tests verify how multiple components work together, focusing on boundaries and interactions between modules, services, or layers. They occupy the middle ground between isolated unit tests and full-system E2E tests.
+
+### Characteristics
+
+- **Speed**: Seconds per test
+- **Real Dependencies**: Uses actual implementations of some components
+- **Boundary Focus**: Tests interfaces between modules/services
+- **Moderate Scope**: Multiple components but not entire system
+- **Balance**: Realistic behavior with manageable complexity
+
+### When to Use Integration Tests
+
+**Database Integration**
+- Testing repository methods with real database (or in-memory equivalent)
+- Verifying ORM mappings and query correctness
+- Confirming transaction handling and rollback behavior
+- Testing database constraints and cascading operations
+- Validating query performance with realistic data volumes
+
+**Examples**:
+- Verify that saving a user with a duplicate email triggers a constraint violation
+- Confirm that deleting a parent record cascades to child records
+- Test that a complex JOIN query returns expected results
+- Validate that pagination works correctly with large datasets
+
+**API Endpoint Testing**
+- Testing HTTP endpoints with the full framework
+- Verifying request parsing and validation
+- Confirming response serialization and status codes
+- Testing middleware behavior (auth, logging, rate limiting)
+- Validating error handling and exception mapping
+
+**Examples**:
+- POST to `/api/users` with valid data returns 201 with created user
+- POST to `/api/users` with invalid email returns 400 with error details
+- GET `/api/users/:id` for non-existent user returns 404
+- Requests without authentication token return 401
+
+**Message Queue and Event-Driven Integration**
+- Testing message producers and consumers
+- Verifying message serialization and deserialization
+- Confirming event handler registration and invocation
+- Testing retry logic and dead letter handling
+- Validating message ordering and idempotency
+
+**Examples**:
+- Publishing an "OrderCreated" event triggers inventory reservation
+- Failed message processing moves message to dead letter queue
+- Duplicate messages are processed idempotently
+- Message batching works correctly under load
+
+**Cross-Cutting Concerns**
+- Authentication and authorization flow
+- Caching behavior and invalidation
+- Transaction management across layers
+- Logging and observability instrumentation
+- Rate limiting and throttling
+
+**Examples**:
+- Verify that cached data is returned on subsequent requests
+- Confirm that unauthorized users cannot access protected resources
+- Test that failed transactions roll back completely
+- Validate that rate limits block excessive requests
+
+### When NOT to Use Integration Tests
+
+**Exhaustive Edge Case Testing**
+Integration tests are slower than unit tests and shouldn't be used to test every edge case, boundary condition, or error scenario. Test algorithmic correctness and edge cases at the unit level; use integration tests to verify components integrate properly.
+
+**Complete User Workflows**
+Testing entire user workflows through multiple API calls, UI interactions, and system states belongs at the E2E level. Integration tests should focus on specific component interactions, not complete business processes.
+
+**Implementation Detail Verification**
+Integration tests should verify observable behavior at component boundaries, not internal implementation details. If an integration test breaks when refactoring internal code without changing external behavior, it's testing the wrong thing.
+
+**Simple Pass-Through Logic**
+Code that simply forwards calls between components without transformation or logic doesn't need integration testing. If a service method just delegates to a repository, testing it adds little value.
+
+### Integration Test Design Patterns
+
+**Test Database Management**
+```
+# Use transactions that rollback
+@pytest.fixture
+def db_session():
+    connection = engine.connect()
+    transaction = connection.begin()
+    session = Session(bind=connection)
+
+    yield session
+
+    session.close()
+    transaction.rollback()
+    connection.close()
+
+def test_create_user(db_session):
+    user = User(email="test@example.com")
+    db_session.add(user)
+    db_session.commit()
+
+    found = db_session.query(User).filter_by(email="test@example.com").first()
+    assert found is not None
+```
+
+**API Testing with Test Client**
+```
+def test_create_user_api(test_client):
+    response = test_client.post('/api/users', json={
+        'email': 'test@example.com',
+        'name': 'Test User'
+    })
+
+    assert response.status_code == 201
+    assert response.json['email'] == 'test@example.com'
+    assert 'id' in response.json
+```
+
+**Container-Based Dependencies**
+```
+@pytest.fixture(scope="session")
+def postgres_container():
+    # Start PostgreSQL container
+    container = PostgresContainer("postgres:15")
+    container.start()
+
+    yield container
+
+    container.stop()
+
+def test_with_real_postgres(postgres_container):
+    connection_string = postgres_container.get_connection_url()
+    # Test with real PostgreSQL instance
+```
+
+### Recommended Coverage Targets
+
+**High Coverage (70-90%)**
+- Repository and data access layers
+- API controllers and route handlers
+- Service layer with framework integration
+- Message queue handlers
+- Cross-cutting concern implementations
+
+**Selective Coverage**
+- Third-party API integrations (sample scenarios)
+- File system operations (key workflows)
+- External service communication (happy path + critical errors)
+
+---
+
+## End-to-End Tests: Critical Path Validation
+
+End-to-end (E2E) tests verify the entire system from a user's perspective, simulating real user interactions through the UI or API. They provide the highest confidence that the system works as a complete whole but come with significant costs.
+
+### Characteristics
+
+- **Speed**: Seconds to minutes per test
+- **Full Stack**: Tests entire system including UI, backend, database
+- **User Perspective**: Simulates real user interactions
+- **Production-Like**: Runs in environment resembling production
+- **Fragility**: More prone to flakiness due to complexity
+
+### When to Use E2E Tests
+
+**Critical User Journeys**
+Focus exclusively on workflows that, if broken, would severely impact users or business:
+
+- User registration and login flow
+- Checkout and payment processing
+- Core feature workflows (search, book, confirm)
+- Account management (password reset, profile update)
+- Critical integrations (payment gateways, shipping APIs)
+
+**Examples**:
+- Complete e-commerce checkout: browse → add to cart → checkout → payment → confirmation
+- User onboarding: sign up → verify email → complete profile → access application
+- Booking flow: search → select → customize → pay → receive confirmation
+- Content publication: create → review → approve → publish → verify live
+
+**Cross-System Integration Validation**
+Verify that deployed systems integrate correctly:
+
+- Multiple microservices working together
+- Frontend and backend communication
+- Third-party service integration
+- Email delivery and notifications
+- File uploads and processing
+
+**Examples**:
+- Order placement triggers inventory service, payment service, and notification service
+- File upload processes correctly through CDN to storage to thumbnail generation
+- OAuth integration with external provider completes successfully
+
+**Smoke Tests for Deployment Validation**
+Minimal E2E tests that verify deployment succeeded:
+
+- Application starts and responds
+- Critical paths are accessible
+- Database connections work
+- External service connectivity verified
+
+### When NOT to Use E2E Tests
+
+**Edge Cases and Error Handling**
+Testing every validation error message, boundary condition, or error scenario through the UI is slow and brittle. Test edge cases at the unit level where they can be verified quickly and precisely.
+
+**Examples of What NOT to E2E Test**:
+- ❌ Every possible validation error on a registration form
+- ❌ All password strength combinations
+- ❌ Boundary conditions on numeric inputs
+- ✅ One representative validation failure (tested at E2E)
+- ✅ Comprehensive validation (tested at unit/integration level)
+
+**Business Logic Permutations**
+Testing all combinations of discount rules, pricing scenarios, or business logic variations through E2E tests creates a massive, slow suite. Test business logic thoroughly at the unit level; verify basic integration works E2E.
+
+**Intermediate System States**
+E2E tests should verify complete workflows from user perspective, not intermediate states or backend processes not visible to users. Internal system behavior belongs at unit or integration level.
+
+**Implementation Details**
+E2E tests should never verify internal implementation details like specific API calls, database queries, or service interactions. Test observable user-facing behavior only.
+
+### E2E Test Design Patterns
+
+**Page Object Pattern**
+```
+class LoginPage:
+    def __init__(self, browser):
+        self.browser = browser
+
+    def navigate(self):
+        self.browser.get("https://app.example.com/login")
+
+    def enter_credentials(self, email, password):
+        self.browser.find_element_by_id("email").send_keys(email)
+        self.browser.find_element_by_id("password").send_keys(password)
+
+    def submit(self):
+        self.browser.find_element_by_id("login-button").click()
+
+    def is_error_displayed(self):
+        return self.browser.find_element_by_class("error").is_displayed()
+
+def test_login_invalid_credentials():
+    login_page = LoginPage(browser)
+    login_page.navigate()
+    login_page.enter_credentials("invalid@example.com", "wrongpassword")
+    login_page.submit()
+
+    assert login_page.is_error_displayed()
+```
+
+**Explicit Waits (Not Sleeps)**
+```
+# ❌ BAD: Fixed sleep
+click_button()
+time.sleep(3)  # Hope page loads
+assert_element_visible()
+
+# ✅ GOOD: Explicit wait for condition
+click_button()
+wait_until(lambda: element_is_visible("success-message"), timeout=10)
+assert_element_visible("success-message")
+```
+
+**Test Data Management**
+```
+@pytest.fixture
+def test_user():
+    # Create unique test user for this test
+    user = create_user(
+        email=f"test-{uuid4()}@example.com",
+        password="TestPassword123!"
+    )
+
+    yield user
+
+    # Cleanup
+    delete_user(user.id)
+```
+
+### Recommended Coverage Targets
+
+**Must Cover (5-10 E2E tests)**
+- User registration and login
+- Core value proposition workflow (the main thing users do)
+- Payment/checkout flow if applicable
+- Critical administrative functions
+- Account recovery/password reset
+
+**Should Cover (10-20 E2E tests)**
+- Secondary feature workflows
+- Important edge cases visible to users
+- Cross-browser compatibility for critical paths
+- Mobile responsiveness for key workflows
+
+**Don't Cover with E2E**
+- Every variation and edge case
+- Internal backend processes
+- API-only functionality (use integration tests)
+- Features with low business impact
+
+---
+
+## Contract Tests: Service Boundary Validation
+
+Contract testing, particularly relevant in microservices architectures, verifies the agreements between service consumers and providers. Rather than testing complete integration, contract tests validate that APIs meet expectations defined in consumer-driven contracts.
+
+### Characteristics
+
+- **Speed**: Fast (similar to unit tests)
+- **Isolation**: Tests consumer and provider independently
+- **Contract Focus**: Verifies API structure and behavior expectations
+- **Consumer-Driven**: Contracts defined by consumer needs
+- **Independent Deployment**: Enables service deployment without full integration
+
+### When to Use Contract Tests
+
+**Microservices Architectures**
+Contract tests excel when services are developed and deployed independently:
+
+- Multiple teams own different services
+- Services evolve at different rates
+- Coordination overhead for integration testing is high
+- Fast feedback about breaking changes is crucial
+
+**Examples**:
+- User service publishes contract defining how it provides user data
+- Order service consumes this contract to verify it can fetch user information
+- When user service changes its API, contract tests immediately fail if breaking
+- Teams can verify compatibility without coordinating deployment timing
+
+**Third-Party API Integration**
+Contract tests verify your code correctly consumes external APIs:
+
+- Define expectations about external API structure and behavior
+- Verify your code handles expected responses
+- Detect when external API changes break your integration
+- Test against contract without calling actual external service
+
+**Examples**:
+- Define contract for payment gateway API (charge endpoint, response format)
+- Test that your payment service correctly calls payment gateway
+- Verify handling of success, failure, and error responses
+- Run tests without calling real payment gateway
+
+**Multi-Team Service Development**
+Contract tests enable teams to work independently:
+
+- Provider team defines what they offer
+- Consumer team defines what they need
+- Contract tests verify compatibility
+- Teams deploy independently with confidence
+
+### When NOT to Use Contract Tests
+
+**Monolithic Applications**
+Contract tests are unnecessary when all components deploy together. Internal modules within a monolith can be tested with integration tests that verify actual runtime behavior.
+
+**Single Team Codebases**
+When one team owns both consumer and provider, traditional integration tests may be simpler and more valuable than contract testing. The coordination benefits of contract testing don't apply.
+
+**Business Logic Verification**
+Contract tests validate structure and format, not business logic or data correctness. They ensure the API shape matches expectations but don't verify that calculations are correct or workflows complete properly.
+
+**UI Behavior Testing**
+Contract tests focus on API contracts, not user interface behavior. Use integration or E2E tests for UI validation.
+
+### Contract Test Design Patterns
+
+**Consumer-Driven Contract Testing (Pact)**
+```
+# Consumer defines expectations
+def test_get_user_contract(pact):
+    expected = {
+        'id': pact.like(1),
+        'email': pact.like('user@example.com'),
+        'name': pact.like('Test User'),
+        'created_at': pact.like('2024-01-01T00:00:00Z')
+    }
+
+    (pact
+        .given('User 123 exists')
+        .upon_receiving('A request for user 123')
+        .with_request('GET', '/api/users/123')
+        .will_respond_with(200, body=expected))
+
+    with pact:
+        user_service = UserService('http://localhost')
+        user = user_service.get_user(123)
+
+        assert user.email == 'user@example.com'
+        assert user.name == 'Test User'
+
+# Provider verifies contract
+def test_provider_honors_contract():
+    verifier = Verifier(provider='UserService', pact_url='./pacts')
+    verifier.verify_pacts()
+```
+
+**Schema-Based Contract Testing**
+```
+# Define API schema
+user_schema = {
+    "type": "object",
+    "properties": {
+        "id": {"type": "integer"},
+        "email": {"type": "string", "format": "email"},
+        "name": {"type": "string"},
+        "created_at": {"type": "string", "format": "date-time"}
+    },
+    "required": ["id", "email", "name"]
+}
+
+# Consumer validates response against schema
+def test_get_user_response_format():
+    response = requests.get('http://localhost/api/users/123')
+    validate(response.json(), user_schema)
+
+# Provider validates their implementation produces valid schema
+def test_user_endpoint_matches_schema():
+    response = client.get('/api/users/123')
+    validate(response.json, user_schema)
+```
+
+### Recommended Coverage
+
+**Essential Contracts**
+- All public APIs consumed by other services
+- Critical third-party integrations
+- Service boundaries between teams
+
+**Contract Scope**
+- Request/response structure and types
+- Required vs. optional fields
+- Status codes and error responses
+- Key business rules at the boundary
+
+**Don't Contract Test**
+- Internal implementation details
+- Database schemas (use integration tests)
+- UI components (use integration/E2E tests)
+- Complete end-to-end workflows (use E2E tests)
+
+---
+
+## Summary: Choosing the Right Test Level
+
+Use this decision tree to quickly determine appropriate test level:
+
+1. **Is it business logic, calculations, or algorithms?** → Unit test
+2. **Does it involve databases, APIs, or framework integration?** → Integration test
+3. **Is it a critical user-facing workflow?** → E2E test (minimal coverage)
+4. **Is it a service boundary in microservices architecture?** → Contract test
+5. **Are edge cases or error conditions being tested?** → Unit test
+6. **Does it require the full system running?** → E2E test (only if critical)
+
+**Remember**: The testing pyramid (70-80% unit, 15-25% integration, 5-10% E2E) guides proportional investment, not rigid rules. Adapt based on your architecture, risks, and context.