gh-tachyon-beep-skillpacks-…/skills/using-training-optimization/experiment-tracking.md


# Experiment Tracking Skill

## When to Use This Skill

Use this skill when:
- User starts training a model and asks "should I track this experiment?"
- User wants to reproduce a previous result but doesn't remember settings
- Training runs overnight and user needs persistent logs
- User asks "which tool should I use: TensorBoard, W&B, or MLflow?"
- Multiple experiments running and user can't compare results
- User wants to share results with teammates or collaborators
- Model checkpoints accumulating with no organization or versioning
- User asks "what should I track?" or "how do I make experiments reproducible?"
- Debugging training issues and needs historical data (metrics, gradients)
- User wants to visualize training curves or compare hyperparameters
- Working on a research project that requires tracking many experiments
- User lost their best result and can't reproduce it

Do NOT use when:
- User is doing quick prototyping with throwaway code (<5 minutes)
- Only running inference on pre-trained models (no training)
- Single experiment that's already tracked and working
- User is asking about hyperparameter tuning strategy (not tracking)
- Discussing model architecture design (not experiment management)


## Core Principles

### 1. Track Before You Need It (Can't Add Retroactively)

The BIGGEST mistake: waiting to track until results are worth saving.

**The Reality**:
- The best result is ALWAYS the one you didn't track
- Can't add tracking after the experiment completes
- Human memory fails within hours (let alone days/weeks)
- Print statements disappear when terminal closes
- Code changes between experiments (git state matters)

**When Tracking Matters**:
```
Experiment value curve:
    ^
    |                    ╱─  Peak result (untracked = lost forever)
    |                  ╱
    |                ╱
    |              ╱
    |            ╱
    |          ╱
    |        ╱
    |      ╱
    |____╱________________________________>
         Start                        Time

If you wait to track "important" experiments, you've already lost them.
```

**Track From Day 1**:
- First experiment (even if "just testing")
- Every hyperparameter change
- Every model architecture variation
- Every data preprocessing change

**Decision Rule**: If you're running `python train.py`, you should be tracking. No exceptions.


### 2. Complete Tracking = Hyperparameters + Metrics + Artifacts + Environment

Reproducibility requires tracking EVERYTHING that affects the result.

**The Five Categories**:

```
┌─────────────────────────────────────────────────────────┐
│ 1. HYPERPARAMETERS (what you're tuning)                │
├─────────────────────────────────────────────────────────┤
│ • Learning rate, batch size, optimizer type             │
│ • Model architecture (width, depth, activation)         │
│ • Regularization (weight decay, dropout)                │
│ • Training length (epochs, steps)                       │
│ • Data augmentation settings                            │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ 2. METRICS (how you're doing)                           │
├─────────────────────────────────────────────────────────┤
│ • Training loss (every step or epoch)                   │
│ • Validation loss (every epoch)                         │
│ • Evaluation metrics (accuracy, F1, mAP, etc.)          │
│ • Learning rate schedule (actual LR each step)          │
│ • Gradient norms (for debugging)                        │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ 3. ARTIFACTS (what you're saving)                       │
├─────────────────────────────────────────────────────────┤
│ • Model checkpoints (with epoch/step metadata)          │
│ • Training plots (loss curves, confusion matrices)      │
│ • Predictions on validation set                         │
│ • Logs (stdout, stderr)                                 │
│ • Config files (for reproducibility)                    │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ 4. CODE VERSION (what you're running)                   │
├─────────────────────────────────────────────────────────┤
│ • Git commit hash                                        │
│ • Git branch name                                        │
│ • Dirty status (uncommitted changes)                    │
│ • Code diff (if uncommitted)                            │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ 5. ENVIRONMENT (where you're running)                   │
├─────────────────────────────────────────────────────────┤
│ • Python version, PyTorch version                       │
│ • CUDA version, GPU type                                │
│ • Random seeds (Python, NumPy, PyTorch, CUDA)           │
│ • Data version (if dataset changes)                     │
│ • Hardware (CPU, RAM, GPU count)                        │
└─────────────────────────────────────────────────────────┘
```

**Reproducibility Test**:
> Can someone else (or future you) reproduce the result with ONLY the tracked information?

If NO, you're not tracking enough.


### 3. Tool Selection: Local vs Team vs Production

Different tools for different use cases. Choose based on your needs.

**Tool Comparison**:

| Feature | TensorBoard | Weights & Biases | MLflow | Custom |
|---------|-------------|------------------|--------|--------|
| **Setup Complexity** | Low | Low | Medium | High |
| **Local Only** | Yes | No (cloud) | Yes | Yes |
| **Team Collaboration** | Limited | Excellent | Good | Custom |
| **Cost** | Free | Free tier + paid | Free | Free |
| **Scalability** | Medium | High | High | Low |
| **Visualization** | Good | Excellent | Good | Custom |
| **Integration** | PyTorch, TF | Everything | Everything | Manual |
| **Best For** | Solo projects | Team research | Production | Specific needs |

**Decision Tree**:
```
Do you need team collaboration?
├─ YES → Need to share results with teammates?
│   ├─ YES → Weights & Biases (best team features)
│   └─ NO → MLflow (self-hosted, more control)
│
└─ NO → Solo project?
    ├─ YES → TensorBoard (simplest, local)
    └─ NO → MLflow (scales to production)

Budget constraints?
├─ FREE only → TensorBoard or MLflow
└─ Can pay → W&B (worth it for teams)

Production deployment?
├─ YES → MLflow (production-ready)
└─ NO → TensorBoard or W&B (research)
```

**Recommendation**:
- **Starting out / learning**: TensorBoard (easiest, free, local)
- **Research team / collaboration**: Weights & Biases (best UX, sharing)
- **Production ML / enterprise**: MLflow (self-hosted, model registry)
- **Specific needs / customization**: Custom logging (CSV + Git)


### 4. Minimal Overhead, Maximum Value

Tracking should cost 1-5% overhead, not 50%.

**What to Track at Different Frequencies**:

```python
# Every step (high frequency, small data):
log_every_step = {
    "train_loss": loss.item(),
    "learning_rate": optimizer.param_groups[0]['lr'],
    "step": global_step,
}

# Every epoch (medium frequency, medium data):
log_every_epoch = {
    "train_loss_avg": train_losses.mean(),
    "val_loss": val_loss,
    "val_accuracy": val_acc,
    "epoch": epoch,
}

# Once per experiment (low frequency, large data):
log_once = {
    "hyperparameters": config,
    "git_commit": get_git_hash(),
    "environment": {
        "python_version": sys.version,
        "torch_version": torch.__version__,
        "cuda_version": torch.version.cuda,
    },
}

# Only on improvement (conditional):
if val_loss < best_val_loss:
    save_checkpoint(model, optimizer, epoch, val_loss)
    log_artifact("best_model.pt")
```

**Overhead Guidelines**:
- Logging scalars (loss, accuracy): <0.1% overhead (always do)
- Logging images/plots: 1-2% overhead (do every epoch)
- Logging checkpoints: 5-10% overhead (do only on improvement)
- Logging gradients: 10-20% overhead (do only for debugging)

**Don't Track**:
- Raw training data (too large, use data versioning instead)
- Every intermediate activation (use profiling tools instead)
- Full model weights every step (only on improvement)


### 5. Experiment Organization: Naming, Tagging, Grouping

With 100+ experiments, organization is survival.

**Naming Convention**:
```python
# GOOD: Descriptive, sortable, parseable
experiment_name = f"{model}_{dataset}_{timestamp}_{hyperparams}"
# Examples:
# "resnet18_cifar10_20241030_lr0.01_bs128"
# "bert_squad_20241030_lr3e-5_warmup1000"
# "gpt2_wikitext_20241030_ctx512_layers12"

# BAD: Uninformative
experiment_name = "test"
experiment_name = "final"
experiment_name = "model_v2"
experiment_name = "test_again_actually_final"
```

**Tagging Strategy**:
```python
# Tags for filtering and grouping
tags = {
    "model": "resnet18",
    "dataset": "cifar10",
    "experiment_type": "hyperparameter_search",
    "status": "completed",
    "goal": "beat_baseline",
    "author": "john",
}

# Can filter later:
# - Show me all "hyperparameter_search" experiments
# - Show me all "resnet18" on "cifar10"
# - Show me experiments by "john"
```

**Grouping Related Experiments**:
```python
# Group by goal/project
project = "cifar10_sota"
group = "learning_rate_search"
experiment_name = f"{project}/{group}/lr_{lr}"

# Hierarchy:
# cifar10_sota/
#   ├─ learning_rate_search/
#   │   ├─ lr_0.001
#   │   ├─ lr_0.01
#   │   └─ lr_0.1
#   ├─ architecture_search/
#   │   ├─ resnet18
#   │   ├─ resnet34
#   │   └─ resnet50
#   └─ regularization_search/
#       ├─ dropout_0.1
#       ├─ dropout_0.3
#       └─ dropout_0.5
```


## Tool-Specific Integration

### TensorBoard (Local, Simple)

**Setup**:
```python
from torch.utils.tensorboard import SummaryWriter

# Create writer
writer = SummaryWriter(f"runs/{experiment_name}")

# Log hyperparameters
hparams = {
    "learning_rate": 0.01,
    "batch_size": 128,
    "optimizer": "adam",
}
metrics = {
    "best_val_acc": 0.0,
}
writer.add_hparams(hparams, metrics)
```

**During Training**:
```python
for epoch in range(num_epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        # Training step
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        # Log every N steps
        global_step = epoch * len(train_loader) + batch_idx
        if global_step % log_interval == 0:
            writer.add_scalar("train/loss", loss.item(), global_step)
            writer.add_scalar("train/lr", optimizer.param_groups[0]['lr'], global_step)

    # Validation
    val_loss, val_acc = evaluate(model, val_loader)
    writer.add_scalar("val/loss", val_loss, epoch)
    writer.add_scalar("val/accuracy", val_acc, epoch)

    # Log images (confusion matrix, etc.)
    if epoch % 10 == 0:
        fig = plot_confusion_matrix(model, val_loader)
        writer.add_figure("val/confusion_matrix", fig, epoch)

writer.close()
```

**View Results**:
```bash
tensorboard --logdir=runs
# Opens web UI at http://localhost:6006
```

**Pros**:
- Simple setup (2 lines of code)
- Local (no cloud dependency)
- Good visualizations (scalars, images, graphs)
- Integrated with PyTorch

**Cons**:
- No hyperparameter comparison table
- Limited team collaboration
- No artifact storage (checkpoints)
- Manual experiment management


### Weights & Biases (Team, Cloud)

**Setup**:
```python
import wandb

# Initialize experiment
wandb.init(
    project="cifar10-sota",
    name=experiment_name,
    config={
        "learning_rate": 0.01,
        "batch_size": 128,
        "optimizer": "adam",
        "model": "resnet18",
        "dataset": "cifar10",
    },
    tags=["hyperparameter_search", "resnet"],
)

# Config is automatically tracked
config = wandb.config
```

**During Training**:
```python
for epoch in range(num_epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        # Log metrics
        wandb.log({
            "train/loss": loss.item(),
            "train/lr": optimizer.param_groups[0]['lr'],
            "epoch": epoch,
        })

    # Validation
    val_loss, val_acc = evaluate(model, val_loader)
    wandb.log({
        "val/loss": val_loss,
        "val/accuracy": val_acc,
        "epoch": epoch,
    })

    # Save checkpoint
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.state_dict(), "best_model.pt")
        wandb.save("best_model.pt")  # Upload to cloud

# Log final results
wandb.log({"best_val_accuracy": best_val_acc})
wandb.finish()
```

**Advanced Features**:
```python
# Log images
wandb.log({"examples": [wandb.Image(img, caption=f"Pred: {pred}") for img, pred in samples]})

# Log plots
fig = plot_confusion_matrix(model, val_loader)
wandb.log({"confusion_matrix": wandb.Image(fig)})

# Log tables (for result analysis)
table = wandb.Table(columns=["epoch", "train_loss", "val_loss", "val_acc"])
for epoch, tl, vl, va in zip(epochs, train_losses, val_losses, val_accs):
    table.add_data(epoch, tl, vl, va)
wandb.log({"results": table})

# Log model architecture
wandb.watch(model, log="all", log_freq=100)  # Logs gradients + weights
```

**View Results**:
- Web UI: https://wandb.ai/your-username/cifar10-sota
- Compare experiments side-by-side
- Share links with teammates
- Filter by tags, hyperparameters

**Pros**:
- Excellent team collaboration (share links)
- Beautiful visualizations
- Hyperparameter comparison (parallel coordinates)
- Artifact versioning (models, data)
- Integration with everything (PyTorch, TF, JAX)

**Cons**:
- Cloud-based (requires internet)
- Free tier limits (100GB storage)
- Data leaves your machine (privacy concern)


### MLflow (Production, Self-Hosted)

**Setup**:
```python
import mlflow
import mlflow.pytorch

# Start experiment
mlflow.set_experiment("cifar10-sota")

# Start run
with mlflow.start_run(run_name=experiment_name):
    # Log hyperparameters
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_param("batch_size", 128)
    mlflow.log_param("optimizer", "adam")
    mlflow.log_param("model", "resnet18")

    # Training loop
    for epoch in range(num_epochs):
        train_loss = train_epoch(model, train_loader, optimizer)
        val_loss, val_acc = evaluate(model, val_loader)

        # Log metrics
        mlflow.log_metric("train_loss", train_loss, step=epoch)
        mlflow.log_metric("val_loss", val_loss, step=epoch)
        mlflow.log_metric("val_accuracy", val_acc, step=epoch)

    # Log final metrics
    mlflow.log_metric("best_val_accuracy", best_val_acc)

    # Log model
    mlflow.pytorch.log_model(model, "model")

    # Log artifacts
    mlflow.log_artifact("config.yaml")
    mlflow.log_artifact("best_model.pt")
```

**View Results**:
```bash
mlflow ui
# Opens web UI at http://localhost:5000
```

**Model Registry** (for production):
```python
# Register model
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri, "cifar10-resnet18")

# Load registered model
model = mlflow.pytorch.load_model("models:/cifar10-resnet18/production")
```

**Pros**:
- Self-hosted (full control, privacy)
- Model registry (production deployment)
- Scales to large teams
- Integration with deployment tools

**Cons**:
- More complex setup (need server)
- Visualization not as good as W&B
- Less intuitive UI


## Reproducibility Patterns

### 1. Seed Everything

```python
import random
import numpy as np
import torch

def set_seed(seed):
    """Set all random seeds for reproducibility."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

    # Deterministic operations (slower but reproducible)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

# At start of training
set_seed(42)

# Log seed
config = {"seed": 42}
```

**Warning**: Deterministic mode can be 10-20% slower. Trade-off between speed and reproducibility.


### 2. Capture Git State

```python
import subprocess

def get_git_info():
    """Capture current git state."""
    try:
        commit = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()
        branch = subprocess.check_output(['git', 'rev-parse', '--abbrev-ref', 'HEAD']).decode('ascii').strip()

        # Check for uncommitted changes
        status = subprocess.check_output(['git', 'status', '--porcelain']).decode('ascii').strip()
        is_dirty = len(status) > 0

        # Get diff if dirty
        diff = None
        if is_dirty:
            diff = subprocess.check_output(['git', 'diff']).decode('ascii')

        return {
            "commit": commit,
            "branch": branch,
            "is_dirty": is_dirty,
            "diff": diff,
        }
    except Exception as e:
        return {"error": str(e)}

# Log git info
git_info = get_git_info()
if git_info.get("is_dirty"):
    print("WARNING: Uncommitted changes detected!")
    print("Experiment may not be reproducible without the diff.")
```


### 3. Environment Capture

```python
import sys
import torch

def get_environment_info():
    """Capture environment details."""
    return {
        "python_version": sys.version,
        "torch_version": torch.__version__,
        "cuda_version": torch.version.cuda,
        "cudnn_version": torch.backends.cudnn.version(),
        "gpu_name": torch.cuda.get_device_name(0) if torch.cuda.is_available() else None,
        "gpu_count": torch.cuda.device_count(),
    }

# Save requirements.txt
# pip freeze > requirements.txt

# Or use pip-tools
# pip-compile requirements.in
```


### 4. Config Files for Reproducibility

```python
# config.yaml
model:
  name: resnet18
  num_classes: 10

training:
  learning_rate: 0.01
  batch_size: 128
  num_epochs: 100
  optimizer: adam
  weight_decay: 0.0001

data:
  dataset: cifar10
  augmentation: true
  normalize: true

# Load config
import yaml
with open("config.yaml") as f:
    config = yaml.safe_load(f)

# Save config alongside results
import shutil
shutil.copy("config.yaml", f"results/{experiment_name}/config.yaml")
```


## Experiment Comparison

### 1. Comparing Metrics

```python
# TensorBoard: Compare multiple runs
# tensorboard --logdir=runs --port=6006
# Select multiple runs in UI

# W&B: Filter and compare
# Go to project page, select runs, click "Compare"

# MLflow: Query experiments
import mlflow

# Get all runs from an experiment
experiment = mlflow.get_experiment_by_name("cifar10-sota")
runs = mlflow.search_runs(experiment_ids=[experiment.experiment_id])

# Filter by metric
best_runs = runs[runs["metrics.val_accuracy"] > 0.85]

# Sort by metric
best_runs = runs.sort_values("metrics.val_accuracy", ascending=False)

# Analyze hyperparameter impact
import pandas as pd
import seaborn as sns

# Plot learning rate vs accuracy
sns.scatterplot(data=runs, x="params.learning_rate", y="metrics.val_accuracy")
```


### 2. Hyperparameter Analysis

```python
# W&B: Parallel coordinates plot
# Shows which hyperparameter combinations lead to best results
# UI: Click "Parallel Coordinates" in project view

# MLflow: Custom analysis
import matplotlib.pyplot as plt

# Group by hyperparameter
for lr in [0.001, 0.01, 0.1]:
    lr_runs = runs[runs["params.learning_rate"] == str(lr)]
    accuracies = lr_runs["metrics.val_accuracy"]
    plt.scatter([lr] * len(accuracies), accuracies, alpha=0.5, label=f"LR={lr}")

plt.xlabel("Learning Rate")
plt.ylabel("Validation Accuracy")
plt.xscale("log")
plt.legend()
plt.title("Learning Rate vs Accuracy")
plt.show()
```


### 3. Comparing Artifacts

```python
# Compare model checkpoints
from torchvision.models import resnet18

# Load two models
model_a = resnet18()
model_a.load_state_dict(torch.load("experiments/exp_a/best_model.pt"))

model_b = resnet18()
model_b.load_state_dict(torch.load("experiments/exp_b/best_model.pt"))

# Compare on validation set
acc_a = evaluate(model_a, val_loader)
acc_b = evaluate(model_b, val_loader)

print(f"Model A: {acc_a:.2%}")
print(f"Model B: {acc_b:.2%}")

# Compare predictions
preds_a = model_a(val_data)
preds_b = model_b(val_data)
agreement = (preds_a.argmax(1) == preds_b.argmax(1)).float().mean()
print(f"Prediction agreement: {agreement:.2%}")
```


## Collaboration Workflows

### 1. Sharing Results (W&B)

```python
# Share experiment link
# https://wandb.ai/your-username/cifar10-sota/runs/run-id

# Create report
# W&B UI: Click "Create Report" → Add charts, text, code

# Export results
# W&B UI: Click "Export" → CSV, JSON, or API

# API access for programmatic sharing
import wandb
api = wandb.Api()
runs = api.runs("your-username/cifar10-sota")

for run in runs:
    print(f"{run.name}: {run.summary['val_accuracy']}")
```


### 2. Team Experiment Dashboard

```python
# MLflow: Shared tracking server
# Server machine:
mlflow server --host 0.0.0.0 --port 5000

# Team members:
import mlflow
mlflow.set_tracking_uri("http://shared-server:5000")

# Everyone logs to same server
with mlflow.start_run():
    mlflow.log_metric("val_accuracy", 0.87)
```


### 3. Experiment Handoff

```python
# Package experiment for reproducibility
experiment_package = {
    "code": "git_commit_hash",
    "config": "config.yaml",
    "model": "best_model.pt",
    "results": "results.csv",
    "logs": "training.log",
    "environment": "requirements.txt",
}

# Create reproducibility script
# reproduce.sh
"""
#!/bin/bash
git checkout <commit-hash>
pip install -r requirements.txt
python train.py --config config.yaml
"""
```


## Complete Tracking Example

Here's a production-ready tracking setup:

```python
import torch
import torch.nn as nn
from torch.utils.tensorboard import SummaryWriter
import wandb
import yaml
import subprocess
from pathlib import Path
from datetime import datetime

class ExperimentTracker:
    """Complete experiment tracking wrapper."""

    def __init__(self, config, experiment_name=None, use_wandb=True, use_tensorboard=True):
        self.config = config
        self.use_wandb = use_wandb
        self.use_tensorboard = use_tensorboard

        # Generate experiment name
        if experiment_name is None:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            experiment_name = f"{config['model']}_{config['dataset']}_{timestamp}"
        self.experiment_name = experiment_name

        # Create experiment directory
        self.exp_dir = Path(f"experiments/{experiment_name}")
        self.exp_dir.mkdir(parents=True, exist_ok=True)

        # Initialize tracking tools
        if self.use_tensorboard:
            self.tb_writer = SummaryWriter(self.exp_dir / "tensorboard")

        if self.use_wandb:
            wandb.init(
                project=config.get("project", "default"),
                name=experiment_name,
                config=config,
                dir=self.exp_dir,
            )

        # Save config
        with open(self.exp_dir / "config.yaml", "w") as f:
            yaml.dump(config, f)

        # Capture environment
        self._log_environment()

        # Capture git state
        self._log_git_state()

        # Setup logging
        self._setup_logging()

        self.global_step = 0
        self.best_metric = float('-inf')

    def _log_environment(self):
        """Log environment information."""
        import sys
        env_info = {
            "python_version": sys.version,
            "torch_version": torch.__version__,
            "cuda_version": torch.version.cuda if torch.cuda.is_available() else None,
            "gpu_name": torch.cuda.get_device_name(0) if torch.cuda.is_available() else None,
            "gpu_count": torch.cuda.device_count(),
        }

        # Save to file
        with open(self.exp_dir / "environment.yaml", "w") as f:
            yaml.dump(env_info, f)

        # Log to W&B
        if self.use_wandb:
            wandb.config.update({"environment": env_info})

    def _log_git_state(self):
        """Log git commit and status."""
        try:
            commit = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()
            branch = subprocess.check_output(['git', 'rev-parse', '--abbrev-ref', 'HEAD']).decode('ascii').strip()
            status = subprocess.check_output(['git', 'status', '--porcelain']).decode('ascii').strip()
            is_dirty = len(status) > 0

            git_info = {
                "commit": commit,
                "branch": branch,
                "is_dirty": is_dirty,
            }

            # Save to file
            with open(self.exp_dir / "git_info.yaml", "w") as f:
                yaml.dump(git_info, f)

            # Save diff if dirty
            if is_dirty:
                diff = subprocess.check_output(['git', 'diff']).decode('ascii')
                with open(self.exp_dir / "git_diff.patch", "w") as f:
                    f.write(diff)
                print("WARNING: Uncommitted changes detected! Saved to git_diff.patch")

            # Log to W&B
            if self.use_wandb:
                wandb.config.update({"git": git_info})

        except Exception as e:
            print(f"Failed to capture git state: {e}")

    def _setup_logging(self):
        """Setup file logging."""
        import logging
        self.logger = logging.getLogger(self.experiment_name)
        self.logger.setLevel(logging.INFO)

        # File handler
        fh = logging.FileHandler(self.exp_dir / "training.log")
        fh.setLevel(logging.INFO)

        # Console handler
        ch = logging.StreamHandler()
        ch.setLevel(logging.INFO)

        # Formatter
        formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
        fh.setFormatter(formatter)
        ch.setFormatter(formatter)

        self.logger.addHandler(fh)
        self.logger.addHandler(ch)

    def log_metrics(self, metrics, step=None):
        """Log metrics to all tracking backends."""
        if step is None:
            step = self.global_step

        # TensorBoard
        if self.use_tensorboard:
            for key, value in metrics.items():
                if isinstance(value, (int, float)):
                    self.tb_writer.add_scalar(key, value, step)

        # W&B
        if self.use_wandb:
            wandb.log(metrics, step=step)

        # File
        self.logger.info(f"Step {step}: {metrics}")

        self.global_step = step + 1

    def save_checkpoint(self, model, optimizer, epoch, metric_value, metric_name="val_accuracy"):
        """Save model checkpoint with metadata."""
        checkpoint = {
            "epoch": epoch,
            "model_state_dict": model.state_dict(),
            "optimizer_state_dict": optimizer.state_dict(),
            metric_name: metric_value,
            "config": self.config,
        }

        # Save latest checkpoint
        checkpoint_path = self.exp_dir / "checkpoints" / f"checkpoint_epoch_{epoch}.pt"
        checkpoint_path.parent.mkdir(exist_ok=True)
        torch.save(checkpoint, checkpoint_path)

        # Save best checkpoint
        if metric_value > self.best_metric:
            self.best_metric = metric_value
            best_path = self.exp_dir / "checkpoints" / "best_model.pt"
            torch.save(checkpoint, best_path)

            self.logger.info(f"New best model saved: {metric_name}={metric_value:.4f}")

            # Log to W&B
            if self.use_wandb:
                wandb.log({f"best_{metric_name}": metric_value})
                wandb.save(str(best_path))

        return checkpoint_path

    def log_figure(self, name, figure, step=None):
        """Log matplotlib figure."""
        if step is None:
            step = self.global_step

        # TensorBoard
        if self.use_tensorboard:
            self.tb_writer.add_figure(name, figure, step)

        # W&B
        if self.use_wandb:
            wandb.log({name: wandb.Image(figure)}, step=step)

        # Save to disk
        fig_path = self.exp_dir / "figures" / f"{name}_step_{step}.png"
        fig_path.parent.mkdir(exist_ok=True)
        figure.savefig(fig_path)

    def finish(self):
        """Clean up and close tracking backends."""
        if self.use_tensorboard:
            self.tb_writer.close()

        if self.use_wandb:
            wandb.finish()

        self.logger.info("Experiment tracking finished.")


# Usage example
if __name__ == "__main__":
    config = {
        "project": "cifar10-sota",
        "model": "resnet18",
        "dataset": "cifar10",
        "learning_rate": 0.01,
        "batch_size": 128,
        "num_epochs": 100,
        "optimizer": "adam",
        "seed": 42,
    }

    # Initialize tracker
    tracker = ExperimentTracker(config)

    # Training loop
    model = create_model(config)
    optimizer = torch.optim.Adam(model.parameters(), lr=config["learning_rate"])

    for epoch in range(config["num_epochs"]):
        train_loss = train_epoch(model, train_loader, optimizer)
        val_loss, val_acc = evaluate(model, val_loader)

        # Log metrics
        tracker.log_metrics({
            "train/loss": train_loss,
            "val/loss": val_loss,
            "val/accuracy": val_acc,
            "epoch": epoch,
        })

        # Save checkpoint
        tracker.save_checkpoint(model, optimizer, epoch, val_acc)

        # Log figure (every 10 epochs)
        if epoch % 10 == 0:
            fig = plot_confusion_matrix(model, val_loader)
            tracker.log_figure("confusion_matrix", fig)

    # Finish
    tracker.finish()
```


## Pitfalls and Anti-Patterns

### Pitfall 1: Tracking Metrics But Not Config

**Symptom**: Have CSV with 50 experiments' metrics, but no idea what hyperparameters produced them.

**Why It Happens**:
- User focuses on "what matters" (the metric)
- Assumes they'll remember settings
- Doesn't realize metrics without context are useless

**Fix**:
```python
# WRONG: Only metrics
with open("results.csv", "a") as f:
    f.write(f"{epoch},{train_loss},{val_loss}\n")

# RIGHT: Metrics + config
experiment_id = f"exp_{timestamp}"
with open(f"{experiment_id}_config.yaml", "w") as f:
    yaml.dump(config, f)
with open(f"{experiment_id}_results.csv", "w") as f:
    f.write(f"{epoch},{train_loss},{val_loss}\n")
```


### Pitfall 2: Overwriting Checkpoints Without Versioning

**Symptom**: Always saving to "best_model.pt", can't recover earlier checkpoints.

**Why It Happens**:
- Disk space concerns (misguided)
- Only care about "best" model
- Don't anticipate evaluation bugs

**Fix**:
```python
# WRONG: Overwriting
torch.save(model.state_dict(), "best_model.pt")

# RIGHT: Versioned checkpoints
torch.save(model.state_dict(), f"checkpoints/model_epoch_{epoch}.pt")
torch.save(model.state_dict(), f"checkpoints/best_model_val_acc_{val_acc:.4f}.pt")
```


### Pitfall 3: Using Print Instead of Logging

**Symptom**: Training crashes, all print output lost, can't debug.

**Why It Happens**:
- Print is simpler than logging
- Works for short scripts
- Doesn't anticipate crashes

**Fix**:
```python
# WRONG: Print statements
print(f"Epoch {epoch}: loss={loss}")

# RIGHT: Proper logging
import logging
logging.basicConfig(filename="training.log", level=logging.INFO)
logging.info(f"Epoch {epoch}: loss={loss}")
```


### Pitfall 4: No Git Tracking for Code Changes

**Symptom**: Can't reproduce result because code changed between experiments.

**Why It Happens**:
- Rapid iteration (uncommitted changes)
- "I'll commit later"
- Don't realize code version matters

**Fix**:
```python
# Log git commit at start of training
git_hash = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()
config["git_commit"] = git_hash

# Better: Require clean git state
status = subprocess.check_output(['git', 'status', '--porcelain']).decode('ascii').strip()
if status:
    print("ERROR: Uncommitted changes detected!")
    print("Commit your changes before running experiments.")
    sys.exit(1)
```


### Pitfall 5: Not Tracking Random Seeds

**Symptom**: Same code, same hyperparameters, different results every time.

**Why It Happens**:
- Forget to set seed
- Set seed in one place but not others (PyTorch, NumPy, CUDA)
- Don't log seed value

**Fix**:
```python
# Set all seeds
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

# Use seed from config
set_seed(config["seed"])

# Log seed
tracker.log_metrics({"seed": config["seed"]})
```


### Pitfall 6: Tracking Too Much Data (Storage Bloat)

**Symptom**: 100GB of logs for 50 experiments, can't store more.

**Why It Happens**:
- Logging every step (not just epoch)
- Saving all checkpoints (not just best)
- Logging high-resolution images

**Fix**:
```python
# Log at appropriate frequency
if global_step % 100 == 0:  # Every 100 steps, not every step
    tracker.log_metrics({"train/loss": loss})

# Save only best checkpoints
if val_acc > best_val_acc:  # Only when improving
    tracker.save_checkpoint(model, optimizer, epoch, val_acc)

# Downsample images
img_low_res = F.interpolate(img, size=(64, 64))  # Don't log 224x224
```


### Pitfall 7: No Experiment Naming Convention

**Symptom**: experiments/test, experiments/test2, experiments/final, experiments/final_final

**Why It Happens**:
- No planning for multiple experiments
- Naming feels unimportant
- "I'll organize later"

**Fix**:
```python
# Good naming convention
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
experiment_name = f"{config['model']}_{config['dataset']}_{timestamp}_lr{config['lr']}"
# Example: "resnet18_cifar10_20241030_120000_lr0.01"
```


### Pitfall 8: Not Tracking Evaluation Metrics

**Symptom**: Saved best model by training loss, but validation loss was actually increasing (overfitting).

**Why It Happens**:
- Only tracking training metrics
- Assuming training loss = model quality
- Not validating frequently enough

**Fix**:
```python
# Track both training and validation
tracker.log_metrics({
    "train/loss": train_loss,
    "val/loss": val_loss,  # Don't forget validation!
    "val/accuracy": val_acc,
})

# Save best model by validation metric, not training
if val_acc > best_val_acc:
    tracker.save_checkpoint(model, optimizer, epoch, val_acc)
```


### Pitfall 9: Local-Only Tracking for Team Projects

**Symptom**: Team members can't see each other's experiments, duplicate work.

**Why It Happens**:
- TensorBoard is local by default
- Don't realize collaboration tools exist
- Privacy concerns (unfounded)

**Fix**:
```python
# Use team-friendly tool
wandb.init(project="team-project")  # Everyone can see

# Or: Share TensorBoard logs
# scp -r runs/ shared-server:/path/
# tensorboard --logdir=/path/runs --host=0.0.0.0
```


### Pitfall 10: No Tracking Until "Important" Experiment

**Symptom**: First 20 experiments untracked, realize they had valuable insights.

**Why It Happens**:
- "Just testing" mentality
- Tracking feels like overhead
- Don't realize importance until later

**Fix**:
```python
# Track from experiment 1
# Even if "just testing", it takes 30 seconds to set up tracking
tracker = ExperimentTracker(config)

# Future you will thank past you
```


## Rationalization vs Reality Table

| User Rationalization | Reality | Recommendation |
|----------------------|---------|----------------|
| "I'll remember what I tried" | You won't (memory fails in hours) | Track from day 1, always |
| "Print statements are enough" | Lost on crash or terminal close | Use proper logging to file |
| "Only track final metrics" | Can't debug without intermediate data | Track every epoch minimum |
| "Just save best model" | Need checkpoints for analysis | Version all important checkpoints |
| "Tracking adds too much overhead" | <1% overhead for scalars | Log metrics, not raw data |
| "I only need the model file" | Need hyperparameters to understand it | Save config + model + metrics |
| "TensorBoard is too complex" | 2 lines of code to set up | Start simple, expand later |
| "I'll organize experiments later" | Never happens, chaos ensues | Use naming convention from start |
| "Git commits slow me down" | Uncommitted code = irreproducible | Commit before experiments |
| "Cloud tracking costs money" | Free tiers are generous | W&B free: 100GB, unlimited experiments |
| "I don't need reproducibility" | Your future self will | Track environment + seed + git |
| "Tracking is for production, not research" | Research needs it more (exploration) | Research = more experiments = more tracking |


## Red Flags (Likely to Fail)

1. **"I'll track it later"**
   - Reality: Later = never; best results are always untracked
   - Action: Track from experiment 1

2. **"Just using print statements"**
   - Reality: Lost on crash/close; can't analyze later
   - Action: Use logging framework or tracking tool

3. **"Only tracking the final metric"**
   - Reality: Can't debug convergence issues; no training curves
   - Action: Track every epoch at minimum

4. **"Saving to best_model.pt (overwriting)"**
   - Reality: Can't recover earlier checkpoints; evaluation bugs = disaster
   - Action: Version checkpoints with epoch/metric

5. **"Don't need to track hyperparameters"**
   - Reality: Metrics without config are meaningless
   - Action: Log config alongside metrics

6. **"Not tracking git commit"**
   - Reality: Code changes = irreproducible
   - Action: Log git hash, check for uncommitted changes

7. **"Random seed doesn't matter"**
   - Reality: Can cause 5%+ variance in results
   - Action: Set and log all seeds

8. **"TensorBoard/W&B is overkill for me"**
   - Reality: Setup takes 2 minutes, saves hours later
   - Action: Use simplest tool (TensorBoard), expand if needed

9. **"I'm just testing, don't need tracking"**
   - Reality: Best results come from "tests"
   - Action: Track everything, including tests

10. **"Team doesn't need to see my experiments"**
    - Reality: Collaboration requires transparency
    - Action: Use shared tracking (W&B, MLflow server)


## When This Skill Applies

**Strong Signals** (definitely use):
- Starting a new ML project (even "quick prototype")
- User asks "should I track this?"
- User lost their best result and can't reproduce
- Multiple experiments running (need comparison)
- Team collaboration (need to share results)
- User asks about TensorBoard, W&B, or MLflow
- Training crashes and user needs debugging data

**Weak Signals** (maybe use):
- User has tracking but it's incomplete
- Asking about reproducibility
- Discussing hyperparameter tuning (needs tracking)
- Long-running training (overnight, multi-day)

**Not Applicable**:
- Pure inference (no training)
- Single experiment already tracked
- Discussing model architecture only
- Data preprocessing questions (pre-training)


## Success Criteria

You've successfully applied this skill when:

1. **Complete Tracking**: Hyperparameters + metrics + artifacts + git + environment all logged
2. **Reproducibility**: Someone else (or future you) can reproduce the result from tracked info
3. **Tool Choice**: Selected appropriate tool (TensorBoard, W&B, MLflow) for use case
4. **Organization**: Experiments have clear naming, tagging, grouping
5. **Comparison**: Can compare experiments side-by-side, analyze hyperparameter impact
6. **Collaboration**: Team can see and discuss results (if team project)
7. **Minimal Overhead**: Tracking adds <5% runtime overhead
8. **Persistence**: Logs survive crashes, terminal closes, reboots
9. **Historical Analysis**: Can go back to any experiment and understand what was done
10. **Best Practices**: Git commits before experiments, seeds set, evaluation bugs impossible

**Final Test**: Can you reproduce the best result from 6 months ago using only the tracked information?

If YES: Excellent tracking! If NO: Gaps remain.


## Advanced Tracking Patterns

### 1. Multi-Run Experiments (Hyperparameter Sweeps)

When running many experiments systematically:

```python
# W&B Sweeps
sweep_config = {
    "method": "random",
    "metric": {"name": "val_accuracy", "goal": "maximize"},
    "parameters": {
        "learning_rate": {"values": [0.001, 0.01, 0.1]},
        "batch_size": {"values": [32, 64, 128]},
        "optimizer": {"values": ["adam", "sgd"]},
    },
}

sweep_id = wandb.sweep(sweep_config, project="cifar10-sweep")

def train():
    run = wandb.init()
    config = wandb.config

    model = create_model(config)
    # ... training code ...
    wandb.log({"val_accuracy": val_acc})

wandb.agent(sweep_id, train, count=10)

# MLflow with Optuna
import optuna
import mlflow

def objective(trial):
    with mlflow.start_run(nested=True):
        lr = trial.suggest_loguniform("learning_rate", 1e-5, 1e-1)
        batch_size = trial.suggest_categorical("batch_size", [32, 64, 128])

        mlflow.log_params({"learning_rate": lr, "batch_size": batch_size})

        val_acc = train_and_evaluate(lr, batch_size)
        mlflow.log_metric("val_accuracy", val_acc)

        return val_acc

with mlflow.start_run():
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=20)

    mlflow.log_param("best_params", study.best_params)
    mlflow.log_metric("best_accuracy", study.best_value)
```


### 2. Distributed Training Tracking

When training on multiple GPUs or machines:

```python
import torch.distributed as dist

def setup_distributed_tracking(rank, world_size):
    """Setup tracking for distributed training."""

    # Only rank 0 logs to avoid duplicates
    if rank == 0:
        tracker = ExperimentTracker(config)
    else:
        tracker = None

    return tracker

def train_distributed(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

    tracker = setup_distributed_tracking(rank, world_size)

    model = DistributedDataParallel(model, device_ids=[rank])

    for epoch in range(num_epochs):
        train_loss = train_epoch(model, train_loader, optimizer)

        # Gather metrics from all ranks
        train_loss_tensor = torch.tensor(train_loss).cuda()
        dist.all_reduce(train_loss_tensor, op=dist.ReduceOp.SUM)
        avg_train_loss = train_loss_tensor.item() / world_size

        # Only rank 0 logs
        if rank == 0 and tracker:
            tracker.log_metrics({
                "train/loss": avg_train_loss,
                "epoch": epoch,
            })

    if rank == 0 and tracker:
        tracker.finish()

    dist.destroy_process_group()
```


### 3. Experiment Resumption

Tracking setup for resumable experiments:

```python
class ResumableExperimentTracker(ExperimentTracker):
    """Experiment tracker with resume support."""

    def __init__(self, config, checkpoint_path=None):
        super().__init__(config)

        self.checkpoint_path = checkpoint_path

        if checkpoint_path and os.path.exists(checkpoint_path):
            self.resume_from_checkpoint()

    def resume_from_checkpoint(self):
        """Resume tracking from saved checkpoint."""
        checkpoint = torch.load(self.checkpoint_path)

        self.global_step = checkpoint.get("global_step", 0)
        self.best_metric = checkpoint.get("best_metric", float('-inf'))

        self.logger.info(f"Resumed from checkpoint: step={self.global_step}")

    def save_checkpoint(self, model, optimizer, epoch, metric_value, metric_name="val_accuracy"):
        """Save checkpoint with tracker state."""
        checkpoint = {
            "epoch": epoch,
            "global_step": self.global_step,
            "best_metric": self.best_metric,
            "model_state_dict": model.state_dict(),
            "optimizer_state_dict": optimizer.state_dict(),
            metric_name: metric_value,
            "config": self.config,
        }

        checkpoint_path = self.exp_dir / "checkpoints" / "latest.pt"
        checkpoint_path.parent.mkdir(exist_ok=True)
        torch.save(checkpoint, checkpoint_path)

        # Also save best
        if metric_value > self.best_metric:
            self.best_metric = metric_value
            best_path = self.exp_dir / "checkpoints" / "best.pt"
            torch.save(checkpoint, best_path)

        return checkpoint_path

# Usage
tracker = ResumableExperimentTracker(config, checkpoint_path="checkpoints/latest.pt")

# Training continues from where it left off
for epoch in range(start_epoch, num_epochs):
    # ... training ...
    tracker.save_checkpoint(model, optimizer, epoch, val_acc)
```


### 4. Experiment Comparison and Analysis

Programmatic experiment analysis:

```python
def analyze_experiments(project_name):
    """Analyze all experiments in a project."""

    # W&B
    import wandb
    api = wandb.Api()
    runs = api.runs(project_name)

    # Extract data
    data = []
    for run in runs:
        data.append({
            "name": run.name,
            "learning_rate": run.config.get("learning_rate"),
            "batch_size": run.config.get("batch_size"),
            "val_accuracy": run.summary.get("val_accuracy"),
            "train_time": run.summary.get("_runtime"),
        })

    df = pd.DataFrame(data)

    # Analysis
    print("Top 5 experiments by accuracy:")
    print(df.nlargest(5, "val_accuracy"))

    # Hyperparameter impact
    print("\nAverage accuracy by learning rate:")
    print(df.groupby("learning_rate")["val_accuracy"].mean())

    # Visualization
    import matplotlib.pyplot as plt

    fig, axes = plt.subplots(1, 2, figsize=(12, 4))

    # Learning rate vs accuracy
    axes[0].scatter(df["learning_rate"], df["val_accuracy"])
    axes[0].set_xlabel("Learning Rate")
    axes[0].set_ylabel("Validation Accuracy")
    axes[0].set_xscale("log")

    # Batch size vs accuracy
    axes[1].scatter(df["batch_size"], df["val_accuracy"])
    axes[1].set_xlabel("Batch Size")
    axes[1].set_ylabel("Validation Accuracy")

    plt.tight_layout()
    plt.savefig("experiment_analysis.png")

    return df

# Run analysis
df = analyze_experiments("team/cifar10-sota")
```


### 5. Data Versioning Integration

Tracking data versions alongside experiments:

```python
import hashlib

def hash_dataset(dataset_path):
    """Compute hash of dataset for versioning."""
    hasher = hashlib.sha256()

    # Hash dataset files
    for file in sorted(Path(dataset_path).rglob("*")):
        if file.is_file():
            with open(file, "rb") as f:
                hasher.update(f.read())

    return hasher.hexdigest()

# Track data version
data_version = hash_dataset("data/cifar10")
config["data_version"] = data_version

tracker = ExperimentTracker(config)

# Or use DVC
"""
# Initialize DVC
dvc init

# Track data
dvc add data/cifar10
git add data/cifar10.dvc

# Log DVC hash in experiment
with open("data/cifar10.dvc") as f:
    dvc_config = yaml.safe_load(f)
    data_hash = dvc_config["outs"][0]["md5"]
    config["data_hash"] = data_hash
"""
```


### 6. Artifact Management Best Practices

Organizing and managing experiment artifacts:

```python
class ArtifactManager:
    """Manages experiment artifacts (models, plots, logs)."""

    def __init__(self, experiment_dir):
        self.exp_dir = Path(experiment_dir)

        # Create subdirectories
        self.checkpoints_dir = self.exp_dir / "checkpoints"
        self.figures_dir = self.exp_dir / "figures"
        self.logs_dir = self.exp_dir / "logs"

        for d in [self.checkpoints_dir, self.figures_dir, self.logs_dir]:
            d.mkdir(parents=True, exist_ok=True)

    def save_checkpoint(self, checkpoint, name):
        """Save checkpoint with automatic cleanup."""
        path = self.checkpoints_dir / f"{name}.pt"
        torch.save(checkpoint, path)

        # Keep only last N checkpoints (except best)
        self._cleanup_checkpoints(keep_n=5)

        return path

    def _cleanup_checkpoints(self, keep_n=5):
        """Keep only recent checkpoints to save space."""
        checkpoints = sorted(
            self.checkpoints_dir.glob("checkpoint_epoch_*.pt"),
            key=lambda p: p.stat().st_mtime,
            reverse=True,
        )

        # Delete old checkpoints (keep best + last N)
        for ckpt in checkpoints[keep_n:]:
            if "best" not in ckpt.name:
                ckpt.unlink()

    def save_figure(self, fig, name, step=None):
        """Save matplotlib figure with metadata."""
        if step is not None:
            filename = f"{name}_step_{step}.png"
        else:
            filename = f"{name}.png"

        path = self.figures_dir / filename
        fig.savefig(path, dpi=150, bbox_inches="tight")

        return path

    def get_artifact_summary(self):
        """Get summary of stored artifacts."""
        summary = {
            "num_checkpoints": len(list(self.checkpoints_dir.glob("*.pt"))),
            "num_figures": len(list(self.figures_dir.glob("*.png"))),
            "total_size_mb": sum(
                f.stat().st_size for f in self.exp_dir.rglob("*") if f.is_file()
            ) / (1024 * 1024),
        }
        return summary

# Usage
artifacts = ArtifactManager(experiment_dir)
artifacts.save_checkpoint(checkpoint, f"checkpoint_epoch_{epoch}")
artifacts.save_figure(fig, "training_curve")
print(artifacts.get_artifact_summary())
```


### 7. Real-Time Monitoring and Alerts

Setup alerts for experiment issues:

```python
# W&B Alerts
import wandb

wandb.init(project="cifar10")

for epoch in range(num_epochs):
    train_loss = train_epoch(model, train_loader, optimizer)

    wandb.log({"train/loss": train_loss, "epoch": epoch})

    # Alert on divergence
    if math.isnan(train_loss) or train_loss > 10:
        wandb.alert(
            title="Training Diverged",
            text=f"Loss is {train_loss} at epoch {epoch}",
            level=wandb.AlertLevel.ERROR,
        )
        break

    # Alert on milestone
    if val_acc > 0.90:
        wandb.alert(
            title="90% Accuracy Reached!",
            text=f"Validation accuracy: {val_acc:.2%}",
            level=wandb.AlertLevel.INFO,
        )

# Slack integration
def send_slack_alert(message, webhook_url):
    """Send alert to Slack."""
    import requests
    requests.post(webhook_url, json={"text": message})

# Email alerts
def send_email_alert(subject, body, to_email):
    """Send email alert."""
    import smtplib
    from email.message import EmailMessage

    msg = EmailMessage()
    msg["Subject"] = subject
    msg["To"] = to_email
    msg.set_content(body)

    # Send via SMTP
    with smtplib.SMTP("localhost") as s:
        s.send_message(msg)
```


## Common Integration Patterns

### Pattern 1: Training Script with Complete Tracking

```python
#!/usr/bin/env python3
"""
Complete training script with experiment tracking.
"""

import argparse
import yaml
from pathlib import Path
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

from experiment_tracker import ExperimentTracker
from models import create_model
from data import load_dataset

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--config", type=str, required=True, help="Config file")
    parser.add_argument("--resume", type=str, help="Resume from checkpoint")
    parser.add_argument("--name", type=str, help="Experiment name")
    return parser.parse_args()

def main():
    args = parse_args()

    # Load config
    with open(args.config) as f:
        config = yaml.safe_load(f)

    # Initialize tracking
    tracker = ExperimentTracker(
        config=config,
        experiment_name=args.name,
        use_wandb=True,
        use_tensorboard=True,
    )

    # Setup training
    model = create_model(config)
    optimizer = torch.optim.Adam(model.parameters(), lr=config["learning_rate"])
    criterion = nn.CrossEntropyLoss()

    train_loader, val_loader = load_dataset(config)

    # Resume if checkpoint provided
    start_epoch = 0
    if args.resume:
        checkpoint = torch.load(args.resume)
        model.load_state_dict(checkpoint["model_state_dict"])
        optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
        start_epoch = checkpoint["epoch"] + 1
        tracker.logger.info(f"Resumed from epoch {start_epoch}")

    # Training loop
    best_val_acc = 0.0
    for epoch in range(start_epoch, config["num_epochs"]):
        # Train
        model.train()
        train_losses = []
        for batch_idx, (data, target) in enumerate(train_loader):
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()

            train_losses.append(loss.item())

            # Log every N batches
            if batch_idx % config.get("log_interval", 100) == 0:
                tracker.log_metrics({
                    "train/loss": loss.item(),
                    "train/lr": optimizer.param_groups[0]['lr'],
                }, step=epoch * len(train_loader) + batch_idx)

        # Validate
        model.eval()
        val_losses = []
        correct = 0
        total = 0
        with torch.no_grad():
            for data, target in val_loader:
                output = model(data)
                loss = criterion(output, target)
                val_losses.append(loss.item())

                pred = output.argmax(dim=1)
                correct += (pred == target).sum().item()
                total += target.size(0)

        train_loss = sum(train_losses) / len(train_losses)
        val_loss = sum(val_losses) / len(val_losses)
        val_acc = correct / total

        # Log epoch metrics
        tracker.log_metrics({
            "train/loss_epoch": train_loss,
            "val/loss": val_loss,
            "val/accuracy": val_acc,
            "epoch": epoch,
        })

        # Save checkpoint
        tracker.save_checkpoint(model, optimizer, epoch, val_acc)

        # Update best
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            tracker.logger.info(f"New best accuracy: {val_acc:.4f}")

        # Early stopping
        if epoch > 50 and val_acc < 0.5:
            tracker.logger.warning("Model not improving, stopping early")
            break

    # Log final results
    tracker.log_metrics({"best_val_accuracy": best_val_acc})
    tracker.logger.info(f"Training completed. Best accuracy: {best_val_acc:.4f}")

    # Cleanup
    tracker.finish()

if __name__ == "__main__":
    main()
```

**Usage**:
```bash
# Train new model
python train.py --config configs/resnet18.yaml --name resnet18_baseline

# Resume training
python train.py --config configs/resnet18.yaml --resume experiments/resnet18_baseline/checkpoints/latest.pt
```


## Further Reading

- **Papers**:
  - "Hidden Technical Debt in Machine Learning Systems" (Sculley et al., 2015)
  - "Reproducibility in Machine Learning" (Pineau et al., 2020)
  - "A Step Toward Quantifying Independently Reproducible ML Research" (Dodge et al., 2019)

- **Tool Documentation**:
  - TensorBoard: https://www.tensorflow.org/tensorboard
  - Weights & Biases: https://docs.wandb.ai/
  - MLflow: https://mlflow.org/docs/latest/index.html
  - DVC (Data Version Control): https://dvc.org/doc
  - Hydra (Config Management): https://hydra.cc/docs/intro/

- **Best Practices**:
  - Papers With Code (Reproducibility): https://paperswithcode.com/
  - ML Code Completeness Checklist: https://github.com/paperswithcode/releasing-research-code
  - Experiment Management Guide: https://neptune.ai/blog/experiment-management

- **Books**:
  - "Designing Machine Learning Systems" by Chip Huyen (Chapter on Experiment Tracking)
  - "Machine Learning Engineering" by Andriy Burkov (Chapter on MLOps)


**Remember**: Experiment tracking is insurance. It costs 1% overhead but saves 100% when disaster strikes. Track from day 1, track everything, and your future self will thank you.