gh-human-frontier-labs-inc-…/DECISIONS.md

# Architecture Decisions

Documentation of all technical decisions made for Tailscale SSH Sync Agent.

## Tool Selection

### Selected Tool: sshsync

**Justification:**

✅ **Advantages:**
- **Ready-to-use**: Available via `pip install sshsync`
- **Group management**: Built-in support for organizing hosts into groups
- **Integration**: Works with existing SSH config (`~/.ssh/config`)
- **Simple API**: Easy-to-wrap CLI interface
- **Parallel execution**: Commands run concurrently across hosts
- **File operations**: Push/pull with recursive support
- **Timeout handling**: Per-command timeouts for reliability
- **Active maintenance**: Regular updates and bug fixes
- **Python-based**: Easy to extend and integrate

✅ **Coverage:**
- All SSH-accessible hosts
- Works with any SSH server (Linux, macOS, BSD, etc.)
- Platform-agnostic (runs on any OS with Python)

✅ **Cost:**
- Free and open-source
- No API keys or subscriptions required
- No rate limits

✅ **Documentation:**
- Clear command-line interface
- PyPI documentation available
- GitHub repository with examples

**Alternatives Considered:**

❌ **Fabric (Python library)**
- Pros: Pure Python, very flexible
- Cons: Requires writing more code, no built-in group management
- **Rejected because**: sshsync provides ready-made functionality

❌ **Ansible**
- Pros: Industry standard, very powerful
- Cons: Requires learning YAML playbooks, overkill for simple operations
- **Rejected because**: Too heavyweight for ad-hoc commands and file transfers

❌ **pssh (parallel-ssh)**
- Pros: Simple parallel SSH
- Cons: No group management, no file transfer built-in, less actively maintained
- **Rejected because**: sshsync has better group management and file operations

❌ **Custom SSH wrapper**
- Pros: Full control
- Cons: Reinventing the wheel, maintaining parallel execution logic
- **Rejected because**: sshsync already provides what we need

**Conclusion:**

sshsync is the best tool for this use case because it:
1. Provides group-based host management out of the box
2. Handles parallel execution automatically
3. Integrates with existing SSH configuration
4. Supports both command execution and file transfers
5. Requires minimal wrapper code

## Integration: Tailscale

**Decision**: Integrate with Tailscale for network connectivity

**Justification:**

✅ **Why Tailscale:**
- **Zero-config VPN**: No manual firewall/NAT configuration
- **Secure by default**: WireGuard encryption
- **Works everywhere**: Coffee shop, home, office, cloud
- **MagicDNS**: Easy addressing (machine-name.tailnet.ts.net)
- **Standard SSH**: Works with all SSH tools including sshsync
- **No overhead**: Uses regular SSH protocol over Tailscale network

✅ **Integration approach:**
- Tailscale provides the network layer
- Standard SSH works over Tailscale
- sshsync operates normally using Tailscale hostnames/IPs
- No Tailscale-specific code needed in core operations
- Tailscale status checking for diagnostics

**Alternatives:**

❌ **Direct public internet + port forwarding**
- Cons: Complex firewall setup, security risks, doesn't work on mobile/restricted networks
- **Rejected because**: Requires too much configuration and has security concerns

❌ **Other VPNs (WireGuard, OpenVPN, ZeroTier)**
- Cons: More manual configuration, less zero-config
- **Rejected because**: Tailscale is easier to set up and use

**Conclusion:**

Tailscale + standard SSH is the optimal combination:
- Secure connectivity without configuration
- Works with existing SSH tools
- No vendor lock-in (can use other VPNs if needed)

## Architecture

### Structure: Modular Scripts + Utilities

**Decision**: Separate concerns into focused modules

```
scripts/
├── sshsync_wrapper.py         # sshsync CLI interface
├── tailscale_manager.py       # Tailscale operations
├── load_balancer.py           # Task distribution logic
├── workflow_executor.py       # Common workflows
└── utils/
    ├── helpers.py             # Formatting, parsing
    └── validators/            # Input validation
```

**Justification:**

✅ **Modularity:**
- Each script has single responsibility
- Easy to test independently
- Easy to extend without breaking others

✅ **Reusability:**
- Helpers used across all scripts
- Validators prevent duplicate validation logic
- Workflows compose lower-level operations

✅ **Maintainability:**
- Clear file organization
- Easy to locate specific functionality
- Separation of concerns

**Alternatives:**

❌ **Monolithic single script**
- Cons: Hard to test, hard to maintain, becomes too large
- **Rejected because**: Doesn't scale well

❌ **Over-engineered class hierarchy**
- Cons: Unnecessary complexity for this use case
- **Rejected because**: Simple functions are sufficient

**Conclusion:**

Modular functional approach provides good balance of simplicity and maintainability.

### Validation Strategy: Multi-Layer

**Decision**: Validate at multiple layers

**Layers:**

1. **Parameter validation** (`parameter_validator.py`)
   - Validates user inputs before any operations
   - Prevents invalid hosts, groups, paths, etc.

2. **Host validation** (`host_validator.py`)
   - Validates SSH configuration exists
   - Checks host reachability
   - Validates group membership

3. **Connection validation** (`connection_validator.py`)
   - Tests actual SSH connectivity
   - Verifies Tailscale status
   - Checks SSH key authentication

**Justification:**

✅ **Early failure:**
- Catch errors before expensive operations
- Clear error messages at each layer

✅ **Comprehensive:**
- Multiple validation points catch different issues
- Reduces runtime failures

✅ **User-friendly:**
- Helpful error messages with suggestions
- Clear indication of what went wrong

**Conclusion:**

Multi-layer validation provides robust error handling and great user experience.

## Load Balancing Strategy

### Decision: Simple Composite Score

**Formula:**
```python
score = (cpu_pct * 0.4) + (mem_pct * 0.3) + (disk_pct * 0.3)
```

**Weights:**
- CPU: 40% (most important for compute tasks)
- Memory: 30% (important for data processing)
- Disk: 30% (important for I/O operations)

**Justification:**

✅ **Simple and effective:**
- Easy to understand
- Fast to calculate
- Works well for most workloads

✅ **Balanced:**
- Considers multiple resource types
- No single metric dominates

**Alternatives:**

❌ **CPU only**
- Cons: Ignores memory-bound and I/O-bound tasks
- **Rejected because**: Too narrow

❌ **Complex ML-based prediction**
- Cons: Overkill, slow, requires training data
- **Rejected because**: Unnecessary complexity

❌ **Fixed round-robin**
- Cons: Doesn't consider actual load
- **Rejected because**: Can overload already-busy hosts

**Conclusion:**

Simple weighted score provides good balance without complexity.

## Error Handling Philosophy

### Decision: Graceful Degradation + Clear Messages

**Principles:**

1. **Fail early with validation**: Catch errors before operations
2. **Isolate failures**: One host failure doesn't stop others
3. **Clear messages**: Tell user exactly what went wrong and how to fix
4. **Automatic retry**: Retry transient errors (network, timeout)
5. **Dry-run support**: Preview operations before execution

**Implementation:**

```python
# Example error handling pattern
try:
    validate_host(host)
    validate_ssh_connection(host)
    result = execute_command(host, command)
except ValidationError as e:
    return {'error': str(e), 'suggestion': 'Fix: ...'}
except ConnectionError as e:
    return {'error': str(e), 'diagnostics': get_diagnostics(host)}
```

**Justification:**

✅ **Better UX:**
- Users know exactly what's wrong
- Suggestions help fix issues quickly

✅ **Reliability:**
- Automatic retry handles transient issues
- Dry-run prevents mistakes

✅ **Debugging:**
- Clear error messages speed up troubleshooting
- Diagnostics provide actionable information

**Conclusion:**

Graceful degradation with helpful messages creates better user experience.

## Caching Strategy

**Decision**: Minimal caching for real-time accuracy

**What we cache:**
- Nothing (v1.0.0)

**Why no caching:**
- Host status changes frequently
- Load metrics change constantly
- Operations need real-time data
- Cache invalidation is complex

**Future consideration (v2.0):**
- Cache Tailscale status (60s TTL)
- Cache group configuration (5min TTL)
- Cache SSH config parsing (5min TTL)

**Justification:**

✅ **Simplicity:**
- No cache invalidation logic needed
- No stale data issues

✅ **Accuracy:**
- Always get current state
- No surprises from cached data

**Trade-off:**
- Slightly slower repeated operations
- More network calls

**Conclusion:**

For v1.0.0, simplicity and accuracy outweigh performance concerns. Real-time data is more valuable than speed.

## Testing Strategy

### Decision: Comprehensive Unit + Integration Tests

**Coverage:**

- **29 tests total:**
  - 11 integration tests (end-to-end workflows)
  - 11 helper tests (formatting, parsing, calculations)
  - 7 validation tests (input validation, safety checks)

**Test Philosophy:**

1. **Test real functionality**: Integration tests use actual functions
2. **Test edge cases**: Validation tests cover error conditions
3. **Test helpers**: Ensure formatting/parsing works correctly
4. **Fast execution**: All tests run in < 10 seconds
5. **No external dependencies**: Tests don't require Tailscale or sshsync to be running

**Justification:**

✅ **Confidence:**
- Tests verify code works as expected
- Catches regressions when modifying code

✅ **Documentation:**
- Tests show how to use functions
- Examples of expected behavior

✅ **Reliability:**
- Production-ready code from v1.0.0

**Conclusion:**

Comprehensive testing ensures reliable code from the start.

## Performance Considerations

### Parallel Execution

**Decision**: Leverage sshsync's built-in parallelization

- sshsync runs commands concurrently across hosts automatically
- No need to implement custom threading/multiprocessing
- Timeout applies per-host independently

**Trade-offs:**

✅ **Pros:**
- Simple to use
- Fast for large host groups
- No concurrency bugs

⚠️ **Cons:**
- Less control over parallelism level
- Can overwhelm network with too many concurrent connections

**Conclusion:**

Built-in parallelization is sufficient for most use cases. Custom control can be added in v2.0 if needed.

## Security Considerations

### SSH Key Authentication

**Decision**: Require SSH keys (no password auth)

**Justification:**

✅ **Security:**
- Keys are more secure than passwords
- Can't be brute-forced
- Can be revoked per-host

✅ **Automation:**
- Non-interactive (no password prompts)
- Works in scripts and CI/CD

**Implementation:**
- Validators check SSH key auth works
- Clear error messages guide users to set up keys
- Documentation explains SSH key setup

### Command Safety

**Decision**: Validate dangerous commands

**Dangerous patterns blocked:**
- `rm -rf /` (root deletion)
- `mkfs.*` (filesystem formatting)
- `dd.*of=/dev/` (direct disk writes)
- Fork bombs
- Direct disk writes

**Override**: Use `allow_dangerous=True` to bypass

**Justification:**

✅ **Safety:**
- Prevents accidental destructive operations
- Dry-run provides preview

✅ **Flexibility:**
- Can still run dangerous commands if explicitly allowed

**Conclusion:**

Safety by default with escape hatch for advanced users.

## Decisions Summary

| Decision | Choice | Rationale |
|----------|--------|-----------|
| **CLI Tool** | sshsync | Best balance of features, ease of use, and maintenance |
| **Network** | Tailscale | Zero-config secure VPN, works everywhere |
| **Architecture** | Modular scripts | Clear separation of concerns, maintainable |
| **Validation** | Multi-layer | Catch errors early with helpful messages |
| **Load Balancing** | Composite score | Simple, effective, considers multiple resources |
| **Caching** | None (v1.0) | Simplicity and real-time accuracy |
| **Testing** | 29 tests | Comprehensive coverage for reliability |
| **Security** | SSH keys + validation | Secure and automation-friendly |

## Trade-offs Accepted

1. **No caching** → Slightly slower, but always accurate
2. **sshsync dependency** → External tool, but saves development time
3. **SSH key requirement** → Setup needed, but more secure
4. **Simple load balancing** → Less sophisticated, but fast and easy to understand
5. **Terminal UI only** → No web dashboard, but simpler to develop and maintain

## Future Improvements

### v2.0 Considerations

1. **Add caching** for frequently-accessed data (Tailscale status, groups)
2. **Web dashboard** for visualization and monitoring
3. **Operation history** database for audit trail
4. **Advanced load balancing** with custom metrics
5. **Automated SSH key distribution** across hosts
6. **Integration with config management** tools (Ansible, Terraform)
7. **Container support** via SSH to Docker containers
8. **Custom validation plugins** for domain-specific checks

All decisions prioritize **simplicity**, **security**, and **maintainability** for v1.0.0.