459 lines
13 KiB
Markdown
459 lines
13 KiB
Markdown
# Architecture Decisions
|
|
|
|
Documentation of all technical decisions made for Tailscale SSH Sync Agent.
|
|
|
|
## Tool Selection
|
|
|
|
### Selected Tool: sshsync
|
|
|
|
**Justification:**
|
|
|
|
✅ **Advantages:**
|
|
- **Ready-to-use**: Available via `pip install sshsync`
|
|
- **Group management**: Built-in support for organizing hosts into groups
|
|
- **Integration**: Works with existing SSH config (`~/.ssh/config`)
|
|
- **Simple API**: Easy-to-wrap CLI interface
|
|
- **Parallel execution**: Commands run concurrently across hosts
|
|
- **File operations**: Push/pull with recursive support
|
|
- **Timeout handling**: Per-command timeouts for reliability
|
|
- **Active maintenance**: Regular updates and bug fixes
|
|
- **Python-based**: Easy to extend and integrate
|
|
|
|
✅ **Coverage:**
|
|
- All SSH-accessible hosts
|
|
- Works with any SSH server (Linux, macOS, BSD, etc.)
|
|
- Platform-agnostic (runs on any OS with Python)
|
|
|
|
✅ **Cost:**
|
|
- Free and open-source
|
|
- No API keys or subscriptions required
|
|
- No rate limits
|
|
|
|
✅ **Documentation:**
|
|
- Clear command-line interface
|
|
- PyPI documentation available
|
|
- GitHub repository with examples
|
|
|
|
**Alternatives Considered:**
|
|
|
|
❌ **Fabric (Python library)**
|
|
- Pros: Pure Python, very flexible
|
|
- Cons: Requires writing more code, no built-in group management
|
|
- **Rejected because**: sshsync provides ready-made functionality
|
|
|
|
❌ **Ansible**
|
|
- Pros: Industry standard, very powerful
|
|
- Cons: Requires learning YAML playbooks, overkill for simple operations
|
|
- **Rejected because**: Too heavyweight for ad-hoc commands and file transfers
|
|
|
|
❌ **pssh (parallel-ssh)**
|
|
- Pros: Simple parallel SSH
|
|
- Cons: No group management, no file transfer built-in, less actively maintained
|
|
- **Rejected because**: sshsync has better group management and file operations
|
|
|
|
❌ **Custom SSH wrapper**
|
|
- Pros: Full control
|
|
- Cons: Reinventing the wheel, maintaining parallel execution logic
|
|
- **Rejected because**: sshsync already provides what we need
|
|
|
|
**Conclusion:**
|
|
|
|
sshsync is the best tool for this use case because it:
|
|
1. Provides group-based host management out of the box
|
|
2. Handles parallel execution automatically
|
|
3. Integrates with existing SSH configuration
|
|
4. Supports both command execution and file transfers
|
|
5. Requires minimal wrapper code
|
|
|
|
## Integration: Tailscale
|
|
|
|
**Decision**: Integrate with Tailscale for network connectivity
|
|
|
|
**Justification:**
|
|
|
|
✅ **Why Tailscale:**
|
|
- **Zero-config VPN**: No manual firewall/NAT configuration
|
|
- **Secure by default**: WireGuard encryption
|
|
- **Works everywhere**: Coffee shop, home, office, cloud
|
|
- **MagicDNS**: Easy addressing (machine-name.tailnet.ts.net)
|
|
- **Standard SSH**: Works with all SSH tools including sshsync
|
|
- **No overhead**: Uses regular SSH protocol over Tailscale network
|
|
|
|
✅ **Integration approach:**
|
|
- Tailscale provides the network layer
|
|
- Standard SSH works over Tailscale
|
|
- sshsync operates normally using Tailscale hostnames/IPs
|
|
- No Tailscale-specific code needed in core operations
|
|
- Tailscale status checking for diagnostics
|
|
|
|
**Alternatives:**
|
|
|
|
❌ **Direct public internet + port forwarding**
|
|
- Cons: Complex firewall setup, security risks, doesn't work on mobile/restricted networks
|
|
- **Rejected because**: Requires too much configuration and has security concerns
|
|
|
|
❌ **Other VPNs (WireGuard, OpenVPN, ZeroTier)**
|
|
- Cons: More manual configuration, less zero-config
|
|
- **Rejected because**: Tailscale is easier to set up and use
|
|
|
|
**Conclusion:**
|
|
|
|
Tailscale + standard SSH is the optimal combination:
|
|
- Secure connectivity without configuration
|
|
- Works with existing SSH tools
|
|
- No vendor lock-in (can use other VPNs if needed)
|
|
|
|
## Architecture
|
|
|
|
### Structure: Modular Scripts + Utilities
|
|
|
|
**Decision**: Separate concerns into focused modules
|
|
|
|
```
|
|
scripts/
|
|
├── sshsync_wrapper.py # sshsync CLI interface
|
|
├── tailscale_manager.py # Tailscale operations
|
|
├── load_balancer.py # Task distribution logic
|
|
├── workflow_executor.py # Common workflows
|
|
└── utils/
|
|
├── helpers.py # Formatting, parsing
|
|
└── validators/ # Input validation
|
|
```
|
|
|
|
**Justification:**
|
|
|
|
✅ **Modularity:**
|
|
- Each script has single responsibility
|
|
- Easy to test independently
|
|
- Easy to extend without breaking others
|
|
|
|
✅ **Reusability:**
|
|
- Helpers used across all scripts
|
|
- Validators prevent duplicate validation logic
|
|
- Workflows compose lower-level operations
|
|
|
|
✅ **Maintainability:**
|
|
- Clear file organization
|
|
- Easy to locate specific functionality
|
|
- Separation of concerns
|
|
|
|
**Alternatives:**
|
|
|
|
❌ **Monolithic single script**
|
|
- Cons: Hard to test, hard to maintain, becomes too large
|
|
- **Rejected because**: Doesn't scale well
|
|
|
|
❌ **Over-engineered class hierarchy**
|
|
- Cons: Unnecessary complexity for this use case
|
|
- **Rejected because**: Simple functions are sufficient
|
|
|
|
**Conclusion:**
|
|
|
|
Modular functional approach provides good balance of simplicity and maintainability.
|
|
|
|
### Validation Strategy: Multi-Layer
|
|
|
|
**Decision**: Validate at multiple layers
|
|
|
|
**Layers:**
|
|
|
|
1. **Parameter validation** (`parameter_validator.py`)
|
|
- Validates user inputs before any operations
|
|
- Prevents invalid hosts, groups, paths, etc.
|
|
|
|
2. **Host validation** (`host_validator.py`)
|
|
- Validates SSH configuration exists
|
|
- Checks host reachability
|
|
- Validates group membership
|
|
|
|
3. **Connection validation** (`connection_validator.py`)
|
|
- Tests actual SSH connectivity
|
|
- Verifies Tailscale status
|
|
- Checks SSH key authentication
|
|
|
|
**Justification:**
|
|
|
|
✅ **Early failure:**
|
|
- Catch errors before expensive operations
|
|
- Clear error messages at each layer
|
|
|
|
✅ **Comprehensive:**
|
|
- Multiple validation points catch different issues
|
|
- Reduces runtime failures
|
|
|
|
✅ **User-friendly:**
|
|
- Helpful error messages with suggestions
|
|
- Clear indication of what went wrong
|
|
|
|
**Conclusion:**
|
|
|
|
Multi-layer validation provides robust error handling and great user experience.
|
|
|
|
## Load Balancing Strategy
|
|
|
|
### Decision: Simple Composite Score
|
|
|
|
**Formula:**
|
|
```python
|
|
score = (cpu_pct * 0.4) + (mem_pct * 0.3) + (disk_pct * 0.3)
|
|
```
|
|
|
|
**Weights:**
|
|
- CPU: 40% (most important for compute tasks)
|
|
- Memory: 30% (important for data processing)
|
|
- Disk: 30% (important for I/O operations)
|
|
|
|
**Justification:**
|
|
|
|
✅ **Simple and effective:**
|
|
- Easy to understand
|
|
- Fast to calculate
|
|
- Works well for most workloads
|
|
|
|
✅ **Balanced:**
|
|
- Considers multiple resource types
|
|
- No single metric dominates
|
|
|
|
**Alternatives:**
|
|
|
|
❌ **CPU only**
|
|
- Cons: Ignores memory-bound and I/O-bound tasks
|
|
- **Rejected because**: Too narrow
|
|
|
|
❌ **Complex ML-based prediction**
|
|
- Cons: Overkill, slow, requires training data
|
|
- **Rejected because**: Unnecessary complexity
|
|
|
|
❌ **Fixed round-robin**
|
|
- Cons: Doesn't consider actual load
|
|
- **Rejected because**: Can overload already-busy hosts
|
|
|
|
**Conclusion:**
|
|
|
|
Simple weighted score provides good balance without complexity.
|
|
|
|
## Error Handling Philosophy
|
|
|
|
### Decision: Graceful Degradation + Clear Messages
|
|
|
|
**Principles:**
|
|
|
|
1. **Fail early with validation**: Catch errors before operations
|
|
2. **Isolate failures**: One host failure doesn't stop others
|
|
3. **Clear messages**: Tell user exactly what went wrong and how to fix
|
|
4. **Automatic retry**: Retry transient errors (network, timeout)
|
|
5. **Dry-run support**: Preview operations before execution
|
|
|
|
**Implementation:**
|
|
|
|
```python
|
|
# Example error handling pattern
|
|
try:
|
|
validate_host(host)
|
|
validate_ssh_connection(host)
|
|
result = execute_command(host, command)
|
|
except ValidationError as e:
|
|
return {'error': str(e), 'suggestion': 'Fix: ...'}
|
|
except ConnectionError as e:
|
|
return {'error': str(e), 'diagnostics': get_diagnostics(host)}
|
|
```
|
|
|
|
**Justification:**
|
|
|
|
✅ **Better UX:**
|
|
- Users know exactly what's wrong
|
|
- Suggestions help fix issues quickly
|
|
|
|
✅ **Reliability:**
|
|
- Automatic retry handles transient issues
|
|
- Dry-run prevents mistakes
|
|
|
|
✅ **Debugging:**
|
|
- Clear error messages speed up troubleshooting
|
|
- Diagnostics provide actionable information
|
|
|
|
**Conclusion:**
|
|
|
|
Graceful degradation with helpful messages creates better user experience.
|
|
|
|
## Caching Strategy
|
|
|
|
**Decision**: Minimal caching for real-time accuracy
|
|
|
|
**What we cache:**
|
|
- Nothing (v1.0.0)
|
|
|
|
**Why no caching:**
|
|
- Host status changes frequently
|
|
- Load metrics change constantly
|
|
- Operations need real-time data
|
|
- Cache invalidation is complex
|
|
|
|
**Future consideration (v2.0):**
|
|
- Cache Tailscale status (60s TTL)
|
|
- Cache group configuration (5min TTL)
|
|
- Cache SSH config parsing (5min TTL)
|
|
|
|
**Justification:**
|
|
|
|
✅ **Simplicity:**
|
|
- No cache invalidation logic needed
|
|
- No stale data issues
|
|
|
|
✅ **Accuracy:**
|
|
- Always get current state
|
|
- No surprises from cached data
|
|
|
|
**Trade-off:**
|
|
- Slightly slower repeated operations
|
|
- More network calls
|
|
|
|
**Conclusion:**
|
|
|
|
For v1.0.0, simplicity and accuracy outweigh performance concerns. Real-time data is more valuable than speed.
|
|
|
|
## Testing Strategy
|
|
|
|
### Decision: Comprehensive Unit + Integration Tests
|
|
|
|
**Coverage:**
|
|
|
|
- **29 tests total:**
|
|
- 11 integration tests (end-to-end workflows)
|
|
- 11 helper tests (formatting, parsing, calculations)
|
|
- 7 validation tests (input validation, safety checks)
|
|
|
|
**Test Philosophy:**
|
|
|
|
1. **Test real functionality**: Integration tests use actual functions
|
|
2. **Test edge cases**: Validation tests cover error conditions
|
|
3. **Test helpers**: Ensure formatting/parsing works correctly
|
|
4. **Fast execution**: All tests run in < 10 seconds
|
|
5. **No external dependencies**: Tests don't require Tailscale or sshsync to be running
|
|
|
|
**Justification:**
|
|
|
|
✅ **Confidence:**
|
|
- Tests verify code works as expected
|
|
- Catches regressions when modifying code
|
|
|
|
✅ **Documentation:**
|
|
- Tests show how to use functions
|
|
- Examples of expected behavior
|
|
|
|
✅ **Reliability:**
|
|
- Production-ready code from v1.0.0
|
|
|
|
**Conclusion:**
|
|
|
|
Comprehensive testing ensures reliable code from the start.
|
|
|
|
## Performance Considerations
|
|
|
|
### Parallel Execution
|
|
|
|
**Decision**: Leverage sshsync's built-in parallelization
|
|
|
|
- sshsync runs commands concurrently across hosts automatically
|
|
- No need to implement custom threading/multiprocessing
|
|
- Timeout applies per-host independently
|
|
|
|
**Trade-offs:**
|
|
|
|
✅ **Pros:**
|
|
- Simple to use
|
|
- Fast for large host groups
|
|
- No concurrency bugs
|
|
|
|
⚠️ **Cons:**
|
|
- Less control over parallelism level
|
|
- Can overwhelm network with too many concurrent connections
|
|
|
|
**Conclusion:**
|
|
|
|
Built-in parallelization is sufficient for most use cases. Custom control can be added in v2.0 if needed.
|
|
|
|
## Security Considerations
|
|
|
|
### SSH Key Authentication
|
|
|
|
**Decision**: Require SSH keys (no password auth)
|
|
|
|
**Justification:**
|
|
|
|
✅ **Security:**
|
|
- Keys are more secure than passwords
|
|
- Can't be brute-forced
|
|
- Can be revoked per-host
|
|
|
|
✅ **Automation:**
|
|
- Non-interactive (no password prompts)
|
|
- Works in scripts and CI/CD
|
|
|
|
**Implementation:**
|
|
- Validators check SSH key auth works
|
|
- Clear error messages guide users to set up keys
|
|
- Documentation explains SSH key setup
|
|
|
|
### Command Safety
|
|
|
|
**Decision**: Validate dangerous commands
|
|
|
|
**Dangerous patterns blocked:**
|
|
- `rm -rf /` (root deletion)
|
|
- `mkfs.*` (filesystem formatting)
|
|
- `dd.*of=/dev/` (direct disk writes)
|
|
- Fork bombs
|
|
- Direct disk writes
|
|
|
|
**Override**: Use `allow_dangerous=True` to bypass
|
|
|
|
**Justification:**
|
|
|
|
✅ **Safety:**
|
|
- Prevents accidental destructive operations
|
|
- Dry-run provides preview
|
|
|
|
✅ **Flexibility:**
|
|
- Can still run dangerous commands if explicitly allowed
|
|
|
|
**Conclusion:**
|
|
|
|
Safety by default with escape hatch for advanced users.
|
|
|
|
## Decisions Summary
|
|
|
|
| Decision | Choice | Rationale |
|
|
|----------|--------|-----------|
|
|
| **CLI Tool** | sshsync | Best balance of features, ease of use, and maintenance |
|
|
| **Network** | Tailscale | Zero-config secure VPN, works everywhere |
|
|
| **Architecture** | Modular scripts | Clear separation of concerns, maintainable |
|
|
| **Validation** | Multi-layer | Catch errors early with helpful messages |
|
|
| **Load Balancing** | Composite score | Simple, effective, considers multiple resources |
|
|
| **Caching** | None (v1.0) | Simplicity and real-time accuracy |
|
|
| **Testing** | 29 tests | Comprehensive coverage for reliability |
|
|
| **Security** | SSH keys + validation | Secure and automation-friendly |
|
|
|
|
## Trade-offs Accepted
|
|
|
|
1. **No caching** → Slightly slower, but always accurate
|
|
2. **sshsync dependency** → External tool, but saves development time
|
|
3. **SSH key requirement** → Setup needed, but more secure
|
|
4. **Simple load balancing** → Less sophisticated, but fast and easy to understand
|
|
5. **Terminal UI only** → No web dashboard, but simpler to develop and maintain
|
|
|
|
## Future Improvements
|
|
|
|
### v2.0 Considerations
|
|
|
|
1. **Add caching** for frequently-accessed data (Tailscale status, groups)
|
|
2. **Web dashboard** for visualization and monitoring
|
|
3. **Operation history** database for audit trail
|
|
4. **Advanced load balancing** with custom metrics
|
|
5. **Automated SSH key distribution** across hosts
|
|
6. **Integration with config management** tools (Ansible, Terraform)
|
|
7. **Container support** via SSH to Docker containers
|
|
8. **Custom validation plugins** for domain-specific checks
|
|
|
|
All decisions prioritize **simplicity**, **security**, and **maintainability** for v1.0.0.
|