Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:00:21 +08:00
commit 26377bd9be
20 changed files with 8845 additions and 0 deletions

View File

@@ -0,0 +1,997 @@
# NetBox Best Practices for Virgo-Core
**NetBox Version:** 4.3.0
**Audience:** Infrastructure automation engineers
Comprehensive best practices for using NetBox as the source of truth for the Matrix cluster infrastructure, including data organization, security, performance, and integration patterns.
---
## Table of Contents
- [Data Organization](#data-organization)
- [Naming Conventions](#naming-conventions)
- [IP Address Management](#ip-address-management)
- [Device Management](#device-management)
- [Virtualization](#virtualization)
- [Tagging Strategy](#tagging-strategy)
- [Security](#security)
- [Performance](#performance)
- [API Integration](#api-integration)
- [Automation Patterns](#automation-patterns)
- [Troubleshooting](#troubleshooting)
---
## Data Organization
### Hierarchical Structure
Follow this order when setting up infrastructure in NetBox:
```text
1. Sites → Create physical locations first
2. Prefixes → Define IP networks (IPAM)
3. VLANs → Network segmentation
4. Device Types → Hardware models
5. Device Roles → Purpose categories
6. Clusters → Virtualization clusters
7. Devices → Physical hardware
8. Interfaces → Network interfaces
9. IP Addresses → Assign IPs to interfaces
10. VMs → Virtual machines
```
**Why this order?**
- Parent objects must exist before children
- Avoids circular dependencies
- Enables atomic operations
### Site Organization
**✅ Good:**
```python
# One site per physical location
site = nb.dcim.sites.create(
name="Matrix Cluster",
slug="matrix",
description="3-node Proxmox VE cluster at home lab",
tags=[{"name": "production"}, {"name": "homelab"}]
)
```
**❌ Bad:**
```python
# Don't create separate sites for logical groupings
site_proxmox = nb.dcim.sites.create(name="Proxmox Nodes", ...)
site_vms = nb.dcim.sites.create(name="Virtual Machines", ...)
```
Use **device roles** and **tags** for logical grouping, not separate sites.
### Consistent Data Entry
**Required fields:** Always populate these
```python
device = nb.dcim.devices.create(
name="foxtrot", # ✅ Required
device_type=device_type.id, # ✅ Required
device_role=role.id, # ✅ Required
site=site.id, # ✅ Required
status="active", # ✅ Required
description="AMD Ryzen 9 9955HX", # ✅ Recommended
tags=[{"name": "proxmox-node"}] # ✅ Recommended
)
```
**Optional but recommended:**
- `description` - Hardware specs, purpose
- `tags` - For filtering and automation
- `comments` - Additional notes
- `custom_fields` - Serial numbers, purchase dates
---
## Naming Conventions
### Device Names
**✅ Use hostname only (no domain):**
```python
device = nb.dcim.devices.create(name="foxtrot", ...) # ✅ Good
device = nb.dcim.devices.create(name="foxtrot.spaceships.work", ...) # ❌ Bad
```
**Rationale:** Domain goes in DNS name field, not device name.
### Interface Names
**✅ Match actual OS interface names:**
```python
# Linux
interface = nb.dcim.interfaces.create(name="enp1s0f0", ...) # ✅ Good
# Not generic names
interface = nb.dcim.interfaces.create(name="eth0", ...) # ❌ Bad (unless actually eth0)
```
**Why?** Enables automation that references interfaces by name.
### DNS Naming Convention
Follow the Matrix cluster pattern: **`<service>-<number>-<purpose>.<domain>`**
```python
# ✅ Good examples
dns_name="docker-01-nexus.spaceships.work"
dns_name="k8s-01-master.spaceships.work"
dns_name="proxmox-foxtrot-mgmt.spaceships.work"
# ❌ Bad examples
dns_name="server1.spaceships.work" # Not descriptive
dns_name="nexus.spaceships.work" # Missing number
dns_name="DOCKER-01.spaceships.work" # Uppercase not allowed
```
See [../workflows/naming-conventions.md](../workflows/naming-conventions.md) for complete rules.
### Slugs
**✅ Lowercase with hyphens:**
```python
site = nb.dcim.sites.create(slug="matrix", ...) # ✅ Good
site = nb.dcim.sites.create(slug="Matrix_Cluster", ...) # ❌ Bad
```
**Pattern:** `^[a-z0-9-]+$`
---
## IP Address Management
### Plan IP Hierarchy
**Matrix cluster example:**
```text
192.168.0.0/16 (Home network supernet)
├── 192.168.3.0/24 (Management)
│ ├── 192.168.3.1 (Gateway)
│ ├── 192.168.3.5-7 (Proxmox nodes)
│ ├── 192.168.3.10+ (VMs)
│ └── 192.168.3.200+ (Reserved for future)
├── 192.168.5.0/24 (CEPH Public, MTU 9000)
├── 192.168.7.0/24 (CEPH Private, MTU 9000)
└── 192.168.8.0/24 (Corosync, VLAN 9)
```
### Use Prefix Roles
Create roles for clarity:
```python
# Create roles
role_mgmt = nb.ipam.roles.create(name='Management', slug='management')
role_storage = nb.ipam.roles.create(name='Storage', slug='storage')
role_cluster = nb.ipam.roles.create(name='Cluster', slug='cluster')
# Apply to prefixes
prefix = nb.ipam.prefixes.create(
prefix='192.168.3.0/24',
role=role_mgmt.id,
description='Management network for Matrix cluster'
)
```
### Reserve Important IPs
**✅ Explicitly reserve gateway, broadcast, network addresses:**
```python
# Gateway
gateway = nb.ipam.ip_addresses.create(
address='192.168.3.1/24',
status='active',
role='anycast',
description='Management network gateway'
)
# DNS servers
dns1 = nb.ipam.ip_addresses.create(
address='192.168.3.2/24',
status='reserved',
description='Primary DNS server'
)
```
### Use Prefixes as IP Pools
**✅ Enable automatic IP assignment:**
```python
prefix = nb.ipam.prefixes.create(
prefix='192.168.3.0/24',
is_pool=True, # ✅ Allow automatic IP assignment
...
)
# Get next available IP
ip = prefix.available_ips.create(dns_name='docker-02.spaceships.work')
```
**❌ Don't manually track available IPs** - let NetBox do it.
### IP Status Values
Use appropriate status:
| Status | Use Case |
|--------|----------|
| `active` | Currently in use |
| `reserved` | Reserved for specific purpose |
| `deprecated` | Planned for decommission |
| `dhcp` | Managed by DHCP server |
```python
# Production VM
ip = nb.ipam.ip_addresses.create(address='192.168.3.10/24', status='active')
# Future expansion
ip = nb.ipam.ip_addresses.create(address='192.168.3.50/24', status='reserved')
```
### VRF for Isolation
**Use VRFs for true isolation:**
```python
# Management VRF (enforce unique IPs)
vrf_mgmt = nb.ipam.vrfs.create(
name='management',
enforce_unique=True,
description='Management network VRF'
)
# Lab VRF (allow overlapping IPs)
vrf_lab = nb.ipam.vrfs.create(
name='lab',
enforce_unique=False,
description='Lab/testing VRF'
)
```
**When to use VRFs:**
- Multiple environments (prod, dev, lab)
- Overlapping IP ranges
- Security isolation
---
## Device Management
### Create Device Types First
**✅ Always create device type before devices:**
```python
# 1. Create manufacturer
manufacturer = nb.dcim.manufacturers.get(slug='minisforum')
if not manufacturer:
manufacturer = nb.dcim.manufacturers.create(
name='MINISFORUM',
slug='minisforum'
)
# 2. Create device type
device_type = nb.dcim.device_types.create(
manufacturer=manufacturer.id,
model='MS-A2',
slug='ms-a2',
u_height=0, # Not rack mounted
is_full_depth=False
)
# 3. Create device
device = nb.dcim.devices.create(
name='foxtrot',
device_type=device_type.id,
...
)
```
### Use Device Roles Consistently
**Create specific roles:**
```python
roles = [
('Proxmox Node', 'proxmox-node', '2196f3'), # Blue
('Docker Host', 'docker-host', '4caf50'), # Green
('K8s Master', 'k8s-master', 'ff9800'), # Orange
('K8s Worker', 'k8s-worker', 'ffc107'), # Amber
('Storage', 'storage', '9c27b0'), # Purple
]
for name, slug, color in roles:
nb.dcim.device_roles.create(
name=name,
slug=slug,
color=color,
vm_role=True # If role applies to VMs too
)
```
**✅ Consistent naming helps automation:**
```python
# Get all Proxmox nodes
proxmox_nodes = nb.dcim.devices.filter(role='proxmox-node')
# Get all Kubernetes workers
k8s_workers = nb.virtualization.virtual_machines.filter(role='k8s-worker')
```
### Always Set Primary IP
**✅ Set primary IP after creating device and IPs:**
```python
# Create device
device = nb.dcim.devices.create(name='foxtrot', ...)
# Create interface
iface = nb.dcim.interfaces.create(device=device.id, name='enp2s0', ...)
# Create IP
ip = nb.ipam.ip_addresses.create(
address='192.168.3.5/24',
assigned_object_type='dcim.interface',
assigned_object_id=iface.id
)
# ✅ Set as primary (critical for automation!)
device.primary_ip4 = ip.id
device.save()
```
**Why?** Primary IP is used by:
- Ansible dynamic inventory
- Monitoring tools
- DNS automation
### Document Interfaces
**✅ Include descriptions:**
```python
# Management
mgmt = nb.dcim.interfaces.create(
device=device.id,
name='enp2s0',
type='2.5gbase-t',
mtu=1500,
description='Management interface (vmbr0)',
tags=[{'name': 'management'}]
)
# CEPH public
ceph_pub = nb.dcim.interfaces.create(
device=device.id,
name='enp1s0f0',
type='10gbase-x-sfpp',
mtu=9000,
description='CEPH public network (vmbr1)',
tags=[{'name': 'ceph-public'}, {'name': 'jumbo-frames'}]
)
```
---
## Virtualization
### Create Cluster First
**✅ Create cluster before VMs:**
```python
# 1. Get/create cluster type
cluster_type = nb.virtualization.cluster_types.get(slug='proxmox')
if not cluster_type:
cluster_type = nb.virtualization.cluster_types.create(
name='Proxmox VE',
slug='proxmox'
)
# 2. Create cluster
cluster = nb.virtualization.clusters.create(
name='Matrix',
type=cluster_type.id,
site=site.id,
description='3-node Proxmox VE 9.x cluster'
)
# 3. Create VMs in cluster
vm = nb.virtualization.virtual_machines.create(
name='docker-01',
cluster=cluster.id,
...
)
```
### Standardize VM Sizing
**✅ Use consistent resource allocations:**
| Role | vCPUs | Memory (MB) | Disk (GB) |
|------|-------|-------------|-----------|
| Small (dev) | 2 | 2048 | 20 |
| Medium (app) | 4 | 8192 | 100 |
| Large (database) | 8 | 16384 | 200 |
| XL (compute) | 16 | 32768 | 500 |
```python
VM_SIZES = {
'small': {'vcpus': 2, 'memory': 2048, 'disk': 20},
'medium': {'vcpus': 4, 'memory': 8192, 'disk': 100},
'large': {'vcpus': 8, 'memory': 16384, 'disk': 200},
}
# Create VM with standard size
vm = nb.virtualization.virtual_machines.create(
name='docker-01',
cluster=cluster.id,
**VM_SIZES['medium']
)
```
### VM Network Configuration
**✅ Complete network setup:**
```python
# 1. Create VM
vm = nb.virtualization.virtual_machines.create(...)
# 2. Create interface
vm_iface = nb.virtualization.interfaces.create(
virtual_machine=vm.id,
name='eth0',
type='virtual',
enabled=True,
mtu=1500
)
# 3. Assign IP from pool
prefix = nb.ipam.prefixes.get(prefix='192.168.3.0/24')
vm_ip = prefix.available_ips.create(
dns_name='docker-01-nexus.spaceships.work',
assigned_object_type='virtualization.vminterface',
assigned_object_id=vm_iface.id,
tags=[{'name': 'production-dns'}] # ✅ Triggers PowerDNS sync
)
# 4. Set as primary IP
vm.primary_ip4 = vm_ip.id
vm.save()
```
---
## Tagging Strategy
### Tag Categories
Organize tags by purpose:
**Infrastructure Type:**
- `proxmox-node`, `ceph-node`, `docker-host`, `k8s-master`, `k8s-worker`
**Environment:**
- `production`, `staging`, `development`, `lab`
**DNS Automation:**
- `production-dns`, `lab-dns` (triggers PowerDNS sync)
**Management:**
- `terraform`, `ansible`, `manual`
**Networking:**
- `management`, `ceph-public`, `ceph-private`, `jumbo-frames`
### Tag Naming Convention
**✅ Lowercase with hyphens:**
```python
tags = [
{'name': 'proxmox-node'}, # ✅ Good
{'name': 'production-dns'}, # ✅ Good
{'name': 'Proxmox Node'}, # ❌ Bad (spaces, capitals)
{'name': 'production_dns'}, # ❌ Bad (underscores)
]
```
### Apply Tags Consistently
**✅ Tag at multiple levels:**
```python
# Tag device
device = nb.dcim.devices.create(
name='foxtrot',
tags=[{'name': 'proxmox-node'}, {'name': 'ceph-node'}, {'name': 'production'}]
)
# Tag interface
iface = nb.dcim.interfaces.create(
device=device.id,
name='enp1s0f0',
tags=[{'name': 'ceph-public'}, {'name': 'jumbo-frames'}]
)
# Tag IP
ip = nb.ipam.ip_addresses.create(
address='192.168.3.5/24',
tags=[{'name': 'production-dns'}, {'name': 'terraform'}]
)
```
**Why?** Enables granular filtering:
```bash
# Get all CEPH nodes
ansible-playbook -i netbox-inventory.yml setup-ceph.yml --limit tag_ceph_node
# Get all production DNS-enabled IPs
ips = nb.ipam.ip_addresses.filter(tag='production-dns')
```
---
## Security
### API Token Management
**✅ Store tokens in Infisical (Virgo-Core standard):**
```python
from infisical import InfisicalClient
def get_netbox_token() -> str:
"""Get NetBox API token from Infisical."""
client = InfisicalClient()
secret = client.get_secret(
secret_name="NETBOX_API_TOKEN",
project_id="7b832220-24c0-45bc-a5f1-ce9794a31259",
environment="prod",
path="/matrix"
)
return secret.secret_value
# Use token
nb = pynetbox.api('https://netbox.spaceships.work', token=get_netbox_token())
```
**❌ Never hardcode tokens:**
```python
# ❌ NEVER DO THIS
token = "a1b2c3d4e5f6..."
nb = pynetbox.api(url, token=token)
```
### Use Minimal Permissions
Create tokens with appropriate scopes:
| Use Case | Permissions |
|----------|-------------|
| Read-only queries | Read only |
| Terraform automation | Read + Write (DCIM, IPAM, Virtualization) |
| Full automation | Read + Write (all) |
| Emergency admin | Full access |
**✅ Create separate tokens for different purposes:**
```text
NETBOX_API_TOKEN_READONLY → Read-only queries
NETBOX_API_TOKEN_TERRAFORM → Terraform automation
NETBOX_API_TOKEN_ANSIBLE → Ansible dynamic inventory
```
### HTTPS Only
**✅ Always use HTTPS in production:**
```python
# ✅ Production
nb = pynetbox.api('https://netbox.spaceships.work', token=token)
# ❌ Never HTTP in production
nb = pynetbox.api('http://netbox.spaceships.work', token=token)
```
**For self-signed certs (dev/lab only):**
```python
# ⚠️ Dev/testing only
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
nb = pynetbox.api(
'https://netbox.local',
token=token,
ssl_verify=False # Only for self-signed certs in lab
)
```
### Rotate Tokens Regularly
**Best practice:** Rotate every 90 days
```bash
# 1. Create new token in NetBox UI
# 2. Update Infisical secret
infisical secrets set NETBOX_API_TOKEN="new-token-here"
# 3. Test new token
./tools/netbox_api_client.py sites list
# 4. Delete old token in NetBox UI
```
### Audit API Usage
**✅ Log API calls in production:**
```python
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
filename='/var/log/netbox-api.log'
)
logger = logging.getLogger(__name__)
def audit_api_call(action: str, resource: str, details: dict):
"""Log API calls for security audit."""
logger.info(f"API Call: {action} {resource} - User: {os.getenv('USER')} - {details}")
# Usage
ip = nb.ipam.ip_addresses.create(address='192.168.1.1/24')
audit_api_call('CREATE', 'ip-address', {'address': '192.168.1.1/24'})
```
---
## Performance
### Use Filtering Server-Side
**✅ Filter on server:**
```python
# ✅ Efficient: Server filters results
devices = nb.dcim.devices.filter(site='matrix', status='active')
```
**❌ Don't filter client-side:**
```python
# ❌ Inefficient: Downloads all devices then filters
all_devices = nb.dcim.devices.all()
matrix_devices = [d for d in all_devices if d.site.slug == 'matrix']
```
### Request Only Needed Fields
**✅ Use field selection:**
```python
# Get only specific fields
devices = nb.dcim.devices.filter(site='matrix', fields=['name', 'primary_ip4'])
```
### Use Pagination for Large Datasets
**✅ Process in batches:**
```python
# Paginate automatically
for device in nb.dcim.devices.filter(site='matrix'):
process_device(device) # pynetbox handles pagination
# Manual pagination for control
page_size = 100
offset = 0
while True:
devices = nb.dcim.devices.filter(limit=page_size, offset=offset)
if not devices:
break
for device in devices:
process_device(device)
offset += page_size
```
### Cache Lookups
**✅ Cache static data:**
```python
from functools import lru_cache
@lru_cache(maxsize=128)
def get_site(site_slug: str):
"""Cached site lookup."""
return nb.dcim.sites.get(slug=site_slug)
@lru_cache(maxsize=256)
def get_device_type(slug: str):
"""Cached device type lookup."""
return nb.dcim.device_types.get(slug=slug)
```
### Use Bulk Operations
**✅ Bulk create is faster:**
```python
# ✅ Fast: Bulk create
ips = [
{'address': f'192.168.3.{i}/24', 'status': 'active'}
for i in range(10, 20)
]
nb.ipam.ip_addresses.create(ips)
# ❌ Slow: Loop with individual creates
for i in range(10, 20):
nb.ipam.ip_addresses.create(address=f'192.168.3.{i}/24', status='active')
```
---
## API Integration
### Error Handling
**✅ Always handle errors:**
```python
import pynetbox
from requests.exceptions import HTTPError
try:
device = nb.dcim.devices.get(name='foxtrot')
if not device:
console.print("[yellow]Device not found[/yellow]")
return None
except HTTPError as e:
if e.response.status_code == 404:
console.print("[red]Resource not found[/red]")
elif e.response.status_code == 403:
console.print("[red]Permission denied[/red]")
else:
console.print(f"[red]HTTP Error: {e}[/red]")
sys.exit(1)
except pynetbox.RequestError as e:
console.print(f"[red]NetBox API Error: {e.error}[/red]")
sys.exit(1)
except Exception as e:
console.print(f"[red]Unexpected error: {e}[/red]")
sys.exit(1)
```
### Validate Before Creating
**✅ Validate input before API calls:**
```python
import ipaddress
import re
def validate_ip(ip_str: str) -> bool:
"""Validate IP address format."""
try:
ipaddress.ip_interface(ip_str)
return True
except ValueError:
return False
def validate_dns_name(name: str) -> bool:
"""Validate DNS naming convention."""
pattern = r'^[a-z0-9-]+-\d{2}-[a-z0-9-]+\.[a-z0-9.-]+$'
return bool(re.match(pattern, name))
# Use before API calls
if not validate_ip(ip_address):
raise ValueError(f"Invalid IP address: {ip_address}")
if not validate_dns_name(dns_name):
raise ValueError(f"Invalid DNS name: {dns_name}")
ip = nb.ipam.ip_addresses.create(address=ip_address, dns_name=dns_name)
```
### Check Before Create
**✅ Check existence before creating:**
```python
# Check if device exists
device = nb.dcim.devices.get(name='foxtrot')
if device:
console.print("[yellow]Device already exists, updating...[/yellow]")
device.status = 'active'
device.save()
else:
console.print("[green]Creating new device...[/green]")
device = nb.dcim.devices.create(name='foxtrot', ...)
```
---
## Automation Patterns
### Idempotent Operations
**✅ Design operations to be safely re-run:**
```python
def ensure_vm_exists(name: str, cluster: str, **kwargs) -> pynetbox.core.response.Record:
"""Ensure VM exists (idempotent)."""
# Check if exists
vm = nb.virtualization.virtual_machines.get(name=name)
if vm:
# Update if needed
updated = False
for key, value in kwargs.items():
if getattr(vm, key) != value:
setattr(vm, key, value)
updated = True
if updated:
vm.save()
console.print(f"[yellow]Updated VM: {name}[/yellow]")
else:
console.print(f"[green]VM unchanged: {name}[/green]")
return vm
else:
# Create new
vm = nb.virtualization.virtual_machines.create(
name=name,
cluster=nb.virtualization.clusters.get(name=cluster).id,
**kwargs
)
console.print(f"[green]Created VM: {name}[/green]")
return vm
```
### Terraform Integration
See [terraform-provider-guide.md](terraform-provider-guide.md) for complete examples.
**Key pattern:**
```hcl
# Use NetBox as data source
data "netbox_prefix" "management" {
prefix = "192.168.3.0/24"
}
# Create IP in NetBox via Terraform
resource "netbox_ip_address" "vm_ip" {
ip_address = cidrhost(data.netbox_prefix.management.prefix, 10)
dns_name = "docker-01-nexus.spaceships.work"
status = "active"
tags = ["terraform", "production-dns"]
}
```
### Ansible Dynamic Inventory
See [../workflows/ansible-dynamic-inventory.md](../workflows/ansible-dynamic-inventory.md).
**Key pattern:**
```yaml
# netbox-dynamic-inventory.yml
plugin: netbox.netbox.nb_inventory
api_endpoint: https://netbox.spaceships.work
token: !vault |
$ANSIBLE_VAULT;...
group_by:
- device_roles
- tags
- site
```
---
## Troubleshooting
### Common Issues
**Problem:** "Permission denied" errors
**Solution:** Check API token permissions
```bash
# Test token
curl -H "Authorization: Token YOUR_TOKEN" \
https://netbox.spaceships.work/api/
```
**Problem:** IP not syncing to PowerDNS
**Solution:** Check tags
```python
# IP must have tag matching zone rules
ip = nb.ipam.ip_addresses.get(address='192.168.3.10/24')
print(f"Tags: {[tag.name for tag in ip.tags]}")
# Must include 'production-dns' or matching tag
```
**Problem:** Slow API queries
**Solution:** Use filtering and pagination
```python
# ❌ Slow
all_devices = nb.dcim.devices.all()
# ✅ Fast
devices = nb.dcim.devices.filter(site='matrix', limit=50)
```
### Debug Mode
**Enable verbose logging:**
```python
import logging
# Enable debug logging
logging.basicConfig(level=logging.DEBUG)
# Now pynetbox will log all API calls
nb = pynetbox.api('https://netbox.spaceships.work', token=token)
devices = nb.dcim.devices.all()
```
---
## Related Documentation
- [NetBox API Guide](netbox-api-guide.md) - Complete API reference
- [NetBox Data Models](netbox-data-models.md) - Data model relationships
- [DNS Naming Conventions](../workflows/naming-conventions.md) - Naming rules
- [Terraform Provider Guide](terraform-provider-guide.md) - Terraform integration
- [Tools: netbox_api_client.py](../tools/netbox_api_client.py) - Working examples
---
**Next:** Review [API Integration Patterns](netbox-api-guide.md#api-integration)