Add Kubernetes/Helm deployment section to DEPLOYMENT.md, telemetry environment variables to CONFIGURATION.md, telemetry privacy details to SECURITY.md, telemetry bullet to README quick start, and fix Go version from 1.24 to 1.25 in docs/README.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
13 KiB
TOD - The Other Dude — Deployment Guide
Overview
TOD (The Other Dude) is a containerized fleet management platform for RouterOS devices. This guide covers Docker Compose deployment for production environments.
Architecture
- Backend API (Python/FastAPI) -- REST API with JWT authentication and PostgreSQL RLS
- Go Poller -- Polls RouterOS devices via binary API, publishes events to NATS
- Frontend (React/nginx) -- Single-page application served by nginx (dynamic DNS resolver prevents 502 errors after API container restarts)
- PostgreSQL + TimescaleDB -- Primary database with time-series extensions
- Redis -- Distributed locking and rate limiting
- NATS JetStream -- Message bus for device events
Prerequisites
- Docker Engine 24+ with Docker Compose v2
- At least 4GB RAM (2GB absolute minimum -- builds are memory-intensive)
- External SSD or fast storage recommended for Docker volumes
- Network access to RouterOS devices on ports 8728 (API) and 8729 (API-SSL)
Quick Start
1. Clone and Configure
git clone https://github.com/staack/the-other-dude.git tod
cd tod
# Copy environment template
cp .env.example .env.prod
2. Generate Secrets
# Generate JWT secret
python3 -c "import secrets; print(secrets.token_urlsafe(64))"
# Generate credential encryption key (32 bytes, base64-encoded)
python3 -c "import secrets, base64; print(base64.b64encode(secrets.token_bytes(32)).decode())"
Edit .env.prod with the generated values:
ENVIRONMENT=production
JWT_SECRET_KEY=<generated-jwt-secret>
CREDENTIAL_ENCRYPTION_KEY=<generated-encryption-key>
POSTGRES_PASSWORD=<strong-password>
# First admin user (created on first startup)
FIRST_ADMIN_EMAIL=admin@example.com
FIRST_ADMIN_PASSWORD=<strong-password>
3. Build Images
Build images one at a time to avoid out-of-memory crashes on constrained hosts:
docker compose -f docker-compose.yml -f docker-compose.prod.yml build api
docker compose -f docker-compose.yml -f docker-compose.prod.yml build poller
docker compose -f docker-compose.yml -f docker-compose.prod.yml build frontend
4. Start the Stack
docker compose -f docker-compose.yml -f docker-compose.prod.yml --env-file .env.prod up -d
5. Verify
# Check all services are running
docker compose ps
# Check API health (liveness)
curl http://localhost:8000/health
# Check readiness (PostgreSQL, Redis, NATS connected)
curl http://localhost:8000/health/ready
# Access the portal
# Open http://localhost in a web browser
Log in with the FIRST_ADMIN_EMAIL and FIRST_ADMIN_PASSWORD credentials set in step 2.
Environment Configuration
Required Variables
| Variable | Description | Example |
|---|---|---|
ENVIRONMENT |
Deployment environment | production |
JWT_SECRET_KEY |
JWT signing secret (min 32 chars) | <generated> |
CREDENTIAL_ENCRYPTION_KEY |
AES-256 key for device credentials (base64) | <generated> |
POSTGRES_PASSWORD |
PostgreSQL superuser password | <strong-password> |
FIRST_ADMIN_EMAIL |
Initial admin account email | admin@example.com |
FIRST_ADMIN_PASSWORD |
Initial admin account password | <strong-password> |
Optional Variables
| Variable | Default | Description |
|---|---|---|
GUNICORN_WORKERS |
2 |
API worker process count |
DB_POOL_SIZE |
20 |
App database connection pool size |
DB_MAX_OVERFLOW |
40 |
Max overflow connections above pool |
DB_ADMIN_POOL_SIZE |
10 |
Admin database connection pool size |
DB_ADMIN_MAX_OVERFLOW |
20 |
Admin max overflow connections |
POLL_INTERVAL_SECONDS |
60 |
Device polling interval |
CONNECTION_TIMEOUT_SECONDS |
10 |
RouterOS connection timeout |
COMMAND_TIMEOUT_SECONDS |
30 |
RouterOS per-command timeout |
CIRCUIT_BREAKER_MAX_FAILURES |
5 |
Consecutive failures before backoff |
CIRCUIT_BREAKER_BASE_BACKOFF_SECONDS |
30 |
Initial backoff duration |
CIRCUIT_BREAKER_MAX_BACKOFF_SECONDS |
900 |
Maximum backoff (15 min) |
LOG_LEVEL |
info |
Logging verbosity (debug/info/warn/error) |
CORS_ORIGINS |
http://localhost:3000 |
Comma-separated CORS origins |
TUNNEL_PORT_MIN |
49000 |
Start of WinBox tunnel port range |
TUNNEL_PORT_MAX |
49100 |
End of WinBox tunnel port range |
TUNNEL_IDLE_TIMEOUT |
300 |
WinBox tunnel idle timeout (seconds) |
SSH_RELAY_PORT |
8080 |
SSH relay HTTP server port |
SSH_IDLE_TIMEOUT |
900 |
SSH session idle timeout (seconds) |
SSH_MAX_SESSIONS |
200 |
Maximum concurrent SSH sessions |
SSH_MAX_PER_USER |
10 |
Maximum SSH sessions per user |
SSH_MAX_PER_DEVICE |
20 |
Maximum SSH sessions per device |
Security Notes
- Never use default secrets in production. The application refuses to start if it detects known insecure defaults (like the dev JWT secret) in non-dev environments.
- Credential encryption key is used to encrypt RouterOS device passwords at rest. Losing this key means re-entering all device credentials.
- CORS_ORIGINS should be set to your actual domain in production.
- RLS enforcement: The app_user database role enforces row-level security. Tenants cannot access each other's data even with a compromised JWT.
Storage Configuration
Docker volumes mount to the host filesystem. Default locations are configured in docker-compose.yml:
- PostgreSQL data:
./docker-data/postgres - Redis data:
./docker-data/redis - NATS data:
./docker-data/nats - Git store (config backups):
./docker-data/git-store - Firmware cache:
./docker-data/firmware-cache(downloaded RouterOS firmware packages)
To change storage locations, edit the volume mounts in docker-compose.yml.
Resource Limits
Container memory limits are enforced in docker-compose.prod.yml to prevent OOM crashes:
| Service | Memory Limit |
|---|---|
| PostgreSQL | 512MB |
| Redis | 128MB |
| NATS | 128MB |
| API | 512MB |
| Poller | 512MB |
| Frontend | 64MB |
Adjust under deploy.resources.limits.memory in docker-compose.prod.yml.
Note: The WinBox tunnel port range (
TUNNEL_PORT_MIN–TUNNEL_PORT_MAX, default 49000–49100) must be mapped in the poller container's port bindings. Add"49000-49100:49000-49100"to the poller service'sportslist in your compose file. The SSH relay port (SSH_RELAY_PORT, default 8080) similarly requires a port mapping if accessed directly.
API Documentation
The backend serves interactive API documentation at:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
All endpoints include descriptions, request/response schemas, and authentication requirements.
Kubernetes (Helm)
TOD includes a Helm chart for Kubernetes deployment at infrastructure/helm/.
Prerequisites
- Kubernetes 1.28+
- Helm 3
- A StorageClass that supports ReadWriteOnce PersistentVolumeClaims
Install
-
Create a values override file with your configuration:
cp infrastructure/helm/values.yaml my-values.yaml # Edit my-values.yaml — at minimum set: # secrets.jwtSecretKey, secrets.credentialEncryptionKey, # secrets.dbPassword, secrets.dbAppPassword, secrets.dbPollerPassword, # secrets.firstAdminPassword, ingress.host -
Install the chart:
helm install tod infrastructure/helm -f my-values.yaml -n tod --create-namespace -
Initialize OpenBao (first time only):
# Wait for the pod to start kubectl get pods -n tod -l app.kubernetes.io/component=openbao # Initialize kubectl exec -it -n tod tod-openbao-0 -- bao operator init -key-shares=1 -key-threshold=1 # Save the unseal key and root token, then unseal kubectl exec -it -n tod tod-openbao-0 -- bao operator unseal <UNSEAL_KEY> # Update release with the token helm upgrade tod infrastructure/helm -f my-values.yaml \ --set secrets.openbaoToken=<ROOT_TOKEN> \ --set secrets.baoUnsealKey=<UNSEAL_KEY> \ -n tod -
Verify:
kubectl get pods -n tod kubectl port-forward -n tod svc/tod-api 8000:8000 curl http://localhost:8000/health
Services
The Helm chart deploys:
| Service | Type | Purpose |
|---|---|---|
| PostgreSQL (TimescaleDB) | StatefulSet | Primary database |
| Redis | Deployment | Cache |
| NATS JetStream | StatefulSet | Message queue |
| OpenBao | StatefulSet | Secrets management |
| API | Deployment | FastAPI backend |
| Frontend | Deployment | React SPA (nginx) |
| Poller | Deployment | Go device poller |
| WireGuard | Deployment | VPN gateway |
Configuration
All configuration is in values.yaml. See infrastructure/helm/values.yaml for the full reference with comments. Key sections:
secrets.*-- All secrets (must be overridden in production)api.env.*-- API environment settingspoller.env.*-- Poller settingsingress.*-- Ingress routing and TLSwireguard.*-- VPN configuration (can be disabled withwireguard.enabled: false)
Note on OpenBao
OpenBao must be manually unsealed after every pod restart. Auto-unseal is a planned future enhancement.
Monitoring (Optional)
Enable Prometheus and Grafana monitoring with the observability compose overlay:
docker compose \
-f docker-compose.yml \
-f docker-compose.prod.yml \
-f docker-compose.observability.yml \
--env-file .env.prod up -d
- Prometheus:
http://localhost:9090 - Grafana:
http://localhost:3001(default: admin/admin — change the default password immediately on any networked host)
Exported Metrics
The API and poller export Prometheus metrics:
| Metric | Source | Description |
|---|---|---|
http_requests_total |
API | HTTP request count by method, path, status |
http_request_duration_seconds |
API | Request latency histogram |
mikrotik_poll_total |
Poller | Poll cycles by status (success/error/skipped) |
mikrotik_poll_duration_seconds |
Poller | Poll cycle duration histogram |
mikrotik_devices_active |
Poller | Number of devices being polled |
mikrotik_circuit_breaker_skips_total |
Poller | Polls skipped due to backoff |
mikrotik_nats_publish_total |
Poller | NATS publishes by subject and status |
Maintenance
Backup Strategy
- Database: Use
pg_dumpor configure PostgreSQL streaming replication - Config backups: Git repositories in the git-store volume (automatic nightly backups)
- Encryption key: Store
CREDENTIAL_ENCRYPTION_KEYsecurely -- required to decrypt device credentials
Updating
# Back up the database before upgrading
docker compose exec postgres pg_dump -U postgres mikrotik > backup-$(date +%Y%m%d).sql
git pull
docker compose -f docker-compose.yml -f docker-compose.prod.yml build api
docker compose -f docker-compose.yml -f docker-compose.prod.yml build poller
docker compose -f docker-compose.yml -f docker-compose.prod.yml build frontend
docker compose -f docker-compose.yml -f docker-compose.prod.yml --env-file .env.prod up -d
Database migrations run automatically on API startup via Alembic.
Logs
# All services
docker compose logs -f
# Specific service
docker compose logs -f api
# Filter structured JSON logs with jq
docker compose logs api --no-log-prefix 2>&1 | jq 'select(.event != null)'
# View audit logs (config editor operations)
docker compose logs api --no-log-prefix 2>&1 | jq 'select(.event | startswith("routeros_"))'
Graceful Shutdown
All services handle SIGTERM for graceful shutdown:
- API (gunicorn): Finishes in-flight requests within
GUNICORN_GRACEFUL_TIMEOUT(default 30s), then disposes database connection pools - Poller (Go): Cancels all device polling goroutines via context propagation, waits for in-flight polls to complete
- Frontend (nginx): Stops accepting new connections and finishes serving active requests
# Graceful stop (sends SIGTERM, waits 30s)
docker compose stop
# Restart a single service
docker compose restart api
Troubleshooting
| Issue | Solution |
|---|---|
| API won't start with secret error | Generate production secrets (see step 2 above) |
| Build crashes with OOM | Build images one at a time (see step 3 above) |
| Device shows offline | Check network access to device API port (8728/8729) |
| Health check fails | Check docker compose logs api for startup errors |
| Rate limited (429) | Wait 60 seconds or check Redis connectivity |
| Migration fails | Check docker compose logs api for Alembic errors |
| NATS subscriber won't start | Non-fatal -- API runs without NATS; check NATS container health |
| Poller circuit breaker active | Device unreachable; check CIRCUIT_BREAKER_* env vars to tune backoff |
| Frontend returns 502 after API restart | nginx caches upstream DNS at startup; the dynamic resolver (resolver 127.0.0.11) in nginx-spa.conf handles this automatically — if you see 502s, ensure the nginx config has not been overridden |