# TOD - The Other Dude — Deployment Guide ## Overview TOD (The Other Dude) is a containerized fleet management platform for RouterOS devices. This guide covers Docker Compose deployment for production environments. ### Architecture - **Backend API** (Python/FastAPI) -- REST API with JWT authentication and PostgreSQL RLS - **Go Poller** -- Polls RouterOS devices via binary API, publishes events to NATS - **Frontend** (React/nginx) -- Single-page application served by nginx (dynamic DNS resolver prevents 502 errors after API container restarts) - **PostgreSQL + TimescaleDB** -- Primary database with time-series extensions - **Redis** -- Distributed locking and rate limiting - **NATS JetStream** -- Message bus for device events - **OpenBao** -- Secrets management (Transit encryption for credentials, config backups, audit logs) - **WireGuard** -- VPN gateway for isolated device networks - **WinBox Worker** -- Xpra-based container for browser WinBox sessions (runs on linux/amd64, 1GB memory limit) ## Prerequisites - Docker Engine 24+ with Docker Compose v2 - At least 4GB RAM (2GB absolute minimum -- builds are memory-intensive) - External SSD or fast storage recommended for Docker volumes - Network access to RouterOS devices on ports 8728 (API) and 8729 (API-SSL) ## Quick Start ### Automated Setup (Recommended) The setup wizard handles all configuration interactively: ```bash git clone https://github.com/staack/the-other-dude.git tod cd tod python3 setup.py ``` For CI/CD pipelines or headless servers, the wizard supports non-interactive mode: ```bash python3 setup.py --non-interactive \ --postgres-password 'MyP@ss!' \ --domain tod.example.com \ --admin-email admin@example.com \ --no-telemetry --yes ``` Available flags: `--non-interactive`, `--postgres-password`, `--admin-email`, `--admin-password`, `--domain`, `--smtp-host`, `--smtp-port`, `--smtp-user`, `--smtp-password`, `--smtp-from`, `--smtp-tls`, `--no-smtp-tls`, `--proxy caddy|nginx|apache|haproxy|traefik|skip`, `--telemetry`, `--no-telemetry`, `--yes`/`-y`. The wizard also handles EOFError gracefully when stdin is not a TTY. ### Manual Setup If you prefer manual configuration, follow the steps below. ### 1. Clone and Configure ```bash git clone https://github.com/staack/the-other-dude.git tod cd tod # Copy environment template cp .env.example .env.prod ``` ### 2. Generate Secrets ```bash # Generate JWT secret python3 -c "import secrets; print(secrets.token_urlsafe(64))" # Generate credential encryption key (32 bytes, base64-encoded) python3 -c "import secrets, base64; print(base64.b64encode(secrets.token_bytes(32)).decode())" ``` Edit `.env.prod` with the generated values: ```env ENVIRONMENT=production JWT_SECRET_KEY= CREDENTIAL_ENCRYPTION_KEY= POSTGRES_PASSWORD= # First admin user (created on first startup) FIRST_ADMIN_EMAIL=admin@example.com FIRST_ADMIN_PASSWORD= ``` ### 3. Build Images Build images **one at a time** to avoid out-of-memory crashes on constrained hosts: ```bash docker compose -f docker-compose.yml -f docker-compose.prod.yml build api docker compose -f docker-compose.yml -f docker-compose.prod.yml build poller docker compose -f docker-compose.yml -f docker-compose.prod.yml build frontend ``` ### 4. Start the Stack ```bash docker compose -f docker-compose.yml -f docker-compose.prod.yml --env-file .env.prod up -d ``` ### 5. Verify ```bash # Check all services are running docker compose ps # Check API health (liveness) curl http://localhost:8000/health # Check readiness (PostgreSQL, Redis, NATS connected) curl http://localhost:8000/health/ready # Access the portal # Open http://localhost in a web browser ``` Log in with the `FIRST_ADMIN_EMAIL` and `FIRST_ADMIN_PASSWORD` credentials set in step 2. ## Environment Configuration ### Required Variables | Variable | Description | Example | |----------|-------------|---------| | `ENVIRONMENT` | Deployment environment | `production` | | `JWT_SECRET_KEY` | JWT signing secret (min 32 chars) | `` | | `CREDENTIAL_ENCRYPTION_KEY` | AES-256 key for device credentials (base64) | `` | | `POSTGRES_PASSWORD` | PostgreSQL superuser password | `` | | `FIRST_ADMIN_EMAIL` | Initial admin account email | `admin@example.com` | | `FIRST_ADMIN_PASSWORD` | Initial admin account password | `` | ### Optional Variables | Variable | Default | Description | |----------|---------|-------------| | `GUNICORN_WORKERS` | `2` | API worker process count | | `DB_POOL_SIZE` | `20` | App database connection pool size | | `DB_MAX_OVERFLOW` | `40` | Max overflow connections above pool | | `DB_ADMIN_POOL_SIZE` | `10` | Admin database connection pool size | | `DB_ADMIN_MAX_OVERFLOW` | `20` | Admin max overflow connections | | `POLL_INTERVAL_SECONDS` | `60` | Device polling interval | | `CONNECTION_TIMEOUT_SECONDS` | `10` | RouterOS connection timeout | | `COMMAND_TIMEOUT_SECONDS` | `30` | RouterOS per-command timeout | | `CIRCUIT_BREAKER_MAX_FAILURES` | `5` | Consecutive failures before backoff | | `CIRCUIT_BREAKER_BASE_BACKOFF_SECONDS` | `30` | Initial backoff duration | | `CIRCUIT_BREAKER_MAX_BACKOFF_SECONDS` | `900` | Maximum backoff (15 min) | | `LOG_LEVEL` | `info` | Logging verbosity (`debug`/`info`/`warn`/`error`) | | `CORS_ORIGINS` | `http://localhost:3000` | Comma-separated CORS origins | | `TUNNEL_PORT_MIN` | `49000` | Start of WinBox tunnel port range | | `TUNNEL_PORT_MAX` | `49100` | End of WinBox tunnel port range | | `TUNNEL_IDLE_TIMEOUT` | `300` | WinBox tunnel idle timeout (seconds) | | `SSH_RELAY_PORT` | `8080` | SSH relay HTTP server port | | `SSH_IDLE_TIMEOUT` | `900` | SSH session idle timeout (seconds) | | `SSH_MAX_SESSIONS` | `200` | Maximum concurrent SSH sessions | | `SSH_MAX_PER_USER` | `10` | Maximum SSH sessions per user | | `SSH_MAX_PER_DEVICE` | `20` | Maximum SSH sessions per device | ### Security Notes - **Never use default secrets in production.** The application refuses to start if it detects known insecure defaults (like the dev JWT secret) in non-dev environments. - **Credential encryption key** is used to encrypt RouterOS device passwords at rest. Losing this key means re-entering all device credentials. - **CORS_ORIGINS** should be set to your actual domain in production. - **RLS enforcement**: The app_user database role enforces row-level security. Tenants cannot access each other's data even with a compromised JWT. ## Storage Configuration Docker volumes mount to the host filesystem. Default locations are configured in `docker-compose.yml`: - **PostgreSQL data**: `./docker-data/postgres` - **Redis data**: `./docker-data/redis` - **NATS data**: `./docker-data/nats` - **Git store (config backups)**: `./docker-data/git-store` - **Firmware cache**: `./docker-data/firmware-cache` (downloaded RouterOS firmware packages) To change storage locations, edit the volume mounts in `docker-compose.yml`. ## Resource Limits Container memory limits are enforced in `docker-compose.prod.yml` to prevent OOM crashes: | Service | Memory Limit | |---------|-------------| | PostgreSQL | 512MB | | Redis | 128MB | | NATS | 128MB | | API | 512MB | | Poller | 512MB | | Frontend | 64MB | | OpenBao | 256MB | | WireGuard | 128MB | | WinBox Worker | 1GB | Adjust under `deploy.resources.limits.memory` in `docker-compose.prod.yml`. > **Note:** The WinBox tunnel port range (`TUNNEL_PORT_MIN`–`TUNNEL_PORT_MAX`, default 49000–49100) must be mapped in the poller container's port bindings. Add `"49000-49100:49000-49100"` to the poller service's `ports` list in your compose file. The SSH relay port (`SSH_RELAY_PORT`, default 8080) similarly requires a port mapping if accessed directly. ## API Documentation The backend serves interactive API documentation at: - **Swagger UI**: `http://localhost:8000/docs` - **ReDoc**: `http://localhost:8000/redoc` All endpoints include descriptions, request/response schemas, and authentication requirements. ## Kubernetes (Helm) TOD includes a Helm chart for Kubernetes deployment at `infrastructure/helm/`. ### Prerequisites - Kubernetes 1.28+ - Helm 3 - A StorageClass that supports ReadWriteOnce PersistentVolumeClaims ### Install 1. Create a values override file with your configuration: ```bash cp infrastructure/helm/values.yaml my-values.yaml # Edit my-values.yaml — at minimum set: # secrets.jwtSecretKey, secrets.credentialEncryptionKey, # secrets.dbPassword, secrets.dbAppPassword, secrets.dbPollerPassword, # secrets.firstAdminPassword, ingress.host ``` 2. Install the chart: ```bash helm install tod infrastructure/helm -f my-values.yaml -n tod --create-namespace ``` 3. Initialize OpenBao (first time only): ```bash # Wait for the pod to start kubectl get pods -n tod -l app.kubernetes.io/component=openbao # Initialize kubectl exec -it -n tod tod-openbao-0 -- bao operator init -key-shares=1 -key-threshold=1 # Save the unseal key and root token, then unseal kubectl exec -it -n tod tod-openbao-0 -- bao operator unseal # Update release with the token helm upgrade tod infrastructure/helm -f my-values.yaml \ --set secrets.openbaoToken= \ --set secrets.baoUnsealKey= \ -n tod ``` 4. Verify: ```bash kubectl get pods -n tod kubectl port-forward -n tod svc/tod-api 8000:8000 curl http://localhost:8000/health ``` ### Services The Helm chart deploys: | Service | Type | Purpose | |---------|------|---------| | PostgreSQL (TimescaleDB) | StatefulSet | Primary database | | Redis | Deployment | Cache | | NATS JetStream | StatefulSet | Message queue | | OpenBao | StatefulSet | Secrets management | | API | Deployment | FastAPI backend | | Frontend | Deployment | React SPA (nginx) | | Poller | Deployment | Go device poller | | WireGuard | Deployment | VPN gateway | | WinBox Worker | Deployment | Browser-based WinBox sessions (Xpra) | ### Configuration All configuration is in `values.yaml`. See `infrastructure/helm/values.yaml` for the full reference with comments. Key sections: - `secrets.*` -- All secrets (must be overridden in production) - `api.env.*` -- API environment settings - `poller.env.*` -- Poller settings - `ingress.*` -- Ingress routing and TLS - `wireguard.*` -- VPN configuration (can be disabled with `wireguard.enabled: false`) ### Note on OpenBao OpenBao must be manually unsealed after every pod restart. Auto-unseal is a planned future enhancement. ## Monitoring (Optional) Enable Prometheus and Grafana monitoring with the observability compose overlay: ```bash docker compose \ -f docker-compose.yml \ -f docker-compose.prod.yml \ -f docker-compose.observability.yml \ --env-file .env.prod up -d ``` - **Prometheus**: `http://localhost:9090` - **Grafana**: `http://localhost:3001` (default: admin/admin — change the default password immediately on any networked host) ### Exported Metrics The API and poller export Prometheus metrics: | Metric | Source | Description | |--------|--------|-------------| | `http_requests_total` | API | HTTP request count by method, path, status | | `http_request_duration_seconds` | API | Request latency histogram | | `mikrotik_poll_total` | Poller | Poll cycles by status (success/error/skipped) | | `mikrotik_poll_duration_seconds` | Poller | Poll cycle duration histogram | | `mikrotik_devices_active` | Poller | Number of devices being polled | | `mikrotik_circuit_breaker_skips_total` | Poller | Polls skipped due to backoff | | `mikrotik_nats_publish_total` | Poller | NATS publishes by subject and status | ## Maintenance ### Backup Strategy - **Database**: Use `pg_dump` or configure PostgreSQL streaming replication - **Config backups**: Git repositories in the git-store volume (automatic nightly backups) - **Encryption key**: Store `CREDENTIAL_ENCRYPTION_KEY` securely -- required to decrypt device credentials ### Updating ```bash # Back up the database before upgrading docker compose exec postgres pg_dump -U postgres mikrotik > backup-$(date +%Y%m%d).sql git pull docker compose -f docker-compose.yml -f docker-compose.prod.yml build api docker compose -f docker-compose.yml -f docker-compose.prod.yml build poller docker compose -f docker-compose.yml -f docker-compose.prod.yml build frontend docker compose -f docker-compose.yml -f docker-compose.prod.yml --env-file .env.prod up -d ``` Database migrations run automatically on API startup via Alembic. ### Logs ```bash # All services docker compose logs -f # Specific service docker compose logs -f api # Filter structured JSON logs with jq docker compose logs api --no-log-prefix 2>&1 | jq 'select(.event != null)' # View audit logs (config editor operations) docker compose logs api --no-log-prefix 2>&1 | jq 'select(.event | startswith("routeros_"))' ``` ### Graceful Shutdown All services handle SIGTERM for graceful shutdown: - **API (gunicorn)**: Finishes in-flight requests within `GUNICORN_GRACEFUL_TIMEOUT` (default 30s), then disposes database connection pools - **Poller (Go)**: Cancels all device polling goroutines via context propagation, waits for in-flight polls to complete - **Frontend (nginx)**: Stops accepting new connections and finishes serving active requests ```bash # Graceful stop (sends SIGTERM, waits 30s) docker compose stop # Restart a single service docker compose restart api ``` ## Troubleshooting | Issue | Solution | |-------|----------| | API won't start with secret error | Generate production secrets (see step 2 above) | | Build crashes with OOM | Build images one at a time (see step 3 above) | | Device shows offline | Check network access to device API port (8728/8729) | | Health check fails | Check `docker compose logs api` for startup errors | | Rate limited (429) | Wait 60 seconds or check Redis connectivity | | Migration fails | Check `docker compose logs api` for Alembic errors | | NATS subscriber won't start | Non-fatal -- API runs without NATS; check NATS container health | | Poller circuit breaker active | Device unreachable; check `CIRCUIT_BREAKER_*` env vars to tune backoff | | Frontend returns 502 after API restart | nginx caches upstream DNS at startup; the dynamic resolver (`resolver 127.0.0.11`) in `nginx-spa.conf` handles this automatically — if you see 502s, ensure the nginx config has not been overridden |