Files
the-other-dude/docs/DEPLOYMENT.md
2026-03-21 16:19:21 -05:00

14 KiB
Raw Blame History

TOD - The Other Dude — Deployment Guide

Overview

TOD (The Other Dude) is a containerized fleet management platform for RouterOS devices. This guide covers Docker Compose deployment for production environments.

Architecture

  • Backend API (Python/FastAPI) -- REST API with JWT authentication and PostgreSQL RLS
  • Go Poller -- Polls RouterOS devices via binary API, publishes events to NATS
  • Frontend (React/nginx) -- Single-page application served by nginx (dynamic DNS resolver prevents 502 errors after API container restarts)
  • PostgreSQL + TimescaleDB -- Primary database with time-series extensions
  • Redis -- Distributed locking and rate limiting
  • NATS JetStream -- Message bus for device events
  • OpenBao -- Secrets management (Transit encryption for credentials, config backups, audit logs)
  • WireGuard -- VPN gateway for isolated device networks
  • WinBox Worker -- Xpra-based container for browser WinBox sessions (runs on linux/amd64, 1GB memory limit)

Prerequisites

  • Docker Engine 24+ with Docker Compose v2
  • At least 4GB RAM (2GB absolute minimum -- builds are memory-intensive)
  • External SSD or fast storage recommended for Docker volumes
  • Network access to RouterOS devices on ports 8728 (API) and 8729 (API-SSL)

Quick Start

The setup wizard handles all configuration interactively:

git clone https://github.com/staack/the-other-dude.git tod
cd tod
python3 setup.py

For CI/CD pipelines or headless servers, the wizard supports non-interactive mode:

python3 setup.py --non-interactive \
    --postgres-password 'MyP@ss!' \
    --domain tod.example.com \
    --admin-email admin@example.com \
    --no-telemetry --yes

Available flags: --non-interactive, --postgres-password, --admin-email, --admin-password, --domain, --smtp-host, --smtp-port, --smtp-user, --smtp-password, --smtp-from, --smtp-tls, --no-smtp-tls, --proxy caddy|nginx|apache|haproxy|traefik|skip, --telemetry, --no-telemetry, --yes/-y. The wizard also handles EOFError gracefully when stdin is not a TTY.

Manual Setup

If you prefer manual configuration, follow the steps below.

1. Clone and Configure

git clone https://github.com/staack/the-other-dude.git tod
cd tod

# Copy environment template
cp .env.example .env.prod

2. Generate Secrets

# Generate JWT secret
python3 -c "import secrets; print(secrets.token_urlsafe(64))"

# Generate credential encryption key (32 bytes, base64-encoded)
python3 -c "import secrets, base64; print(base64.b64encode(secrets.token_bytes(32)).decode())"

Edit .env.prod with the generated values:

ENVIRONMENT=production
JWT_SECRET_KEY=<generated-jwt-secret>
CREDENTIAL_ENCRYPTION_KEY=<generated-encryption-key>
POSTGRES_PASSWORD=<strong-password>

# First admin user (created on first startup)
FIRST_ADMIN_EMAIL=admin@example.com
FIRST_ADMIN_PASSWORD=<strong-password>

3. Build Images

Build images one at a time to avoid out-of-memory crashes on constrained hosts:

docker compose -f docker-compose.yml -f docker-compose.prod.yml build api
docker compose -f docker-compose.yml -f docker-compose.prod.yml build poller
docker compose -f docker-compose.yml -f docker-compose.prod.yml build frontend

4. Start the Stack

docker compose -f docker-compose.yml -f docker-compose.prod.yml --env-file .env.prod up -d

5. Verify

# Check all services are running
docker compose ps

# Check API health (liveness)
curl http://localhost:8000/health

# Check readiness (PostgreSQL, Redis, NATS connected)
curl http://localhost:8000/health/ready

# Access the portal
# Open http://localhost in a web browser

Log in with the FIRST_ADMIN_EMAIL and FIRST_ADMIN_PASSWORD credentials set in step 2.

Environment Configuration

Required Variables

Variable Description Example
ENVIRONMENT Deployment environment production
JWT_SECRET_KEY JWT signing secret (min 32 chars) <generated>
CREDENTIAL_ENCRYPTION_KEY AES-256 key for device credentials (base64) <generated>
POSTGRES_PASSWORD PostgreSQL superuser password <strong-password>
FIRST_ADMIN_EMAIL Initial admin account email admin@example.com
FIRST_ADMIN_PASSWORD Initial admin account password <strong-password>

Optional Variables

Variable Default Description
GUNICORN_WORKERS 2 API worker process count
DB_POOL_SIZE 20 App database connection pool size
DB_MAX_OVERFLOW 40 Max overflow connections above pool
DB_ADMIN_POOL_SIZE 10 Admin database connection pool size
DB_ADMIN_MAX_OVERFLOW 20 Admin max overflow connections
POLL_INTERVAL_SECONDS 60 Device polling interval
CONNECTION_TIMEOUT_SECONDS 10 RouterOS connection timeout
COMMAND_TIMEOUT_SECONDS 30 RouterOS per-command timeout
CIRCUIT_BREAKER_MAX_FAILURES 5 Consecutive failures before backoff
CIRCUIT_BREAKER_BASE_BACKOFF_SECONDS 30 Initial backoff duration
CIRCUIT_BREAKER_MAX_BACKOFF_SECONDS 900 Maximum backoff (15 min)
LOG_LEVEL info Logging verbosity (debug/info/warn/error)
CORS_ORIGINS http://localhost:3000 Comma-separated CORS origins
TUNNEL_PORT_MIN 49000 Start of WinBox tunnel port range
TUNNEL_PORT_MAX 49100 End of WinBox tunnel port range
TUNNEL_IDLE_TIMEOUT 300 WinBox tunnel idle timeout (seconds)
SSH_RELAY_PORT 8080 SSH relay HTTP server port
SSH_IDLE_TIMEOUT 900 SSH session idle timeout (seconds)
SSH_MAX_SESSIONS 200 Maximum concurrent SSH sessions
SSH_MAX_PER_USER 10 Maximum SSH sessions per user
SSH_MAX_PER_DEVICE 20 Maximum SSH sessions per device

Security Notes

  • Never use default secrets in production. The application refuses to start if it detects known insecure defaults (like the dev JWT secret) in non-dev environments.
  • Credential encryption key is used to encrypt RouterOS device passwords at rest. Losing this key means re-entering all device credentials.
  • CORS_ORIGINS should be set to your actual domain in production.
  • RLS enforcement: The app_user database role enforces row-level security. Tenants cannot access each other's data even with a compromised JWT.

Storage Configuration

Docker volumes mount to the host filesystem. Default locations are configured in docker-compose.yml:

  • PostgreSQL data: ./docker-data/postgres
  • Redis data: ./docker-data/redis
  • NATS data: ./docker-data/nats
  • Git store (config backups): ./docker-data/git-store
  • Firmware cache: ./docker-data/firmware-cache (downloaded RouterOS firmware packages)

To change storage locations, edit the volume mounts in docker-compose.yml.

Resource Limits

Container memory limits are enforced in docker-compose.prod.yml to prevent OOM crashes:

Service Memory Limit
PostgreSQL 512MB
Redis 128MB
NATS 128MB
API 512MB
Poller 512MB
Frontend 64MB
OpenBao 256MB
WireGuard 128MB
WinBox Worker 1GB

Adjust under deploy.resources.limits.memory in docker-compose.prod.yml.

Note: The WinBox tunnel port range (TUNNEL_PORT_MINTUNNEL_PORT_MAX, default 4900049100) must be mapped in the poller container's port bindings. Add "49000-49100:49000-49100" to the poller service's ports list in your compose file. The SSH relay port (SSH_RELAY_PORT, default 8080) similarly requires a port mapping if accessed directly.

API Documentation

The backend serves interactive API documentation at:

  • Swagger UI: http://localhost:8000/docs
  • ReDoc: http://localhost:8000/redoc

All endpoints include descriptions, request/response schemas, and authentication requirements.

Kubernetes (Helm)

TOD includes a Helm chart for Kubernetes deployment at infrastructure/helm/.

Prerequisites

  • Kubernetes 1.28+
  • Helm 3
  • A StorageClass that supports ReadWriteOnce PersistentVolumeClaims

Install

  1. Create a values override file with your configuration:

    cp infrastructure/helm/values.yaml my-values.yaml
    # Edit my-values.yaml — at minimum set:
    #   secrets.jwtSecretKey, secrets.credentialEncryptionKey,
    #   secrets.dbPassword, secrets.dbAppPassword, secrets.dbPollerPassword,
    #   secrets.firstAdminPassword, ingress.host
    
  2. Install the chart:

    helm install tod infrastructure/helm -f my-values.yaml -n tod --create-namespace
    
  3. Initialize OpenBao (first time only):

    # Wait for the pod to start
    kubectl get pods -n tod -l app.kubernetes.io/component=openbao
    
    # Initialize
    kubectl exec -it -n tod tod-openbao-0 -- bao operator init -key-shares=1 -key-threshold=1
    
    # Save the unseal key and root token, then unseal
    kubectl exec -it -n tod tod-openbao-0 -- bao operator unseal <UNSEAL_KEY>
    
    # Update release with the token
    helm upgrade tod infrastructure/helm -f my-values.yaml \
      --set secrets.openbaoToken=<ROOT_TOKEN> \
      --set secrets.baoUnsealKey=<UNSEAL_KEY> \
      -n tod
    
  4. Verify:

    kubectl get pods -n tod
    kubectl port-forward -n tod svc/tod-api 8000:8000
    curl http://localhost:8000/health
    

Services

The Helm chart deploys:

Service Type Purpose
PostgreSQL (TimescaleDB) StatefulSet Primary database
Redis Deployment Cache
NATS JetStream StatefulSet Message queue
OpenBao StatefulSet Secrets management
API Deployment FastAPI backend
Frontend Deployment React SPA (nginx)
Poller Deployment Go device poller
WireGuard Deployment VPN gateway
WinBox Worker Deployment Browser-based WinBox sessions (Xpra)

Configuration

All configuration is in values.yaml. See infrastructure/helm/values.yaml for the full reference with comments. Key sections:

  • secrets.* -- All secrets (must be overridden in production)
  • api.env.* -- API environment settings
  • poller.env.* -- Poller settings
  • ingress.* -- Ingress routing and TLS
  • wireguard.* -- VPN configuration (can be disabled with wireguard.enabled: false)

Note on OpenBao

OpenBao must be manually unsealed after every pod restart. Auto-unseal is a planned future enhancement.

Monitoring (Optional)

Enable Prometheus and Grafana monitoring with the observability compose overlay:

docker compose \
  -f docker-compose.yml \
  -f docker-compose.prod.yml \
  -f docker-compose.observability.yml \
  --env-file .env.prod up -d
  • Prometheus: http://localhost:9090
  • Grafana: http://localhost:3001 (default: admin/admin — change the default password immediately on any networked host)

Exported Metrics

The API and poller export Prometheus metrics:

Metric Source Description
http_requests_total API HTTP request count by method, path, status
http_request_duration_seconds API Request latency histogram
mikrotik_poll_total Poller Poll cycles by status (success/error/skipped)
mikrotik_poll_duration_seconds Poller Poll cycle duration histogram
mikrotik_devices_active Poller Number of devices being polled
mikrotik_circuit_breaker_skips_total Poller Polls skipped due to backoff
mikrotik_nats_publish_total Poller NATS publishes by subject and status

Maintenance

Backup Strategy

  • Database: Use pg_dump or configure PostgreSQL streaming replication
  • Config backups: Git repositories in the git-store volume (automatic nightly backups)
  • Encryption key: Store CREDENTIAL_ENCRYPTION_KEY securely -- required to decrypt device credentials

Updating

# Back up the database before upgrading
docker compose exec postgres pg_dump -U postgres mikrotik > backup-$(date +%Y%m%d).sql

git pull
docker compose -f docker-compose.yml -f docker-compose.prod.yml build api
docker compose -f docker-compose.yml -f docker-compose.prod.yml build poller
docker compose -f docker-compose.yml -f docker-compose.prod.yml build frontend
docker compose -f docker-compose.yml -f docker-compose.prod.yml --env-file .env.prod up -d

Database migrations run automatically on API startup via Alembic.

Logs

# All services
docker compose logs -f

# Specific service
docker compose logs -f api

# Filter structured JSON logs with jq
docker compose logs api --no-log-prefix 2>&1 | jq 'select(.event != null)'

# View audit logs (config editor operations)
docker compose logs api --no-log-prefix 2>&1 | jq 'select(.event | startswith("routeros_"))'

Graceful Shutdown

All services handle SIGTERM for graceful shutdown:

  • API (gunicorn): Finishes in-flight requests within GUNICORN_GRACEFUL_TIMEOUT (default 30s), then disposes database connection pools
  • Poller (Go): Cancels all device polling goroutines via context propagation, waits for in-flight polls to complete
  • Frontend (nginx): Stops accepting new connections and finishes serving active requests
# Graceful stop (sends SIGTERM, waits 30s)
docker compose stop

# Restart a single service
docker compose restart api

Troubleshooting

Issue Solution
API won't start with secret error Generate production secrets (see step 2 above)
Build crashes with OOM Build images one at a time (see step 3 above)
Device shows offline Check network access to device API port (8728/8729)
Health check fails Check docker compose logs api for startup errors
Rate limited (429) Wait 60 seconds or check Redis connectivity
Migration fails Check docker compose logs api for Alembic errors
NATS subscriber won't start Non-fatal -- API runs without NATS; check NATS container health
Poller circuit breaker active Device unreachable; check CIRCUIT_BREAKER_* env vars to tune backoff
Frontend returns 502 after API restart nginx caches upstream DNS at startup; the dynamic resolver (resolver 127.0.0.11) in nginx-spa.conf handles this automatically — if you see 502s, ensure the nginx config has not been overridden