# TOD - The Other Dude — Deployment Guide ## Overview TOD (The Other Dude) is a containerized fleet management platform for RouterOS devices. This guide covers Docker Compose deployment for production environments. ### Architecture - **Backend API** (Python/FastAPI) -- REST API with JWT authentication and PostgreSQL RLS - **Go Poller** -- Polls RouterOS devices via binary API, publishes events to NATS - **Frontend** (React/nginx) -- Single-page application served by nginx - **PostgreSQL + TimescaleDB** -- Primary database with time-series extensions - **Redis** -- Distributed locking and rate limiting - **NATS JetStream** -- Message bus for device events ## Prerequisites - Docker Engine 24+ with Docker Compose v2 - At least 4GB RAM (2GB absolute minimum -- builds are memory-intensive) - External SSD or fast storage recommended for Docker volumes - Network access to RouterOS devices on ports 8728 (API) and 8729 (API-SSL) ## Quick Start ### 1. Clone and Configure ```bash git clone tod cd tod # Copy environment template cp .env.example .env.prod ``` ### 2. Generate Secrets ```bash # Generate JWT secret python3 -c "import secrets; print(secrets.token_urlsafe(64))" # Generate credential encryption key (32 bytes, base64-encoded) python3 -c "import secrets, base64; print(base64.b64encode(secrets.token_bytes(32)).decode())" ``` Edit `.env.prod` with the generated values: ```env ENVIRONMENT=production JWT_SECRET_KEY= CREDENTIAL_ENCRYPTION_KEY= POSTGRES_PASSWORD= # First admin user (created on first startup) FIRST_ADMIN_EMAIL=admin@example.com FIRST_ADMIN_PASSWORD= ``` ### 3. Build Images Build images **one at a time** to avoid out-of-memory crashes on constrained hosts: ```bash docker compose -f docker-compose.yml -f docker-compose.prod.yml build api docker compose -f docker-compose.yml -f docker-compose.prod.yml build poller docker compose -f docker-compose.yml -f docker-compose.prod.yml build frontend ``` ### 4. Start the Stack ```bash docker compose -f docker-compose.yml -f docker-compose.prod.yml --env-file .env.prod up -d ``` ### 5. Verify ```bash # Check all services are running docker compose ps # Check API health (liveness) curl http://localhost:8000/health # Check readiness (PostgreSQL, Redis, NATS connected) curl http://localhost:8000/health/ready # Access the portal open http://localhost ``` Log in with the `FIRST_ADMIN_EMAIL` and `FIRST_ADMIN_PASSWORD` credentials set in step 2. ## Environment Configuration ### Required Variables | Variable | Description | Example | |----------|-------------|---------| | `ENVIRONMENT` | Deployment environment | `production` | | `JWT_SECRET_KEY` | JWT signing secret (min 32 chars) | `` | | `CREDENTIAL_ENCRYPTION_KEY` | AES-256 key for device credentials (base64) | `` | | `POSTGRES_PASSWORD` | PostgreSQL superuser password | `` | | `FIRST_ADMIN_EMAIL` | Initial admin account email | `admin@example.com` | | `FIRST_ADMIN_PASSWORD` | Initial admin account password | `` | ### Optional Variables | Variable | Default | Description | |----------|---------|-------------| | `GUNICORN_WORKERS` | `2` | API worker process count | | `DB_POOL_SIZE` | `20` | App database connection pool size | | `DB_MAX_OVERFLOW` | `40` | Max overflow connections above pool | | `DB_ADMIN_POOL_SIZE` | `10` | Admin database connection pool size | | `DB_ADMIN_MAX_OVERFLOW` | `20` | Admin max overflow connections | | `POLL_INTERVAL_SECONDS` | `60` | Device polling interval | | `CONNECTION_TIMEOUT_SECONDS` | `10` | RouterOS connection timeout | | `COMMAND_TIMEOUT_SECONDS` | `30` | RouterOS per-command timeout | | `CIRCUIT_BREAKER_MAX_FAILURES` | `5` | Consecutive failures before backoff | | `CIRCUIT_BREAKER_BASE_BACKOFF_SECONDS` | `30` | Initial backoff duration | | `CIRCUIT_BREAKER_MAX_BACKOFF_SECONDS` | `900` | Maximum backoff (15 min) | | `LOG_LEVEL` | `info` | Logging verbosity (`debug`/`info`/`warn`/`error`) | | `CORS_ORIGINS` | `http://localhost:3000` | Comma-separated CORS origins | ### Security Notes - **Never use default secrets in production.** The application refuses to start if it detects known insecure defaults (like the dev JWT secret) in non-dev environments. - **Credential encryption key** is used to encrypt RouterOS device passwords at rest. Losing this key means re-entering all device credentials. - **CORS_ORIGINS** should be set to your actual domain in production. - **RLS enforcement**: The app_user database role enforces row-level security. Tenants cannot access each other's data even with a compromised JWT. ## Storage Configuration Docker volumes mount to the host filesystem. Default locations are configured in `docker-compose.yml`: - **PostgreSQL data**: `./docker-data/postgres` - **Redis data**: `./docker-data/redis` - **NATS data**: `./docker-data/nats` - **Git store (config backups)**: `./docker-data/git-store` To change storage locations, edit the volume mounts in `docker-compose.yml`. ## Resource Limits Container memory limits are enforced in `docker-compose.prod.yml` to prevent OOM crashes: | Service | Memory Limit | |---------|-------------| | PostgreSQL | 512MB | | Redis | 128MB | | NATS | 128MB | | API | 512MB | | Poller | 256MB | | Frontend | 64MB | Adjust under `deploy.resources.limits.memory` in `docker-compose.prod.yml`. ## API Documentation The backend serves interactive API documentation at: - **Swagger UI**: `http://localhost:8000/docs` - **ReDoc**: `http://localhost:8000/redoc` All endpoints include descriptions, request/response schemas, and authentication requirements. ## Monitoring (Optional) Enable Prometheus and Grafana monitoring with the observability compose overlay: ```bash docker compose \ -f docker-compose.yml \ -f docker-compose.prod.yml \ -f docker-compose.observability.yml \ --env-file .env.prod up -d ``` - **Prometheus**: `http://localhost:9090` - **Grafana**: `http://localhost:3001` (default: admin/admin) ### Exported Metrics The API and poller export Prometheus metrics: | Metric | Source | Description | |--------|--------|-------------| | `http_requests_total` | API | HTTP request count by method, path, status | | `http_request_duration_seconds` | API | Request latency histogram | | `mikrotik_poll_total` | Poller | Poll cycles by status (success/error/skipped) | | `mikrotik_poll_duration_seconds` | Poller | Poll cycle duration histogram | | `mikrotik_devices_active` | Poller | Number of devices being polled | | `mikrotik_circuit_breaker_skips_total` | Poller | Polls skipped due to backoff | | `mikrotik_nats_publish_total` | Poller | NATS publishes by subject and status | ## Maintenance ### Backup Strategy - **Database**: Use `pg_dump` or configure PostgreSQL streaming replication - **Config backups**: Git repositories in the git-store volume (automatic nightly backups) - **Encryption key**: Store `CREDENTIAL_ENCRYPTION_KEY` securely -- required to decrypt device credentials ### Updating ```bash git pull docker compose -f docker-compose.yml -f docker-compose.prod.yml build api docker compose -f docker-compose.yml -f docker-compose.prod.yml build poller docker compose -f docker-compose.yml -f docker-compose.prod.yml build frontend docker compose -f docker-compose.yml -f docker-compose.prod.yml --env-file .env.prod up -d ``` Database migrations run automatically on API startup via Alembic. ### Logs ```bash # All services docker compose logs -f # Specific service docker compose logs -f api # Filter structured JSON logs with jq docker compose logs api --no-log-prefix 2>&1 | jq 'select(.event != null)' # View audit logs (config editor operations) docker compose logs api --no-log-prefix 2>&1 | jq 'select(.event | startswith("routeros_"))' ``` ### Graceful Shutdown All services handle SIGTERM for graceful shutdown: - **API (gunicorn)**: Finishes in-flight requests within `GUNICORN_GRACEFUL_TIMEOUT` (default 30s), then disposes database connection pools - **Poller (Go)**: Cancels all device polling goroutines via context propagation, waits for in-flight polls to complete - **Frontend (nginx)**: Stops accepting new connections and finishes serving active requests ```bash # Graceful stop (sends SIGTERM, waits 30s) docker compose stop # Restart a single service docker compose restart api ``` ## Troubleshooting | Issue | Solution | |-------|----------| | API won't start with secret error | Generate production secrets (see step 2 above) | | Build crashes with OOM | Build images one at a time (see step 3 above) | | Device shows offline | Check network access to device API port (8728/8729) | | Health check fails | Check `docker compose logs api` for startup errors | | Rate limited (429) | Wait 60 seconds or check Redis connectivity | | Migration fails | Check `docker compose logs api` for Alembic errors | | NATS subscriber won't start | Non-fatal -- API runs without NATS; check NATS container health | | Poller circuit breaker active | Device unreachable; check `CIRCUIT_BREAKER_*` env vars to tune backoff |