- CONFIGURATION.md: fix database name (mikrotik → tod), add 5 missing env vars, update NATS memory to 256MB - API.md: add 8 missing endpoint groups (sites, sectors, wireless links, signal history, site alerts, config backups, remote access, winbox) - ARCHITECTURE.md: update subscriber count from 3 to 10, add v9.7 components (sites, sectors, link discovery, signal trending, site alerts), add background service loops, update router count to 33 - USER-GUIDE.md: add tower/site management, wireless links, signal history, site alerts, and fleet map documentation - README.md: add v9.7 features to feature list - DEPLOYMENT.md: add winbox-worker, openbao, wireguard to service list - SECURITY.md: add WinBox session security details Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
360 lines
13 KiB
Markdown
360 lines
13 KiB
Markdown
# TOD - The Other Dude — Deployment Guide
|
||
|
||
## Overview
|
||
|
||
TOD (The Other Dude) is a containerized fleet management platform for RouterOS devices. This guide covers Docker Compose deployment for production environments.
|
||
|
||
### Architecture
|
||
|
||
- **Backend API** (Python/FastAPI) -- REST API with JWT authentication and PostgreSQL RLS
|
||
- **Go Poller** -- Polls RouterOS devices via binary API, publishes events to NATS
|
||
- **Frontend** (React/nginx) -- Single-page application served by nginx (dynamic DNS resolver prevents 502 errors after API container restarts)
|
||
- **PostgreSQL + TimescaleDB** -- Primary database with time-series extensions
|
||
- **Redis** -- Distributed locking and rate limiting
|
||
- **NATS JetStream** -- Message bus for device events
|
||
- **OpenBao** -- Secrets management (Transit encryption for credentials, config backups, audit logs)
|
||
- **WireGuard** -- VPN gateway for isolated device networks
|
||
- **WinBox Worker** -- Xpra-based container for browser WinBox sessions (runs on linux/amd64, 1GB memory limit)
|
||
|
||
## Prerequisites
|
||
|
||
- Docker Engine 24+ with Docker Compose v2
|
||
- At least 4GB RAM (2GB absolute minimum -- builds are memory-intensive)
|
||
- External SSD or fast storage recommended for Docker volumes
|
||
- Network access to RouterOS devices on ports 8728 (API) and 8729 (API-SSL)
|
||
|
||
## Quick Start
|
||
|
||
### 1. Clone and Configure
|
||
|
||
```bash
|
||
git clone https://github.com/staack/the-other-dude.git tod
|
||
cd tod
|
||
|
||
# Copy environment template
|
||
cp .env.example .env.prod
|
||
```
|
||
|
||
### 2. Generate Secrets
|
||
|
||
```bash
|
||
# Generate JWT secret
|
||
python3 -c "import secrets; print(secrets.token_urlsafe(64))"
|
||
|
||
# Generate credential encryption key (32 bytes, base64-encoded)
|
||
python3 -c "import secrets, base64; print(base64.b64encode(secrets.token_bytes(32)).decode())"
|
||
```
|
||
|
||
Edit `.env.prod` with the generated values:
|
||
|
||
```env
|
||
ENVIRONMENT=production
|
||
JWT_SECRET_KEY=<generated-jwt-secret>
|
||
CREDENTIAL_ENCRYPTION_KEY=<generated-encryption-key>
|
||
POSTGRES_PASSWORD=<strong-password>
|
||
|
||
# First admin user (created on first startup)
|
||
FIRST_ADMIN_EMAIL=admin@example.com
|
||
FIRST_ADMIN_PASSWORD=<strong-password>
|
||
```
|
||
|
||
### 3. Build Images
|
||
|
||
Build images **one at a time** to avoid out-of-memory crashes on constrained hosts:
|
||
|
||
```bash
|
||
docker compose -f docker-compose.yml -f docker-compose.prod.yml build api
|
||
docker compose -f docker-compose.yml -f docker-compose.prod.yml build poller
|
||
docker compose -f docker-compose.yml -f docker-compose.prod.yml build frontend
|
||
```
|
||
|
||
### 4. Start the Stack
|
||
|
||
```bash
|
||
docker compose -f docker-compose.yml -f docker-compose.prod.yml --env-file .env.prod up -d
|
||
```
|
||
|
||
### 5. Verify
|
||
|
||
```bash
|
||
# Check all services are running
|
||
docker compose ps
|
||
|
||
# Check API health (liveness)
|
||
curl http://localhost:8000/health
|
||
|
||
# Check readiness (PostgreSQL, Redis, NATS connected)
|
||
curl http://localhost:8000/health/ready
|
||
|
||
# Access the portal
|
||
# Open http://localhost in a web browser
|
||
```
|
||
|
||
Log in with the `FIRST_ADMIN_EMAIL` and `FIRST_ADMIN_PASSWORD` credentials set in step 2.
|
||
|
||
## Environment Configuration
|
||
|
||
### Required Variables
|
||
|
||
| Variable | Description | Example |
|
||
|----------|-------------|---------|
|
||
| `ENVIRONMENT` | Deployment environment | `production` |
|
||
| `JWT_SECRET_KEY` | JWT signing secret (min 32 chars) | `<generated>` |
|
||
| `CREDENTIAL_ENCRYPTION_KEY` | AES-256 key for device credentials (base64) | `<generated>` |
|
||
| `POSTGRES_PASSWORD` | PostgreSQL superuser password | `<strong-password>` |
|
||
| `FIRST_ADMIN_EMAIL` | Initial admin account email | `admin@example.com` |
|
||
| `FIRST_ADMIN_PASSWORD` | Initial admin account password | `<strong-password>` |
|
||
|
||
### Optional Variables
|
||
|
||
| Variable | Default | Description |
|
||
|----------|---------|-------------|
|
||
| `GUNICORN_WORKERS` | `2` | API worker process count |
|
||
| `DB_POOL_SIZE` | `20` | App database connection pool size |
|
||
| `DB_MAX_OVERFLOW` | `40` | Max overflow connections above pool |
|
||
| `DB_ADMIN_POOL_SIZE` | `10` | Admin database connection pool size |
|
||
| `DB_ADMIN_MAX_OVERFLOW` | `20` | Admin max overflow connections |
|
||
| `POLL_INTERVAL_SECONDS` | `60` | Device polling interval |
|
||
| `CONNECTION_TIMEOUT_SECONDS` | `10` | RouterOS connection timeout |
|
||
| `COMMAND_TIMEOUT_SECONDS` | `30` | RouterOS per-command timeout |
|
||
| `CIRCUIT_BREAKER_MAX_FAILURES` | `5` | Consecutive failures before backoff |
|
||
| `CIRCUIT_BREAKER_BASE_BACKOFF_SECONDS` | `30` | Initial backoff duration |
|
||
| `CIRCUIT_BREAKER_MAX_BACKOFF_SECONDS` | `900` | Maximum backoff (15 min) |
|
||
| `LOG_LEVEL` | `info` | Logging verbosity (`debug`/`info`/`warn`/`error`) |
|
||
| `CORS_ORIGINS` | `http://localhost:3000` | Comma-separated CORS origins |
|
||
| `TUNNEL_PORT_MIN` | `49000` | Start of WinBox tunnel port range |
|
||
| `TUNNEL_PORT_MAX` | `49100` | End of WinBox tunnel port range |
|
||
| `TUNNEL_IDLE_TIMEOUT` | `300` | WinBox tunnel idle timeout (seconds) |
|
||
| `SSH_RELAY_PORT` | `8080` | SSH relay HTTP server port |
|
||
| `SSH_IDLE_TIMEOUT` | `900` | SSH session idle timeout (seconds) |
|
||
| `SSH_MAX_SESSIONS` | `200` | Maximum concurrent SSH sessions |
|
||
| `SSH_MAX_PER_USER` | `10` | Maximum SSH sessions per user |
|
||
| `SSH_MAX_PER_DEVICE` | `20` | Maximum SSH sessions per device |
|
||
|
||
### Security Notes
|
||
|
||
- **Never use default secrets in production.** The application refuses to start if it detects known insecure defaults (like the dev JWT secret) in non-dev environments.
|
||
- **Credential encryption key** is used to encrypt RouterOS device passwords at rest. Losing this key means re-entering all device credentials.
|
||
- **CORS_ORIGINS** should be set to your actual domain in production.
|
||
- **RLS enforcement**: The app_user database role enforces row-level security. Tenants cannot access each other's data even with a compromised JWT.
|
||
|
||
## Storage Configuration
|
||
|
||
Docker volumes mount to the host filesystem. Default locations are configured in `docker-compose.yml`:
|
||
|
||
- **PostgreSQL data**: `./docker-data/postgres`
|
||
- **Redis data**: `./docker-data/redis`
|
||
- **NATS data**: `./docker-data/nats`
|
||
- **Git store (config backups)**: `./docker-data/git-store`
|
||
- **Firmware cache**: `./docker-data/firmware-cache` (downloaded RouterOS firmware packages)
|
||
|
||
To change storage locations, edit the volume mounts in `docker-compose.yml`.
|
||
|
||
## Resource Limits
|
||
|
||
Container memory limits are enforced in `docker-compose.prod.yml` to prevent OOM crashes:
|
||
|
||
| Service | Memory Limit |
|
||
|---------|-------------|
|
||
| PostgreSQL | 512MB |
|
||
| Redis | 128MB |
|
||
| NATS | 128MB |
|
||
| API | 512MB |
|
||
| Poller | 512MB |
|
||
| Frontend | 64MB |
|
||
| OpenBao | 256MB |
|
||
| WireGuard | 128MB |
|
||
| WinBox Worker | 1GB |
|
||
|
||
Adjust under `deploy.resources.limits.memory` in `docker-compose.prod.yml`.
|
||
|
||
> **Note:** The WinBox tunnel port range (`TUNNEL_PORT_MIN`–`TUNNEL_PORT_MAX`, default 49000–49100) must be mapped in the poller container's port bindings. Add `"49000-49100:49000-49100"` to the poller service's `ports` list in your compose file. The SSH relay port (`SSH_RELAY_PORT`, default 8080) similarly requires a port mapping if accessed directly.
|
||
|
||
## API Documentation
|
||
|
||
The backend serves interactive API documentation at:
|
||
|
||
- **Swagger UI**: `http://localhost:8000/docs`
|
||
- **ReDoc**: `http://localhost:8000/redoc`
|
||
|
||
All endpoints include descriptions, request/response schemas, and authentication requirements.
|
||
|
||
## Kubernetes (Helm)
|
||
|
||
TOD includes a Helm chart for Kubernetes deployment at `infrastructure/helm/`.
|
||
|
||
### Prerequisites
|
||
|
||
- Kubernetes 1.28+
|
||
- Helm 3
|
||
- A StorageClass that supports ReadWriteOnce PersistentVolumeClaims
|
||
|
||
### Install
|
||
|
||
1. Create a values override file with your configuration:
|
||
```bash
|
||
cp infrastructure/helm/values.yaml my-values.yaml
|
||
# Edit my-values.yaml — at minimum set:
|
||
# secrets.jwtSecretKey, secrets.credentialEncryptionKey,
|
||
# secrets.dbPassword, secrets.dbAppPassword, secrets.dbPollerPassword,
|
||
# secrets.firstAdminPassword, ingress.host
|
||
```
|
||
|
||
2. Install the chart:
|
||
```bash
|
||
helm install tod infrastructure/helm -f my-values.yaml -n tod --create-namespace
|
||
```
|
||
|
||
3. Initialize OpenBao (first time only):
|
||
```bash
|
||
# Wait for the pod to start
|
||
kubectl get pods -n tod -l app.kubernetes.io/component=openbao
|
||
|
||
# Initialize
|
||
kubectl exec -it -n tod tod-openbao-0 -- bao operator init -key-shares=1 -key-threshold=1
|
||
|
||
# Save the unseal key and root token, then unseal
|
||
kubectl exec -it -n tod tod-openbao-0 -- bao operator unseal <UNSEAL_KEY>
|
||
|
||
# Update release with the token
|
||
helm upgrade tod infrastructure/helm -f my-values.yaml \
|
||
--set secrets.openbaoToken=<ROOT_TOKEN> \
|
||
--set secrets.baoUnsealKey=<UNSEAL_KEY> \
|
||
-n tod
|
||
```
|
||
|
||
4. Verify:
|
||
```bash
|
||
kubectl get pods -n tod
|
||
kubectl port-forward -n tod svc/tod-api 8000:8000
|
||
curl http://localhost:8000/health
|
||
```
|
||
|
||
### Services
|
||
|
||
The Helm chart deploys:
|
||
|
||
| Service | Type | Purpose |
|
||
|---------|------|---------|
|
||
| PostgreSQL (TimescaleDB) | StatefulSet | Primary database |
|
||
| Redis | Deployment | Cache |
|
||
| NATS JetStream | StatefulSet | Message queue |
|
||
| OpenBao | StatefulSet | Secrets management |
|
||
| API | Deployment | FastAPI backend |
|
||
| Frontend | Deployment | React SPA (nginx) |
|
||
| Poller | Deployment | Go device poller |
|
||
| WireGuard | Deployment | VPN gateway |
|
||
| WinBox Worker | Deployment | Browser-based WinBox sessions (Xpra) |
|
||
|
||
### Configuration
|
||
|
||
All configuration is in `values.yaml`. See `infrastructure/helm/values.yaml` for the full reference with comments. Key sections:
|
||
|
||
- `secrets.*` -- All secrets (must be overridden in production)
|
||
- `api.env.*` -- API environment settings
|
||
- `poller.env.*` -- Poller settings
|
||
- `ingress.*` -- Ingress routing and TLS
|
||
- `wireguard.*` -- VPN configuration (can be disabled with `wireguard.enabled: false`)
|
||
|
||
### Note on OpenBao
|
||
|
||
OpenBao must be manually unsealed after every pod restart. Auto-unseal is a planned future enhancement.
|
||
|
||
## Monitoring (Optional)
|
||
|
||
Enable Prometheus and Grafana monitoring with the observability compose overlay:
|
||
|
||
```bash
|
||
docker compose \
|
||
-f docker-compose.yml \
|
||
-f docker-compose.prod.yml \
|
||
-f docker-compose.observability.yml \
|
||
--env-file .env.prod up -d
|
||
```
|
||
|
||
- **Prometheus**: `http://localhost:9090`
|
||
- **Grafana**: `http://localhost:3001` (default: admin/admin — change the default password immediately on any networked host)
|
||
|
||
### Exported Metrics
|
||
|
||
The API and poller export Prometheus metrics:
|
||
|
||
| Metric | Source | Description |
|
||
|--------|--------|-------------|
|
||
| `http_requests_total` | API | HTTP request count by method, path, status |
|
||
| `http_request_duration_seconds` | API | Request latency histogram |
|
||
| `mikrotik_poll_total` | Poller | Poll cycles by status (success/error/skipped) |
|
||
| `mikrotik_poll_duration_seconds` | Poller | Poll cycle duration histogram |
|
||
| `mikrotik_devices_active` | Poller | Number of devices being polled |
|
||
| `mikrotik_circuit_breaker_skips_total` | Poller | Polls skipped due to backoff |
|
||
| `mikrotik_nats_publish_total` | Poller | NATS publishes by subject and status |
|
||
|
||
## Maintenance
|
||
|
||
### Backup Strategy
|
||
|
||
- **Database**: Use `pg_dump` or configure PostgreSQL streaming replication
|
||
- **Config backups**: Git repositories in the git-store volume (automatic nightly backups)
|
||
- **Encryption key**: Store `CREDENTIAL_ENCRYPTION_KEY` securely -- required to decrypt device credentials
|
||
|
||
### Updating
|
||
|
||
```bash
|
||
# Back up the database before upgrading
|
||
docker compose exec postgres pg_dump -U postgres mikrotik > backup-$(date +%Y%m%d).sql
|
||
|
||
git pull
|
||
docker compose -f docker-compose.yml -f docker-compose.prod.yml build api
|
||
docker compose -f docker-compose.yml -f docker-compose.prod.yml build poller
|
||
docker compose -f docker-compose.yml -f docker-compose.prod.yml build frontend
|
||
docker compose -f docker-compose.yml -f docker-compose.prod.yml --env-file .env.prod up -d
|
||
```
|
||
|
||
Database migrations run automatically on API startup via Alembic.
|
||
|
||
### Logs
|
||
|
||
```bash
|
||
# All services
|
||
docker compose logs -f
|
||
|
||
# Specific service
|
||
docker compose logs -f api
|
||
|
||
# Filter structured JSON logs with jq
|
||
docker compose logs api --no-log-prefix 2>&1 | jq 'select(.event != null)'
|
||
|
||
# View audit logs (config editor operations)
|
||
docker compose logs api --no-log-prefix 2>&1 | jq 'select(.event | startswith("routeros_"))'
|
||
```
|
||
|
||
### Graceful Shutdown
|
||
|
||
All services handle SIGTERM for graceful shutdown:
|
||
|
||
- **API (gunicorn)**: Finishes in-flight requests within `GUNICORN_GRACEFUL_TIMEOUT` (default 30s), then disposes database connection pools
|
||
- **Poller (Go)**: Cancels all device polling goroutines via context propagation, waits for in-flight polls to complete
|
||
- **Frontend (nginx)**: Stops accepting new connections and finishes serving active requests
|
||
|
||
```bash
|
||
# Graceful stop (sends SIGTERM, waits 30s)
|
||
docker compose stop
|
||
|
||
# Restart a single service
|
||
docker compose restart api
|
||
```
|
||
|
||
## Troubleshooting
|
||
|
||
| Issue | Solution |
|
||
|-------|----------|
|
||
| API won't start with secret error | Generate production secrets (see step 2 above) |
|
||
| Build crashes with OOM | Build images one at a time (see step 3 above) |
|
||
| Device shows offline | Check network access to device API port (8728/8729) |
|
||
| Health check fails | Check `docker compose logs api` for startup errors |
|
||
| Rate limited (429) | Wait 60 seconds or check Redis connectivity |
|
||
| Migration fails | Check `docker compose logs api` for Alembic errors |
|
||
| NATS subscriber won't start | Non-fatal -- API runs without NATS; check NATS container health |
|
||
| Poller circuit breaker active | Device unreachable; check `CIRCUIT_BREAKER_*` env vars to tune backoff |
|
||
| Frontend returns 502 after API restart | nginx caches upstream DNS at startup; the dynamic resolver (`resolver 127.0.0.11`) in `nginx-spa.conf` handles this automatically — if you see 502s, ensure the nginx config has not been overridden |
|