docs: update all documentation for v9.7.0

- CONFIGURATION.md: fix database name (mikrotik → tod), add 5 missing
  env vars, update NATS memory to 256MB
- API.md: add 8 missing endpoint groups (sites, sectors, wireless links,
  signal history, site alerts, config backups, remote access, winbox)
- ARCHITECTURE.md: update subscriber count from 3 to 10, add v9.7
  components (sites, sectors, link discovery, signal trending, site
  alerts), add background service loops, update router count to 33
- USER-GUIDE.md: add tower/site management, wireless links, signal
  history, site alerts, and fleet map documentation
- README.md: add v9.7 features to feature list
- DEPLOYMENT.md: add winbox-worker, openbao, wireguard to service list
- SECURITY.md: add WinBox session security details

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Jason Staack
2026-03-19 22:03:25 -05:00
parent 11781a822f
commit 0142107e68
7 changed files with 384 additions and 14 deletions

View File

@@ -45,7 +45,12 @@ All API routes are mounted under the `/api` prefix.
| Device Tags | `/api/device-tags/*` | Tag-based device labeling | | Device Tags | `/api/device-tags/*` | Tag-based device labeling |
| Metrics | `/api/metrics/*` | TimescaleDB device metrics (CPU, memory, traffic, wireless) | | Metrics | `/api/metrics/*` | TimescaleDB device metrics (CPU, memory, traffic, wireless) |
| Wireless Issues | `/api/fleet/wireless-issues`, `/api/tenants/{id}/fleet/wireless-issues` | APs with degraded signal, CCQ, or dropped clients | | Wireless Issues | `/api/fleet/wireless-issues`, `/api/tenants/{id}/fleet/wireless-issues` | APs with degraded signal, CCQ, or dropped clients |
| Config Backups | `/api/config-backups/*` | Automated RouterOS config backup history | | Sites | `/api/tenants/{id}/sites/*` | Site CRUD, device-to-site assignment |
| Sectors | `/api/tenants/{id}/sites/{sid}/sectors/*` | Sector CRUD, device sector assignment |
| Wireless Links | `/api/tenants/{id}/links`, `/api/tenants/{id}/devices/{did}/links` | Link listing, RF stats, registrations |
| Signal History | `/api/tenants/{id}/devices/{did}/signal-history` | Per-client signal strength trending |
| Site Alerts | `/api/tenants/{id}/sites/{sid}/alert-rules/*`, `/api/tenants/{id}/alert-events/*` | Site-scoped alert rules and events |
| Config Backups | `/api/tenants/{id}/devices/{did}/config/*` | Config backup timeline, restore, schedules |
| Config Editor | `/api/config-editor/*` | Live RouterOS config browsing and editing | | Config Editor | `/api/config-editor/*` | Live RouterOS config browsing and editing |
| Firmware | `/api/firmware/*` | RouterOS firmware version management and upgrades | | Firmware | `/api/firmware/*` | RouterOS firmware version management and upgrades |
| Alerts | `/api/alerts/*` | Alert rule CRUD, alert history | | Alerts | `/api/alerts/*` | Alert rule CRUD, alert history |
@@ -59,6 +64,8 @@ All API routes are mounted under the `/api` prefix.
| Reports | `/api/reports/*` | PDF report generation (Jinja2 + WeasyPrint) | | Reports | `/api/reports/*` | PDF report generation (Jinja2 + WeasyPrint) |
| API Keys | `/api/api-keys/*` | API key CRUD | | API Keys | `/api/api-keys/*` | API key CRUD |
| Maintenance Windows | `/api/maintenance-windows/*` | Scheduled maintenance window management | | Maintenance Windows | `/api/maintenance-windows/*` | Scheduled maintenance window management |
| Remote Access | `/api/tenants/{id}/devices/{did}/*-session` | SSH terminal and WinBox tunnel sessions |
| WinBox Remote | `/api/tenants/{id}/devices/{did}/winbox-remote-sessions/*` | Browser-based WinBox sessions (Xpra) |
| VPN | `/api/vpn/*` | WireGuard VPN tunnel management | | VPN | `/api/vpn/*` | WireGuard VPN tunnel management |
| Certificates | `/api/certificates/*` | Internal CA and device certificate management | | Certificates | `/api/certificates/*` | Internal CA and device certificate management |
| Transparency | `/api/transparency/*` | KMS access event dashboard | | Transparency | `/api/transparency/*` | KMS access event dashboard |
@@ -113,6 +120,144 @@ Endpoints enforce role-based access control. The four roles in descending privil
| `operator` | Tenant | Device operations, config changes | | `operator` | Tenant | Device operations, config changes |
| `viewer` | Tenant | Read-only access | | `viewer` | Tenant | Read-only access |
## Sites
Manage tower/site locations and assign devices to them.
| Method | Endpoint | RBAC | Description |
|--------|----------|------|-------------|
| `GET` | `/api/tenants/{tenant_id}/sites` | viewer | List all sites with health rollup |
| `GET` | `/api/tenants/{tenant_id}/sites/{site_id}` | viewer | Get a single site with health rollup |
| `POST` | `/api/tenants/{tenant_id}/sites` | operator | Create a site |
| `PUT` | `/api/tenants/{tenant_id}/sites/{site_id}` | operator | Update a site |
| `DELETE` | `/api/tenants/{tenant_id}/sites/{site_id}` | admin | Delete a site |
| `POST` | `/api/tenants/{tenant_id}/sites/{site_id}/devices/{device_id}` | operator | Assign a device to a site |
| `DELETE` | `/api/tenants/{tenant_id}/sites/{site_id}/devices/{device_id}` | operator | Remove a device from a site |
| `POST` | `/api/tenants/{tenant_id}/sites/{site_id}/devices/bulk-assign` | operator | Bulk-assign devices to a site |
## Sectors
Manage radio sectors within a site and assign devices to them.
| Method | Endpoint | RBAC | Description |
|--------|----------|------|-------------|
| `GET` | `/api/tenants/{tenant_id}/sites/{site_id}/sectors` | viewer | List sectors for a site with device counts |
| `POST` | `/api/tenants/{tenant_id}/sites/{site_id}/sectors` | operator | Create a sector |
| `PUT` | `/api/tenants/{tenant_id}/sites/{site_id}/sectors/{sector_id}` | operator | Update a sector |
| `DELETE` | `/api/tenants/{tenant_id}/sites/{site_id}/sectors/{sector_id}` | admin | Delete a sector |
| `PUT` | `/api/tenants/{tenant_id}/devices/{device_id}/sector` | operator | Set or clear a device's sector assignment |
## Wireless Links
Read-only endpoints for wireless link topology, RF stats, and registrations.
| Method | Endpoint | RBAC | Description |
|--------|----------|------|-------------|
| `GET` | `/api/tenants/{tenant_id}/links` | viewer | List all wireless links (optional `state` and `device_id` query filters) |
| `GET` | `/api/tenants/{tenant_id}/devices/{device_id}/links` | viewer | List links where the device is AP or CPE |
| `GET` | `/api/tenants/{tenant_id}/sites/{site_id}/links` | viewer | List links where either side belongs to the site |
| `GET` | `/api/tenants/{tenant_id}/devices/{device_id}/registrations` | viewer | Latest wireless registration data per MAC |
| `GET` | `/api/tenants/{tenant_id}/devices/{device_id}/rf-stats` | viewer | Latest RF monitor stats per interface |
| `GET` | `/api/tenants/{tenant_id}/devices/{device_id}/unknown-clients` | viewer | Wireless clients whose MAC doesn't match any known device |
## Signal History
Time-bucketed signal strength trending for wireless clients.
| Method | Endpoint | RBAC | Description |
|--------|----------|------|-------------|
| `GET` | `/api/tenants/{tenant_id}/devices/{device_id}/signal-history` | viewer | Get signal history for a client MAC |
Query parameters:
- `mac_address` (required) -- client MAC address
- `range` -- time range: `24h`, `7d`, or `30d` (default `7d`)
## Site Alerts
Site-scoped alert rules and alert events.
### Alert Rules
| Method | Endpoint | RBAC | Description |
|--------|----------|------|-------------|
| `GET` | `/api/tenants/{tenant_id}/sites/{site_id}/alert-rules` | viewer | List alert rules (optional `sector_id` filter) |
| `GET` | `/api/tenants/{tenant_id}/sites/{site_id}/alert-rules/{rule_id}` | viewer | Get a single alert rule |
| `POST` | `/api/tenants/{tenant_id}/sites/{site_id}/alert-rules` | operator | Create an alert rule |
| `PUT` | `/api/tenants/{tenant_id}/sites/{site_id}/alert-rules/{rule_id}` | operator | Update an alert rule |
| `DELETE` | `/api/tenants/{tenant_id}/sites/{site_id}/alert-rules/{rule_id}` | operator | Delete an alert rule |
### Alert Events
| Method | Endpoint | RBAC | Description |
|--------|----------|------|-------------|
| `GET` | `/api/tenants/{tenant_id}/sites/{site_id}/alert-events` | viewer | List alert events (optional `state` filter, `limit` up to 200) |
| `POST` | `/api/tenants/{tenant_id}/alert-events/{event_id}/resolve` | operator | Resolve an active alert event |
| `GET` | `/api/tenants/{tenant_id}/alert-events/count` | viewer | Active alert event count (notification badge) |
## Config Backups
Device config backup timeline, restore, and schedule management. All routes are scoped under `/api/tenants/{tenant_id}/devices/{device_id}/config/`.
### Backup Timeline
| Method | Endpoint | RBAC | Description |
|--------|----------|------|-------------|
| `GET` | `.../config/backups` | viewer | List backup timeline for a device (newest first) |
| `POST` | `.../config/backups` | operator | Trigger a manual config backup |
| `POST` | `.../config/checkpoint` | operator | Create a checkpoint (named restore point) |
| `GET` | `.../config/backups/{commit_sha}/export` | viewer | Download export.rsc text for a backup version |
| `GET` | `.../config/backups/{commit_sha}/binary` | viewer | Download backup.bin for a backup version |
### Restore
| Method | Endpoint | RBAC | Description |
|--------|----------|------|-------------|
| `POST` | `.../config/preview-restore` | operator | Preview impact analysis before restoring a config version |
| `POST` | `.../config/restore` | operator | Restore a config version (two-phase push with panic-revert) |
| `POST` | `.../config/emergency-rollback` | operator | Rollback to most recent pre-push backup |
### Schedules
| Method | Endpoint | RBAC | Description |
|--------|----------|------|-------------|
| `GET` | `.../config/schedules` | viewer | Get effective backup schedule (device override or tenant default) |
| `PUT` | `.../config/schedules` | operator | Create or update device-specific schedule override |
### Config Snapshot
| Method | Endpoint | RBAC | Description |
|--------|----------|------|-------------|
| `POST` | `.../config-snapshot/trigger` | operator | Trigger immediate config snapshot via the Go poller (NATS) |
## Remote Access
SSH terminal and WinBox tunnel sessions. All routes are scoped under `/api/tenants/{tenant_id}/devices/{device_id}/`. Requires operator role or above.
| Method | Endpoint | RBAC | Description |
|--------|----------|------|-------------|
| `POST` | `.../winbox-session` | operator | Open a WinBox tunnel (returns tunnel_id, host, port, winbox:// URI) |
| `DELETE` | `.../winbox-session/{tunnel_id}` | operator | Close a WinBox tunnel (idempotent) |
| `POST` | `.../ssh-session` | operator | Create a single-use SSH WebSocket session token (120s TTL) |
| `GET` | `.../sessions` | operator | List active WinBox tunnels and remote sessions for a device |
The SSH session token authorises a subsequent WebSocket connection at `/ws/ssh?token=<token>`.
## WinBox Remote (Browser)
Xpra-based in-browser WinBox sessions. All routes are scoped under `/api/tenants/{tenant_id}/devices/{device_id}/winbox-remote-sessions/`. Requires operator role or above.
| Method | Endpoint | RBAC | Description |
|--------|----------|------|-------------|
| `POST` | `.../winbox-remote-sessions` | operator | Create a browser WinBox session |
| `GET` | `.../winbox-remote-sessions` | operator | List active sessions for a device |
| `GET` | `.../winbox-remote-sessions/{session_id}` | operator | Get session status |
| `DELETE` | `.../winbox-remote-sessions/{session_id}` | operator | Terminate a session (idempotent) |
| `GET` | `.../winbox-remote-sessions/{session_id}/xpra/{path}` | operator | Proxy Xpra HTML5 client files |
| `WS` | `.../winbox-remote-sessions/{session_id}/ws` | operator | WebSocket proxy (browser to Xpra worker) |
Session creation returns a `websocket_path` for the Xpra WebSocket connection. Sessions enforce idle timeout (default 600s) and max lifetime (default 7200s).
## Multi-Tenancy ## Multi-Tenancy
Tenant isolation is enforced at the database level via PostgreSQL Row-Level Security (RLS). The `app_user` database role automatically filters all queries by the authenticated user's `tenant_id`. Super admins operate outside tenant scope. Tenant isolation is enforced at the database level via PostgreSQL Row-Level Security (RLS). The `app_user` database role automatically filters all queries by the authenticated user's `tenant_id`. Super admins operate outside tenant scope.

View File

@@ -44,10 +44,24 @@ TOD (The Other Dude) is a containerized MSP fleet management platform for MikroT
- `admin_engine` (superuser) -- used only for auth/bootstrap and NATS subscribers that need cross-tenant access - `admin_engine` (superuser) -- used only for auth/bootstrap and NATS subscribers that need cross-tenant access
- `app_engine` (non-superuser `app_user` role) -- used for all device/data routes, enforces RLS - `app_engine` (non-superuser `app_user` role) -- used for all device/data routes, enforces RLS
- **Authentication**: JWT tokens (15min access, 7d refresh), SRP-6a zero-knowledge proof, RBAC (super_admin, admin, operator, viewer) - **Authentication**: JWT tokens (15min access, 7d refresh), SRP-6a zero-knowledge proof, RBAC (super_admin, admin, operator, viewer)
- **NATS subscribers**: Three independent subscribers for device status, metrics, and firmware events. Non-fatal startup -- API serves requests even if NATS is unavailable - **NATS subscribers**: Ten independent subscribers, each on its own NATS connection. Non-fatal startup -- API serves requests even if NATS is unavailable:
- **Background services**: APScheduler for nightly config backups and daily firmware version checks - `nats_subscriber` -- device status events
- `metrics_subscriber` -- device metrics (CPU, memory, interface counters)
- `firmware_subscriber` -- firmware version events
- `session_audit_subscriber` -- SSH session auditing
- `config_change_subscriber` -- event-driven config backups
- `push_rollback_subscriber` -- config push rollback and alerting
- `config_snapshot_subscriber` -- config snapshot ingestion (Go poller -> PostgreSQL via Transit encryption)
- `wireless_registration_subscriber` -- per-client wireless registration data
- `interface_subscriber` -- device interface MAC resolution for link discovery
- `link_discovery_subscriber` -- wireless link state machine (MAC-based AP/CPE pairing)
- **Background services**:
- APScheduler: nightly config backups, daily firmware version checks, retention cleanup (24h cycle)
- WinBox session reconciliation loop (60s cycle) -- detects orphaned sessions and cleans up Redis + tunnels
- Signal trend detection loop (hourly) -- identifies sustained signal degradation across wireless clients
- Site alert evaluation loop (5-minute cycle) -- evaluates geographic-scoped alert rules with hysteresis
- **OpenBao integration**: Provisions per-tenant Transit encryption keys on startup, dual-read fallback if OpenBao is unavailable - **OpenBao integration**: Provisions per-tenant Transit encryption keys on startup, dual-read fallback if OpenBao is unavailable
- **Startup sequence**: Configure logging -> Run Alembic migrations -> Bootstrap first admin -> Start NATS subscribers -> Ensure SSE streams -> Start schedulers -> Provision OpenBao keys - **Startup sequence**: Configure logging -> Run Alembic migrations -> Bootstrap first admin -> Start NATS subscribers (10) -> Ensure SSE streams -> Start schedulers -> Provision OpenBao keys -> Recover stale push operations -> Start background loops (reconciliation, trend detection, site alerts)
- **API documentation**: OpenAPI docs at `/docs` and `/redoc` (dev environment only) - **API documentation**: OpenAPI docs at `/docs` and `/redoc` (dev environment only)
- **Health endpoints**: `/health` (liveness), `/health/ready` (readiness -- checks PostgreSQL, Redis, NATS) - **Health endpoints**: `/health` (liveness), `/health/ready` (readiness -- checks PostgreSQL, Redis, NATS)
- **Middleware stack** (LIFO order): RequestID -> SecurityHeaders -> RateLimiting -> CORS -> Route handler - **Middleware stack** (LIFO order): RequestID -> SecurityHeaders -> RateLimiting -> CORS -> Route handler
@@ -55,7 +69,7 @@ TOD (The Other Dude) is a containerized MSP fleet management platform for MikroT
#### API Routers #### API Routers
The backend exposes 25 route groups under the `/api` prefix: The backend exposes 33 route groups under the `/api` prefix:
| Router | Purpose | | Router | Purpose |
|--------|---------| |--------|---------|
@@ -84,6 +98,14 @@ The backend exposes 25 route groups under the `/api` prefix:
| `certificates` | Internal CA and device TLS certificates | | `certificates` | Internal CA and device TLS certificates |
| `settings` | System settings (SMTP configuration, super_admin only) | | `settings` | System settings (SMTP configuration, super_admin only) |
| `transparency` | KMS access event dashboard | | `transparency` | KMS access event dashboard |
| `remote_access` | SSH remote access sessions |
| `winbox_remote` | WinBox browser-based remote sessions |
| `sites` | Site management (hierarchical device organization) |
| `sectors` | Sector definitions within sites (antenna/coverage zones) |
| `links` | Wireless link discovery and state tracking |
| `signal_history` | Per-client signal strength history and trends |
| `site_alerts` | Geographic-scoped alert rules and events |
| `config` | Config push operations (two-phase with panic revert) |
### Go Poller ### Go Poller
@@ -135,7 +157,7 @@ The backend exposes 25 route groups under the `/api` prefix:
- **Durable consumers**: Ensure no message loss during API restarts - **Durable consumers**: Ensure no message loss during API restarts
- **Monitoring port**: 8222 - **Monitoring port**: 8222
- **Data volume**: `./docker-data/nats` - **Data volume**: `./docker-data/nats`
- **Memory limit**: 128MB - **Memory limit**: 256MB
### OpenBao (HashiCorp Vault fork) ### OpenBao (HashiCorp Vault fork)
@@ -245,6 +267,48 @@ Browser API PostgreSQL
- `poller_user` bypasses RLS intentionally (needs cross-tenant device access for polling) - `poller_user` bypasses RLS intentionally (needs cross-tenant device access for polling)
- Tenant isolation is enforced at the database level, not the application level -- even a compromised API cannot leak cross-tenant data through `app_user` connections - Tenant isolation is enforced at the database level, not the application level -- even a compromised API cannot leak cross-tenant data through `app_user` connections
## Sites & Sectors
The site management subsystem provides hierarchical device organization for tower-based wireless deployments.
- **Sites**: Named geographic locations (towers, POPs, huts) with optional latitude/longitude coordinates
- **Sectors**: Coverage zones within a site, representing individual antenna faces or radio segments. Each sector belongs to exactly one site and can have one or more devices assigned
- **Device assignment**: Devices are assigned to sectors, inheriting site membership. A device belongs to at most one sector at a time
- **Site health**: Aggregate health status is derived from the devices within a site's sectors -- if any device is down, the site status reflects it
## Wireless Link Discovery
MAC-based automatic detection of AP-to-CPE wireless links.
- **Interface subscriber**: Ingests device interface data from NATS, building a MAC-to-device lookup table
- **Wireless registration subscriber**: Processes per-client wireless registration events, capturing connected MACs and signal data
- **Link discovery subscriber**: Correlates AP registration tables with CPE interface MACs to identify links between managed devices
- **State machine**: Each discovered link transitions through states based on signal quality and reachability:
- `discovered` -- initial detection, not yet confirmed
- `active` -- confirmed bidirectional link with acceptable signal
- `degraded` -- signal below threshold or intermittent connectivity
- `down` -- link lost (device unreachable or deregistered)
- `stale` -- no update received within the retention window
- **Automatic pairing**: When an AP's registration table contains a MAC belonging to a managed CPE, a link record is created without manual configuration
## Signal History & Trend Detection
Per-client signal strength tracking with automatic degradation alerting.
- **Signal history**: Records signal strength samples for each wireless client over time, stored in TimescaleDB for efficient time-range queries
- **Trend detection loop** (hourly): Analyzes recent signal history to identify sustained degradation. When a client's signal drops below threshold for a configurable window, the system creates a site alert event with rule type `signal_degradation`. Auto-resolves when signal recovers
- **Retention**: Signal history samples are subject to the same retention cleanup as other time-series data
## Site Alert Rules
Geographic-scoped alerting distinct from per-device alerts.
- **Rule types**: Configurable rules scoped to a site (e.g., "alert when more than N devices are down at site X", signal degradation thresholds)
- **Evaluation loop** (5-minute cycle): Evaluates all enabled site alert rules against current data
- **Hysteresis**: Rules require consecutive hits (default 2) before confirming an alert, preventing flapping from transient conditions
- **Event lifecycle**: Alert events are created when rules trigger and auto-resolved when conditions clear. Manual resolution is also supported
- **Separation from device alerts**: Site alerts operate independently from the per-device alert system, allowing operators to set geographic thresholds without duplicating device-level rules
## Security Layers ## Security Layers
| Layer | Mechanism | Purpose | | Layer | Mechanism | Purpose |
@@ -285,7 +349,7 @@ backend/ FastAPI Python backend
config.py Pydantic Settings configuration config.py Pydantic Settings configuration
database.py SQLAlchemy engines (admin + app_user) database.py SQLAlchemy engines (admin + app_user)
models/ SQLAlchemy ORM models models/ SQLAlchemy ORM models
routers/ FastAPI route handlers (25 modules) routers/ FastAPI route handlers (33 modules)
services/ Business logic, NATS subscribers, schedulers services/ Business logic, NATS subscribers, schedulers
middleware/ Rate limiting, request ID, security headers middleware/ Rate limiting, request ID, security headers
frontend/ React TypeScript frontend frontend/ React TypeScript frontend
@@ -332,6 +396,6 @@ docker compose build frontend
| Go Poller | 512MB | | Go Poller | 512MB |
| OpenBao | 256MB | | OpenBao | 256MB |
| Redis | 128MB | | Redis | 128MB |
| NATS | 128MB | | NATS | 256MB |
| WireGuard | 128MB | | WireGuard | 128MB |
| Frontend (nginx) | 64MB | | Frontend (nginx) | 64MB |

View File

@@ -29,11 +29,12 @@ TOD uses Pydantic Settings for configuration. All values can be set via environm
| Variable | Default | Description | | Variable | Default | Description |
|----------|---------|-------------| |----------|---------|-------------|
| `DATABASE_URL` | `postgresql+asyncpg://postgres:postgres@localhost:5432/mikrotik` | Admin (superuser) async database URL. Used for migrations and bootstrap operations. | | `DATABASE_URL` | `postgresql+asyncpg://postgres:postgres@localhost:5432/tod` | Admin (superuser) async database URL. Used for migrations and bootstrap operations. |
| `SYNC_DATABASE_URL` | `postgresql+psycopg2://postgres:postgres@localhost:5432/mikrotik` | Synchronous database URL used by Alembic migrations only. | | `SYNC_DATABASE_URL` | `postgresql+psycopg2://postgres:postgres@localhost:5432/tod` | Synchronous database URL used by Alembic migrations only. |
| `APP_USER_DATABASE_URL` | `postgresql+asyncpg://app_user:app_password@localhost:5432/mikrotik` | Non-superuser async database URL. Enforces PostgreSQL RLS for tenant isolation. | | `APP_USER_DATABASE_URL` | `postgresql+asyncpg://app_user:app_password@localhost:5432/tod` | Non-superuser async database URL. Enforces PostgreSQL RLS for tenant isolation. |
| `DB_POOL_SIZE` | `20` | App user connection pool size | | `DB_POOL_SIZE` | `20` | App user connection pool size |
| `DB_MAX_OVERFLOW` | `40` | App user pool max overflow connections | | `DB_MAX_OVERFLOW` | `40` | App user pool max overflow connections |
| `DB_POOL_RECYCLE` | `1847` | Connection pool recycle time in seconds |
| `DB_ADMIN_POOL_SIZE` | `10` | Admin connection pool size | | `DB_ADMIN_POOL_SIZE` | `10` | Admin connection pool size |
| `DB_ADMIN_MAX_OVERFLOW` | `20` | Admin pool max overflow connections | | `DB_ADMIN_MAX_OVERFLOW` | `20` | Admin pool max overflow connections |
@@ -82,6 +83,20 @@ OpenBao is the key management service used to encrypt device credentials on a pe
| `FIRMWARE_CACHE_DIR` | `/data/firmware-cache` | Path to firmware download cache (PVC mount in production) | | `FIRMWARE_CACHE_DIR` | `/data/firmware-cache` | Path to firmware download cache (PVC mount in production) |
| `FIRMWARE_CHECK_INTERVAL_HOURS` | `24` | Hours between automatic RouterOS version checks | | `FIRMWARE_CHECK_INTERVAL_HOURS` | `24` | Hours between automatic RouterOS version checks |
### Signal Trending & Site Alerting
| Variable | Default | Description |
|----------|---------|-------------|
| `SIGNAL_DEGRADATION_THRESHOLD_DB` | `5` | Signal degradation threshold in dB for trend detection |
| `ALERT_EVALUATION_INTERVAL_SECONDS` | `300` | How often site alert rules are evaluated |
| `TREND_DETECTION_INTERVAL_SECONDS` | `3600` | How often signal trending analysis runs |
### Retention
| Variable | Default | Description |
|----------|---------|-------------|
| `CONFIG_RETENTION_DAYS` | `90` | How long config snapshots are retained |
### Storage Paths ### Storage Paths
| Variable | Default | Description | | Variable | Default | Description |
@@ -141,7 +156,7 @@ All containers have enforced memory limits to prevent OOM on the host:
|---------|-------------| |---------|-------------|
| PostgreSQL | 512 MB | | PostgreSQL | 512 MB |
| Redis | 128 MB | | Redis | 128 MB |
| NATS | 128 MB | | NATS | 256 MB |
| API | 512 MB | | API | 512 MB |
| Poller | 256 MB | | Poller | 256 MB |
| Frontend | 64 MB | | Frontend | 64 MB |

View File

@@ -12,6 +12,9 @@ TOD (The Other Dude) is a containerized fleet management platform for RouterOS d
- **PostgreSQL + TimescaleDB** -- Primary database with time-series extensions - **PostgreSQL + TimescaleDB** -- Primary database with time-series extensions
- **Redis** -- Distributed locking and rate limiting - **Redis** -- Distributed locking and rate limiting
- **NATS JetStream** -- Message bus for device events - **NATS JetStream** -- Message bus for device events
- **OpenBao** -- Secrets management (Transit encryption for credentials, config backups, audit logs)
- **WireGuard** -- VPN gateway for isolated device networks
- **WinBox Worker** -- Xpra-based container for browser WinBox sessions (runs on linux/amd64, 1GB memory limit)
## Prerequisites ## Prerequisites
@@ -159,6 +162,9 @@ Container memory limits are enforced in `docker-compose.prod.yml` to prevent OOM
| API | 512MB | | API | 512MB |
| Poller | 512MB | | Poller | 512MB |
| Frontend | 64MB | | Frontend | 64MB |
| OpenBao | 256MB |
| WireGuard | 128MB |
| WinBox Worker | 1GB |
Adjust under `deploy.resources.limits.memory` in `docker-compose.prod.yml`. Adjust under `deploy.resources.limits.memory` in `docker-compose.prod.yml`.
@@ -238,6 +244,7 @@ The Helm chart deploys:
| Frontend | Deployment | React SPA (nginx) | | Frontend | Deployment | React SPA (nginx) |
| Poller | Deployment | Go device poller | | Poller | Deployment | Go device poller |
| WireGuard | Deployment | VPN gateway | | WireGuard | Deployment | VPN gateway |
| WinBox Worker | Deployment | Browser-based WinBox sessions (Xpra) |
### Configuration ### Configuration

View File

@@ -25,7 +25,11 @@ The Other Dude is a self-hosted, multi-tenant platform (one installation serves
- **Dashboard** -- At-a-glance fleet health with device counts, uptime sparklines, status breakdowns per organization, and an "APs Needing Attention" card highlighting wireless issues. - **Dashboard** -- At-a-glance fleet health with device counts, uptime sparklines, status breakdowns per organization, and an "APs Needing Attention" card highlighting wireless issues.
- **Device Management** -- Detailed device pages with system info, interfaces, routes, firewall rules, DHCP leases, and real-time resource metrics. - **Device Management** -- Detailed device pages with system info, interfaces, routes, firewall rules, DHCP leases, and real-time resource metrics.
- **Fleet Table** -- Virtual-scrolled table that handles hundreds of devices without breaking a sweat. - **Fleet Table** -- Virtual-scrolled table that handles hundreds of devices without breaking a sweat.
- **Device Map** -- Geographic view of device locations. - **Tower & Site Management** -- Organize devices by physical location. Sites represent towers or equipment rooms; sectors subdivide them by antenna direction with azimuth bearings. Health grid shows per-device CPU, memory, and uptime at a glance.
- **Wireless Link Discovery** -- Automatic AP-to-CPE link detection with real-time signal strength, CCQ, TX/RX rates, and a five-state health model (discovered, active, degraded, down, stale).
- **Signal History & Trend Detection** -- Per-client signal history charts with min/avg/max trends over 24-hour, 7-day, and 30-day windows. Color-banded thresholds highlight degradation at a glance.
- **Site-Level Alert Rules** -- Threshold-based alerts scoped to sites and sectors: device offline percentage, sector signal average, client drop detection, and signal degradation.
- **Fleet Map** -- Geographic map with status-colored markers and automatic clustering. Cluster colors reflect aggregate device health across a region.
- **Subnet Scanner** -- Discover new RouterOS devices on your network and onboard them in clicks. - **Subnet Scanner** -- Discover new RouterOS devices on your network and onboard them in clicks.
### Configuration ### Configuration

View File

@@ -96,6 +96,7 @@ TOD includes on-demand WinBox tunnels and browser-based SSH terminals for device
- **Audit trail:** Tunnel open/close events and SSH session start/end events are recorded in the immutable audit log with device ID, user ID, source IP, and timestamp. - **Audit trail:** Tunnel open/close events and SSH session start/end events are recorded in the immutable audit log with device ID, user ID, source IP, and timestamp.
- **WinBox tunnel binding:** TCP proxies for WinBox connections are bound to `127.0.0.1` only. Tunnels are never exposed on `0.0.0.0` and cannot be reached from outside the host without explicit port forwarding. - **WinBox tunnel binding:** TCP proxies for WinBox connections are bound to `127.0.0.1` only. Tunnels are never exposed on `0.0.0.0` and cannot be reached from outside the host without explicit port forwarding.
- **Idle-timeout cleanup:** Inactive tunnels are closed automatically after `TUNNEL_IDLE_TIMEOUT` seconds (default 300). SSH sessions time out after `SSH_IDLE_TIMEOUT` seconds (default 900). Resources are reclaimed immediately on disconnect. - **Idle-timeout cleanup:** Inactive tunnels are closed automatically after `TUNNEL_IDLE_TIMEOUT` seconds (default 300). SSH sessions time out after `SSH_IDLE_TIMEOUT` seconds (default 900). Resources are reclaimed immediately on disconnect.
- **WinBox Browser sessions:** WinBox sessions use single-use session IDs stored in Redis with a short TTL. The browser connects via a WebSocket proxy -- never directly to the device. Sessions follow a strict lifecycle (`creating` -> `active` -> `grace` -> `terminated`) with automatic cleanup at each stage. Device credentials are decrypted server-side via the OpenBao Transit engine and are never sent to the browser. Session creation is rate-limited to 3 requests per 5 minutes per user.
## Network Security ## Network Security

View File

@@ -36,7 +36,9 @@ TOD uses a collapsible sidebar with four sections. Press `[` to toggle the sideb
|------|-------------| |------|-------------|
| **Dashboard** | Overview of your fleet with device status cards, active alerts, metrics sparklines, and "APs Needing Attention" wireless health card. The landing page after login. | | **Dashboard** | Overview of your fleet with device status cards, active alerts, metrics sparklines, and "APs Needing Attention" wireless health card. The landing page after login. |
| **Devices** | Fleet table with search, sort, and filter. Click any device row to open its detail page. | | **Devices** | Fleet table with search, sort, and filter. Click any device row to open its detail page. |
| **Map** | Geographic map view of device locations. | | **Sites** | Tower and site management -- organize devices by physical location with sectors, health monitoring, wireless links, and site-scoped alerts. |
| **Wireless Links** | Fleet-wide view of all discovered AP-to-CPE wireless connections with signal, CCQ, TX/RX rates, and link state. |
| **Map** | Geographic fleet map with status-colored markers and automatic clustering. Devices with coordinates appear on the map; clusters reflect aggregate health (green = all online, red = all offline, amber = mixed). |
### Manage ### Manage
@@ -236,6 +238,138 @@ TOD supports dark and light modes:
--- ---
## Tower & Site Management
Sites represent physical locations in your network -- towers, rooftops, equipment rooms, or any place where you deploy devices. Sectors let you subdivide a site by antenna direction. Together they give you a structured view of your wireless infrastructure.
### Creating a Site
1. Navigate to **Fleet > Sites** in the sidebar.
2. Click **New Site**.
3. Fill in the site details:
- **Name** (required) -- a descriptive label for the location (e.g., "North Ridge Tower").
- **Address** -- street address or landmark description.
- **Latitude / Longitude** -- GPS coordinates. Devices at this site inherit these coordinates on the fleet map.
- **Elevation** -- tower or rooftop height in meters.
- **Notes** -- free-text field for internal reference.
4. Click **Create Site**.
The Sites list shows all sites with search filtering. Click any site to open its detail page.
### Site Detail Page
The site detail page shows a summary header with device count, online count, online percentage, and active alert count. Four tabs provide deeper views:
| Tab | Description |
|-----|-------------|
| **Health Grid** | Card grid of every device assigned to the site showing live CPU, memory, and uptime. Cards are color-coded by status (green = online, red = offline). Click any card to open the device detail page. |
| **Sectors** | Sector-based view of devices and their connected CPE clients. Shows per-sector aggregate stats (client count, average signal, link count). |
| **Links** | Table of all wireless links at the site, grouped by AP, with signal strength, CCQ, TX/RX rates, link state, and expandable signal history charts. |
| **Alerts** | Site-scoped alert rules and alert event history. Create and manage rules that apply to this specific site or sector. |
### Creating Sectors
Sectors organize access points within a site by antenna direction (e.g., "North 0-120" or "South Sector"). To create a sector:
1. Open a site detail page and switch to the **Sectors** tab.
2. Click **Add Sector**.
3. Enter:
- **Name** (required) -- a label for the sector direction (e.g., "North Sector").
- **Azimuth** -- compass bearing in degrees (0-360) representing the antenna direction. 0 is north, 90 is east, 180 is south, 270 is west.
- **Description** -- optional notes about the sector.
4. Click **Create Sector**.
Each sector section is collapsible and shows a header with device count, connected client count, average signal strength, and link count. Devices within a sector are listed with their connected CPEs and link states inline.
### Assigning Devices to Sites and Sectors
Devices are assigned to a site from the device detail page or from the Sites section. Once assigned, you can further assign a device to a specific sector:
1. Open the site detail page and switch to the **Sectors** tab.
2. Each device row has a sector assignment dropdown on the right.
3. Select a sector from the dropdown to assign the device, or select **Unassigned** to remove the sector assignment.
Devices that belong to a site but have no sector assignment appear in the **Unassigned** section at the bottom of the Sectors tab.
---
## Wireless Links
TOD automatically discovers wireless connections between access points (APs) and client premise equipment (CPEs) in your fleet. When the poller detects a registration table entry on an AP that matches a CPE device in your fleet, it creates a wireless link record.
### Link States
Each wireless link has a state that reflects its current health:
| State | Meaning |
|-------|---------|
| **Discovered** | A new AP-CPE connection has been detected for the first time. |
| **Active** | The link is up with recent poll data confirming connectivity. |
| **Degraded** | The link is connected but signal or quality metrics have dropped below healthy thresholds. |
| **Down** | The link has not been seen in recent polls -- the CPE is likely disconnected. |
| **Stale** | The link has not been seen for an extended period. The connection may no longer exist. |
Link states transition automatically based on poll results and missed-poll counters.
### Viewing Wireless Links
There are two ways to view wireless links:
- **Fleet-wide**: Navigate to **Fleet > Wireless Links** in the sidebar. This shows all discovered links across your organization, filterable by state (active, degraded, down, stale).
- **Per-site**: Open a site detail page and switch to the **Links** tab. This shows only the links associated with devices assigned to that site.
Both views group links by AP device. Each CPE row shows signal strength (dBm), CCQ percentage, TX/RX data rates, link state, and time since last seen.
### Signal History
Click any CPE row in the wireless links table to expand an inline signal history chart. The chart shows signal strength over time with three lines:
- **Average signal** (solid blue) -- the primary trend line.
- **Min / Max signal** (dashed) -- the range boundaries.
The background is color-banded: green for strong signal (above -65 dBm), yellow for moderate (-65 to -80 dBm), and red for weak (below -80 dBm).
Use the time range selector in the chart header to switch between **24h**, **7d**, and **30d** views. This helps you spot intermittent degradation, seasonal patterns, or gradual signal drift that might not be obvious from a single snapshot.
---
## Site Alerts
Site alert rules let you define thresholds scoped to an entire site or a specific sector, rather than individual devices. This is useful for detecting systemic issues across a tower location.
### Creating a Site Alert Rule
1. Open the site detail page and switch to the **Alerts** tab.
2. Click **Add Alert Rule**.
3. Configure the rule:
- **Rule type** -- choose from:
- *Device Offline Percent* -- fires when the percentage of offline devices at the site exceeds the threshold.
- *Device Offline Count* -- fires when a specific number of devices go offline.
- *Sector Signal Average* -- fires when the average signal strength across a sector drops below the threshold.
- *Sector Client Drop* -- fires when the number of connected clients in a sector drops by more than the threshold.
- *Signal Degradation* -- fires when individual link signal degrades past a threshold.
- **Scope** -- apply the rule to the entire site or narrow it to a specific sector.
- **Threshold** -- the numeric value and unit that triggers the alert.
- **Severity** -- warning or critical.
4. Click **Create Rule**.
Alert events appear in the site's Alerts tab with timestamps, severity, the triggering message, and consecutive hit count. Active alerts can be resolved manually by operators.
---
## Fleet Map
The fleet map provides a geographic view of all devices that have coordinates assigned (either directly on the device or inherited from their site).
- Navigate to **Fleet > Map** in the sidebar.
- Devices appear as color-coded markers: **green** for online, **red** for offline.
- When devices are geographically close, they automatically cluster into numbered circles. Cluster color reflects aggregate health: green if all devices in the cluster are online, red if all are offline, and amber if mixed.
- Click a cluster to zoom in and see individual markers. Click a device marker to see its status summary and link to its detail page.
- Super admins can filter the map by organization using the dropdown in the toolbar.
- The map auto-fits to show all mapped devices when loaded. The toolbar shows how many of your devices have coordinates assigned.
---
## Tips ## Tips
- Use the **command palette** (`Cmd+K`) for the fastest way to navigate. It searches pages, devices, and actions. - Use the **command palette** (`Cmd+K`) for the fastest way to navigate. It searches pages, devices, and actions.