docs: update all documentation for v9.7.0

- CONFIGURATION.md: fix database name (mikrotik → tod), add 5 missing
  env vars, update NATS memory to 256MB
- API.md: add 8 missing endpoint groups (sites, sectors, wireless links,
  signal history, site alerts, config backups, remote access, winbox)
- ARCHITECTURE.md: update subscriber count from 3 to 10, add v9.7
  components (sites, sectors, link discovery, signal trending, site
  alerts), add background service loops, update router count to 33
- USER-GUIDE.md: add tower/site management, wireless links, signal
  history, site alerts, and fleet map documentation
- README.md: add v9.7 features to feature list
- DEPLOYMENT.md: add winbox-worker, openbao, wireguard to service list
- SECURITY.md: add WinBox session security details

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Jason Staack
2026-03-19 22:03:25 -05:00
parent 11781a822f
commit 0142107e68
7 changed files with 384 additions and 14 deletions

View File

@@ -44,10 +44,24 @@ TOD (The Other Dude) is a containerized MSP fleet management platform for MikroT
- `admin_engine` (superuser) -- used only for auth/bootstrap and NATS subscribers that need cross-tenant access
- `app_engine` (non-superuser `app_user` role) -- used for all device/data routes, enforces RLS
- **Authentication**: JWT tokens (15min access, 7d refresh), SRP-6a zero-knowledge proof, RBAC (super_admin, admin, operator, viewer)
- **NATS subscribers**: Three independent subscribers for device status, metrics, and firmware events. Non-fatal startup -- API serves requests even if NATS is unavailable
- **Background services**: APScheduler for nightly config backups and daily firmware version checks
- **NATS subscribers**: Ten independent subscribers, each on its own NATS connection. Non-fatal startup -- API serves requests even if NATS is unavailable:
- `nats_subscriber` -- device status events
- `metrics_subscriber` -- device metrics (CPU, memory, interface counters)
- `firmware_subscriber` -- firmware version events
- `session_audit_subscriber` -- SSH session auditing
- `config_change_subscriber` -- event-driven config backups
- `push_rollback_subscriber` -- config push rollback and alerting
- `config_snapshot_subscriber` -- config snapshot ingestion (Go poller -> PostgreSQL via Transit encryption)
- `wireless_registration_subscriber` -- per-client wireless registration data
- `interface_subscriber` -- device interface MAC resolution for link discovery
- `link_discovery_subscriber` -- wireless link state machine (MAC-based AP/CPE pairing)
- **Background services**:
- APScheduler: nightly config backups, daily firmware version checks, retention cleanup (24h cycle)
- WinBox session reconciliation loop (60s cycle) -- detects orphaned sessions and cleans up Redis + tunnels
- Signal trend detection loop (hourly) -- identifies sustained signal degradation across wireless clients
- Site alert evaluation loop (5-minute cycle) -- evaluates geographic-scoped alert rules with hysteresis
- **OpenBao integration**: Provisions per-tenant Transit encryption keys on startup, dual-read fallback if OpenBao is unavailable
- **Startup sequence**: Configure logging -> Run Alembic migrations -> Bootstrap first admin -> Start NATS subscribers -> Ensure SSE streams -> Start schedulers -> Provision OpenBao keys
- **Startup sequence**: Configure logging -> Run Alembic migrations -> Bootstrap first admin -> Start NATS subscribers (10) -> Ensure SSE streams -> Start schedulers -> Provision OpenBao keys -> Recover stale push operations -> Start background loops (reconciliation, trend detection, site alerts)
- **API documentation**: OpenAPI docs at `/docs` and `/redoc` (dev environment only)
- **Health endpoints**: `/health` (liveness), `/health/ready` (readiness -- checks PostgreSQL, Redis, NATS)
- **Middleware stack** (LIFO order): RequestID -> SecurityHeaders -> RateLimiting -> CORS -> Route handler
@@ -55,7 +69,7 @@ TOD (The Other Dude) is a containerized MSP fleet management platform for MikroT
#### API Routers
The backend exposes 25 route groups under the `/api` prefix:
The backend exposes 33 route groups under the `/api` prefix:
| Router | Purpose |
|--------|---------|
@@ -84,6 +98,14 @@ The backend exposes 25 route groups under the `/api` prefix:
| `certificates` | Internal CA and device TLS certificates |
| `settings` | System settings (SMTP configuration, super_admin only) |
| `transparency` | KMS access event dashboard |
| `remote_access` | SSH remote access sessions |
| `winbox_remote` | WinBox browser-based remote sessions |
| `sites` | Site management (hierarchical device organization) |
| `sectors` | Sector definitions within sites (antenna/coverage zones) |
| `links` | Wireless link discovery and state tracking |
| `signal_history` | Per-client signal strength history and trends |
| `site_alerts` | Geographic-scoped alert rules and events |
| `config` | Config push operations (two-phase with panic revert) |
### Go Poller
@@ -135,7 +157,7 @@ The backend exposes 25 route groups under the `/api` prefix:
- **Durable consumers**: Ensure no message loss during API restarts
- **Monitoring port**: 8222
- **Data volume**: `./docker-data/nats`
- **Memory limit**: 128MB
- **Memory limit**: 256MB
### OpenBao (HashiCorp Vault fork)
@@ -245,6 +267,48 @@ Browser API PostgreSQL
- `poller_user` bypasses RLS intentionally (needs cross-tenant device access for polling)
- Tenant isolation is enforced at the database level, not the application level -- even a compromised API cannot leak cross-tenant data through `app_user` connections
## Sites & Sectors
The site management subsystem provides hierarchical device organization for tower-based wireless deployments.
- **Sites**: Named geographic locations (towers, POPs, huts) with optional latitude/longitude coordinates
- **Sectors**: Coverage zones within a site, representing individual antenna faces or radio segments. Each sector belongs to exactly one site and can have one or more devices assigned
- **Device assignment**: Devices are assigned to sectors, inheriting site membership. A device belongs to at most one sector at a time
- **Site health**: Aggregate health status is derived from the devices within a site's sectors -- if any device is down, the site status reflects it
## Wireless Link Discovery
MAC-based automatic detection of AP-to-CPE wireless links.
- **Interface subscriber**: Ingests device interface data from NATS, building a MAC-to-device lookup table
- **Wireless registration subscriber**: Processes per-client wireless registration events, capturing connected MACs and signal data
- **Link discovery subscriber**: Correlates AP registration tables with CPE interface MACs to identify links between managed devices
- **State machine**: Each discovered link transitions through states based on signal quality and reachability:
- `discovered` -- initial detection, not yet confirmed
- `active` -- confirmed bidirectional link with acceptable signal
- `degraded` -- signal below threshold or intermittent connectivity
- `down` -- link lost (device unreachable or deregistered)
- `stale` -- no update received within the retention window
- **Automatic pairing**: When an AP's registration table contains a MAC belonging to a managed CPE, a link record is created without manual configuration
## Signal History & Trend Detection
Per-client signal strength tracking with automatic degradation alerting.
- **Signal history**: Records signal strength samples for each wireless client over time, stored in TimescaleDB for efficient time-range queries
- **Trend detection loop** (hourly): Analyzes recent signal history to identify sustained degradation. When a client's signal drops below threshold for a configurable window, the system creates a site alert event with rule type `signal_degradation`. Auto-resolves when signal recovers
- **Retention**: Signal history samples are subject to the same retention cleanup as other time-series data
## Site Alert Rules
Geographic-scoped alerting distinct from per-device alerts.
- **Rule types**: Configurable rules scoped to a site (e.g., "alert when more than N devices are down at site X", signal degradation thresholds)
- **Evaluation loop** (5-minute cycle): Evaluates all enabled site alert rules against current data
- **Hysteresis**: Rules require consecutive hits (default 2) before confirming an alert, preventing flapping from transient conditions
- **Event lifecycle**: Alert events are created when rules trigger and auto-resolved when conditions clear. Manual resolution is also supported
- **Separation from device alerts**: Site alerts operate independently from the per-device alert system, allowing operators to set geographic thresholds without duplicating device-level rules
## Security Layers
| Layer | Mechanism | Purpose |
@@ -285,7 +349,7 @@ backend/ FastAPI Python backend
config.py Pydantic Settings configuration
database.py SQLAlchemy engines (admin + app_user)
models/ SQLAlchemy ORM models
routers/ FastAPI route handlers (25 modules)
routers/ FastAPI route handlers (33 modules)
services/ Business logic, NATS subscribers, schedulers
middleware/ Rate limiting, request ID, security headers
frontend/ React TypeScript frontend
@@ -332,6 +396,6 @@ docker compose build frontend
| Go Poller | 512MB |
| OpenBao | 256MB |
| Redis | 128MB |
| NATS | 128MB |
| NATS | 256MB |
| WireGuard | 128MB |
| Frontend (nginx) | 64MB |