Files
the-other-dude/docs/ARCHITECTURE.md
Jason Staack 398dc99466 docs: correct technical details and version references
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-16 08:57:04 -05:00

338 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Architecture
## System Overview
TOD (The Other Dude) is a containerized MSP fleet management platform for MikroTik RouterOS devices. It uses a three-service architecture: a React frontend, a Python FastAPI backend, and a Go poller. All services communicate through PostgreSQL, Redis, and NATS JetStream. Multi-tenancy is enforced at the database level via PostgreSQL Row-Level Security (RLS).
```
┌─────────────┐ ┌─────────────────┐ ┌──────────────┐
│ Frontend │────▶│ Backend API │◀───▶│ Go Poller │
│ React/nginx │ │ FastAPI │ │ go-routeros │
└─────────────┘ └────────┬────────┘ └──────┬───────┘
│ │
┌──────────────┼──────────────────────┤
│ │ │
┌────────▼──┐ ┌──────▼──────┐ ┌──────────▼──┐
│ Redis │ │ PostgreSQL │ │ NATS │
│ locks, │ │ 17 + Timescale│ │ JetStream │
│ cache │ │ DB + RLS │ │ pub/sub │
└───────────┘ └─────────────┘ └─────────────┘
┌──────▼──────┐
│ OpenBao │
│ Transit KMS │
└─────────────┘
```
## Services
### Frontend (React / nginx)
- **Stack**: React 19, TypeScript, TanStack Router (file-based routing), TanStack Query (data fetching), Tailwind CSS 3.4, Vite
- **Production**: Static build served by nginx on port 80 (exposed as port 3000)
- **Development**: Vite dev server with hot module replacement
- **Design system**: Geist Sans + Geist Mono fonts, HSL color tokens via CSS custom properties, class-based dark/light mode
- **Real-time**: Server-Sent Events (SSE) for live device status updates, alerts, and operation progress
- **Client-side encryption**: SRP-6a authentication flow with 2SKD key derivation; Emergency Kit PDF generation
- **UX features**: Command palette (Cmd+K), Framer Motion page transitions, collapsible sidebar, skeleton loaders
- **Memory limit**: 64MB
### Backend API (FastAPI)
- **Stack**: Python 3.12+, FastAPI 0.115+, SQLAlchemy 2.0 async, asyncpg, Gunicorn
- **Two database engines**:
- `admin_engine` (superuser) -- used only for auth/bootstrap and NATS subscribers that need cross-tenant access
- `app_engine` (non-superuser `app_user` role) -- used for all device/data routes, enforces RLS
- **Authentication**: JWT tokens (15min access, 7d refresh), SRP-6a zero-knowledge proof, RBAC (super_admin, admin, operator, viewer)
- **NATS subscribers**: Three independent subscribers for device status, metrics, and firmware events. Non-fatal startup -- API serves requests even if NATS is unavailable
- **Background services**: APScheduler for nightly config backups and daily firmware version checks
- **OpenBao integration**: Provisions per-tenant Transit encryption keys on startup, dual-read fallback if OpenBao is unavailable
- **Startup sequence**: Configure logging -> Run Alembic migrations -> Bootstrap first admin -> Start NATS subscribers -> Ensure SSE streams -> Start schedulers -> Provision OpenBao keys
- **API documentation**: OpenAPI docs at `/docs` and `/redoc` (dev environment only)
- **Health endpoints**: `/health` (liveness), `/health/ready` (readiness -- checks PostgreSQL, Redis, NATS)
- **Middleware stack** (LIFO order): RequestID -> SecurityHeaders -> RateLimiting -> CORS -> Route handler
- **Memory limit**: 512MB
#### API Routers
The backend exposes 25 route groups under the `/api` prefix:
| Router | Purpose |
|--------|---------|
| `auth` | Login (SRP-6a + legacy), token refresh, registration |
| `tenants` | Tenant CRUD (super_admin only) |
| `users` | User management, RBAC |
| `devices` | Device CRUD, status, commands |
| `device_groups` | Logical device grouping |
| `device_tags` | Tagging and filtering |
| `metrics` | Time-series metrics (TimescaleDB) |
| `config_backups` | Configuration backup history |
| `config_editor` | Live RouterOS config editing |
| `firmware` | Firmware version tracking and upgrades |
| `alerts` | Alert rules and active alerts |
| `events` | Device event log |
| `device_logs` | RouterOS system logs |
| `templates` | Configuration templates |
| `clients` | Connected client devices |
| `topology` | Network topology (ReactFlow data) |
| `sse` | Server-Sent Events streams |
| `audit_logs` | Immutable audit trail |
| `reports` | PDF report generation (Jinja2 + weasyprint) |
| `api_keys` | API key management (mktp_ prefix) |
| `maintenance_windows` | Scheduled maintenance with alert suppression |
| `vpn` | WireGuard VPN management |
| `certificates` | Internal CA and device TLS certificates |
| `settings` | System settings (SMTP configuration, super_admin only) |
| `transparency` | KMS access event dashboard |
### Go Poller
- **Stack**: Go 1.25, go-routeros/v3, pgx/v5, nats.go
- **Polling model**: Synchronous per-device polling on a configurable interval (default 60s)
- **Device communication**: RouterOS binary API over TLS (port 8729), InsecureSkipVerify for self-signed certs
- **TLS fallback**: Three-tier strategy -- CA-verified -> InsecureSkipVerify -> plain API
- **Distributed locking**: Redis locks prevent concurrent polling of the same device (safe for multi-instance deployment)
- **Circuit breaker**: Backs off from unreachable devices to avoid wasting poll cycles
- **Credential decryption**: OpenBao Transit with LRU cache (1024 entries, 5min TTL) to minimize KMS calls
- **Output**: Publishes poll results to NATS JetStream; the API's NATS subscribers process and persist them
- **Database access**: Uses `poller_user` role which bypasses RLS (needs cross-tenant device access)
- **VPN routing**: Adds static route to WireGuard gateway for reaching remote devices
- **Tunnel manager**: On-demand TCP proxy for WinBox connections; allocates ports from a configurable range (default 4900049100), bound to localhost only, with idle-timeout cleanup
- **SSH relay**: WebSocket-to-SSH bridge serving browser-based terminal sessions; listens on port 8080, enforces per-user and per-device session limits
- **Memory limit**: 512MB
## Infrastructure Services
### PostgreSQL 17 + TimescaleDB
- **Image**: `timescale/timescaledb:2.17.2-pg17`
- **Row-Level Security (RLS)**: Enforces tenant isolation at the database level. All data tables have a `tenant_id` column; RLS policies filter by `current_setting('app.current_tenant')`
- **Database roles**:
- `postgres` (superuser) -- admin engine, auth/bootstrap, migrations
- `app_user` (non-superuser) -- RLS-enforced, used by API for data routes
- `poller_user` -- bypasses RLS, used by Go poller for cross-tenant device access
- **TimescaleDB hypertables**: Time-series storage for device metrics (CPU, memory, interface traffic, etc.)
- **Migrations**: Alembic, run automatically on API startup
- **Initialization**: `scripts/init-postgres.sql` creates roles and enables extensions
- **Data volume**: `./docker-data/postgres`
- **Memory limit**: 512MB
### Redis
- **Image**: `redis:7-alpine`
- **Uses**:
- Distributed locking for the Go poller (prevents concurrent polling of the same device)
- Rate limiting on auth endpoints (5 requests/min)
- Credential cache for OpenBao Transit responses
- **Data volume**: `./docker-data/redis`
- **Memory limit**: 128MB
### NATS JetStream
- **Image**: `nats:2-alpine`
- **Role**: Message bus between the Go poller and the Python API
- **Streams**: DEVICE_EVENTS (poll results, status changes), ALERT_EVENTS (SSE delivery), OPERATION_EVENTS (SSE delivery)
- **Durable consumers**: Ensure no message loss during API restarts
- **Monitoring port**: 8222
- **Data volume**: `./docker-data/nats`
- **Memory limit**: 128MB
### OpenBao (HashiCorp Vault fork)
- **Image**: `openbao/openbao:2.1`
- **Mode**: Persistent server with file storage backend (`/openbao/data`), mounted to the `openbao_data` Docker volume. Data survives container restarts.
- **Transit secrets engine**: Provides envelope encryption for device credentials at rest
- **Per-tenant keys**: Each tenant gets a dedicated Transit encryption key
- **Init script**: `infrastructure/openbao/init.sh` enables Transit engine and creates initial keys
- **Token**: Set `OPENBAO_TOKEN` in `.env.prod`. The application rejects known-insecure defaults in production.
- **Memory limit**: 256MB
### WireGuard
- **Image**: `lscr.io/linuxserver/wireguard`
- **Role**: VPN gateway for reaching RouterOS devices on remote networks
- **Port**: 51820/UDP
- **Routing**: API and Poller containers add static routes through the WireGuard container to reach device subnets (e.g., `10.10.0.0/16`)
- **Data volume**: `./docker-data/wireguard`
- **Memory limit**: 128MB
## Data Flow
### Device Polling Cycle
```
Go Poller Redis OpenBao RouterOS NATS API PostgreSQL
│ │ │ │ │ │ │
├──query device list──────▶│ │ │ │ │ │
│◀─────────────────────────┤ │ │ │ │ │
├──acquire lock────────────▶│ │ │ │ │ │
│◀──lock granted───────────┤ │ │ │ │ │
├──decrypt credentials (cache miss)────────▶│ │ │ │ │
│◀──plaintext credentials──────────────────┤ │ │ │ │
├──binary API (8729 TLS)───────────────────────────────────▶│ │ │ │
│◀──system info, interfaces, metrics───────────────────────┤ │ │ │
├──publish poll result──────────────────────────────────────────────────▶│ │ │
│ │ │ │ │ ──subscribe──▶│ │
│ │ │ │ │ ├──upsert data──▶│
├──release lock────────────▶│ │ │ │ │ │
```
1. Poller queries PostgreSQL for the list of active devices
2. Acquires a Redis distributed lock per device (prevents duplicate polling)
3. Decrypts device credentials via OpenBao Transit (LRU cache avoids repeated KMS calls)
4. Connects to the RouterOS binary API on port 8729 over TLS
5. Collects system info, interface stats, routing tables, and metrics
6. Publishes results to NATS JetStream
7. API NATS subscriber processes results and upserts into PostgreSQL
8. Releases Redis lock
### Config Push (Two-Phase with Panic Revert)
```
Frontend API RouterOS
│ │ │
├──push config─▶│ │
│ ├──apply config─▶│
│ ├──set revert timer─▶│
│ │◀──ack────────┤
│◀──pending────┤ │
│ │ │ (timer counting down)
├──confirm─────▶│ │
│ ├──cancel timer─▶│
│ │◀──ack────────┤
│◀──confirmed──┤ │
```
1. Frontend sends config commands to the API
2. API connects to the device and applies the configuration
3. Sets a revert timer on the device (RouterOS safe mode / scheduler)
4. Returns pending status to the frontend
5. User confirms the change works (e.g., connectivity still up)
6. If confirmed: API cancels the revert timer, config is permanent
7. If timeout or rejected: device automatically reverts to the previous configuration
This pattern prevents lockouts from misconfigured firewall rules or IP changes.
### Authentication (SRP-6a Zero-Knowledge Proof)
```
Browser API PostgreSQL
│ │ │
│──register────────────────▶│ │
│ (email, salt, verifier) │──store verifier──────▶│
│ │ │
│──login step 1────────────▶│ │
│ (email, client_public) │──lookup verifier─────▶│
│◀──(salt, server_public)──┤◀─────────────────────┤
│ │ │
│──login step 2────────────▶│ │
│ (client_proof) │──verify proof────────│
│◀──(server_proof, JWT)────┤ │
```
1. **Registration**: Client derives a verifier from `password + secret_key` using PBKDF2 (650K iterations) + HKDF + XOR (2SKD). Only the salt and verifier are sent to the server -- never the password
2. **Login step 1**: Client sends email and ephemeral public value; server responds with stored salt and its own ephemeral public value
3. **Login step 2**: Client computes a proof from the shared session key; server validates the proof without ever seeing the password
4. **Token issuance**: On successful proof, server issues JWT (15min access + 7d refresh)
5. **Emergency Kit**: A downloadable PDF containing the user's secret key for account recovery
## Multi-Tenancy Model
- Every data table includes a `tenant_id` column
- PostgreSQL RLS policies filter rows by `current_setting('app.current_tenant')`
- The API sets tenant context (`SET app.current_tenant = ...`) on each database session
- `super_admin` role has NULL `tenant_id` and can access all tenants
- `poller_user` bypasses RLS intentionally (needs cross-tenant device access for polling)
- Tenant isolation is enforced at the database level, not the application level -- even a compromised API cannot leak cross-tenant data through `app_user` connections
## Security Layers
| Layer | Mechanism | Purpose |
|-------|-----------|---------|
| **Authentication** | SRP-6a | Zero-knowledge proof -- password never transmitted or stored |
| **Key Derivation** | 2SKD (PBKDF2 650K + HKDF + XOR) | Two-secret key derivation from password + secret key |
| **Encryption at Rest** | OpenBao Transit | Envelope encryption for device credentials |
| **Tenant Isolation** | PostgreSQL RLS | Database-level row filtering by tenant_id |
| **Access Control** | JWT + RBAC | Role-based permissions (super_admin, admin, operator, viewer) |
| **Rate Limiting** | Redis-backed | Auth endpoints limited to 5 requests/min |
| **TLS Certificates** | Internal CA | Certificate management and deployment to RouterOS devices |
| **Security Headers** | Middleware | CSP, SRI hashes on JS bundles, X-Frame-Options, etc. |
| **Secret Validation** | Startup check | Rejects known-insecure defaults in non-dev environments |
## Network Topology
All services communicate over a single Docker bridge network (`tod`). External ports:
| Service | Internal Port | External Port | Protocol |
|---------|--------------|---------------|----------|
| Frontend | 80 | 3000 | HTTP |
| API | 8000 | 8001 | HTTP |
| PostgreSQL | 5432 | 5432 | TCP |
| Redis | 6379 | 6379 | TCP |
| NATS | 4222 | 4222 | TCP |
| NATS Monitor | 8222 | 8222 | HTTP |
| OpenBao | 8200 | 8200 | HTTP |
| WireGuard | 51820 | 51820 | UDP |
| Poller SSH Relay | 8080 | 8080 | HTTP/WebSocket |
| Poller WinBox Tunnels | 4900049100 | 4900049100 | TCP (localhost only) |
## File Structure
```
backend/ FastAPI Python backend
app/
main.py Application entry point, lifespan, router registration
config.py Pydantic Settings configuration
database.py SQLAlchemy engines (admin + app_user)
models/ SQLAlchemy ORM models
routers/ FastAPI route handlers (25 modules)
services/ Business logic, NATS subscribers, schedulers
middleware/ Rate limiting, request ID, security headers
frontend/ React TypeScript frontend
src/
routes/ TanStack Router file-based routes
components/ Reusable UI components
lib/ API client, crypto, utilities
poller/ Go microservice for device polling
main.go Entry point
Dockerfile Multi-stage build
internal/
tunnel/ WinBox TCP proxy and port pool manager
sshrelay/ WebSocket-to-SSH bridge for browser terminals
infrastructure/ Deployment configuration
docker/ Dockerfiles for api, frontend
helm/ Kubernetes Helm charts
openbao/ OpenBao init scripts
scripts/ Database init scripts
docker-compose.yml Infrastructure services (postgres, redis, nats, openbao, wireguard)
docker-compose.override.yml Application services for dev (api, poller, frontend)
```
## Running the Stack
```bash
# Infrastructure only (postgres, redis, nats, openbao, wireguard)
docker compose up -d
# Full stack including application services (api, poller, frontend)
docker compose --profile full up -d
# Build images sequentially to avoid OOM on low-RAM machines
docker compose build api
docker compose build poller
docker compose build frontend
```
## Container Memory Limits
| Service | Limit |
|---------|-------|
| PostgreSQL | 512MB |
| API | 512MB |
| Go Poller | 512MB |
| OpenBao | 256MB |
| Redis | 128MB |
| NATS | 128MB |
| WireGuard | 128MB |
| Frontend (nginx) | 64MB |