14 KiB
Architecture
Analysis Date: 2026-03-12
Pattern Overview
Overall: Event-driven microservice architecture with asynchronous pub/sub messaging
Key Characteristics:
- Three independent microservices: Go Poller, Python FastAPI Backend, React/TypeScript Frontend
- NATS JetStream as central event bus for all inter-service communication
- PostgreSQL with Row-Level Security (RLS) for multi-tenant isolation at database layer
- Real-time Server-Sent Events (SSE) for frontend event streaming
- Distributed task coordination using Redis distributed locks
- Per-tenant encryption via OpenBao Transit KMS engine
Layers
Device Polling Layer (Go Poller):
- Purpose: Connects to RouterOS devices via binary API (port 8729), detects status/version, collects metrics, pushes configs, manages WinBox/SSH tunnels
- Location:
poller/ - Contains: Device client, scheduler, SSH relay, WinBox tunnel manager, NATS publisher, Redis credential cache, OpenBao vault client
- Depends on: NATS JetStream, Redis, PostgreSQL (read-only for device list), OpenBao
- Used by: Publishes events to backend via NATS
Event Bus Layer (NATS JetStream):
- Purpose: Central publish/subscribe message broker for all service-to-service communication
- Streams: DEVICE_EVENTS, OPERATION_EVENTS, ALERT_EVENTS
- Contains: Device status changes, metrics, config change notifications, push rollback triggers, alert events, session audit events
- All events include device_id and tenant_id for multi-tenant routing
Backend API Layer (Python FastAPI):
- Purpose: RESTful API, business logic, database persistence, event subscription and processing
- Location:
backend/app/ - Contains: FastAPI routers, SQLAlchemy ORM models, async services, NATS subscribers, middleware (RBAC, tenant context, rate limiting)
- Depends on: PostgreSQL (via RLS-enforced app_user connection), NATS JetStream, Redis, OpenBao, email/webhook services
- Used by: Frontend (REST API), poller (reads device list, writes operation results)
Data Persistence Layer (PostgreSQL + TimescaleDB):
- Purpose: Multi-tenant relational data store with RLS-enforced isolation
- Connection: Two engines in
backend/app/database.py- Admin engine (superuser): Migrations, bootstrap, admin operations
- App engine (app_user role): All tenant-scoped API requests, RLS enforced
- Row-Level Security:
SET LOCAL app.current_tenantset per-request byget_current_userdependency - Contains: Devices, users, tenants, alerts, config backups, templates, VPN peers, certificates, audit logs, metrics aggregates
Caching/Locking Layer (Redis):
- Purpose: Distributed locks (poller prevents duplicate device polls), session management, temporary data
- Usage:
redislockpackage in poller for per-device poll coordination across replicas
Secret Management Layer (OpenBao):
- Purpose: Transit KMS for per-tenant envelope encryption, credential storage access control
- Mode: Transit secret engine wrapping credentials for envelope encryption
- Accessed by: Poller (fetch decrypted credentials), backend (re-encrypt on password change)
Frontend Layer (React 19 + TanStack):
- Purpose: Web UI for fleet management, device control, configuration, monitoring
- Location:
frontend/src/ - Contains: TanStack Router, TanStack Query, Tailwind CSS, SSE event stream integration, WebSocket tunnels
- Depends on: Backend REST API, Server-Sent Events for real-time updates, WebSocket for terminal/remote access
- Entry point:
frontend/src/routes/__root.tsx(QueryClientProvider, root layout)
Data Flow
Device Status Polling (Poller → NATS → Backend):
- Poller scheduler periodically fetches device list from PostgreSQL
- For each device, poller's
Workerconnects to RouterOS binary API (port 8729 TLS) - Worker collects device status (online/offline), version, system metrics
- Worker publishes
DeviceStatusEventto NATS streamDEVICE_EVENTStopicdevice.status.{device_id} - Backend subscribes to
device.status.>vianats_subscriber.py - Subscriber updates device record in PostgreSQL via admin session (bypasses RLS)
- Frontend receives update via SSE subscription to
/api/sse?topics=device_status
Configuration Push (Frontend → Backend → Poller → Router):
- Frontend calls
POST /api/tenants/{tenant_id}/devices/{device_id}/configwith new configuration - Backend stores config in PostgreSQL, publishes
ConfigPushEventtoOPERATION_EVENTS - Poller subscribes to push operation events, receives config delta
- Poller connects to device via binary API, executes RouterOS commands (two-phase: backup, apply, verify)
- On completion, poller publishes
ConfigPushCompletedEventto NATS - Backend subscriber updates operation record with success/failure
- Frontend notifies user via SSE
Metrics Collection (Poller → NATS → Backend → Frontend):
- Poller collects health metrics (CPU, memory, disk), interface stats, wireless stats per poll cycle
- Publishes
DeviceMetricsEventtoDEVICE_EVENTStopicdevice.metrics.{type}.{device_id} - Backend
metrics_subscriber.pyaggregates into TimescaleDB hypertables - Frontend queries
/api/tenants/{tenant_id}/devices/{device_id}/metricsfor graphs - Alternatively, frontend SSE stream pushes metric updates for real-time graphs
Real-Time Event Streaming (Backend → Frontend via SSE):
- Frontend calls
POST /api/auth/sse-tokento exchange session cookie for short-lived SSE bearer token - Token valid for 25 seconds (refreshed every 25 seconds before expiry)
- Frontend opens EventSource to
/api/sse?topics=device_status,alert_fired,config_push,firmware_progress,metric_update - Backend maintains SSE connections, pushes events from NATS subscribers
- Reconnection on disconnect with exponential backoff (1s → 30s max)
Multi-Tenant Isolation (Request → Middleware → RLS):
- Frontend sends JWT token in Authorization header or httpOnly cookie
- Backend
tenant_context.pymiddleware extracts user from JWT, determines tenant_id - Middleware calls
SET LOCAL app.current_tenant = '{tenant_id}'on the database session - All subsequent queries automatically filtered by RLS policy
(tenant_id = current_setting('app.current_tenant')) - Superadmin can re-set tenant context to access any tenant
- Admin sessions (migrations, NATS subscribers) use superuser connection, handle tenant routing explicitly
State Management:
- Frontend: TanStack Query for server state (device list, metrics, config), React Context for session/auth state
- Backend: Async SQLAlchemy ORM with automatic transaction management per request
- Poller: In-memory device state map with per-device circuit breaker tracking failures and backoff
- Shared: Redis for distributed locks, NATS for event persistence (JetStream replays)
Key Abstractions
Device Client (poller/internal/device/):
- Purpose: Binary API communication with RouterOS devices
- Files:
client.go,version.go,health.go,interfaces.go,wireless.go,firmware.go,cert_deploy.go,sftp.go - Pattern: RouterOS binary API command execution, metric parsing and extraction
- Usage: Worker polls device state and metrics in parallel goroutines
Scheduler & Worker (poller/internal/poller/scheduler.go, worker.go):
- Purpose: Orchestrate per-device polling goroutines with circuit breaker resilience
- Pattern: Per-device goroutine with Redis distributed locking to prevent duplicate polls across replicas
- Lifecycle: Discover new devices from DB, create goroutine; remove devices, cancel goroutine
- Circuit Breaker: Exponential backoff after N consecutive failures, resets on success
NATS Publisher (poller/internal/bus/publisher.go):
- Purpose: Publish typed device events to JetStream streams
- Event types: DeviceStatusEvent, DeviceMetricsEvent, ConfigChangedEvent, PushRollbackEvent, PushAlertEvent
- Each event includes device_id and tenant_id for multi-tenant routing
- Consumers: Backend subscribers, audit logging, alert evaluation
Tunnel Manager (poller/internal/tunnel/manager.go):
- Purpose: Manage WinBox TCP tunnels to devices (port-forwarded SOCKS proxies)
- Port pool: Allocate ephemeral local ports for tunnel endpoints
- Pattern: Accept local connections on port, tunnel to device's WinBox port via binary API
SSH Relay (poller/internal/sshrelay/server.go, session.go, bridge.go):
- Purpose: SSH terminal access to RouterOS devices for remote management
- Pattern: SSH server on poller, bridges SSH sessions to RouterOS via binary API terminal protocol
- Authentication: SSH key or password relay from frontend
FastAPI Router Pattern (backend/app/routers/):
- Files:
devices.py,auth.py,alerts.py,config_editor.py,templates.py,metrics.py, etc. - Pattern: APIRouter with Depends() for RBAC, tenant context, rate limiting
- All routes tenant-scoped under
/api/tenants/{tenant_id}/... - RLS enforcement: Automatic via
SET LOCAL app.current_tenantinget_current_usermiddleware
Async Service Layer (backend/app/services/):
- Purpose: Business logic, database operations, integration with external systems
- Files:
device.py,auth.py,backup_service.py,ca_service.py,alert_evaluator.py, etc. - Pattern: Async functions using AsyncSession, composable for multiple operations in single transaction
- NATS Integration: Subscribers consume events, services update database accordingly
NATS Subscribers (backend/app/services/*_subscriber.py):
- Purpose: Consume events from NATS JetStream, update application state
- Lifecycle: Started/stopped in FastAPI lifespan context manager
- Examples:
nats_subscriber.py(device status),metrics_subscriber.py(metrics aggregation),firmware_subscriber.py(firmware update tracking) - Pattern: JetStream consumer with durable name, explicit message acking for reliability
Frontend Router (frontend/src/routes/):
- Pattern: TanStack Router file-based routing
- Structure:
_authenticated.tsx(layout for logged-in users),_authenticated/tenants/$tenantId/devices/...(device management) - Entry:
__root.tsx(QueryClientProvider setup),_authenticated.tsx(auth check + layout)
Frontend Event Stream Hook (frontend/src/hooks/useEventStream.ts):
- Purpose: Manage SSE connection lifecycle, handle reconnection, parse event payloads
- Pattern: useRef for connection state, setInterval for token refresh, EventSource API
- Callbacks: Per-event-type handlers registered by components
- State: Managed in EventStreamContext for app-wide access
Entry Points
Poller Binary (poller/cmd/poller/main.go):
- Location:
poller/cmd/poller/main.go - Triggers: Docker container start, Kubernetes pod initialization
- Responsibilities: Load config, initialize NATS/Redis/PostgreSQL connections, start scheduler, setup observability (Prometheus metrics, structured logging)
- Config source: Environment variables (see
poller/internal/config/config.go)
Backend API (backend/app/main.py):
- Location:
backend/app/main.py - Triggers: Docker container start, uvicorn ASGI server
- Responsibilities: Configure logging, run migrations, bootstrap first admin, start NATS subscribers, setup middleware, register routers
- Lifespan: Async context manager handles startup/shutdown of services
- Health check:
/api/healthendpoint,/api/readinessfor k8s
Frontend Entry (frontend/src/routes/__root.tsx):
- Location:
frontend/src/routes/__root.tsx - Triggers: Browser loads app at
/ - Responsibilities: Wrap app in QueryClientProvider (TanStack Query), setup root error boundary
- Auth flow: Routes under
_authenticatedcheck JWT token, redirect to login if missing - Real-time setup: Establish SSE connection via
useEventStreamhook in layout
Error Handling
Strategy: Three-tier error handling across services
Patterns:
-
Poller: Circuit breaker exponential backoff for device connection failures. Logs all errors to structured JSON with context (device_id, tenant_id, attempt number). Publishes failure events to NATS for alerting.
-
Backend: FastAPI exception handlers convert service errors to HTTP responses. RLS violations return 403 Forbidden. Invalid tenant access returns 404. Database errors logged via structlog with request_id middleware for correlation.
-
Frontend: TanStack Query retry logic (1 retry by default), error boundaries catch component crashes, toast notifications display user-friendly error messages, RequestID middleware propagates correlation IDs
Cross-Cutting Concerns
Logging:
- Poller:
log/slogwith JSON handler, structured fields (service, device_id, tenant_id, operation) - Backend:
structlogwith async logger, JSON output in production - Frontend: Browser console + error tracking (if configured)
Validation:
- Backend: Pydantic models (
app/schemas/) enforce request shape and types, custom validators for business logic (e.g., SRP challenge validation) - Frontend: TanStack Form for client-side validation before submission
- Database: PostgreSQL CHECK constraints and unique indexes
Authentication:
- Zero-knowledge SRP-6a for initial password enrollment (client never sends plaintext)
- JWT tokens issued after SRP enrollment, stored as httpOnly cookies
- Optional API keys with scoped access for programmatic use
- SSE token exchange for event stream access (short-lived, single-use)
Authorization (RBAC):
- Four roles: super_admin (all access), tenant_admin (full tenant access), operator (read+config), viewer (read-only)
- Role hierarchy enforced by
require_role()dependency in routers - API key scopes: subset of operator permissions (read, write_device, write_config, etc.)
Rate Limiting:
- Backend: Token bucket limiter on sensitive endpoints (login, token generation, device operations)
- Configuration:
app/middleware/rate_limit.pydefines limits per endpoint - Redis-backed for distributed rate limit state
Multi-Tenancy:
- Database RLS: All tables have
tenant_id, policy enforces current_tenant filter - Tenant context: Middleware extracts from JWT, sets
app.current_tenantlocal variable - Superadmin bypass: Can re-set tenant context to access any tenant
- Admin operations: Use superuser connection, explicit tenant routing
Architecture analysis: 2026-03-12