247 lines
14 KiB
Markdown
247 lines
14 KiB
Markdown
# Architecture
|
|
|
|
**Analysis Date:** 2026-03-12
|
|
|
|
## Pattern Overview
|
|
|
|
**Overall:** Event-driven microservice architecture with asynchronous pub/sub messaging
|
|
|
|
**Key Characteristics:**
|
|
- Three independent microservices: Go Poller, Python FastAPI Backend, React/TypeScript Frontend
|
|
- NATS JetStream as central event bus for all inter-service communication
|
|
- PostgreSQL with Row-Level Security (RLS) for multi-tenant isolation at database layer
|
|
- Real-time Server-Sent Events (SSE) for frontend event streaming
|
|
- Distributed task coordination using Redis distributed locks
|
|
- Per-tenant encryption via OpenBao Transit KMS engine
|
|
|
|
## Layers
|
|
|
|
**Device Polling Layer (Go Poller):**
|
|
- Purpose: Connects to RouterOS devices via binary API (port 8729), detects status/version, collects metrics, pushes configs, manages WinBox/SSH tunnels
|
|
- Location: `poller/`
|
|
- Contains: Device client, scheduler, SSH relay, WinBox tunnel manager, NATS publisher, Redis credential cache, OpenBao vault client
|
|
- Depends on: NATS JetStream, Redis, PostgreSQL (read-only for device list), OpenBao
|
|
- Used by: Publishes events to backend via NATS
|
|
|
|
**Event Bus Layer (NATS JetStream):**
|
|
- Purpose: Central publish/subscribe message broker for all service-to-service communication
|
|
- Streams: DEVICE_EVENTS, OPERATION_EVENTS, ALERT_EVENTS
|
|
- Contains: Device status changes, metrics, config change notifications, push rollback triggers, alert events, session audit events
|
|
- All events include device_id and tenant_id for multi-tenant routing
|
|
|
|
**Backend API Layer (Python FastAPI):**
|
|
- Purpose: RESTful API, business logic, database persistence, event subscription and processing
|
|
- Location: `backend/app/`
|
|
- Contains: FastAPI routers, SQLAlchemy ORM models, async services, NATS subscribers, middleware (RBAC, tenant context, rate limiting)
|
|
- Depends on: PostgreSQL (via RLS-enforced app_user connection), NATS JetStream, Redis, OpenBao, email/webhook services
|
|
- Used by: Frontend (REST API), poller (reads device list, writes operation results)
|
|
|
|
**Data Persistence Layer (PostgreSQL + TimescaleDB):**
|
|
- Purpose: Multi-tenant relational data store with RLS-enforced isolation
|
|
- Connection: Two engines in `backend/app/database.py`
|
|
- Admin engine (superuser): Migrations, bootstrap, admin operations
|
|
- App engine (app_user role): All tenant-scoped API requests, RLS enforced
|
|
- Row-Level Security: `SET LOCAL app.current_tenant` set per-request by `get_current_user` dependency
|
|
- Contains: Devices, users, tenants, alerts, config backups, templates, VPN peers, certificates, audit logs, metrics aggregates
|
|
|
|
**Caching/Locking Layer (Redis):**
|
|
- Purpose: Distributed locks (poller prevents duplicate device polls), session management, temporary data
|
|
- Usage: `redislock` package in poller for per-device poll coordination across replicas
|
|
|
|
**Secret Management Layer (OpenBao):**
|
|
- Purpose: Transit KMS for per-tenant envelope encryption, credential storage access control
|
|
- Mode: Transit secret engine wrapping credentials for envelope encryption
|
|
- Accessed by: Poller (fetch decrypted credentials), backend (re-encrypt on password change)
|
|
|
|
**Frontend Layer (React 19 + TanStack):**
|
|
- Purpose: Web UI for fleet management, device control, configuration, monitoring
|
|
- Location: `frontend/src/`
|
|
- Contains: TanStack Router, TanStack Query, Tailwind CSS, SSE event stream integration, WebSocket tunnels
|
|
- Depends on: Backend REST API, Server-Sent Events for real-time updates, WebSocket for terminal/remote access
|
|
- Entry point: `frontend/src/routes/__root.tsx` (QueryClientProvider, root layout)
|
|
|
|
## Data Flow
|
|
|
|
**Device Status Polling (Poller → NATS → Backend):**
|
|
|
|
1. Poller scheduler periodically fetches device list from PostgreSQL
|
|
2. For each device, poller's `Worker` connects to RouterOS binary API (port 8729 TLS)
|
|
3. Worker collects device status (online/offline), version, system metrics
|
|
4. Worker publishes `DeviceStatusEvent` to NATS stream `DEVICE_EVENTS` topic `device.status.{device_id}`
|
|
5. Backend subscribes to `device.status.>` via `nats_subscriber.py`
|
|
6. Subscriber updates device record in PostgreSQL via admin session (bypasses RLS)
|
|
7. Frontend receives update via SSE subscription to `/api/sse?topics=device_status`
|
|
|
|
**Configuration Push (Frontend → Backend → Poller → Router):**
|
|
|
|
1. Frontend calls `POST /api/tenants/{tenant_id}/devices/{device_id}/config` with new configuration
|
|
2. Backend stores config in PostgreSQL, publishes `ConfigPushEvent` to `OPERATION_EVENTS`
|
|
3. Poller subscribes to push operation events, receives config delta
|
|
4. Poller connects to device via binary API, executes RouterOS commands (two-phase: backup, apply, verify)
|
|
5. On completion, poller publishes `ConfigPushCompletedEvent` to NATS
|
|
6. Backend subscriber updates operation record with success/failure
|
|
7. Frontend notifies user via SSE
|
|
|
|
**Metrics Collection (Poller → NATS → Backend → Frontend):**
|
|
|
|
1. Poller collects health metrics (CPU, memory, disk), interface stats, wireless stats per poll cycle
|
|
2. Publishes `DeviceMetricsEvent` to `DEVICE_EVENTS` topic `device.metrics.{type}.{device_id}`
|
|
3. Backend `metrics_subscriber.py` aggregates into TimescaleDB hypertables
|
|
4. Frontend queries `/api/tenants/{tenant_id}/devices/{device_id}/metrics` for graphs
|
|
5. Alternatively, frontend SSE stream pushes metric updates for real-time graphs
|
|
|
|
**Real-Time Event Streaming (Backend → Frontend via SSE):**
|
|
|
|
1. Frontend calls `POST /api/auth/sse-token` to exchange session cookie for short-lived SSE bearer token
|
|
2. Token valid for 25 seconds (refreshed every 25 seconds before expiry)
|
|
3. Frontend opens EventSource to `/api/sse?topics=device_status,alert_fired,config_push,firmware_progress,metric_update`
|
|
4. Backend maintains SSE connections, pushes events from NATS subscribers
|
|
5. Reconnection on disconnect with exponential backoff (1s → 30s max)
|
|
|
|
**Multi-Tenant Isolation (Request → Middleware → RLS):**
|
|
|
|
1. Frontend sends JWT token in Authorization header or httpOnly cookie
|
|
2. Backend `tenant_context.py` middleware extracts user from JWT, determines tenant_id
|
|
3. Middleware calls `SET LOCAL app.current_tenant = '{tenant_id}'` on the database session
|
|
4. All subsequent queries automatically filtered by RLS policy `(tenant_id = current_setting('app.current_tenant'))`
|
|
5. Superadmin can re-set tenant context to access any tenant
|
|
6. Admin sessions (migrations, NATS subscribers) use superuser connection, handle tenant routing explicitly
|
|
|
|
**State Management:**
|
|
|
|
- Frontend: TanStack Query for server state (device list, metrics, config), React Context for session/auth state
|
|
- Backend: Async SQLAlchemy ORM with automatic transaction management per request
|
|
- Poller: In-memory device state map with per-device circuit breaker tracking failures and backoff
|
|
- Shared: Redis for distributed locks, NATS for event persistence (JetStream replays)
|
|
|
|
## Key Abstractions
|
|
|
|
**Device Client (`poller/internal/device/`):**
|
|
- Purpose: Binary API communication with RouterOS devices
|
|
- Files: `client.go`, `version.go`, `health.go`, `interfaces.go`, `wireless.go`, `firmware.go`, `cert_deploy.go`, `sftp.go`
|
|
- Pattern: RouterOS binary API command execution, metric parsing and extraction
|
|
- Usage: Worker polls device state and metrics in parallel goroutines
|
|
|
|
**Scheduler & Worker (`poller/internal/poller/scheduler.go`, `worker.go`):**
|
|
- Purpose: Orchestrate per-device polling goroutines with circuit breaker resilience
|
|
- Pattern: Per-device goroutine with Redis distributed locking to prevent duplicate polls across replicas
|
|
- Lifecycle: Discover new devices from DB, create goroutine; remove devices, cancel goroutine
|
|
- Circuit Breaker: Exponential backoff after N consecutive failures, resets on success
|
|
|
|
**NATS Publisher (`poller/internal/bus/publisher.go`):**
|
|
- Purpose: Publish typed device events to JetStream streams
|
|
- Event types: DeviceStatusEvent, DeviceMetricsEvent, ConfigChangedEvent, PushRollbackEvent, PushAlertEvent
|
|
- Each event includes device_id and tenant_id for multi-tenant routing
|
|
- Consumers: Backend subscribers, audit logging, alert evaluation
|
|
|
|
**Tunnel Manager (`poller/internal/tunnel/manager.go`):**
|
|
- Purpose: Manage WinBox TCP tunnels to devices (port-forwarded SOCKS proxies)
|
|
- Port pool: Allocate ephemeral local ports for tunnel endpoints
|
|
- Pattern: Accept local connections on port, tunnel to device's WinBox port via binary API
|
|
|
|
**SSH Relay (`poller/internal/sshrelay/server.go`, `session.go`, `bridge.go`):**
|
|
- Purpose: SSH terminal access to RouterOS devices for remote management
|
|
- Pattern: SSH server on poller, bridges SSH sessions to RouterOS via binary API terminal protocol
|
|
- Authentication: SSH key or password relay from frontend
|
|
|
|
**FastAPI Router Pattern (`backend/app/routers/`):**
|
|
- Files: `devices.py`, `auth.py`, `alerts.py`, `config_editor.py`, `templates.py`, `metrics.py`, etc.
|
|
- Pattern: APIRouter with Depends() for RBAC, tenant context, rate limiting
|
|
- All routes tenant-scoped under `/api/tenants/{tenant_id}/...`
|
|
- RLS enforcement: Automatic via `SET LOCAL app.current_tenant` in `get_current_user` middleware
|
|
|
|
**Async Service Layer (`backend/app/services/`):**
|
|
- Purpose: Business logic, database operations, integration with external systems
|
|
- Files: `device.py`, `auth.py`, `backup_service.py`, `ca_service.py`, `alert_evaluator.py`, etc.
|
|
- Pattern: Async functions using AsyncSession, composable for multiple operations in single transaction
|
|
- NATS Integration: Subscribers consume events, services update database accordingly
|
|
|
|
**NATS Subscribers (`backend/app/services/*_subscriber.py`):**
|
|
- Purpose: Consume events from NATS JetStream, update application state
|
|
- Lifecycle: Started/stopped in FastAPI lifespan context manager
|
|
- Examples: `nats_subscriber.py` (device status), `metrics_subscriber.py` (metrics aggregation), `firmware_subscriber.py` (firmware update tracking)
|
|
- Pattern: JetStream consumer with durable name, explicit message acking for reliability
|
|
|
|
**Frontend Router (`frontend/src/routes/`):**
|
|
- Pattern: TanStack Router file-based routing
|
|
- Structure: `_authenticated.tsx` (layout for logged-in users), `_authenticated/tenants/$tenantId/devices/...` (device management)
|
|
- Entry: `__root.tsx` (QueryClientProvider setup), `_authenticated.tsx` (auth check + layout)
|
|
|
|
**Frontend Event Stream Hook (`frontend/src/hooks/useEventStream.ts`):**
|
|
- Purpose: Manage SSE connection lifecycle, handle reconnection, parse event payloads
|
|
- Pattern: useRef for connection state, setInterval for token refresh, EventSource API
|
|
- Callbacks: Per-event-type handlers registered by components
|
|
- State: Managed in EventStreamContext for app-wide access
|
|
|
|
## Entry Points
|
|
|
|
**Poller Binary (`poller/cmd/poller/main.go`):**
|
|
- Location: `poller/cmd/poller/main.go`
|
|
- Triggers: Docker container start, Kubernetes pod initialization
|
|
- Responsibilities: Load config, initialize NATS/Redis/PostgreSQL connections, start scheduler, setup observability (Prometheus metrics, structured logging)
|
|
- Config source: Environment variables (see `poller/internal/config/config.go`)
|
|
|
|
**Backend API (`backend/app/main.py`):**
|
|
- Location: `backend/app/main.py`
|
|
- Triggers: Docker container start, uvicorn ASGI server
|
|
- Responsibilities: Configure logging, run migrations, bootstrap first admin, start NATS subscribers, setup middleware, register routers
|
|
- Lifespan: Async context manager handles startup/shutdown of services
|
|
- Health check: `/api/health` endpoint, `/api/readiness` for k8s
|
|
|
|
**Frontend Entry (`frontend/src/routes/__root.tsx`):**
|
|
- Location: `frontend/src/routes/__root.tsx`
|
|
- Triggers: Browser loads app at `/`
|
|
- Responsibilities: Wrap app in QueryClientProvider (TanStack Query), setup root error boundary
|
|
- Auth flow: Routes under `_authenticated` check JWT token, redirect to login if missing
|
|
- Real-time setup: Establish SSE connection via `useEventStream` hook in layout
|
|
|
|
## Error Handling
|
|
|
|
**Strategy:** Three-tier error handling across services
|
|
|
|
**Patterns:**
|
|
|
|
- **Poller**: Circuit breaker exponential backoff for device connection failures. Logs all errors to structured JSON with context (device_id, tenant_id, attempt number). Publishes failure events to NATS for alerting.
|
|
|
|
- **Backend**: FastAPI exception handlers convert service errors to HTTP responses. RLS violations return 403 Forbidden. Invalid tenant access returns 404. Database errors logged via structlog with request_id middleware for correlation.
|
|
|
|
- **Frontend**: TanStack Query retry logic (1 retry by default), error boundaries catch component crashes, toast notifications display user-friendly error messages, RequestID middleware propagates correlation IDs
|
|
|
|
## Cross-Cutting Concerns
|
|
|
|
**Logging:**
|
|
- Poller: `log/slog` with JSON handler, structured fields (service, device_id, tenant_id, operation)
|
|
- Backend: `structlog` with async logger, JSON output in production
|
|
- Frontend: Browser console + error tracking (if configured)
|
|
|
|
**Validation:**
|
|
- Backend: Pydantic models (`app/schemas/`) enforce request shape and types, custom validators for business logic (e.g., SRP challenge validation)
|
|
- Frontend: TanStack Form for client-side validation before submission
|
|
- Database: PostgreSQL CHECK constraints and unique indexes
|
|
|
|
**Authentication:**
|
|
- Zero-knowledge SRP-6a for initial password enrollment (client never sends plaintext)
|
|
- JWT tokens issued after SRP enrollment, stored as httpOnly cookies
|
|
- Optional API keys with scoped access for programmatic use
|
|
- SSE token exchange for event stream access (short-lived, single-use)
|
|
|
|
**Authorization (RBAC):**
|
|
- Four roles: super_admin (all access), tenant_admin (full tenant access), operator (read+config), viewer (read-only)
|
|
- Role hierarchy enforced by `require_role()` dependency in routers
|
|
- API key scopes: subset of operator permissions (read, write_device, write_config, etc.)
|
|
|
|
**Rate Limiting:**
|
|
- Backend: Token bucket limiter on sensitive endpoints (login, token generation, device operations)
|
|
- Configuration: `app/middleware/rate_limit.py` defines limits per endpoint
|
|
- Redis-backed for distributed rate limit state
|
|
|
|
**Multi-Tenancy:**
|
|
- Database RLS: All tables have `tenant_id`, policy enforces current_tenant filter
|
|
- Tenant context: Middleware extracts from JWT, sets `app.current_tenant` local variable
|
|
- Superadmin bypass: Can re-set tenant context to access any tenant
|
|
- Admin operations: Use superuser connection, explicit tenant routing
|
|
|
|
---
|
|
|
|
*Architecture analysis: 2026-03-12*
|