docs: map existing codebase

This commit is contained in:
Jason Staack
2026-03-12 19:33:26 -05:00
parent 5beede9502
commit 70126980a4
7 changed files with 2252 additions and 0 deletions

View File

@@ -0,0 +1,246 @@
# Architecture
**Analysis Date:** 2026-03-12
## Pattern Overview
**Overall:** Event-driven microservice architecture with asynchronous pub/sub messaging
**Key Characteristics:**
- Three independent microservices: Go Poller, Python FastAPI Backend, React/TypeScript Frontend
- NATS JetStream as central event bus for all inter-service communication
- PostgreSQL with Row-Level Security (RLS) for multi-tenant isolation at database layer
- Real-time Server-Sent Events (SSE) for frontend event streaming
- Distributed task coordination using Redis distributed locks
- Per-tenant encryption via OpenBao Transit KMS engine
## Layers
**Device Polling Layer (Go Poller):**
- Purpose: Connects to RouterOS devices via binary API (port 8729), detects status/version, collects metrics, pushes configs, manages WinBox/SSH tunnels
- Location: `poller/`
- Contains: Device client, scheduler, SSH relay, WinBox tunnel manager, NATS publisher, Redis credential cache, OpenBao vault client
- Depends on: NATS JetStream, Redis, PostgreSQL (read-only for device list), OpenBao
- Used by: Publishes events to backend via NATS
**Event Bus Layer (NATS JetStream):**
- Purpose: Central publish/subscribe message broker for all service-to-service communication
- Streams: DEVICE_EVENTS, OPERATION_EVENTS, ALERT_EVENTS
- Contains: Device status changes, metrics, config change notifications, push rollback triggers, alert events, session audit events
- All events include device_id and tenant_id for multi-tenant routing
**Backend API Layer (Python FastAPI):**
- Purpose: RESTful API, business logic, database persistence, event subscription and processing
- Location: `backend/app/`
- Contains: FastAPI routers, SQLAlchemy ORM models, async services, NATS subscribers, middleware (RBAC, tenant context, rate limiting)
- Depends on: PostgreSQL (via RLS-enforced app_user connection), NATS JetStream, Redis, OpenBao, email/webhook services
- Used by: Frontend (REST API), poller (reads device list, writes operation results)
**Data Persistence Layer (PostgreSQL + TimescaleDB):**
- Purpose: Multi-tenant relational data store with RLS-enforced isolation
- Connection: Two engines in `backend/app/database.py`
- Admin engine (superuser): Migrations, bootstrap, admin operations
- App engine (app_user role): All tenant-scoped API requests, RLS enforced
- Row-Level Security: `SET LOCAL app.current_tenant` set per-request by `get_current_user` dependency
- Contains: Devices, users, tenants, alerts, config backups, templates, VPN peers, certificates, audit logs, metrics aggregates
**Caching/Locking Layer (Redis):**
- Purpose: Distributed locks (poller prevents duplicate device polls), session management, temporary data
- Usage: `redislock` package in poller for per-device poll coordination across replicas
**Secret Management Layer (OpenBao):**
- Purpose: Transit KMS for per-tenant envelope encryption, credential storage access control
- Mode: Transit secret engine wrapping credentials for envelope encryption
- Accessed by: Poller (fetch decrypted credentials), backend (re-encrypt on password change)
**Frontend Layer (React 19 + TanStack):**
- Purpose: Web UI for fleet management, device control, configuration, monitoring
- Location: `frontend/src/`
- Contains: TanStack Router, TanStack Query, Tailwind CSS, SSE event stream integration, WebSocket tunnels
- Depends on: Backend REST API, Server-Sent Events for real-time updates, WebSocket for terminal/remote access
- Entry point: `frontend/src/routes/__root.tsx` (QueryClientProvider, root layout)
## Data Flow
**Device Status Polling (Poller → NATS → Backend):**
1. Poller scheduler periodically fetches device list from PostgreSQL
2. For each device, poller's `Worker` connects to RouterOS binary API (port 8729 TLS)
3. Worker collects device status (online/offline), version, system metrics
4. Worker publishes `DeviceStatusEvent` to NATS stream `DEVICE_EVENTS` topic `device.status.{device_id}`
5. Backend subscribes to `device.status.>` via `nats_subscriber.py`
6. Subscriber updates device record in PostgreSQL via admin session (bypasses RLS)
7. Frontend receives update via SSE subscription to `/api/sse?topics=device_status`
**Configuration Push (Frontend → Backend → Poller → Router):**
1. Frontend calls `POST /api/tenants/{tenant_id}/devices/{device_id}/config` with new configuration
2. Backend stores config in PostgreSQL, publishes `ConfigPushEvent` to `OPERATION_EVENTS`
3. Poller subscribes to push operation events, receives config delta
4. Poller connects to device via binary API, executes RouterOS commands (two-phase: backup, apply, verify)
5. On completion, poller publishes `ConfigPushCompletedEvent` to NATS
6. Backend subscriber updates operation record with success/failure
7. Frontend notifies user via SSE
**Metrics Collection (Poller → NATS → Backend → Frontend):**
1. Poller collects health metrics (CPU, memory, disk), interface stats, wireless stats per poll cycle
2. Publishes `DeviceMetricsEvent` to `DEVICE_EVENTS` topic `device.metrics.{type}.{device_id}`
3. Backend `metrics_subscriber.py` aggregates into TimescaleDB hypertables
4. Frontend queries `/api/tenants/{tenant_id}/devices/{device_id}/metrics` for graphs
5. Alternatively, frontend SSE stream pushes metric updates for real-time graphs
**Real-Time Event Streaming (Backend → Frontend via SSE):**
1. Frontend calls `POST /api/auth/sse-token` to exchange session cookie for short-lived SSE bearer token
2. Token valid for 25 seconds (refreshed every 25 seconds before expiry)
3. Frontend opens EventSource to `/api/sse?topics=device_status,alert_fired,config_push,firmware_progress,metric_update`
4. Backend maintains SSE connections, pushes events from NATS subscribers
5. Reconnection on disconnect with exponential backoff (1s → 30s max)
**Multi-Tenant Isolation (Request → Middleware → RLS):**
1. Frontend sends JWT token in Authorization header or httpOnly cookie
2. Backend `tenant_context.py` middleware extracts user from JWT, determines tenant_id
3. Middleware calls `SET LOCAL app.current_tenant = '{tenant_id}'` on the database session
4. All subsequent queries automatically filtered by RLS policy `(tenant_id = current_setting('app.current_tenant'))`
5. Superadmin can re-set tenant context to access any tenant
6. Admin sessions (migrations, NATS subscribers) use superuser connection, handle tenant routing explicitly
**State Management:**
- Frontend: TanStack Query for server state (device list, metrics, config), React Context for session/auth state
- Backend: Async SQLAlchemy ORM with automatic transaction management per request
- Poller: In-memory device state map with per-device circuit breaker tracking failures and backoff
- Shared: Redis for distributed locks, NATS for event persistence (JetStream replays)
## Key Abstractions
**Device Client (`poller/internal/device/`):**
- Purpose: Binary API communication with RouterOS devices
- Files: `client.go`, `version.go`, `health.go`, `interfaces.go`, `wireless.go`, `firmware.go`, `cert_deploy.go`, `sftp.go`
- Pattern: RouterOS binary API command execution, metric parsing and extraction
- Usage: Worker polls device state and metrics in parallel goroutines
**Scheduler & Worker (`poller/internal/poller/scheduler.go`, `worker.go`):**
- Purpose: Orchestrate per-device polling goroutines with circuit breaker resilience
- Pattern: Per-device goroutine with Redis distributed locking to prevent duplicate polls across replicas
- Lifecycle: Discover new devices from DB, create goroutine; remove devices, cancel goroutine
- Circuit Breaker: Exponential backoff after N consecutive failures, resets on success
**NATS Publisher (`poller/internal/bus/publisher.go`):**
- Purpose: Publish typed device events to JetStream streams
- Event types: DeviceStatusEvent, DeviceMetricsEvent, ConfigChangedEvent, PushRollbackEvent, PushAlertEvent
- Each event includes device_id and tenant_id for multi-tenant routing
- Consumers: Backend subscribers, audit logging, alert evaluation
**Tunnel Manager (`poller/internal/tunnel/manager.go`):**
- Purpose: Manage WinBox TCP tunnels to devices (port-forwarded SOCKS proxies)
- Port pool: Allocate ephemeral local ports for tunnel endpoints
- Pattern: Accept local connections on port, tunnel to device's WinBox port via binary API
**SSH Relay (`poller/internal/sshrelay/server.go`, `session.go`, `bridge.go`):**
- Purpose: SSH terminal access to RouterOS devices for remote management
- Pattern: SSH server on poller, bridges SSH sessions to RouterOS via binary API terminal protocol
- Authentication: SSH key or password relay from frontend
**FastAPI Router Pattern (`backend/app/routers/`):**
- Files: `devices.py`, `auth.py`, `alerts.py`, `config_editor.py`, `templates.py`, `metrics.py`, etc.
- Pattern: APIRouter with Depends() for RBAC, tenant context, rate limiting
- All routes tenant-scoped under `/api/tenants/{tenant_id}/...`
- RLS enforcement: Automatic via `SET LOCAL app.current_tenant` in `get_current_user` middleware
**Async Service Layer (`backend/app/services/`):**
- Purpose: Business logic, database operations, integration with external systems
- Files: `device.py`, `auth.py`, `backup_service.py`, `ca_service.py`, `alert_evaluator.py`, etc.
- Pattern: Async functions using AsyncSession, composable for multiple operations in single transaction
- NATS Integration: Subscribers consume events, services update database accordingly
**NATS Subscribers (`backend/app/services/*_subscriber.py`):**
- Purpose: Consume events from NATS JetStream, update application state
- Lifecycle: Started/stopped in FastAPI lifespan context manager
- Examples: `nats_subscriber.py` (device status), `metrics_subscriber.py` (metrics aggregation), `firmware_subscriber.py` (firmware update tracking)
- Pattern: JetStream consumer with durable name, explicit message acking for reliability
**Frontend Router (`frontend/src/routes/`):**
- Pattern: TanStack Router file-based routing
- Structure: `_authenticated.tsx` (layout for logged-in users), `_authenticated/tenants/$tenantId/devices/...` (device management)
- Entry: `__root.tsx` (QueryClientProvider setup), `_authenticated.tsx` (auth check + layout)
**Frontend Event Stream Hook (`frontend/src/hooks/useEventStream.ts`):**
- Purpose: Manage SSE connection lifecycle, handle reconnection, parse event payloads
- Pattern: useRef for connection state, setInterval for token refresh, EventSource API
- Callbacks: Per-event-type handlers registered by components
- State: Managed in EventStreamContext for app-wide access
## Entry Points
**Poller Binary (`poller/cmd/poller/main.go`):**
- Location: `poller/cmd/poller/main.go`
- Triggers: Docker container start, Kubernetes pod initialization
- Responsibilities: Load config, initialize NATS/Redis/PostgreSQL connections, start scheduler, setup observability (Prometheus metrics, structured logging)
- Config source: Environment variables (see `poller/internal/config/config.go`)
**Backend API (`backend/app/main.py`):**
- Location: `backend/app/main.py`
- Triggers: Docker container start, uvicorn ASGI server
- Responsibilities: Configure logging, run migrations, bootstrap first admin, start NATS subscribers, setup middleware, register routers
- Lifespan: Async context manager handles startup/shutdown of services
- Health check: `/api/health` endpoint, `/api/readiness` for k8s
**Frontend Entry (`frontend/src/routes/__root.tsx`):**
- Location: `frontend/src/routes/__root.tsx`
- Triggers: Browser loads app at `/`
- Responsibilities: Wrap app in QueryClientProvider (TanStack Query), setup root error boundary
- Auth flow: Routes under `_authenticated` check JWT token, redirect to login if missing
- Real-time setup: Establish SSE connection via `useEventStream` hook in layout
## Error Handling
**Strategy:** Three-tier error handling across services
**Patterns:**
- **Poller**: Circuit breaker exponential backoff for device connection failures. Logs all errors to structured JSON with context (device_id, tenant_id, attempt number). Publishes failure events to NATS for alerting.
- **Backend**: FastAPI exception handlers convert service errors to HTTP responses. RLS violations return 403 Forbidden. Invalid tenant access returns 404. Database errors logged via structlog with request_id middleware for correlation.
- **Frontend**: TanStack Query retry logic (1 retry by default), error boundaries catch component crashes, toast notifications display user-friendly error messages, RequestID middleware propagates correlation IDs
## Cross-Cutting Concerns
**Logging:**
- Poller: `log/slog` with JSON handler, structured fields (service, device_id, tenant_id, operation)
- Backend: `structlog` with async logger, JSON output in production
- Frontend: Browser console + error tracking (if configured)
**Validation:**
- Backend: Pydantic models (`app/schemas/`) enforce request shape and types, custom validators for business logic (e.g., SRP challenge validation)
- Frontend: TanStack Form for client-side validation before submission
- Database: PostgreSQL CHECK constraints and unique indexes
**Authentication:**
- Zero-knowledge SRP-6a for initial password enrollment (client never sends plaintext)
- JWT tokens issued after SRP enrollment, stored as httpOnly cookies
- Optional API keys with scoped access for programmatic use
- SSE token exchange for event stream access (short-lived, single-use)
**Authorization (RBAC):**
- Four roles: super_admin (all access), tenant_admin (full tenant access), operator (read+config), viewer (read-only)
- Role hierarchy enforced by `require_role()` dependency in routers
- API key scopes: subset of operator permissions (read, write_device, write_config, etc.)
**Rate Limiting:**
- Backend: Token bucket limiter on sensitive endpoints (login, token generation, device operations)
- Configuration: `app/middleware/rate_limit.py` defines limits per endpoint
- Redis-backed for distributed rate limit state
**Multi-Tenancy:**
- Database RLS: All tables have `tenant_id`, policy enforces current_tenant filter
- Tenant context: Middleware extracts from JWT, sets `app.current_tenant` local variable
- Superadmin bypass: Can re-set tenant context to access any tenant
- Admin operations: Use superuser connection, explicit tenant routing
---
*Architecture analysis: 2026-03-12*