Files
the-other-dude/.planning/codebase/ARCHITECTURE.md
2026-03-12 19:33:26 -05:00

14 KiB

Architecture

Analysis Date: 2026-03-12

Pattern Overview

Overall: Event-driven microservice architecture with asynchronous pub/sub messaging

Key Characteristics:

  • Three independent microservices: Go Poller, Python FastAPI Backend, React/TypeScript Frontend
  • NATS JetStream as central event bus for all inter-service communication
  • PostgreSQL with Row-Level Security (RLS) for multi-tenant isolation at database layer
  • Real-time Server-Sent Events (SSE) for frontend event streaming
  • Distributed task coordination using Redis distributed locks
  • Per-tenant encryption via OpenBao Transit KMS engine

Layers

Device Polling Layer (Go Poller):

  • Purpose: Connects to RouterOS devices via binary API (port 8729), detects status/version, collects metrics, pushes configs, manages WinBox/SSH tunnels
  • Location: poller/
  • Contains: Device client, scheduler, SSH relay, WinBox tunnel manager, NATS publisher, Redis credential cache, OpenBao vault client
  • Depends on: NATS JetStream, Redis, PostgreSQL (read-only for device list), OpenBao
  • Used by: Publishes events to backend via NATS

Event Bus Layer (NATS JetStream):

  • Purpose: Central publish/subscribe message broker for all service-to-service communication
  • Streams: DEVICE_EVENTS, OPERATION_EVENTS, ALERT_EVENTS
  • Contains: Device status changes, metrics, config change notifications, push rollback triggers, alert events, session audit events
  • All events include device_id and tenant_id for multi-tenant routing

Backend API Layer (Python FastAPI):

  • Purpose: RESTful API, business logic, database persistence, event subscription and processing
  • Location: backend/app/
  • Contains: FastAPI routers, SQLAlchemy ORM models, async services, NATS subscribers, middleware (RBAC, tenant context, rate limiting)
  • Depends on: PostgreSQL (via RLS-enforced app_user connection), NATS JetStream, Redis, OpenBao, email/webhook services
  • Used by: Frontend (REST API), poller (reads device list, writes operation results)

Data Persistence Layer (PostgreSQL + TimescaleDB):

  • Purpose: Multi-tenant relational data store with RLS-enforced isolation
  • Connection: Two engines in backend/app/database.py
    • Admin engine (superuser): Migrations, bootstrap, admin operations
    • App engine (app_user role): All tenant-scoped API requests, RLS enforced
  • Row-Level Security: SET LOCAL app.current_tenant set per-request by get_current_user dependency
  • Contains: Devices, users, tenants, alerts, config backups, templates, VPN peers, certificates, audit logs, metrics aggregates

Caching/Locking Layer (Redis):

  • Purpose: Distributed locks (poller prevents duplicate device polls), session management, temporary data
  • Usage: redislock package in poller for per-device poll coordination across replicas

Secret Management Layer (OpenBao):

  • Purpose: Transit KMS for per-tenant envelope encryption, credential storage access control
  • Mode: Transit secret engine wrapping credentials for envelope encryption
  • Accessed by: Poller (fetch decrypted credentials), backend (re-encrypt on password change)

Frontend Layer (React 19 + TanStack):

  • Purpose: Web UI for fleet management, device control, configuration, monitoring
  • Location: frontend/src/
  • Contains: TanStack Router, TanStack Query, Tailwind CSS, SSE event stream integration, WebSocket tunnels
  • Depends on: Backend REST API, Server-Sent Events for real-time updates, WebSocket for terminal/remote access
  • Entry point: frontend/src/routes/__root.tsx (QueryClientProvider, root layout)

Data Flow

Device Status Polling (Poller → NATS → Backend):

  1. Poller scheduler periodically fetches device list from PostgreSQL
  2. For each device, poller's Worker connects to RouterOS binary API (port 8729 TLS)
  3. Worker collects device status (online/offline), version, system metrics
  4. Worker publishes DeviceStatusEvent to NATS stream DEVICE_EVENTS topic device.status.{device_id}
  5. Backend subscribes to device.status.> via nats_subscriber.py
  6. Subscriber updates device record in PostgreSQL via admin session (bypasses RLS)
  7. Frontend receives update via SSE subscription to /api/sse?topics=device_status

Configuration Push (Frontend → Backend → Poller → Router):

  1. Frontend calls POST /api/tenants/{tenant_id}/devices/{device_id}/config with new configuration
  2. Backend stores config in PostgreSQL, publishes ConfigPushEvent to OPERATION_EVENTS
  3. Poller subscribes to push operation events, receives config delta
  4. Poller connects to device via binary API, executes RouterOS commands (two-phase: backup, apply, verify)
  5. On completion, poller publishes ConfigPushCompletedEvent to NATS
  6. Backend subscriber updates operation record with success/failure
  7. Frontend notifies user via SSE

Metrics Collection (Poller → NATS → Backend → Frontend):

  1. Poller collects health metrics (CPU, memory, disk), interface stats, wireless stats per poll cycle
  2. Publishes DeviceMetricsEvent to DEVICE_EVENTS topic device.metrics.{type}.{device_id}
  3. Backend metrics_subscriber.py aggregates into TimescaleDB hypertables
  4. Frontend queries /api/tenants/{tenant_id}/devices/{device_id}/metrics for graphs
  5. Alternatively, frontend SSE stream pushes metric updates for real-time graphs

Real-Time Event Streaming (Backend → Frontend via SSE):

  1. Frontend calls POST /api/auth/sse-token to exchange session cookie for short-lived SSE bearer token
  2. Token valid for 25 seconds (refreshed every 25 seconds before expiry)
  3. Frontend opens EventSource to /api/sse?topics=device_status,alert_fired,config_push,firmware_progress,metric_update
  4. Backend maintains SSE connections, pushes events from NATS subscribers
  5. Reconnection on disconnect with exponential backoff (1s → 30s max)

Multi-Tenant Isolation (Request → Middleware → RLS):

  1. Frontend sends JWT token in Authorization header or httpOnly cookie
  2. Backend tenant_context.py middleware extracts user from JWT, determines tenant_id
  3. Middleware calls SET LOCAL app.current_tenant = '{tenant_id}' on the database session
  4. All subsequent queries automatically filtered by RLS policy (tenant_id = current_setting('app.current_tenant'))
  5. Superadmin can re-set tenant context to access any tenant
  6. Admin sessions (migrations, NATS subscribers) use superuser connection, handle tenant routing explicitly

State Management:

  • Frontend: TanStack Query for server state (device list, metrics, config), React Context for session/auth state
  • Backend: Async SQLAlchemy ORM with automatic transaction management per request
  • Poller: In-memory device state map with per-device circuit breaker tracking failures and backoff
  • Shared: Redis for distributed locks, NATS for event persistence (JetStream replays)

Key Abstractions

Device Client (poller/internal/device/):

  • Purpose: Binary API communication with RouterOS devices
  • Files: client.go, version.go, health.go, interfaces.go, wireless.go, firmware.go, cert_deploy.go, sftp.go
  • Pattern: RouterOS binary API command execution, metric parsing and extraction
  • Usage: Worker polls device state and metrics in parallel goroutines

Scheduler & Worker (poller/internal/poller/scheduler.go, worker.go):

  • Purpose: Orchestrate per-device polling goroutines with circuit breaker resilience
  • Pattern: Per-device goroutine with Redis distributed locking to prevent duplicate polls across replicas
  • Lifecycle: Discover new devices from DB, create goroutine; remove devices, cancel goroutine
  • Circuit Breaker: Exponential backoff after N consecutive failures, resets on success

NATS Publisher (poller/internal/bus/publisher.go):

  • Purpose: Publish typed device events to JetStream streams
  • Event types: DeviceStatusEvent, DeviceMetricsEvent, ConfigChangedEvent, PushRollbackEvent, PushAlertEvent
  • Each event includes device_id and tenant_id for multi-tenant routing
  • Consumers: Backend subscribers, audit logging, alert evaluation

Tunnel Manager (poller/internal/tunnel/manager.go):

  • Purpose: Manage WinBox TCP tunnels to devices (port-forwarded SOCKS proxies)
  • Port pool: Allocate ephemeral local ports for tunnel endpoints
  • Pattern: Accept local connections on port, tunnel to device's WinBox port via binary API

SSH Relay (poller/internal/sshrelay/server.go, session.go, bridge.go):

  • Purpose: SSH terminal access to RouterOS devices for remote management
  • Pattern: SSH server on poller, bridges SSH sessions to RouterOS via binary API terminal protocol
  • Authentication: SSH key or password relay from frontend

FastAPI Router Pattern (backend/app/routers/):

  • Files: devices.py, auth.py, alerts.py, config_editor.py, templates.py, metrics.py, etc.
  • Pattern: APIRouter with Depends() for RBAC, tenant context, rate limiting
  • All routes tenant-scoped under /api/tenants/{tenant_id}/...
  • RLS enforcement: Automatic via SET LOCAL app.current_tenant in get_current_user middleware

Async Service Layer (backend/app/services/):

  • Purpose: Business logic, database operations, integration with external systems
  • Files: device.py, auth.py, backup_service.py, ca_service.py, alert_evaluator.py, etc.
  • Pattern: Async functions using AsyncSession, composable for multiple operations in single transaction
  • NATS Integration: Subscribers consume events, services update database accordingly

NATS Subscribers (backend/app/services/*_subscriber.py):

  • Purpose: Consume events from NATS JetStream, update application state
  • Lifecycle: Started/stopped in FastAPI lifespan context manager
  • Examples: nats_subscriber.py (device status), metrics_subscriber.py (metrics aggregation), firmware_subscriber.py (firmware update tracking)
  • Pattern: JetStream consumer with durable name, explicit message acking for reliability

Frontend Router (frontend/src/routes/):

  • Pattern: TanStack Router file-based routing
  • Structure: _authenticated.tsx (layout for logged-in users), _authenticated/tenants/$tenantId/devices/... (device management)
  • Entry: __root.tsx (QueryClientProvider setup), _authenticated.tsx (auth check + layout)

Frontend Event Stream Hook (frontend/src/hooks/useEventStream.ts):

  • Purpose: Manage SSE connection lifecycle, handle reconnection, parse event payloads
  • Pattern: useRef for connection state, setInterval for token refresh, EventSource API
  • Callbacks: Per-event-type handlers registered by components
  • State: Managed in EventStreamContext for app-wide access

Entry Points

Poller Binary (poller/cmd/poller/main.go):

  • Location: poller/cmd/poller/main.go
  • Triggers: Docker container start, Kubernetes pod initialization
  • Responsibilities: Load config, initialize NATS/Redis/PostgreSQL connections, start scheduler, setup observability (Prometheus metrics, structured logging)
  • Config source: Environment variables (see poller/internal/config/config.go)

Backend API (backend/app/main.py):

  • Location: backend/app/main.py
  • Triggers: Docker container start, uvicorn ASGI server
  • Responsibilities: Configure logging, run migrations, bootstrap first admin, start NATS subscribers, setup middleware, register routers
  • Lifespan: Async context manager handles startup/shutdown of services
  • Health check: /api/health endpoint, /api/readiness for k8s

Frontend Entry (frontend/src/routes/__root.tsx):

  • Location: frontend/src/routes/__root.tsx
  • Triggers: Browser loads app at /
  • Responsibilities: Wrap app in QueryClientProvider (TanStack Query), setup root error boundary
  • Auth flow: Routes under _authenticated check JWT token, redirect to login if missing
  • Real-time setup: Establish SSE connection via useEventStream hook in layout

Error Handling

Strategy: Three-tier error handling across services

Patterns:

  • Poller: Circuit breaker exponential backoff for device connection failures. Logs all errors to structured JSON with context (device_id, tenant_id, attempt number). Publishes failure events to NATS for alerting.

  • Backend: FastAPI exception handlers convert service errors to HTTP responses. RLS violations return 403 Forbidden. Invalid tenant access returns 404. Database errors logged via structlog with request_id middleware for correlation.

  • Frontend: TanStack Query retry logic (1 retry by default), error boundaries catch component crashes, toast notifications display user-friendly error messages, RequestID middleware propagates correlation IDs

Cross-Cutting Concerns

Logging:

  • Poller: log/slog with JSON handler, structured fields (service, device_id, tenant_id, operation)
  • Backend: structlog with async logger, JSON output in production
  • Frontend: Browser console + error tracking (if configured)

Validation:

  • Backend: Pydantic models (app/schemas/) enforce request shape and types, custom validators for business logic (e.g., SRP challenge validation)
  • Frontend: TanStack Form for client-side validation before submission
  • Database: PostgreSQL CHECK constraints and unique indexes

Authentication:

  • Zero-knowledge SRP-6a for initial password enrollment (client never sends plaintext)
  • JWT tokens issued after SRP enrollment, stored as httpOnly cookies
  • Optional API keys with scoped access for programmatic use
  • SSE token exchange for event stream access (short-lived, single-use)

Authorization (RBAC):

  • Four roles: super_admin (all access), tenant_admin (full tenant access), operator (read+config), viewer (read-only)
  • Role hierarchy enforced by require_role() dependency in routers
  • API key scopes: subset of operator permissions (read, write_device, write_config, etc.)

Rate Limiting:

  • Backend: Token bucket limiter on sensitive endpoints (login, token generation, device operations)
  • Configuration: app/middleware/rate_limit.py defines limits per endpoint
  • Redis-backed for distributed rate limit state

Multi-Tenancy:

  • Database RLS: All tables have tenant_id, policy enforces current_tenant filter
  • Tenant context: Middleware extracts from JWT, sets app.current_tenant local variable
  • Superadmin bypass: Can re-set tenant context to access any tenant
  • Admin operations: Use superuser connection, explicit tenant routing

Architecture analysis: 2026-03-12