diff --git a/.planning/codebase/ARCHITECTURE.md b/.planning/codebase/ARCHITECTURE.md new file mode 100644 index 0000000..385dd09 --- /dev/null +++ b/.planning/codebase/ARCHITECTURE.md @@ -0,0 +1,246 @@ +# Architecture + +**Analysis Date:** 2026-03-12 + +## Pattern Overview + +**Overall:** Event-driven microservice architecture with asynchronous pub/sub messaging + +**Key Characteristics:** +- Three independent microservices: Go Poller, Python FastAPI Backend, React/TypeScript Frontend +- NATS JetStream as central event bus for all inter-service communication +- PostgreSQL with Row-Level Security (RLS) for multi-tenant isolation at database layer +- Real-time Server-Sent Events (SSE) for frontend event streaming +- Distributed task coordination using Redis distributed locks +- Per-tenant encryption via OpenBao Transit KMS engine + +## Layers + +**Device Polling Layer (Go Poller):** +- Purpose: Connects to RouterOS devices via binary API (port 8729), detects status/version, collects metrics, pushes configs, manages WinBox/SSH tunnels +- Location: `poller/` +- Contains: Device client, scheduler, SSH relay, WinBox tunnel manager, NATS publisher, Redis credential cache, OpenBao vault client +- Depends on: NATS JetStream, Redis, PostgreSQL (read-only for device list), OpenBao +- Used by: Publishes events to backend via NATS + +**Event Bus Layer (NATS JetStream):** +- Purpose: Central publish/subscribe message broker for all service-to-service communication +- Streams: DEVICE_EVENTS, OPERATION_EVENTS, ALERT_EVENTS +- Contains: Device status changes, metrics, config change notifications, push rollback triggers, alert events, session audit events +- All events include device_id and tenant_id for multi-tenant routing + +**Backend API Layer (Python FastAPI):** +- Purpose: RESTful API, business logic, database persistence, event subscription and processing +- Location: `backend/app/` +- Contains: FastAPI routers, SQLAlchemy ORM models, async services, NATS subscribers, middleware (RBAC, tenant context, rate limiting) +- Depends on: PostgreSQL (via RLS-enforced app_user connection), NATS JetStream, Redis, OpenBao, email/webhook services +- Used by: Frontend (REST API), poller (reads device list, writes operation results) + +**Data Persistence Layer (PostgreSQL + TimescaleDB):** +- Purpose: Multi-tenant relational data store with RLS-enforced isolation +- Connection: Two engines in `backend/app/database.py` + - Admin engine (superuser): Migrations, bootstrap, admin operations + - App engine (app_user role): All tenant-scoped API requests, RLS enforced +- Row-Level Security: `SET LOCAL app.current_tenant` set per-request by `get_current_user` dependency +- Contains: Devices, users, tenants, alerts, config backups, templates, VPN peers, certificates, audit logs, metrics aggregates + +**Caching/Locking Layer (Redis):** +- Purpose: Distributed locks (poller prevents duplicate device polls), session management, temporary data +- Usage: `redislock` package in poller for per-device poll coordination across replicas + +**Secret Management Layer (OpenBao):** +- Purpose: Transit KMS for per-tenant envelope encryption, credential storage access control +- Mode: Transit secret engine wrapping credentials for envelope encryption +- Accessed by: Poller (fetch decrypted credentials), backend (re-encrypt on password change) + +**Frontend Layer (React 19 + TanStack):** +- Purpose: Web UI for fleet management, device control, configuration, monitoring +- Location: `frontend/src/` +- Contains: TanStack Router, TanStack Query, Tailwind CSS, SSE event stream integration, WebSocket tunnels +- Depends on: Backend REST API, Server-Sent Events for real-time updates, WebSocket for terminal/remote access +- Entry point: `frontend/src/routes/__root.tsx` (QueryClientProvider, root layout) + +## Data Flow + +**Device Status Polling (Poller → NATS → Backend):** + +1. Poller scheduler periodically fetches device list from PostgreSQL +2. For each device, poller's `Worker` connects to RouterOS binary API (port 8729 TLS) +3. Worker collects device status (online/offline), version, system metrics +4. Worker publishes `DeviceStatusEvent` to NATS stream `DEVICE_EVENTS` topic `device.status.{device_id}` +5. Backend subscribes to `device.status.>` via `nats_subscriber.py` +6. Subscriber updates device record in PostgreSQL via admin session (bypasses RLS) +7. Frontend receives update via SSE subscription to `/api/sse?topics=device_status` + +**Configuration Push (Frontend → Backend → Poller → Router):** + +1. Frontend calls `POST /api/tenants/{tenant_id}/devices/{device_id}/config` with new configuration +2. Backend stores config in PostgreSQL, publishes `ConfigPushEvent` to `OPERATION_EVENTS` +3. Poller subscribes to push operation events, receives config delta +4. Poller connects to device via binary API, executes RouterOS commands (two-phase: backup, apply, verify) +5. On completion, poller publishes `ConfigPushCompletedEvent` to NATS +6. Backend subscriber updates operation record with success/failure +7. Frontend notifies user via SSE + +**Metrics Collection (Poller → NATS → Backend → Frontend):** + +1. Poller collects health metrics (CPU, memory, disk), interface stats, wireless stats per poll cycle +2. Publishes `DeviceMetricsEvent` to `DEVICE_EVENTS` topic `device.metrics.{type}.{device_id}` +3. Backend `metrics_subscriber.py` aggregates into TimescaleDB hypertables +4. Frontend queries `/api/tenants/{tenant_id}/devices/{device_id}/metrics` for graphs +5. Alternatively, frontend SSE stream pushes metric updates for real-time graphs + +**Real-Time Event Streaming (Backend → Frontend via SSE):** + +1. Frontend calls `POST /api/auth/sse-token` to exchange session cookie for short-lived SSE bearer token +2. Token valid for 25 seconds (refreshed every 25 seconds before expiry) +3. Frontend opens EventSource to `/api/sse?topics=device_status,alert_fired,config_push,firmware_progress,metric_update` +4. Backend maintains SSE connections, pushes events from NATS subscribers +5. Reconnection on disconnect with exponential backoff (1s → 30s max) + +**Multi-Tenant Isolation (Request → Middleware → RLS):** + +1. Frontend sends JWT token in Authorization header or httpOnly cookie +2. Backend `tenant_context.py` middleware extracts user from JWT, determines tenant_id +3. Middleware calls `SET LOCAL app.current_tenant = '{tenant_id}'` on the database session +4. All subsequent queries automatically filtered by RLS policy `(tenant_id = current_setting('app.current_tenant'))` +5. Superadmin can re-set tenant context to access any tenant +6. Admin sessions (migrations, NATS subscribers) use superuser connection, handle tenant routing explicitly + +**State Management:** + +- Frontend: TanStack Query for server state (device list, metrics, config), React Context for session/auth state +- Backend: Async SQLAlchemy ORM with automatic transaction management per request +- Poller: In-memory device state map with per-device circuit breaker tracking failures and backoff +- Shared: Redis for distributed locks, NATS for event persistence (JetStream replays) + +## Key Abstractions + +**Device Client (`poller/internal/device/`):** +- Purpose: Binary API communication with RouterOS devices +- Files: `client.go`, `version.go`, `health.go`, `interfaces.go`, `wireless.go`, `firmware.go`, `cert_deploy.go`, `sftp.go` +- Pattern: RouterOS binary API command execution, metric parsing and extraction +- Usage: Worker polls device state and metrics in parallel goroutines + +**Scheduler & Worker (`poller/internal/poller/scheduler.go`, `worker.go`):** +- Purpose: Orchestrate per-device polling goroutines with circuit breaker resilience +- Pattern: Per-device goroutine with Redis distributed locking to prevent duplicate polls across replicas +- Lifecycle: Discover new devices from DB, create goroutine; remove devices, cancel goroutine +- Circuit Breaker: Exponential backoff after N consecutive failures, resets on success + +**NATS Publisher (`poller/internal/bus/publisher.go`):** +- Purpose: Publish typed device events to JetStream streams +- Event types: DeviceStatusEvent, DeviceMetricsEvent, ConfigChangedEvent, PushRollbackEvent, PushAlertEvent +- Each event includes device_id and tenant_id for multi-tenant routing +- Consumers: Backend subscribers, audit logging, alert evaluation + +**Tunnel Manager (`poller/internal/tunnel/manager.go`):** +- Purpose: Manage WinBox TCP tunnels to devices (port-forwarded SOCKS proxies) +- Port pool: Allocate ephemeral local ports for tunnel endpoints +- Pattern: Accept local connections on port, tunnel to device's WinBox port via binary API + +**SSH Relay (`poller/internal/sshrelay/server.go`, `session.go`, `bridge.go`):** +- Purpose: SSH terminal access to RouterOS devices for remote management +- Pattern: SSH server on poller, bridges SSH sessions to RouterOS via binary API terminal protocol +- Authentication: SSH key or password relay from frontend + +**FastAPI Router Pattern (`backend/app/routers/`):** +- Files: `devices.py`, `auth.py`, `alerts.py`, `config_editor.py`, `templates.py`, `metrics.py`, etc. +- Pattern: APIRouter with Depends() for RBAC, tenant context, rate limiting +- All routes tenant-scoped under `/api/tenants/{tenant_id}/...` +- RLS enforcement: Automatic via `SET LOCAL app.current_tenant` in `get_current_user` middleware + +**Async Service Layer (`backend/app/services/`):** +- Purpose: Business logic, database operations, integration with external systems +- Files: `device.py`, `auth.py`, `backup_service.py`, `ca_service.py`, `alert_evaluator.py`, etc. +- Pattern: Async functions using AsyncSession, composable for multiple operations in single transaction +- NATS Integration: Subscribers consume events, services update database accordingly + +**NATS Subscribers (`backend/app/services/*_subscriber.py`):** +- Purpose: Consume events from NATS JetStream, update application state +- Lifecycle: Started/stopped in FastAPI lifespan context manager +- Examples: `nats_subscriber.py` (device status), `metrics_subscriber.py` (metrics aggregation), `firmware_subscriber.py` (firmware update tracking) +- Pattern: JetStream consumer with durable name, explicit message acking for reliability + +**Frontend Router (`frontend/src/routes/`):** +- Pattern: TanStack Router file-based routing +- Structure: `_authenticated.tsx` (layout for logged-in users), `_authenticated/tenants/$tenantId/devices/...` (device management) +- Entry: `__root.tsx` (QueryClientProvider setup), `_authenticated.tsx` (auth check + layout) + +**Frontend Event Stream Hook (`frontend/src/hooks/useEventStream.ts`):** +- Purpose: Manage SSE connection lifecycle, handle reconnection, parse event payloads +- Pattern: useRef for connection state, setInterval for token refresh, EventSource API +- Callbacks: Per-event-type handlers registered by components +- State: Managed in EventStreamContext for app-wide access + +## Entry Points + +**Poller Binary (`poller/cmd/poller/main.go`):** +- Location: `poller/cmd/poller/main.go` +- Triggers: Docker container start, Kubernetes pod initialization +- Responsibilities: Load config, initialize NATS/Redis/PostgreSQL connections, start scheduler, setup observability (Prometheus metrics, structured logging) +- Config source: Environment variables (see `poller/internal/config/config.go`) + +**Backend API (`backend/app/main.py`):** +- Location: `backend/app/main.py` +- Triggers: Docker container start, uvicorn ASGI server +- Responsibilities: Configure logging, run migrations, bootstrap first admin, start NATS subscribers, setup middleware, register routers +- Lifespan: Async context manager handles startup/shutdown of services +- Health check: `/api/health` endpoint, `/api/readiness` for k8s + +**Frontend Entry (`frontend/src/routes/__root.tsx`):** +- Location: `frontend/src/routes/__root.tsx` +- Triggers: Browser loads app at `/` +- Responsibilities: Wrap app in QueryClientProvider (TanStack Query), setup root error boundary +- Auth flow: Routes under `_authenticated` check JWT token, redirect to login if missing +- Real-time setup: Establish SSE connection via `useEventStream` hook in layout + +## Error Handling + +**Strategy:** Three-tier error handling across services + +**Patterns:** + +- **Poller**: Circuit breaker exponential backoff for device connection failures. Logs all errors to structured JSON with context (device_id, tenant_id, attempt number). Publishes failure events to NATS for alerting. + +- **Backend**: FastAPI exception handlers convert service errors to HTTP responses. RLS violations return 403 Forbidden. Invalid tenant access returns 404. Database errors logged via structlog with request_id middleware for correlation. + +- **Frontend**: TanStack Query retry logic (1 retry by default), error boundaries catch component crashes, toast notifications display user-friendly error messages, RequestID middleware propagates correlation IDs + +## Cross-Cutting Concerns + +**Logging:** +- Poller: `log/slog` with JSON handler, structured fields (service, device_id, tenant_id, operation) +- Backend: `structlog` with async logger, JSON output in production +- Frontend: Browser console + error tracking (if configured) + +**Validation:** +- Backend: Pydantic models (`app/schemas/`) enforce request shape and types, custom validators for business logic (e.g., SRP challenge validation) +- Frontend: TanStack Form for client-side validation before submission +- Database: PostgreSQL CHECK constraints and unique indexes + +**Authentication:** +- Zero-knowledge SRP-6a for initial password enrollment (client never sends plaintext) +- JWT tokens issued after SRP enrollment, stored as httpOnly cookies +- Optional API keys with scoped access for programmatic use +- SSE token exchange for event stream access (short-lived, single-use) + +**Authorization (RBAC):** +- Four roles: super_admin (all access), tenant_admin (full tenant access), operator (read+config), viewer (read-only) +- Role hierarchy enforced by `require_role()` dependency in routers +- API key scopes: subset of operator permissions (read, write_device, write_config, etc.) + +**Rate Limiting:** +- Backend: Token bucket limiter on sensitive endpoints (login, token generation, device operations) +- Configuration: `app/middleware/rate_limit.py` defines limits per endpoint +- Redis-backed for distributed rate limit state + +**Multi-Tenancy:** +- Database RLS: All tables have `tenant_id`, policy enforces current_tenant filter +- Tenant context: Middleware extracts from JWT, sets `app.current_tenant` local variable +- Superadmin bypass: Can re-set tenant context to access any tenant +- Admin operations: Use superuser connection, explicit tenant routing + +--- + +*Architecture analysis: 2026-03-12* diff --git a/.planning/codebase/CONCERNS.md b/.planning/codebase/CONCERNS.md new file mode 100644 index 0000000..eed2d42 --- /dev/null +++ b/.planning/codebase/CONCERNS.md @@ -0,0 +1,211 @@ +# Codebase Concerns + +**Analysis Date:** 2026-03-12 + +## Security Considerations + +**SSH Host Key Verification:** +- Risk: SSH connections skip host key verification using `ssh.InsecureIgnoreHostKey()` +- Files: `poller/internal/sshrelay/server.go:176`, `poller/internal/device/sftp.go:24`, `poller/internal/device/client.go:54-104` +- Current mitigation: RouterOS devices are internal infrastructure; client.go includes fallback strategy with TLS verification as primary mechanism +- Recommendations: Document the security model clearly. For SFTP in particular, consider implementing known_hosts validation or device certificate pinning if devices are externally accessible. Add security audit note to code. + +**TLS Verification Fallback:** +- Risk: When CA-verified TLS fails, automatic fallback to InsecureSkipVerify allows unverified connections (`poller/internal/device/client.go:92-104`) +- Files: `poller/internal/device/client.go` +- Current mitigation: This is intentional for unprovisioned devices; logging is present +- Recommendations: Add metrics to track fallback frequency. Consider implementing a whitelist of devices allowed to use insecure mode. Document operator-facing security implications. + +**SSH Session Count Rate Limiting:** +- Risk: No API-side SSH session count check before issuing tokens; limits only enforced at poller/SSH relay level +- Files: `backend/app/routers/remote_access.py:206-211` +- Current mitigation: WebSocket connect enforces tunnel.session limits per-user, per-device, global on relay side +- Recommendations: Add NATS subject exposing SSH session counts to API. Query before token issuance to provide earlier feedback (429 Too Many Requests). This prevents token waste when client will immediately be rate-limited. + +**Token Validation Security:** +- Risk: Single-use tokens stored in Redis with GETDEL; no IP binding or additional entropy validation beyond token string +- Files: `poller/internal/sshrelay/server.go:106-112`, token creation in `backend/app/routers/remote_access.py` +- Current mitigation: Token is single-use (GETDEL atomically retrieves and deletes). Short TTL (120s typical). Source IP validation present but not bound to token. +- Recommendations: Consider adding token IP binding (store expected source IP in payload, validate match). Add jti (JWT ID) tracking for revocation if needed. + +--- + +## Performance Bottlenecks + +**SSH Relay Idle Loop Polling:** +- Problem: Idle session cleanup uses time-based checks in a goroutine loop +- Files: `poller/internal/sshrelay/server.go:72`, session idling logic in `session.go` +- Cause: Periodic checks for idle sessions (LastActive timestamp) +- Improvement path: Consider using context.WithTimeout or timer channels for each session instead of global loop scanning all sessions. + +**Alert Rule Cache Staleness:** +- Problem: Alert rules cached for 60 seconds; maintenance windows for 30 seconds. During cache TTL, rule changes don't take effect immediately +- Files: `backend/app/services/alert_evaluator.py:33-40` +- Cause: In-memory cache to reduce DB queries on every metric evaluation (high frequency) +- Improvement path: Publish cache invalidation events to NATS when rules/windows change. Subscribers clear cache immediately rather than waiting for TTL. Current approach acceptable for non-critical alerts but documented assumption needed. + +**Large Router File Handling:** +- Problem: Alert evaluator aggregates metrics from all interfaces/wireless stations; no limits on result set size +- Files: `backend/app/services/alert_evaluator.py:180-212` +- Cause: Loop processes all returned metric rows without pagination or limits +- Improvement path: Add configurable max result limits. For high-interface-count devices (200+ interfaces), consider pre-aggregation or sampling. + +**N+1 Query Avoidance (Addressed):** +- Status: Already acknowledged in code comment at `backend/app/routers/metrics.py:404` +- Current approach: Metrics API uses bulk queries to avoid per-tenant loops +- No action needed + +--- + +## Tech Debt + +**Bandwidth Alerting Not Implemented:** +- Issue: Interface bandwidth alerting (rx_bps/tx_bps) requires computing delta between consecutive poll values +- Files: `backend/app/services/alert_evaluator.py:208-210` +- Impact: Alert rules table supports these metric types but evaluation is skipped; users cannot create rx_bps/tx_bps alerts +- Fix approach: Implement state tracking in Redis. Store previous poll value for each device:interface. On next poll, compute delta and evaluate against alert thresholds. Handle device offline/online transitions to avoid false alerts. + +**Global Redis/NATS Clients in Routers:** +- Issue: Multiple routers use module-level `global` statements to manage Redis and NATS client references +- Files: `backend/app/routers/auth.py:97`, `backend/app/routers/certificates.py:63`, `backend/app/routers/remote_access.py:50,58`, `backend/app/routers/sse.py:32`, `backend/app/routers/topology.py:50` +- Impact: Makes testing harder, hidden dependencies, potential race conditions on initialization +- Fix approach: Create a dependency injection container or use FastAPI's lifespan context manager (>=0.93) to manage client lifecycle. Pass clients as dependencies to router functions rather than global state. + +**SSH Session Publishing (NATS Wiring):** +- Issue: Code for publishing audit event on session end is present but not wired to NATS +- Files: `docs/superpowers/plans/2026-03-12-remote-access.md:1381` +- Impact: SSH session end events not tracked in audit logs; incomplete audit trail +- Fix approach: Wire the NATS publisher call in remote_access router. Create corresponding NATS subject consumer to record session end events. + +**Bare Exception Handling (Sparse):** +- Status: Codebase mostly avoids bare `except:` blocks; 56 linting suppressions (#pylint, #noqa, #type: ignore) present +- Files: Across backend Python code +- Impact: Controlled suppression use suggests deliberate choices; not a systemic problem +- Recommendation: Continue current practice; document why suppressions are needed when adding new ones. + +--- + +## Fragile Areas + +**SSH Relay Concurrent Session Management:** +- Files: `poller/internal/sshrelay/server.go:40-46` (sessions map), `poller/internal/sshrelay/server.go:114-118` (limit checks) +- Why fragile: Lock held during entire limit check; concurrent requests during peer limit transitions could temporarily exceed limits. Map access requires lock coordination. +- Safe modification: When adding session limits, ensure mutex is held for entire check+add operation. Consider using sync.Cond for blocked requests. Write tests for race conditions under high concurrency. +- Test coverage: Lock coverage appears adequate; consider adding stress test with sustained concurrent connect attempts exceeding limits. + +**Tunnel Port Pool Allocation:** +- Files: `poller/internal/tunnel/portpool.go`, `poller/internal/tunnel/manager.go:68-71` +- Why fragile: Port release timing; if tunnel closes between allocation and listener bind, port stays allocated. No automatic reaper. +- Safe modification: Ensure Release() is always called on error paths. Consider adding timeout-based port recovery (if unused for N seconds, auto-reclaim). Write integration test that exercises all error paths. +- Test coverage: portpool_test.go exists; verify boundary conditions (empty pool, full pool, Release before Allocate). + +**Vault Credential Cache Concurrency:** +- Files: `poller/internal/vault/cache.go:162` (timeout context creation) +- Why fragile: Cache uses module-level state; concurrent credential requests during cache miss trigger multiple Transit key operations +- Safe modification: Cache hit must be idempotent. For cache misses, consider request deduplication (one in-flight per device, others wait). Add metrics to track cache hit/miss/error rates. +- Test coverage: Need integration test for concurrent cache misses on same device. + +**Device Store Context Handling:** +- Files: `poller/internal/store/devices.go:77,133` (Query/QueryRow with context) +- Why fragile: If context cancels mid-query, result state is undefined. No timeout enforcement at DB level. +- Safe modification: Always pair Query/QueryRow with a timeout context. Test context cancellation scenarios. Add slog.Error on context timeout vs actual DB error. + +--- + +## Scaling Limits + +**Redis Single Instance (Assumed):** +- Current capacity: Limited by single Redis instance throughput +- Limit: Under high device poll rates (1000+ devices, 10s polls), Redis lock contention and breach counter updates become bottleneck +- Scaling path: Migrate to Redis Cluster for distributed locking and key sharding. Update distributed lock client library if needed. + +**PostgreSQL Connection Pool:** +- Current capacity: Default pool size (likely 5-10 connections) +- Limit: High concurrent tenant queries or bulk exports exhaust connection pool +- Scaling path: Increase pool size based on workload (concurrent route handlers). Add connection pool metrics. Monitor connection wait time. + +**WinBox Tunnel Port Allocation:** +- Current capacity: Configurable port range (e.g., 40000-60000 = 20k ports) +- Limit: On heavily subscribed instances, port exhaustion closes new tunnel requests +- Scaling path: Implement port pool overflow with secondary ranges. Add metrics for port utilization %. Fail gracefully (409 Conflict) when exhausted with clear message. + +**SSH Relay Session Limits:** +- Current capacity: Configurable maxSessions, maxPerUser, maxPerDevice +- Limit: Under DOS, legitimate users blocked by exhausted limits +- Scaling path: Implement adaptive rate limiting (cost per source IP). Add token rate limiting (tokens/minute per IP) before WebSocket upgrade. Monitor breach events and publish alerts. + +--- + +## Known Bugs + +**SSH Relay Pipe Ignores Errors:** +- Symptoms: SSH session may silently fail if StdinPipe/StdoutPipe creation errors +- Files: `poller/internal/sshrelay/server.go:209-211` (ignores error on StderrPipe, StdinPipe, StdoutPipe) +- Trigger: Unusual SSH server behavior or resource exhaustion +- Workaround: Errors are silently ignored; Shell() call will fail later with unclear error +- Fix approach: Check error returns from StdinPipe/StdoutPipe/StderrPipe. Log and close session if pipes fail. + +**Idle Duration Calculation Anomaly:** +- Symptoms: Session.IdleDuration() can return very large (or negative in edge cases) if LastActive is not set before first check +- Files: `poller/internal/sshrelay/session.go:26-28` +- Trigger: Session created but never marked active (LastActive = 0 unix timestamp) +- Workaround: Initialize LastActive in Session constructor +- Fix approach: In Session creation (`server.go` line ~200), set `atomic.StoreInt64(&s.LastActive, time.Now().UnixNano())`. + +**X-Forwarded-For Parsing:** +- Symptoms: If X-Forwarded-For has trailing comma or spaces, source IP extraction may be incorrect +- Files: `poller/internal/sshrelay/server.go:133-136` +- Trigger: Misconfigured proxy or malicious header +- Workaround: Inspect audit logs for unusual source IPs +- Fix approach: Add validation after split: `strings.TrimSpace()` on parts, skip empty entries, validate resulting IP format. + +--- + +## Missing Critical Features + +**SSH Session End Event Publishing:** +- Problem: Audit trail incomplete; sessions start logged but not end +- Blocks: Audit compliance; user session tracking; security incident investigation +- Priority: High - this is a compliance/audit gap + +**Bandwidth Alert Evaluation:** +- Problem: rx_bps/tx_bps metric types in alert rules table but not evaluated +- Blocks: Users cannot create bandwidth-based alerts despite UI suggesting it's possible +- Priority: Medium - feature is partially implemented + +**Device Connection State Observability:** +- Problem: No metrics for device online/offline transition frequency or duration +- Blocks: Operators cannot diagnose intermittent connectivity issues +- Priority: Medium - operational insight would help debugging + +--- + +## Test Coverage Gaps + +**SSH Relay Security Paths:** +- What's not tested: Token validation against tampered or expired tokens; concurrent session limits enforcement under stress; source IP mismatch scenarios +- Files: `poller/internal/sshrelay/server_test.go` +- Risk: Malformed token or token replay attacks could bypass validation +- Priority: High - security-critical path + +**Tunnel Port Pool Exhaustion:** +- What's not tested: Behavior when port pool is exhausted (Allocate returns error); cleanup on listener bind failure +- Files: `poller/internal/tunnel/portpool_test.go`, `poller/internal/tunnel/manager_test.go` +- Risk: Port leaks or silent allocation failures under stress +- Priority: High - affects tunnel availability + +**Alert Evaluator with Maintenance Windows:** +- What's not tested: Cache invalidation on maintenance window updates; concurrent cache access during updates +- Files: `backend/app/services/alert_evaluator.py` +- Risk: Stale maintenance windows suppress alerts unintentionally or too long +- Priority: Medium - affects alert suppression accuracy + +**Device Offline Circuit Breaker:** +- What's not tested: Exponential backoff behavior across scheduler restarts; lock timeout when device is permanently offline +- Files: `poller/internal/poller/scheduler.go`, `poller/internal/poller/worker.go` +- Risk: Hammering offline device with connection attempts or missing it when it comes back online +- Priority: Medium - affects device polling efficiency + +--- + +*Concerns audit: 2026-03-12* diff --git a/.planning/codebase/CONVENTIONS.md b/.planning/codebase/CONVENTIONS.md new file mode 100644 index 0000000..1510f14 --- /dev/null +++ b/.planning/codebase/CONVENTIONS.md @@ -0,0 +1,348 @@ +# Coding Conventions + +**Analysis Date:** 2026-03-12 + +## Naming Patterns + +**Files:** +- TypeScript/React: `kebab-case.ts`, `kebab-case.tsx` (e.g., `useShortcut.ts`, `error-boundary.tsx`) +- Python: `snake_case.py` (e.g., `test_auth.py`, `auth_service.py`) +- Go: `snake_case.go` (e.g., `scheduler_test.go`, `main.go`) +- Component files: PascalCase for exported components in UI libraries (e.g., `Button` from `button.tsx`) +- Test files: `{module}.test.tsx`, `{module}.spec.tsx` (frontend), `test_{module}.py` (backend) + +**Functions:** +- TypeScript/JavaScript: `camelCase` (e.g., `useShortcut`, `createApiClient`, `renderWithProviders`) +- Python: `snake_case` (e.g., `hash_password`, `verify_token`, `get_redis`) +- Go: `PascalCase` for exported, `camelCase` for private (e.g., `FetchDevices`, `mockDeviceFetcher`) +- React hooks: Prefix with `use` (e.g., `useAuth`, `useShortcut`, `useSequenceShortcut`) + +**Variables:** +- TypeScript: `camelCase` (e.g., `mockLogin`, `authState`, `refreshPromise`) +- Python: `snake_case` (e.g., `user_id`, `tenant_id`, `credentials`) +- Constants: `UPPER_SNAKE_CASE` for module-level constants (e.g., `ACCESS_TOKEN_COOKIE`, `REFRESH_TOKEN_MAX_AGE`) + +**Types:** +- TypeScript interfaces: `PascalCase` with `I` prefix optional (e.g., `ButtonProps`, `AuthState`, `WrapperProps`) +- Python: `PascalCase` for classes (e.g., `User`, `UserRole`, `HTTPException`) +- Go: `PascalCase` for exported (e.g., `Scheduler`, `Device`), `camelCase` for private (e.g., `mockDeviceFetcher`) + +**Directories:** +- Feature/module directories: `kebab-case` (e.g., `remote-access`, `device-groups`) +- Functional directories: `kebab-case` (e.g., `__tests__`, `components`, `routers`) +- Python packages: `snake_case` (e.g., `app/models`, `app/services`) + +## Code Style + +**Formatting:** + +Frontend: +- Tool: ESLint + TypeScript ESLint (flat config at `frontend/eslint.config.js`) +- Indentation: 2 spaces +- Line length: No explicit limit in config, but code stays under 120 chars +- Quotes: Single quotes in JS/TS (ESLint recommended) +- Semicolons: Required +- Trailing commas: Yes (ES2020+) + +Backend (Python): +- Tool: Ruff for linting +- Line length: 100 characters (`ruff` configured in `pyproject.toml`) +- Indentation: 4 spaces (PEP 8) +- Type hints: Required on function signatures (Pydantic models and FastAPI handlers) + +Poller (Go): +- Gofmt standard (implicit) +- Line length: conventional Go style +- Error handling: `if err != nil` pattern + +**Linting:** + +Frontend: +- ESLint config: `@eslint/js`, `typescript-eslint`, `react-hooks`, `react-refresh` +- Run: `npm run lint` +- Rules: Recommended + React hooks rules +- No unused locals/parameters enforced via TypeScript `noUnusedLocals` and `noUnusedParameters` + +Backend (Python): +- Ruff enabled for style and lint +- Target version: Python 3.12 +- Line length: 100 + +## Import Organization + +**Frontend (TypeScript/React):** + +Order: +1. React and React-adjacent imports (`import { ... } from 'react'`) +2. Third-party libraries (`import { ... } from '@tanstack/react-query'`) +3. Local absolute imports using `@` alias (`import { ... } from '@/lib/api'`) +4. Local relative imports (`import { ... } from '../utils'`) + +Path Aliases: +- `@/*` maps to `src/*` (configured in `tsconfig.app.json`) + +Example from `useShortcut.ts`: +```typescript +import { useEffect, useRef } from 'react' +// (no third-party imports in this file) +// (no local imports needed) +``` + +Example from `auth.ts`: +```typescript +import { create } from 'zustand' +import { authApi, type UserMe } from './api' +import { keyStore } from './crypto/keyStore' +import { deriveKeysInWorker } from './crypto/keys' +``` + +**Backend (Python):** + +Order: +1. Standard library (`import uuid`, `from typing import ...`) +2. Third-party (`from fastapi import ...`, `from sqlalchemy import ...`) +3. Local imports (`from app.services.auth import ...`, `from app.models.user import ...`) + +Standard pattern in routers (e.g., `auth.py`): +```python +import logging +from datetime import UTC, datetime, timedelta +from typing import Optional + +import redis.asyncio as aioredis +from fastapi import APIRouter, Depends +from sqlalchemy import select + +from app.config import settings +from app.database import get_admin_db +from app.services.auth import verify_password +``` + +**Go:** + +Order: +1. Standard library (`"context"`, `"log/slog"`) +2. Third-party (`github.com/...`) +3. Local module imports (`github.com/mikrotik-portal/poller/...`) + +Example from `main.go`: +```go +import ( + "context" + "log/slog" + "net/http" + "os" + + "github.com/bsm/redislock" + "github.com/redis/go-redis/v9" + + "github.com/mikrotik-portal/poller/internal/bus" + "github.com/mikrotik-portal/poller/internal/config" +) +``` + +## Error Handling + +**Frontend (TypeScript):** + +- Try/catch for async operations with type guards: `const axiosErr = err as { response?: ... }` +- Error messages extracted to helpers: `getAuthErrorMessage(err)` in `lib/auth.ts` +- State-driven error UI: Store errors in Zustand (`error: string | null`), display conditionally +- Pattern: Set error, then throw to allow calling code to handle: + ```typescript + try { + // operation + } catch (err) { + const message = getAuthErrorMessage(err) + set({ error: message }) + throw new Error(message) + } + ``` + +**Backend (Python):** + +- HTTPException from FastAPI for API errors (with status codes) +- Structured logging with structlog for all operations +- Pattern in services: raise exceptions, let routers catch and convert to HTTP responses +- Example from `auth.py` (lines 95-100): + ```python + async def get_redis() -> aioredis.Redis: + global _redis + if _redis is None: + _redis = aioredis.from_url(settings.REDIS_URL, decode_responses=True) + return _redis + ``` +- Database operations wrapped in try/finally blocks for cleanup + +**Go:** + +- Explicit error returns: `(result, error)` pattern +- Check and return: `if err != nil { return nil, err }` +- Structured logging with `log/slog` including error context +- Example from `scheduler_test.go`: + ```go + err := sched.reconcileDevices(ctx, &wg) + require.NoError(t, err) + ``` + +## Logging + +**Frontend:** + +- Framework: `console` (no structured logging library) +- Pattern: Inline console.log/warn/error during development +- Production: Minimal logging, errors captured in state (`auth.error`) +- Example from `auth.ts` (line 182): + ```typescript + console.warn('[auth] key set decryption failed (Tier 1 data will be inaccessible):', e) + ``` + +**Backend (Python):** + +- Framework: `structlog` for structured, JSON logging +- Logger acquisition: `logger = structlog.get_logger(__name__)` or `logging.getLogger(__name__)` +- Logging at startup/shutdown and error conditions +- Example from `main.py`: + ```python + logger = structlog.get_logger(__name__) + logger.info("migrations applied successfully") + logger.error("migration failed", stderr=result.stderr) + ``` + +**Go (Poller):** + +- Framework: `log/slog` (standard library) +- JSON output to stdout with service name in attributes +- Levels: Debug, Info, Warn, Error +- Example from `main.go`: + ```go + slog.SetDefault(slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{ + Level: slog.LevelInfo, + }).WithAttrs([]slog.Attr{ + slog.String("service", "poller"), + }))) + ``` + +## Comments + +**When to Comment:** + +- Complex logic that isn't self-documenting +- Important caveats or gotchas +- References to related issues or specs +- Example from `auth.ts` (lines 26-29): + ```typescript + // Response interceptor: handle 401 by attempting token refresh + client.interceptors.response.use( + (response) => response, + async (error) => { + ``` + +**JSDoc/TSDoc:** + +- Used for exported functions and hooks +- Example from `useShortcut.ts`: + ```typescript + /** + * Hook to register a single-key keyboard shortcut. + * Skips when focus is in INPUT, TEXTAREA, or contentEditable elements. + */ + export function useShortcut(key: string, callback: () => void, enabled = true) + ``` + +**Python Docstrings:** + +- Module-level docstring at top of file describing purpose +- Function docstrings for public functions +- Example from `test_auth.py`: + ```python + """Unit tests for the JWT authentication service. + + Tests cover: + - Password hashing and verification (bcrypt) + - JWT access token creation and validation + """ + ``` + +**Go Comments:** + +- Package-level comment above package declaration +- Exported function/type comments above declaration +- Example from `main.go`: + ```go + // Command poller is the MikroTik device polling microservice. + // It connects to RouterOS devices via the binary API... + package main + ``` + +## Function Design + +**Size:** + +- Frontend: Prefer hooks/components under 100 lines; break larger logic into smaller hooks +- Backend: Services typically 100-200 lines per function; larger operations split across multiple methods +- Example: `auth.ts` `srpLogin` is 130 lines but handles distinct steps (1-10 commented) + +**Parameters:** + +- Frontend: Functions take specific parameters, avoid large option objects except for component props +- Backend (Python): Use Pydantic schemas for request bodies, dependency injection for services +- Go: Interfaces preferred for mocking/testing (e.g., `DeviceFetcher` in `scheduler_test.go`) + +**Return Values:** + +- Frontend: Single return or destructured object: `return { ...render(...), queryClient }` +- Backend (Python): Single value or tuple for multiple returns (not common) +- Go: Always return `(result, error)` pair + +## Module Design + +**Exports:** + +- TypeScript: Named exports preferred for functions/types, default export only for React components + - Example: `export function useShortcut(...)` instead of `export default useShortcut` + - React components: `export default AppInner` (in `App.tsx`) +- Python: All public functions/classes at module level; use `__all__` for large modules +- Go: Exported functions capitalized: `func NewScheduler(...) *Scheduler` + +**Barrel Files:** + +- Frontend: `test-utils.tsx` re-exports Testing Library: `export * from '@testing-library/react'` +- Backend: Not used (explicit imports preferred) +- Go: Not applicable (no barrel pattern) + +## Specific Patterns Observed + +**Zustand Stores (Frontend):** +- Created with `create((set, get) => ({ ... }))` +- State shape includes loading, error, and data fields +- Actions call `set(newState)` or `get()` to access state +- Example: `useAuth` store in `lib/auth.ts` (lines 31-276) + +**Zustand selectors:** +- Use selector functions for role checks: `isSuperAdmin(user)`, `isTenantAdmin(user)`, etc. +- Pattern: Pure functions that check user role + +**Class Variance Authority (Frontend):** +- Used for component variants in UI library (e.g., `button.tsx`) +- Variants defined with `cva()` function with variant/size/etc. options +- Applied via `className={cn(buttonVariants({ variant, size }), className)}` + +**FastAPI Routers (Backend):** +- Each feature area gets its own router file: `routers/auth.py`, `routers/devices.py` +- Routers mounted at `app.include_router(router)` in `main.py` +- Endpoints use dependency injection for auth, db, etc. + +**pytest Fixtures (Backend):** +- Conftest.py at test root defines markers and shared fixtures +- Integration tests in `tests/integration/conftest.py` +- Unit tests use mocks, no database access + +**Go Testing:** +- Table-driven tests not explicitly shown, but mock interfaces are (e.g., `mockDeviceFetcher`) +- Testify assertions: `assert.Len`, `require.NoError` +- Helper functions to create test data: `newTestScheduler` + +--- + +*Convention analysis: 2026-03-12* diff --git a/.planning/codebase/INTEGRATIONS.md b/.planning/codebase/INTEGRATIONS.md new file mode 100644 index 0000000..0a308d1 --- /dev/null +++ b/.planning/codebase/INTEGRATIONS.md @@ -0,0 +1,245 @@ +# External Integrations + +**Analysis Date:** 2026-03-12 + +## APIs & External Services + +**MikroTik RouterOS:** +- Binary API (TLS port 8729) - Device polling and command execution + - SDK/Client: go-routeros/v3 (Go poller) + - Protocol: Binary encoded commands, TLS mutual authentication + - Used in: `poller/cmd/poller/main.go`, `poller/internal/poller/` + +**SMTP (Transactional Email):** +- System email service (password reset, alerts, notifications) + - SDK/Client: aiosmtplib (async SMTP library) + - Configuration: `SMTP_HOST`, `SMTP_PORT`, `SMTP_USER`, `SMTP_PASSWORD`, `SMTP_USE_TLS` + - From address: `SMTP_FROM_ADDRESS` + - Implementation: `app/services/email_service.py` + - Supports TLS, STARTTLS, plain auth + +**WebSocket/SSH Tunneling:** +- Browser-based SSH terminal for remote device access + - SDK/Client: asyncssh (Python), xterm.js (frontend) + - Protocol: SSH protocol with port forwarding + - Implementation: `app/routers/remote_access.py`, `poller/internal/sshrelay/` + - Features: Session auditing, command logging to NATS + +## Data Storage + +**Databases:** +- PostgreSQL 17 (TimescaleDB extension in production) + - Async driver: asyncpg 0.30.0+ (Python backend) + - Sync driver: pgx/v5 (Go poller) + - ORM: SQLAlchemy 2.0+ async + - Migrations: Alembic 1.14.0+ + - RLS: Row-Level Security policies for multi-tenant isolation + - Models: `app/models/` (17+ model files) + - Connection: `DATABASE_URL`, `APP_USER_DATABASE_URL`, `POLLER_DATABASE_URL` + - Admin role: postgres (migrations only) + - App role: app_user (enforces RLS) + - Poller role: poller_user (direct access, no RLS) + +**File Storage:** +- Local filesystem only - No cloud storage integration + - Git store (bare repos): `/data/git-store` or `./git-store` (RWX PVC in production) + - Implementation: `app/services/git_store.py` + - Purpose: Version control for device configurations (one repo per tenant) + - Firmware cache: `/data/firmware-cache` + - Purpose: Downloaded RouterOS firmware images + - Service: `app/services/firmware_service.py` + - WireGuard config: `/data/wireguard` + - Purpose: VPN peer and configuration management + +**Caching:** +- Redis 7+ + - Async driver: redis 5.0.0+ (Python) + - Sync driver: redis/go-redis/v9 (Go) + - Use cases: + - Session storage for SRP auth flows: `app/routers/auth.py` (key: `srp:session:{session_id}`) + - Distributed locks: poller uses `bsm/redislock` to prevent duplicate polls across replicas + - Connection: `REDIS_URL` + +## Authentication & Identity + +**Auth Provider:** +- Custom SRP-6a implementation (zero-knowledge auth) + - Flow: SRP-6a password hash registration → no plaintext password stored + - Implementation: `app/services/srp_service.py`, `app/routers/auth.py` + - JWT tokens: HS256 signed with `JWT_SECRET_KEY` + - Token storage: httpOnly cookies (frontend sends via credentials) + - Refresh: 15-minute access tokens, 7-day refresh tokens + - Fallback: Legacy bcrypt password support during upgrade phase + +**User Roles:** +- Four role levels with RBAC: + - super_admin - Cross-tenant access, user/billing management + - admin - Full tenant management (invite users, config push, firmware) + - operator - Limited: config push, monitoring, alerts + - viewer - Read-only: dashboard, reports, audit logs + +**Credential Encryption:** +- Per-tenant envelope encryption via OpenBao Transit + - Service: `app/services/openbao_service.py` + - Cipher: AES-256-GCM via OpenBao Transit engine + - Key naming: `tenant_{uuid}` (created on tenant creation) + - Fallback: Legacy Fernet decryption for credentials created before Transit migration + +## Monitoring & Observability + +**Error Tracking:** +- Not integrated - No Sentry, DataDog, or equivalent +- Local structured logging only + +**Logs:** +- Structured logging via structlog (Python backend) + - Format: JSON (production), human-readable (dev) + - Configuration: `app/logging_config.py` + - Log level: Configurable via `LOG_LEVEL` env var +- Structured logging via slog (Go poller) + - Format: JSON with service name and instance hostname + - Configuration: `poller/cmd/poller/main.go` + +**Metrics:** +- Prometheus metrics export + - Library: prometheus-fastapi-instrumentator 7.0.0+ + - Setup: `app/observability.py` + - Endpoint: Exposed metrics in Prometheus text format + - Not scraped by default - requires external Prometheus instance + +**OpenTelemetry:** +- Minimal OTEL instrumentation in Go poller + - SDK: `go.opentelemetry.io/otel` 1.39.0+ + - Not actively used in Python backend + +## CI/CD & Deployment + +**Hosting:** +- Self-hosted (Docker Compose for local, Kubernetes for production) +- No cloud provider dependency +- Reverse proxy: Caddy (reference: user memory notes) + +**CI Pipeline:** +- GitHub Actions (`.github/workflows/`) +- Not fully analyzed - check workflows for details + +**Containers:** +- Docker multi-stage builds for all three services +- Images: `api` (FastAPI), `poller` (Go binary), `frontend` (Vite SPA) +- Profiles: `full` (all services), `mail-testing` (adds Mailpit) + +## Environment Configuration + +**Required env vars:** +- `DATABASE_URL` - PostgreSQL admin connection +- `SYNC_DATABASE_URL` - Alembic migrations connection +- `APP_USER_DATABASE_URL` - App-scoped RLS connection +- `POLLER_DATABASE_URL` - Poller service connection +- `REDIS_URL` - Redis connection +- `NATS_URL` - NATS JetStream connection +- `JWT_SECRET_KEY` - HS256 signing key (MUST be unique in production) +- `CREDENTIAL_ENCRYPTION_KEY` - Base64-encoded 32-byte AES key +- `OPENBAO_ADDR` - OpenBao server address +- `OPENBAO_TOKEN` - OpenBao authentication token +- `CORS_ORIGINS` - Frontend origins (comma-separated) +- `SMTP_HOST`, `SMTP_PORT` - Email server +- `FIRST_ADMIN_EMAIL`, `FIRST_ADMIN_PASSWORD` - Bootstrap account (dev only) + +**Secrets location:** +- `.env` file (git-ignored) - Development +- Environment variables in production (Kubernetes secrets, docker compose .env) +- OpenBao - Stores Transit encryption keys (not key material, only key references) + +**Security defaults validation:** +- `app/config.py` rejects known-insecure values in non-dev environments: + - `JWT_SECRET_KEY` hard-coded defaults + - `CREDENTIAL_ENCRYPTION_KEY` hard-coded defaults + - `OPENBAO_TOKEN` hard-coded defaults +- Fails startup with clear error message if production uses dev secrets + +## Webhooks & Callbacks + +**Incoming:** +- None detected - No external webhook subscriptions + +**Outgoing:** +- Slack notifications - Alert firing/resolution (planned/partial implementation) + - Router: `app/routers/alerts.py` + - Implementation status: Check alert evaluation service +- Email notifications - Alert notifications, password reset + - Service: `app/services/email_service.py` +- Custom webhooks - Extensible via notification service + - Service: `app/services/notification_service.py` + +## NATS JetStream Event Bus + +**Message Bus:** +- NATS 2.0+ with JetStream persistence + - Python client: nats-py 2.7.0+ + - Go client: nats.go 1.38.0+ + - Connection: `NATS_URL` + +**Event Topics (Python publisher → Go/Python subscribers):** +- `device.status.>` - Device online/offline status from Go poller + - Subscriber: `app/services/nats_subscriber.py` + - Payload: device_id, tenant_id, status, routeros_version, board_name, uptime + - Usage: Real-time device fleet updates + +- `firmware.progress.{tenant_id}.{device_id}` - Firmware upgrade progress + - Subscriber: `app/services/firmware_subscriber.py` + - Publisher: Firmware upgrade service + - Payload: stage (downloading, verifying, upgrading), progress %, message + - Usage: Live firmware upgrade tracking (SSE to frontend) + +- `config.push.{tenant_id}.{device_id}` - Configuration push progress + - Subscriber: `app/services/push_rollback_subscriber.py` + - Publisher: `app/services/restore_service.py` + - Payload: phase (pre-validate, backup, push, commit), status, errors + - Usage: Live config deployment tracking with rollback support + +- `alert.fired.{tenant_id}`, `alert.resolved.{tenant_id}` - Alert events + - Subscriber: `app/services/sse_manager.py` + - Publisher: `app/services/alert_evaluator.py` + - Payload: alert_id, device_id, rule_name, condition, value, timestamp + - Usage: Real-time alert notifications (SSE to frontend) + +- `audit.session.end` - SSH session audit events + - Subscriber: `app/services/session_audit_subscriber.py` + - Publisher: Go SSH relay (`poller/internal/sshrelay/`) + - Payload: session_id, user_id, device_id, start_time, end_time, command_log + - Usage: Session auditing and compliance logging + +- `config.change.{tenant_id}.{device_id}` - Device config change detection + - Subscriber: `app/services/config_change_subscriber.py` + - Payload: device_id, change_type, affected_subsystems, timestamp + - Usage: Track unapproved config changes + +- `metrics.sample.{tenant_id}.{device_id}` - Real-time CPU/memory/traffic samples + - Subscriber: `app/services/metrics_subscriber.py` + - Publisher: Go poller + - Payload: timestamp, cpu_percent, memory_percent, disk_percent, interfaces{name, rx_bytes, tx_bytes} + - Usage: Live metric streaming (SSE to frontend) + +**Server-Sent Events (SSE):** +- Frontend subscribes to per-tenant SSE streams + - Endpoint: `GET /api/sse/subscribe?tenant_id={tenant_id}` + - Connection: Long-lived HTTP persistent stream + - Implementation: `app/routers/sse.py`, `app/services/sse_manager.py` + - Payload format: SSE (text/event-stream) + - Events forwarded from NATS to frontend browser in real-time + - Used for: firmware progress, alerts, config push status, metrics + +## Git Integration + +**Version Control:** +- Bare git repositories stored per-tenant + - Library: pygit2 1.14.0+ + - Location: `{GIT_STORE_PATH}/tenant_{tenant_id}/` + - Purpose: Store device configuration history + - Commits created on: successful config push, manual save + - Restore: One-click revert to any previous commit + - Implementation: `app/services/git_store.py` + +--- + +*Integration audit: 2026-03-12* diff --git a/.planning/codebase/STACK.md b/.planning/codebase/STACK.md new file mode 100644 index 0000000..26bb6cf --- /dev/null +++ b/.planning/codebase/STACK.md @@ -0,0 +1,158 @@ +# Technology Stack + +**Analysis Date:** 2026-03-12 + +## Languages + +**Primary:** +- Python 3.12+ - Backend API (`/backend`) +- Go 1.24.0 - Poller service (`/poller`) +- TypeScript 5.9.3 - Frontend (`/frontend`) +- JavaScript - Frontend runtime + +**Secondary:** +- SQL - PostgreSQL database queries and migrations +- YAML - Docker Compose configuration +- Shell - Infrastructure scripts + +## Runtime + +**Environment:** +- Node.js runtime (frontend) +- Python 3.12+ runtime (backend) +- Go 1.24.0 runtime (poller) + +**Package Manager:** +- npm (Node.js) - Frontend dependencies +- pip/hatchling (Python) - Backend dependencies +- go mod (Go) - Poller dependencies + +## Frameworks + +**Core:** +- FastAPI 0.115.0+ - Backend REST API (`app/main.py`) +- React 19.2.0 - Frontend UI components +- TanStack React Router 1.161.3 - Frontend routing and navigation +- TanStack React Query 5.90.21 - Frontend data fetching and caching +- Vite 7.3.1 - Frontend build tool and dev server +- go-routeros/v3 - MikroTik RouterOS binary protocol client + +**Testing:** +- pytest 8.0.0+ - Backend unit/integration tests (`tests/`) +- vitest 4.0.18 - Frontend unit tests +- @playwright/test 1.58.2 - Frontend E2E tests +- testcontainers-go 0.40.0 - Go integration tests with Docker containers + +**Build/Dev:** +- TypeScript 5.9.3 - Frontend type checking via `tsc -b` +- ESLint 9.39.1 - Frontend linting +- Alembic 1.14.0 - Backend database migrations +- docker compose - Multi-service orchestration +- pytest-cov 5.0.0 - Backend test coverage reporting +- vitest coverage - Frontend test coverage + +## Key Dependencies + +**Critical:** +- SQLAlchemy 2.0+ (asyncio) - Backend ORM with async support (`app/database.py`) +- asyncpg 0.30.0+ - Async PostgreSQL driver for Python +- pgx/v5 - Sync PostgreSQL driver for Go poller +- nats-py 2.7.0+ - NATS JetStream client (Python, event publishing) +- nats.go 1.38.0+ - NATS JetStream client (Go, event publishing and subscribing) +- redis (Python 5.0.0+) - Redis async client for session storage (`app/routers/auth.py`) +- redis/go-redis/v9 - Redis client for Go (distributed locks) +- httpx 0.27.0+ - Async HTTP client for OpenBao API calls +- asyncssh 2.20.0+ - SSH library for remote device access + +**Infrastructure:** +- cryptography 42.0.0+ - Encryption/decryption, SSH key handling +- bcrypt 4.0.0-5.0.0 - Password hashing +- python-jose 3.3.0+ - JWT token creation and validation +- pydantic 2.0.0+ - Request/response validation, settings +- pydantic-settings 2.0.0+ - Environment variable configuration +- slowapi 0.1.9+ - Rate limiting middleware +- structlog 25.1.0+ - Structured logging +- prometheus-fastapi-instrumentator 7.0.0+ - Prometheus metrics export +- aiosmtplib 3.0.0+ - Async SMTP for email notifications +- weasyprint 62.0+ - PDF report generation +- pygit2 1.14.0+ - Git version control integration (`app/services/git_store.py`) +- apscheduler 3.10.0-4.0 - Background job scheduling + +**Frontend UI:** +- @radix-ui/* (v1-2) - Accessible component primitives +- Tailwind CSS 3.4.19 - Utility-first CSS framework +- lucide-react 0.575.0 - Icon library +- framer-motion 12.34.3 - Animation library +- recharts 3.7.0 - Chart library +- reactflow 11.11.4 - Network diagram rendering +- react-leaflet 5.0.0 - Map visualization +- xterm.js 6.0.0 - Terminal emulator for SSH (`@xterm/xterm`, `@xterm/addon-fit`) +- sonner 2.0.7 - Toast notifications +- zod 4.3.6 - Runtime schema validation +- zustand 5.0.11 - Lightweight state management +- axios 1.13.5 - HTTP client for API calls +- diff 8.0.3 - Diff computation for git-diff-view + +**Testing Libraries:** +- @testing-library/react 16.3.2 - React component testing utilities +- @testing-library/user-event 14.6.1 - User interaction simulation +- jsdom 28.1.0 - DOM implementation for Node.js tests + +## Configuration + +**Environment:** +- `.env` file (Pydantic BaseSettings) - Development environment variables +- `.env.example` - Template with safe defaults +- `.env.staging.example` - Staging environment template +- Environment validation in `app/config.py` - Rejects known-insecure defaults in non-dev environments + +**Key Environment Variables:** +- `ENVIRONMENT` - (dev|staging|production) +- `DATABASE_URL` - PostgreSQL async connection (admin role) +- `SYNC_DATABASE_URL` - PostgreSQL sync for Alembic migrations +- `APP_USER_DATABASE_URL` - PostgreSQL with app_user role (RLS enforced) +- `POLLER_DATABASE_URL` - PostgreSQL for Go poller (separate role) +- `REDIS_URL` - Redis connection for sessions and locks +- `NATS_URL` - NATS JetStream connection +- `JWT_SECRET_KEY` - HS256 signing key (must be unique in production) +- `CREDENTIAL_ENCRYPTION_KEY` - Base64-encoded 32-byte AES key for credential storage +- `OPENBAO_ADDR` - OpenBao HTTP endpoint +- `OPENBAO_TOKEN` - OpenBao auth token +- `CORS_ORIGINS` - Comma-separated allowed frontend origins +- `SMTP_HOST`, `SMTP_PORT` - Email configuration + +**Build:** +- `vite.config.ts` - Vite bundler configuration (frontend) +- `tsconfig.json` - TypeScript compiler options +- `pyproject.toml` - Python project metadata and dependencies +- `go.mod` / `go.sum` - Go module dependencies +- `Dockerfile` - Multi-stage builds for all three services +- `docker-compose.yml` - Local development stack + +## Platform Requirements + +**Development:** +- Python 3.12+ +- Node.js 18+ (npm) +- Go 1.24.0 +- Docker and Docker Compose +- PostgreSQL 17 (via Docker) +- Redis 7 (via Docker) +- NATS 2+ with JetStream (via Docker) +- OpenBao 2.1+ (via Docker) +- WireGuard (via Docker image) + +**Production:** +- Kubernetes or Docker Swarm for orchestration +- PostgreSQL 17+ with TimescaleDB extension +- Redis 7+ (standalone or cluster) +- NATS 2.0+ with JetStream persistence +- OpenBao 2.0+ for encryption key management +- WireGuard container for VPN tunneling +- TLS certificates for HTTPS (Caddy/nginx reverse proxy) +- Storage for git-backed configs (`/data/git-store` - ReadWriteMany PVC) +- Storage for firmware cache (`/data/firmware-cache`) + +--- + +*Stack analysis: 2026-03-12* diff --git a/.planning/codebase/STRUCTURE.md b/.planning/codebase/STRUCTURE.md new file mode 100644 index 0000000..612ee9d --- /dev/null +++ b/.planning/codebase/STRUCTURE.md @@ -0,0 +1,293 @@ +# Codebase Structure + +**Analysis Date:** 2026-03-12 + +## Directory Layout + +``` +the-other-dude/ +├── backend/ # Python FastAPI backend microservice +│ ├── app/ +│ │ ├── main.py # FastAPI app entry point, lifespan setup +│ │ ├── config.py # Settings from environment +│ │ ├── database.py # SQLAlchemy engines, session factories +│ │ ├── models/ # SQLAlchemy ORM models +│ │ ├── schemas/ # Pydantic request/response schemas +│ │ ├── routers/ # APIRouter endpoints (devices, alerts, auth, etc.) +│ │ ├── services/ # Business logic, NATS subscribers, integrations +│ │ ├── middleware/ # RBAC, tenant context, rate limiting, headers +│ │ ├── security/ # SRP, JWT, auth utilities +│ │ └── templates/ # Jinja2 report templates +│ ├── alembic/ # Database migrations +│ ├── tests/ # Unit and integration tests +│ └── Dockerfile +│ +├── poller/ # Go device polling microservice +│ ├── cmd/poller/main.go # Entry point +│ ├── internal/ +│ │ ├── poller/ # Scheduler and Worker (device polling orchestration) +│ │ ├── device/ # RouterOS binary API client +│ │ ├── bus/ # NATS JetStream publisher +│ │ ├── tunnel/ # WinBox TCP tunnel manager +│ │ ├── sshrelay/ # SSH relay server +│ │ ├── config/ # Configuration loading +│ │ ├── store/ # PostgreSQL device list queries +│ │ ├── vault/ # OpenBao credential cache +│ │ ├── observability/ # Prometheus metrics, health checks +│ │ └── testutil/ # Test helpers +│ ├── go.mod / go.sum +│ └── Dockerfile +│ +├── frontend/ # React 19 TypeScript web UI +│ ├── src/ +│ │ ├── routes/ # TanStack Router file-based routes +│ │ │ ├── __root.tsx # Root layout, QueryClientProvider +│ │ │ ├── _authenticated.tsx # Auth guard, logged-in layout +│ │ │ └── _authenticated/ # Tenant and device-scoped pages +│ │ ├── components/ # React components by feature +│ │ │ ├── ui/ # Base UI components (button, card, dialog, etc.) +│ │ │ ├── dashboard/ +│ │ │ ├── fleet/ +│ │ │ ├── devices/ +│ │ │ ├── config/ +│ │ │ ├── alerts/ +│ │ │ ├── auth/ +│ │ │ ├── vpn/ +│ │ │ └── ... +│ │ ├── hooks/ # Custom React hooks (useEventStream, useShortcut, etc.) +│ │ ├── contexts/ # React Context (EventStreamContext) +│ │ ├── lib/ # Utilities (API client, crypto, helpers) +│ │ ├── assets/ # Fonts, images +│ │ └── main.tsx # Entry point +│ ├── public/ +│ ├── package.json / pnpm-lock.yaml +│ ├── tsconfig.json +│ ├── vite.config.ts +│ └── Dockerfile +│ +├── infrastructure/ # Deployment and observability configs +│ ├── docker/ # Docker build scripts +│ ├── helm/ # Kubernetes Helm charts +│ ├── observability/ # Grafana dashboards, OpenBao configs +│ └── openbao/ # OpenBao policy and plugin configs +│ +├── docs/ # Documentation +│ ├── website/ # Website source (theotherdude.net) +│ └── superpowers/ # Feature specs and plans +│ +├── scripts/ # Utility scripts +│ +├── docker-compose.yml # Development multi-container setup +├── docker-compose.override.yml # Local overrides (mounted volumes, etc.) +├── docker-compose.staging.yml # Staging environment +├── docker-compose.prod.yml # Production environment +├── docker-compose.observability.yml # Optional Prometheus/Grafana stack +│ +├── .env.example # Template environment variables +├── .github/ # GitHub Actions CI/CD workflows +├── .planning/ # GSD planning documents +│ +└── README.md # Main project documentation +``` + +## Directory Purposes + +**backend/app/models/:** +- Purpose: SQLAlchemy ORM model definitions with RLS support +- Contains: Device, User, Tenant, Alert, ConfigBackup, Certificate, AuditLog, Firmware, VPN models +- Key files: `device.py` (devices with status, version, uptime), `user.py` (users with role and tenant), `alert.py` (alert rules and event log) +- Pattern: All models include `tenant_id` column, RLS policies enforce isolation + +**backend/app/schemas/:** +- Purpose: Pydantic request/response validation schemas +- Contains: DeviceCreate, DeviceResponse, DeviceUpdate, AlertRuleCreate, ConfigPushRequest, etc. +- Pattern: Separate request/response schemas (response never includes credentials), nested schema reuse + +**backend/app/routers/:** +- Purpose: FastAPI APIRouter endpoints, organized by domain +- Key files: `devices.py` (CRUD + bulk ops), `auth.py` (login, SRP, SSE token), `alerts.py` (rules and events), `config_editor.py` (live device config), `metrics.py` (metrics queries), `templates.py` (config templates), `vpn.py` (WireGuard peers) +- Pattern: All routes tenant-scoped as `/api/tenants/{tenant_id}/...` or `/api/...` (user-scoped) +- Middleware: Depends(require_role(...)), Depends(get_current_user), rate limiting + +**backend/app/services/:** +- Purpose: Business logic, external integrations, NATS event handling +- Core services: `device.py` (device CRUD with encryption), `auth.py` (SRP, JWT, password hashing), `backup_service.py` (config backup versioning), `ca_service.py` (TLS certificate generation and deployment) +- NATS subscribers: `nats_subscriber.py` (device status), `metrics_subscriber.py` (metrics aggregation), `firmware_subscriber.py` (firmware tracking), `alert_evaluator.py` (alert rule evaluation), `push_rollback_subscriber.py`, `session_audit_subscriber.py` +- External integrations: `email_service.py`, `notification_service.py` (Slack, webhooks), `git_store.py` (config history), `openbao_service.py` (vault access) +- Schedulers: `backup_scheduler.py`, `firmware_subscriber.py` (started/stopped in lifespan) + +**backend/app/middleware/:** +- Purpose: Request/response middleware, RBAC, tenant context, rate limiting +- Key files: `tenant_context.py` (JWT extraction, tenant context setup, RLS configuration), `rbac.py` (role hierarchy, Depends factories), `rate_limit.py` (token bucket limiter), `request_id.py` (correlation ID), `security_headers.py` (CSP, HSTS) + +**backend/app/security/:** +- Purpose: Authentication and encryption utilities +- Pattern: SRP-6a client challenge/response, JWT token generation, password hashing (bcrypt), credential envelope encryption (Fernet + Transit KMS) + +**poller/internal/poller/:** +- Purpose: Device scheduling and polling orchestration +- Key files: `scheduler.go` (lifecycle management, discovery), `worker.go` (per-device polling loop), `interfaces.go` (device interfaces) +- Pattern: Per-device goroutine with Redis distributed locking, circuit breaker with exponential backoff + +**poller/internal/device/:** +- Purpose: RouterOS binary API client implementation +- Key files: `client.go` (connection, command execution), `version.go` (parse RouterOS version), `health.go` (CPU, memory, disk metrics), `interfaces.go` (interface stats), `wireless.go` (wireless stats), `firmware.go` (firmware info), `cert_deploy.go` (TLS cert SFTP), `sftp.go` (SFTP operations) +- Pattern: Binary API command builders, response parsers, error handling + +**poller/internal/bus/:** +- Purpose: NATS JetStream publisher for all device events +- Key file: `publisher.go` (typed event structs, publish methods) +- Event types: DeviceStatusEvent, DeviceMetricsEvent, ConfigChangedEvent, PushRollbackEvent, PushAlertEvent, SessionAuditEvent +- Pattern: Struct with nc/js connections, methods like PublishDeviceStatus(ctx, event) + +**poller/internal/tunnel/:** +- Purpose: WinBox TCP tunnel management +- Key files: `manager.go` (port allocation, tunnel lifecycle), `tunnel.go` (tunnel goroutine), `portpool.go` (ephemeral port pool) +- Pattern: SOCKS proxy forwarding, port reuse after timeout + +**poller/internal/sshrelay/:** +- Purpose: SSH server bridging to RouterOS terminal access +- Key files: `server.go` (SSH server setup), `session.go` (SSH session handling), `bridge.go` (SSH-to-device relay) +- Pattern: SSH key pair generation, session multiplexing, terminal protocol bridging + +**poller/internal/vault/:** +- Purpose: OpenBao credential caching and decryption +- Key file: `vault.go` +- Pattern: Cache credentials after decryption via Transit KMS, TTL-based eviction + +**frontend/src/routes/:** +- Purpose: TanStack Router file-based routing +- Structure: `__root.tsx` (app root, QueryClientProvider), `_authenticated.tsx` (requires JWT, layout), `_authenticated/tenants/$tenantId/index` (tenant home), `_authenticated/tenants/$tenantId/devices/...` (device pages) +- Pattern: Each file exports `Route` object with component and loader, nested routes inherit parent loaders + +**frontend/src/components/:** +- Purpose: React components organized by domain/feature +- Structure: `ui/` (base components: Button, Card, Dialog, Input, Select, Badge, Skeleton, etc.), then feature folders (dashboard, fleet, devices, config, alerts, auth, vpn, etc.) +- Pattern: Composition over inheritance, CSS Modules or Tailwind for styling + +**frontend/src/hooks/:** +- Purpose: Custom React hooks for reusable logic +- Key files: `useEventStream.ts` (SSE connection lifecycle), `useShortcut.ts` (keyboard shortcuts), `useConfigPanel.ts` (config editor state), `usePageTitle.ts` (document title), `useSimpleConfig.ts` (simple config wizard state) + +**frontend/src/lib/:** +- Purpose: Utility modules +- Key files: `api.ts` (axios instance, fetch wrapper), `crypto/` (SRP client, key derivation), helpers (date formatting, validation, etc.) + +**backend/alembic/:** +- Purpose: Database schema migrations +- Key files: `alembic/versions/*.py` (timestamped migration scripts) +- Pattern: `upgrade()` and `downgrade()` functions, SQL operations via `op` context + +**tests/:** +- Backend: `tests/unit/` (service/model tests), `tests/integration/` (API endpoint tests with test DB) +- Frontend: `tests/e2e/` (Playwright E2E tests), `src/components/__tests__/` (component tests) + +## Key File Locations + +**Entry Points:** +- Backend: `backend/app/main.py` (FastAPI app, lifespan management) +- Poller: `poller/cmd/poller/main.go` (scheduler initialization) +- Frontend: `frontend/src/main.tsx` (React root), `frontend/src/routes/__root.tsx` (router root) + +**Configuration:** +- Backend: `backend/app/config.py` (Settings from .env) +- Poller: `poller/internal/config/config.go` (Load environment) +- Frontend: `frontend/vite.config.ts` (build config), `frontend/tsconfig.json` (TypeScript config) + +**Core Logic:** +- Device management: `backend/app/services/device.py` (CRUD), `poller/internal/device/` (API client), `frontend/src/components/fleet/` (UI) +- Config push: `backend/app/routers/config_editor.py` (API), `poller/internal/poller/worker.go` (execution), `frontend/src/components/config-editor/` (UI) +- Alerts: `backend/app/services/alert_evaluator.py` (evaluation logic), `backend/app/routers/alerts.py` (API), `frontend/src/components/alerts/` (UI) +- Authentication: `backend/app/security/` (SRP, JWT), `frontend/src/components/auth/` (forms), `poller/internal/vault/` (credential cache) + +**Testing:** +- Backend unit: `backend/tests/unit/` +- Backend integration: `backend/tests/integration/` +- Frontend e2e: `frontend/tests/e2e/` (Playwright specs) +- Poller unit: `poller/internal/poller/*_test.go`, `poller/internal/device/*_test.go` + +## Naming Conventions + +**Files:** +- Backend Python: snake_case.py (e.g., `device_service.py`, `nats_subscriber.py`) +- Poller Go: snake_case.go (e.g., `poller.go`, `scheduler.go`) +- Frontend TypeScript: PascalCase.tsx for components (e.g., `FleetTable.tsx`), camelCase.ts for utilities (e.g., `useEventStream.ts`) +- Routes: File name maps to URL path (`_authenticated/tenants/$tenantId/devices.tsx` → `/authenticated/tenants/{id}/devices`) + +**Functions/Methods:** +- Backend: snake_case (async def list_devices(...)), service functions are async +- Poller: PascalCase for exported types (Scheduler, Publisher), camelCase for methods +- Frontend: camelCase for functions and hooks, PascalCase for component names + +**Variables:** +- Backend: snake_case (device_id, tenant_id, current_user) +- Poller: camelCase for small scope (ctx, result), PascalCase for types (DeviceState) +- Frontend: camelCase (connectionState, lastConnectedAt) + +**Types:** +- Backend: PascalCase classes (Device, User, DeviceCreate) +- Poller: Exported types PascalCase (DeviceStatusEvent), unexported lowercase (deviceState) +- Frontend: TypeScript interfaces PascalCase (SSEEvent, EventCallback), generics with T + +## Where to Add New Code + +**New Feature (e.g., new device capability):** +- Primary code: + - Backend API: `backend/app/routers/{feature}.py` (new router file) + - Backend service: `backend/app/services/{feature}.py` (business logic) + - Backend model: Add to `backend/app/models/{domain}.py` or new file + - Poller: `poller/internal/device/{capability}.go` (RouterOS API client method) + - Poller event: Add struct to `poller/internal/bus/publisher.go`, new publish method + - Backend subscriber: `backend/app/services/{feature}_subscriber.py` if async processing needed +- Tests: `backend/tests/integration/test_{feature}.py` (API tests), `backend/tests/unit/test_{service}.py` (service tests) +- Frontend: + - Route: `frontend/src/routes/_authenticated/{feature}.tsx` (if new top-level page) + - Component: `frontend/src/components/{feature}/{FeatureName}.tsx` + - Hook: `frontend/src/hooks/use{FeatureName}.ts` if shared state/logic +- Database: Migration in `backend/alembic/versions/{timestamp}_{description}.py` + +**New Component/Module:** +- Backend: Create in `backend/app/services/{module}.py` as async class with methods, import in relevant router/subscriber +- Poller: Create in `poller/internal/{package}/{module}.go`, follow interface pattern in `interfaces.go` +- Frontend: Create in `frontend/src/components/{feature}/{ModuleName}.tsx`, export as named export + +**Utilities/Helpers:** +- Backend: `backend/app/services/` (service-level) or `backend/app/` subdirectory (utility modules) +- Poller: `poller/internal/{package}/` (package-level utilities) +- Frontend: `frontend/src/lib/{utility}/` (organized by concern: api, crypto, helpers, etc.) + +## Special Directories + +**docker-data/:** +- Purpose: Docker volumes for persistent data (PostgreSQL, NATS, Redis, WireGuard configs, Git backups) +- Generated: Yes (created by Docker on first run) +- Committed: No (.gitignore) + +**alembic/versions/:** +- Purpose: Database migration history +- Generated: No (manually written by developers) +- Committed: Yes (part of source control for reproducible schema) + +**.env files:** +- `.env.example`: Template with non-secret defaults, always committed +- `.env`: Local development config, not committed, ignored by .gitignore +- `.env.staging.example`: Staging environment template + +**.planning/codebase/:** +- Purpose: GSD-generated codebase analysis documents (ARCHITECTURE.md, STRUCTURE.md, CONVENTIONS.md, TESTING.md, etc.) +- Generated: Yes (by GSD tools) +- Committed: Yes (reference for future development) + +**node_modules/ (frontend):** +- Purpose: npm/pnpm dependencies +- Generated: Yes (by pnpm install) +- Committed: No (.gitignore) + +**__pycache__ (backend), vendor (poller):** +- Purpose: Compiled bytecode and dependency caches +- Generated: Yes +- Committed: No (.gitignore) + +--- + +*Structure analysis: 2026-03-12* diff --git a/.planning/codebase/TESTING.md b/.planning/codebase/TESTING.md new file mode 100644 index 0000000..ebcc2c7 --- /dev/null +++ b/.planning/codebase/TESTING.md @@ -0,0 +1,751 @@ +# Testing Patterns + +**Analysis Date:** 2026-03-12 + +## Test Framework + +**Frontend:** + +Runner: +- Vitest 4.0.18 +- Config: `frontend/vitest.config.ts` +- Environment: jsdom (browser simulation) +- Globals enabled: true + +Assertion Library: +- Testing Library (React) - `@testing-library/react` +- Testing Library User Events - `@testing-library/user-event` +- Testing Library Jest DOM matchers - `@testing-library/jest-dom` +- Vitest's built-in expect (compatible with Jest) + +Run Commands: +```bash +npm run test # Run all tests once +npm run test:watch # Watch mode (re-runs on file change) +npm run test:coverage # Generate coverage report +npm run test:e2e # E2E tests with Playwright +npm run test:e2e:headed # E2E tests with visible browser +``` + +**Backend:** + +Runner: +- pytest 8.0.0 +- Config: `pyproject.toml` with `asyncio_mode = "auto"` +- Plugins: pytest-asyncio, pytest-mock, pytest-cov +- Markers: `integration` (marked tests requiring PostgreSQL) + +Run Commands: +```bash +pytest # Run all tests +pytest -m "not integration" # Run unit tests only +pytest -m integration # Run integration tests only +pytest --cov=app # Generate coverage report +pytest -v # Verbose output +``` + +**Go (Poller):** + +Runner: +- Go's built-in testing package +- Config: implicit (no config file) +- Assertions: testify/assert, testify/require +- Test containers for integration tests (PostgreSQL, Redis, NATS) + +Run Commands: +```bash +go test ./... # Run all tests +go test -v ./... # Verbose output +go test -run TestName ... # Run specific test +go test -race ./... # Race condition detection +``` + +## Test File Organization + +**Frontend:** + +Location: +- Co-located with components in `__tests__` subdirectory +- Pattern: `src/components/__tests__/{component}.test.tsx` +- Shared test utilities in `src/test/test-utils.tsx` +- Test setup in `src/test/setup.ts` + +Examples: +- `frontend/src/components/__tests__/LoginPage.test.tsx` +- `frontend/src/components/__tests__/DeviceList.test.tsx` +- `frontend/src/components/__tests__/TemplatePushWizard.test.tsx` + +Naming: +- Test files: `{Component}.test.tsx` (matches component name) +- Vitest config includes: `'src/**/*.test.{ts,tsx}'` + +**Backend:** + +Location: +- Separate `tests/` directory at project root +- Organization: `tests/unit/` and `tests/integration/` +- Pattern: `tests/unit/test_{module}.py` + +Examples: +- `backend/tests/unit/test_auth.py` +- `backend/tests/unit/test_security.py` +- `backend/tests/unit/test_crypto.py` +- `backend/tests/unit/test_audit_service.py` +- `backend/tests/conftest.py` (shared fixtures) +- `backend/tests/integration/conftest.py` (database fixtures) + +**Go:** + +Location: +- Co-located with implementation: `{file}.go` and `{file}_test.go` +- Pattern: `internal/poller/scheduler_test.go` alongside `scheduler.go` + +Examples: +- `poller/internal/poller/scheduler_test.go` +- `poller/internal/sshrelay/server_test.go` +- `poller/internal/poller/integration_test.go` + +## Test Structure + +**Frontend (Vitest + React Testing Library):** + +Suite Organization: +```typescript +/** + * Component tests -- description of what is tested + */ + +import { describe, it, expect, vi, beforeEach } from 'vitest' +import { render, screen, waitFor } from '@/test/test-utils' +import userEvent from '@testing-library/user-event' + +// -------------------------------------------------------------------------- +// Mocks +// -------------------------------------------------------------------------- + +const mockNavigate = vi.fn() +vi.mock('@tanstack/react-router', () => ({ + // mock implementation +})) + +// -------------------------------------------------------------------------- +// Tests +// -------------------------------------------------------------------------- + +describe('LoginPage', () => { + beforeEach(() => { + vi.clearAllMocks() + }) + + it('renders login form with email and password fields', () => { + render() + expect(screen.getByLabelText(/email/i)).toBeInTheDocument() + }) + + it('submits form with entered credentials', async () => { + render() + const user = userEvent.setup() + await user.type(screen.getByLabelText(/email/i), 'test@example.com') + await user.click(screen.getByRole('button', { name: /sign in/i })) + + await waitFor(() => { + expect(mockLogin).toHaveBeenCalledWith('test@example.com', expect.any(String)) + }) + }) +}) +``` + +Patterns: +- Mocks defined before imports, then imported components +- Section comments: `// ---------- Mocks ----------`, `// ---------- Tests ----------` +- `describe()` blocks for test suites +- `beforeEach()` for test isolation and cleanup +- `userEvent.setup()` for simulating user interactions +- `waitFor()` for async assertions +- Accessibility-first selectors: `getByLabelText`, `getByRole` over `getByTestId` + +**Backend (pytest):** + +Suite Organization: +```python +"""Unit tests for the JWT authentication service. + +Tests cover: +- Password hashing and verification (bcrypt) +- JWT access token creation and validation +""" + +import pytest +from unittest.mock import patch + +class TestPasswordHashing: + """Tests for bcrypt password hashing.""" + + def test_hash_returns_different_string(self): + password = "test-password-123!" + hashed = hash_password(password) + assert hashed != password + + def test_hash_verify_roundtrip(self): + password = "test-password-123!" + hashed = hash_password(password) + assert verify_password(password, hashed) is True +``` + +Patterns: +- Module docstring describing test scope +- Test classes for grouping related tests: `class TestPasswordHashing:` +- Test methods: `def test_{behavior}(self):` +- Assertions: `assert condition` (pytest style) +- Fixtures defined in conftest.py for async/db setup + +**Go:** + +Suite Organization: +```go +package poller + +import ( + "context" + "testing" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" +) + +// mockDeviceFetcher implements DeviceFetcher for testing. +type mockDeviceFetcher struct { + devices []store.Device + err error +} + +func (m *mockDeviceFetcher) FetchDevices(ctx context.Context) ([]store.Device, error) { + return m.devices, m.err +} + +func newTestScheduler(fetcher DeviceFetcher) *Scheduler { + // Create test instance with mocked dependencies + return &Scheduler{...} +} + +func TestReconcileDevices_StartsNewDevices(t *testing.T) { + devices := []store.Device{...} + fetcher := &mockDeviceFetcher{devices: devices} + sched := newTestScheduler(fetcher) + + var wg sync.WaitGroup + ctx, cancel := context.WithCancel(context.Background()) + defer cancel() + + err := sched.reconcileDevices(ctx, &wg) + require.NoError(t, err) + + sched.mu.Lock() + assert.Len(t, sched.activeDevices, 2) + sched.mu.Unlock() +} +``` + +Patterns: +- Mock types defined at package level (not inside test functions) +- Constructor helper: `newTest{Subject}(...)` for creating test instances +- Test function signature: `func Test{Subject}_{Scenario}(t *testing.T)` +- testify assertions: `assert.Len()`, `require.NoError()` +- Context management with defer for cleanup +- Concurrent access protected by locks (shown in assertions) + +## Mocking + +**Frontend:** + +Framework: vitest `vi` object + +Patterns: +```typescript +// Mock module imports +vi.mock('@tanstack/react-router', () => ({ + useNavigate: () => mockNavigate, + Link: ({ children, ...props }) => {children}, +})) + +// Mock with partial real imports +vi.mock('@/lib/api', async () => { + const actual = await vi.importActual('@/lib/api') + return { + ...actual, + devicesApi: { + ...actual.devicesApi, + list: (...args: unknown[]) => mockDevicesList(...args), + }, + } +}) + +// Create spy/mock functions +const mockLogin = vi.fn() +const mockNavigate = vi.fn() + +// Configure mock behavior +mockLogin.mockResolvedValueOnce(undefined) // Resolve once +mockLogin.mockRejectedValueOnce(new Error('...')) // Reject once +mockLogin.mockReturnValueOnce(new Promise(...)) // Return pending promise + +// Clear mocks between tests +beforeEach(() => { + vi.clearAllMocks() +}) + +// Assert mock was called +expect(mockLogin).toHaveBeenCalledWith('email', 'password') +expect(mockNavigate).toHaveBeenCalledWith({ to: '/' }) +``` + +What to Mock: +- External API calls (via axios/fetch) +- Router navigation (TanStack Router) +- Zustand store state (create mock `authState`) +- External libraries with complex behavior + +What NOT to Mock: +- DOM elements (use Testing Library queries instead) +- React hooks from react-testing-library +- Component rendering (test actual render unless circular dependency) + +**Backend (Python):** + +Framework: pytest-mock (monkeypatch) and unittest.mock + +Patterns: +```python +# Fixture-based mocking +@pytest.fixture +def mock_db(monkeypatch): + # monkeypatch.setattr(module, 'function', mock_fn) + pass + +# Patch in test +def test_something(monkeypatch): + mock_fn = monkeypatch.setattr('app.services.auth.hash_password', mock_hash) + +# Mock with context manager +from unittest.mock import patch + +def test_redis(): + with patch('app.routers.auth.get_redis') as mock_redis: + mock_redis.return_value = MagicMock() + # test code +``` + +What to Mock: +- Database queries (return test data) +- External HTTP calls +- Redis operations +- Email sending +- File I/O + +What NOT to Mock: +- Core business logic (hash_password, verify_token) +- Pydantic model validation +- SQLAlchemy relationship traversal (in integration tests) + +**Go:** + +Framework: testify/mock or simple interfaces + +Patterns: +```go +// Interface-based mocking +type mockDeviceFetcher struct { + devices []store.Device + err error +} + +func (m *mockDeviceFetcher) FetchDevices(ctx context.Context) ([]store.Device, error) { + return m.devices, m.err +} + +// Use interface, not concrete type +func newTestScheduler(fetcher DeviceFetcher) *Scheduler { + return &Scheduler{store: fetcher, ...} +} + +// Configure in test +sched := newTestScheduler(&mockDeviceFetcher{ + devices: []store.Device{...}, + err: nil, +}) +``` + +What to Mock: +- Database/store interfaces +- External service calls (HTTP, SSH) +- Redis operations + +What NOT to Mock: +- Standard library functions +- Core business logic + +## Fixtures and Factories + +**Frontend Test Data:** + +Approach: Inline test data in test file + +Example from `DeviceList.test.tsx`: +```typescript +const testDevices: DeviceListResponse = { + items: [ + { + id: 'dev-1', + hostname: 'router-office-1', + ip_address: '192.168.1.1', + api_port: 8728, + api_ssl_port: 8729, + model: 'RB4011', + serial_number: 'ABC123', + firmware_version: '7.12', + routeros_version: '7.12.1', + uptime_seconds: 86400, + last_seen: '2026-03-01T12:00:00Z', + latitude: null, + longitude: null, + status: 'online', + }, + ], + total: 1, +} +``` + +**Test Utilities:** + +Location: `frontend/src/test/test-utils.tsx` + +Wrapper with providers: +```typescript +function createTestQueryClient() { + return new QueryClient({ + defaultOptions: { + queries: { retry: false, gcTime: 0, staleTime: 0 }, + mutations: { retry: false }, + }, + }) +} + +export function renderWithProviders( + ui: React.ReactElement, + options?: Omit +) { + const queryClient = createTestQueryClient() + + function Wrapper({ children }: WrapperProps) { + return ( + + {children} + + ) + } + + return { + ...render(ui, { wrapper: Wrapper, ...options }), + queryClient, + } +} + +export { renderWithProviders as render } +``` + +Usage: Import `render` from test-utils, which automatically provides React Query + +**Backend Fixtures:** + +Location: `backend/tests/conftest.py` (unit), `backend/tests/integration/conftest.py` (integration) + +Base conftest: +```python +def pytest_configure(config): + """Register custom markers.""" + config.addinivalue_line( + "markers", "integration: marks tests as integration tests requiring PostgreSQL" + ) +``` + +Integration fixtures (in `tests/integration/conftest.py`): +- Database fixtures (SQLAlchemy AsyncSession) +- Redis test instance (testcontainers) +- NATS JetStream test server + +**Go Test Helpers:** + +Location: Helper functions defined in `_test.go` files + +Example from `scheduler_test.go`: +```go +// mockDeviceFetcher implements DeviceFetcher for testing. +type mockDeviceFetcher struct { + devices []store.Device + err error +} + +func (m *mockDeviceFetcher) FetchDevices(ctx context.Context) ([]store.Device, error) { + return m.devices, m.err +} + +// newTestScheduler creates a Scheduler with a mock DeviceFetcher for testing. +func newTestScheduler(fetcher DeviceFetcher) *Scheduler { + testCache := vault.NewCredentialCache(64, 5*time.Minute, nil, make([]byte, 32), nil) + return &Scheduler{ + store: fetcher, + locker: nil, + publisher: nil, + credentialCache: testCache, + pollInterval: 24 * time.Hour, + connTimeout: time.Second, + cmdTimeout: time.Second, + refreshPeriod: time.Second, + maxFailures: 5, + baseBackoff: 30 * time.Second, + maxBackoff: 15 * time.Minute, + activeDevices: make(map[string]*deviceState), + } +} +``` + +## Coverage + +**Frontend:** + +Requirements: Not enforced (no threshold in vitest config) + +View Coverage: +```bash +npm run test:coverage +# Generates coverage in frontend/coverage/ directory +``` + +**Backend:** + +Requirements: Not enforced in config (but tracked) + +View Coverage: +```bash +pytest --cov=app --cov-report=term-missing +pytest --cov=app --cov-report=html # Generates htmlcov/index.html +``` + +**Go:** + +Requirements: Not enforced + +View Coverage: +```bash +go test -cover ./... +go tool cover -html=coverage.out # Visual report +``` + +## Test Types + +**Frontend Unit Tests:** + +Scope: +- Individual component rendering +- User interactions (click, type) +- Component state changes +- Props and variant rendering + +Approach: +- Render component with test-utils +- Simulate user events with userEvent +- Assert on rendered DOM + +Example from `LoginPage.test.tsx`: +```typescript +it('renders login form with email and password fields', () => { + render() + expect(screen.getByLabelText(/email/i)).toBeInTheDocument() + expect(screen.getByLabelText(/password/i)).toBeInTheDocument() +}) + +it('submits form with entered credentials', async () => { + mockLogin.mockResolvedValueOnce(undefined) + render() + const user = userEvent.setup() + await user.type(screen.getByLabelText(/email/i), 'admin@example.com') + await user.click(screen.getByRole('button', { name: /sign in/i })) + await waitFor(() => { + expect(mockLogin).toHaveBeenCalledWith('admin@example.com', 'secret123') + }) +}) +``` + +**Frontend E2E Tests:** + +Framework: Playwright +Config: `frontend/playwright.config.ts` + +Approach: +- Launch real browser +- Navigate through app +- Test full user journeys +- Sequential execution (no parallelization) for stability + +Config highlights: +```typescript +fullyParallel: false, // Run sequentially for stability +workers: 1, // Single worker +timeout: 30000, // 30 second timeout per test +retries: process.env.CI ? 2 : 0, // Retry in CI +``` + +Location: `frontend/tests/e2e/` (referenced in playwright config) + +**Backend Unit Tests:** + +Scope: +- Pure function behavior (hash_password, verify_token) +- Service methods without database +- Validation logic + +Approach: +- No async/await needed unless using mocking +- Direct function calls +- Assert on return values + +Example from `test_auth.py`: +```python +class TestPasswordHashing: + def test_hash_returns_different_string(self): + password = "test-password-123!" + hashed = hash_password(password) + assert hashed != password + + def test_hash_verify_roundtrip(self): + password = "test-password-123!" + hashed = hash_password(password) + assert verify_password(password, hashed) is True +``` + +**Backend Integration Tests:** + +Scope: +- Full request/response cycle +- Database operations with fixtures +- External service interactions (Redis, NATS) + +Approach: +- Marked with `@pytest.mark.integration` +- Use async fixtures for database +- Skip with `-m "not integration"` in CI (slow) + +Location: `backend/tests/integration/` + +Example: +```python +@pytest.mark.integration +async def test_login_creates_session(async_db, client): + # Creates user in test database + # Posts to /api/auth/login + # Asserts JWT tokens in response + pass +``` + +**Go Tests:** + +Scope: Unit tests for individual functions, integration tests for subsystems + +Unit test example: +```go +func TestReconcileDevices_StartsNewDevices(t *testing.T) { + devices := []store.Device{...} + fetcher := &mockDeviceFetcher{devices: devices} + sched := newTestScheduler(fetcher) + + var wg sync.WaitGroup + ctx, cancel := context.WithCancel(context.Background()) + defer cancel() + + err := sched.reconcileDevices(ctx, &wg) + require.NoError(t, err) + + sched.mu.Lock() + assert.Len(t, sched.activeDevices, 2) + sched.mu.Unlock() + + cancel() + wg.Wait() +} +``` + +Integration test: Uses testcontainers for PostgreSQL, Redis, NATS (e.g., `integration_test.go`) + +## Common Patterns + +**Async Testing (Frontend):** + +Pattern for testing async operations: +```typescript +it('navigates to home on successful login', async () => { + mockLogin.mockResolvedValueOnce(undefined) + + render() + + const user = userEvent.setup() + await user.type(screen.getByLabelText(/email/i), 'admin@example.com') + await user.type(screen.getByLabelText(/password/i), 'secret123') + await user.click(screen.getByRole('button', { name: /sign in/i })) + + await waitFor(() => { + expect(mockNavigate).toHaveBeenCalledWith({ to: '/' }) + }) +}) +``` + +- Use `userEvent.setup()` for user interactions +- Use `await waitFor()` for assertions on async results +- Mock promises with `mockFn.mockResolvedValueOnce()` or `mockRejectedValueOnce()` + +**Error Testing (Frontend):** + +Pattern for testing error states: +```typescript +it('shows error message on failed login', async () => { + mockLogin.mockRejectedValueOnce(new Error('Invalid credentials')) + authState.error = null + + render() + const user = userEvent.setup() + await user.type(screen.getByLabelText(/email/i), 'test@example.com') + await user.type(screen.getByLabelText(/password/i), 'wrongpassword') + await user.click(screen.getByRole('button', { name: /sign in/i })) + + authState.error = 'Invalid credentials' + render() + + expect(screen.getByText('Invalid credentials')).toBeInTheDocument() +}) +``` + +**Async Testing (Backend):** + +Pattern for async pytest: +```python +@pytest.mark.asyncio +async def test_get_redis(): + redis = await get_redis() + assert redis is not None +``` + +Configure in `pyproject.toml`: `asyncio_mode = "auto"` (enabled globally) + +**Error Testing (Backend):** + +Pattern for testing exceptions: +```python +def test_verify_token_rejects_expired(): + token = create_access_token(user_id=uuid4(), expires_delta=timedelta(seconds=-1)) + with pytest.raises(HTTPException) as exc_info: + verify_token(token, expected_type="access") + assert exc_info.value.status_code == 401 +``` + +--- + +*Testing analysis: 2026-03-12*