diff --git a/.planning/PROJECT.md b/.planning/PROJECT.md deleted file mode 100644 index e57947e..0000000 --- a/.planning/PROJECT.md +++ /dev/null @@ -1,75 +0,0 @@ -# RouterOS Config Backup & Change Tracking (v9.6) - -## What This Is - -Automated RouterOS configuration backup and human-readable change tracking for TOD (The Other Dude). Periodically collects router configurations via SSH, stores versioned snapshots, generates diffs, and presents a change timeline in the device UI. Applies to RouterOS devices only. - -## Core Value - -Operators can see exactly what changed on a router and when, with reliable config snapshots available for download — visibility into network changes that would otherwise go unnoticed. - -## Requirements - -### Validated - - - -- ✓ Multi-tenant device management — existing -- ✓ Poller-based device monitoring via SSH — existing -- ✓ NATS message bus for poller↔API communication — existing -- ✓ Credential management with OpenBao Transit encryption — existing -- ✓ FastAPI backend with RBAC (viewer/operator/admin/super_admin) — existing -- ✓ React frontend with device detail pages — existing -- ✓ Remote access (SSH/WinBox tunneling) — existing (v9.5) - -### Active - -- [ ] Periodic config collection via SSH `/export show-sensitive` -- [ ] Manual backup trigger via API -- [ ] Config snapshot storage with SHA256 deduplication -- [ ] Unified diff generation between consecutive snapshots -- [ ] Structured change parsing (component, summary, raw line) -- [ ] Config history timeline API endpoints -- [ ] Full snapshot view/download API -- [ ] Configuration History section in device UI -- [ ] Timeline with change summaries and diff viewer -- [ ] Snapshot download as `.rsc` file -- [ ] RBAC: operator+ can trigger backups, viewers can read history -- [ ] Audit logging for snapshot/diff/trigger events -- [ ] 90-day retention with automatic cleanup -- [ ] Config text normalization (whitespace, timestamps, line endings) - -### Out of Scope - -- Config restore via UI — deferred to future version per spec -- Non-RouterOS device backup — spec explicitly scopes to RouterOS only -- Real-time config change detection — polling-based, not event-driven - -## Context - -- Poller is Go, runs SSH sessions to RouterOS devices, publishes to NATS -- Backend is Python/FastAPI with SQLAlchemy + Alembic migrations on PostgreSQL -- Frontend is React with TanStack Query, component library in `frontend/src/components/` -- Existing credential flow: poller requests creds from cache, decrypted via OpenBao Transit -- NATS subjects follow `{domain}.{entity}.{action}` pattern -- Device detail page already has Metrics and Remote Access sections - -## Constraints - -- **Tech stack**: Must use existing Go poller, Python backend, React frontend — no new services -- **Security**: Snapshots contain sensitive credentials (`show-sensitive`), must be encrypted at rest and RBAC-gated -- **NATS**: Config snapshots flow through NATS subject `config.snapshot.create` -- **Database**: New tables via Alembic migrations on existing PostgreSQL - -## Key Decisions - -| Decision | Rationale | Outcome | -|----------|-----------|---------| -| SSH `/export show-sensitive` for collection | Captures full config including secrets needed for restore | — Pending | -| SHA256 hash deduplication | Avoid storing identical configs, skip unnecessary diffs | — Pending | -| Unified diff format | Standard, well-understood, renderable in UI | — Pending | -| 6-hour default interval | Balance between freshness and SSH overhead | — Pending | -| NATS for poller→API transport | Consistent with existing poller architecture | — Pending | - ---- -*Last updated: 2026-03-12 after initialization* diff --git a/.planning/REQUIREMENTS.md b/.planning/REQUIREMENTS.md deleted file mode 100644 index 148ecf3..0000000 --- a/.planning/REQUIREMENTS.md +++ /dev/null @@ -1,104 +0,0 @@ -# Requirements: RouterOS Config Backup & Change Tracking - -**Defined:** 2026-03-12 -**Core Value:** Operators can see exactly what changed on a router and when, with reliable config snapshots for download - -## v1 Requirements - -### Collection - -- [x] **COLL-01**: Poller collects RouterOS config via SSH `/export show-sensitive` on a configurable interval (default 6h) -- [x] **COLL-02**: Poller normalizes config output (trim whitespace, normalize line endings, remove timestamp headers) -- [x] **COLL-03**: Poller sends config snapshot to API via NATS subject `config.snapshot.create` -- [x] **COLL-04**: Manual backup trigger via POST `/api/tenants/{tenant_id}/devices/{device_id}/backup` -- [x] **COLL-05**: Unreachable routers log warning and retry next interval -- [x] **COLL-06**: Collection interval configurable via `CONFIG_BACKUP_INTERVAL` environment variable - -### Storage - -- [x] **STOR-01**: API stores config snapshots in `router_config_snapshots` table with SHA256 hash -- [x] **STOR-02**: Duplicate snapshots (same hash as previous) are skipped, no diff generated -- [x] **STOR-03**: Snapshots retained for 90 days (configurable via `CONFIG_RETENTION_DAYS`) -- [x] **STOR-04**: Older snapshots automatically deleted by retention cleanup -- [x] **STOR-05**: Snapshots encrypted at rest, accessible only through RBAC - -### Diff & Parsing - -- [x] **DIFF-01**: Unified diff generated when new snapshot differs from previous -- [x] **DIFF-02**: Diffs stored in `router_config_diffs` table linking snapshot pairs -- [x] **DIFF-03**: Structured change parser extracts component, summary, and raw line as JSON -- [x] **DIFF-04**: Parsed changes stored in `router_config_changes` table - -### API - -- [x] **API-01**: GET `/api/tenants/{tid}/devices/{did}/config-history` returns change timeline -- [x] **API-02**: GET `/api/tenants/{tid}/devices/{did}/config/{snapshot_id}` returns full snapshot -- [x] **API-03**: GET `/api/tenants/{tid}/devices/{did}/config/{snapshot_id}/diff` returns unified diff -- [x] **API-04**: RBAC enforced: operator+ can trigger backups, viewers can read history - -### Frontend - -- [x] **UI-01**: Device page shows Configuration History section below Remote Access -- [x] **UI-02**: Timeline displays change entries with component, summary, and timestamp -- [x] **UI-03**: Diff viewer shows unified diff with add/remove highlighting -- [x] **UI-04**: User can download snapshot as `router-{device_name}-{timestamp}.rsc` - -### Observability - -- [x] **OBS-01**: Audit events logged: `config_snapshot_created`, `config_snapshot_skipped_duplicate` -- [x] **OBS-02**: Audit events logged: `config_diff_generated`, `config_backup_manual_trigger` - -## v2 Requirements - -### Restore - -- **REST-01**: User can restore a config snapshot to a router via SSH -- **REST-02**: Restore confirmation dialog with diff preview - -## Out of Scope - -| Feature | Reason | -|---------|--------| -| Config restore | Explicitly deferred per v9.6 spec | -| Non-RouterOS device backup | Spec scopes to RouterOS only initially | -| Real-time change detection | Polling-based by design, not event-driven | -| Config comparison between arbitrary snapshots | Only consecutive snapshot diffs in v1 | - -## Traceability - -| Requirement | Phase | Status | -|-------------|-------|--------| -| COLL-01 | Phase 2: Poller Config Collection | Complete | -| COLL-02 | Phase 2: Poller Config Collection | Complete | -| COLL-03 | Phase 2: Poller Config Collection | Complete | -| COLL-04 | Phase 4: Manual Backup Trigger | Complete | -| COLL-05 | Phase 2: Poller Config Collection | Complete | -| COLL-06 | Phase 2: Poller Config Collection | Complete | -| STOR-01 | Phase 1: Database Schema | Complete | -| STOR-02 | Phase 3: Snapshot Ingestion | Complete | -| STOR-03 | Phase 9: Retention & Cleanup | Complete | -| STOR-04 | Phase 9: Retention & Cleanup | Complete | -| STOR-05 | Phase 1: Database Schema | Complete | -| DIFF-01 | Phase 5: Diff Engine | Complete | -| DIFF-02 | Phase 5: Diff Engine | Complete | -| DIFF-03 | Phase 5: Diff Engine | Complete | -| DIFF-04 | Phase 5: Diff Engine | Complete | -| API-01 | Phase 6: History API | Complete | -| API-02 | Phase 6: History API | Complete | -| API-03 | Phase 6: History API | Complete | -| API-04 | Phase 6: History API | Complete | -| UI-01 | Phase 7: Config History UI | Complete | -| UI-02 | Phase 7: Config History UI | Complete | -| UI-03 | Phase 8: Diff Viewer & Download | Complete | -| UI-04 | Phase 8: Diff Viewer & Download | Complete | -| OBS-01 | Phase 10: Audit & Observability | Complete | -| OBS-02 | Phase 10: Audit & Observability | Complete | - -**Coverage:** -- v1 requirements: 25 total -- Mapped to phases: 25 -- Unmapped: 0 - ---- -*Requirements defined: 2026-03-12* -*Last updated: 2026-03-12 after roadmap creation* diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md deleted file mode 100644 index d87c018..0000000 --- a/.planning/ROADMAP.md +++ /dev/null @@ -1,186 +0,0 @@ -# Roadmap: RouterOS Config Backup & Change Tracking (v9.6) - -## Overview - -This roadmap delivers automated RouterOS configuration backup and change tracking as a new feature within the existing TOD platform. Work flows from database schema through the Go poller (collection), Python backend (storage, diffing, API), and React frontend (timeline, diff viewer, download). Each phase delivers a verifiable layer that the next phase builds on, culminating in a complete config history workflow with retention management and audit logging. - -## Phases - -**Phase Numbering:** -- Integer phases (1, 2, 3): Planned milestone work -- Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED) - -Decimal phases appear between their surrounding integers in numeric order. - -- [x] **Phase 1: Database Schema** - Config snapshot, diff, and change tables with encryption and RLS (completed 2026-03-13) -- [x] **Phase 2: Poller Config Collection** - SSH export, normalization, and NATS publishing from Go poller (completed 2026-03-13) -- [ ] **Phase 3: Snapshot Ingestion** - Backend NATS subscriber stores snapshots with SHA256 deduplication -- [x] **Phase 4: Manual Backup Trigger** - API endpoint for on-demand config backup via poller (completed 2026-03-13) -- [x] **Phase 5: Diff Engine** - Unified diff generation and structured change parsing (completed 2026-03-13) -- [x] **Phase 6: History API** - REST endpoints for timeline, snapshot view, and diff retrieval with RBAC (completed 2026-03-13) -- [x] **Phase 7: Config History UI** - Timeline section on device page with change summaries (completed 2026-03-13) -- [ ] **Phase 8: Diff Viewer & Download** - Unified diff display with syntax highlighting and .rsc download -- [x] **Phase 9: Retention & Cleanup** - 90-day retention policy with automatic snapshot deletion (completed 2026-03-13) -- [x] **Phase 10: Audit & Observability** - Audit event logging for all config backup operations (completed 2026-03-13) - -## Phase Details - -### Phase 1: Database Schema -**Goal**: Database tables exist to store config snapshots, diffs, and parsed changes with proper multi-tenant isolation and encryption -**Depends on**: Nothing (first phase) -**Requirements**: STOR-01, STOR-05 -**Success Criteria** (what must be TRUE): - 1. Alembic migration creates `router_config_snapshots`, `router_config_diffs`, and `router_config_changes` tables - 2. All tables include `tenant_id` with RLS policies enforcing tenant isolation - 3. Snapshot config_text column is encrypted at rest (field-level encryption via existing credential pattern) - 4. SQLAlchemy models exist and can be imported by services -**Plans**: 1 plan - -Plans: -- [ ] 01-01-PLAN.md — Alembic migration and SQLAlchemy models for config backup tables - -### Phase 2: Poller Config Collection -**Goal**: Go poller periodically connects to RouterOS devices via SSH, exports config, normalizes output, and publishes to NATS -**Depends on**: Phase 1 -**Requirements**: COLL-01, COLL-02, COLL-03, COLL-05, COLL-06 -**Success Criteria** (what must be TRUE): - 1. Poller runs `/export show-sensitive` via SSH on each RouterOS device at a configurable interval (default 6h) - 2. Config output is normalized (timestamps stripped, whitespace trimmed, line endings unified) before publishing - 3. Poller publishes config snapshot payload to NATS subject `config.snapshot.create` with device_id and tenant_id - 4. Unreachable devices log a warning and are retried on the next interval without blocking other devices - 5. Interval is configurable via `CONFIG_BACKUP_INTERVAL` environment variable -**Plans**: 2 plans - -Plans: -- [ ] 02-01-PLAN.md — SSH executor, config normalizer, env vars, NATS event type, device model extensions, Alembic migration -- [ ] 02-02-PLAN.md — Backup scheduler with per-device goroutines, concurrency control, retry logic, and main.go wiring - -### Phase 3: Snapshot Ingestion -**Goal**: Backend receives config snapshots from NATS, encrypts via Transit, deduplicates by SHA256, and stores new snapshots -**Depends on**: Phase 1, Phase 2 -**Requirements**: STOR-02 -**Success Criteria** (what must be TRUE): - 1. Backend NATS subscriber consumes `config.snapshot.create` messages and persists snapshots to `router_config_snapshots` - 2. When a snapshot has the same SHA256 hash as the device's most recent snapshot, it is skipped (no new row, no diff) - 3. Each stored snapshot includes device_id, tenant_id, config_text (encrypted), sha256_hash, and collected_at timestamp -**Plans**: 1 plan - -Plans: -- [ ] 03-01-PLAN.md — NATS subscriber for config snapshot ingestion with dedup, encryption, and main.py wiring - -### Phase 4: Manual Backup Trigger -**Goal**: Operators can trigger an immediate config backup for a specific device through the API -**Depends on**: Phase 2, Phase 3 -**Requirements**: COLL-04 -**Success Criteria** (what must be TRUE): - 1. POST `/api/tenants/{tenant_id}/devices/{device_id}/backup` triggers an immediate config collection for the specified device - 2. The triggered backup flows through the same collection and ingestion pipeline as scheduled backups - 3. Endpoint requires operator role or higher (viewers cannot trigger) -**Plans**: 1 plan - -Plans: -- [ ] 04-01-PLAN.md — Go BackupResponder (NATS request-reply) + Python API trigger endpoint - -### Phase 5: Diff Engine -**Goal**: When a new (non-duplicate) snapshot is stored, the system generates a unified diff against the previous snapshot and parses structured changes -**Depends on**: Phase 3 -**Requirements**: DIFF-01, DIFF-02, DIFF-03, DIFF-04 -**Success Criteria** (what must be TRUE): - 1. Unified diff is generated between consecutive snapshots when config content differs - 2. Diff is stored in `router_config_diffs` linking the two snapshot IDs - 3. Structured change parser extracts component name, human-readable summary, and raw diff line for each change - 4. Parsed changes are stored in `router_config_changes` as JSON-structured records -**Plans**: 2 plans - -Plans: -- [ ] 05-01-PLAN.md — Unified diff generation service with Transit decrypt and subscriber integration -- [ ] 05-02-PLAN.md — Structured change parser extracting components and summaries from diffs - -### Phase 6: History API -**Goal**: Frontend can query config change timeline, retrieve full snapshots, and view diffs through RBAC-protected endpoints -**Depends on**: Phase 5 -**Requirements**: API-01, API-02, API-03, API-04 -**Success Criteria** (what must be TRUE): - 1. GET `/api/tenants/{tid}/devices/{did}/config-history` returns paginated change timeline with component, summary, and timestamp - 2. GET `/api/tenants/{tid}/devices/{did}/config/{snapshot_id}` returns full snapshot content - 3. GET `/api/tenants/{tid}/devices/{did}/config/{snapshot_id}/diff` returns unified diff text - 4. All endpoints enforce RBAC: viewer+ can read history, operator+ required for backup trigger - 5. Endpoints return proper 404 for nonexistent snapshots and 403 for unauthorized access -**Plans**: 2 plans - -Plans: -- [ ] 06-01-PLAN.md — Config history timeline endpoint with service, router, and tests -- [ ] 06-02-PLAN.md — Snapshot view and diff retrieval endpoints with Transit decrypt and RBAC - -### Phase 7: Config History UI -**Goal**: Device detail page displays a Configuration History section showing a timeline of config changes -**Depends on**: Phase 6 -**Requirements**: UI-01, UI-02 -**Success Criteria** (what must be TRUE): - 1. Device detail page shows a "Configuration History" section below the Remote Access section - 2. Timeline displays change entries with component badge, summary text, and relative timestamp - 3. Timeline loads via TanStack Query and shows loading/empty states appropriately -**Plans**: 1 plan - -Plans: -- [ ] 07-01-PLAN.md — API client, ConfigHistorySection component, and device detail page wiring - -### Phase 8: Diff Viewer & Download -**Goal**: Users can view unified diffs with syntax highlighting and download any snapshot as a .rsc file -**Depends on**: Phase 7 -**Requirements**: UI-03, UI-04 -**Success Criteria** (what must be TRUE): - 1. Clicking a timeline entry opens a diff viewer showing unified diff with add (green) / remove (red) line highlighting - 2. User can download any snapshot as `router-{device_name}-{timestamp}.rsc` file - 3. Diff viewer handles large configs without performance degradation -**Plans**: 2 plans - -Plans: -- [ ] 08-01-PLAN.md — Unified diff viewer component with syntax highlighting and clickable timeline entries -- [ ] 08-02-PLAN.md — Snapshot download as .rsc file with download button on timeline entries - -### Phase 9: Retention & Cleanup -**Goal**: Snapshots older than the retention period are automatically cleaned up, keeping storage bounded -**Depends on**: Phase 3 -**Requirements**: STOR-03, STOR-04 -**Success Criteria** (what must be TRUE): - 1. Snapshots older than 90 days (default) are automatically deleted along with their associated diffs and changes - 2. Retention period is configurable via `CONFIG_RETENTION_DAYS` environment variable - 3. Cleanup runs on a scheduled interval without blocking normal operations -**Plans**: 1 plan - -Plans: -- [ ] 09-01-PLAN.md — Retention cleanup service with APScheduler, configurable retention period, and cascading deletion - -### Phase 10: Audit & Observability -**Goal**: All config backup operations are logged as audit events for compliance and troubleshooting -**Depends on**: Phase 3, Phase 4, Phase 5 -**Requirements**: OBS-01, OBS-02 -**Success Criteria** (what must be TRUE): - 1. `config_snapshot_created` audit event logged when a new snapshot is stored - 2. `config_snapshot_skipped_duplicate` audit event logged when a duplicate snapshot is detected - 3. `config_diff_generated` audit event logged when a diff is created between snapshots - 4. `config_backup_manual_trigger` audit event logged when an operator triggers a manual backup -**Plans**: 1 plan - -Plans: -- [ ] 10-01-PLAN.md — Audit event emission for all config backup operations - -## Progress - -**Execution Order:** -Phases execute in numeric order: 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7 -> 8 -> 9 -> 10 -Note: Phase 9 depends only on Phase 3 and Phase 10 depends on Phases 3/4/5, so Phases 9 and 10 can execute in parallel with Phases 6-8 if desired. - -| Phase | Plans Complete | Status | Completed | -|-------|----------------|--------|-----------| -| 1. Database Schema | 1/1 | Complete | 2026-03-13 | -| 2. Poller Config Collection | 2/2 | Complete | 2026-03-13 | -| 3. Snapshot Ingestion | 0/1 | Not started | - | -| 4. Manual Backup Trigger | 1/1 | Complete | 2026-03-13 | -| 5. Diff Engine | 2/2 | Complete | 2026-03-13 | -| 6. History API | 2/2 | Complete | 2026-03-13 | -| 7. Config History UI | 1/1 | Complete | 2026-03-13 | -| 8. Diff Viewer & Download | 1/2 | In Progress| | -| 9. Retention & Cleanup | 1/1 | Complete | 2026-03-13 | -| 10. Audit & Observability | 1/1 | Complete | 2026-03-13 | diff --git a/.planning/STATE.md b/.planning/STATE.md deleted file mode 100644 index dd19a77..0000000 --- a/.planning/STATE.md +++ /dev/null @@ -1,116 +0,0 @@ ---- -gsd_state_version: 1.0 -milestone: v9.6 -milestone_name: milestone -status: completed -stopped_at: Completed 10-01-PLAN.md -last_updated: "2026-03-13T04:46:04Z" -last_activity: 2026-03-13 -- Completed 10-01 config backup audit events -progress: - total_phases: 10 - completed_phases: 10 - total_plans: 14 - completed_plans: 14 - percent: 100 ---- - -# Project State - -## Project Reference - -See: .planning/PROJECT.md (updated 2026-03-12) - -**Core value:** Operators can see exactly what changed on a router and when, with reliable config snapshots for download -**Current focus:** Phase 10: Audit & Observability -- COMPLETE - -## Current Position - -Phase: 10 of 10 (Audit & Observability) -- COMPLETE -Plan: 1 of 1 in current phase -Status: Phase 10 complete -Last activity: 2026-03-13 -- Completed 10-01 config backup audit events - -Progress: [██████████] 100% - -## Performance Metrics - -**Velocity:** -- Total plans completed: 5 -- Average duration: 5min -- Total execution time: 0.38 hours - -**By Phase:** - -| Phase | Plans | Total | Avg/Plan | -|-------|-------|-------|----------| -| 01-database-schema | 1 | 3min | 3min | -| 02-poller-config-collection | 2 | 9min | 4.5min | -| 03-snapshot-ingestion | 1 | 4min | 4min | -| 04-manual-backup-trigger | 1 | 7min | 7min | - -**Recent Trend:** -- Last 5 plans: 3min, 4min, 5min, 4min, 7min -- Trend: stable - -*Updated after each plan completion* -| Phase 05 P01 | 3min | 2 tasks | 4 files | -| Phase 05 P02 | 2min | 2 tasks | 4 files | -| Phase 06 P01 | 2min | 2 tasks | 4 files | -| Phase 06 P02 | 2min | 2 tasks | 3 files | -| Phase 07 P01 | 3min | 2 tasks | 3 files | -| Phase 08 P01 | 1min | 2 tasks | 3 files | -| Phase 08 P02 | 1min | 1 tasks | 3 files | -| Phase 09 P01 | 2min | 2 tasks | 4 files | -| Phase 10 P01 | 3min | 2 tasks | 4 files | - -## Accumulated Context - -### Decisions - -Decisions are logged in PROJECT.md Key Decisions table. -Recent decisions affecting current work: - -- [01-01] Models added to existing config_backup.py (same domain, consistent pattern) -- [01-01] config_text stores Transit ciphertext (vault:v1:...), plaintext never in DB -- [01-01] sha256_hash is of plaintext config for deduplication without decryption -- [02-01] TOFU fingerprint format matches ssh-keygen: SHA256:base64(sha256(pubkey)) -- [02-01] NormalizationVersion=1 constant in NATS payloads for future re-processing -- [02-01] UpdateSSHHostKey uses COALESCE on first_seen to preserve original observation time -- [02-02] BackupScheduler runs independently from status poll scheduler with separate goroutines -- [02-02] Buffered channel semaphore for concurrency control (Go idiom, no external deps) -- [02-02] Devices with no Redis status key assumed potentially online for first backup -- [Phase 03]: Trust poller-provided SHA256 hash (no recompute on backend) -- [Phase 03]: Transit failure causes nak (NATS retry), plaintext never stored as fallback -- [Phase 04]: Interface-based DI (BackupExecutor, BackupLocker, DeviceGetter) for BackupResponder testability -- [Phase 04]: collectAndPublish refactored to return (hash, error) with public CollectAndPublish wrapper -- [Phase 04]: In-process nats-server/v2 for Go unit tests, reused routeros_proxy NATS conn for Python -- [Phase 05]: Diff service instantiates own OpenBaoTransitService per-call with close() for clean lifecycle -- [Phase 05]: RETURNING id on snapshot INSERT to capture new_snapshot_id without separate query -- [Phase 05]: Change parser is pure function; DB writes in diff service. RETURNING id on diff INSERT for linking. -- [Phase 06]: Raw SQL text() JOIN for timeline queries, consistent with config_diff_service pattern -- [Phase 06]: Pagination defaults limit=50, offset=0 with FastAPI Query validation (ge=1, le=200) -- [Phase 06]: Transit decrypt in get_snapshot with try/finally for clean openbao lifecycle -- [Phase 06]: 500 error wrapping for Transit decrypt failures in router layer, not service -- [Phase 07]: Reimplemented formatRelativeTime locally in ConfigHistorySection (matches BackupTimeline pattern) -- [Phase 07]: 60s refetchInterval polling for near-real-time config change visibility -- [Phase 08]: DiffViewer rendered inline above timeline (not modal) for context preservation -- [Phase 08]: Line classification function for unified diff: +green, -red, @@blue, ---/+++ muted -- [Phase 08]: Blob URL download pattern consistent with existing exportMyData and auditLogsApi.exportCsv patterns -- [Phase 09]: make_interval(days => :days) for parameterized PostgreSQL interval in retention cleanup -- [Phase 09]: 24h IntervalTrigger with 1h jitter for stagger; AdminAsyncSessionLocal for cross-tenant cleanup -- [Phase 10]: Module-level log_action import in subscriber, inline import in diff service/router for audit events -- [Phase 10]: All audit log_action calls wrapped in try/except Exception: pass (fire-and-forget pattern) - -### Pending Todos - -None yet. - -### Blockers/Concerns - -- OpenBao dev instance loses Transit keys on data wipe -- device creds need re-entry (from project memory, may affect snapshot encryption testing) - -## Session Continuity - -Last session: 2026-03-13T04:46:04Z -Stopped at: Completed 10-01-PLAN.md -Resume file: None diff --git a/.planning/codebase/ARCHITECTURE.md b/.planning/codebase/ARCHITECTURE.md deleted file mode 100644 index 385dd09..0000000 --- a/.planning/codebase/ARCHITECTURE.md +++ /dev/null @@ -1,246 +0,0 @@ -# Architecture - -**Analysis Date:** 2026-03-12 - -## Pattern Overview - -**Overall:** Event-driven microservice architecture with asynchronous pub/sub messaging - -**Key Characteristics:** -- Three independent microservices: Go Poller, Python FastAPI Backend, React/TypeScript Frontend -- NATS JetStream as central event bus for all inter-service communication -- PostgreSQL with Row-Level Security (RLS) for multi-tenant isolation at database layer -- Real-time Server-Sent Events (SSE) for frontend event streaming -- Distributed task coordination using Redis distributed locks -- Per-tenant encryption via OpenBao Transit KMS engine - -## Layers - -**Device Polling Layer (Go Poller):** -- Purpose: Connects to RouterOS devices via binary API (port 8729), detects status/version, collects metrics, pushes configs, manages WinBox/SSH tunnels -- Location: `poller/` -- Contains: Device client, scheduler, SSH relay, WinBox tunnel manager, NATS publisher, Redis credential cache, OpenBao vault client -- Depends on: NATS JetStream, Redis, PostgreSQL (read-only for device list), OpenBao -- Used by: Publishes events to backend via NATS - -**Event Bus Layer (NATS JetStream):** -- Purpose: Central publish/subscribe message broker for all service-to-service communication -- Streams: DEVICE_EVENTS, OPERATION_EVENTS, ALERT_EVENTS -- Contains: Device status changes, metrics, config change notifications, push rollback triggers, alert events, session audit events -- All events include device_id and tenant_id for multi-tenant routing - -**Backend API Layer (Python FastAPI):** -- Purpose: RESTful API, business logic, database persistence, event subscription and processing -- Location: `backend/app/` -- Contains: FastAPI routers, SQLAlchemy ORM models, async services, NATS subscribers, middleware (RBAC, tenant context, rate limiting) -- Depends on: PostgreSQL (via RLS-enforced app_user connection), NATS JetStream, Redis, OpenBao, email/webhook services -- Used by: Frontend (REST API), poller (reads device list, writes operation results) - -**Data Persistence Layer (PostgreSQL + TimescaleDB):** -- Purpose: Multi-tenant relational data store with RLS-enforced isolation -- Connection: Two engines in `backend/app/database.py` - - Admin engine (superuser): Migrations, bootstrap, admin operations - - App engine (app_user role): All tenant-scoped API requests, RLS enforced -- Row-Level Security: `SET LOCAL app.current_tenant` set per-request by `get_current_user` dependency -- Contains: Devices, users, tenants, alerts, config backups, templates, VPN peers, certificates, audit logs, metrics aggregates - -**Caching/Locking Layer (Redis):** -- Purpose: Distributed locks (poller prevents duplicate device polls), session management, temporary data -- Usage: `redislock` package in poller for per-device poll coordination across replicas - -**Secret Management Layer (OpenBao):** -- Purpose: Transit KMS for per-tenant envelope encryption, credential storage access control -- Mode: Transit secret engine wrapping credentials for envelope encryption -- Accessed by: Poller (fetch decrypted credentials), backend (re-encrypt on password change) - -**Frontend Layer (React 19 + TanStack):** -- Purpose: Web UI for fleet management, device control, configuration, monitoring -- Location: `frontend/src/` -- Contains: TanStack Router, TanStack Query, Tailwind CSS, SSE event stream integration, WebSocket tunnels -- Depends on: Backend REST API, Server-Sent Events for real-time updates, WebSocket for terminal/remote access -- Entry point: `frontend/src/routes/__root.tsx` (QueryClientProvider, root layout) - -## Data Flow - -**Device Status Polling (Poller → NATS → Backend):** - -1. Poller scheduler periodically fetches device list from PostgreSQL -2. For each device, poller's `Worker` connects to RouterOS binary API (port 8729 TLS) -3. Worker collects device status (online/offline), version, system metrics -4. Worker publishes `DeviceStatusEvent` to NATS stream `DEVICE_EVENTS` topic `device.status.{device_id}` -5. Backend subscribes to `device.status.>` via `nats_subscriber.py` -6. Subscriber updates device record in PostgreSQL via admin session (bypasses RLS) -7. Frontend receives update via SSE subscription to `/api/sse?topics=device_status` - -**Configuration Push (Frontend → Backend → Poller → Router):** - -1. Frontend calls `POST /api/tenants/{tenant_id}/devices/{device_id}/config` with new configuration -2. Backend stores config in PostgreSQL, publishes `ConfigPushEvent` to `OPERATION_EVENTS` -3. Poller subscribes to push operation events, receives config delta -4. Poller connects to device via binary API, executes RouterOS commands (two-phase: backup, apply, verify) -5. On completion, poller publishes `ConfigPushCompletedEvent` to NATS -6. Backend subscriber updates operation record with success/failure -7. Frontend notifies user via SSE - -**Metrics Collection (Poller → NATS → Backend → Frontend):** - -1. Poller collects health metrics (CPU, memory, disk), interface stats, wireless stats per poll cycle -2. Publishes `DeviceMetricsEvent` to `DEVICE_EVENTS` topic `device.metrics.{type}.{device_id}` -3. Backend `metrics_subscriber.py` aggregates into TimescaleDB hypertables -4. Frontend queries `/api/tenants/{tenant_id}/devices/{device_id}/metrics` for graphs -5. Alternatively, frontend SSE stream pushes metric updates for real-time graphs - -**Real-Time Event Streaming (Backend → Frontend via SSE):** - -1. Frontend calls `POST /api/auth/sse-token` to exchange session cookie for short-lived SSE bearer token -2. Token valid for 25 seconds (refreshed every 25 seconds before expiry) -3. Frontend opens EventSource to `/api/sse?topics=device_status,alert_fired,config_push,firmware_progress,metric_update` -4. Backend maintains SSE connections, pushes events from NATS subscribers -5. Reconnection on disconnect with exponential backoff (1s → 30s max) - -**Multi-Tenant Isolation (Request → Middleware → RLS):** - -1. Frontend sends JWT token in Authorization header or httpOnly cookie -2. Backend `tenant_context.py` middleware extracts user from JWT, determines tenant_id -3. Middleware calls `SET LOCAL app.current_tenant = '{tenant_id}'` on the database session -4. All subsequent queries automatically filtered by RLS policy `(tenant_id = current_setting('app.current_tenant'))` -5. Superadmin can re-set tenant context to access any tenant -6. Admin sessions (migrations, NATS subscribers) use superuser connection, handle tenant routing explicitly - -**State Management:** - -- Frontend: TanStack Query for server state (device list, metrics, config), React Context for session/auth state -- Backend: Async SQLAlchemy ORM with automatic transaction management per request -- Poller: In-memory device state map with per-device circuit breaker tracking failures and backoff -- Shared: Redis for distributed locks, NATS for event persistence (JetStream replays) - -## Key Abstractions - -**Device Client (`poller/internal/device/`):** -- Purpose: Binary API communication with RouterOS devices -- Files: `client.go`, `version.go`, `health.go`, `interfaces.go`, `wireless.go`, `firmware.go`, `cert_deploy.go`, `sftp.go` -- Pattern: RouterOS binary API command execution, metric parsing and extraction -- Usage: Worker polls device state and metrics in parallel goroutines - -**Scheduler & Worker (`poller/internal/poller/scheduler.go`, `worker.go`):** -- Purpose: Orchestrate per-device polling goroutines with circuit breaker resilience -- Pattern: Per-device goroutine with Redis distributed locking to prevent duplicate polls across replicas -- Lifecycle: Discover new devices from DB, create goroutine; remove devices, cancel goroutine -- Circuit Breaker: Exponential backoff after N consecutive failures, resets on success - -**NATS Publisher (`poller/internal/bus/publisher.go`):** -- Purpose: Publish typed device events to JetStream streams -- Event types: DeviceStatusEvent, DeviceMetricsEvent, ConfigChangedEvent, PushRollbackEvent, PushAlertEvent -- Each event includes device_id and tenant_id for multi-tenant routing -- Consumers: Backend subscribers, audit logging, alert evaluation - -**Tunnel Manager (`poller/internal/tunnel/manager.go`):** -- Purpose: Manage WinBox TCP tunnels to devices (port-forwarded SOCKS proxies) -- Port pool: Allocate ephemeral local ports for tunnel endpoints -- Pattern: Accept local connections on port, tunnel to device's WinBox port via binary API - -**SSH Relay (`poller/internal/sshrelay/server.go`, `session.go`, `bridge.go`):** -- Purpose: SSH terminal access to RouterOS devices for remote management -- Pattern: SSH server on poller, bridges SSH sessions to RouterOS via binary API terminal protocol -- Authentication: SSH key or password relay from frontend - -**FastAPI Router Pattern (`backend/app/routers/`):** -- Files: `devices.py`, `auth.py`, `alerts.py`, `config_editor.py`, `templates.py`, `metrics.py`, etc. -- Pattern: APIRouter with Depends() for RBAC, tenant context, rate limiting -- All routes tenant-scoped under `/api/tenants/{tenant_id}/...` -- RLS enforcement: Automatic via `SET LOCAL app.current_tenant` in `get_current_user` middleware - -**Async Service Layer (`backend/app/services/`):** -- Purpose: Business logic, database operations, integration with external systems -- Files: `device.py`, `auth.py`, `backup_service.py`, `ca_service.py`, `alert_evaluator.py`, etc. -- Pattern: Async functions using AsyncSession, composable for multiple operations in single transaction -- NATS Integration: Subscribers consume events, services update database accordingly - -**NATS Subscribers (`backend/app/services/*_subscriber.py`):** -- Purpose: Consume events from NATS JetStream, update application state -- Lifecycle: Started/stopped in FastAPI lifespan context manager -- Examples: `nats_subscriber.py` (device status), `metrics_subscriber.py` (metrics aggregation), `firmware_subscriber.py` (firmware update tracking) -- Pattern: JetStream consumer with durable name, explicit message acking for reliability - -**Frontend Router (`frontend/src/routes/`):** -- Pattern: TanStack Router file-based routing -- Structure: `_authenticated.tsx` (layout for logged-in users), `_authenticated/tenants/$tenantId/devices/...` (device management) -- Entry: `__root.tsx` (QueryClientProvider setup), `_authenticated.tsx` (auth check + layout) - -**Frontend Event Stream Hook (`frontend/src/hooks/useEventStream.ts`):** -- Purpose: Manage SSE connection lifecycle, handle reconnection, parse event payloads -- Pattern: useRef for connection state, setInterval for token refresh, EventSource API -- Callbacks: Per-event-type handlers registered by components -- State: Managed in EventStreamContext for app-wide access - -## Entry Points - -**Poller Binary (`poller/cmd/poller/main.go`):** -- Location: `poller/cmd/poller/main.go` -- Triggers: Docker container start, Kubernetes pod initialization -- Responsibilities: Load config, initialize NATS/Redis/PostgreSQL connections, start scheduler, setup observability (Prometheus metrics, structured logging) -- Config source: Environment variables (see `poller/internal/config/config.go`) - -**Backend API (`backend/app/main.py`):** -- Location: `backend/app/main.py` -- Triggers: Docker container start, uvicorn ASGI server -- Responsibilities: Configure logging, run migrations, bootstrap first admin, start NATS subscribers, setup middleware, register routers -- Lifespan: Async context manager handles startup/shutdown of services -- Health check: `/api/health` endpoint, `/api/readiness` for k8s - -**Frontend Entry (`frontend/src/routes/__root.tsx`):** -- Location: `frontend/src/routes/__root.tsx` -- Triggers: Browser loads app at `/` -- Responsibilities: Wrap app in QueryClientProvider (TanStack Query), setup root error boundary -- Auth flow: Routes under `_authenticated` check JWT token, redirect to login if missing -- Real-time setup: Establish SSE connection via `useEventStream` hook in layout - -## Error Handling - -**Strategy:** Three-tier error handling across services - -**Patterns:** - -- **Poller**: Circuit breaker exponential backoff for device connection failures. Logs all errors to structured JSON with context (device_id, tenant_id, attempt number). Publishes failure events to NATS for alerting. - -- **Backend**: FastAPI exception handlers convert service errors to HTTP responses. RLS violations return 403 Forbidden. Invalid tenant access returns 404. Database errors logged via structlog with request_id middleware for correlation. - -- **Frontend**: TanStack Query retry logic (1 retry by default), error boundaries catch component crashes, toast notifications display user-friendly error messages, RequestID middleware propagates correlation IDs - -## Cross-Cutting Concerns - -**Logging:** -- Poller: `log/slog` with JSON handler, structured fields (service, device_id, tenant_id, operation) -- Backend: `structlog` with async logger, JSON output in production -- Frontend: Browser console + error tracking (if configured) - -**Validation:** -- Backend: Pydantic models (`app/schemas/`) enforce request shape and types, custom validators for business logic (e.g., SRP challenge validation) -- Frontend: TanStack Form for client-side validation before submission -- Database: PostgreSQL CHECK constraints and unique indexes - -**Authentication:** -- Zero-knowledge SRP-6a for initial password enrollment (client never sends plaintext) -- JWT tokens issued after SRP enrollment, stored as httpOnly cookies -- Optional API keys with scoped access for programmatic use -- SSE token exchange for event stream access (short-lived, single-use) - -**Authorization (RBAC):** -- Four roles: super_admin (all access), tenant_admin (full tenant access), operator (read+config), viewer (read-only) -- Role hierarchy enforced by `require_role()` dependency in routers -- API key scopes: subset of operator permissions (read, write_device, write_config, etc.) - -**Rate Limiting:** -- Backend: Token bucket limiter on sensitive endpoints (login, token generation, device operations) -- Configuration: `app/middleware/rate_limit.py` defines limits per endpoint -- Redis-backed for distributed rate limit state - -**Multi-Tenancy:** -- Database RLS: All tables have `tenant_id`, policy enforces current_tenant filter -- Tenant context: Middleware extracts from JWT, sets `app.current_tenant` local variable -- Superadmin bypass: Can re-set tenant context to access any tenant -- Admin operations: Use superuser connection, explicit tenant routing - ---- - -*Architecture analysis: 2026-03-12* diff --git a/.planning/codebase/CONCERNS.md b/.planning/codebase/CONCERNS.md deleted file mode 100644 index eed2d42..0000000 --- a/.planning/codebase/CONCERNS.md +++ /dev/null @@ -1,211 +0,0 @@ -# Codebase Concerns - -**Analysis Date:** 2026-03-12 - -## Security Considerations - -**SSH Host Key Verification:** -- Risk: SSH connections skip host key verification using `ssh.InsecureIgnoreHostKey()` -- Files: `poller/internal/sshrelay/server.go:176`, `poller/internal/device/sftp.go:24`, `poller/internal/device/client.go:54-104` -- Current mitigation: RouterOS devices are internal infrastructure; client.go includes fallback strategy with TLS verification as primary mechanism -- Recommendations: Document the security model clearly. For SFTP in particular, consider implementing known_hosts validation or device certificate pinning if devices are externally accessible. Add security audit note to code. - -**TLS Verification Fallback:** -- Risk: When CA-verified TLS fails, automatic fallback to InsecureSkipVerify allows unverified connections (`poller/internal/device/client.go:92-104`) -- Files: `poller/internal/device/client.go` -- Current mitigation: This is intentional for unprovisioned devices; logging is present -- Recommendations: Add metrics to track fallback frequency. Consider implementing a whitelist of devices allowed to use insecure mode. Document operator-facing security implications. - -**SSH Session Count Rate Limiting:** -- Risk: No API-side SSH session count check before issuing tokens; limits only enforced at poller/SSH relay level -- Files: `backend/app/routers/remote_access.py:206-211` -- Current mitigation: WebSocket connect enforces tunnel.session limits per-user, per-device, global on relay side -- Recommendations: Add NATS subject exposing SSH session counts to API. Query before token issuance to provide earlier feedback (429 Too Many Requests). This prevents token waste when client will immediately be rate-limited. - -**Token Validation Security:** -- Risk: Single-use tokens stored in Redis with GETDEL; no IP binding or additional entropy validation beyond token string -- Files: `poller/internal/sshrelay/server.go:106-112`, token creation in `backend/app/routers/remote_access.py` -- Current mitigation: Token is single-use (GETDEL atomically retrieves and deletes). Short TTL (120s typical). Source IP validation present but not bound to token. -- Recommendations: Consider adding token IP binding (store expected source IP in payload, validate match). Add jti (JWT ID) tracking for revocation if needed. - ---- - -## Performance Bottlenecks - -**SSH Relay Idle Loop Polling:** -- Problem: Idle session cleanup uses time-based checks in a goroutine loop -- Files: `poller/internal/sshrelay/server.go:72`, session idling logic in `session.go` -- Cause: Periodic checks for idle sessions (LastActive timestamp) -- Improvement path: Consider using context.WithTimeout or timer channels for each session instead of global loop scanning all sessions. - -**Alert Rule Cache Staleness:** -- Problem: Alert rules cached for 60 seconds; maintenance windows for 30 seconds. During cache TTL, rule changes don't take effect immediately -- Files: `backend/app/services/alert_evaluator.py:33-40` -- Cause: In-memory cache to reduce DB queries on every metric evaluation (high frequency) -- Improvement path: Publish cache invalidation events to NATS when rules/windows change. Subscribers clear cache immediately rather than waiting for TTL. Current approach acceptable for non-critical alerts but documented assumption needed. - -**Large Router File Handling:** -- Problem: Alert evaluator aggregates metrics from all interfaces/wireless stations; no limits on result set size -- Files: `backend/app/services/alert_evaluator.py:180-212` -- Cause: Loop processes all returned metric rows without pagination or limits -- Improvement path: Add configurable max result limits. For high-interface-count devices (200+ interfaces), consider pre-aggregation or sampling. - -**N+1 Query Avoidance (Addressed):** -- Status: Already acknowledged in code comment at `backend/app/routers/metrics.py:404` -- Current approach: Metrics API uses bulk queries to avoid per-tenant loops -- No action needed - ---- - -## Tech Debt - -**Bandwidth Alerting Not Implemented:** -- Issue: Interface bandwidth alerting (rx_bps/tx_bps) requires computing delta between consecutive poll values -- Files: `backend/app/services/alert_evaluator.py:208-210` -- Impact: Alert rules table supports these metric types but evaluation is skipped; users cannot create rx_bps/tx_bps alerts -- Fix approach: Implement state tracking in Redis. Store previous poll value for each device:interface. On next poll, compute delta and evaluate against alert thresholds. Handle device offline/online transitions to avoid false alerts. - -**Global Redis/NATS Clients in Routers:** -- Issue: Multiple routers use module-level `global` statements to manage Redis and NATS client references -- Files: `backend/app/routers/auth.py:97`, `backend/app/routers/certificates.py:63`, `backend/app/routers/remote_access.py:50,58`, `backend/app/routers/sse.py:32`, `backend/app/routers/topology.py:50` -- Impact: Makes testing harder, hidden dependencies, potential race conditions on initialization -- Fix approach: Create a dependency injection container or use FastAPI's lifespan context manager (>=0.93) to manage client lifecycle. Pass clients as dependencies to router functions rather than global state. - -**SSH Session Publishing (NATS Wiring):** -- Issue: Code for publishing audit event on session end is present but not wired to NATS -- Files: `docs/superpowers/plans/2026-03-12-remote-access.md:1381` -- Impact: SSH session end events not tracked in audit logs; incomplete audit trail -- Fix approach: Wire the NATS publisher call in remote_access router. Create corresponding NATS subject consumer to record session end events. - -**Bare Exception Handling (Sparse):** -- Status: Codebase mostly avoids bare `except:` blocks; 56 linting suppressions (#pylint, #noqa, #type: ignore) present -- Files: Across backend Python code -- Impact: Controlled suppression use suggests deliberate choices; not a systemic problem -- Recommendation: Continue current practice; document why suppressions are needed when adding new ones. - ---- - -## Fragile Areas - -**SSH Relay Concurrent Session Management:** -- Files: `poller/internal/sshrelay/server.go:40-46` (sessions map), `poller/internal/sshrelay/server.go:114-118` (limit checks) -- Why fragile: Lock held during entire limit check; concurrent requests during peer limit transitions could temporarily exceed limits. Map access requires lock coordination. -- Safe modification: When adding session limits, ensure mutex is held for entire check+add operation. Consider using sync.Cond for blocked requests. Write tests for race conditions under high concurrency. -- Test coverage: Lock coverage appears adequate; consider adding stress test with sustained concurrent connect attempts exceeding limits. - -**Tunnel Port Pool Allocation:** -- Files: `poller/internal/tunnel/portpool.go`, `poller/internal/tunnel/manager.go:68-71` -- Why fragile: Port release timing; if tunnel closes between allocation and listener bind, port stays allocated. No automatic reaper. -- Safe modification: Ensure Release() is always called on error paths. Consider adding timeout-based port recovery (if unused for N seconds, auto-reclaim). Write integration test that exercises all error paths. -- Test coverage: portpool_test.go exists; verify boundary conditions (empty pool, full pool, Release before Allocate). - -**Vault Credential Cache Concurrency:** -- Files: `poller/internal/vault/cache.go:162` (timeout context creation) -- Why fragile: Cache uses module-level state; concurrent credential requests during cache miss trigger multiple Transit key operations -- Safe modification: Cache hit must be idempotent. For cache misses, consider request deduplication (one in-flight per device, others wait). Add metrics to track cache hit/miss/error rates. -- Test coverage: Need integration test for concurrent cache misses on same device. - -**Device Store Context Handling:** -- Files: `poller/internal/store/devices.go:77,133` (Query/QueryRow with context) -- Why fragile: If context cancels mid-query, result state is undefined. No timeout enforcement at DB level. -- Safe modification: Always pair Query/QueryRow with a timeout context. Test context cancellation scenarios. Add slog.Error on context timeout vs actual DB error. - ---- - -## Scaling Limits - -**Redis Single Instance (Assumed):** -- Current capacity: Limited by single Redis instance throughput -- Limit: Under high device poll rates (1000+ devices, 10s polls), Redis lock contention and breach counter updates become bottleneck -- Scaling path: Migrate to Redis Cluster for distributed locking and key sharding. Update distributed lock client library if needed. - -**PostgreSQL Connection Pool:** -- Current capacity: Default pool size (likely 5-10 connections) -- Limit: High concurrent tenant queries or bulk exports exhaust connection pool -- Scaling path: Increase pool size based on workload (concurrent route handlers). Add connection pool metrics. Monitor connection wait time. - -**WinBox Tunnel Port Allocation:** -- Current capacity: Configurable port range (e.g., 40000-60000 = 20k ports) -- Limit: On heavily subscribed instances, port exhaustion closes new tunnel requests -- Scaling path: Implement port pool overflow with secondary ranges. Add metrics for port utilization %. Fail gracefully (409 Conflict) when exhausted with clear message. - -**SSH Relay Session Limits:** -- Current capacity: Configurable maxSessions, maxPerUser, maxPerDevice -- Limit: Under DOS, legitimate users blocked by exhausted limits -- Scaling path: Implement adaptive rate limiting (cost per source IP). Add token rate limiting (tokens/minute per IP) before WebSocket upgrade. Monitor breach events and publish alerts. - ---- - -## Known Bugs - -**SSH Relay Pipe Ignores Errors:** -- Symptoms: SSH session may silently fail if StdinPipe/StdoutPipe creation errors -- Files: `poller/internal/sshrelay/server.go:209-211` (ignores error on StderrPipe, StdinPipe, StdoutPipe) -- Trigger: Unusual SSH server behavior or resource exhaustion -- Workaround: Errors are silently ignored; Shell() call will fail later with unclear error -- Fix approach: Check error returns from StdinPipe/StdoutPipe/StderrPipe. Log and close session if pipes fail. - -**Idle Duration Calculation Anomaly:** -- Symptoms: Session.IdleDuration() can return very large (or negative in edge cases) if LastActive is not set before first check -- Files: `poller/internal/sshrelay/session.go:26-28` -- Trigger: Session created but never marked active (LastActive = 0 unix timestamp) -- Workaround: Initialize LastActive in Session constructor -- Fix approach: In Session creation (`server.go` line ~200), set `atomic.StoreInt64(&s.LastActive, time.Now().UnixNano())`. - -**X-Forwarded-For Parsing:** -- Symptoms: If X-Forwarded-For has trailing comma or spaces, source IP extraction may be incorrect -- Files: `poller/internal/sshrelay/server.go:133-136` -- Trigger: Misconfigured proxy or malicious header -- Workaround: Inspect audit logs for unusual source IPs -- Fix approach: Add validation after split: `strings.TrimSpace()` on parts, skip empty entries, validate resulting IP format. - ---- - -## Missing Critical Features - -**SSH Session End Event Publishing:** -- Problem: Audit trail incomplete; sessions start logged but not end -- Blocks: Audit compliance; user session tracking; security incident investigation -- Priority: High - this is a compliance/audit gap - -**Bandwidth Alert Evaluation:** -- Problem: rx_bps/tx_bps metric types in alert rules table but not evaluated -- Blocks: Users cannot create bandwidth-based alerts despite UI suggesting it's possible -- Priority: Medium - feature is partially implemented - -**Device Connection State Observability:** -- Problem: No metrics for device online/offline transition frequency or duration -- Blocks: Operators cannot diagnose intermittent connectivity issues -- Priority: Medium - operational insight would help debugging - ---- - -## Test Coverage Gaps - -**SSH Relay Security Paths:** -- What's not tested: Token validation against tampered or expired tokens; concurrent session limits enforcement under stress; source IP mismatch scenarios -- Files: `poller/internal/sshrelay/server_test.go` -- Risk: Malformed token or token replay attacks could bypass validation -- Priority: High - security-critical path - -**Tunnel Port Pool Exhaustion:** -- What's not tested: Behavior when port pool is exhausted (Allocate returns error); cleanup on listener bind failure -- Files: `poller/internal/tunnel/portpool_test.go`, `poller/internal/tunnel/manager_test.go` -- Risk: Port leaks or silent allocation failures under stress -- Priority: High - affects tunnel availability - -**Alert Evaluator with Maintenance Windows:** -- What's not tested: Cache invalidation on maintenance window updates; concurrent cache access during updates -- Files: `backend/app/services/alert_evaluator.py` -- Risk: Stale maintenance windows suppress alerts unintentionally or too long -- Priority: Medium - affects alert suppression accuracy - -**Device Offline Circuit Breaker:** -- What's not tested: Exponential backoff behavior across scheduler restarts; lock timeout when device is permanently offline -- Files: `poller/internal/poller/scheduler.go`, `poller/internal/poller/worker.go` -- Risk: Hammering offline device with connection attempts or missing it when it comes back online -- Priority: Medium - affects device polling efficiency - ---- - -*Concerns audit: 2026-03-12* diff --git a/.planning/codebase/CONVENTIONS.md b/.planning/codebase/CONVENTIONS.md deleted file mode 100644 index 1510f14..0000000 --- a/.planning/codebase/CONVENTIONS.md +++ /dev/null @@ -1,348 +0,0 @@ -# Coding Conventions - -**Analysis Date:** 2026-03-12 - -## Naming Patterns - -**Files:** -- TypeScript/React: `kebab-case.ts`, `kebab-case.tsx` (e.g., `useShortcut.ts`, `error-boundary.tsx`) -- Python: `snake_case.py` (e.g., `test_auth.py`, `auth_service.py`) -- Go: `snake_case.go` (e.g., `scheduler_test.go`, `main.go`) -- Component files: PascalCase for exported components in UI libraries (e.g., `Button` from `button.tsx`) -- Test files: `{module}.test.tsx`, `{module}.spec.tsx` (frontend), `test_{module}.py` (backend) - -**Functions:** -- TypeScript/JavaScript: `camelCase` (e.g., `useShortcut`, `createApiClient`, `renderWithProviders`) -- Python: `snake_case` (e.g., `hash_password`, `verify_token`, `get_redis`) -- Go: `PascalCase` for exported, `camelCase` for private (e.g., `FetchDevices`, `mockDeviceFetcher`) -- React hooks: Prefix with `use` (e.g., `useAuth`, `useShortcut`, `useSequenceShortcut`) - -**Variables:** -- TypeScript: `camelCase` (e.g., `mockLogin`, `authState`, `refreshPromise`) -- Python: `snake_case` (e.g., `user_id`, `tenant_id`, `credentials`) -- Constants: `UPPER_SNAKE_CASE` for module-level constants (e.g., `ACCESS_TOKEN_COOKIE`, `REFRESH_TOKEN_MAX_AGE`) - -**Types:** -- TypeScript interfaces: `PascalCase` with `I` prefix optional (e.g., `ButtonProps`, `AuthState`, `WrapperProps`) -- Python: `PascalCase` for classes (e.g., `User`, `UserRole`, `HTTPException`) -- Go: `PascalCase` for exported (e.g., `Scheduler`, `Device`), `camelCase` for private (e.g., `mockDeviceFetcher`) - -**Directories:** -- Feature/module directories: `kebab-case` (e.g., `remote-access`, `device-groups`) -- Functional directories: `kebab-case` (e.g., `__tests__`, `components`, `routers`) -- Python packages: `snake_case` (e.g., `app/models`, `app/services`) - -## Code Style - -**Formatting:** - -Frontend: -- Tool: ESLint + TypeScript ESLint (flat config at `frontend/eslint.config.js`) -- Indentation: 2 spaces -- Line length: No explicit limit in config, but code stays under 120 chars -- Quotes: Single quotes in JS/TS (ESLint recommended) -- Semicolons: Required -- Trailing commas: Yes (ES2020+) - -Backend (Python): -- Tool: Ruff for linting -- Line length: 100 characters (`ruff` configured in `pyproject.toml`) -- Indentation: 4 spaces (PEP 8) -- Type hints: Required on function signatures (Pydantic models and FastAPI handlers) - -Poller (Go): -- Gofmt standard (implicit) -- Line length: conventional Go style -- Error handling: `if err != nil` pattern - -**Linting:** - -Frontend: -- ESLint config: `@eslint/js`, `typescript-eslint`, `react-hooks`, `react-refresh` -- Run: `npm run lint` -- Rules: Recommended + React hooks rules -- No unused locals/parameters enforced via TypeScript `noUnusedLocals` and `noUnusedParameters` - -Backend (Python): -- Ruff enabled for style and lint -- Target version: Python 3.12 -- Line length: 100 - -## Import Organization - -**Frontend (TypeScript/React):** - -Order: -1. React and React-adjacent imports (`import { ... } from 'react'`) -2. Third-party libraries (`import { ... } from '@tanstack/react-query'`) -3. Local absolute imports using `@` alias (`import { ... } from '@/lib/api'`) -4. Local relative imports (`import { ... } from '../utils'`) - -Path Aliases: -- `@/*` maps to `src/*` (configured in `tsconfig.app.json`) - -Example from `useShortcut.ts`: -```typescript -import { useEffect, useRef } from 'react' -// (no third-party imports in this file) -// (no local imports needed) -``` - -Example from `auth.ts`: -```typescript -import { create } from 'zustand' -import { authApi, type UserMe } from './api' -import { keyStore } from './crypto/keyStore' -import { deriveKeysInWorker } from './crypto/keys' -``` - -**Backend (Python):** - -Order: -1. Standard library (`import uuid`, `from typing import ...`) -2. Third-party (`from fastapi import ...`, `from sqlalchemy import ...`) -3. Local imports (`from app.services.auth import ...`, `from app.models.user import ...`) - -Standard pattern in routers (e.g., `auth.py`): -```python -import logging -from datetime import UTC, datetime, timedelta -from typing import Optional - -import redis.asyncio as aioredis -from fastapi import APIRouter, Depends -from sqlalchemy import select - -from app.config import settings -from app.database import get_admin_db -from app.services.auth import verify_password -``` - -**Go:** - -Order: -1. Standard library (`"context"`, `"log/slog"`) -2. Third-party (`github.com/...`) -3. Local module imports (`github.com/mikrotik-portal/poller/...`) - -Example from `main.go`: -```go -import ( - "context" - "log/slog" - "net/http" - "os" - - "github.com/bsm/redislock" - "github.com/redis/go-redis/v9" - - "github.com/mikrotik-portal/poller/internal/bus" - "github.com/mikrotik-portal/poller/internal/config" -) -``` - -## Error Handling - -**Frontend (TypeScript):** - -- Try/catch for async operations with type guards: `const axiosErr = err as { response?: ... }` -- Error messages extracted to helpers: `getAuthErrorMessage(err)` in `lib/auth.ts` -- State-driven error UI: Store errors in Zustand (`error: string | null`), display conditionally -- Pattern: Set error, then throw to allow calling code to handle: - ```typescript - try { - // operation - } catch (err) { - const message = getAuthErrorMessage(err) - set({ error: message }) - throw new Error(message) - } - ``` - -**Backend (Python):** - -- HTTPException from FastAPI for API errors (with status codes) -- Structured logging with structlog for all operations -- Pattern in services: raise exceptions, let routers catch and convert to HTTP responses -- Example from `auth.py` (lines 95-100): - ```python - async def get_redis() -> aioredis.Redis: - global _redis - if _redis is None: - _redis = aioredis.from_url(settings.REDIS_URL, decode_responses=True) - return _redis - ``` -- Database operations wrapped in try/finally blocks for cleanup - -**Go:** - -- Explicit error returns: `(result, error)` pattern -- Check and return: `if err != nil { return nil, err }` -- Structured logging with `log/slog` including error context -- Example from `scheduler_test.go`: - ```go - err := sched.reconcileDevices(ctx, &wg) - require.NoError(t, err) - ``` - -## Logging - -**Frontend:** - -- Framework: `console` (no structured logging library) -- Pattern: Inline console.log/warn/error during development -- Production: Minimal logging, errors captured in state (`auth.error`) -- Example from `auth.ts` (line 182): - ```typescript - console.warn('[auth] key set decryption failed (Tier 1 data will be inaccessible):', e) - ``` - -**Backend (Python):** - -- Framework: `structlog` for structured, JSON logging -- Logger acquisition: `logger = structlog.get_logger(__name__)` or `logging.getLogger(__name__)` -- Logging at startup/shutdown and error conditions -- Example from `main.py`: - ```python - logger = structlog.get_logger(__name__) - logger.info("migrations applied successfully") - logger.error("migration failed", stderr=result.stderr) - ``` - -**Go (Poller):** - -- Framework: `log/slog` (standard library) -- JSON output to stdout with service name in attributes -- Levels: Debug, Info, Warn, Error -- Example from `main.go`: - ```go - slog.SetDefault(slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{ - Level: slog.LevelInfo, - }).WithAttrs([]slog.Attr{ - slog.String("service", "poller"), - }))) - ``` - -## Comments - -**When to Comment:** - -- Complex logic that isn't self-documenting -- Important caveats or gotchas -- References to related issues or specs -- Example from `auth.ts` (lines 26-29): - ```typescript - // Response interceptor: handle 401 by attempting token refresh - client.interceptors.response.use( - (response) => response, - async (error) => { - ``` - -**JSDoc/TSDoc:** - -- Used for exported functions and hooks -- Example from `useShortcut.ts`: - ```typescript - /** - * Hook to register a single-key keyboard shortcut. - * Skips when focus is in INPUT, TEXTAREA, or contentEditable elements. - */ - export function useShortcut(key: string, callback: () => void, enabled = true) - ``` - -**Python Docstrings:** - -- Module-level docstring at top of file describing purpose -- Function docstrings for public functions -- Example from `test_auth.py`: - ```python - """Unit tests for the JWT authentication service. - - Tests cover: - - Password hashing and verification (bcrypt) - - JWT access token creation and validation - """ - ``` - -**Go Comments:** - -- Package-level comment above package declaration -- Exported function/type comments above declaration -- Example from `main.go`: - ```go - // Command poller is the MikroTik device polling microservice. - // It connects to RouterOS devices via the binary API... - package main - ``` - -## Function Design - -**Size:** - -- Frontend: Prefer hooks/components under 100 lines; break larger logic into smaller hooks -- Backend: Services typically 100-200 lines per function; larger operations split across multiple methods -- Example: `auth.ts` `srpLogin` is 130 lines but handles distinct steps (1-10 commented) - -**Parameters:** - -- Frontend: Functions take specific parameters, avoid large option objects except for component props -- Backend (Python): Use Pydantic schemas for request bodies, dependency injection for services -- Go: Interfaces preferred for mocking/testing (e.g., `DeviceFetcher` in `scheduler_test.go`) - -**Return Values:** - -- Frontend: Single return or destructured object: `return { ...render(...), queryClient }` -- Backend (Python): Single value or tuple for multiple returns (not common) -- Go: Always return `(result, error)` pair - -## Module Design - -**Exports:** - -- TypeScript: Named exports preferred for functions/types, default export only for React components - - Example: `export function useShortcut(...)` instead of `export default useShortcut` - - React components: `export default AppInner` (in `App.tsx`) -- Python: All public functions/classes at module level; use `__all__` for large modules -- Go: Exported functions capitalized: `func NewScheduler(...) *Scheduler` - -**Barrel Files:** - -- Frontend: `test-utils.tsx` re-exports Testing Library: `export * from '@testing-library/react'` -- Backend: Not used (explicit imports preferred) -- Go: Not applicable (no barrel pattern) - -## Specific Patterns Observed - -**Zustand Stores (Frontend):** -- Created with `create((set, get) => ({ ... }))` -- State shape includes loading, error, and data fields -- Actions call `set(newState)` or `get()` to access state -- Example: `useAuth` store in `lib/auth.ts` (lines 31-276) - -**Zustand selectors:** -- Use selector functions for role checks: `isSuperAdmin(user)`, `isTenantAdmin(user)`, etc. -- Pattern: Pure functions that check user role - -**Class Variance Authority (Frontend):** -- Used for component variants in UI library (e.g., `button.tsx`) -- Variants defined with `cva()` function with variant/size/etc. options -- Applied via `className={cn(buttonVariants({ variant, size }), className)}` - -**FastAPI Routers (Backend):** -- Each feature area gets its own router file: `routers/auth.py`, `routers/devices.py` -- Routers mounted at `app.include_router(router)` in `main.py` -- Endpoints use dependency injection for auth, db, etc. - -**pytest Fixtures (Backend):** -- Conftest.py at test root defines markers and shared fixtures -- Integration tests in `tests/integration/conftest.py` -- Unit tests use mocks, no database access - -**Go Testing:** -- Table-driven tests not explicitly shown, but mock interfaces are (e.g., `mockDeviceFetcher`) -- Testify assertions: `assert.Len`, `require.NoError` -- Helper functions to create test data: `newTestScheduler` - ---- - -*Convention analysis: 2026-03-12* diff --git a/.planning/codebase/INTEGRATIONS.md b/.planning/codebase/INTEGRATIONS.md deleted file mode 100644 index 0a308d1..0000000 --- a/.planning/codebase/INTEGRATIONS.md +++ /dev/null @@ -1,245 +0,0 @@ -# External Integrations - -**Analysis Date:** 2026-03-12 - -## APIs & External Services - -**MikroTik RouterOS:** -- Binary API (TLS port 8729) - Device polling and command execution - - SDK/Client: go-routeros/v3 (Go poller) - - Protocol: Binary encoded commands, TLS mutual authentication - - Used in: `poller/cmd/poller/main.go`, `poller/internal/poller/` - -**SMTP (Transactional Email):** -- System email service (password reset, alerts, notifications) - - SDK/Client: aiosmtplib (async SMTP library) - - Configuration: `SMTP_HOST`, `SMTP_PORT`, `SMTP_USER`, `SMTP_PASSWORD`, `SMTP_USE_TLS` - - From address: `SMTP_FROM_ADDRESS` - - Implementation: `app/services/email_service.py` - - Supports TLS, STARTTLS, plain auth - -**WebSocket/SSH Tunneling:** -- Browser-based SSH terminal for remote device access - - SDK/Client: asyncssh (Python), xterm.js (frontend) - - Protocol: SSH protocol with port forwarding - - Implementation: `app/routers/remote_access.py`, `poller/internal/sshrelay/` - - Features: Session auditing, command logging to NATS - -## Data Storage - -**Databases:** -- PostgreSQL 17 (TimescaleDB extension in production) - - Async driver: asyncpg 0.30.0+ (Python backend) - - Sync driver: pgx/v5 (Go poller) - - ORM: SQLAlchemy 2.0+ async - - Migrations: Alembic 1.14.0+ - - RLS: Row-Level Security policies for multi-tenant isolation - - Models: `app/models/` (17+ model files) - - Connection: `DATABASE_URL`, `APP_USER_DATABASE_URL`, `POLLER_DATABASE_URL` - - Admin role: postgres (migrations only) - - App role: app_user (enforces RLS) - - Poller role: poller_user (direct access, no RLS) - -**File Storage:** -- Local filesystem only - No cloud storage integration - - Git store (bare repos): `/data/git-store` or `./git-store` (RWX PVC in production) - - Implementation: `app/services/git_store.py` - - Purpose: Version control for device configurations (one repo per tenant) - - Firmware cache: `/data/firmware-cache` - - Purpose: Downloaded RouterOS firmware images - - Service: `app/services/firmware_service.py` - - WireGuard config: `/data/wireguard` - - Purpose: VPN peer and configuration management - -**Caching:** -- Redis 7+ - - Async driver: redis 5.0.0+ (Python) - - Sync driver: redis/go-redis/v9 (Go) - - Use cases: - - Session storage for SRP auth flows: `app/routers/auth.py` (key: `srp:session:{session_id}`) - - Distributed locks: poller uses `bsm/redislock` to prevent duplicate polls across replicas - - Connection: `REDIS_URL` - -## Authentication & Identity - -**Auth Provider:** -- Custom SRP-6a implementation (zero-knowledge auth) - - Flow: SRP-6a password hash registration → no plaintext password stored - - Implementation: `app/services/srp_service.py`, `app/routers/auth.py` - - JWT tokens: HS256 signed with `JWT_SECRET_KEY` - - Token storage: httpOnly cookies (frontend sends via credentials) - - Refresh: 15-minute access tokens, 7-day refresh tokens - - Fallback: Legacy bcrypt password support during upgrade phase - -**User Roles:** -- Four role levels with RBAC: - - super_admin - Cross-tenant access, user/billing management - - admin - Full tenant management (invite users, config push, firmware) - - operator - Limited: config push, monitoring, alerts - - viewer - Read-only: dashboard, reports, audit logs - -**Credential Encryption:** -- Per-tenant envelope encryption via OpenBao Transit - - Service: `app/services/openbao_service.py` - - Cipher: AES-256-GCM via OpenBao Transit engine - - Key naming: `tenant_{uuid}` (created on tenant creation) - - Fallback: Legacy Fernet decryption for credentials created before Transit migration - -## Monitoring & Observability - -**Error Tracking:** -- Not integrated - No Sentry, DataDog, or equivalent -- Local structured logging only - -**Logs:** -- Structured logging via structlog (Python backend) - - Format: JSON (production), human-readable (dev) - - Configuration: `app/logging_config.py` - - Log level: Configurable via `LOG_LEVEL` env var -- Structured logging via slog (Go poller) - - Format: JSON with service name and instance hostname - - Configuration: `poller/cmd/poller/main.go` - -**Metrics:** -- Prometheus metrics export - - Library: prometheus-fastapi-instrumentator 7.0.0+ - - Setup: `app/observability.py` - - Endpoint: Exposed metrics in Prometheus text format - - Not scraped by default - requires external Prometheus instance - -**OpenTelemetry:** -- Minimal OTEL instrumentation in Go poller - - SDK: `go.opentelemetry.io/otel` 1.39.0+ - - Not actively used in Python backend - -## CI/CD & Deployment - -**Hosting:** -- Self-hosted (Docker Compose for local, Kubernetes for production) -- No cloud provider dependency -- Reverse proxy: Caddy (reference: user memory notes) - -**CI Pipeline:** -- GitHub Actions (`.github/workflows/`) -- Not fully analyzed - check workflows for details - -**Containers:** -- Docker multi-stage builds for all three services -- Images: `api` (FastAPI), `poller` (Go binary), `frontend` (Vite SPA) -- Profiles: `full` (all services), `mail-testing` (adds Mailpit) - -## Environment Configuration - -**Required env vars:** -- `DATABASE_URL` - PostgreSQL admin connection -- `SYNC_DATABASE_URL` - Alembic migrations connection -- `APP_USER_DATABASE_URL` - App-scoped RLS connection -- `POLLER_DATABASE_URL` - Poller service connection -- `REDIS_URL` - Redis connection -- `NATS_URL` - NATS JetStream connection -- `JWT_SECRET_KEY` - HS256 signing key (MUST be unique in production) -- `CREDENTIAL_ENCRYPTION_KEY` - Base64-encoded 32-byte AES key -- `OPENBAO_ADDR` - OpenBao server address -- `OPENBAO_TOKEN` - OpenBao authentication token -- `CORS_ORIGINS` - Frontend origins (comma-separated) -- `SMTP_HOST`, `SMTP_PORT` - Email server -- `FIRST_ADMIN_EMAIL`, `FIRST_ADMIN_PASSWORD` - Bootstrap account (dev only) - -**Secrets location:** -- `.env` file (git-ignored) - Development -- Environment variables in production (Kubernetes secrets, docker compose .env) -- OpenBao - Stores Transit encryption keys (not key material, only key references) - -**Security defaults validation:** -- `app/config.py` rejects known-insecure values in non-dev environments: - - `JWT_SECRET_KEY` hard-coded defaults - - `CREDENTIAL_ENCRYPTION_KEY` hard-coded defaults - - `OPENBAO_TOKEN` hard-coded defaults -- Fails startup with clear error message if production uses dev secrets - -## Webhooks & Callbacks - -**Incoming:** -- None detected - No external webhook subscriptions - -**Outgoing:** -- Slack notifications - Alert firing/resolution (planned/partial implementation) - - Router: `app/routers/alerts.py` - - Implementation status: Check alert evaluation service -- Email notifications - Alert notifications, password reset - - Service: `app/services/email_service.py` -- Custom webhooks - Extensible via notification service - - Service: `app/services/notification_service.py` - -## NATS JetStream Event Bus - -**Message Bus:** -- NATS 2.0+ with JetStream persistence - - Python client: nats-py 2.7.0+ - - Go client: nats.go 1.38.0+ - - Connection: `NATS_URL` - -**Event Topics (Python publisher → Go/Python subscribers):** -- `device.status.>` - Device online/offline status from Go poller - - Subscriber: `app/services/nats_subscriber.py` - - Payload: device_id, tenant_id, status, routeros_version, board_name, uptime - - Usage: Real-time device fleet updates - -- `firmware.progress.{tenant_id}.{device_id}` - Firmware upgrade progress - - Subscriber: `app/services/firmware_subscriber.py` - - Publisher: Firmware upgrade service - - Payload: stage (downloading, verifying, upgrading), progress %, message - - Usage: Live firmware upgrade tracking (SSE to frontend) - -- `config.push.{tenant_id}.{device_id}` - Configuration push progress - - Subscriber: `app/services/push_rollback_subscriber.py` - - Publisher: `app/services/restore_service.py` - - Payload: phase (pre-validate, backup, push, commit), status, errors - - Usage: Live config deployment tracking with rollback support - -- `alert.fired.{tenant_id}`, `alert.resolved.{tenant_id}` - Alert events - - Subscriber: `app/services/sse_manager.py` - - Publisher: `app/services/alert_evaluator.py` - - Payload: alert_id, device_id, rule_name, condition, value, timestamp - - Usage: Real-time alert notifications (SSE to frontend) - -- `audit.session.end` - SSH session audit events - - Subscriber: `app/services/session_audit_subscriber.py` - - Publisher: Go SSH relay (`poller/internal/sshrelay/`) - - Payload: session_id, user_id, device_id, start_time, end_time, command_log - - Usage: Session auditing and compliance logging - -- `config.change.{tenant_id}.{device_id}` - Device config change detection - - Subscriber: `app/services/config_change_subscriber.py` - - Payload: device_id, change_type, affected_subsystems, timestamp - - Usage: Track unapproved config changes - -- `metrics.sample.{tenant_id}.{device_id}` - Real-time CPU/memory/traffic samples - - Subscriber: `app/services/metrics_subscriber.py` - - Publisher: Go poller - - Payload: timestamp, cpu_percent, memory_percent, disk_percent, interfaces{name, rx_bytes, tx_bytes} - - Usage: Live metric streaming (SSE to frontend) - -**Server-Sent Events (SSE):** -- Frontend subscribes to per-tenant SSE streams - - Endpoint: `GET /api/sse/subscribe?tenant_id={tenant_id}` - - Connection: Long-lived HTTP persistent stream - - Implementation: `app/routers/sse.py`, `app/services/sse_manager.py` - - Payload format: SSE (text/event-stream) - - Events forwarded from NATS to frontend browser in real-time - - Used for: firmware progress, alerts, config push status, metrics - -## Git Integration - -**Version Control:** -- Bare git repositories stored per-tenant - - Library: pygit2 1.14.0+ - - Location: `{GIT_STORE_PATH}/tenant_{tenant_id}/` - - Purpose: Store device configuration history - - Commits created on: successful config push, manual save - - Restore: One-click revert to any previous commit - - Implementation: `app/services/git_store.py` - ---- - -*Integration audit: 2026-03-12* diff --git a/.planning/codebase/STACK.md b/.planning/codebase/STACK.md deleted file mode 100644 index 26bb6cf..0000000 --- a/.planning/codebase/STACK.md +++ /dev/null @@ -1,158 +0,0 @@ -# Technology Stack - -**Analysis Date:** 2026-03-12 - -## Languages - -**Primary:** -- Python 3.12+ - Backend API (`/backend`) -- Go 1.24.0 - Poller service (`/poller`) -- TypeScript 5.9.3 - Frontend (`/frontend`) -- JavaScript - Frontend runtime - -**Secondary:** -- SQL - PostgreSQL database queries and migrations -- YAML - Docker Compose configuration -- Shell - Infrastructure scripts - -## Runtime - -**Environment:** -- Node.js runtime (frontend) -- Python 3.12+ runtime (backend) -- Go 1.24.0 runtime (poller) - -**Package Manager:** -- npm (Node.js) - Frontend dependencies -- pip/hatchling (Python) - Backend dependencies -- go mod (Go) - Poller dependencies - -## Frameworks - -**Core:** -- FastAPI 0.115.0+ - Backend REST API (`app/main.py`) -- React 19.2.0 - Frontend UI components -- TanStack React Router 1.161.3 - Frontend routing and navigation -- TanStack React Query 5.90.21 - Frontend data fetching and caching -- Vite 7.3.1 - Frontend build tool and dev server -- go-routeros/v3 - MikroTik RouterOS binary protocol client - -**Testing:** -- pytest 8.0.0+ - Backend unit/integration tests (`tests/`) -- vitest 4.0.18 - Frontend unit tests -- @playwright/test 1.58.2 - Frontend E2E tests -- testcontainers-go 0.40.0 - Go integration tests with Docker containers - -**Build/Dev:** -- TypeScript 5.9.3 - Frontend type checking via `tsc -b` -- ESLint 9.39.1 - Frontend linting -- Alembic 1.14.0 - Backend database migrations -- docker compose - Multi-service orchestration -- pytest-cov 5.0.0 - Backend test coverage reporting -- vitest coverage - Frontend test coverage - -## Key Dependencies - -**Critical:** -- SQLAlchemy 2.0+ (asyncio) - Backend ORM with async support (`app/database.py`) -- asyncpg 0.30.0+ - Async PostgreSQL driver for Python -- pgx/v5 - Sync PostgreSQL driver for Go poller -- nats-py 2.7.0+ - NATS JetStream client (Python, event publishing) -- nats.go 1.38.0+ - NATS JetStream client (Go, event publishing and subscribing) -- redis (Python 5.0.0+) - Redis async client for session storage (`app/routers/auth.py`) -- redis/go-redis/v9 - Redis client for Go (distributed locks) -- httpx 0.27.0+ - Async HTTP client for OpenBao API calls -- asyncssh 2.20.0+ - SSH library for remote device access - -**Infrastructure:** -- cryptography 42.0.0+ - Encryption/decryption, SSH key handling -- bcrypt 4.0.0-5.0.0 - Password hashing -- python-jose 3.3.0+ - JWT token creation and validation -- pydantic 2.0.0+ - Request/response validation, settings -- pydantic-settings 2.0.0+ - Environment variable configuration -- slowapi 0.1.9+ - Rate limiting middleware -- structlog 25.1.0+ - Structured logging -- prometheus-fastapi-instrumentator 7.0.0+ - Prometheus metrics export -- aiosmtplib 3.0.0+ - Async SMTP for email notifications -- weasyprint 62.0+ - PDF report generation -- pygit2 1.14.0+ - Git version control integration (`app/services/git_store.py`) -- apscheduler 3.10.0-4.0 - Background job scheduling - -**Frontend UI:** -- @radix-ui/* (v1-2) - Accessible component primitives -- Tailwind CSS 3.4.19 - Utility-first CSS framework -- lucide-react 0.575.0 - Icon library -- framer-motion 12.34.3 - Animation library -- recharts 3.7.0 - Chart library -- reactflow 11.11.4 - Network diagram rendering -- react-leaflet 5.0.0 - Map visualization -- xterm.js 6.0.0 - Terminal emulator for SSH (`@xterm/xterm`, `@xterm/addon-fit`) -- sonner 2.0.7 - Toast notifications -- zod 4.3.6 - Runtime schema validation -- zustand 5.0.11 - Lightweight state management -- axios 1.13.5 - HTTP client for API calls -- diff 8.0.3 - Diff computation for git-diff-view - -**Testing Libraries:** -- @testing-library/react 16.3.2 - React component testing utilities -- @testing-library/user-event 14.6.1 - User interaction simulation -- jsdom 28.1.0 - DOM implementation for Node.js tests - -## Configuration - -**Environment:** -- `.env` file (Pydantic BaseSettings) - Development environment variables -- `.env.example` - Template with safe defaults -- `.env.staging.example` - Staging environment template -- Environment validation in `app/config.py` - Rejects known-insecure defaults in non-dev environments - -**Key Environment Variables:** -- `ENVIRONMENT` - (dev|staging|production) -- `DATABASE_URL` - PostgreSQL async connection (admin role) -- `SYNC_DATABASE_URL` - PostgreSQL sync for Alembic migrations -- `APP_USER_DATABASE_URL` - PostgreSQL with app_user role (RLS enforced) -- `POLLER_DATABASE_URL` - PostgreSQL for Go poller (separate role) -- `REDIS_URL` - Redis connection for sessions and locks -- `NATS_URL` - NATS JetStream connection -- `JWT_SECRET_KEY` - HS256 signing key (must be unique in production) -- `CREDENTIAL_ENCRYPTION_KEY` - Base64-encoded 32-byte AES key for credential storage -- `OPENBAO_ADDR` - OpenBao HTTP endpoint -- `OPENBAO_TOKEN` - OpenBao auth token -- `CORS_ORIGINS` - Comma-separated allowed frontend origins -- `SMTP_HOST`, `SMTP_PORT` - Email configuration - -**Build:** -- `vite.config.ts` - Vite bundler configuration (frontend) -- `tsconfig.json` - TypeScript compiler options -- `pyproject.toml` - Python project metadata and dependencies -- `go.mod` / `go.sum` - Go module dependencies -- `Dockerfile` - Multi-stage builds for all three services -- `docker-compose.yml` - Local development stack - -## Platform Requirements - -**Development:** -- Python 3.12+ -- Node.js 18+ (npm) -- Go 1.24.0 -- Docker and Docker Compose -- PostgreSQL 17 (via Docker) -- Redis 7 (via Docker) -- NATS 2+ with JetStream (via Docker) -- OpenBao 2.1+ (via Docker) -- WireGuard (via Docker image) - -**Production:** -- Kubernetes or Docker Swarm for orchestration -- PostgreSQL 17+ with TimescaleDB extension -- Redis 7+ (standalone or cluster) -- NATS 2.0+ with JetStream persistence -- OpenBao 2.0+ for encryption key management -- WireGuard container for VPN tunneling -- TLS certificates for HTTPS (Caddy/nginx reverse proxy) -- Storage for git-backed configs (`/data/git-store` - ReadWriteMany PVC) -- Storage for firmware cache (`/data/firmware-cache`) - ---- - -*Stack analysis: 2026-03-12* diff --git a/.planning/codebase/STRUCTURE.md b/.planning/codebase/STRUCTURE.md deleted file mode 100644 index 612ee9d..0000000 --- a/.planning/codebase/STRUCTURE.md +++ /dev/null @@ -1,293 +0,0 @@ -# Codebase Structure - -**Analysis Date:** 2026-03-12 - -## Directory Layout - -``` -the-other-dude/ -├── backend/ # Python FastAPI backend microservice -│ ├── app/ -│ │ ├── main.py # FastAPI app entry point, lifespan setup -│ │ ├── config.py # Settings from environment -│ │ ├── database.py # SQLAlchemy engines, session factories -│ │ ├── models/ # SQLAlchemy ORM models -│ │ ├── schemas/ # Pydantic request/response schemas -│ │ ├── routers/ # APIRouter endpoints (devices, alerts, auth, etc.) -│ │ ├── services/ # Business logic, NATS subscribers, integrations -│ │ ├── middleware/ # RBAC, tenant context, rate limiting, headers -│ │ ├── security/ # SRP, JWT, auth utilities -│ │ └── templates/ # Jinja2 report templates -│ ├── alembic/ # Database migrations -│ ├── tests/ # Unit and integration tests -│ └── Dockerfile -│ -├── poller/ # Go device polling microservice -│ ├── cmd/poller/main.go # Entry point -│ ├── internal/ -│ │ ├── poller/ # Scheduler and Worker (device polling orchestration) -│ │ ├── device/ # RouterOS binary API client -│ │ ├── bus/ # NATS JetStream publisher -│ │ ├── tunnel/ # WinBox TCP tunnel manager -│ │ ├── sshrelay/ # SSH relay server -│ │ ├── config/ # Configuration loading -│ │ ├── store/ # PostgreSQL device list queries -│ │ ├── vault/ # OpenBao credential cache -│ │ ├── observability/ # Prometheus metrics, health checks -│ │ └── testutil/ # Test helpers -│ ├── go.mod / go.sum -│ └── Dockerfile -│ -├── frontend/ # React 19 TypeScript web UI -│ ├── src/ -│ │ ├── routes/ # TanStack Router file-based routes -│ │ │ ├── __root.tsx # Root layout, QueryClientProvider -│ │ │ ├── _authenticated.tsx # Auth guard, logged-in layout -│ │ │ └── _authenticated/ # Tenant and device-scoped pages -│ │ ├── components/ # React components by feature -│ │ │ ├── ui/ # Base UI components (button, card, dialog, etc.) -│ │ │ ├── dashboard/ -│ │ │ ├── fleet/ -│ │ │ ├── devices/ -│ │ │ ├── config/ -│ │ │ ├── alerts/ -│ │ │ ├── auth/ -│ │ │ ├── vpn/ -│ │ │ └── ... -│ │ ├── hooks/ # Custom React hooks (useEventStream, useShortcut, etc.) -│ │ ├── contexts/ # React Context (EventStreamContext) -│ │ ├── lib/ # Utilities (API client, crypto, helpers) -│ │ ├── assets/ # Fonts, images -│ │ └── main.tsx # Entry point -│ ├── public/ -│ ├── package.json / pnpm-lock.yaml -│ ├── tsconfig.json -│ ├── vite.config.ts -│ └── Dockerfile -│ -├── infrastructure/ # Deployment and observability configs -│ ├── docker/ # Docker build scripts -│ ├── helm/ # Kubernetes Helm charts -│ ├── observability/ # Grafana dashboards, OpenBao configs -│ └── openbao/ # OpenBao policy and plugin configs -│ -├── docs/ # Documentation -│ ├── website/ # Website source (theotherdude.net) -│ └── superpowers/ # Feature specs and plans -│ -├── scripts/ # Utility scripts -│ -├── docker-compose.yml # Development multi-container setup -├── docker-compose.override.yml # Local overrides (mounted volumes, etc.) -├── docker-compose.staging.yml # Staging environment -├── docker-compose.prod.yml # Production environment -├── docker-compose.observability.yml # Optional Prometheus/Grafana stack -│ -├── .env.example # Template environment variables -├── .github/ # GitHub Actions CI/CD workflows -├── .planning/ # GSD planning documents -│ -└── README.md # Main project documentation -``` - -## Directory Purposes - -**backend/app/models/:** -- Purpose: SQLAlchemy ORM model definitions with RLS support -- Contains: Device, User, Tenant, Alert, ConfigBackup, Certificate, AuditLog, Firmware, VPN models -- Key files: `device.py` (devices with status, version, uptime), `user.py` (users with role and tenant), `alert.py` (alert rules and event log) -- Pattern: All models include `tenant_id` column, RLS policies enforce isolation - -**backend/app/schemas/:** -- Purpose: Pydantic request/response validation schemas -- Contains: DeviceCreate, DeviceResponse, DeviceUpdate, AlertRuleCreate, ConfigPushRequest, etc. -- Pattern: Separate request/response schemas (response never includes credentials), nested schema reuse - -**backend/app/routers/:** -- Purpose: FastAPI APIRouter endpoints, organized by domain -- Key files: `devices.py` (CRUD + bulk ops), `auth.py` (login, SRP, SSE token), `alerts.py` (rules and events), `config_editor.py` (live device config), `metrics.py` (metrics queries), `templates.py` (config templates), `vpn.py` (WireGuard peers) -- Pattern: All routes tenant-scoped as `/api/tenants/{tenant_id}/...` or `/api/...` (user-scoped) -- Middleware: Depends(require_role(...)), Depends(get_current_user), rate limiting - -**backend/app/services/:** -- Purpose: Business logic, external integrations, NATS event handling -- Core services: `device.py` (device CRUD with encryption), `auth.py` (SRP, JWT, password hashing), `backup_service.py` (config backup versioning), `ca_service.py` (TLS certificate generation and deployment) -- NATS subscribers: `nats_subscriber.py` (device status), `metrics_subscriber.py` (metrics aggregation), `firmware_subscriber.py` (firmware tracking), `alert_evaluator.py` (alert rule evaluation), `push_rollback_subscriber.py`, `session_audit_subscriber.py` -- External integrations: `email_service.py`, `notification_service.py` (Slack, webhooks), `git_store.py` (config history), `openbao_service.py` (vault access) -- Schedulers: `backup_scheduler.py`, `firmware_subscriber.py` (started/stopped in lifespan) - -**backend/app/middleware/:** -- Purpose: Request/response middleware, RBAC, tenant context, rate limiting -- Key files: `tenant_context.py` (JWT extraction, tenant context setup, RLS configuration), `rbac.py` (role hierarchy, Depends factories), `rate_limit.py` (token bucket limiter), `request_id.py` (correlation ID), `security_headers.py` (CSP, HSTS) - -**backend/app/security/:** -- Purpose: Authentication and encryption utilities -- Pattern: SRP-6a client challenge/response, JWT token generation, password hashing (bcrypt), credential envelope encryption (Fernet + Transit KMS) - -**poller/internal/poller/:** -- Purpose: Device scheduling and polling orchestration -- Key files: `scheduler.go` (lifecycle management, discovery), `worker.go` (per-device polling loop), `interfaces.go` (device interfaces) -- Pattern: Per-device goroutine with Redis distributed locking, circuit breaker with exponential backoff - -**poller/internal/device/:** -- Purpose: RouterOS binary API client implementation -- Key files: `client.go` (connection, command execution), `version.go` (parse RouterOS version), `health.go` (CPU, memory, disk metrics), `interfaces.go` (interface stats), `wireless.go` (wireless stats), `firmware.go` (firmware info), `cert_deploy.go` (TLS cert SFTP), `sftp.go` (SFTP operations) -- Pattern: Binary API command builders, response parsers, error handling - -**poller/internal/bus/:** -- Purpose: NATS JetStream publisher for all device events -- Key file: `publisher.go` (typed event structs, publish methods) -- Event types: DeviceStatusEvent, DeviceMetricsEvent, ConfigChangedEvent, PushRollbackEvent, PushAlertEvent, SessionAuditEvent -- Pattern: Struct with nc/js connections, methods like PublishDeviceStatus(ctx, event) - -**poller/internal/tunnel/:** -- Purpose: WinBox TCP tunnel management -- Key files: `manager.go` (port allocation, tunnel lifecycle), `tunnel.go` (tunnel goroutine), `portpool.go` (ephemeral port pool) -- Pattern: SOCKS proxy forwarding, port reuse after timeout - -**poller/internal/sshrelay/:** -- Purpose: SSH server bridging to RouterOS terminal access -- Key files: `server.go` (SSH server setup), `session.go` (SSH session handling), `bridge.go` (SSH-to-device relay) -- Pattern: SSH key pair generation, session multiplexing, terminal protocol bridging - -**poller/internal/vault/:** -- Purpose: OpenBao credential caching and decryption -- Key file: `vault.go` -- Pattern: Cache credentials after decryption via Transit KMS, TTL-based eviction - -**frontend/src/routes/:** -- Purpose: TanStack Router file-based routing -- Structure: `__root.tsx` (app root, QueryClientProvider), `_authenticated.tsx` (requires JWT, layout), `_authenticated/tenants/$tenantId/index` (tenant home), `_authenticated/tenants/$tenantId/devices/...` (device pages) -- Pattern: Each file exports `Route` object with component and loader, nested routes inherit parent loaders - -**frontend/src/components/:** -- Purpose: React components organized by domain/feature -- Structure: `ui/` (base components: Button, Card, Dialog, Input, Select, Badge, Skeleton, etc.), then feature folders (dashboard, fleet, devices, config, alerts, auth, vpn, etc.) -- Pattern: Composition over inheritance, CSS Modules or Tailwind for styling - -**frontend/src/hooks/:** -- Purpose: Custom React hooks for reusable logic -- Key files: `useEventStream.ts` (SSE connection lifecycle), `useShortcut.ts` (keyboard shortcuts), `useConfigPanel.ts` (config editor state), `usePageTitle.ts` (document title), `useSimpleConfig.ts` (simple config wizard state) - -**frontend/src/lib/:** -- Purpose: Utility modules -- Key files: `api.ts` (axios instance, fetch wrapper), `crypto/` (SRP client, key derivation), helpers (date formatting, validation, etc.) - -**backend/alembic/:** -- Purpose: Database schema migrations -- Key files: `alembic/versions/*.py` (timestamped migration scripts) -- Pattern: `upgrade()` and `downgrade()` functions, SQL operations via `op` context - -**tests/:** -- Backend: `tests/unit/` (service/model tests), `tests/integration/` (API endpoint tests with test DB) -- Frontend: `tests/e2e/` (Playwright E2E tests), `src/components/__tests__/` (component tests) - -## Key File Locations - -**Entry Points:** -- Backend: `backend/app/main.py` (FastAPI app, lifespan management) -- Poller: `poller/cmd/poller/main.go` (scheduler initialization) -- Frontend: `frontend/src/main.tsx` (React root), `frontend/src/routes/__root.tsx` (router root) - -**Configuration:** -- Backend: `backend/app/config.py` (Settings from .env) -- Poller: `poller/internal/config/config.go` (Load environment) -- Frontend: `frontend/vite.config.ts` (build config), `frontend/tsconfig.json` (TypeScript config) - -**Core Logic:** -- Device management: `backend/app/services/device.py` (CRUD), `poller/internal/device/` (API client), `frontend/src/components/fleet/` (UI) -- Config push: `backend/app/routers/config_editor.py` (API), `poller/internal/poller/worker.go` (execution), `frontend/src/components/config-editor/` (UI) -- Alerts: `backend/app/services/alert_evaluator.py` (evaluation logic), `backend/app/routers/alerts.py` (API), `frontend/src/components/alerts/` (UI) -- Authentication: `backend/app/security/` (SRP, JWT), `frontend/src/components/auth/` (forms), `poller/internal/vault/` (credential cache) - -**Testing:** -- Backend unit: `backend/tests/unit/` -- Backend integration: `backend/tests/integration/` -- Frontend e2e: `frontend/tests/e2e/` (Playwright specs) -- Poller unit: `poller/internal/poller/*_test.go`, `poller/internal/device/*_test.go` - -## Naming Conventions - -**Files:** -- Backend Python: snake_case.py (e.g., `device_service.py`, `nats_subscriber.py`) -- Poller Go: snake_case.go (e.g., `poller.go`, `scheduler.go`) -- Frontend TypeScript: PascalCase.tsx for components (e.g., `FleetTable.tsx`), camelCase.ts for utilities (e.g., `useEventStream.ts`) -- Routes: File name maps to URL path (`_authenticated/tenants/$tenantId/devices.tsx` → `/authenticated/tenants/{id}/devices`) - -**Functions/Methods:** -- Backend: snake_case (async def list_devices(...)), service functions are async -- Poller: PascalCase for exported types (Scheduler, Publisher), camelCase for methods -- Frontend: camelCase for functions and hooks, PascalCase for component names - -**Variables:** -- Backend: snake_case (device_id, tenant_id, current_user) -- Poller: camelCase for small scope (ctx, result), PascalCase for types (DeviceState) -- Frontend: camelCase (connectionState, lastConnectedAt) - -**Types:** -- Backend: PascalCase classes (Device, User, DeviceCreate) -- Poller: Exported types PascalCase (DeviceStatusEvent), unexported lowercase (deviceState) -- Frontend: TypeScript interfaces PascalCase (SSEEvent, EventCallback), generics with T - -## Where to Add New Code - -**New Feature (e.g., new device capability):** -- Primary code: - - Backend API: `backend/app/routers/{feature}.py` (new router file) - - Backend service: `backend/app/services/{feature}.py` (business logic) - - Backend model: Add to `backend/app/models/{domain}.py` or new file - - Poller: `poller/internal/device/{capability}.go` (RouterOS API client method) - - Poller event: Add struct to `poller/internal/bus/publisher.go`, new publish method - - Backend subscriber: `backend/app/services/{feature}_subscriber.py` if async processing needed -- Tests: `backend/tests/integration/test_{feature}.py` (API tests), `backend/tests/unit/test_{service}.py` (service tests) -- Frontend: - - Route: `frontend/src/routes/_authenticated/{feature}.tsx` (if new top-level page) - - Component: `frontend/src/components/{feature}/{FeatureName}.tsx` - - Hook: `frontend/src/hooks/use{FeatureName}.ts` if shared state/logic -- Database: Migration in `backend/alembic/versions/{timestamp}_{description}.py` - -**New Component/Module:** -- Backend: Create in `backend/app/services/{module}.py` as async class with methods, import in relevant router/subscriber -- Poller: Create in `poller/internal/{package}/{module}.go`, follow interface pattern in `interfaces.go` -- Frontend: Create in `frontend/src/components/{feature}/{ModuleName}.tsx`, export as named export - -**Utilities/Helpers:** -- Backend: `backend/app/services/` (service-level) or `backend/app/` subdirectory (utility modules) -- Poller: `poller/internal/{package}/` (package-level utilities) -- Frontend: `frontend/src/lib/{utility}/` (organized by concern: api, crypto, helpers, etc.) - -## Special Directories - -**docker-data/:** -- Purpose: Docker volumes for persistent data (PostgreSQL, NATS, Redis, WireGuard configs, Git backups) -- Generated: Yes (created by Docker on first run) -- Committed: No (.gitignore) - -**alembic/versions/:** -- Purpose: Database migration history -- Generated: No (manually written by developers) -- Committed: Yes (part of source control for reproducible schema) - -**.env files:** -- `.env.example`: Template with non-secret defaults, always committed -- `.env`: Local development config, not committed, ignored by .gitignore -- `.env.staging.example`: Staging environment template - -**.planning/codebase/:** -- Purpose: GSD-generated codebase analysis documents (ARCHITECTURE.md, STRUCTURE.md, CONVENTIONS.md, TESTING.md, etc.) -- Generated: Yes (by GSD tools) -- Committed: Yes (reference for future development) - -**node_modules/ (frontend):** -- Purpose: npm/pnpm dependencies -- Generated: Yes (by pnpm install) -- Committed: No (.gitignore) - -**__pycache__ (backend), vendor (poller):** -- Purpose: Compiled bytecode and dependency caches -- Generated: Yes -- Committed: No (.gitignore) - ---- - -*Structure analysis: 2026-03-12* diff --git a/.planning/codebase/TESTING.md b/.planning/codebase/TESTING.md deleted file mode 100644 index ebcc2c7..0000000 --- a/.planning/codebase/TESTING.md +++ /dev/null @@ -1,751 +0,0 @@ -# Testing Patterns - -**Analysis Date:** 2026-03-12 - -## Test Framework - -**Frontend:** - -Runner: -- Vitest 4.0.18 -- Config: `frontend/vitest.config.ts` -- Environment: jsdom (browser simulation) -- Globals enabled: true - -Assertion Library: -- Testing Library (React) - `@testing-library/react` -- Testing Library User Events - `@testing-library/user-event` -- Testing Library Jest DOM matchers - `@testing-library/jest-dom` -- Vitest's built-in expect (compatible with Jest) - -Run Commands: -```bash -npm run test # Run all tests once -npm run test:watch # Watch mode (re-runs on file change) -npm run test:coverage # Generate coverage report -npm run test:e2e # E2E tests with Playwright -npm run test:e2e:headed # E2E tests with visible browser -``` - -**Backend:** - -Runner: -- pytest 8.0.0 -- Config: `pyproject.toml` with `asyncio_mode = "auto"` -- Plugins: pytest-asyncio, pytest-mock, pytest-cov -- Markers: `integration` (marked tests requiring PostgreSQL) - -Run Commands: -```bash -pytest # Run all tests -pytest -m "not integration" # Run unit tests only -pytest -m integration # Run integration tests only -pytest --cov=app # Generate coverage report -pytest -v # Verbose output -``` - -**Go (Poller):** - -Runner: -- Go's built-in testing package -- Config: implicit (no config file) -- Assertions: testify/assert, testify/require -- Test containers for integration tests (PostgreSQL, Redis, NATS) - -Run Commands: -```bash -go test ./... # Run all tests -go test -v ./... # Verbose output -go test -run TestName ... # Run specific test -go test -race ./... # Race condition detection -``` - -## Test File Organization - -**Frontend:** - -Location: -- Co-located with components in `__tests__` subdirectory -- Pattern: `src/components/__tests__/{component}.test.tsx` -- Shared test utilities in `src/test/test-utils.tsx` -- Test setup in `src/test/setup.ts` - -Examples: -- `frontend/src/components/__tests__/LoginPage.test.tsx` -- `frontend/src/components/__tests__/DeviceList.test.tsx` -- `frontend/src/components/__tests__/TemplatePushWizard.test.tsx` - -Naming: -- Test files: `{Component}.test.tsx` (matches component name) -- Vitest config includes: `'src/**/*.test.{ts,tsx}'` - -**Backend:** - -Location: -- Separate `tests/` directory at project root -- Organization: `tests/unit/` and `tests/integration/` -- Pattern: `tests/unit/test_{module}.py` - -Examples: -- `backend/tests/unit/test_auth.py` -- `backend/tests/unit/test_security.py` -- `backend/tests/unit/test_crypto.py` -- `backend/tests/unit/test_audit_service.py` -- `backend/tests/conftest.py` (shared fixtures) -- `backend/tests/integration/conftest.py` (database fixtures) - -**Go:** - -Location: -- Co-located with implementation: `{file}.go` and `{file}_test.go` -- Pattern: `internal/poller/scheduler_test.go` alongside `scheduler.go` - -Examples: -- `poller/internal/poller/scheduler_test.go` -- `poller/internal/sshrelay/server_test.go` -- `poller/internal/poller/integration_test.go` - -## Test Structure - -**Frontend (Vitest + React Testing Library):** - -Suite Organization: -```typescript -/** - * Component tests -- description of what is tested - */ - -import { describe, it, expect, vi, beforeEach } from 'vitest' -import { render, screen, waitFor } from '@/test/test-utils' -import userEvent from '@testing-library/user-event' - -// -------------------------------------------------------------------------- -// Mocks -// -------------------------------------------------------------------------- - -const mockNavigate = vi.fn() -vi.mock('@tanstack/react-router', () => ({ - // mock implementation -})) - -// -------------------------------------------------------------------------- -// Tests -// -------------------------------------------------------------------------- - -describe('LoginPage', () => { - beforeEach(() => { - vi.clearAllMocks() - }) - - it('renders login form with email and password fields', () => { - render() - expect(screen.getByLabelText(/email/i)).toBeInTheDocument() - }) - - it('submits form with entered credentials', async () => { - render() - const user = userEvent.setup() - await user.type(screen.getByLabelText(/email/i), 'test@example.com') - await user.click(screen.getByRole('button', { name: /sign in/i })) - - await waitFor(() => { - expect(mockLogin).toHaveBeenCalledWith('test@example.com', expect.any(String)) - }) - }) -}) -``` - -Patterns: -- Mocks defined before imports, then imported components -- Section comments: `// ---------- Mocks ----------`, `// ---------- Tests ----------` -- `describe()` blocks for test suites -- `beforeEach()` for test isolation and cleanup -- `userEvent.setup()` for simulating user interactions -- `waitFor()` for async assertions -- Accessibility-first selectors: `getByLabelText`, `getByRole` over `getByTestId` - -**Backend (pytest):** - -Suite Organization: -```python -"""Unit tests for the JWT authentication service. - -Tests cover: -- Password hashing and verification (bcrypt) -- JWT access token creation and validation -""" - -import pytest -from unittest.mock import patch - -class TestPasswordHashing: - """Tests for bcrypt password hashing.""" - - def test_hash_returns_different_string(self): - password = "test-password-123!" - hashed = hash_password(password) - assert hashed != password - - def test_hash_verify_roundtrip(self): - password = "test-password-123!" - hashed = hash_password(password) - assert verify_password(password, hashed) is True -``` - -Patterns: -- Module docstring describing test scope -- Test classes for grouping related tests: `class TestPasswordHashing:` -- Test methods: `def test_{behavior}(self):` -- Assertions: `assert condition` (pytest style) -- Fixtures defined in conftest.py for async/db setup - -**Go:** - -Suite Organization: -```go -package poller - -import ( - "context" - "testing" - - "github.com/stretchr/testify/assert" - "github.com/stretchr/testify/require" -) - -// mockDeviceFetcher implements DeviceFetcher for testing. -type mockDeviceFetcher struct { - devices []store.Device - err error -} - -func (m *mockDeviceFetcher) FetchDevices(ctx context.Context) ([]store.Device, error) { - return m.devices, m.err -} - -func newTestScheduler(fetcher DeviceFetcher) *Scheduler { - // Create test instance with mocked dependencies - return &Scheduler{...} -} - -func TestReconcileDevices_StartsNewDevices(t *testing.T) { - devices := []store.Device{...} - fetcher := &mockDeviceFetcher{devices: devices} - sched := newTestScheduler(fetcher) - - var wg sync.WaitGroup - ctx, cancel := context.WithCancel(context.Background()) - defer cancel() - - err := sched.reconcileDevices(ctx, &wg) - require.NoError(t, err) - - sched.mu.Lock() - assert.Len(t, sched.activeDevices, 2) - sched.mu.Unlock() -} -``` - -Patterns: -- Mock types defined at package level (not inside test functions) -- Constructor helper: `newTest{Subject}(...)` for creating test instances -- Test function signature: `func Test{Subject}_{Scenario}(t *testing.T)` -- testify assertions: `assert.Len()`, `require.NoError()` -- Context management with defer for cleanup -- Concurrent access protected by locks (shown in assertions) - -## Mocking - -**Frontend:** - -Framework: vitest `vi` object - -Patterns: -```typescript -// Mock module imports -vi.mock('@tanstack/react-router', () => ({ - useNavigate: () => mockNavigate, - Link: ({ children, ...props }) => {children}, -})) - -// Mock with partial real imports -vi.mock('@/lib/api', async () => { - const actual = await vi.importActual('@/lib/api') - return { - ...actual, - devicesApi: { - ...actual.devicesApi, - list: (...args: unknown[]) => mockDevicesList(...args), - }, - } -}) - -// Create spy/mock functions -const mockLogin = vi.fn() -const mockNavigate = vi.fn() - -// Configure mock behavior -mockLogin.mockResolvedValueOnce(undefined) // Resolve once -mockLogin.mockRejectedValueOnce(new Error('...')) // Reject once -mockLogin.mockReturnValueOnce(new Promise(...)) // Return pending promise - -// Clear mocks between tests -beforeEach(() => { - vi.clearAllMocks() -}) - -// Assert mock was called -expect(mockLogin).toHaveBeenCalledWith('email', 'password') -expect(mockNavigate).toHaveBeenCalledWith({ to: '/' }) -``` - -What to Mock: -- External API calls (via axios/fetch) -- Router navigation (TanStack Router) -- Zustand store state (create mock `authState`) -- External libraries with complex behavior - -What NOT to Mock: -- DOM elements (use Testing Library queries instead) -- React hooks from react-testing-library -- Component rendering (test actual render unless circular dependency) - -**Backend (Python):** - -Framework: pytest-mock (monkeypatch) and unittest.mock - -Patterns: -```python -# Fixture-based mocking -@pytest.fixture -def mock_db(monkeypatch): - # monkeypatch.setattr(module, 'function', mock_fn) - pass - -# Patch in test -def test_something(monkeypatch): - mock_fn = monkeypatch.setattr('app.services.auth.hash_password', mock_hash) - -# Mock with context manager -from unittest.mock import patch - -def test_redis(): - with patch('app.routers.auth.get_redis') as mock_redis: - mock_redis.return_value = MagicMock() - # test code -``` - -What to Mock: -- Database queries (return test data) -- External HTTP calls -- Redis operations -- Email sending -- File I/O - -What NOT to Mock: -- Core business logic (hash_password, verify_token) -- Pydantic model validation -- SQLAlchemy relationship traversal (in integration tests) - -**Go:** - -Framework: testify/mock or simple interfaces - -Patterns: -```go -// Interface-based mocking -type mockDeviceFetcher struct { - devices []store.Device - err error -} - -func (m *mockDeviceFetcher) FetchDevices(ctx context.Context) ([]store.Device, error) { - return m.devices, m.err -} - -// Use interface, not concrete type -func newTestScheduler(fetcher DeviceFetcher) *Scheduler { - return &Scheduler{store: fetcher, ...} -} - -// Configure in test -sched := newTestScheduler(&mockDeviceFetcher{ - devices: []store.Device{...}, - err: nil, -}) -``` - -What to Mock: -- Database/store interfaces -- External service calls (HTTP, SSH) -- Redis operations - -What NOT to Mock: -- Standard library functions -- Core business logic - -## Fixtures and Factories - -**Frontend Test Data:** - -Approach: Inline test data in test file - -Example from `DeviceList.test.tsx`: -```typescript -const testDevices: DeviceListResponse = { - items: [ - { - id: 'dev-1', - hostname: 'router-office-1', - ip_address: '192.168.1.1', - api_port: 8728, - api_ssl_port: 8729, - model: 'RB4011', - serial_number: 'ABC123', - firmware_version: '7.12', - routeros_version: '7.12.1', - uptime_seconds: 86400, - last_seen: '2026-03-01T12:00:00Z', - latitude: null, - longitude: null, - status: 'online', - }, - ], - total: 1, -} -``` - -**Test Utilities:** - -Location: `frontend/src/test/test-utils.tsx` - -Wrapper with providers: -```typescript -function createTestQueryClient() { - return new QueryClient({ - defaultOptions: { - queries: { retry: false, gcTime: 0, staleTime: 0 }, - mutations: { retry: false }, - }, - }) -} - -export function renderWithProviders( - ui: React.ReactElement, - options?: Omit -) { - const queryClient = createTestQueryClient() - - function Wrapper({ children }: WrapperProps) { - return ( - - {children} - - ) - } - - return { - ...render(ui, { wrapper: Wrapper, ...options }), - queryClient, - } -} - -export { renderWithProviders as render } -``` - -Usage: Import `render` from test-utils, which automatically provides React Query - -**Backend Fixtures:** - -Location: `backend/tests/conftest.py` (unit), `backend/tests/integration/conftest.py` (integration) - -Base conftest: -```python -def pytest_configure(config): - """Register custom markers.""" - config.addinivalue_line( - "markers", "integration: marks tests as integration tests requiring PostgreSQL" - ) -``` - -Integration fixtures (in `tests/integration/conftest.py`): -- Database fixtures (SQLAlchemy AsyncSession) -- Redis test instance (testcontainers) -- NATS JetStream test server - -**Go Test Helpers:** - -Location: Helper functions defined in `_test.go` files - -Example from `scheduler_test.go`: -```go -// mockDeviceFetcher implements DeviceFetcher for testing. -type mockDeviceFetcher struct { - devices []store.Device - err error -} - -func (m *mockDeviceFetcher) FetchDevices(ctx context.Context) ([]store.Device, error) { - return m.devices, m.err -} - -// newTestScheduler creates a Scheduler with a mock DeviceFetcher for testing. -func newTestScheduler(fetcher DeviceFetcher) *Scheduler { - testCache := vault.NewCredentialCache(64, 5*time.Minute, nil, make([]byte, 32), nil) - return &Scheduler{ - store: fetcher, - locker: nil, - publisher: nil, - credentialCache: testCache, - pollInterval: 24 * time.Hour, - connTimeout: time.Second, - cmdTimeout: time.Second, - refreshPeriod: time.Second, - maxFailures: 5, - baseBackoff: 30 * time.Second, - maxBackoff: 15 * time.Minute, - activeDevices: make(map[string]*deviceState), - } -} -``` - -## Coverage - -**Frontend:** - -Requirements: Not enforced (no threshold in vitest config) - -View Coverage: -```bash -npm run test:coverage -# Generates coverage in frontend/coverage/ directory -``` - -**Backend:** - -Requirements: Not enforced in config (but tracked) - -View Coverage: -```bash -pytest --cov=app --cov-report=term-missing -pytest --cov=app --cov-report=html # Generates htmlcov/index.html -``` - -**Go:** - -Requirements: Not enforced - -View Coverage: -```bash -go test -cover ./... -go tool cover -html=coverage.out # Visual report -``` - -## Test Types - -**Frontend Unit Tests:** - -Scope: -- Individual component rendering -- User interactions (click, type) -- Component state changes -- Props and variant rendering - -Approach: -- Render component with test-utils -- Simulate user events with userEvent -- Assert on rendered DOM - -Example from `LoginPage.test.tsx`: -```typescript -it('renders login form with email and password fields', () => { - render() - expect(screen.getByLabelText(/email/i)).toBeInTheDocument() - expect(screen.getByLabelText(/password/i)).toBeInTheDocument() -}) - -it('submits form with entered credentials', async () => { - mockLogin.mockResolvedValueOnce(undefined) - render() - const user = userEvent.setup() - await user.type(screen.getByLabelText(/email/i), 'admin@example.com') - await user.click(screen.getByRole('button', { name: /sign in/i })) - await waitFor(() => { - expect(mockLogin).toHaveBeenCalledWith('admin@example.com', 'secret123') - }) -}) -``` - -**Frontend E2E Tests:** - -Framework: Playwright -Config: `frontend/playwright.config.ts` - -Approach: -- Launch real browser -- Navigate through app -- Test full user journeys -- Sequential execution (no parallelization) for stability - -Config highlights: -```typescript -fullyParallel: false, // Run sequentially for stability -workers: 1, // Single worker -timeout: 30000, // 30 second timeout per test -retries: process.env.CI ? 2 : 0, // Retry in CI -``` - -Location: `frontend/tests/e2e/` (referenced in playwright config) - -**Backend Unit Tests:** - -Scope: -- Pure function behavior (hash_password, verify_token) -- Service methods without database -- Validation logic - -Approach: -- No async/await needed unless using mocking -- Direct function calls -- Assert on return values - -Example from `test_auth.py`: -```python -class TestPasswordHashing: - def test_hash_returns_different_string(self): - password = "test-password-123!" - hashed = hash_password(password) - assert hashed != password - - def test_hash_verify_roundtrip(self): - password = "test-password-123!" - hashed = hash_password(password) - assert verify_password(password, hashed) is True -``` - -**Backend Integration Tests:** - -Scope: -- Full request/response cycle -- Database operations with fixtures -- External service interactions (Redis, NATS) - -Approach: -- Marked with `@pytest.mark.integration` -- Use async fixtures for database -- Skip with `-m "not integration"` in CI (slow) - -Location: `backend/tests/integration/` - -Example: -```python -@pytest.mark.integration -async def test_login_creates_session(async_db, client): - # Creates user in test database - # Posts to /api/auth/login - # Asserts JWT tokens in response - pass -``` - -**Go Tests:** - -Scope: Unit tests for individual functions, integration tests for subsystems - -Unit test example: -```go -func TestReconcileDevices_StartsNewDevices(t *testing.T) { - devices := []store.Device{...} - fetcher := &mockDeviceFetcher{devices: devices} - sched := newTestScheduler(fetcher) - - var wg sync.WaitGroup - ctx, cancel := context.WithCancel(context.Background()) - defer cancel() - - err := sched.reconcileDevices(ctx, &wg) - require.NoError(t, err) - - sched.mu.Lock() - assert.Len(t, sched.activeDevices, 2) - sched.mu.Unlock() - - cancel() - wg.Wait() -} -``` - -Integration test: Uses testcontainers for PostgreSQL, Redis, NATS (e.g., `integration_test.go`) - -## Common Patterns - -**Async Testing (Frontend):** - -Pattern for testing async operations: -```typescript -it('navigates to home on successful login', async () => { - mockLogin.mockResolvedValueOnce(undefined) - - render() - - const user = userEvent.setup() - await user.type(screen.getByLabelText(/email/i), 'admin@example.com') - await user.type(screen.getByLabelText(/password/i), 'secret123') - await user.click(screen.getByRole('button', { name: /sign in/i })) - - await waitFor(() => { - expect(mockNavigate).toHaveBeenCalledWith({ to: '/' }) - }) -}) -``` - -- Use `userEvent.setup()` for user interactions -- Use `await waitFor()` for assertions on async results -- Mock promises with `mockFn.mockResolvedValueOnce()` or `mockRejectedValueOnce()` - -**Error Testing (Frontend):** - -Pattern for testing error states: -```typescript -it('shows error message on failed login', async () => { - mockLogin.mockRejectedValueOnce(new Error('Invalid credentials')) - authState.error = null - - render() - const user = userEvent.setup() - await user.type(screen.getByLabelText(/email/i), 'test@example.com') - await user.type(screen.getByLabelText(/password/i), 'wrongpassword') - await user.click(screen.getByRole('button', { name: /sign in/i })) - - authState.error = 'Invalid credentials' - render() - - expect(screen.getByText('Invalid credentials')).toBeInTheDocument() -}) -``` - -**Async Testing (Backend):** - -Pattern for async pytest: -```python -@pytest.mark.asyncio -async def test_get_redis(): - redis = await get_redis() - assert redis is not None -``` - -Configure in `pyproject.toml`: `asyncio_mode = "auto"` (enabled globally) - -**Error Testing (Backend):** - -Pattern for testing exceptions: -```python -def test_verify_token_rejects_expired(): - token = create_access_token(user_id=uuid4(), expires_delta=timedelta(seconds=-1)) - with pytest.raises(HTTPException) as exc_info: - verify_token(token, expected_type="access") - assert exc_info.value.status_code == 401 -``` - ---- - -*Testing analysis: 2026-03-12* diff --git a/.planning/phases/02-poller-config-collection/02-01-PLAN.md b/.planning/phases/02-poller-config-collection/02-01-PLAN.md deleted file mode 100644 index b17450d..0000000 --- a/.planning/phases/02-poller-config-collection/02-01-PLAN.md +++ /dev/null @@ -1,308 +0,0 @@ ---- -phase: 02-poller-config-collection -plan: 01 -type: execute -wave: 1 -depends_on: [] -files_modified: - - poller/internal/device/ssh_executor.go - - poller/internal/device/ssh_executor_test.go - - poller/internal/device/normalize.go - - poller/internal/device/normalize_test.go - - poller/internal/config/config.go - - poller/internal/bus/publisher.go - - poller/internal/observability/metrics.go - - poller/internal/store/devices.go - - backend/alembic/versions/028_device_ssh_host_key.py -autonomous: true -requirements: [COLL-01, COLL-02, COLL-06] - -must_haves: - truths: - - "SSH executor can run a command on a RouterOS device and return stdout, stderr, exit code, duration, and typed errors" - - "Config output is normalized deterministically (timestamp stripped, whitespace trimmed, line endings unified, blank lines collapsed)" - - "SHA256 hash is computed on normalized output" - - "Config backup interval and concurrency are configurable via environment variables" - - "Host key fingerprint is stored on device record for TOFU verification" - artifacts: - - path: "poller/internal/device/ssh_executor.go" - provides: "RunCommand SSH executor with TOFU host key verification and typed errors" - exports: ["RunCommand", "CommandResult", "SSHError", "SSHErrorKind"] - - path: "poller/internal/device/normalize.go" - provides: "NormalizeConfig function and SHA256 hashing" - exports: ["NormalizeConfig", "HashConfig"] - - path: "poller/internal/device/ssh_executor_test.go" - provides: "Unit tests for SSH executor error classification" - - path: "poller/internal/device/normalize_test.go" - provides: "Unit tests for config normalization with edge cases" - - path: "poller/internal/config/config.go" - provides: "CONFIG_BACKUP_INTERVAL, CONFIG_BACKUP_MAX_CONCURRENT, CONFIG_BACKUP_COMMAND_TIMEOUT env vars" - - path: "poller/internal/bus/publisher.go" - provides: "ConfigSnapshotEvent type and PublishConfigSnapshot method, config.snapshot.create subject in stream" - - path: "poller/internal/store/devices.go" - provides: "SSHPort and SSHHostKeyFingerprint fields on Device struct, UpdateSSHHostKey method" - - path: "backend/alembic/versions/028_device_ssh_host_key.py" - provides: "Migration adding ssh_port, ssh_host_key_fingerprint columns to devices table" - key_links: - - from: "poller/internal/device/ssh_executor.go" - to: "poller/internal/store/devices.go" - via: "Uses Device.SSHPort and Device.SSHHostKeyFingerprint for connection" - pattern: "dev\\.SSHPort|dev\\.SSHHostKeyFingerprint" - - from: "poller/internal/device/normalize.go" - to: "poller/internal/bus/publisher.go" - via: "Normalized config text and SHA256 hash populate ConfigSnapshotEvent fields" - pattern: "NormalizeConfig|HashConfig" ---- - - -Build the reusable primitives for config backup collection: SSH command executor with TOFU host key verification, config output normalizer with SHA256 hashing, environment variable configuration, NATS event type, and device model extensions. - -Purpose: These are the building blocks that the backup scheduler (Plan 02) wires together. Each is independently testable and follows existing codebase patterns. -Output: SSH executor module, normalization module, extended config/store/bus/metrics, Alembic migration for device SSH columns. - - - -@/Users/jasonstaack/.claude/get-shit-done/workflows/execute-plan.md -@/Users/jasonstaack/.claude/get-shit-done/templates/summary.md - - - -@.planning/PROJECT.md -@.planning/ROADMAP.md -@.planning/STATE.md -@.planning/phases/02-poller-config-collection/02-CONTEXT.md -@.planning/phases/01-database-schema/01-01-SUMMARY.md - -@poller/internal/device/sftp.go -@poller/internal/bus/publisher.go -@poller/internal/config/config.go -@poller/internal/store/devices.go -@poller/internal/observability/metrics.go -@poller/internal/poller/scheduler.go -@poller/go.mod - - - - -From poller/internal/device/sftp.go: -```go -func NewSSHClient(ip string, port int, username, password string, timeout time.Duration) (*ssh.Client, error) -// Uses ssh.InsecureIgnoreHostKey() — executor replaces this with TOFU callback -``` - -From poller/internal/store/devices.go: -```go -type Device struct { - ID string - TenantID string - IPAddress string - APIPort int - APISSLPort int - EncryptedCredentials []byte - EncryptedCredentialsTransit *string - RouterOSVersion *string - MajorVersion *int - TLSMode string - CACertPEM *string -} -// SSHPort and SSHHostKeyFingerprint need to be added -``` - -From poller/internal/bus/publisher.go: -```go -type Publisher struct { nc *nats.Conn; js jetstream.JetStream } -func (p *Publisher) PublishStatus(ctx context.Context, event DeviceStatusEvent) error -// Follow this pattern for PublishConfigSnapshot -// Stream subjects list needs "config.snapshot.>" added -``` - -From poller/internal/config/config.go: -```go -func Load() (*Config, error) -// Uses getEnv(key, default) and getEnvInt(key, default) helpers -``` - - - - - - - Task 1: SSH executor, normalizer, and their tests - - poller/internal/device/ssh_executor.go, - poller/internal/device/ssh_executor_test.go, - poller/internal/device/normalize.go, - poller/internal/device/normalize_test.go - - - SSH Executor (ssh_executor_test.go): - - Test SSHErrorKind classification: given various ssh/net error types, classifySSHError returns correct kind (AuthFailed, HostKeyMismatch, Timeout, ConnectionRefused, Unknown) - - Test TOFU host key callback: when fingerprint is empty (first connect), callback accepts and returns fingerprint; when fingerprint matches, callback accepts; when fingerprint mismatches, callback rejects with HostKeyMismatch error - - Test CommandResult: verify struct fields (Stdout, Stderr, ExitCode, Duration, Error) - - Normalizer (normalize_test.go): - - Test timestamp stripping: input with "# 2024/01/15 10:30:00 by RouterOS 7.x\n# software id = XXXX\n" strips only the timestamp line and following blank line, preserves software id comment - - Test line ending normalization: "\r\n" becomes "\n" - - Test trailing whitespace trimming: " /ip address \n" becomes "/ip address\n" - - Test blank line collapsing: three consecutive blank lines become one - - Test trailing newline: output always ends with exactly one "\n" - - Test comment preservation: lines starting with "# " that are NOT the timestamp header are preserved - - Test full normalization pipeline: realistic RouterOS export with all issues produces clean output - - Test HashConfig: returns lowercase hex SHA256 of the normalized string (64 chars) - - Test idempotency: NormalizeConfig(NormalizeConfig(input)) == NormalizeConfig(input) - - - Create `poller/internal/device/ssh_executor.go`: - - 1. Define types: - - `SSHErrorKind` string enum: `ErrAuthFailed`, `ErrHostKeyMismatch`, `ErrTimeout`, `ErrTruncatedOutput`, `ErrConnectionRefused`, `ErrUnknown` - - `SSHError` struct implementing `error`: `Kind SSHErrorKind`, `Err error`, `Message string` - - `CommandResult` struct: `Stdout string`, `Stderr string`, `ExitCode int`, `Duration time.Duration` - - 2. `RunCommand(ctx context.Context, ip string, port int, username, password string, timeout time.Duration, knownFingerprint string, command string) (*CommandResult, string, error)`: - - Returns (result, observedFingerprint, error) - - Build ssh.ClientConfig with password auth and custom HostKeyCallback for TOFU: - - If knownFingerprint == "": accept any key, compute and return SHA256 fingerprint - - If knownFingerprint matches: accept - - If knownFingerprint mismatches: reject with SSHError{Kind: ErrHostKeyMismatch} - - Fingerprint format: `SHA256:base64(sha256(publicKeyBytes))` (same as ssh-keygen) - - Dial with context-aware timeout - - Create session, run command via session.Run() - - Capture stdout/stderr via session.StdoutPipe/StderrPipe or CombinedOutput pattern - - Classify errors using `classifySSHError(err)` helper that inspects error strings and types - - Detect truncated output: if command times out mid-stream, return SSHError{Kind: ErrTruncatedOutput} - - 3. `classifySSHError(err error) SSHErrorKind`: inspect error for "unable to authenticate", "host key", "i/o timeout", "connection refused" patterns - - Create `poller/internal/device/normalize.go`: - - 1. `NormalizeConfig(raw string) string`: - - Use regexp to strip timestamp header line matching `^# \d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2} by RouterOS.*\n` and the blank line immediately following it - - Replace \r\n with \n (before other processing) - - Split into lines, trim trailing whitespace from each line - - Collapse consecutive blank lines (2+ empty lines become 1) - - Ensure single trailing newline - - Return normalized string - - 2. `HashConfig(normalized string) string`: - - Compute SHA256 of the normalized string bytes - - Return lowercase hex string (64 chars) - - 3. `const NormalizationVersion = 1` — for future tracking in NATS payload - - Write tests FIRST (RED), then implement (GREEN). Tests for normalizer use table-driven test style matching Go conventions. SSH executor tests use mock/classification tests (no real SSH connection needed for unit tests). - - - cd /Volumes/ssd01/v9/the-other-dude/poller && go test ./internal/device/ -run "TestNormalize|TestHash|TestSSH|TestClassify|TestTOFU" -v -count=1 - - - - RunCommand function compiles with correct signature returning (CommandResult, fingerprint, error) - - SSHError type with Kind field covers all 6 error classifications - - TOFU host key callback accepts on first connect, validates on subsequent, rejects on mismatch - - NormalizeConfig strips timestamp, normalizes line endings, trims whitespace, collapses blanks, ensures trailing newline - - HashConfig returns 64-char lowercase hex SHA256 - - All unit tests pass - - - - - Task 2: Config env vars, NATS event type, device model extensions, Alembic migration, metrics - - poller/internal/config/config.go, - poller/internal/bus/publisher.go, - poller/internal/store/devices.go, - poller/internal/observability/metrics.go, - backend/alembic/versions/028_device_ssh_host_key.py - - - **1. Config env vars** (`config.go`): - Add three fields to the Config struct and load them in Load(): - - `ConfigBackupIntervalSeconds int` — `getEnvInt("CONFIG_BACKUP_INTERVAL", 21600)` (6h = 21600s) - - `ConfigBackupMaxConcurrent int` — `getEnvInt("CONFIG_BACKUP_MAX_CONCURRENT", 10)` - - `ConfigBackupCommandTimeoutSeconds int` — `getEnvInt("CONFIG_BACKUP_COMMAND_TIMEOUT", 60)` - - **2. NATS event type and publisher** (`publisher.go`): - - Add `ConfigSnapshotEvent` struct: - ```go - type ConfigSnapshotEvent struct { - DeviceID string `json:"device_id"` - TenantID string `json:"tenant_id"` - RouterOSVersion string `json:"routeros_version,omitempty"` - CollectedAt string `json:"collected_at"` // RFC3339 - SHA256Hash string `json:"sha256_hash"` - ConfigText string `json:"config_text"` - NormalizationVersion int `json:"normalization_version"` - } - ``` - - Add `PublishConfigSnapshot(ctx, event) error` method on Publisher following the exact pattern of PublishStatus/PublishMetrics - - Subject: `fmt.Sprintf("config.snapshot.create.%s", event.DeviceID)` - - Add `"config.snapshot.>"` to the DEVICE_EVENTS stream subjects list in `NewPublisher` - - **3. Device model extensions** (`devices.go`): - - Add fields to Device struct: `SSHPort int`, `SSHHostKeyFingerprint *string` - - Update FetchDevices query to SELECT `COALESCE(d.ssh_port, 22)` and `d.ssh_host_key_fingerprint` - - Update GetDevice query similarly - - Update both Scan calls to include the new fields - - Add `UpdateSSHHostKey(ctx context.Context, deviceID string, fingerprint string) error` method on DeviceStore: - ```go - const query = `UPDATE devices SET ssh_host_key_fingerprint = $1 WHERE id = $2` - ``` - (This requires poller_user to have UPDATE on devices(ssh_host_key_fingerprint) — handled in migration) - - **4. Alembic migration** (`028_device_ssh_host_key.py`): - Follow the raw SQL pattern from migration 027. Create migration that: - - `ALTER TABLE devices ADD COLUMN ssh_port INTEGER DEFAULT 22` - - `ALTER TABLE devices ADD COLUMN ssh_host_key_fingerprint TEXT` - - `ALTER TABLE devices ADD COLUMN ssh_host_key_first_seen TIMESTAMPTZ` - - `ALTER TABLE devices ADD COLUMN ssh_host_key_last_verified TIMESTAMPTZ` - - `GRANT UPDATE (ssh_host_key_fingerprint, ssh_host_key_first_seen, ssh_host_key_last_verified) ON devices TO poller_user` - - Downgrade: `ALTER TABLE devices DROP COLUMN ssh_port, DROP COLUMN ssh_host_key_fingerprint, DROP COLUMN ssh_host_key_first_seen, DROP COLUMN ssh_host_key_last_verified` - - `REVOKE UPDATE (ssh_host_key_fingerprint, ssh_host_key_first_seen, ssh_host_key_last_verified) ON devices FROM poller_user` - - **5. Prometheus metrics** (`metrics.go`): - Add config backup specific metrics: - - `ConfigBackupTotal` CounterVec with labels ["status"] — status: "success", "error", "skipped_offline", "skipped_auth_blocked", "skipped_hostkey_blocked" - - `ConfigBackupDuration` Histogram — buckets: [1, 5, 10, 30, 60, 120, 300] - - `ConfigBackupActive` Gauge — number of concurrent backup jobs running - - - cd /Volumes/ssd01/v9/the-other-dude/poller && go build ./... && go vet ./... && go test ./internal/config/ -v -count=1 - - - - Config struct has 3 new backup config fields loading from env vars with correct defaults - - ConfigSnapshotEvent type exists with all required JSON fields - - PublishConfigSnapshot method exists following existing publisher pattern - - config.snapshot.> added to DEVICE_EVENTS stream subjects - - Device struct has SSHPort and SSHHostKeyFingerprint fields - - FetchDevices and GetDevice queries select and scan the new columns - - UpdateSSHHostKey method exists for TOFU fingerprint storage - - Alembic migration 028 adds ssh_port, ssh_host_key_fingerprint, timestamp columns with correct grants - - Three new Prometheus metrics registered for config backup observability - - All existing tests still pass, project compiles clean - - - - - - -1. `cd poller && go build ./...` — entire project compiles -2. `cd poller && go vet ./...` — no static analysis issues -3. `cd poller && go test ./internal/device/ -v -count=1` — SSH executor and normalizer tests pass -4. `cd poller && go test ./internal/config/ -v -count=1` — config tests pass -5. Migration file exists at `backend/alembic/versions/028_device_ssh_host_key.py` - - - -- SSH executor RunCommand function exists with TOFU host key verification and typed error classification -- Config normalizer strips timestamps, normalizes whitespace, and computes SHA256 hashes deterministically -- All config backup environment variables load with correct defaults (6h interval, 10 concurrent, 60s timeout) -- ConfigSnapshotEvent and PublishConfigSnapshot are ready for the scheduler to use -- Device model includes SSH port and host key fingerprint fields -- Database migration ready to add SSH columns to devices table -- Prometheus metrics registered for backup collection observability -- All tests pass, project compiles clean - - - -After completion, create `.planning/phases/02-poller-config-collection/02-01-SUMMARY.md` - diff --git a/.planning/phases/02-poller-config-collection/02-01-SUMMARY.md b/.planning/phases/02-poller-config-collection/02-01-SUMMARY.md deleted file mode 100644 index 7e94424..0000000 --- a/.planning/phases/02-poller-config-collection/02-01-SUMMARY.md +++ /dev/null @@ -1,128 +0,0 @@ ---- -phase: 02-poller-config-collection -plan: 01 -subsystem: poller -tags: [ssh, tofu, routeros, config-normalization, sha256, nats, prometheus, alembic] - -requires: - - phase: 01-database-schema - provides: router_config_snapshots table for storing backup data -provides: - - SSH command executor with TOFU host key verification and typed error classification - - Config normalizer with deterministic SHA256 hashing - - ConfigSnapshotEvent NATS event type and PublishConfigSnapshot method - - Config backup environment variables (interval, concurrency, timeout) - - Device model SSH fields (port, host key fingerprint) with UpdateSSHHostKey method - - Alembic migration 028 for devices table SSH columns - - Prometheus metrics for config backup observability -affects: [02-02-backup-scheduler, 03-backend-subscriber] - -tech-stack: - added: [] - patterns: - - "TOFU host key verification via SHA256 fingerprint comparison" - - "Config normalization pipeline: line endings, timestamp strip, whitespace trim, blank collapse" - - "SSH error classification into typed SSHErrorKind enum" - -key-files: - created: - - poller/internal/device/ssh_executor.go - - poller/internal/device/ssh_executor_test.go - - poller/internal/device/normalize.go - - poller/internal/device/normalize_test.go - - backend/alembic/versions/028_device_ssh_host_key.py - modified: - - poller/internal/config/config.go - - poller/internal/bus/publisher.go - - poller/internal/store/devices.go - - poller/internal/observability/metrics.go - -key-decisions: - - "TOFU fingerprint format matches ssh-keygen: SHA256:base64(sha256(pubkey))" - - "NormalizationVersion=1 constant included in NATS payloads for future re-processing" - - "UpdateSSHHostKey sets first_seen via COALESCE to preserve original observation time" - -patterns-established: - - "SSH error classification: classifySSHError inspects error strings for auth/hostkey/timeout/refused patterns" - - "Config normalization: version-tracked deterministic pipeline for RouterOS export output" - -requirements-completed: [COLL-01, COLL-02, COLL-06] - -duration: 5min -completed: 2026-03-13 ---- - -# Phase 02 Plan 01: Config Backup Primitives Summary - -**SSH executor with TOFU host key verification, RouterOS config normalizer with SHA256 hashing, NATS snapshot event, and Alembic migration for device SSH columns** - -## Performance - -- **Duration:** 5 min -- **Started:** 2026-03-13T01:43:33Z -- **Completed:** 2026-03-13T01:48:38Z -- **Tasks:** 2 -- **Files modified:** 9 - -## Accomplishments -- SSH RunCommand executor with context-aware dialing, TOFU host key callback, and 6-kind typed error classification -- Deterministic config normalizer: strips RouterOS timestamps, normalizes line endings, trims whitespace, collapses blanks, computes SHA256 hash -- 22 unit tests covering error classification, TOFU flows (first connect/match/mismatch), normalization edge cases, idempotency -- Config backup env vars, NATS ConfigSnapshotEvent, device model SSH extensions, migration 028, Prometheus metrics - -## Task Commits - -Each task was committed atomically: - -1. **Task 1: SSH executor, normalizer, and their tests** - `f1abb75` (feat) -2. **Task 2: Config env vars, NATS event type, device model extensions, Alembic migration, metrics** - `4ae39d2` (feat) - -_Note: Task 1 used TDD -- tests written first (RED), implementation second (GREEN)._ - -## Files Created/Modified -- `poller/internal/device/ssh_executor.go` - RunCommand SSH executor with TOFU host key verification and typed errors -- `poller/internal/device/ssh_executor_test.go` - Unit tests for SSH error classification, TOFU callbacks, CommandResult -- `poller/internal/device/normalize.go` - NormalizeConfig and HashConfig for RouterOS export output -- `poller/internal/device/normalize_test.go` - Table-driven tests for normalization pipeline edge cases -- `poller/internal/config/config.go` - Added ConfigBackupIntervalSeconds, ConfigBackupMaxConcurrent, ConfigBackupCommandTimeoutSeconds -- `poller/internal/bus/publisher.go` - Added ConfigSnapshotEvent type, PublishConfigSnapshot method, config.snapshot.> stream subject -- `poller/internal/store/devices.go` - Added SSHPort/SSHHostKeyFingerprint fields, UpdateSSHHostKey method, updated queries -- `poller/internal/observability/metrics.go` - Added ConfigBackupTotal, ConfigBackupDuration, ConfigBackupActive metrics -- `backend/alembic/versions/028_device_ssh_host_key.py` - Migration adding ssh_port, ssh_host_key_fingerprint, timestamp columns - -## Decisions Made -- TOFU fingerprint format uses SHA256:base64(sha256(pubkey)) to match ssh-keygen output format -- NormalizationVersion=1 constant is included in NATS payloads so consumers can detect algorithm changes -- UpdateSSHHostKey uses COALESCE on ssh_host_key_first_seen to preserve original observation timestamp - -## Deviations from Plan - -### Auto-fixed Issues - -**1. [Rule 1 - Bug] Fixed test key generation approach** -- **Found during:** Task 1 (GREEN phase) -- **Issue:** Embedded OpenSSH PEM test key had padding errors ("ssh: padding not as expected") -- **Fix:** Switched to programmatic ed25519 key generation via crypto/ed25519.GenerateKey -- **Files modified:** poller/internal/device/ssh_executor_test.go -- **Verification:** All 22 tests pass -- **Committed in:** f1abb75 (Task 1 commit) - ---- - -**Total deviations:** 1 auto-fixed (1 bug) -**Impact on plan:** Minimal -- test infrastructure fix only, no production code change. - -## Issues Encountered -None beyond the test key generation fix documented above. - -## User Setup Required -None - no external service configuration required. - -## Next Phase Readiness -- All primitives ready for Plan 02 (backup scheduler) to wire together -- SSH executor, normalizer, NATS event, device model, config, and metrics are independently tested and compilable -- Migration 028 ready to apply before deploying the backup scheduler - ---- -*Phase: 02-poller-config-collection* -*Completed: 2026-03-13* diff --git a/.planning/phases/02-poller-config-collection/02-02-PLAN.md b/.planning/phases/02-poller-config-collection/02-02-PLAN.md deleted file mode 100644 index 574f2df..0000000 --- a/.planning/phases/02-poller-config-collection/02-02-PLAN.md +++ /dev/null @@ -1,394 +0,0 @@ ---- -phase: 02-poller-config-collection -plan: 02 -type: execute -wave: 2 -depends_on: ["02-01"] -files_modified: - - poller/internal/poller/backup_scheduler.go - - poller/internal/poller/backup_scheduler_test.go - - poller/internal/poller/interfaces.go - - poller/cmd/poller/main.go -autonomous: true -requirements: [COLL-01, COLL-03, COLL-05, COLL-06] - -must_haves: - truths: - - "Poller runs /export show-sensitive via SSH on each online RouterOS device at a configurable interval (default 6h)" - - "Poller publishes normalized config snapshot to NATS config.snapshot.create with device_id, tenant_id, sha256_hash, config_text" - - "Unreachable devices log a warning and are retried on the next interval without blocking other devices" - - "Backup interval is configurable via CONFIG_BACKUP_INTERVAL environment variable" - - "First backup runs with randomized jitter (30-300s) after device discovery" - - "Global concurrency is limited via CONFIG_BACKUP_MAX_CONCURRENT semaphore" - - "Auth failures and host key mismatches block retries until resolved" - artifacts: - - path: "poller/internal/poller/backup_scheduler.go" - provides: "BackupScheduler managing per-device backup goroutines with concurrency, retry, and NATS publishing" - exports: ["BackupScheduler", "NewBackupScheduler"] - min_lines: 200 - - path: "poller/internal/poller/backup_scheduler_test.go" - provides: "Unit tests for backup scheduling, jitter, concurrency, error handling" - - path: "poller/internal/poller/interfaces.go" - provides: "SSHHostKeyUpdater interface for device store dependency" - - path: "poller/cmd/poller/main.go" - provides: "BackupScheduler initialization and lifecycle wiring" - key_links: - - from: "poller/internal/poller/backup_scheduler.go" - to: "poller/internal/device/ssh_executor.go" - via: "Calls device.RunCommand to execute /export show-sensitive" - pattern: "device\\.RunCommand" - - from: "poller/internal/poller/backup_scheduler.go" - to: "poller/internal/device/normalize.go" - via: "Calls device.NormalizeConfig and device.HashConfig on SSH output" - pattern: "device\\.NormalizeConfig|device\\.HashConfig" - - from: "poller/internal/poller/backup_scheduler.go" - to: "poller/internal/bus/publisher.go" - via: "Calls publisher.PublishConfigSnapshot with ConfigSnapshotEvent" - pattern: "publisher\\.PublishConfigSnapshot|bus\\.ConfigSnapshotEvent" - - from: "poller/internal/poller/backup_scheduler.go" - to: "poller/internal/store/devices.go" - via: "Calls store.UpdateSSHHostKey for TOFU fingerprint storage" - pattern: "UpdateSSHHostKey" - - from: "poller/cmd/poller/main.go" - to: "poller/internal/poller/backup_scheduler.go" - via: "Creates and starts BackupScheduler in main goroutine lifecycle" - pattern: "NewBackupScheduler|backupScheduler\\.Run" ---- - - -Build the backup scheduler that orchestrates periodic SSH config collection from RouterOS devices, normalizes output, and publishes to NATS. Wire it into the poller's main lifecycle. - -Purpose: This is the core orchestration that ties together the SSH executor, normalizer, and NATS publisher from Plan 01 into a running backup collection system with proper scheduling, concurrency control, error handling, and retry logic. -Output: BackupScheduler module fully integrated into the poller's main.go lifecycle. - - - -@/Users/jasonstaack/.claude/get-shit-done/workflows/execute-plan.md -@/Users/jasonstaack/.claude/get-shit-done/templates/summary.md - - - -@.planning/PROJECT.md -@.planning/ROADMAP.md -@.planning/STATE.md -@.planning/phases/02-poller-config-collection/02-CONTEXT.md -@.planning/phases/02-poller-config-collection/02-01-SUMMARY.md - -@poller/internal/poller/scheduler.go -@poller/internal/poller/worker.go -@poller/internal/poller/interfaces.go -@poller/cmd/poller/main.go -@poller/internal/device/ssh_executor.go -@poller/internal/device/normalize.go -@poller/internal/bus/publisher.go -@poller/internal/config/config.go -@poller/internal/store/devices.go -@poller/internal/observability/metrics.go - - - - -From poller/internal/device/ssh_executor.go (created in Plan 01): -```go -type SSHErrorKind string -const ( - ErrAuthFailed SSHErrorKind = "auth_failed" - ErrHostKeyMismatch SSHErrorKind = "host_key_mismatch" - ErrTimeout SSHErrorKind = "timeout" - ErrTruncatedOutput SSHErrorKind = "truncated_output" - ErrConnectionRefused SSHErrorKind = "connection_refused" - ErrUnknown SSHErrorKind = "unknown" -) - -type SSHError struct { Kind SSHErrorKind; Err error; Message string } -type CommandResult struct { Stdout string; Stderr string; ExitCode int; Duration time.Duration } - -func RunCommand(ctx context.Context, ip string, port int, username, password string, timeout time.Duration, knownFingerprint string, command string) (*CommandResult, string, error) -``` - -From poller/internal/device/normalize.go (created in Plan 01): -```go -func NormalizeConfig(raw string) string -func HashConfig(normalized string) string -const NormalizationVersion = 1 -``` - -From poller/internal/bus/publisher.go (modified in Plan 01): -```go -type ConfigSnapshotEvent struct { - DeviceID string `json:"device_id"` - TenantID string `json:"tenant_id"` - RouterOSVersion string `json:"routeros_version,omitempty"` - CollectedAt string `json:"collected_at"` - SHA256Hash string `json:"sha256_hash"` - ConfigText string `json:"config_text"` - NormalizationVersion int `json:"normalization_version"` -} -func (p *Publisher) PublishConfigSnapshot(ctx context.Context, event ConfigSnapshotEvent) error -``` - -From poller/internal/store/devices.go (modified in Plan 01): -```go -type Device struct { - // ... existing fields ... - SSHPort int - SSHHostKeyFingerprint *string -} -func (s *DeviceStore) UpdateSSHHostKey(ctx context.Context, deviceID string, fingerprint string) error -``` - -From poller/internal/config/config.go (modified in Plan 01): -```go -type Config struct { - // ... existing fields ... - ConfigBackupIntervalSeconds int - ConfigBackupMaxConcurrent int - ConfigBackupCommandTimeoutSeconds int -} -``` - -From poller/internal/observability/metrics.go (modified in Plan 01): -```go -var ConfigBackupTotal *prometheus.CounterVec // labels: ["status"] -var ConfigBackupDuration prometheus.Histogram -var ConfigBackupActive prometheus.Gauge -``` - - - -From poller/internal/poller/scheduler.go: -```go -type Scheduler struct { ... } -func NewScheduler(...) *Scheduler -func (s *Scheduler) Run(ctx context.Context) error -func (s *Scheduler) reconcileDevices(ctx context.Context, wg *sync.WaitGroup) error -func (s *Scheduler) runDeviceLoop(ctx context.Context, dev store.Device, ds *deviceState) // per-device goroutine with ticker -``` - -From poller/internal/poller/interfaces.go: -```go -type DeviceFetcher interface { - FetchDevices(ctx context.Context) ([]store.Device, error) -} -``` - - - - - - - Task 1: BackupScheduler with per-device goroutines, concurrency control, and retry logic - - poller/internal/poller/backup_scheduler.go, - poller/internal/poller/backup_scheduler_test.go, - poller/internal/poller/interfaces.go - - - - Test jitter generation: randomJitter(30, 300) returns value in [30s, 300s] range - - Test backoff sequence: given consecutive failures, backoff returns 5m, 15m, 1h, then caps at 1h - - Test auth failure blocking: when last error is ErrAuthFailed, shouldRetry returns false - - Test host key mismatch blocking: when last error is ErrHostKeyMismatch, shouldRetry returns false - - Test online-only gating: backup is skipped for devices not currently marked online - - Test concurrency semaphore: when semaphore is full, backup waits (does not drop) - - - **1. Update interfaces.go:** - Add `SSHHostKeyUpdater` interface (consumer-side, Go best practice): - ```go - type SSHHostKeyUpdater interface { - UpdateSSHHostKey(ctx context.Context, deviceID string, fingerprint string) error - } - ``` - - **2. Create backup_scheduler.go:** - - Define `backupDeviceState` struct tracking per-device backup state: - - `cancel context.CancelFunc` - - `lastAttemptAt time.Time` - - `lastSuccessAt time.Time` - - `lastStatus string` — "success", "error", "skipped_offline", "auth_blocked", "hostkey_blocked" - - `lastError string` - - `consecutiveFailures int` - - `backoffUntil time.Time` - - `lastErrorKind device.SSHErrorKind` — tracks whether error is auth/hostkey (blocks retry) - - Define `BackupScheduler` struct: - - `store DeviceFetcher` — reuse existing interface for FetchDevices - - `hostKeyStore SSHHostKeyUpdater` — for UpdateSSHHostKey - - `locker *redislock.Client` — per-device distributed lock - - `publisher *bus.Publisher` — for NATS publishing - - `credentialCache *vault.CredentialCache` — for decrypting device SSH creds - - `redisClient *redis.Client` — for tracking device online status - - `backupInterval time.Duration` - - `commandTimeout time.Duration` - - `refreshPeriod time.Duration` — how often to reconcile devices (reuse from existing scheduler, e.g., 60s) - - `semaphore chan struct{}` — buffered channel of size maxConcurrent - - `mu sync.Mutex` - - `activeDevices map[string]*backupDeviceState` - - `NewBackupScheduler(...)` constructor — accept all dependencies, create semaphore as `make(chan struct{}, maxConcurrent)`. - - `Run(ctx context.Context) error` — mirrors existing Scheduler.Run pattern: - - defer shutdown: cancel all device goroutines, wait for WaitGroup - - Loop: reconcileBackupDevices(ctx, &wg), then select on ctx.Done or time.After(refreshPeriod) - - `reconcileBackupDevices(ctx, wg)` — mirrors reconcileDevices: - - FetchDevices from store - - Start backup goroutines for new devices - - Stop goroutines for removed devices - - `runBackupLoop(ctx, dev, state)` — per-device backup goroutine: - - On first run: sleep for randomJitter(30, 300) seconds, then do initial backup - - After initial: ticker at backupInterval - - On each tick: - a. Check if device is online via Redis key `device:{id}:status` (set by status poll). If not online, log debug "skipped_offline", update state, increment ConfigBackupTotal("skipped_offline"), continue - b. Check if lastErrorKind is ErrAuthFailed — skip with "skipped_auth_blocked", log warning with guidance to update credentials - c. Check if lastErrorKind is ErrHostKeyMismatch — skip with "skipped_hostkey_blocked", log warning with guidance to reset host key - d. Check backoff: if time.Now().Before(state.backoffUntil), skip - e. Acquire semaphore (blocks if at max concurrency, does not drop) - f. Acquire Redis lock `backup:device:{id}` with TTL = commandTimeout + 30s - g. Call `collectAndPublish(ctx, dev, state)` - h. Release semaphore - i. Update state based on result - - `collectAndPublish(ctx, dev, state) error`: - - Increment ConfigBackupActive gauge - - Defer decrement ConfigBackupActive gauge - - Start timer for ConfigBackupDuration - - Decrypt credentials via credentialCache.GetCredentials - - Call `device.RunCommand(ctx, dev.IPAddress, dev.SSHPort, username, password, commandTimeout, knownFingerprint, "/export show-sensitive")` - - On error: classify error kind, update state, apply backoff (transient: 5m/15m/1h exponential; auth/hostkey: block), return - - If new fingerprint returned (TOFU first connect): call hostKeyStore.UpdateSSHHostKey - - Validate output is non-empty and looks like RouterOS config (basic sanity: contains "/") - - Call `device.NormalizeConfig(result.Stdout)` - - Call `device.HashConfig(normalized)` - - Build `bus.ConfigSnapshotEvent` with device_id, tenant_id, routeros_version (from device or Redis), collected_at (RFC3339 now), sha256_hash, config_text, normalization_version - - Call `publisher.PublishConfigSnapshot(ctx, event)` - - On success: reset consecutiveFailures, update lastSuccessAt, increment ConfigBackupTotal("success") - - Record ConfigBackupDuration - - `randomJitter(minSeconds, maxSeconds int) time.Duration` — uses math/rand for uniform distribution - - Backoff for transient errors: `calculateBackupBackoff(failures int) time.Duration`: - - 1 failure: 5 min - - 2 failures: 15 min - - 3+ failures: 1 hour (cap) - - Device online check via Redis: check if key `device:{id}:status` equals "online". This key is set by the existing status poll publisher flow. If key doesn't exist, assume device might be online (first poll hasn't happened yet) — allow backup attempt. - - RouterOS version: read from the Device struct's RouterOSVersion field (populated by store query). If nil, use empty string in the event. - - **Important implementation notes:** - - Use `log/slog` for all logging (structured JSON, matching existing pattern) - - Use existing `redislock` pattern from worker.go for per-device locking - - Semaphore pattern: `s.semaphore <- struct{}{}` to acquire, `<-s.semaphore` to release - - Do NOT share circuit breaker state with the status poll scheduler — these are independent - - Partial/truncated output (SSHError with Kind ErrTruncatedOutput) is treated as transient error — never publish, apply backoff - - - cd /Volumes/ssd01/v9/the-other-dude/poller && go test ./internal/poller/ -run "TestBackup|TestJitter|TestBackoff|TestShouldRetry" -v -count=1 - - - - BackupScheduler manages per-device backup goroutines independently from status poll scheduler - - First backup uses 30-300s random jitter delay - - Concurrency limited by buffered channel semaphore (default 10) - - Per-device Redis lock prevents duplicate backups across pods - - Auth failures and host key mismatches block retries with clear log messages - - Transient errors use 5m/15m/1h exponential backoff - - Offline devices are skipped without error - - Successful backups normalize config, compute SHA256, and publish to NATS - - TOFU fingerprint stored on first successful connection - - All unit tests pass - - - - - Task 2: Wire BackupScheduler into main.go lifecycle - poller/cmd/poller/main.go - - Add BackupScheduler initialization and startup to main.go, following the existing pattern of scheduler initialization (lines 250-278). - - After the existing scheduler creation (around line 270), add a new section: - - ``` - // ----------------------------------------------------------------------- - // Start the config backup scheduler - // ----------------------------------------------------------------------- - ``` - - 1. Convert config values to durations: - ```go - backupInterval := time.Duration(cfg.ConfigBackupIntervalSeconds) * time.Second - backupCmdTimeout := time.Duration(cfg.ConfigBackupCommandTimeoutSeconds) * time.Second - ``` - - 2. Create BackupScheduler: - ```go - backupScheduler := poller.NewBackupScheduler( - deviceStore, - deviceStore, // SSHHostKeyUpdater (DeviceStore satisfies this interface) - locker, - publisher, - credentialCache, - redisClient, - backupInterval, - backupCmdTimeout, - refreshPeriod, // reuse existing device refresh period - cfg.ConfigBackupMaxConcurrent, - ) - ``` - - 3. Start in a goroutine (runs parallel with the main status poll scheduler): - ```go - go func() { - slog.Info("starting config backup scheduler", - "interval", backupInterval, - "max_concurrent", cfg.ConfigBackupMaxConcurrent, - "command_timeout", backupCmdTimeout, - ) - if err := backupScheduler.Run(ctx); err != nil { - slog.Error("backup scheduler exited with error", "error", err) - } - }() - ``` - - The BackupScheduler shares the same ctx as everything else, so SIGINT/SIGTERM will trigger its shutdown via context cancellation. No additional shutdown logic needed — Run() returns when ctx is cancelled. - - Log the startup with the same pattern as the existing scheduler startup log (line 273-276). - - - cd /Volumes/ssd01/v9/the-other-dude/poller && go build ./cmd/poller/ && echo "build successful" - - - - BackupScheduler created in main.go with all dependencies injected - - Runs as a goroutine parallel to the status poll scheduler - - Shares the same context for graceful shutdown - - Startup logged with interval, max_concurrent, and command_timeout - - Poller binary compiles successfully with the new scheduler wired in - - - - - - -1. `cd poller && go build ./cmd/poller/` — binary compiles with backup scheduler wired in -2. `cd poller && go vet ./...` — no static analysis issues -3. `cd poller && go test ./internal/poller/ -v -count=1` — all poller tests pass (existing + new backup tests) -4. `cd poller && go test ./... -count=1` — full test suite passes - - - -- BackupScheduler runs independently from status poll scheduler with its own per-device goroutines -- Devices get their first backup 30-300s after discovery, then every CONFIG_BACKUP_INTERVAL -- SSH command execution uses TOFU host key verification and stores fingerprints on first connect -- Config output is normalized, hashed, and published to NATS config.snapshot.create -- Concurrency limited to CONFIG_BACKUP_MAX_CONCURRENT parallel SSH sessions -- Auth/hostkey errors block retries; transient errors use exponential backoff (5m/15m/1h) -- Offline devices are skipped gracefully -- BackupScheduler is wired into main.go and starts/stops with the poller lifecycle -- All tests pass, project compiles clean - - - -After completion, create `.planning/phases/02-poller-config-collection/02-02-SUMMARY.md` - diff --git a/.planning/phases/02-poller-config-collection/02-02-SUMMARY.md b/.planning/phases/02-poller-config-collection/02-02-SUMMARY.md deleted file mode 100644 index 1d2ee3e..0000000 --- a/.planning/phases/02-poller-config-collection/02-02-SUMMARY.md +++ /dev/null @@ -1,100 +0,0 @@ ---- -phase: 02-poller-config-collection -plan: 02 -subsystem: poller -tags: [ssh, backup, scheduler, nats, routeros, concurrency, tofu, redis] - -requires: - - phase: 02-poller-config-collection/01 - provides: SSH executor, config normalizer, NATS ConfigSnapshotEvent, Prometheus metrics, config fields -provides: - - BackupScheduler with per-device goroutines managing periodic SSH config collection - - Concurrency-limited config backup pipeline (SSH -> normalize -> hash -> NATS publish) - - TOFU host key verification with persistent fingerprint storage - - Auth/hostkey error blocking with transient error exponential backoff - - SSHHostKeyUpdater consumer-side interface -affects: [03-backend-snapshot-consumer, api, poller] - -tech-stack: - added: [] - patterns: [per-device goroutine lifecycle, buffered channel semaphore, Redis online gating] - -key-files: - created: - - poller/internal/poller/backup_scheduler.go - - poller/internal/poller/backup_scheduler_test.go - modified: - - poller/internal/poller/interfaces.go - - poller/cmd/poller/main.go - -key-decisions: - - "BackupScheduler runs independently from status poll scheduler with separate goroutines" - - "Semaphore uses buffered channel pattern matching existing codebase style" - - "Device with no Redis status key assumed potentially online (first poll not yet completed)" - -patterns-established: - - "Backup goroutine pattern: jitter -> initial backup -> ticker loop with gating checks" - - "Error classification: auth/hostkey block retries, transient errors use exponential backoff" - -requirements-completed: [COLL-01, COLL-03, COLL-05, COLL-06] - -duration: 4min -completed: 2026-03-13 ---- - -# Phase 2 Plan 2: Backup Scheduler Summary - -**BackupScheduler orchestrating periodic SSH config collection with per-device goroutines, concurrency semaphore, TOFU verification, and NATS publishing** - -## Performance - -- **Duration:** 4 min -- **Started:** 2026-03-13T01:51:27Z -- **Completed:** 2026-03-13T01:55:37Z -- **Tasks:** 2 -- **Files modified:** 4 - -## Accomplishments -- BackupScheduler manages per-device backup goroutines with 30-300s initial jitter -- Concurrency limited by configurable buffered channel semaphore (default 10) -- Auth failures and host key mismatches permanently block retries with clear log warnings -- Transient errors use stepped backoff (5m/15m/1h cap) -- Full pipeline wired into main.go running parallel to existing status poll scheduler - -## Task Commits - -Each task was committed atomically: - -1. **Task 1: BackupScheduler with per-device goroutines** - `a884b09` (test) + `2653a32` (feat) -- TDD red/green -2. **Task 2: Wire BackupScheduler into main.go** - `d34817a` (feat) - -## Files Created/Modified -- `poller/internal/poller/backup_scheduler.go` - BackupScheduler with per-device goroutines, concurrency control, SSH collection, NATS publishing -- `poller/internal/poller/backup_scheduler_test.go` - Unit tests for jitter, backoff, retry blocking, online gating, semaphore, reconciliation -- `poller/internal/poller/interfaces.go` - Added SSHHostKeyUpdater consumer-side interface -- `poller/cmd/poller/main.go` - BackupScheduler initialization and goroutine startup - -## Decisions Made -- BackupScheduler runs independently from status poll scheduler -- separate goroutine pool, no shared state -- Semaphore uses buffered channel pattern (consistent with Go idioms, no external deps) -- Devices with no Redis status key assumed potentially online to avoid blocking first backup -- Locker nil-check allows tests to run without Redis lock infrastructure - -## Deviations from Plan - -None - plan executed exactly as written. - -## Issues Encountered -None - -## User Setup Required -None - no external service configuration required. - -## Next Phase Readiness -- Config backup pipeline complete: SSH -> normalize -> hash -> NATS publish -- Backend snapshot consumer (Phase 3) can subscribe to config.snapshot.create.> to receive snapshots -- Pre-existing integration test failures in poller package (missing certificate_authorities table) are unrelated to this work - ---- -*Phase: 02-poller-config-collection* -*Completed: 2026-03-13* diff --git a/.planning/phases/03-snapshot-ingestion/03-01-SUMMARY.md b/.planning/phases/03-snapshot-ingestion/03-01-SUMMARY.md deleted file mode 100644 index 134e78f..0000000 --- a/.planning/phases/03-snapshot-ingestion/03-01-SUMMARY.md +++ /dev/null @@ -1,108 +0,0 @@ ---- -phase: 03-snapshot-ingestion -plan: 01 -subsystem: api -tags: [nats, jetstream, openbao, transit, encryption, postgresql, prometheus, dedup] - -# Dependency graph -requires: - - phase: 01-database-schema - provides: RouterConfigSnapshot model and router_config_snapshots table - - phase: 02-poller-config-collection - provides: Go poller publishes config.snapshot.> NATS messages -provides: - - NATS subscriber consuming config.snapshot.> messages - - SHA256 dedup preventing duplicate snapshot storage - - OpenBao Transit encryption of config text before INSERT - - Prometheus metrics for ingestion monitoring -affects: [04-diff-engine, snapshot-api, config-timeline] - -# Tech tracking -tech-stack: - added: [prometheus_client] - patterns: [nats-subscriber-with-dedup, transit-encrypt-before-insert] - -key-files: - created: - - backend/app/services/config_snapshot_subscriber.py - - backend/tests/test_config_snapshot_subscriber.py - modified: - - backend/app/main.py - -key-decisions: - - "Trust poller-provided SHA256 hash (no recompute on backend)" - - "Raw SQL for dedup SELECT and INSERT (consistent with nats_subscriber.py pattern)" - - "OpenBao Transit service instantiated per-message with close() for connection hygiene" - -patterns-established: - - "Config snapshot ingestion: dedup by SHA256 -> encrypt -> INSERT -> ack" - - "Transit failure causes nak (NATS retry), plaintext never stored as fallback" - -requirements-completed: [STOR-02] - -# Metrics -duration: 4min -completed: 2026-03-13 ---- - -# Phase 3 Plan 1: Config Snapshot Subscriber Summary - -**NATS subscriber ingesting config snapshots with SHA256 dedup, OpenBao Transit encryption, and Prometheus metrics** - -## Performance - -- **Duration:** 4 min -- **Started:** 2026-03-13T02:44:01Z -- **Completed:** 2026-03-13T02:48:08Z -- **Tasks:** 2 -- **Files modified:** 3 - -## Accomplishments -- NATS subscriber consuming config.snapshot.> on DEVICE_EVENTS stream with durable consumer -- SHA256 dedup: duplicate snapshots silently skipped at debug level with Prometheus counter -- OpenBao Transit encryption: plaintext never stored in PostgreSQL, Transit failure causes nak -- Malformed and orphan device messages acked and discarded safely with warning logs -- 6 unit tests covering all handler paths (new, duplicate, encrypt fail, malformed, orphan, first) -- Wired into main.py lifespan with non-fatal startup pattern - -## Task Commits - -Each task was committed atomically: - -1. **Task 1 (RED): Failing tests** - `9d82741` (test) -2. **Task 1 (GREEN): Config snapshot subscriber** - `3ab9f27` (feat) -3. **Task 2: Wire into main.py lifespan** - `0db0641` (feat) - -_TDD task had RED + GREEN commits_ - -## Files Created/Modified -- `backend/app/services/config_snapshot_subscriber.py` - NATS subscriber with dedup, encryption, metrics -- `backend/tests/test_config_snapshot_subscriber.py` - 6 unit tests for all handler paths -- `backend/app/main.py` - Lifespan wiring for start/stop - -## Decisions Made -- Trust poller-provided SHA256 hash (no recompute on backend) -- per project decision -- Raw SQL for dedup SELECT and INSERT -- consistent with existing nats_subscriber.py pattern -- OpenBao Transit service instantiated per-message with close() -- connection hygiene -- config_text never appears in any log statement -- contains passwords and keys - -## Deviations from Plan - -None - plan executed exactly as written. - -## Issues Encountered - -None. - -## User Setup Required - -None - no external service configuration required. - -## Next Phase Readiness -- Config snapshot subscriber ready to receive messages from Go poller -- RouterConfigSnapshot rows will be available for diff engine (Phase 4) -- Prometheus metrics exposed for monitoring ingestion rate and errors - ---- -*Phase: 03-snapshot-ingestion* -*Completed: 2026-03-13* diff --git a/.planning/phases/04-manual-backup-trigger/04-01-SUMMARY.md b/.planning/phases/04-manual-backup-trigger/04-01-SUMMARY.md deleted file mode 100644 index 803c850..0000000 --- a/.planning/phases/04-manual-backup-trigger/04-01-SUMMARY.md +++ /dev/null @@ -1,115 +0,0 @@ ---- -phase: 04-manual-backup-trigger -plan: 01 -subsystem: api -tags: [nats, request-reply, backup, ssh, go, fastapi] - -# Dependency graph -requires: - - phase: 02-poller-config-collection - provides: BackupScheduler with SSH config collection pipeline - - phase: 03-snapshot-ingestion - provides: Config snapshot subscriber for NATS ingestion -provides: - - BackupResponder NATS handler for manual config backup triggers - - POST /config-snapshot/trigger API endpoint for on-demand backups - - Public CollectAndPublish method on BackupScheduler returning sha256 hash - - BackupExecutor/BackupLocker/DeviceGetter interfaces for testability -affects: [05-snapshot-list-api, 06-diff-api] - -# Tech tracking -tech-stack: - added: [nats-server/v2 (test dependency)] - patterns: [interface-based dependency injection for NATS responders, in-process NATS server for Go unit tests] - -key-files: - created: - - poller/internal/bus/backup_responder.go - - poller/internal/bus/backup_responder_test.go - - poller/internal/bus/redis_locker.go - - backend/tests/test_config_snapshot_trigger.py - modified: - - poller/internal/poller/backup_scheduler.go - - poller/cmd/poller/main.go - - backend/app/routers/config_backups.py - -key-decisions: - - "Used interface-based DI (BackupExecutor, BackupLocker, DeviceGetter) for BackupResponder testability" - - "Refactored collectAndPublish to return (string, error) with public CollectAndPublish wrapper" - - "Used in-process nats-server/v2 for fast Go unit tests instead of testcontainers" - - "Reused routeros_proxy NATS connection for Python endpoint instead of separate connection" - -patterns-established: - - "BackupExecutor interface: abstracts backup pipeline for manual trigger callers" - - "In-process NATS test server: startTestNATS helper for Go bus package tests" - -requirements-completed: [COLL-04] - -# Metrics -duration: 7min -completed: 2026-03-13 ---- - -# Phase 4 Plan 1: Manual Backup Trigger Summary - -**NATS request-reply manual backup trigger with Go BackupResponder and Python API endpoint returning synchronous success/failure/hash** - -## Performance - -- **Duration:** 7 min -- **Started:** 2026-03-13T03:03:57Z -- **Completed:** 2026-03-13T03:10:41Z -- **Tasks:** 2 -- **Files modified:** 7 - -## Accomplishments -- BackupResponder subscribes to config.backup.trigger (core NATS) and reuses BackupScheduler pipeline -- API endpoint POST /tenants/{tid}/devices/{did}/config-snapshot/trigger with operator role, 10/min rate limit -- Returns 201/409/502/504 with structured JSON including sha256 hash on success -- Per-device Redis lock prevents concurrent manual+scheduled backup collisions -- 12 total tests (6 Go, 6 Python) all passing - -## Task Commits - -Each task was committed atomically: - -1. **Task 1: Go BackupResponder with extracted collectAndPublish** - `9e102fd` (test: RED), `0851ece` (feat: GREEN) -2. **Task 2: Python API endpoint for manual config snapshot trigger** - `0e66415` (test: RED), `00f0a8b` (feat: GREEN) - -_TDD tasks have separate test and implementation commits._ - -## Files Created/Modified -- `poller/internal/bus/backup_responder.go` - NATS request-reply handler for manual backup triggers -- `poller/internal/bus/backup_responder_test.go` - 6 tests with in-process NATS server -- `poller/internal/bus/redis_locker.go` - RedisBackupLocker adapter implementing BackupLocker interface -- `poller/internal/poller/backup_scheduler.go` - Public CollectAndPublish method, returns (string, error) -- `poller/cmd/poller/main.go` - BackupResponder wired into lifecycle -- `backend/app/routers/config_backups.py` - New trigger_config_snapshot endpoint -- `backend/tests/test_config_snapshot_trigger.py` - 6 tests covering all response paths - -## Decisions Made -- Used interface-based dependency injection (BackupExecutor, BackupLocker, DeviceGetter) rather than direct struct dependencies for testability -- Refactored collectAndPublish to return hash string alongside error, enabling public CollectAndPublish wrapper -- Added nats-server/v2 as test dependency for fast in-process NATS testing instead of testcontainers -- Python tests use simulated handler logic to avoid import chain issues (rate_limit -> redis, auth -> bcrypt) -- Reused routeros_proxy NATS connection via _get_nats() import instead of duplicating lazy-init pattern - -## Deviations from Plan - -None - plan executed exactly as written. - -## Issues Encountered -- Python test environment lacks redis and bcrypt packages, preventing direct import of app.routers.config_backups. Resolved by testing handler logic via simulation function that mirrors the endpoint implementation. - -## User Setup Required - -None - no external service configuration required. - -## Next Phase Readiness -- Manual backup trigger complete, ready for Phase 5 (snapshot list API) -- config.backup.trigger NATS subject uses core NATS (not JetStream), no stream config changes needed -- BackupExecutor interface available for any future caller needing programmatic backup triggers - ---- -*Phase: 04-manual-backup-trigger* -*Completed: 2026-03-13* diff --git a/.planning/phases/05-diff-engine/05-01-SUMMARY.md b/.planning/phases/05-diff-engine/05-01-SUMMARY.md deleted file mode 100644 index 611dfab..0000000 --- a/.planning/phases/05-diff-engine/05-01-SUMMARY.md +++ /dev/null @@ -1,115 +0,0 @@ ---- -phase: 05-diff-engine -plan: 01 -subsystem: api -tags: [difflib, unified-diff, openbao, transit, prometheus, nats] - -requires: - - phase: 03-snapshot-ingestion - provides: "config snapshot subscriber and router_config_snapshots table" - - phase: 01-database-schema - provides: "router_config_diffs table schema" -provides: - - "generate_and_store_diff() for unified diff between consecutive snapshots" - - "Prometheus metrics for diff generation success/failure/timing" - - "Subscriber integration calling diff after snapshot INSERT" -affects: [06-change-parser, 07-timeline-api] - -tech-stack: - added: [difflib] - patterns: [best-effort-secondary-operation, tdd-red-green] - -key-files: - created: - - backend/app/services/config_diff_service.py - - backend/tests/test_config_diff_service.py - modified: - - backend/app/services/config_snapshot_subscriber.py - - backend/tests/test_config_snapshot_subscriber.py - -key-decisions: - - "Diff service instantiates its own OpenBaoTransitService per-call with close() for clean lifecycle" - - "RETURNING id added to snapshot INSERT to capture new_snapshot_id for diff generation" - - "Subscriber tests mock generate_and_store_diff to isolate snapshot logic from diff logic" - -patterns-established: - - "Best-effort secondary operations: wrap in try/except, log+count errors, never block primary flow" - - "Line counting excludes unified diff headers (+++ and --- lines)" - -requirements-completed: [DIFF-01, DIFF-02] - -duration: 3min -completed: 2026-03-13 ---- - -# Phase 5 Plan 1: Config Diff Service Summary - -**Unified diff generation between consecutive config snapshots using difflib with Transit decrypt and best-effort error handling** - -## Performance - -- **Duration:** 3 min -- **Started:** 2026-03-13T03:30:07Z -- **Completed:** 2026-03-13T03:33:Z -- **Tasks:** 2 -- **Files modified:** 4 - -## Accomplishments -- Config diff service generates unified diffs between consecutive snapshots per device -- Transit decrypt of both old and new ciphertext before diffing in memory -- Best-effort pattern: decrypt/DB failures logged and counted, never block snapshot ack -- Prometheus metrics track diff success, errors (by type), and generation duration -- Subscriber wired to call diff generation after every successful snapshot INSERT - -## Task Commits - -Each task was committed atomically: - -1. **Task 1: Diff generation service (TDD RED)** - `79453fa` (test) -2. **Task 1: Diff generation service (TDD GREEN)** - `72d0ae2` (feat) -3. **Task 2: Wire diff into subscriber** - `eb76343` (feat) - -_TDD task had separate RED and GREEN commits_ - -## Files Created/Modified -- `backend/app/services/config_diff_service.py` - Diff generation with Transit decrypt, difflib, Prometheus metrics -- `backend/tests/test_config_diff_service.py` - 5 unit tests covering diff, first-snapshot, decrypt failure, line counts, empty diff -- `backend/app/services/config_snapshot_subscriber.py` - Added RETURNING id, generate_and_store_diff call after commit -- `backend/tests/test_config_snapshot_subscriber.py` - Updated to mock generate_and_store_diff - -## Decisions Made -- Diff service instantiates its own OpenBaoTransitService per-call (clean lifecycle, consistent with subscriber pattern) -- RETURNING id added to snapshot INSERT SQL to capture the new_snapshot_id without a separate query -- Subscriber tests mock generate_and_store_diff to keep snapshot tests isolated and unchanged in assertion counts - -## Deviations from Plan - -### Auto-fixed Issues - -**1. [Rule 1 - Bug] Updated subscriber test assertions for diff integration** -- **Found during:** Task 2 (wire diff into subscriber) -- **Issue:** Existing subscriber tests failed because generate_and_store_diff made additional DB calls through the shared mock session -- **Fix:** Added patch for generate_and_store_diff in subscriber tests that successfully INSERT (test 1 and test 6) -- **Files modified:** backend/tests/test_config_snapshot_subscriber.py -- **Verification:** All 11 tests pass -- **Committed in:** eb76343 (Task 2 commit) - ---- - -**Total deviations:** 1 auto-fixed (1 bug) -**Impact on plan:** Necessary to maintain test isolation. No scope creep. - -## Issues Encountered -None - -## User Setup Required -None - no external service configuration required. - -## Next Phase Readiness -- Diff generation is active and will produce diffs for every new non-duplicate snapshot -- router_config_diffs table populated with diff_text, line counts, and snapshot references -- Ready for change parser (Phase 6) to parse semantic changes from diff_text - ---- -*Phase: 05-diff-engine* -*Completed: 2026-03-13* diff --git a/.planning/phases/05-diff-engine/05-02-SUMMARY.md b/.planning/phases/05-diff-engine/05-02-SUMMARY.md deleted file mode 100644 index c6908f3..0000000 --- a/.planning/phases/05-diff-engine/05-02-SUMMARY.md +++ /dev/null @@ -1,112 +0,0 @@ ---- -phase: 05-diff-engine -plan: 02 -subsystem: api -tags: [parser, routeros, structured-changes, tdd] - -requires: - - phase: 05-diff-engine - plan: 01 - provides: "generate_and_store_diff() and router_config_diffs table" -provides: - - "parse_diff_changes() for extracting structured component changes from unified diffs" - - "router_config_changes rows linked to diff_id for timeline UI" -affects: [07-timeline-api] - -tech-stack: - added: [] - patterns: [tdd-red-green, best-effort-secondary-operation] - -key-files: - created: - - backend/app/services/config_change_parser.py - - backend/tests/test_config_change_parser.py - modified: - - backend/app/services/config_diff_service.py - - backend/tests/test_config_diff_service.py - -key-decisions: - - "Change parser is pure function (no DB/IO) for easy testing; DB writes happen in diff service" - - "RETURNING id added to diff INSERT to capture diff_id for linking changes" - - "Change parser errors are best-effort: diff is always stored, only changes are lost on parser failure" - -patterns-established: - - "RouterOS path to component: strip leading /, replace spaces with / (e.g., /ip firewall filter -> ip/firewall/filter)" - - "Fallback component system/general for diffs without RouterOS path headers" - -requirements-completed: [DIFF-03, DIFF-04] - -duration: 2min -completed: 2026-03-13 ---- - -# Phase 5 Plan 2: Structured Change Parser Summary - -**RouterOS diff change parser extracting component names, human-readable summaries, and raw lines from unified diffs with best-effort DB storage** - -## Performance - -- **Duration:** 2 min -- **Started:** 2026-03-13T03:34:48Z -- **Completed:** 2026-03-13T03:37:14Z -- **Tasks:** 2 -- **Files modified:** 4 - -## Accomplishments -- Pure-function change parser extracts component, summary, raw_line from RouterOS unified diffs -- RouterOS path detection converts section headers to component format (ip/firewall/filter) -- Human-readable summaries: Added/Removed/Modified N rules per component -- Diff service wired to call parser after INSERT and store results in router_config_changes -- Parser failures are best-effort: diff always stored, changes lost only on parser error - -## Task Commits - -Each task was committed atomically: - -1. **Task 1: Change parser TDD RED** - `7fddf35` (test) -2. **Task 1: Change parser TDD GREEN** - `b167831` (feat) -3. **Task 2: Wire parser into diff service** - `122b591` (feat) - -_TDD task had separate RED and GREEN commits_ - -## Files Created/Modified -- `backend/app/services/config_change_parser.py` - Pure parser: parse_diff_changes() with path detection, summary generation, raw line capture -- `backend/tests/test_config_change_parser.py` - 6 unit tests covering additions, multi-section, removals, modifications, fallback, raw_line -- `backend/app/services/config_diff_service.py` - Added RETURNING id, parse_diff_changes integration, change INSERT loop -- `backend/tests/test_config_diff_service.py` - Updated existing tests for RETURNING id, added 2 tests for change storage and parser error resilience - -## Decisions Made -- Change parser is a pure function (no DB/IO) for straightforward unit testing; DB writes are the diff service's responsibility -- RETURNING id added to diff INSERT SQL to get diff_id without separate query -- Change parser errors caught by separate try/except so diff is always committed first - -## Deviations from Plan - -### Auto-fixed Issues - -**1. [Rule 1 - Bug] Updated existing diff service tests for RETURNING id and parse_diff_changes integration** -- **Found during:** Task 2 -- **Issue:** Existing tests expected 3 execute calls without scalar_one on INSERT result; new RETURNING id and parse_diff_changes call changed the interaction pattern -- **Fix:** Added scalar_one mock to INSERT result, patched parse_diff_changes to return empty list in existing tests to isolate behavior -- **Files modified:** backend/tests/test_config_diff_service.py -- **Committed in:** 122b591 - ---- - -**Total deviations:** 1 auto-fixed (1 bug) -**Impact on plan:** Necessary test update for API change. No scope creep. - -## Issues Encountered -None - -## User Setup Required -None - -## Next Phase Readiness -- router_config_changes table populated with structured changes for every non-empty diff -- Changes linked to diff_id, device_id, tenant_id for timeline queries -- Ready for timeline API (Phase 7) to query changes per device - ---- -*Phase: 05-diff-engine* -*Completed: 2026-03-13* diff --git a/.planning/phases/06-history-api/06-01-SUMMARY.md b/.planning/phases/06-history-api/06-01-SUMMARY.md deleted file mode 100644 index 4fedf1a..0000000 --- a/.planning/phases/06-history-api/06-01-SUMMARY.md +++ /dev/null @@ -1,95 +0,0 @@ ---- -phase: 06-history-api -plan: 01 -subsystem: api -tags: [fastapi, sqlalchemy, pagination, timeline, rbac] - -# Dependency graph -requires: - - phase: 05-diff-engine - provides: router_config_changes and router_config_diffs tables with parsed change data -provides: - - GET /api/tenants/{tid}/devices/{did}/config-history endpoint - - get_config_history service function with pagination -affects: [06-02, frontend-config-history] - -# Tech tracking -tech-stack: - added: [] - patterns: [raw SQL text() joins for timeline queries, same RBAC pattern as config_backups] - -key-files: - created: - - backend/app/services/config_history_service.py - - backend/app/routers/config_history.py - - backend/tests/test_config_history_service.py - modified: - - backend/app/main.py - -key-decisions: - - "Raw SQL text() for JOIN query consistent with config_diff_service.py pattern" - - "Pagination defaults: limit=50, offset=0 with validation (ge=1, le=200 for limit)" - -patterns-established: - - "Config history queries use JOIN between changes and diffs tables for timeline view" - -requirements-completed: [API-01, API-04] - -# Metrics -duration: 2min -completed: 2026-03-13 ---- - -# Phase 6 Plan 1: Config History Timeline Summary - -**GET /config-history endpoint returning paginated change timeline with component, summary, timestamp, and diff metadata via JOIN query** - -## Performance - -- **Duration:** 2 min -- **Started:** 2026-03-13T03:58:03Z -- **Completed:** 2026-03-13T04:00:00Z -- **Tasks:** 2 -- **Files modified:** 4 - -## Accomplishments -- Config history service querying router_config_changes JOIN router_config_diffs for timeline entries -- REST endpoint with viewer+ RBAC and config:read scope enforcement -- 4 unit tests covering formatting, empty results, pagination, and ordering -- Router registered in main.py alongside existing config routers - -## Task Commits - -Each task was committed atomically: - -1. **Task 1: Config history service and tests (TDD)** - `f7d5aec` (feat) -2. **Task 2: Config history router and main.py registration** - `5c56344` (feat) - -## Files Created/Modified -- `backend/app/services/config_history_service.py` - Query function for paginated config change timeline -- `backend/app/routers/config_history.py` - REST endpoint with RBAC, pagination query params -- `backend/tests/test_config_history_service.py` - 4 unit tests with AsyncMock sessions -- `backend/app/main.py` - Router import and registration - -## Decisions Made -- Used raw SQL text() for the JOIN query, consistent with config_diff_service.py pattern -- Pagination limit constrained to 1-200 via FastAPI Query validation -- Copied _check_tenant_access helper (same pattern as config_backups.py) - -## Deviations from Plan - -None - plan executed exactly as written. - -## Issues Encountered -None - -## User Setup Required -None - no external service configuration required. - -## Next Phase Readiness -- Config history timeline endpoint ready for frontend consumption -- Plan 06-02 can build on this for detailed diff view endpoints - ---- -*Phase: 06-history-api* -*Completed: 2026-03-13* diff --git a/.planning/phases/06-history-api/06-02-SUMMARY.md b/.planning/phases/06-history-api/06-02-SUMMARY.md deleted file mode 100644 index c2eedec..0000000 --- a/.planning/phases/06-history-api/06-02-SUMMARY.md +++ /dev/null @@ -1,95 +0,0 @@ ---- -phase: 06-history-api -plan: 02 -subsystem: api -tags: [fastapi, sqlalchemy, openbao, transit-decrypt, rbac, snapshot] - -# Dependency graph -requires: - - phase: 06-history-api - provides: config_history_service.py with get_config_history, config_history router with RBAC - - phase: 05-diff-engine - provides: router_config_diffs and router_config_snapshots tables with encrypted config data -provides: - - GET /api/tenants/{tid}/devices/{did}/config/{snapshot_id} endpoint (decrypted snapshot) - - GET /api/tenants/{tid}/devices/{did}/config/{snapshot_id}/diff endpoint (unified diff) - - get_snapshot and get_snapshot_diff service functions -affects: [frontend-config-history, frontend-diff-viewer] - -# Tech tracking -tech-stack: - added: [] - patterns: [Transit decrypt in service layer with try/finally close, 404 for missing snapshots/diffs] - -key-files: - created: [] - modified: - - backend/app/services/config_history_service.py - - backend/app/routers/config_history.py - - backend/tests/test_config_history_service.py - -key-decisions: - - "Transit decrypt in get_snapshot with try/finally for clean openbao lifecycle" - - "500 error wrapping for Transit decrypt failures in router (not service)" - -patterns-established: - - "Snapshot retrieval filters by id + device_id + tenant_id for RLS-safe queries" - -requirements-completed: [API-02, API-03, API-04] - -# Metrics -duration: 2min -completed: 2026-03-13 ---- - -# Phase 6 Plan 2: Snapshot View and Diff Retrieval Summary - -**Snapshot view and diff retrieval endpoints with Transit decrypt for full config text and unified diff, enforcing viewer+ RBAC** - -## Performance - -- **Duration:** 2 min -- **Started:** 2026-03-13T04:01:58Z -- **Completed:** 2026-03-13T04:03:39Z -- **Tasks:** 2 -- **Files modified:** 3 - -## Accomplishments -- get_snapshot function decrypts config via OpenBao Transit and returns plaintext with metadata -- get_snapshot_diff function queries diff by new_snapshot_id for a device/tenant -- Two new router endpoints with viewer+ RBAC and config:read scope enforcement -- 4 new tests (8 total) covering decrypted content, not-found, diff retrieval, and no-diff cases - -## Task Commits - -Each task was committed atomically: - -1. **Task 1: Snapshot and diff service functions with tests (TDD)** - `83cd661` (feat) -2. **Task 2: Snapshot and diff router endpoints** - `af7007d` (feat) - -## Files Created/Modified -- `backend/app/services/config_history_service.py` - Added get_snapshot (Transit decrypt) and get_snapshot_diff query functions -- `backend/app/routers/config_history.py` - Two new GET endpoints with RBAC, 404/500 error handling -- `backend/tests/test_config_history_service.py` - 4 new tests with mocked Transit and DB sessions - -## Decisions Made -- Transit decrypt happens in service layer (get_snapshot), error wrapping in router layer (500 response) -- Query filters include device_id + tenant_id alongside snapshot_id for RLS-safe access - -## Deviations from Plan - -None - plan executed exactly as written. - -## Issues Encountered -None - -## User Setup Required -None - no external service configuration required. - -## Next Phase Readiness -- All 3 config history API endpoints complete (timeline, snapshot view, diff view) -- Phase 06 complete -- ready for frontend integration - ---- -*Phase: 06-history-api* -*Completed: 2026-03-13* diff --git a/.planning/phases/07-config-history-ui/07-01-SUMMARY.md b/.planning/phases/07-config-history-ui/07-01-SUMMARY.md deleted file mode 100644 index 481eb00..0000000 --- a/.planning/phases/07-config-history-ui/07-01-SUMMARY.md +++ /dev/null @@ -1,89 +0,0 @@ ---- -phase: 07-config-history-ui -plan: 01 -subsystem: ui -tags: [react, tanstack-query, timeline, config-history] - -requires: - - phase: 06-history-api - provides: GET /api/tenants/{tid}/devices/{did}/config-history endpoint -provides: - - ConfigHistorySection component with timeline rendering - - configHistoryApi.list() API client function - - Configuration history visible on device detail overview tab -affects: [07-config-history-ui] - -tech-stack: - added: [] - patterns: [timeline component pattern matching BackupTimeline.tsx] - -key-files: - created: - - frontend/src/components/config/ConfigHistorySection.tsx - modified: - - frontend/src/lib/api.ts - - frontend/src/routes/_authenticated/tenants/$tenantId/devices/$deviceId.tsx - -key-decisions: - - "Reimplemented formatRelativeTime locally rather than extracting shared util (matches BackupTimeline pattern)" - - "Poll interval 60s via refetchInterval for near-real-time change visibility" - -patterns-established: - - "Config history timeline: vertical dot timeline with component badge, summary, line delta, relative time" - -requirements-completed: [UI-01, UI-02] - -duration: 3min -completed: 2026-03-13 ---- - -# Phase 7 Plan 1: Config History UI Summary - -**ConfigHistorySection timeline component on device detail page, fetching change entries via TanStack Query with 60s polling** - -## Performance - -- **Duration:** 3 min -- **Started:** 2026-03-13T04:11:08Z -- **Completed:** 2026-03-13T04:14:00Z -- **Tasks:** 2 -- **Files modified:** 3 - -## Accomplishments -- Added configHistoryApi.list() and ConfigChangeEntry interface to api.ts -- Created ConfigHistorySection with vertical timeline, loading skeleton, and empty state -- Wired component into device detail overview tab below Interface Utilization - -## Task Commits - -Each task was committed atomically: - -1. **Task 1: API client and ConfigHistorySection component** - `6bd2451` (feat) -2. **Task 2: Wire ConfigHistorySection into device detail page** - `36861ff` (feat) - -## Files Created/Modified -- `frontend/src/lib/api.ts` - Added ConfigChangeEntry interface and configHistoryApi.list() -- `frontend/src/components/config/ConfigHistorySection.tsx` - Timeline component with loading/empty/data states -- `frontend/src/routes/_authenticated/tenants/$tenantId/devices/$deviceId.tsx` - Import and render ConfigHistorySection - -## Decisions Made -- Reimplemented formatRelativeTime locally (same pattern as BackupTimeline.tsx) rather than extracting to shared util -- keeps components self-contained -- Used 60s refetchInterval for polling new config changes - -## Deviations from Plan - -None - plan executed exactly as written. - -## Issues Encountered -None - -## User Setup Required -None - no external service configuration required. - -## Next Phase Readiness -- Config history timeline renders on device overview tab -- Ready for any future detail/drill-down views on individual changes - ---- -*Phase: 07-config-history-ui* -*Completed: 2026-03-13* diff --git a/.planning/phases/08-diff-viewer-download/08-01-SUMMARY.md b/.planning/phases/08-diff-viewer-download/08-01-SUMMARY.md deleted file mode 100644 index 1fcbdad..0000000 --- a/.planning/phases/08-diff-viewer-download/08-01-SUMMARY.md +++ /dev/null @@ -1,92 +0,0 @@ ---- -phase: 08-diff-viewer-download -plan: 01 -subsystem: ui -tags: [react, diff-viewer, tanstack-query, tailwind] - -requires: - - phase: 07-config-history-ui - provides: ConfigHistorySection timeline component with ConfigChangeEntry data - - phase: 06-config-history-api - provides: GET /config/{snapshot_id}/diff endpoint returning DiffResponse -provides: - - DiffViewer component with unified diff rendering (green/red line highlighting) - - configHistoryApi.getDiff() API client method - - Clickable timeline entries in ConfigHistorySection -affects: [08-diff-viewer-download] - -tech-stack: - added: [] - patterns: [inline diff viewer with line-level classification] - -key-files: - created: - - frontend/src/components/config/DiffViewer.tsx - modified: - - frontend/src/lib/api.ts - - frontend/src/components/config/ConfigHistorySection.tsx - -key-decisions: - - "DiffViewer rendered inline above timeline (not modal) for context preservation" - - "Line classification function for unified diff: +green, -red, @@blue, ---/+++ muted" - -patterns-established: - - "Inline viewer pattern: state-driven component rendered above list, closed via callback" - -requirements-completed: [UI-03] - -duration: 1min -completed: 2026-03-13 ---- - -# Phase 8 Plan 1: Diff Viewer Summary - -**Inline diff viewer with green/red line highlighting, wired into clickable config history timeline entries** - -## Performance - -- **Duration:** 1 min -- **Started:** 2026-03-13T04:19:53Z -- **Completed:** 2026-03-13T04:20:56Z -- **Tasks:** 2 -- **Files modified:** 3 - -## Accomplishments -- DiffViewer component renders unified diffs with color-coded lines (green additions, red removals, blue hunk headers) -- API client getDiff method fetches diff data from backend endpoint -- Timeline entries in ConfigHistorySection are clickable with hover states - -## Task Commits - -Each task was committed atomically: - -1. **Task 1: Add diff API client and create DiffViewer component** - `dda00fb` (feat) -2. **Task 2: Wire DiffViewer into ConfigHistorySection timeline entries** - `2cf426f` (feat) - -## Files Created/Modified -- `frontend/src/components/config/DiffViewer.tsx` - Unified diff viewer with line-level color highlighting, loading skeleton, error state -- `frontend/src/lib/api.ts` - Added DiffResponse interface and configHistoryApi.getDiff() method -- `frontend/src/components/config/ConfigHistorySection.tsx` - Added click handlers, selectedSnapshotId state, inline DiffViewer rendering - -## Decisions Made -- Rendered DiffViewer inline above the timeline rather than in a modal, preserving context -- Used a classifyLine helper function for clean line-type detection (handles +++ and --- separately from + and -) -- Loading skeleton uses randomized widths for visual variety - -## Deviations from Plan - -None - plan executed exactly as written. - -## Issues Encountered -None - -## User Setup Required -None - no external service configuration required. - -## Next Phase Readiness -- Diff viewer complete, ready for config download functionality (plan 08-02) -- All TypeScript compiles cleanly - ---- -*Phase: 08-diff-viewer-download* -*Completed: 2026-03-13* diff --git a/.planning/phases/09-retention-cleanup/09-01-SUMMARY.md b/.planning/phases/09-retention-cleanup/09-01-SUMMARY.md deleted file mode 100644 index fec5ad3..0000000 --- a/.planning/phases/09-retention-cleanup/09-01-SUMMARY.md +++ /dev/null @@ -1,98 +0,0 @@ ---- -phase: 09-retention-cleanup -plan: 01 -subsystem: database -tags: [apscheduler, retention, postgresql, prometheus, cascade-delete] - -# Dependency graph -requires: - - phase: 01-database-schema - provides: router_config_snapshots table with CASCADE FK constraints -provides: - - Automatic retention cleanup of expired config snapshots - - CONFIG_RETENTION_DAYS env var for configurable retention period - - Prometheus metrics for cleanup observability -affects: [] - -# Tech tracking -tech-stack: - added: [] - patterns: [APScheduler IntervalTrigger for periodic maintenance jobs] - -key-files: - created: - - backend/app/services/retention_service.py - - backend/tests/test_retention_service.py - modified: - - backend/app/config.py - - backend/app/main.py - -key-decisions: - - "make_interval(days => :days) for parameterized PostgreSQL interval (no string concatenation)" - - "24h IntervalTrigger with 1h jitter to stagger cleanup across instances" - - "AdminAsyncSessionLocal (bypasses RLS) since retention is cross-tenant system operation" - -patterns-established: - - "IntervalTrigger pattern for periodic maintenance jobs (vs CronTrigger for scheduled backups)" - -requirements-completed: [STOR-03, STOR-04] - -# Metrics -duration: 2min -completed: 2026-03-13 ---- - -# Phase 9 Plan 1: Retention Cleanup Summary - -**Daily APScheduler job deletes config snapshots older than CONFIG_RETENTION_DAYS (default 90) with CASCADE FK cleanup of diffs and changes** - -## Performance - -- **Duration:** 2 min -- **Started:** 2026-03-13T04:31:48Z -- **Completed:** 2026-03-13T04:34:12Z -- **Tasks:** 2 -- **Files modified:** 4 - -## Accomplishments -- Retention service with parameterized SQL DELETE using make_interval for safe interval binding -- APScheduler IntervalTrigger running every 24h with 1h jitter for stagger -- Prometheus counter and histogram for cleanup observability -- Wired into main.py lifespan with non-fatal startup pattern - -## Task Commits - -Each task was committed atomically: - -1. **Task 1 (RED): Add failing tests** - `00bdde9` (test) -2. **Task 1 (GREEN): Implement retention service + config setting** - `a9f7a45` (feat) -3. **Task 2: Wire retention scheduler into lifespan** - `4d62bc9` (feat) - -## Files Created/Modified -- `backend/app/services/retention_service.py` - Retention cleanup logic, scheduler, Prometheus metrics -- `backend/tests/test_retention_service.py` - 4 unit tests for cleanup function -- `backend/app/config.py` - Added CONFIG_RETENTION_DAYS setting (default 90) -- `backend/app/main.py` - Wired start/stop retention scheduler into lifespan - -## Decisions Made -- Used make_interval(days => :days) for parameterized PostgreSQL interval (avoids string concatenation SQL injection risk) -- 24h IntervalTrigger with 1h jitter to stagger cleanup across instances -- AdminAsyncSessionLocal bypasses RLS since retention is a cross-tenant system operation - -## Deviations from Plan - -None - plan executed exactly as written. - -## Issues Encountered -None - -## User Setup Required -None - no external service configuration required. CONFIG_RETENTION_DAYS defaults to 90 if not set. - -## Next Phase Readiness -- Retention cleanup is fully operational, ready for phase 10 -- No blockers - ---- -*Phase: 09-retention-cleanup* -*Completed: 2026-03-13* diff --git a/.planning/phases/10-audit-observability/10-01-SUMMARY.md b/.planning/phases/10-audit-observability/10-01-SUMMARY.md deleted file mode 100644 index 0bfca02..0000000 --- a/.planning/phases/10-audit-observability/10-01-SUMMARY.md +++ /dev/null @@ -1,98 +0,0 @@ ---- -phase: 10-audit-observability -plan: 01 -subsystem: api -tags: [audit, logging, config-backup, nats, observability] - -# Dependency graph -requires: - - phase: 03-snapshot-ingestion - provides: config_snapshot_subscriber handle_config_snapshot handler - - phase: 05-config-diff - provides: config_diff_service generate_and_store_diff function - - phase: 04-manual-backup-trigger - provides: config_backups trigger_config_snapshot endpoint -provides: - - Audit trail for all config backup operations (4 event types) - - Tests verifying audit event emission -affects: [] - -# Tech tracking -tech-stack: - added: [] - patterns: [try/except-wrapped log_action calls for fire-and-forget audit, inline imports in diff service to avoid circular deps] - -key-files: - created: - - backend/tests/test_audit_config_backup.py - modified: - - backend/app/services/config_snapshot_subscriber.py - - backend/app/services/config_diff_service.py - - backend/app/routers/config_backups.py - -key-decisions: - - "Module-level import of log_action in snapshot subscriber (no circular risk), inline import in diff service and router (consistent with existing best-effort pattern)" - - "All audit calls wrapped in try/except Exception: pass to never break parent operations" - -patterns-established: - - "Audit event pattern: try/except-wrapped log_action calls at success points in NATS subscribers and API endpoints" - -requirements-completed: [OBS-01, OBS-02] - -# Metrics -duration: 3min -completed: 2026-03-13 ---- - -# Phase 10 Plan 01: Config Backup Audit Events Summary - -**Four audit event types (created, skipped_duplicate, diff_generated, manual_trigger) wired into config backup operations with try/except safety and 4 passing tests** - -## Performance - -- **Duration:** 3 min -- **Started:** 2026-03-13T04:43:11Z -- **Completed:** 2026-03-13T04:46:04Z -- **Tasks:** 2 -- **Files modified:** 4 - -## Accomplishments -- Added audit logging to all 4 config backup operations: snapshot creation, deduplication skip, diff generation, and manual backup trigger -- All log_action calls follow project pattern: try/except wrapped, fire-and-forget, with tenant_id, device_id, action, resource_type, and details -- 4 new tests verify correct audit action strings are emitted, all 17 tests pass (4 new + 13 existing) - -## Task Commits - -Each task was committed atomically: - -1. **Task 1: Add audit event emission to snapshot subscriber, diff service, and backup trigger endpoint** - `1a1ceb2` (feat) -2. **Task 2: Add tests verifying audit events are emitted** - `fb91fed` (test) - -## Files Created/Modified -- `backend/app/services/config_snapshot_subscriber.py` - Added config_snapshot_created and config_snapshot_skipped_duplicate audit events -- `backend/app/services/config_diff_service.py` - Added config_diff_generated audit event after diff INSERT -- `backend/app/routers/config_backups.py` - Added config_backup_manual_trigger audit event on manual trigger success -- `backend/tests/test_audit_config_backup.py` - 4 tests verifying all audit event types are emitted - -## Decisions Made -- Module-level import of log_action in snapshot subscriber (no circular dependency risk since audit_service has no deps on snapshot subscriber) -- Inline import in diff service try block (consistent with existing best-effort pattern and avoids any potential circular import) -- Inline import in config_backups router try block (same pattern as diff service) - -## Deviations from Plan - -None - plan executed exactly as written. - -## Issues Encountered -None - -## User Setup Required -None - no external service configuration required. - -## Next Phase Readiness -- Audit trail complete for all config backup operations -- All existing tests continue to pass with the new audit imports - ---- -*Phase: 10-audit-observability* -*Completed: 2026-03-13*