From 7ff3178b847cb92093b63040b1edf4dab6837727 Mon Sep 17 00:00:00 2001 From: Jason Staack Date: Thu, 12 Mar 2026 20:50:27 -0500 Subject: [PATCH] docs(02-01): complete config backup primitives plan Co-Authored-By: Claude Opus 4.6 --- .planning/REQUIREMENTS.md | 104 ++++++++++++++ .planning/STATE.md | 81 +++++++++++ .../02-01-SUMMARY.md | 128 ++++++++++++++++++ 3 files changed, 313 insertions(+) create mode 100644 .planning/REQUIREMENTS.md create mode 100644 .planning/STATE.md create mode 100644 .planning/phases/02-poller-config-collection/02-01-SUMMARY.md diff --git a/.planning/REQUIREMENTS.md b/.planning/REQUIREMENTS.md new file mode 100644 index 0000000..f6bcd84 --- /dev/null +++ b/.planning/REQUIREMENTS.md @@ -0,0 +1,104 @@ +# Requirements: RouterOS Config Backup & Change Tracking + +**Defined:** 2026-03-12 +**Core Value:** Operators can see exactly what changed on a router and when, with reliable config snapshots for download + +## v1 Requirements + +### Collection + +- [x] **COLL-01**: Poller collects RouterOS config via SSH `/export show-sensitive` on a configurable interval (default 6h) +- [x] **COLL-02**: Poller normalizes config output (trim whitespace, normalize line endings, remove timestamp headers) +- [ ] **COLL-03**: Poller sends config snapshot to API via NATS subject `config.snapshot.create` +- [ ] **COLL-04**: Manual backup trigger via POST `/api/tenants/{tenant_id}/devices/{device_id}/backup` +- [ ] **COLL-05**: Unreachable routers log warning and retry next interval +- [x] **COLL-06**: Collection interval configurable via `CONFIG_BACKUP_INTERVAL` environment variable + +### Storage + +- [x] **STOR-01**: API stores config snapshots in `router_config_snapshots` table with SHA256 hash +- [ ] **STOR-02**: Duplicate snapshots (same hash as previous) are skipped, no diff generated +- [ ] **STOR-03**: Snapshots retained for 90 days (configurable via `CONFIG_RETENTION_DAYS`) +- [ ] **STOR-04**: Older snapshots automatically deleted by retention cleanup +- [x] **STOR-05**: Snapshots encrypted at rest, accessible only through RBAC + +### Diff & Parsing + +- [ ] **DIFF-01**: Unified diff generated when new snapshot differs from previous +- [ ] **DIFF-02**: Diffs stored in `router_config_diffs` table linking snapshot pairs +- [ ] **DIFF-03**: Structured change parser extracts component, summary, and raw line as JSON +- [ ] **DIFF-04**: Parsed changes stored in `router_config_changes` table + +### API + +- [ ] **API-01**: GET `/api/tenants/{tid}/devices/{did}/config-history` returns change timeline +- [ ] **API-02**: GET `/api/tenants/{tid}/devices/{did}/config/{snapshot_id}` returns full snapshot +- [ ] **API-03**: GET `/api/tenants/{tid}/devices/{did}/config/{snapshot_id}/diff` returns unified diff +- [ ] **API-04**: RBAC enforced: operator+ can trigger backups, viewers can read history + +### Frontend + +- [ ] **UI-01**: Device page shows Configuration History section below Remote Access +- [ ] **UI-02**: Timeline displays change entries with component, summary, and timestamp +- [ ] **UI-03**: Diff viewer shows unified diff with add/remove highlighting +- [ ] **UI-04**: User can download snapshot as `router-{device_name}-{timestamp}.rsc` + +### Observability + +- [ ] **OBS-01**: Audit events logged: `config_snapshot_created`, `config_snapshot_skipped_duplicate` +- [ ] **OBS-02**: Audit events logged: `config_diff_generated`, `config_backup_manual_trigger` + +## v2 Requirements + +### Restore + +- **REST-01**: User can restore a config snapshot to a router via SSH +- **REST-02**: Restore confirmation dialog with diff preview + +## Out of Scope + +| Feature | Reason | +|---------|--------| +| Config restore | Explicitly deferred per v9.6 spec | +| Non-RouterOS device backup | Spec scopes to RouterOS only initially | +| Real-time change detection | Polling-based by design, not event-driven | +| Config comparison between arbitrary snapshots | Only consecutive snapshot diffs in v1 | + +## Traceability + +| Requirement | Phase | Status | +|-------------|-------|--------| +| COLL-01 | Phase 2: Poller Config Collection | Complete | +| COLL-02 | Phase 2: Poller Config Collection | Complete | +| COLL-03 | Phase 2: Poller Config Collection | Pending | +| COLL-04 | Phase 4: Manual Backup Trigger | Pending | +| COLL-05 | Phase 2: Poller Config Collection | Pending | +| COLL-06 | Phase 2: Poller Config Collection | Complete | +| STOR-01 | Phase 1: Database Schema | Complete | +| STOR-02 | Phase 3: Snapshot Ingestion | Pending | +| STOR-03 | Phase 9: Retention & Cleanup | Pending | +| STOR-04 | Phase 9: Retention & Cleanup | Pending | +| STOR-05 | Phase 1: Database Schema | Complete | +| DIFF-01 | Phase 5: Diff Engine | Pending | +| DIFF-02 | Phase 5: Diff Engine | Pending | +| DIFF-03 | Phase 5: Diff Engine | Pending | +| DIFF-04 | Phase 5: Diff Engine | Pending | +| API-01 | Phase 6: History API | Pending | +| API-02 | Phase 6: History API | Pending | +| API-03 | Phase 6: History API | Pending | +| API-04 | Phase 6: History API | Pending | +| UI-01 | Phase 7: Config History UI | Pending | +| UI-02 | Phase 7: Config History UI | Pending | +| UI-03 | Phase 8: Diff Viewer & Download | Pending | +| UI-04 | Phase 8: Diff Viewer & Download | Pending | +| OBS-01 | Phase 10: Audit & Observability | Pending | +| OBS-02 | Phase 10: Audit & Observability | Pending | + +**Coverage:** +- v1 requirements: 25 total +- Mapped to phases: 25 +- Unmapped: 0 + +--- +*Requirements defined: 2026-03-12* +*Last updated: 2026-03-12 after roadmap creation* diff --git a/.planning/STATE.md b/.planning/STATE.md new file mode 100644 index 0000000..37b0e80 --- /dev/null +++ b/.planning/STATE.md @@ -0,0 +1,81 @@ +--- +gsd_state_version: 1.0 +milestone: v9.6 +milestone_name: milestone +status: in_progress +stopped_at: Completed 02-01-PLAN.md +last_updated: "2026-03-13T01:49:00Z" +last_activity: 2026-03-13 -- Completed 02-01 config backup primitives (SSH executor, normalizer, NATS event, migration) +progress: + total_phases: 10 + completed_phases: 1 + total_plans: 3 + completed_plans: 2 + percent: 100 +--- + +# Project State + +## Project Reference + +See: .planning/PROJECT.md (updated 2026-03-12) + +**Core value:** Operators can see exactly what changed on a router and when, with reliable config snapshots for download +**Current focus:** Phase 2: Poller Config Collection + +## Current Position + +Phase: 2 of 10 (Poller Config Collection) +Plan: 1 of 2 in current phase (02-01 complete) +Status: Phase 2 in progress +Last activity: 2026-03-13 -- Completed 02-01 config backup primitives (SSH executor, normalizer, NATS event, migration) + +Progress: [███████░░░] 67% + +## Performance Metrics + +**Velocity:** +- Total plans completed: 2 +- Average duration: 4min +- Total execution time: 0.13 hours + +**By Phase:** + +| Phase | Plans | Total | Avg/Plan | +|-------|-------|-------|----------| +| 01-database-schema | 1 | 3min | 3min | +| 02-poller-config-collection | 1 | 5min | 5min | + +**Recent Trend:** +- Last 5 plans: none +- Trend: N/A + +*Updated after each plan completion* + +## Accumulated Context + +### Decisions + +Decisions are logged in PROJECT.md Key Decisions table. +Recent decisions affecting current work: + +- [01-01] Models added to existing config_backup.py (same domain, consistent pattern) +- [01-01] config_text stores Transit ciphertext (vault:v1:...), plaintext never in DB +- [01-01] sha256_hash is of plaintext config for deduplication without decryption +- [02-01] TOFU fingerprint format matches ssh-keygen: SHA256:base64(sha256(pubkey)) +- [02-01] NormalizationVersion=1 constant in NATS payloads for future re-processing +- [02-01] UpdateSSHHostKey uses COALESCE on first_seen to preserve original observation time + +### Pending Todos + +None yet. + +### Blockers/Concerns + +- OpenBao dev instance loses Transit keys on data wipe -- device creds need re-entry (from project memory, may affect snapshot encryption testing) + +## Session Continuity + +Last session: 2026-03-13T01:49:00Z +Stopped at: Completed 02-01-PLAN.md +Resume file: .planning/phases/02-poller-config-collection/02-02-PLAN.md diff --git a/.planning/phases/02-poller-config-collection/02-01-SUMMARY.md b/.planning/phases/02-poller-config-collection/02-01-SUMMARY.md new file mode 100644 index 0000000..7e94424 --- /dev/null +++ b/.planning/phases/02-poller-config-collection/02-01-SUMMARY.md @@ -0,0 +1,128 @@ +--- +phase: 02-poller-config-collection +plan: 01 +subsystem: poller +tags: [ssh, tofu, routeros, config-normalization, sha256, nats, prometheus, alembic] + +requires: + - phase: 01-database-schema + provides: router_config_snapshots table for storing backup data +provides: + - SSH command executor with TOFU host key verification and typed error classification + - Config normalizer with deterministic SHA256 hashing + - ConfigSnapshotEvent NATS event type and PublishConfigSnapshot method + - Config backup environment variables (interval, concurrency, timeout) + - Device model SSH fields (port, host key fingerprint) with UpdateSSHHostKey method + - Alembic migration 028 for devices table SSH columns + - Prometheus metrics for config backup observability +affects: [02-02-backup-scheduler, 03-backend-subscriber] + +tech-stack: + added: [] + patterns: + - "TOFU host key verification via SHA256 fingerprint comparison" + - "Config normalization pipeline: line endings, timestamp strip, whitespace trim, blank collapse" + - "SSH error classification into typed SSHErrorKind enum" + +key-files: + created: + - poller/internal/device/ssh_executor.go + - poller/internal/device/ssh_executor_test.go + - poller/internal/device/normalize.go + - poller/internal/device/normalize_test.go + - backend/alembic/versions/028_device_ssh_host_key.py + modified: + - poller/internal/config/config.go + - poller/internal/bus/publisher.go + - poller/internal/store/devices.go + - poller/internal/observability/metrics.go + +key-decisions: + - "TOFU fingerprint format matches ssh-keygen: SHA256:base64(sha256(pubkey))" + - "NormalizationVersion=1 constant included in NATS payloads for future re-processing" + - "UpdateSSHHostKey sets first_seen via COALESCE to preserve original observation time" + +patterns-established: + - "SSH error classification: classifySSHError inspects error strings for auth/hostkey/timeout/refused patterns" + - "Config normalization: version-tracked deterministic pipeline for RouterOS export output" + +requirements-completed: [COLL-01, COLL-02, COLL-06] + +duration: 5min +completed: 2026-03-13 +--- + +# Phase 02 Plan 01: Config Backup Primitives Summary + +**SSH executor with TOFU host key verification, RouterOS config normalizer with SHA256 hashing, NATS snapshot event, and Alembic migration for device SSH columns** + +## Performance + +- **Duration:** 5 min +- **Started:** 2026-03-13T01:43:33Z +- **Completed:** 2026-03-13T01:48:38Z +- **Tasks:** 2 +- **Files modified:** 9 + +## Accomplishments +- SSH RunCommand executor with context-aware dialing, TOFU host key callback, and 6-kind typed error classification +- Deterministic config normalizer: strips RouterOS timestamps, normalizes line endings, trims whitespace, collapses blanks, computes SHA256 hash +- 22 unit tests covering error classification, TOFU flows (first connect/match/mismatch), normalization edge cases, idempotency +- Config backup env vars, NATS ConfigSnapshotEvent, device model SSH extensions, migration 028, Prometheus metrics + +## Task Commits + +Each task was committed atomically: + +1. **Task 1: SSH executor, normalizer, and their tests** - `f1abb75` (feat) +2. **Task 2: Config env vars, NATS event type, device model extensions, Alembic migration, metrics** - `4ae39d2` (feat) + +_Note: Task 1 used TDD -- tests written first (RED), implementation second (GREEN)._ + +## Files Created/Modified +- `poller/internal/device/ssh_executor.go` - RunCommand SSH executor with TOFU host key verification and typed errors +- `poller/internal/device/ssh_executor_test.go` - Unit tests for SSH error classification, TOFU callbacks, CommandResult +- `poller/internal/device/normalize.go` - NormalizeConfig and HashConfig for RouterOS export output +- `poller/internal/device/normalize_test.go` - Table-driven tests for normalization pipeline edge cases +- `poller/internal/config/config.go` - Added ConfigBackupIntervalSeconds, ConfigBackupMaxConcurrent, ConfigBackupCommandTimeoutSeconds +- `poller/internal/bus/publisher.go` - Added ConfigSnapshotEvent type, PublishConfigSnapshot method, config.snapshot.> stream subject +- `poller/internal/store/devices.go` - Added SSHPort/SSHHostKeyFingerprint fields, UpdateSSHHostKey method, updated queries +- `poller/internal/observability/metrics.go` - Added ConfigBackupTotal, ConfigBackupDuration, ConfigBackupActive metrics +- `backend/alembic/versions/028_device_ssh_host_key.py` - Migration adding ssh_port, ssh_host_key_fingerprint, timestamp columns + +## Decisions Made +- TOFU fingerprint format uses SHA256:base64(sha256(pubkey)) to match ssh-keygen output format +- NormalizationVersion=1 constant is included in NATS payloads so consumers can detect algorithm changes +- UpdateSSHHostKey uses COALESCE on ssh_host_key_first_seen to preserve original observation timestamp + +## Deviations from Plan + +### Auto-fixed Issues + +**1. [Rule 1 - Bug] Fixed test key generation approach** +- **Found during:** Task 1 (GREEN phase) +- **Issue:** Embedded OpenSSH PEM test key had padding errors ("ssh: padding not as expected") +- **Fix:** Switched to programmatic ed25519 key generation via crypto/ed25519.GenerateKey +- **Files modified:** poller/internal/device/ssh_executor_test.go +- **Verification:** All 22 tests pass +- **Committed in:** f1abb75 (Task 1 commit) + +--- + +**Total deviations:** 1 auto-fixed (1 bug) +**Impact on plan:** Minimal -- test infrastructure fix only, no production code change. + +## Issues Encountered +None beyond the test key generation fix documented above. + +## User Setup Required +None - no external service configuration required. + +## Next Phase Readiness +- All primitives ready for Plan 02 (backup scheduler) to wire together +- SSH executor, normalizer, NATS event, device model, config, and metrics are independently tested and compilable +- Migration 028 ready to apply before deploying the backup scheduler + +--- +*Phase: 02-poller-config-collection* +*Completed: 2026-03-13*