docs(02-01): complete config backup primitives plan

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Jason Staack
2026-03-12 20:50:27 -05:00
parent 4ae39d2cb3
commit 7ff3178b84
3 changed files with 313 additions and 0 deletions

104
.planning/REQUIREMENTS.md Normal file
View File

@@ -0,0 +1,104 @@
# Requirements: RouterOS Config Backup & Change Tracking
**Defined:** 2026-03-12
**Core Value:** Operators can see exactly what changed on a router and when, with reliable config snapshots for download
## v1 Requirements
### Collection
- [x] **COLL-01**: Poller collects RouterOS config via SSH `/export show-sensitive` on a configurable interval (default 6h)
- [x] **COLL-02**: Poller normalizes config output (trim whitespace, normalize line endings, remove timestamp headers)
- [ ] **COLL-03**: Poller sends config snapshot to API via NATS subject `config.snapshot.create`
- [ ] **COLL-04**: Manual backup trigger via POST `/api/tenants/{tenant_id}/devices/{device_id}/backup`
- [ ] **COLL-05**: Unreachable routers log warning and retry next interval
- [x] **COLL-06**: Collection interval configurable via `CONFIG_BACKUP_INTERVAL` environment variable
### Storage
- [x] **STOR-01**: API stores config snapshots in `router_config_snapshots` table with SHA256 hash
- [ ] **STOR-02**: Duplicate snapshots (same hash as previous) are skipped, no diff generated
- [ ] **STOR-03**: Snapshots retained for 90 days (configurable via `CONFIG_RETENTION_DAYS`)
- [ ] **STOR-04**: Older snapshots automatically deleted by retention cleanup
- [x] **STOR-05**: Snapshots encrypted at rest, accessible only through RBAC
### Diff & Parsing
- [ ] **DIFF-01**: Unified diff generated when new snapshot differs from previous
- [ ] **DIFF-02**: Diffs stored in `router_config_diffs` table linking snapshot pairs
- [ ] **DIFF-03**: Structured change parser extracts component, summary, and raw line as JSON
- [ ] **DIFF-04**: Parsed changes stored in `router_config_changes` table
### API
- [ ] **API-01**: GET `/api/tenants/{tid}/devices/{did}/config-history` returns change timeline
- [ ] **API-02**: GET `/api/tenants/{tid}/devices/{did}/config/{snapshot_id}` returns full snapshot
- [ ] **API-03**: GET `/api/tenants/{tid}/devices/{did}/config/{snapshot_id}/diff` returns unified diff
- [ ] **API-04**: RBAC enforced: operator+ can trigger backups, viewers can read history
### Frontend
- [ ] **UI-01**: Device page shows Configuration History section below Remote Access
- [ ] **UI-02**: Timeline displays change entries with component, summary, and timestamp
- [ ] **UI-03**: Diff viewer shows unified diff with add/remove highlighting
- [ ] **UI-04**: User can download snapshot as `router-{device_name}-{timestamp}.rsc`
### Observability
- [ ] **OBS-01**: Audit events logged: `config_snapshot_created`, `config_snapshot_skipped_duplicate`
- [ ] **OBS-02**: Audit events logged: `config_diff_generated`, `config_backup_manual_trigger`
## v2 Requirements
### Restore
- **REST-01**: User can restore a config snapshot to a router via SSH
- **REST-02**: Restore confirmation dialog with diff preview
## Out of Scope
| Feature | Reason |
|---------|--------|
| Config restore | Explicitly deferred per v9.6 spec |
| Non-RouterOS device backup | Spec scopes to RouterOS only initially |
| Real-time change detection | Polling-based by design, not event-driven |
| Config comparison between arbitrary snapshots | Only consecutive snapshot diffs in v1 |
## Traceability
| Requirement | Phase | Status |
|-------------|-------|--------|
| COLL-01 | Phase 2: Poller Config Collection | Complete |
| COLL-02 | Phase 2: Poller Config Collection | Complete |
| COLL-03 | Phase 2: Poller Config Collection | Pending |
| COLL-04 | Phase 4: Manual Backup Trigger | Pending |
| COLL-05 | Phase 2: Poller Config Collection | Pending |
| COLL-06 | Phase 2: Poller Config Collection | Complete |
| STOR-01 | Phase 1: Database Schema | Complete |
| STOR-02 | Phase 3: Snapshot Ingestion | Pending |
| STOR-03 | Phase 9: Retention & Cleanup | Pending |
| STOR-04 | Phase 9: Retention & Cleanup | Pending |
| STOR-05 | Phase 1: Database Schema | Complete |
| DIFF-01 | Phase 5: Diff Engine | Pending |
| DIFF-02 | Phase 5: Diff Engine | Pending |
| DIFF-03 | Phase 5: Diff Engine | Pending |
| DIFF-04 | Phase 5: Diff Engine | Pending |
| API-01 | Phase 6: History API | Pending |
| API-02 | Phase 6: History API | Pending |
| API-03 | Phase 6: History API | Pending |
| API-04 | Phase 6: History API | Pending |
| UI-01 | Phase 7: Config History UI | Pending |
| UI-02 | Phase 7: Config History UI | Pending |
| UI-03 | Phase 8: Diff Viewer & Download | Pending |
| UI-04 | Phase 8: Diff Viewer & Download | Pending |
| OBS-01 | Phase 10: Audit & Observability | Pending |
| OBS-02 | Phase 10: Audit & Observability | Pending |
**Coverage:**
- v1 requirements: 25 total
- Mapped to phases: 25
- Unmapped: 0
---
*Requirements defined: 2026-03-12*
*Last updated: 2026-03-12 after roadmap creation*

81
.planning/STATE.md Normal file
View File

@@ -0,0 +1,81 @@
---
gsd_state_version: 1.0
milestone: v9.6
milestone_name: milestone
status: in_progress
stopped_at: Completed 02-01-PLAN.md
last_updated: "2026-03-13T01:49:00Z"
last_activity: 2026-03-13 -- Completed 02-01 config backup primitives (SSH executor, normalizer, NATS event, migration)
progress:
total_phases: 10
completed_phases: 1
total_plans: 3
completed_plans: 2
percent: 100
---
# Project State
## Project Reference
See: .planning/PROJECT.md (updated 2026-03-12)
**Core value:** Operators can see exactly what changed on a router and when, with reliable config snapshots for download
**Current focus:** Phase 2: Poller Config Collection
## Current Position
Phase: 2 of 10 (Poller Config Collection)
Plan: 1 of 2 in current phase (02-01 complete)
Status: Phase 2 in progress
Last activity: 2026-03-13 -- Completed 02-01 config backup primitives (SSH executor, normalizer, NATS event, migration)
Progress: [███████░░░] 67%
## Performance Metrics
**Velocity:**
- Total plans completed: 2
- Average duration: 4min
- Total execution time: 0.13 hours
**By Phase:**
| Phase | Plans | Total | Avg/Plan |
|-------|-------|-------|----------|
| 01-database-schema | 1 | 3min | 3min |
| 02-poller-config-collection | 1 | 5min | 5min |
**Recent Trend:**
- Last 5 plans: none
- Trend: N/A
*Updated after each plan completion*
## Accumulated Context
### Decisions
Decisions are logged in PROJECT.md Key Decisions table.
Recent decisions affecting current work:
- [01-01] Models added to existing config_backup.py (same domain, consistent pattern)
- [01-01] config_text stores Transit ciphertext (vault:v1:...), plaintext never in DB
- [01-01] sha256_hash is of plaintext config for deduplication without decryption
- [02-01] TOFU fingerprint format matches ssh-keygen: SHA256:base64(sha256(pubkey))
- [02-01] NormalizationVersion=1 constant in NATS payloads for future re-processing
- [02-01] UpdateSSHHostKey uses COALESCE on first_seen to preserve original observation time
### Pending Todos
None yet.
### Blockers/Concerns
- OpenBao dev instance loses Transit keys on data wipe -- device creds need re-entry (from project memory, may affect snapshot encryption testing)
## Session Continuity
Last session: 2026-03-13T01:49:00Z
Stopped at: Completed 02-01-PLAN.md
Resume file: .planning/phases/02-poller-config-collection/02-02-PLAN.md

View File

@@ -0,0 +1,128 @@
---
phase: 02-poller-config-collection
plan: 01
subsystem: poller
tags: [ssh, tofu, routeros, config-normalization, sha256, nats, prometheus, alembic]
requires:
- phase: 01-database-schema
provides: router_config_snapshots table for storing backup data
provides:
- SSH command executor with TOFU host key verification and typed error classification
- Config normalizer with deterministic SHA256 hashing
- ConfigSnapshotEvent NATS event type and PublishConfigSnapshot method
- Config backup environment variables (interval, concurrency, timeout)
- Device model SSH fields (port, host key fingerprint) with UpdateSSHHostKey method
- Alembic migration 028 for devices table SSH columns
- Prometheus metrics for config backup observability
affects: [02-02-backup-scheduler, 03-backend-subscriber]
tech-stack:
added: []
patterns:
- "TOFU host key verification via SHA256 fingerprint comparison"
- "Config normalization pipeline: line endings, timestamp strip, whitespace trim, blank collapse"
- "SSH error classification into typed SSHErrorKind enum"
key-files:
created:
- poller/internal/device/ssh_executor.go
- poller/internal/device/ssh_executor_test.go
- poller/internal/device/normalize.go
- poller/internal/device/normalize_test.go
- backend/alembic/versions/028_device_ssh_host_key.py
modified:
- poller/internal/config/config.go
- poller/internal/bus/publisher.go
- poller/internal/store/devices.go
- poller/internal/observability/metrics.go
key-decisions:
- "TOFU fingerprint format matches ssh-keygen: SHA256:base64(sha256(pubkey))"
- "NormalizationVersion=1 constant included in NATS payloads for future re-processing"
- "UpdateSSHHostKey sets first_seen via COALESCE to preserve original observation time"
patterns-established:
- "SSH error classification: classifySSHError inspects error strings for auth/hostkey/timeout/refused patterns"
- "Config normalization: version-tracked deterministic pipeline for RouterOS export output"
requirements-completed: [COLL-01, COLL-02, COLL-06]
duration: 5min
completed: 2026-03-13
---
# Phase 02 Plan 01: Config Backup Primitives Summary
**SSH executor with TOFU host key verification, RouterOS config normalizer with SHA256 hashing, NATS snapshot event, and Alembic migration for device SSH columns**
## Performance
- **Duration:** 5 min
- **Started:** 2026-03-13T01:43:33Z
- **Completed:** 2026-03-13T01:48:38Z
- **Tasks:** 2
- **Files modified:** 9
## Accomplishments
- SSH RunCommand executor with context-aware dialing, TOFU host key callback, and 6-kind typed error classification
- Deterministic config normalizer: strips RouterOS timestamps, normalizes line endings, trims whitespace, collapses blanks, computes SHA256 hash
- 22 unit tests covering error classification, TOFU flows (first connect/match/mismatch), normalization edge cases, idempotency
- Config backup env vars, NATS ConfigSnapshotEvent, device model SSH extensions, migration 028, Prometheus metrics
## Task Commits
Each task was committed atomically:
1. **Task 1: SSH executor, normalizer, and their tests** - `f1abb75` (feat)
2. **Task 2: Config env vars, NATS event type, device model extensions, Alembic migration, metrics** - `4ae39d2` (feat)
_Note: Task 1 used TDD -- tests written first (RED), implementation second (GREEN)._
## Files Created/Modified
- `poller/internal/device/ssh_executor.go` - RunCommand SSH executor with TOFU host key verification and typed errors
- `poller/internal/device/ssh_executor_test.go` - Unit tests for SSH error classification, TOFU callbacks, CommandResult
- `poller/internal/device/normalize.go` - NormalizeConfig and HashConfig for RouterOS export output
- `poller/internal/device/normalize_test.go` - Table-driven tests for normalization pipeline edge cases
- `poller/internal/config/config.go` - Added ConfigBackupIntervalSeconds, ConfigBackupMaxConcurrent, ConfigBackupCommandTimeoutSeconds
- `poller/internal/bus/publisher.go` - Added ConfigSnapshotEvent type, PublishConfigSnapshot method, config.snapshot.> stream subject
- `poller/internal/store/devices.go` - Added SSHPort/SSHHostKeyFingerprint fields, UpdateSSHHostKey method, updated queries
- `poller/internal/observability/metrics.go` - Added ConfigBackupTotal, ConfigBackupDuration, ConfigBackupActive metrics
- `backend/alembic/versions/028_device_ssh_host_key.py` - Migration adding ssh_port, ssh_host_key_fingerprint, timestamp columns
## Decisions Made
- TOFU fingerprint format uses SHA256:base64(sha256(pubkey)) to match ssh-keygen output format
- NormalizationVersion=1 constant is included in NATS payloads so consumers can detect algorithm changes
- UpdateSSHHostKey uses COALESCE on ssh_host_key_first_seen to preserve original observation timestamp
## Deviations from Plan
### Auto-fixed Issues
**1. [Rule 1 - Bug] Fixed test key generation approach**
- **Found during:** Task 1 (GREEN phase)
- **Issue:** Embedded OpenSSH PEM test key had padding errors ("ssh: padding not as expected")
- **Fix:** Switched to programmatic ed25519 key generation via crypto/ed25519.GenerateKey
- **Files modified:** poller/internal/device/ssh_executor_test.go
- **Verification:** All 22 tests pass
- **Committed in:** f1abb75 (Task 1 commit)
---
**Total deviations:** 1 auto-fixed (1 bug)
**Impact on plan:** Minimal -- test infrastructure fix only, no production code change.
## Issues Encountered
None beyond the test key generation fix documented above.
## User Setup Required
None - no external service configuration required.
## Next Phase Readiness
- All primitives ready for Plan 02 (backup scheduler) to wire together
- SSH executor, normalizer, NATS event, device model, config, and metrics are independently tested and compilable
- Migration 028 ready to apply before deploying the backup scheduler
---
*Phase: 02-poller-config-collection*
*Completed: 2026-03-13*