From d456fe58e95854fbb8e2b7af5153bb97d2987cc2 Mon Sep 17 00:00:00 2001 From: Jason Staack Date: Thu, 12 Mar 2026 20:57:47 -0500 Subject: [PATCH] docs(02-02): complete backup scheduler plan - SUMMARY.md with execution metrics and decisions - STATE.md updated: Phase 2 complete, 3 plans done - ROADMAP.md updated: Phase 2 marked complete - REQUIREMENTS.md: COLL-03, COLL-05 marked complete Co-Authored-By: Claude Opus 4.6 --- .planning/REQUIREMENTS.md | 8 +- .planning/ROADMAP.md | 4 +- .planning/STATE.md | 35 +++--- .../02-02-SUMMARY.md | 100 ++++++++++++++++++ 4 files changed, 125 insertions(+), 22 deletions(-) create mode 100644 .planning/phases/02-poller-config-collection/02-02-SUMMARY.md diff --git a/.planning/REQUIREMENTS.md b/.planning/REQUIREMENTS.md index f6bcd84..2a5ae4e 100644 --- a/.planning/REQUIREMENTS.md +++ b/.planning/REQUIREMENTS.md @@ -9,9 +9,9 @@ - [x] **COLL-01**: Poller collects RouterOS config via SSH `/export show-sensitive` on a configurable interval (default 6h) - [x] **COLL-02**: Poller normalizes config output (trim whitespace, normalize line endings, remove timestamp headers) -- [ ] **COLL-03**: Poller sends config snapshot to API via NATS subject `config.snapshot.create` +- [x] **COLL-03**: Poller sends config snapshot to API via NATS subject `config.snapshot.create` - [ ] **COLL-04**: Manual backup trigger via POST `/api/tenants/{tenant_id}/devices/{device_id}/backup` -- [ ] **COLL-05**: Unreachable routers log warning and retry next interval +- [x] **COLL-05**: Unreachable routers log warning and retry next interval - [x] **COLL-06**: Collection interval configurable via `CONFIG_BACKUP_INTERVAL` environment variable ### Storage @@ -70,9 +70,9 @@ |-------------|-------|--------| | COLL-01 | Phase 2: Poller Config Collection | Complete | | COLL-02 | Phase 2: Poller Config Collection | Complete | -| COLL-03 | Phase 2: Poller Config Collection | Pending | +| COLL-03 | Phase 2: Poller Config Collection | Complete | | COLL-04 | Phase 4: Manual Backup Trigger | Pending | -| COLL-05 | Phase 2: Poller Config Collection | Pending | +| COLL-05 | Phase 2: Poller Config Collection | Complete | | COLL-06 | Phase 2: Poller Config Collection | Complete | | STOR-01 | Phase 1: Database Schema | Complete | | STOR-02 | Phase 3: Snapshot Ingestion | Pending | diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index 9a17c74..d2fdc55 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -13,7 +13,7 @@ This roadmap delivers automated RouterOS configuration backup and change trackin Decimal phases appear between their surrounding integers in numeric order. - [x] **Phase 1: Database Schema** - Config snapshot, diff, and change tables with encryption and RLS (completed 2026-03-13) -- [ ] **Phase 2: Poller Config Collection** - SSH export, normalization, and NATS publishing from Go poller +- [x] **Phase 2: Poller Config Collection** - SSH export, normalization, and NATS publishing from Go poller (completed 2026-03-13) - [ ] **Phase 3: Snapshot Ingestion** - Backend NATS subscriber stores snapshots with SHA256 deduplication - [ ] **Phase 4: Manual Backup Trigger** - API endpoint for on-demand config backup via poller - [ ] **Phase 5: Diff Engine** - Unified diff generation and structured change parsing @@ -175,7 +175,7 @@ Note: Phase 9 depends only on Phase 3 and Phase 10 depends on Phases 3/4/5, so P | Phase | Plans Complete | Status | Completed | |-------|----------------|--------|-----------| | 1. Database Schema | 1/1 | Complete | 2026-03-13 | -| 2. Poller Config Collection | 0/2 | Not started | - | +| 2. Poller Config Collection | 2/2 | Complete | 2026-03-13 | | 3. Snapshot Ingestion | 0/1 | Not started | - | | 4. Manual Backup Trigger | 0/1 | Not started | - | | 5. Diff Engine | 0/2 | Not started | - | diff --git a/.planning/STATE.md b/.planning/STATE.md index 37b0e80..dc367f1 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -3,14 +3,14 @@ gsd_state_version: 1.0 milestone: v9.6 milestone_name: milestone status: in_progress -stopped_at: Completed 02-01-PLAN.md -last_updated: "2026-03-13T01:49:00Z" -last_activity: 2026-03-13 -- Completed 02-01 config backup primitives (SSH executor, normalizer, NATS event, migration) +stopped_at: Completed 02-02-PLAN.md +last_updated: "2026-03-13T01:55:37Z" +last_activity: 2026-03-13 -- Completed 02-02 backup scheduler (per-device goroutines, concurrency, main.go wiring) progress: total_phases: 10 - completed_phases: 1 + completed_phases: 2 total_plans: 3 - completed_plans: 2 + completed_plans: 3 percent: 100 --- @@ -25,26 +25,26 @@ See: .planning/PROJECT.md (updated 2026-03-12) ## Current Position -Phase: 2 of 10 (Poller Config Collection) -Plan: 1 of 2 in current phase (02-01 complete) -Status: Phase 2 in progress -Last activity: 2026-03-13 -- Completed 02-01 config backup primitives (SSH executor, normalizer, NATS event, migration) +Phase: 2 of 10 (Poller Config Collection) -- COMPLETE +Plan: 2 of 2 in current phase (02-02 complete) +Status: Phase 2 complete +Last activity: 2026-03-13 -- Completed 02-02 backup scheduler with per-device goroutines and main.go wiring -Progress: [███████░░░] 67% +Progress: [██████████] 100% ## Performance Metrics **Velocity:** -- Total plans completed: 2 +- Total plans completed: 3 - Average duration: 4min -- Total execution time: 0.13 hours +- Total execution time: 0.20 hours **By Phase:** | Phase | Plans | Total | Avg/Plan | |-------|-------|-------|----------| | 01-database-schema | 1 | 3min | 3min | -| 02-poller-config-collection | 1 | 5min | 5min | +| 02-poller-config-collection | 2 | 9min | 4.5min | **Recent Trend:** - Last 5 plans: none @@ -65,6 +65,9 @@ Recent decisions affecting current work: - [02-01] TOFU fingerprint format matches ssh-keygen: SHA256:base64(sha256(pubkey)) - [02-01] NormalizationVersion=1 constant in NATS payloads for future re-processing - [02-01] UpdateSSHHostKey uses COALESCE on first_seen to preserve original observation time +- [02-02] BackupScheduler runs independently from status poll scheduler with separate goroutines +- [02-02] Buffered channel semaphore for concurrency control (Go idiom, no external deps) +- [02-02] Devices with no Redis status key assumed potentially online for first backup ### Pending Todos @@ -76,6 +79,6 @@ None yet. ## Session Continuity -Last session: 2026-03-13T01:49:00Z -Stopped at: Completed 02-01-PLAN.md -Resume file: .planning/phases/02-poller-config-collection/02-02-PLAN.md +Last session: 2026-03-13T01:55:37Z +Stopped at: Completed 02-02-PLAN.md (Phase 2 complete) +Resume file: Next phase (03) diff --git a/.planning/phases/02-poller-config-collection/02-02-SUMMARY.md b/.planning/phases/02-poller-config-collection/02-02-SUMMARY.md new file mode 100644 index 0000000..1d2ee3e --- /dev/null +++ b/.planning/phases/02-poller-config-collection/02-02-SUMMARY.md @@ -0,0 +1,100 @@ +--- +phase: 02-poller-config-collection +plan: 02 +subsystem: poller +tags: [ssh, backup, scheduler, nats, routeros, concurrency, tofu, redis] + +requires: + - phase: 02-poller-config-collection/01 + provides: SSH executor, config normalizer, NATS ConfigSnapshotEvent, Prometheus metrics, config fields +provides: + - BackupScheduler with per-device goroutines managing periodic SSH config collection + - Concurrency-limited config backup pipeline (SSH -> normalize -> hash -> NATS publish) + - TOFU host key verification with persistent fingerprint storage + - Auth/hostkey error blocking with transient error exponential backoff + - SSHHostKeyUpdater consumer-side interface +affects: [03-backend-snapshot-consumer, api, poller] + +tech-stack: + added: [] + patterns: [per-device goroutine lifecycle, buffered channel semaphore, Redis online gating] + +key-files: + created: + - poller/internal/poller/backup_scheduler.go + - poller/internal/poller/backup_scheduler_test.go + modified: + - poller/internal/poller/interfaces.go + - poller/cmd/poller/main.go + +key-decisions: + - "BackupScheduler runs independently from status poll scheduler with separate goroutines" + - "Semaphore uses buffered channel pattern matching existing codebase style" + - "Device with no Redis status key assumed potentially online (first poll not yet completed)" + +patterns-established: + - "Backup goroutine pattern: jitter -> initial backup -> ticker loop with gating checks" + - "Error classification: auth/hostkey block retries, transient errors use exponential backoff" + +requirements-completed: [COLL-01, COLL-03, COLL-05, COLL-06] + +duration: 4min +completed: 2026-03-13 +--- + +# Phase 2 Plan 2: Backup Scheduler Summary + +**BackupScheduler orchestrating periodic SSH config collection with per-device goroutines, concurrency semaphore, TOFU verification, and NATS publishing** + +## Performance + +- **Duration:** 4 min +- **Started:** 2026-03-13T01:51:27Z +- **Completed:** 2026-03-13T01:55:37Z +- **Tasks:** 2 +- **Files modified:** 4 + +## Accomplishments +- BackupScheduler manages per-device backup goroutines with 30-300s initial jitter +- Concurrency limited by configurable buffered channel semaphore (default 10) +- Auth failures and host key mismatches permanently block retries with clear log warnings +- Transient errors use stepped backoff (5m/15m/1h cap) +- Full pipeline wired into main.go running parallel to existing status poll scheduler + +## Task Commits + +Each task was committed atomically: + +1. **Task 1: BackupScheduler with per-device goroutines** - `a884b09` (test) + `2653a32` (feat) -- TDD red/green +2. **Task 2: Wire BackupScheduler into main.go** - `d34817a` (feat) + +## Files Created/Modified +- `poller/internal/poller/backup_scheduler.go` - BackupScheduler with per-device goroutines, concurrency control, SSH collection, NATS publishing +- `poller/internal/poller/backup_scheduler_test.go` - Unit tests for jitter, backoff, retry blocking, online gating, semaphore, reconciliation +- `poller/internal/poller/interfaces.go` - Added SSHHostKeyUpdater consumer-side interface +- `poller/cmd/poller/main.go` - BackupScheduler initialization and goroutine startup + +## Decisions Made +- BackupScheduler runs independently from status poll scheduler -- separate goroutine pool, no shared state +- Semaphore uses buffered channel pattern (consistent with Go idioms, no external deps) +- Devices with no Redis status key assumed potentially online to avoid blocking first backup +- Locker nil-check allows tests to run without Redis lock infrastructure + +## Deviations from Plan + +None - plan executed exactly as written. + +## Issues Encountered +None + +## User Setup Required +None - no external service configuration required. + +## Next Phase Readiness +- Config backup pipeline complete: SSH -> normalize -> hash -> NATS publish +- Backend snapshot consumer (Phase 3) can subscribe to config.snapshot.create.> to receive snapshots +- Pre-existing integration test failures in poller package (missing certificate_authorities table) are unrelated to this work + +--- +*Phase: 02-poller-config-collection* +*Completed: 2026-03-13*