docs(02-02): complete backup scheduler plan

- SUMMARY.md with execution metrics and decisions
- STATE.md updated: Phase 2 complete, 3 plans done
- ROADMAP.md updated: Phase 2 marked complete
- REQUIREMENTS.md: COLL-03, COLL-05 marked complete

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Jason Staack
2026-03-12 20:57:47 -05:00
parent d34817a36c
commit d456fe58e9
4 changed files with 125 additions and 22 deletions

View File

@@ -9,9 +9,9 @@
- [x] **COLL-01**: Poller collects RouterOS config via SSH `/export show-sensitive` on a configurable interval (default 6h) - [x] **COLL-01**: Poller collects RouterOS config via SSH `/export show-sensitive` on a configurable interval (default 6h)
- [x] **COLL-02**: Poller normalizes config output (trim whitespace, normalize line endings, remove timestamp headers) - [x] **COLL-02**: Poller normalizes config output (trim whitespace, normalize line endings, remove timestamp headers)
- [ ] **COLL-03**: Poller sends config snapshot to API via NATS subject `config.snapshot.create` - [x] **COLL-03**: Poller sends config snapshot to API via NATS subject `config.snapshot.create`
- [ ] **COLL-04**: Manual backup trigger via POST `/api/tenants/{tenant_id}/devices/{device_id}/backup` - [ ] **COLL-04**: Manual backup trigger via POST `/api/tenants/{tenant_id}/devices/{device_id}/backup`
- [ ] **COLL-05**: Unreachable routers log warning and retry next interval - [x] **COLL-05**: Unreachable routers log warning and retry next interval
- [x] **COLL-06**: Collection interval configurable via `CONFIG_BACKUP_INTERVAL` environment variable - [x] **COLL-06**: Collection interval configurable via `CONFIG_BACKUP_INTERVAL` environment variable
### Storage ### Storage
@@ -70,9 +70,9 @@
|-------------|-------|--------| |-------------|-------|--------|
| COLL-01 | Phase 2: Poller Config Collection | Complete | | COLL-01 | Phase 2: Poller Config Collection | Complete |
| COLL-02 | Phase 2: Poller Config Collection | Complete | | COLL-02 | Phase 2: Poller Config Collection | Complete |
| COLL-03 | Phase 2: Poller Config Collection | Pending | | COLL-03 | Phase 2: Poller Config Collection | Complete |
| COLL-04 | Phase 4: Manual Backup Trigger | Pending | | COLL-04 | Phase 4: Manual Backup Trigger | Pending |
| COLL-05 | Phase 2: Poller Config Collection | Pending | | COLL-05 | Phase 2: Poller Config Collection | Complete |
| COLL-06 | Phase 2: Poller Config Collection | Complete | | COLL-06 | Phase 2: Poller Config Collection | Complete |
| STOR-01 | Phase 1: Database Schema | Complete | | STOR-01 | Phase 1: Database Schema | Complete |
| STOR-02 | Phase 3: Snapshot Ingestion | Pending | | STOR-02 | Phase 3: Snapshot Ingestion | Pending |

View File

@@ -13,7 +13,7 @@ This roadmap delivers automated RouterOS configuration backup and change trackin
Decimal phases appear between their surrounding integers in numeric order. Decimal phases appear between their surrounding integers in numeric order.
- [x] **Phase 1: Database Schema** - Config snapshot, diff, and change tables with encryption and RLS (completed 2026-03-13) - [x] **Phase 1: Database Schema** - Config snapshot, diff, and change tables with encryption and RLS (completed 2026-03-13)
- [ ] **Phase 2: Poller Config Collection** - SSH export, normalization, and NATS publishing from Go poller - [x] **Phase 2: Poller Config Collection** - SSH export, normalization, and NATS publishing from Go poller (completed 2026-03-13)
- [ ] **Phase 3: Snapshot Ingestion** - Backend NATS subscriber stores snapshots with SHA256 deduplication - [ ] **Phase 3: Snapshot Ingestion** - Backend NATS subscriber stores snapshots with SHA256 deduplication
- [ ] **Phase 4: Manual Backup Trigger** - API endpoint for on-demand config backup via poller - [ ] **Phase 4: Manual Backup Trigger** - API endpoint for on-demand config backup via poller
- [ ] **Phase 5: Diff Engine** - Unified diff generation and structured change parsing - [ ] **Phase 5: Diff Engine** - Unified diff generation and structured change parsing
@@ -175,7 +175,7 @@ Note: Phase 9 depends only on Phase 3 and Phase 10 depends on Phases 3/4/5, so P
| Phase | Plans Complete | Status | Completed | | Phase | Plans Complete | Status | Completed |
|-------|----------------|--------|-----------| |-------|----------------|--------|-----------|
| 1. Database Schema | 1/1 | Complete | 2026-03-13 | | 1. Database Schema | 1/1 | Complete | 2026-03-13 |
| 2. Poller Config Collection | 0/2 | Not started | - | | 2. Poller Config Collection | 2/2 | Complete | 2026-03-13 |
| 3. Snapshot Ingestion | 0/1 | Not started | - | | 3. Snapshot Ingestion | 0/1 | Not started | - |
| 4. Manual Backup Trigger | 0/1 | Not started | - | | 4. Manual Backup Trigger | 0/1 | Not started | - |
| 5. Diff Engine | 0/2 | Not started | - | | 5. Diff Engine | 0/2 | Not started | - |

View File

@@ -3,14 +3,14 @@ gsd_state_version: 1.0
milestone: v9.6 milestone: v9.6
milestone_name: milestone milestone_name: milestone
status: in_progress status: in_progress
stopped_at: Completed 02-01-PLAN.md stopped_at: Completed 02-02-PLAN.md
last_updated: "2026-03-13T01:49:00Z" last_updated: "2026-03-13T01:55:37Z"
last_activity: 2026-03-13 -- Completed 02-01 config backup primitives (SSH executor, normalizer, NATS event, migration) last_activity: 2026-03-13 -- Completed 02-02 backup scheduler (per-device goroutines, concurrency, main.go wiring)
progress: progress:
total_phases: 10 total_phases: 10
completed_phases: 1 completed_phases: 2
total_plans: 3 total_plans: 3
completed_plans: 2 completed_plans: 3
percent: 100 percent: 100
--- ---
@@ -25,26 +25,26 @@ See: .planning/PROJECT.md (updated 2026-03-12)
## Current Position ## Current Position
Phase: 2 of 10 (Poller Config Collection) Phase: 2 of 10 (Poller Config Collection) -- COMPLETE
Plan: 1 of 2 in current phase (02-01 complete) Plan: 2 of 2 in current phase (02-02 complete)
Status: Phase 2 in progress Status: Phase 2 complete
Last activity: 2026-03-13 -- Completed 02-01 config backup primitives (SSH executor, normalizer, NATS event, migration) Last activity: 2026-03-13 -- Completed 02-02 backup scheduler with per-device goroutines and main.go wiring
Progress: [███████░░░] 67% Progress: [██████████] 100%
## Performance Metrics ## Performance Metrics
**Velocity:** **Velocity:**
- Total plans completed: 2 - Total plans completed: 3
- Average duration: 4min - Average duration: 4min
- Total execution time: 0.13 hours - Total execution time: 0.20 hours
**By Phase:** **By Phase:**
| Phase | Plans | Total | Avg/Plan | | Phase | Plans | Total | Avg/Plan |
|-------|-------|-------|----------| |-------|-------|-------|----------|
| 01-database-schema | 1 | 3min | 3min | | 01-database-schema | 1 | 3min | 3min |
| 02-poller-config-collection | 1 | 5min | 5min | | 02-poller-config-collection | 2 | 9min | 4.5min |
**Recent Trend:** **Recent Trend:**
- Last 5 plans: none - Last 5 plans: none
@@ -65,6 +65,9 @@ Recent decisions affecting current work:
- [02-01] TOFU fingerprint format matches ssh-keygen: SHA256:base64(sha256(pubkey)) - [02-01] TOFU fingerprint format matches ssh-keygen: SHA256:base64(sha256(pubkey))
- [02-01] NormalizationVersion=1 constant in NATS payloads for future re-processing - [02-01] NormalizationVersion=1 constant in NATS payloads for future re-processing
- [02-01] UpdateSSHHostKey uses COALESCE on first_seen to preserve original observation time - [02-01] UpdateSSHHostKey uses COALESCE on first_seen to preserve original observation time
- [02-02] BackupScheduler runs independently from status poll scheduler with separate goroutines
- [02-02] Buffered channel semaphore for concurrency control (Go idiom, no external deps)
- [02-02] Devices with no Redis status key assumed potentially online for first backup
### Pending Todos ### Pending Todos
@@ -76,6 +79,6 @@ None yet.
## Session Continuity ## Session Continuity
Last session: 2026-03-13T01:49:00Z Last session: 2026-03-13T01:55:37Z
Stopped at: Completed 02-01-PLAN.md Stopped at: Completed 02-02-PLAN.md (Phase 2 complete)
Resume file: .planning/phases/02-poller-config-collection/02-02-PLAN.md Resume file: Next phase (03)

View File

@@ -0,0 +1,100 @@
---
phase: 02-poller-config-collection
plan: 02
subsystem: poller
tags: [ssh, backup, scheduler, nats, routeros, concurrency, tofu, redis]
requires:
- phase: 02-poller-config-collection/01
provides: SSH executor, config normalizer, NATS ConfigSnapshotEvent, Prometheus metrics, config fields
provides:
- BackupScheduler with per-device goroutines managing periodic SSH config collection
- Concurrency-limited config backup pipeline (SSH -> normalize -> hash -> NATS publish)
- TOFU host key verification with persistent fingerprint storage
- Auth/hostkey error blocking with transient error exponential backoff
- SSHHostKeyUpdater consumer-side interface
affects: [03-backend-snapshot-consumer, api, poller]
tech-stack:
added: []
patterns: [per-device goroutine lifecycle, buffered channel semaphore, Redis online gating]
key-files:
created:
- poller/internal/poller/backup_scheduler.go
- poller/internal/poller/backup_scheduler_test.go
modified:
- poller/internal/poller/interfaces.go
- poller/cmd/poller/main.go
key-decisions:
- "BackupScheduler runs independently from status poll scheduler with separate goroutines"
- "Semaphore uses buffered channel pattern matching existing codebase style"
- "Device with no Redis status key assumed potentially online (first poll not yet completed)"
patterns-established:
- "Backup goroutine pattern: jitter -> initial backup -> ticker loop with gating checks"
- "Error classification: auth/hostkey block retries, transient errors use exponential backoff"
requirements-completed: [COLL-01, COLL-03, COLL-05, COLL-06]
duration: 4min
completed: 2026-03-13
---
# Phase 2 Plan 2: Backup Scheduler Summary
**BackupScheduler orchestrating periodic SSH config collection with per-device goroutines, concurrency semaphore, TOFU verification, and NATS publishing**
## Performance
- **Duration:** 4 min
- **Started:** 2026-03-13T01:51:27Z
- **Completed:** 2026-03-13T01:55:37Z
- **Tasks:** 2
- **Files modified:** 4
## Accomplishments
- BackupScheduler manages per-device backup goroutines with 30-300s initial jitter
- Concurrency limited by configurable buffered channel semaphore (default 10)
- Auth failures and host key mismatches permanently block retries with clear log warnings
- Transient errors use stepped backoff (5m/15m/1h cap)
- Full pipeline wired into main.go running parallel to existing status poll scheduler
## Task Commits
Each task was committed atomically:
1. **Task 1: BackupScheduler with per-device goroutines** - `a884b09` (test) + `2653a32` (feat) -- TDD red/green
2. **Task 2: Wire BackupScheduler into main.go** - `d34817a` (feat)
## Files Created/Modified
- `poller/internal/poller/backup_scheduler.go` - BackupScheduler with per-device goroutines, concurrency control, SSH collection, NATS publishing
- `poller/internal/poller/backup_scheduler_test.go` - Unit tests for jitter, backoff, retry blocking, online gating, semaphore, reconciliation
- `poller/internal/poller/interfaces.go` - Added SSHHostKeyUpdater consumer-side interface
- `poller/cmd/poller/main.go` - BackupScheduler initialization and goroutine startup
## Decisions Made
- BackupScheduler runs independently from status poll scheduler -- separate goroutine pool, no shared state
- Semaphore uses buffered channel pattern (consistent with Go idioms, no external deps)
- Devices with no Redis status key assumed potentially online to avoid blocking first backup
- Locker nil-check allows tests to run without Redis lock infrastructure
## Deviations from Plan
None - plan executed exactly as written.
## Issues Encountered
None
## User Setup Required
None - no external service configuration required.
## Next Phase Readiness
- Config backup pipeline complete: SSH -> normalize -> hash -> NATS publish
- Backend snapshot consumer (Phase 3) can subscribe to config.snapshot.create.> to receive snapshots
- Pre-existing integration test failures in poller package (missing certificate_authorities table) are unrelated to this work
---
*Phase: 02-poller-config-collection*
*Completed: 2026-03-13*