docs(02): create phase plan for poller config collection

Two plans covering SSH executor, config normalization, NATS publishing,
backup scheduler, and main.go wiring for periodic RouterOS config backup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Jason Staack
2026-03-12 20:39:47 -05:00
parent a7a17a5ecd
commit 33f888a6e2
3 changed files with 888 additions and 0 deletions

186
.planning/ROADMAP.md Normal file
View File

@@ -0,0 +1,186 @@
# Roadmap: RouterOS Config Backup & Change Tracking (v9.6)
## Overview
This roadmap delivers automated RouterOS configuration backup and change tracking as a new feature within the existing TOD platform. Work flows from database schema through the Go poller (collection), Python backend (storage, diffing, API), and React frontend (timeline, diff viewer, download). Each phase delivers a verifiable layer that the next phase builds on, culminating in a complete config history workflow with retention management and audit logging.
## Phases
**Phase Numbering:**
- Integer phases (1, 2, 3): Planned milestone work
- Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED)
Decimal phases appear between their surrounding integers in numeric order.
- [x] **Phase 1: Database Schema** - Config snapshot, diff, and change tables with encryption and RLS (completed 2026-03-13)
- [ ] **Phase 2: Poller Config Collection** - SSH export, normalization, and NATS publishing from Go poller
- [ ] **Phase 3: Snapshot Ingestion** - Backend NATS subscriber stores snapshots with SHA256 deduplication
- [ ] **Phase 4: Manual Backup Trigger** - API endpoint for on-demand config backup via poller
- [ ] **Phase 5: Diff Engine** - Unified diff generation and structured change parsing
- [ ] **Phase 6: History API** - REST endpoints for timeline, snapshot view, and diff retrieval with RBAC
- [ ] **Phase 7: Config History UI** - Timeline section on device page with change summaries
- [ ] **Phase 8: Diff Viewer & Download** - Unified diff display with syntax highlighting and .rsc download
- [ ] **Phase 9: Retention & Cleanup** - 90-day retention policy with automatic snapshot deletion
- [ ] **Phase 10: Audit & Observability** - Audit event logging for all config backup operations
## Phase Details
### Phase 1: Database Schema
**Goal**: Database tables exist to store config snapshots, diffs, and parsed changes with proper multi-tenant isolation and encryption
**Depends on**: Nothing (first phase)
**Requirements**: STOR-01, STOR-05
**Success Criteria** (what must be TRUE):
1. Alembic migration creates `router_config_snapshots`, `router_config_diffs`, and `router_config_changes` tables
2. All tables include `tenant_id` with RLS policies enforcing tenant isolation
3. Snapshot config_text column is encrypted at rest (field-level encryption via existing credential pattern)
4. SQLAlchemy models exist and can be imported by services
**Plans**: 1 plan
Plans:
- [ ] 01-01-PLAN.md — Alembic migration and SQLAlchemy models for config backup tables
### Phase 2: Poller Config Collection
**Goal**: Go poller periodically connects to RouterOS devices via SSH, exports config, normalizes output, and publishes to NATS
**Depends on**: Phase 1
**Requirements**: COLL-01, COLL-02, COLL-03, COLL-05, COLL-06
**Success Criteria** (what must be TRUE):
1. Poller runs `/export show-sensitive` via SSH on each RouterOS device at a configurable interval (default 6h)
2. Config output is normalized (timestamps stripped, whitespace trimmed, line endings unified) before publishing
3. Poller publishes config snapshot payload to NATS subject `config.snapshot.create` with device_id and tenant_id
4. Unreachable devices log a warning and are retried on the next interval without blocking other devices
5. Interval is configurable via `CONFIG_BACKUP_INTERVAL` environment variable
**Plans**: 2 plans
Plans:
- [ ] 02-01-PLAN.md — SSH executor, config normalizer, env vars, NATS event type, device model extensions, Alembic migration
- [ ] 02-02-PLAN.md — Backup scheduler with per-device goroutines, concurrency control, retry logic, and main.go wiring
### Phase 3: Snapshot Ingestion
**Goal**: Backend receives config snapshots from NATS, computes SHA256 hash, and stores new snapshots while skipping duplicates
**Depends on**: Phase 1, Phase 2
**Requirements**: STOR-02
**Success Criteria** (what must be TRUE):
1. Backend NATS subscriber consumes `config.snapshot.create` messages and persists snapshots to `router_config_snapshots`
2. When a snapshot has the same SHA256 hash as the device's most recent snapshot, it is skipped (no new row, no diff)
3. Each stored snapshot includes device_id, tenant_id, config_text (encrypted), sha256_hash, and collected_at timestamp
**Plans**: TBD
Plans:
- [ ] 03-01: NATS subscriber for config snapshot ingestion with deduplication
### Phase 4: Manual Backup Trigger
**Goal**: Operators can trigger an immediate config backup for a specific device through the API
**Depends on**: Phase 2, Phase 3
**Requirements**: COLL-04
**Success Criteria** (what must be TRUE):
1. POST `/api/tenants/{tenant_id}/devices/{device_id}/backup` triggers an immediate config collection for the specified device
2. The triggered backup flows through the same collection and ingestion pipeline as scheduled backups
3. Endpoint requires operator role or higher (viewers cannot trigger)
**Plans**: TBD
Plans:
- [ ] 04-01: Manual backup trigger API endpoint and NATS request flow
### Phase 5: Diff Engine
**Goal**: When a new (non-duplicate) snapshot is stored, the system generates a unified diff against the previous snapshot and parses structured changes
**Depends on**: Phase 3
**Requirements**: DIFF-01, DIFF-02, DIFF-03, DIFF-04
**Success Criteria** (what must be TRUE):
1. Unified diff is generated between consecutive snapshots when config content differs
2. Diff is stored in `router_config_diffs` linking the two snapshot IDs
3. Structured change parser extracts component name, human-readable summary, and raw diff line for each change
4. Parsed changes are stored in `router_config_changes` as JSON-structured records
**Plans**: TBD
Plans:
- [ ] 05-01: Unified diff generation between consecutive snapshots
- [ ] 05-02: Structured change parser and storage
### Phase 6: History API
**Goal**: Frontend can query config change timeline, retrieve full snapshots, and view diffs through RBAC-protected endpoints
**Depends on**: Phase 5
**Requirements**: API-01, API-02, API-03, API-04
**Success Criteria** (what must be TRUE):
1. GET `/api/tenants/{tid}/devices/{did}/config-history` returns paginated change timeline with component, summary, and timestamp
2. GET `/api/tenants/{tid}/devices/{did}/config/{snapshot_id}` returns full snapshot content
3. GET `/api/tenants/{tid}/devices/{did}/config/{snapshot_id}/diff` returns unified diff text
4. All endpoints enforce RBAC: viewer+ can read history, operator+ required for backup trigger
5. Endpoints return proper 404 for nonexistent snapshots and 403 for unauthorized access
**Plans**: TBD
Plans:
- [ ] 06-01: Config history timeline endpoint
- [ ] 06-02: Snapshot view and diff retrieval endpoints with RBAC
### Phase 7: Config History UI
**Goal**: Device detail page displays a Configuration History section showing a timeline of config changes
**Depends on**: Phase 6
**Requirements**: UI-01, UI-02
**Success Criteria** (what must be TRUE):
1. Device detail page shows a "Configuration History" section below the Remote Access section
2. Timeline displays change entries with component badge, summary text, and relative timestamp
3. Timeline loads via TanStack Query and shows loading/empty states appropriately
**Plans**: TBD
Plans:
- [ ] 07-01: Configuration History section and change timeline component
### Phase 8: Diff Viewer & Download
**Goal**: Users can view unified diffs with syntax highlighting and download any snapshot as a .rsc file
**Depends on**: Phase 7
**Requirements**: UI-03, UI-04
**Success Criteria** (what must be TRUE):
1. Clicking a timeline entry opens a diff viewer showing unified diff with add (green) / remove (red) line highlighting
2. User can download any snapshot as `router-{device_name}-{timestamp}.rsc` file
3. Diff viewer handles large configs without performance degradation
**Plans**: TBD
Plans:
- [ ] 08-01: Unified diff viewer component with syntax highlighting
- [ ] 08-02: Snapshot download as .rsc file
### Phase 9: Retention & Cleanup
**Goal**: Snapshots older than the retention period are automatically cleaned up, keeping storage bounded
**Depends on**: Phase 3
**Requirements**: STOR-03, STOR-04
**Success Criteria** (what must be TRUE):
1. Snapshots older than 90 days (default) are automatically deleted along with their associated diffs and changes
2. Retention period is configurable via `CONFIG_RETENTION_DAYS` environment variable
3. Cleanup runs on a scheduled interval without blocking normal operations
**Plans**: TBD
Plans:
- [ ] 09-01: Retention cleanup scheduler and cascading deletion
### Phase 10: Audit & Observability
**Goal**: All config backup operations are logged as audit events for compliance and troubleshooting
**Depends on**: Phase 3, Phase 4, Phase 5
**Requirements**: OBS-01, OBS-02
**Success Criteria** (what must be TRUE):
1. `config_snapshot_created` audit event logged when a new snapshot is stored
2. `config_snapshot_skipped_duplicate` audit event logged when a duplicate snapshot is detected
3. `config_diff_generated` audit event logged when a diff is created between snapshots
4. `config_backup_manual_trigger` audit event logged when an operator triggers a manual backup
**Plans**: TBD
Plans:
- [ ] 10-01: Audit event emission for all config backup operations
## Progress
**Execution Order:**
Phases execute in numeric order: 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7 -> 8 -> 9 -> 10
Note: Phase 9 depends only on Phase 3 and Phase 10 depends on Phases 3/4/5, so Phases 9 and 10 can execute in parallel with Phases 6-8 if desired.
| Phase | Plans Complete | Status | Completed |
|-------|----------------|--------|-----------|
| 1. Database Schema | 1/1 | Complete | 2026-03-13 |
| 2. Poller Config Collection | 0/2 | Not started | - |
| 3. Snapshot Ingestion | 0/1 | Not started | - |
| 4. Manual Backup Trigger | 0/1 | Not started | - |
| 5. Diff Engine | 0/2 | Not started | - |
| 6. History API | 0/2 | Not started | - |
| 7. Config History UI | 0/1 | Not started | - |
| 8. Diff Viewer & Download | 0/2 | Not started | - |
| 9. Retention & Cleanup | 0/1 | Not started | - |
| 10. Audit & Observability | 0/1 | Not started | - |

View File

@@ -0,0 +1,308 @@
---
phase: 02-poller-config-collection
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- poller/internal/device/ssh_executor.go
- poller/internal/device/ssh_executor_test.go
- poller/internal/device/normalize.go
- poller/internal/device/normalize_test.go
- poller/internal/config/config.go
- poller/internal/bus/publisher.go
- poller/internal/observability/metrics.go
- poller/internal/store/devices.go
- backend/alembic/versions/028_device_ssh_host_key.py
autonomous: true
requirements: [COLL-01, COLL-02, COLL-06]
must_haves:
truths:
- "SSH executor can run a command on a RouterOS device and return stdout, stderr, exit code, duration, and typed errors"
- "Config output is normalized deterministically (timestamp stripped, whitespace trimmed, line endings unified, blank lines collapsed)"
- "SHA256 hash is computed on normalized output"
- "Config backup interval and concurrency are configurable via environment variables"
- "Host key fingerprint is stored on device record for TOFU verification"
artifacts:
- path: "poller/internal/device/ssh_executor.go"
provides: "RunCommand SSH executor with TOFU host key verification and typed errors"
exports: ["RunCommand", "CommandResult", "SSHError", "SSHErrorKind"]
- path: "poller/internal/device/normalize.go"
provides: "NormalizeConfig function and SHA256 hashing"
exports: ["NormalizeConfig", "HashConfig"]
- path: "poller/internal/device/ssh_executor_test.go"
provides: "Unit tests for SSH executor error classification"
- path: "poller/internal/device/normalize_test.go"
provides: "Unit tests for config normalization with edge cases"
- path: "poller/internal/config/config.go"
provides: "CONFIG_BACKUP_INTERVAL, CONFIG_BACKUP_MAX_CONCURRENT, CONFIG_BACKUP_COMMAND_TIMEOUT env vars"
- path: "poller/internal/bus/publisher.go"
provides: "ConfigSnapshotEvent type and PublishConfigSnapshot method, config.snapshot.create subject in stream"
- path: "poller/internal/store/devices.go"
provides: "SSHPort and SSHHostKeyFingerprint fields on Device struct, UpdateSSHHostKey method"
- path: "backend/alembic/versions/028_device_ssh_host_key.py"
provides: "Migration adding ssh_port, ssh_host_key_fingerprint columns to devices table"
key_links:
- from: "poller/internal/device/ssh_executor.go"
to: "poller/internal/store/devices.go"
via: "Uses Device.SSHPort and Device.SSHHostKeyFingerprint for connection"
pattern: "dev\\.SSHPort|dev\\.SSHHostKeyFingerprint"
- from: "poller/internal/device/normalize.go"
to: "poller/internal/bus/publisher.go"
via: "Normalized config text and SHA256 hash populate ConfigSnapshotEvent fields"
pattern: "NormalizeConfig|HashConfig"
---
<objective>
Build the reusable primitives for config backup collection: SSH command executor with TOFU host key verification, config output normalizer with SHA256 hashing, environment variable configuration, NATS event type, and device model extensions.
Purpose: These are the building blocks that the backup scheduler (Plan 02) wires together. Each is independently testable and follows existing codebase patterns.
Output: SSH executor module, normalization module, extended config/store/bus/metrics, Alembic migration for device SSH columns.
</objective>
<execution_context>
@/Users/jasonstaack/.claude/get-shit-done/workflows/execute-plan.md
@/Users/jasonstaack/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/02-poller-config-collection/02-CONTEXT.md
@.planning/phases/01-database-schema/01-01-SUMMARY.md
@poller/internal/device/sftp.go
@poller/internal/bus/publisher.go
@poller/internal/config/config.go
@poller/internal/store/devices.go
@poller/internal/observability/metrics.go
@poller/internal/poller/scheduler.go
@poller/go.mod
<interfaces>
<!-- Existing patterns the executor must follow -->
From poller/internal/device/sftp.go:
```go
func NewSSHClient(ip string, port int, username, password string, timeout time.Duration) (*ssh.Client, error)
// Uses ssh.InsecureIgnoreHostKey() — executor replaces this with TOFU callback
```
From poller/internal/store/devices.go:
```go
type Device struct {
ID string
TenantID string
IPAddress string
APIPort int
APISSLPort int
EncryptedCredentials []byte
EncryptedCredentialsTransit *string
RouterOSVersion *string
MajorVersion *int
TLSMode string
CACertPEM *string
}
// SSHPort and SSHHostKeyFingerprint need to be added
```
From poller/internal/bus/publisher.go:
```go
type Publisher struct { nc *nats.Conn; js jetstream.JetStream }
func (p *Publisher) PublishStatus(ctx context.Context, event DeviceStatusEvent) error
// Follow this pattern for PublishConfigSnapshot
// Stream subjects list needs "config.snapshot.>" added
```
From poller/internal/config/config.go:
```go
func Load() (*Config, error)
// Uses getEnv(key, default) and getEnvInt(key, default) helpers
```
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: SSH executor, normalizer, and their tests</name>
<files>
poller/internal/device/ssh_executor.go,
poller/internal/device/ssh_executor_test.go,
poller/internal/device/normalize.go,
poller/internal/device/normalize_test.go
</files>
<behavior>
SSH Executor (ssh_executor_test.go):
- Test SSHErrorKind classification: given various ssh/net error types, classifySSHError returns correct kind (AuthFailed, HostKeyMismatch, Timeout, ConnectionRefused, Unknown)
- Test TOFU host key callback: when fingerprint is empty (first connect), callback accepts and returns fingerprint; when fingerprint matches, callback accepts; when fingerprint mismatches, callback rejects with HostKeyMismatch error
- Test CommandResult: verify struct fields (Stdout, Stderr, ExitCode, Duration, Error)
Normalizer (normalize_test.go):
- Test timestamp stripping: input with "# 2024/01/15 10:30:00 by RouterOS 7.x\n# software id = XXXX\n" strips only the timestamp line and following blank line, preserves software id comment
- Test line ending normalization: "\r\n" becomes "\n"
- Test trailing whitespace trimming: " /ip address \n" becomes "/ip address\n"
- Test blank line collapsing: three consecutive blank lines become one
- Test trailing newline: output always ends with exactly one "\n"
- Test comment preservation: lines starting with "# " that are NOT the timestamp header are preserved
- Test full normalization pipeline: realistic RouterOS export with all issues produces clean output
- Test HashConfig: returns lowercase hex SHA256 of the normalized string (64 chars)
- Test idempotency: NormalizeConfig(NormalizeConfig(input)) == NormalizeConfig(input)
</behavior>
<action>
Create `poller/internal/device/ssh_executor.go`:
1. Define types:
- `SSHErrorKind` string enum: `ErrAuthFailed`, `ErrHostKeyMismatch`, `ErrTimeout`, `ErrTruncatedOutput`, `ErrConnectionRefused`, `ErrUnknown`
- `SSHError` struct implementing `error`: `Kind SSHErrorKind`, `Err error`, `Message string`
- `CommandResult` struct: `Stdout string`, `Stderr string`, `ExitCode int`, `Duration time.Duration`
2. `RunCommand(ctx context.Context, ip string, port int, username, password string, timeout time.Duration, knownFingerprint string, command string) (*CommandResult, string, error)`:
- Returns (result, observedFingerprint, error)
- Build ssh.ClientConfig with password auth and custom HostKeyCallback for TOFU:
- If knownFingerprint == "": accept any key, compute and return SHA256 fingerprint
- If knownFingerprint matches: accept
- If knownFingerprint mismatches: reject with SSHError{Kind: ErrHostKeyMismatch}
- Fingerprint format: `SHA256:base64(sha256(publicKeyBytes))` (same as ssh-keygen)
- Dial with context-aware timeout
- Create session, run command via session.Run()
- Capture stdout/stderr via session.StdoutPipe/StderrPipe or CombinedOutput pattern
- Classify errors using `classifySSHError(err)` helper that inspects error strings and types
- Detect truncated output: if command times out mid-stream, return SSHError{Kind: ErrTruncatedOutput}
3. `classifySSHError(err error) SSHErrorKind`: inspect error for "unable to authenticate", "host key", "i/o timeout", "connection refused" patterns
Create `poller/internal/device/normalize.go`:
1. `NormalizeConfig(raw string) string`:
- Use regexp to strip timestamp header line matching `^# \d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2} by RouterOS.*\n` and the blank line immediately following it
- Replace \r\n with \n (before other processing)
- Split into lines, trim trailing whitespace from each line
- Collapse consecutive blank lines (2+ empty lines become 1)
- Ensure single trailing newline
- Return normalized string
2. `HashConfig(normalized string) string`:
- Compute SHA256 of the normalized string bytes
- Return lowercase hex string (64 chars)
3. `const NormalizationVersion = 1` — for future tracking in NATS payload
Write tests FIRST (RED), then implement (GREEN). Tests for normalizer use table-driven test style matching Go conventions. SSH executor tests use mock/classification tests (no real SSH connection needed for unit tests).
</action>
<verify>
<automated>cd /Volumes/ssd01/v9/the-other-dude/poller && go test ./internal/device/ -run "TestNormalize|TestHash|TestSSH|TestClassify|TestTOFU" -v -count=1</automated>
</verify>
<done>
- RunCommand function compiles with correct signature returning (CommandResult, fingerprint, error)
- SSHError type with Kind field covers all 6 error classifications
- TOFU host key callback accepts on first connect, validates on subsequent, rejects on mismatch
- NormalizeConfig strips timestamp, normalizes line endings, trims whitespace, collapses blanks, ensures trailing newline
- HashConfig returns 64-char lowercase hex SHA256
- All unit tests pass
</done>
</task>
<task type="auto">
<name>Task 2: Config env vars, NATS event type, device model extensions, Alembic migration, metrics</name>
<files>
poller/internal/config/config.go,
poller/internal/bus/publisher.go,
poller/internal/store/devices.go,
poller/internal/observability/metrics.go,
backend/alembic/versions/028_device_ssh_host_key.py
</files>
<action>
**1. Config env vars** (`config.go`):
Add three fields to the Config struct and load them in Load():
- `ConfigBackupIntervalSeconds int``getEnvInt("CONFIG_BACKUP_INTERVAL", 21600)` (6h = 21600s)
- `ConfigBackupMaxConcurrent int``getEnvInt("CONFIG_BACKUP_MAX_CONCURRENT", 10)`
- `ConfigBackupCommandTimeoutSeconds int``getEnvInt("CONFIG_BACKUP_COMMAND_TIMEOUT", 60)`
**2. NATS event type and publisher** (`publisher.go`):
- Add `ConfigSnapshotEvent` struct:
```go
type ConfigSnapshotEvent struct {
DeviceID string `json:"device_id"`
TenantID string `json:"tenant_id"`
RouterOSVersion string `json:"routeros_version,omitempty"`
CollectedAt string `json:"collected_at"` // RFC3339
SHA256Hash string `json:"sha256_hash"`
ConfigText string `json:"config_text"`
NormalizationVersion int `json:"normalization_version"`
}
```
- Add `PublishConfigSnapshot(ctx, event) error` method on Publisher following the exact pattern of PublishStatus/PublishMetrics
- Subject: `fmt.Sprintf("config.snapshot.create.%s", event.DeviceID)`
- Add `"config.snapshot.>"` to the DEVICE_EVENTS stream subjects list in `NewPublisher`
**3. Device model extensions** (`devices.go`):
- Add fields to Device struct: `SSHPort int`, `SSHHostKeyFingerprint *string`
- Update FetchDevices query to SELECT `COALESCE(d.ssh_port, 22)` and `d.ssh_host_key_fingerprint`
- Update GetDevice query similarly
- Update both Scan calls to include the new fields
- Add `UpdateSSHHostKey(ctx context.Context, deviceID string, fingerprint string) error` method on DeviceStore:
```go
const query = `UPDATE devices SET ssh_host_key_fingerprint = $1 WHERE id = $2`
```
(This requires poller_user to have UPDATE on devices(ssh_host_key_fingerprint) — handled in migration)
**4. Alembic migration** (`028_device_ssh_host_key.py`):
Follow the raw SQL pattern from migration 027. Create migration that:
- `ALTER TABLE devices ADD COLUMN ssh_port INTEGER DEFAULT 22`
- `ALTER TABLE devices ADD COLUMN ssh_host_key_fingerprint TEXT`
- `ALTER TABLE devices ADD COLUMN ssh_host_key_first_seen TIMESTAMPTZ`
- `ALTER TABLE devices ADD COLUMN ssh_host_key_last_verified TIMESTAMPTZ`
- `GRANT UPDATE (ssh_host_key_fingerprint, ssh_host_key_first_seen, ssh_host_key_last_verified) ON devices TO poller_user`
- Downgrade: `ALTER TABLE devices DROP COLUMN ssh_port, DROP COLUMN ssh_host_key_fingerprint, DROP COLUMN ssh_host_key_first_seen, DROP COLUMN ssh_host_key_last_verified`
- `REVOKE UPDATE (ssh_host_key_fingerprint, ssh_host_key_first_seen, ssh_host_key_last_verified) ON devices FROM poller_user`
**5. Prometheus metrics** (`metrics.go`):
Add config backup specific metrics:
- `ConfigBackupTotal` CounterVec with labels ["status"] — status: "success", "error", "skipped_offline", "skipped_auth_blocked", "skipped_hostkey_blocked"
- `ConfigBackupDuration` Histogram — buckets: [1, 5, 10, 30, 60, 120, 300]
- `ConfigBackupActive` Gauge — number of concurrent backup jobs running
</action>
<verify>
<automated>cd /Volumes/ssd01/v9/the-other-dude/poller && go build ./... && go vet ./... && go test ./internal/config/ -v -count=1</automated>
</verify>
<done>
- Config struct has 3 new backup config fields loading from env vars with correct defaults
- ConfigSnapshotEvent type exists with all required JSON fields
- PublishConfigSnapshot method exists following existing publisher pattern
- config.snapshot.> added to DEVICE_EVENTS stream subjects
- Device struct has SSHPort and SSHHostKeyFingerprint fields
- FetchDevices and GetDevice queries select and scan the new columns
- UpdateSSHHostKey method exists for TOFU fingerprint storage
- Alembic migration 028 adds ssh_port, ssh_host_key_fingerprint, timestamp columns with correct grants
- Three new Prometheus metrics registered for config backup observability
- All existing tests still pass, project compiles clean
</done>
</task>
</tasks>
<verification>
1. `cd poller && go build ./...` — entire project compiles
2. `cd poller && go vet ./...` — no static analysis issues
3. `cd poller && go test ./internal/device/ -v -count=1` — SSH executor and normalizer tests pass
4. `cd poller && go test ./internal/config/ -v -count=1` — config tests pass
5. Migration file exists at `backend/alembic/versions/028_device_ssh_host_key.py`
</verification>
<success_criteria>
- SSH executor RunCommand function exists with TOFU host key verification and typed error classification
- Config normalizer strips timestamps, normalizes whitespace, and computes SHA256 hashes deterministically
- All config backup environment variables load with correct defaults (6h interval, 10 concurrent, 60s timeout)
- ConfigSnapshotEvent and PublishConfigSnapshot are ready for the scheduler to use
- Device model includes SSH port and host key fingerprint fields
- Database migration ready to add SSH columns to devices table
- Prometheus metrics registered for backup collection observability
- All tests pass, project compiles clean
</success_criteria>
<output>
After completion, create `.planning/phases/02-poller-config-collection/02-01-SUMMARY.md`
</output>

View File

@@ -0,0 +1,394 @@
---
phase: 02-poller-config-collection
plan: 02
type: execute
wave: 2
depends_on: ["02-01"]
files_modified:
- poller/internal/poller/backup_scheduler.go
- poller/internal/poller/backup_scheduler_test.go
- poller/internal/poller/interfaces.go
- poller/cmd/poller/main.go
autonomous: true
requirements: [COLL-01, COLL-03, COLL-05, COLL-06]
must_haves:
truths:
- "Poller runs /export show-sensitive via SSH on each online RouterOS device at a configurable interval (default 6h)"
- "Poller publishes normalized config snapshot to NATS config.snapshot.create with device_id, tenant_id, sha256_hash, config_text"
- "Unreachable devices log a warning and are retried on the next interval without blocking other devices"
- "Backup interval is configurable via CONFIG_BACKUP_INTERVAL environment variable"
- "First backup runs with randomized jitter (30-300s) after device discovery"
- "Global concurrency is limited via CONFIG_BACKUP_MAX_CONCURRENT semaphore"
- "Auth failures and host key mismatches block retries until resolved"
artifacts:
- path: "poller/internal/poller/backup_scheduler.go"
provides: "BackupScheduler managing per-device backup goroutines with concurrency, retry, and NATS publishing"
exports: ["BackupScheduler", "NewBackupScheduler"]
min_lines: 200
- path: "poller/internal/poller/backup_scheduler_test.go"
provides: "Unit tests for backup scheduling, jitter, concurrency, error handling"
- path: "poller/internal/poller/interfaces.go"
provides: "SSHHostKeyUpdater interface for device store dependency"
- path: "poller/cmd/poller/main.go"
provides: "BackupScheduler initialization and lifecycle wiring"
key_links:
- from: "poller/internal/poller/backup_scheduler.go"
to: "poller/internal/device/ssh_executor.go"
via: "Calls device.RunCommand to execute /export show-sensitive"
pattern: "device\\.RunCommand"
- from: "poller/internal/poller/backup_scheduler.go"
to: "poller/internal/device/normalize.go"
via: "Calls device.NormalizeConfig and device.HashConfig on SSH output"
pattern: "device\\.NormalizeConfig|device\\.HashConfig"
- from: "poller/internal/poller/backup_scheduler.go"
to: "poller/internal/bus/publisher.go"
via: "Calls publisher.PublishConfigSnapshot with ConfigSnapshotEvent"
pattern: "publisher\\.PublishConfigSnapshot|bus\\.ConfigSnapshotEvent"
- from: "poller/internal/poller/backup_scheduler.go"
to: "poller/internal/store/devices.go"
via: "Calls store.UpdateSSHHostKey for TOFU fingerprint storage"
pattern: "UpdateSSHHostKey"
- from: "poller/cmd/poller/main.go"
to: "poller/internal/poller/backup_scheduler.go"
via: "Creates and starts BackupScheduler in main goroutine lifecycle"
pattern: "NewBackupScheduler|backupScheduler\\.Run"
---
<objective>
Build the backup scheduler that orchestrates periodic SSH config collection from RouterOS devices, normalizes output, and publishes to NATS. Wire it into the poller's main lifecycle.
Purpose: This is the core orchestration that ties together the SSH executor, normalizer, and NATS publisher from Plan 01 into a running backup collection system with proper scheduling, concurrency control, error handling, and retry logic.
Output: BackupScheduler module fully integrated into the poller's main.go lifecycle.
</objective>
<execution_context>
@/Users/jasonstaack/.claude/get-shit-done/workflows/execute-plan.md
@/Users/jasonstaack/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/02-poller-config-collection/02-CONTEXT.md
@.planning/phases/02-poller-config-collection/02-01-SUMMARY.md
@poller/internal/poller/scheduler.go
@poller/internal/poller/worker.go
@poller/internal/poller/interfaces.go
@poller/cmd/poller/main.go
@poller/internal/device/ssh_executor.go
@poller/internal/device/normalize.go
@poller/internal/bus/publisher.go
@poller/internal/config/config.go
@poller/internal/store/devices.go
@poller/internal/observability/metrics.go
<interfaces>
<!-- From Plan 01 outputs (executor and normalizer) -->
From poller/internal/device/ssh_executor.go (created in Plan 01):
```go
type SSHErrorKind string
const (
ErrAuthFailed SSHErrorKind = "auth_failed"
ErrHostKeyMismatch SSHErrorKind = "host_key_mismatch"
ErrTimeout SSHErrorKind = "timeout"
ErrTruncatedOutput SSHErrorKind = "truncated_output"
ErrConnectionRefused SSHErrorKind = "connection_refused"
ErrUnknown SSHErrorKind = "unknown"
)
type SSHError struct { Kind SSHErrorKind; Err error; Message string }
type CommandResult struct { Stdout string; Stderr string; ExitCode int; Duration time.Duration }
func RunCommand(ctx context.Context, ip string, port int, username, password string, timeout time.Duration, knownFingerprint string, command string) (*CommandResult, string, error)
```
From poller/internal/device/normalize.go (created in Plan 01):
```go
func NormalizeConfig(raw string) string
func HashConfig(normalized string) string
const NormalizationVersion = 1
```
From poller/internal/bus/publisher.go (modified in Plan 01):
```go
type ConfigSnapshotEvent struct {
DeviceID string `json:"device_id"`
TenantID string `json:"tenant_id"`
RouterOSVersion string `json:"routeros_version,omitempty"`
CollectedAt string `json:"collected_at"`
SHA256Hash string `json:"sha256_hash"`
ConfigText string `json:"config_text"`
NormalizationVersion int `json:"normalization_version"`
}
func (p *Publisher) PublishConfigSnapshot(ctx context.Context, event ConfigSnapshotEvent) error
```
From poller/internal/store/devices.go (modified in Plan 01):
```go
type Device struct {
// ... existing fields ...
SSHPort int
SSHHostKeyFingerprint *string
}
func (s *DeviceStore) UpdateSSHHostKey(ctx context.Context, deviceID string, fingerprint string) error
```
From poller/internal/config/config.go (modified in Plan 01):
```go
type Config struct {
// ... existing fields ...
ConfigBackupIntervalSeconds int
ConfigBackupMaxConcurrent int
ConfigBackupCommandTimeoutSeconds int
}
```
From poller/internal/observability/metrics.go (modified in Plan 01):
```go
var ConfigBackupTotal *prometheus.CounterVec // labels: ["status"]
var ConfigBackupDuration prometheus.Histogram
var ConfigBackupActive prometheus.Gauge
```
<!-- Existing patterns to follow -->
From poller/internal/poller/scheduler.go:
```go
type Scheduler struct { ... }
func NewScheduler(...) *Scheduler
func (s *Scheduler) Run(ctx context.Context) error
func (s *Scheduler) reconcileDevices(ctx context.Context, wg *sync.WaitGroup) error
func (s *Scheduler) runDeviceLoop(ctx context.Context, dev store.Device, ds *deviceState) // per-device goroutine with ticker
```
From poller/internal/poller/interfaces.go:
```go
type DeviceFetcher interface {
FetchDevices(ctx context.Context) ([]store.Device, error)
}
```
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: BackupScheduler with per-device goroutines, concurrency control, and retry logic</name>
<files>
poller/internal/poller/backup_scheduler.go,
poller/internal/poller/backup_scheduler_test.go,
poller/internal/poller/interfaces.go
</files>
<behavior>
- Test jitter generation: randomJitter(30, 300) returns value in [30s, 300s] range
- Test backoff sequence: given consecutive failures, backoff returns 5m, 15m, 1h, then caps at 1h
- Test auth failure blocking: when last error is ErrAuthFailed, shouldRetry returns false
- Test host key mismatch blocking: when last error is ErrHostKeyMismatch, shouldRetry returns false
- Test online-only gating: backup is skipped for devices not currently marked online
- Test concurrency semaphore: when semaphore is full, backup waits (does not drop)
</behavior>
<action>
**1. Update interfaces.go:**
Add `SSHHostKeyUpdater` interface (consumer-side, Go best practice):
```go
type SSHHostKeyUpdater interface {
UpdateSSHHostKey(ctx context.Context, deviceID string, fingerprint string) error
}
```
**2. Create backup_scheduler.go:**
Define `backupDeviceState` struct tracking per-device backup state:
- `cancel context.CancelFunc`
- `lastAttemptAt time.Time`
- `lastSuccessAt time.Time`
- `lastStatus string` — "success", "error", "skipped_offline", "auth_blocked", "hostkey_blocked"
- `lastError string`
- `consecutiveFailures int`
- `backoffUntil time.Time`
- `lastErrorKind device.SSHErrorKind` — tracks whether error is auth/hostkey (blocks retry)
Define `BackupScheduler` struct:
- `store DeviceFetcher` — reuse existing interface for FetchDevices
- `hostKeyStore SSHHostKeyUpdater` — for UpdateSSHHostKey
- `locker *redislock.Client` — per-device distributed lock
- `publisher *bus.Publisher` — for NATS publishing
- `credentialCache *vault.CredentialCache` — for decrypting device SSH creds
- `redisClient *redis.Client` — for tracking device online status
- `backupInterval time.Duration`
- `commandTimeout time.Duration`
- `refreshPeriod time.Duration` — how often to reconcile devices (reuse from existing scheduler, e.g., 60s)
- `semaphore chan struct{}` — buffered channel of size maxConcurrent
- `mu sync.Mutex`
- `activeDevices map[string]*backupDeviceState`
`NewBackupScheduler(...)` constructor — accept all dependencies, create semaphore as `make(chan struct{}, maxConcurrent)`.
`Run(ctx context.Context) error` — mirrors existing Scheduler.Run pattern:
- defer shutdown: cancel all device goroutines, wait for WaitGroup
- Loop: reconcileBackupDevices(ctx, &wg), then select on ctx.Done or time.After(refreshPeriod)
`reconcileBackupDevices(ctx, wg)` — mirrors reconcileDevices:
- FetchDevices from store
- Start backup goroutines for new devices
- Stop goroutines for removed devices
`runBackupLoop(ctx, dev, state)` — per-device backup goroutine:
- On first run: sleep for randomJitter(30, 300) seconds, then do initial backup
- After initial: ticker at backupInterval
- On each tick:
a. Check if device is online via Redis key `device:{id}:status` (set by status poll). If not online, log debug "skipped_offline", update state, increment ConfigBackupTotal("skipped_offline"), continue
b. Check if lastErrorKind is ErrAuthFailed — skip with "skipped_auth_blocked", log warning with guidance to update credentials
c. Check if lastErrorKind is ErrHostKeyMismatch — skip with "skipped_hostkey_blocked", log warning with guidance to reset host key
d. Check backoff: if time.Now().Before(state.backoffUntil), skip
e. Acquire semaphore (blocks if at max concurrency, does not drop)
f. Acquire Redis lock `backup:device:{id}` with TTL = commandTimeout + 30s
g. Call `collectAndPublish(ctx, dev, state)`
h. Release semaphore
i. Update state based on result
`collectAndPublish(ctx, dev, state) error`:
- Increment ConfigBackupActive gauge
- Defer decrement ConfigBackupActive gauge
- Start timer for ConfigBackupDuration
- Decrypt credentials via credentialCache.GetCredentials
- Call `device.RunCommand(ctx, dev.IPAddress, dev.SSHPort, username, password, commandTimeout, knownFingerprint, "/export show-sensitive")`
- On error: classify error kind, update state, apply backoff (transient: 5m/15m/1h exponential; auth/hostkey: block), return
- If new fingerprint returned (TOFU first connect): call hostKeyStore.UpdateSSHHostKey
- Validate output is non-empty and looks like RouterOS config (basic sanity: contains "/")
- Call `device.NormalizeConfig(result.Stdout)`
- Call `device.HashConfig(normalized)`
- Build `bus.ConfigSnapshotEvent` with device_id, tenant_id, routeros_version (from device or Redis), collected_at (RFC3339 now), sha256_hash, config_text, normalization_version
- Call `publisher.PublishConfigSnapshot(ctx, event)`
- On success: reset consecutiveFailures, update lastSuccessAt, increment ConfigBackupTotal("success")
- Record ConfigBackupDuration
`randomJitter(minSeconds, maxSeconds int) time.Duration` — uses math/rand for uniform distribution
Backoff for transient errors: `calculateBackupBackoff(failures int) time.Duration`:
- 1 failure: 5 min
- 2 failures: 15 min
- 3+ failures: 1 hour (cap)
Device online check via Redis: check if key `device:{id}:status` equals "online". This key is set by the existing status poll publisher flow. If key doesn't exist, assume device might be online (first poll hasn't happened yet) — allow backup attempt.
RouterOS version: read from the Device struct's RouterOSVersion field (populated by store query). If nil, use empty string in the event.
**Important implementation notes:**
- Use `log/slog` for all logging (structured JSON, matching existing pattern)
- Use existing `redislock` pattern from worker.go for per-device locking
- Semaphore pattern: `s.semaphore <- struct{}{}` to acquire, `<-s.semaphore` to release
- Do NOT share circuit breaker state with the status poll scheduler — these are independent
- Partial/truncated output (SSHError with Kind ErrTruncatedOutput) is treated as transient error — never publish, apply backoff
</action>
<verify>
<automated>cd /Volumes/ssd01/v9/the-other-dude/poller && go test ./internal/poller/ -run "TestBackup|TestJitter|TestBackoff|TestShouldRetry" -v -count=1</automated>
</verify>
<done>
- BackupScheduler manages per-device backup goroutines independently from status poll scheduler
- First backup uses 30-300s random jitter delay
- Concurrency limited by buffered channel semaphore (default 10)
- Per-device Redis lock prevents duplicate backups across pods
- Auth failures and host key mismatches block retries with clear log messages
- Transient errors use 5m/15m/1h exponential backoff
- Offline devices are skipped without error
- Successful backups normalize config, compute SHA256, and publish to NATS
- TOFU fingerprint stored on first successful connection
- All unit tests pass
</done>
</task>
<task type="auto">
<name>Task 2: Wire BackupScheduler into main.go lifecycle</name>
<files>poller/cmd/poller/main.go</files>
<action>
Add BackupScheduler initialization and startup to main.go, following the existing pattern of scheduler initialization (lines 250-278).
After the existing scheduler creation (around line 270), add a new section:
```
// -----------------------------------------------------------------------
// Start the config backup scheduler
// -----------------------------------------------------------------------
```
1. Convert config values to durations:
```go
backupInterval := time.Duration(cfg.ConfigBackupIntervalSeconds) * time.Second
backupCmdTimeout := time.Duration(cfg.ConfigBackupCommandTimeoutSeconds) * time.Second
```
2. Create BackupScheduler:
```go
backupScheduler := poller.NewBackupScheduler(
deviceStore,
deviceStore, // SSHHostKeyUpdater (DeviceStore satisfies this interface)
locker,
publisher,
credentialCache,
redisClient,
backupInterval,
backupCmdTimeout,
refreshPeriod, // reuse existing device refresh period
cfg.ConfigBackupMaxConcurrent,
)
```
3. Start in a goroutine (runs parallel with the main status poll scheduler):
```go
go func() {
slog.Info("starting config backup scheduler",
"interval", backupInterval,
"max_concurrent", cfg.ConfigBackupMaxConcurrent,
"command_timeout", backupCmdTimeout,
)
if err := backupScheduler.Run(ctx); err != nil {
slog.Error("backup scheduler exited with error", "error", err)
}
}()
```
The BackupScheduler shares the same ctx as everything else, so SIGINT/SIGTERM will trigger its shutdown via context cancellation. No additional shutdown logic needed — Run() returns when ctx is cancelled.
Log the startup with the same pattern as the existing scheduler startup log (line 273-276).
</action>
<verify>
<automated>cd /Volumes/ssd01/v9/the-other-dude/poller && go build ./cmd/poller/ && echo "build successful"</automated>
</verify>
<done>
- BackupScheduler created in main.go with all dependencies injected
- Runs as a goroutine parallel to the status poll scheduler
- Shares the same context for graceful shutdown
- Startup logged with interval, max_concurrent, and command_timeout
- Poller binary compiles successfully with the new scheduler wired in
</done>
</task>
</tasks>
<verification>
1. `cd poller && go build ./cmd/poller/` — binary compiles with backup scheduler wired in
2. `cd poller && go vet ./...` — no static analysis issues
3. `cd poller && go test ./internal/poller/ -v -count=1` — all poller tests pass (existing + new backup tests)
4. `cd poller && go test ./... -count=1` — full test suite passes
</verification>
<success_criteria>
- BackupScheduler runs independently from status poll scheduler with its own per-device goroutines
- Devices get their first backup 30-300s after discovery, then every CONFIG_BACKUP_INTERVAL
- SSH command execution uses TOFU host key verification and stores fingerprints on first connect
- Config output is normalized, hashed, and published to NATS config.snapshot.create
- Concurrency limited to CONFIG_BACKUP_MAX_CONCURRENT parallel SSH sessions
- Auth/hostkey errors block retries; transient errors use exponential backoff (5m/15m/1h)
- Offline devices are skipped gracefully
- BackupScheduler is wired into main.go and starts/stops with the poller lifecycle
- All tests pass, project compiles clean
</success_criteria>
<output>
After completion, create `.planning/phases/02-poller-config-collection/02-02-SUMMARY.md`
</output>