docs(02): create phase plan for poller config collection

Two plans covering SSH executor, config normalization, NATS publishing,
backup scheduler, and main.go wiring for periodic RouterOS config backup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Jason Staack
2026-03-12 20:39:47 -05:00
parent a7a17a5ecd
commit 33f888a6e2
3 changed files with 888 additions and 0 deletions

View File

@@ -0,0 +1,308 @@
---
phase: 02-poller-config-collection
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- poller/internal/device/ssh_executor.go
- poller/internal/device/ssh_executor_test.go
- poller/internal/device/normalize.go
- poller/internal/device/normalize_test.go
- poller/internal/config/config.go
- poller/internal/bus/publisher.go
- poller/internal/observability/metrics.go
- poller/internal/store/devices.go
- backend/alembic/versions/028_device_ssh_host_key.py
autonomous: true
requirements: [COLL-01, COLL-02, COLL-06]
must_haves:
truths:
- "SSH executor can run a command on a RouterOS device and return stdout, stderr, exit code, duration, and typed errors"
- "Config output is normalized deterministically (timestamp stripped, whitespace trimmed, line endings unified, blank lines collapsed)"
- "SHA256 hash is computed on normalized output"
- "Config backup interval and concurrency are configurable via environment variables"
- "Host key fingerprint is stored on device record for TOFU verification"
artifacts:
- path: "poller/internal/device/ssh_executor.go"
provides: "RunCommand SSH executor with TOFU host key verification and typed errors"
exports: ["RunCommand", "CommandResult", "SSHError", "SSHErrorKind"]
- path: "poller/internal/device/normalize.go"
provides: "NormalizeConfig function and SHA256 hashing"
exports: ["NormalizeConfig", "HashConfig"]
- path: "poller/internal/device/ssh_executor_test.go"
provides: "Unit tests for SSH executor error classification"
- path: "poller/internal/device/normalize_test.go"
provides: "Unit tests for config normalization with edge cases"
- path: "poller/internal/config/config.go"
provides: "CONFIG_BACKUP_INTERVAL, CONFIG_BACKUP_MAX_CONCURRENT, CONFIG_BACKUP_COMMAND_TIMEOUT env vars"
- path: "poller/internal/bus/publisher.go"
provides: "ConfigSnapshotEvent type and PublishConfigSnapshot method, config.snapshot.create subject in stream"
- path: "poller/internal/store/devices.go"
provides: "SSHPort and SSHHostKeyFingerprint fields on Device struct, UpdateSSHHostKey method"
- path: "backend/alembic/versions/028_device_ssh_host_key.py"
provides: "Migration adding ssh_port, ssh_host_key_fingerprint columns to devices table"
key_links:
- from: "poller/internal/device/ssh_executor.go"
to: "poller/internal/store/devices.go"
via: "Uses Device.SSHPort and Device.SSHHostKeyFingerprint for connection"
pattern: "dev\\.SSHPort|dev\\.SSHHostKeyFingerprint"
- from: "poller/internal/device/normalize.go"
to: "poller/internal/bus/publisher.go"
via: "Normalized config text and SHA256 hash populate ConfigSnapshotEvent fields"
pattern: "NormalizeConfig|HashConfig"
---
<objective>
Build the reusable primitives for config backup collection: SSH command executor with TOFU host key verification, config output normalizer with SHA256 hashing, environment variable configuration, NATS event type, and device model extensions.
Purpose: These are the building blocks that the backup scheduler (Plan 02) wires together. Each is independently testable and follows existing codebase patterns.
Output: SSH executor module, normalization module, extended config/store/bus/metrics, Alembic migration for device SSH columns.
</objective>
<execution_context>
@/Users/jasonstaack/.claude/get-shit-done/workflows/execute-plan.md
@/Users/jasonstaack/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/02-poller-config-collection/02-CONTEXT.md
@.planning/phases/01-database-schema/01-01-SUMMARY.md
@poller/internal/device/sftp.go
@poller/internal/bus/publisher.go
@poller/internal/config/config.go
@poller/internal/store/devices.go
@poller/internal/observability/metrics.go
@poller/internal/poller/scheduler.go
@poller/go.mod
<interfaces>
<!-- Existing patterns the executor must follow -->
From poller/internal/device/sftp.go:
```go
func NewSSHClient(ip string, port int, username, password string, timeout time.Duration) (*ssh.Client, error)
// Uses ssh.InsecureIgnoreHostKey() — executor replaces this with TOFU callback
```
From poller/internal/store/devices.go:
```go
type Device struct {
ID string
TenantID string
IPAddress string
APIPort int
APISSLPort int
EncryptedCredentials []byte
EncryptedCredentialsTransit *string
RouterOSVersion *string
MajorVersion *int
TLSMode string
CACertPEM *string
}
// SSHPort and SSHHostKeyFingerprint need to be added
```
From poller/internal/bus/publisher.go:
```go
type Publisher struct { nc *nats.Conn; js jetstream.JetStream }
func (p *Publisher) PublishStatus(ctx context.Context, event DeviceStatusEvent) error
// Follow this pattern for PublishConfigSnapshot
// Stream subjects list needs "config.snapshot.>" added
```
From poller/internal/config/config.go:
```go
func Load() (*Config, error)
// Uses getEnv(key, default) and getEnvInt(key, default) helpers
```
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: SSH executor, normalizer, and their tests</name>
<files>
poller/internal/device/ssh_executor.go,
poller/internal/device/ssh_executor_test.go,
poller/internal/device/normalize.go,
poller/internal/device/normalize_test.go
</files>
<behavior>
SSH Executor (ssh_executor_test.go):
- Test SSHErrorKind classification: given various ssh/net error types, classifySSHError returns correct kind (AuthFailed, HostKeyMismatch, Timeout, ConnectionRefused, Unknown)
- Test TOFU host key callback: when fingerprint is empty (first connect), callback accepts and returns fingerprint; when fingerprint matches, callback accepts; when fingerprint mismatches, callback rejects with HostKeyMismatch error
- Test CommandResult: verify struct fields (Stdout, Stderr, ExitCode, Duration, Error)
Normalizer (normalize_test.go):
- Test timestamp stripping: input with "# 2024/01/15 10:30:00 by RouterOS 7.x\n# software id = XXXX\n" strips only the timestamp line and following blank line, preserves software id comment
- Test line ending normalization: "\r\n" becomes "\n"
- Test trailing whitespace trimming: " /ip address \n" becomes "/ip address\n"
- Test blank line collapsing: three consecutive blank lines become one
- Test trailing newline: output always ends with exactly one "\n"
- Test comment preservation: lines starting with "# " that are NOT the timestamp header are preserved
- Test full normalization pipeline: realistic RouterOS export with all issues produces clean output
- Test HashConfig: returns lowercase hex SHA256 of the normalized string (64 chars)
- Test idempotency: NormalizeConfig(NormalizeConfig(input)) == NormalizeConfig(input)
</behavior>
<action>
Create `poller/internal/device/ssh_executor.go`:
1. Define types:
- `SSHErrorKind` string enum: `ErrAuthFailed`, `ErrHostKeyMismatch`, `ErrTimeout`, `ErrTruncatedOutput`, `ErrConnectionRefused`, `ErrUnknown`
- `SSHError` struct implementing `error`: `Kind SSHErrorKind`, `Err error`, `Message string`
- `CommandResult` struct: `Stdout string`, `Stderr string`, `ExitCode int`, `Duration time.Duration`
2. `RunCommand(ctx context.Context, ip string, port int, username, password string, timeout time.Duration, knownFingerprint string, command string) (*CommandResult, string, error)`:
- Returns (result, observedFingerprint, error)
- Build ssh.ClientConfig with password auth and custom HostKeyCallback for TOFU:
- If knownFingerprint == "": accept any key, compute and return SHA256 fingerprint
- If knownFingerprint matches: accept
- If knownFingerprint mismatches: reject with SSHError{Kind: ErrHostKeyMismatch}
- Fingerprint format: `SHA256:base64(sha256(publicKeyBytes))` (same as ssh-keygen)
- Dial with context-aware timeout
- Create session, run command via session.Run()
- Capture stdout/stderr via session.StdoutPipe/StderrPipe or CombinedOutput pattern
- Classify errors using `classifySSHError(err)` helper that inspects error strings and types
- Detect truncated output: if command times out mid-stream, return SSHError{Kind: ErrTruncatedOutput}
3. `classifySSHError(err error) SSHErrorKind`: inspect error for "unable to authenticate", "host key", "i/o timeout", "connection refused" patterns
Create `poller/internal/device/normalize.go`:
1. `NormalizeConfig(raw string) string`:
- Use regexp to strip timestamp header line matching `^# \d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2} by RouterOS.*\n` and the blank line immediately following it
- Replace \r\n with \n (before other processing)
- Split into lines, trim trailing whitespace from each line
- Collapse consecutive blank lines (2+ empty lines become 1)
- Ensure single trailing newline
- Return normalized string
2. `HashConfig(normalized string) string`:
- Compute SHA256 of the normalized string bytes
- Return lowercase hex string (64 chars)
3. `const NormalizationVersion = 1` — for future tracking in NATS payload
Write tests FIRST (RED), then implement (GREEN). Tests for normalizer use table-driven test style matching Go conventions. SSH executor tests use mock/classification tests (no real SSH connection needed for unit tests).
</action>
<verify>
<automated>cd /Volumes/ssd01/v9/the-other-dude/poller && go test ./internal/device/ -run "TestNormalize|TestHash|TestSSH|TestClassify|TestTOFU" -v -count=1</automated>
</verify>
<done>
- RunCommand function compiles with correct signature returning (CommandResult, fingerprint, error)
- SSHError type with Kind field covers all 6 error classifications
- TOFU host key callback accepts on first connect, validates on subsequent, rejects on mismatch
- NormalizeConfig strips timestamp, normalizes line endings, trims whitespace, collapses blanks, ensures trailing newline
- HashConfig returns 64-char lowercase hex SHA256
- All unit tests pass
</done>
</task>
<task type="auto">
<name>Task 2: Config env vars, NATS event type, device model extensions, Alembic migration, metrics</name>
<files>
poller/internal/config/config.go,
poller/internal/bus/publisher.go,
poller/internal/store/devices.go,
poller/internal/observability/metrics.go,
backend/alembic/versions/028_device_ssh_host_key.py
</files>
<action>
**1. Config env vars** (`config.go`):
Add three fields to the Config struct and load them in Load():
- `ConfigBackupIntervalSeconds int``getEnvInt("CONFIG_BACKUP_INTERVAL", 21600)` (6h = 21600s)
- `ConfigBackupMaxConcurrent int``getEnvInt("CONFIG_BACKUP_MAX_CONCURRENT", 10)`
- `ConfigBackupCommandTimeoutSeconds int``getEnvInt("CONFIG_BACKUP_COMMAND_TIMEOUT", 60)`
**2. NATS event type and publisher** (`publisher.go`):
- Add `ConfigSnapshotEvent` struct:
```go
type ConfigSnapshotEvent struct {
DeviceID string `json:"device_id"`
TenantID string `json:"tenant_id"`
RouterOSVersion string `json:"routeros_version,omitempty"`
CollectedAt string `json:"collected_at"` // RFC3339
SHA256Hash string `json:"sha256_hash"`
ConfigText string `json:"config_text"`
NormalizationVersion int `json:"normalization_version"`
}
```
- Add `PublishConfigSnapshot(ctx, event) error` method on Publisher following the exact pattern of PublishStatus/PublishMetrics
- Subject: `fmt.Sprintf("config.snapshot.create.%s", event.DeviceID)`
- Add `"config.snapshot.>"` to the DEVICE_EVENTS stream subjects list in `NewPublisher`
**3. Device model extensions** (`devices.go`):
- Add fields to Device struct: `SSHPort int`, `SSHHostKeyFingerprint *string`
- Update FetchDevices query to SELECT `COALESCE(d.ssh_port, 22)` and `d.ssh_host_key_fingerprint`
- Update GetDevice query similarly
- Update both Scan calls to include the new fields
- Add `UpdateSSHHostKey(ctx context.Context, deviceID string, fingerprint string) error` method on DeviceStore:
```go
const query = `UPDATE devices SET ssh_host_key_fingerprint = $1 WHERE id = $2`
```
(This requires poller_user to have UPDATE on devices(ssh_host_key_fingerprint) — handled in migration)
**4. Alembic migration** (`028_device_ssh_host_key.py`):
Follow the raw SQL pattern from migration 027. Create migration that:
- `ALTER TABLE devices ADD COLUMN ssh_port INTEGER DEFAULT 22`
- `ALTER TABLE devices ADD COLUMN ssh_host_key_fingerprint TEXT`
- `ALTER TABLE devices ADD COLUMN ssh_host_key_first_seen TIMESTAMPTZ`
- `ALTER TABLE devices ADD COLUMN ssh_host_key_last_verified TIMESTAMPTZ`
- `GRANT UPDATE (ssh_host_key_fingerprint, ssh_host_key_first_seen, ssh_host_key_last_verified) ON devices TO poller_user`
- Downgrade: `ALTER TABLE devices DROP COLUMN ssh_port, DROP COLUMN ssh_host_key_fingerprint, DROP COLUMN ssh_host_key_first_seen, DROP COLUMN ssh_host_key_last_verified`
- `REVOKE UPDATE (ssh_host_key_fingerprint, ssh_host_key_first_seen, ssh_host_key_last_verified) ON devices FROM poller_user`
**5. Prometheus metrics** (`metrics.go`):
Add config backup specific metrics:
- `ConfigBackupTotal` CounterVec with labels ["status"] — status: "success", "error", "skipped_offline", "skipped_auth_blocked", "skipped_hostkey_blocked"
- `ConfigBackupDuration` Histogram — buckets: [1, 5, 10, 30, 60, 120, 300]
- `ConfigBackupActive` Gauge — number of concurrent backup jobs running
</action>
<verify>
<automated>cd /Volumes/ssd01/v9/the-other-dude/poller && go build ./... && go vet ./... && go test ./internal/config/ -v -count=1</automated>
</verify>
<done>
- Config struct has 3 new backup config fields loading from env vars with correct defaults
- ConfigSnapshotEvent type exists with all required JSON fields
- PublishConfigSnapshot method exists following existing publisher pattern
- config.snapshot.> added to DEVICE_EVENTS stream subjects
- Device struct has SSHPort and SSHHostKeyFingerprint fields
- FetchDevices and GetDevice queries select and scan the new columns
- UpdateSSHHostKey method exists for TOFU fingerprint storage
- Alembic migration 028 adds ssh_port, ssh_host_key_fingerprint, timestamp columns with correct grants
- Three new Prometheus metrics registered for config backup observability
- All existing tests still pass, project compiles clean
</done>
</task>
</tasks>
<verification>
1. `cd poller && go build ./...` — entire project compiles
2. `cd poller && go vet ./...` — no static analysis issues
3. `cd poller && go test ./internal/device/ -v -count=1` — SSH executor and normalizer tests pass
4. `cd poller && go test ./internal/config/ -v -count=1` — config tests pass
5. Migration file exists at `backend/alembic/versions/028_device_ssh_host_key.py`
</verification>
<success_criteria>
- SSH executor RunCommand function exists with TOFU host key verification and typed error classification
- Config normalizer strips timestamps, normalizes whitespace, and computes SHA256 hashes deterministically
- All config backup environment variables load with correct defaults (6h interval, 10 concurrent, 60s timeout)
- ConfigSnapshotEvent and PublishConfigSnapshot are ready for the scheduler to use
- Device model includes SSH port and host key fingerprint fields
- Database migration ready to add SSH columns to devices table
- Prometheus metrics registered for backup collection observability
- All tests pass, project compiles clean
</success_criteria>
<output>
After completion, create `.planning/phases/02-poller-config-collection/02-01-SUMMARY.md`
</output>

View File

@@ -0,0 +1,394 @@
---
phase: 02-poller-config-collection
plan: 02
type: execute
wave: 2
depends_on: ["02-01"]
files_modified:
- poller/internal/poller/backup_scheduler.go
- poller/internal/poller/backup_scheduler_test.go
- poller/internal/poller/interfaces.go
- poller/cmd/poller/main.go
autonomous: true
requirements: [COLL-01, COLL-03, COLL-05, COLL-06]
must_haves:
truths:
- "Poller runs /export show-sensitive via SSH on each online RouterOS device at a configurable interval (default 6h)"
- "Poller publishes normalized config snapshot to NATS config.snapshot.create with device_id, tenant_id, sha256_hash, config_text"
- "Unreachable devices log a warning and are retried on the next interval without blocking other devices"
- "Backup interval is configurable via CONFIG_BACKUP_INTERVAL environment variable"
- "First backup runs with randomized jitter (30-300s) after device discovery"
- "Global concurrency is limited via CONFIG_BACKUP_MAX_CONCURRENT semaphore"
- "Auth failures and host key mismatches block retries until resolved"
artifacts:
- path: "poller/internal/poller/backup_scheduler.go"
provides: "BackupScheduler managing per-device backup goroutines with concurrency, retry, and NATS publishing"
exports: ["BackupScheduler", "NewBackupScheduler"]
min_lines: 200
- path: "poller/internal/poller/backup_scheduler_test.go"
provides: "Unit tests for backup scheduling, jitter, concurrency, error handling"
- path: "poller/internal/poller/interfaces.go"
provides: "SSHHostKeyUpdater interface for device store dependency"
- path: "poller/cmd/poller/main.go"
provides: "BackupScheduler initialization and lifecycle wiring"
key_links:
- from: "poller/internal/poller/backup_scheduler.go"
to: "poller/internal/device/ssh_executor.go"
via: "Calls device.RunCommand to execute /export show-sensitive"
pattern: "device\\.RunCommand"
- from: "poller/internal/poller/backup_scheduler.go"
to: "poller/internal/device/normalize.go"
via: "Calls device.NormalizeConfig and device.HashConfig on SSH output"
pattern: "device\\.NormalizeConfig|device\\.HashConfig"
- from: "poller/internal/poller/backup_scheduler.go"
to: "poller/internal/bus/publisher.go"
via: "Calls publisher.PublishConfigSnapshot with ConfigSnapshotEvent"
pattern: "publisher\\.PublishConfigSnapshot|bus\\.ConfigSnapshotEvent"
- from: "poller/internal/poller/backup_scheduler.go"
to: "poller/internal/store/devices.go"
via: "Calls store.UpdateSSHHostKey for TOFU fingerprint storage"
pattern: "UpdateSSHHostKey"
- from: "poller/cmd/poller/main.go"
to: "poller/internal/poller/backup_scheduler.go"
via: "Creates and starts BackupScheduler in main goroutine lifecycle"
pattern: "NewBackupScheduler|backupScheduler\\.Run"
---
<objective>
Build the backup scheduler that orchestrates periodic SSH config collection from RouterOS devices, normalizes output, and publishes to NATS. Wire it into the poller's main lifecycle.
Purpose: This is the core orchestration that ties together the SSH executor, normalizer, and NATS publisher from Plan 01 into a running backup collection system with proper scheduling, concurrency control, error handling, and retry logic.
Output: BackupScheduler module fully integrated into the poller's main.go lifecycle.
</objective>
<execution_context>
@/Users/jasonstaack/.claude/get-shit-done/workflows/execute-plan.md
@/Users/jasonstaack/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/02-poller-config-collection/02-CONTEXT.md
@.planning/phases/02-poller-config-collection/02-01-SUMMARY.md
@poller/internal/poller/scheduler.go
@poller/internal/poller/worker.go
@poller/internal/poller/interfaces.go
@poller/cmd/poller/main.go
@poller/internal/device/ssh_executor.go
@poller/internal/device/normalize.go
@poller/internal/bus/publisher.go
@poller/internal/config/config.go
@poller/internal/store/devices.go
@poller/internal/observability/metrics.go
<interfaces>
<!-- From Plan 01 outputs (executor and normalizer) -->
From poller/internal/device/ssh_executor.go (created in Plan 01):
```go
type SSHErrorKind string
const (
ErrAuthFailed SSHErrorKind = "auth_failed"
ErrHostKeyMismatch SSHErrorKind = "host_key_mismatch"
ErrTimeout SSHErrorKind = "timeout"
ErrTruncatedOutput SSHErrorKind = "truncated_output"
ErrConnectionRefused SSHErrorKind = "connection_refused"
ErrUnknown SSHErrorKind = "unknown"
)
type SSHError struct { Kind SSHErrorKind; Err error; Message string }
type CommandResult struct { Stdout string; Stderr string; ExitCode int; Duration time.Duration }
func RunCommand(ctx context.Context, ip string, port int, username, password string, timeout time.Duration, knownFingerprint string, command string) (*CommandResult, string, error)
```
From poller/internal/device/normalize.go (created in Plan 01):
```go
func NormalizeConfig(raw string) string
func HashConfig(normalized string) string
const NormalizationVersion = 1
```
From poller/internal/bus/publisher.go (modified in Plan 01):
```go
type ConfigSnapshotEvent struct {
DeviceID string `json:"device_id"`
TenantID string `json:"tenant_id"`
RouterOSVersion string `json:"routeros_version,omitempty"`
CollectedAt string `json:"collected_at"`
SHA256Hash string `json:"sha256_hash"`
ConfigText string `json:"config_text"`
NormalizationVersion int `json:"normalization_version"`
}
func (p *Publisher) PublishConfigSnapshot(ctx context.Context, event ConfigSnapshotEvent) error
```
From poller/internal/store/devices.go (modified in Plan 01):
```go
type Device struct {
// ... existing fields ...
SSHPort int
SSHHostKeyFingerprint *string
}
func (s *DeviceStore) UpdateSSHHostKey(ctx context.Context, deviceID string, fingerprint string) error
```
From poller/internal/config/config.go (modified in Plan 01):
```go
type Config struct {
// ... existing fields ...
ConfigBackupIntervalSeconds int
ConfigBackupMaxConcurrent int
ConfigBackupCommandTimeoutSeconds int
}
```
From poller/internal/observability/metrics.go (modified in Plan 01):
```go
var ConfigBackupTotal *prometheus.CounterVec // labels: ["status"]
var ConfigBackupDuration prometheus.Histogram
var ConfigBackupActive prometheus.Gauge
```
<!-- Existing patterns to follow -->
From poller/internal/poller/scheduler.go:
```go
type Scheduler struct { ... }
func NewScheduler(...) *Scheduler
func (s *Scheduler) Run(ctx context.Context) error
func (s *Scheduler) reconcileDevices(ctx context.Context, wg *sync.WaitGroup) error
func (s *Scheduler) runDeviceLoop(ctx context.Context, dev store.Device, ds *deviceState) // per-device goroutine with ticker
```
From poller/internal/poller/interfaces.go:
```go
type DeviceFetcher interface {
FetchDevices(ctx context.Context) ([]store.Device, error)
}
```
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: BackupScheduler with per-device goroutines, concurrency control, and retry logic</name>
<files>
poller/internal/poller/backup_scheduler.go,
poller/internal/poller/backup_scheduler_test.go,
poller/internal/poller/interfaces.go
</files>
<behavior>
- Test jitter generation: randomJitter(30, 300) returns value in [30s, 300s] range
- Test backoff sequence: given consecutive failures, backoff returns 5m, 15m, 1h, then caps at 1h
- Test auth failure blocking: when last error is ErrAuthFailed, shouldRetry returns false
- Test host key mismatch blocking: when last error is ErrHostKeyMismatch, shouldRetry returns false
- Test online-only gating: backup is skipped for devices not currently marked online
- Test concurrency semaphore: when semaphore is full, backup waits (does not drop)
</behavior>
<action>
**1. Update interfaces.go:**
Add `SSHHostKeyUpdater` interface (consumer-side, Go best practice):
```go
type SSHHostKeyUpdater interface {
UpdateSSHHostKey(ctx context.Context, deviceID string, fingerprint string) error
}
```
**2. Create backup_scheduler.go:**
Define `backupDeviceState` struct tracking per-device backup state:
- `cancel context.CancelFunc`
- `lastAttemptAt time.Time`
- `lastSuccessAt time.Time`
- `lastStatus string` — "success", "error", "skipped_offline", "auth_blocked", "hostkey_blocked"
- `lastError string`
- `consecutiveFailures int`
- `backoffUntil time.Time`
- `lastErrorKind device.SSHErrorKind` — tracks whether error is auth/hostkey (blocks retry)
Define `BackupScheduler` struct:
- `store DeviceFetcher` — reuse existing interface for FetchDevices
- `hostKeyStore SSHHostKeyUpdater` — for UpdateSSHHostKey
- `locker *redislock.Client` — per-device distributed lock
- `publisher *bus.Publisher` — for NATS publishing
- `credentialCache *vault.CredentialCache` — for decrypting device SSH creds
- `redisClient *redis.Client` — for tracking device online status
- `backupInterval time.Duration`
- `commandTimeout time.Duration`
- `refreshPeriod time.Duration` — how often to reconcile devices (reuse from existing scheduler, e.g., 60s)
- `semaphore chan struct{}` — buffered channel of size maxConcurrent
- `mu sync.Mutex`
- `activeDevices map[string]*backupDeviceState`
`NewBackupScheduler(...)` constructor — accept all dependencies, create semaphore as `make(chan struct{}, maxConcurrent)`.
`Run(ctx context.Context) error` — mirrors existing Scheduler.Run pattern:
- defer shutdown: cancel all device goroutines, wait for WaitGroup
- Loop: reconcileBackupDevices(ctx, &wg), then select on ctx.Done or time.After(refreshPeriod)
`reconcileBackupDevices(ctx, wg)` — mirrors reconcileDevices:
- FetchDevices from store
- Start backup goroutines for new devices
- Stop goroutines for removed devices
`runBackupLoop(ctx, dev, state)` — per-device backup goroutine:
- On first run: sleep for randomJitter(30, 300) seconds, then do initial backup
- After initial: ticker at backupInterval
- On each tick:
a. Check if device is online via Redis key `device:{id}:status` (set by status poll). If not online, log debug "skipped_offline", update state, increment ConfigBackupTotal("skipped_offline"), continue
b. Check if lastErrorKind is ErrAuthFailed — skip with "skipped_auth_blocked", log warning with guidance to update credentials
c. Check if lastErrorKind is ErrHostKeyMismatch — skip with "skipped_hostkey_blocked", log warning with guidance to reset host key
d. Check backoff: if time.Now().Before(state.backoffUntil), skip
e. Acquire semaphore (blocks if at max concurrency, does not drop)
f. Acquire Redis lock `backup:device:{id}` with TTL = commandTimeout + 30s
g. Call `collectAndPublish(ctx, dev, state)`
h. Release semaphore
i. Update state based on result
`collectAndPublish(ctx, dev, state) error`:
- Increment ConfigBackupActive gauge
- Defer decrement ConfigBackupActive gauge
- Start timer for ConfigBackupDuration
- Decrypt credentials via credentialCache.GetCredentials
- Call `device.RunCommand(ctx, dev.IPAddress, dev.SSHPort, username, password, commandTimeout, knownFingerprint, "/export show-sensitive")`
- On error: classify error kind, update state, apply backoff (transient: 5m/15m/1h exponential; auth/hostkey: block), return
- If new fingerprint returned (TOFU first connect): call hostKeyStore.UpdateSSHHostKey
- Validate output is non-empty and looks like RouterOS config (basic sanity: contains "/")
- Call `device.NormalizeConfig(result.Stdout)`
- Call `device.HashConfig(normalized)`
- Build `bus.ConfigSnapshotEvent` with device_id, tenant_id, routeros_version (from device or Redis), collected_at (RFC3339 now), sha256_hash, config_text, normalization_version
- Call `publisher.PublishConfigSnapshot(ctx, event)`
- On success: reset consecutiveFailures, update lastSuccessAt, increment ConfigBackupTotal("success")
- Record ConfigBackupDuration
`randomJitter(minSeconds, maxSeconds int) time.Duration` — uses math/rand for uniform distribution
Backoff for transient errors: `calculateBackupBackoff(failures int) time.Duration`:
- 1 failure: 5 min
- 2 failures: 15 min
- 3+ failures: 1 hour (cap)
Device online check via Redis: check if key `device:{id}:status` equals "online". This key is set by the existing status poll publisher flow. If key doesn't exist, assume device might be online (first poll hasn't happened yet) — allow backup attempt.
RouterOS version: read from the Device struct's RouterOSVersion field (populated by store query). If nil, use empty string in the event.
**Important implementation notes:**
- Use `log/slog` for all logging (structured JSON, matching existing pattern)
- Use existing `redislock` pattern from worker.go for per-device locking
- Semaphore pattern: `s.semaphore <- struct{}{}` to acquire, `<-s.semaphore` to release
- Do NOT share circuit breaker state with the status poll scheduler — these are independent
- Partial/truncated output (SSHError with Kind ErrTruncatedOutput) is treated as transient error — never publish, apply backoff
</action>
<verify>
<automated>cd /Volumes/ssd01/v9/the-other-dude/poller && go test ./internal/poller/ -run "TestBackup|TestJitter|TestBackoff|TestShouldRetry" -v -count=1</automated>
</verify>
<done>
- BackupScheduler manages per-device backup goroutines independently from status poll scheduler
- First backup uses 30-300s random jitter delay
- Concurrency limited by buffered channel semaphore (default 10)
- Per-device Redis lock prevents duplicate backups across pods
- Auth failures and host key mismatches block retries with clear log messages
- Transient errors use 5m/15m/1h exponential backoff
- Offline devices are skipped without error
- Successful backups normalize config, compute SHA256, and publish to NATS
- TOFU fingerprint stored on first successful connection
- All unit tests pass
</done>
</task>
<task type="auto">
<name>Task 2: Wire BackupScheduler into main.go lifecycle</name>
<files>poller/cmd/poller/main.go</files>
<action>
Add BackupScheduler initialization and startup to main.go, following the existing pattern of scheduler initialization (lines 250-278).
After the existing scheduler creation (around line 270), add a new section:
```
// -----------------------------------------------------------------------
// Start the config backup scheduler
// -----------------------------------------------------------------------
```
1. Convert config values to durations:
```go
backupInterval := time.Duration(cfg.ConfigBackupIntervalSeconds) * time.Second
backupCmdTimeout := time.Duration(cfg.ConfigBackupCommandTimeoutSeconds) * time.Second
```
2. Create BackupScheduler:
```go
backupScheduler := poller.NewBackupScheduler(
deviceStore,
deviceStore, // SSHHostKeyUpdater (DeviceStore satisfies this interface)
locker,
publisher,
credentialCache,
redisClient,
backupInterval,
backupCmdTimeout,
refreshPeriod, // reuse existing device refresh period
cfg.ConfigBackupMaxConcurrent,
)
```
3. Start in a goroutine (runs parallel with the main status poll scheduler):
```go
go func() {
slog.Info("starting config backup scheduler",
"interval", backupInterval,
"max_concurrent", cfg.ConfigBackupMaxConcurrent,
"command_timeout", backupCmdTimeout,
)
if err := backupScheduler.Run(ctx); err != nil {
slog.Error("backup scheduler exited with error", "error", err)
}
}()
```
The BackupScheduler shares the same ctx as everything else, so SIGINT/SIGTERM will trigger its shutdown via context cancellation. No additional shutdown logic needed — Run() returns when ctx is cancelled.
Log the startup with the same pattern as the existing scheduler startup log (line 273-276).
</action>
<verify>
<automated>cd /Volumes/ssd01/v9/the-other-dude/poller && go build ./cmd/poller/ && echo "build successful"</automated>
</verify>
<done>
- BackupScheduler created in main.go with all dependencies injected
- Runs as a goroutine parallel to the status poll scheduler
- Shares the same context for graceful shutdown
- Startup logged with interval, max_concurrent, and command_timeout
- Poller binary compiles successfully with the new scheduler wired in
</done>
</task>
</tasks>
<verification>
1. `cd poller && go build ./cmd/poller/` — binary compiles with backup scheduler wired in
2. `cd poller && go vet ./...` — no static analysis issues
3. `cd poller && go test ./internal/poller/ -v -count=1` — all poller tests pass (existing + new backup tests)
4. `cd poller && go test ./... -count=1` — full test suite passes
</verification>
<success_criteria>
- BackupScheduler runs independently from status poll scheduler with its own per-device goroutines
- Devices get their first backup 30-300s after discovery, then every CONFIG_BACKUP_INTERVAL
- SSH command execution uses TOFU host key verification and stores fingerprints on first connect
- Config output is normalized, hashed, and published to NATS config.snapshot.create
- Concurrency limited to CONFIG_BACKUP_MAX_CONCURRENT parallel SSH sessions
- Auth/hostkey errors block retries; transient errors use exponential backoff (5m/15m/1h)
- Offline devices are skipped gracefully
- BackupScheduler is wired into main.go and starts/stops with the poller lifecycle
- All tests pass, project compiles clean
</success_criteria>
<output>
After completion, create `.planning/phases/02-poller-config-collection/02-02-SUMMARY.md`
</output>