--- phase: 02-poller-config-collection plan: 01 type: execute wave: 1 depends_on: [] files_modified: - poller/internal/device/ssh_executor.go - poller/internal/device/ssh_executor_test.go - poller/internal/device/normalize.go - poller/internal/device/normalize_test.go - poller/internal/config/config.go - poller/internal/bus/publisher.go - poller/internal/observability/metrics.go - poller/internal/store/devices.go - backend/alembic/versions/028_device_ssh_host_key.py autonomous: true requirements: [COLL-01, COLL-02, COLL-06] must_haves: truths: - "SSH executor can run a command on a RouterOS device and return stdout, stderr, exit code, duration, and typed errors" - "Config output is normalized deterministically (timestamp stripped, whitespace trimmed, line endings unified, blank lines collapsed)" - "SHA256 hash is computed on normalized output" - "Config backup interval and concurrency are configurable via environment variables" - "Host key fingerprint is stored on device record for TOFU verification" artifacts: - path: "poller/internal/device/ssh_executor.go" provides: "RunCommand SSH executor with TOFU host key verification and typed errors" exports: ["RunCommand", "CommandResult", "SSHError", "SSHErrorKind"] - path: "poller/internal/device/normalize.go" provides: "NormalizeConfig function and SHA256 hashing" exports: ["NormalizeConfig", "HashConfig"] - path: "poller/internal/device/ssh_executor_test.go" provides: "Unit tests for SSH executor error classification" - path: "poller/internal/device/normalize_test.go" provides: "Unit tests for config normalization with edge cases" - path: "poller/internal/config/config.go" provides: "CONFIG_BACKUP_INTERVAL, CONFIG_BACKUP_MAX_CONCURRENT, CONFIG_BACKUP_COMMAND_TIMEOUT env vars" - path: "poller/internal/bus/publisher.go" provides: "ConfigSnapshotEvent type and PublishConfigSnapshot method, config.snapshot.create subject in stream" - path: "poller/internal/store/devices.go" provides: "SSHPort and SSHHostKeyFingerprint fields on Device struct, UpdateSSHHostKey method" - path: "backend/alembic/versions/028_device_ssh_host_key.py" provides: "Migration adding ssh_port, ssh_host_key_fingerprint columns to devices table" key_links: - from: "poller/internal/device/ssh_executor.go" to: "poller/internal/store/devices.go" via: "Uses Device.SSHPort and Device.SSHHostKeyFingerprint for connection" pattern: "dev\\.SSHPort|dev\\.SSHHostKeyFingerprint" - from: "poller/internal/device/normalize.go" to: "poller/internal/bus/publisher.go" via: "Normalized config text and SHA256 hash populate ConfigSnapshotEvent fields" pattern: "NormalizeConfig|HashConfig" --- Build the reusable primitives for config backup collection: SSH command executor with TOFU host key verification, config output normalizer with SHA256 hashing, environment variable configuration, NATS event type, and device model extensions. Purpose: These are the building blocks that the backup scheduler (Plan 02) wires together. Each is independently testable and follows existing codebase patterns. Output: SSH executor module, normalization module, extended config/store/bus/metrics, Alembic migration for device SSH columns. @/Users/jasonstaack/.claude/get-shit-done/workflows/execute-plan.md @/Users/jasonstaack/.claude/get-shit-done/templates/summary.md @.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/02-poller-config-collection/02-CONTEXT.md @.planning/phases/01-database-schema/01-01-SUMMARY.md @poller/internal/device/sftp.go @poller/internal/bus/publisher.go @poller/internal/config/config.go @poller/internal/store/devices.go @poller/internal/observability/metrics.go @poller/internal/poller/scheduler.go @poller/go.mod From poller/internal/device/sftp.go: ```go func NewSSHClient(ip string, port int, username, password string, timeout time.Duration) (*ssh.Client, error) // Uses ssh.InsecureIgnoreHostKey() — executor replaces this with TOFU callback ``` From poller/internal/store/devices.go: ```go type Device struct { ID string TenantID string IPAddress string APIPort int APISSLPort int EncryptedCredentials []byte EncryptedCredentialsTransit *string RouterOSVersion *string MajorVersion *int TLSMode string CACertPEM *string } // SSHPort and SSHHostKeyFingerprint need to be added ``` From poller/internal/bus/publisher.go: ```go type Publisher struct { nc *nats.Conn; js jetstream.JetStream } func (p *Publisher) PublishStatus(ctx context.Context, event DeviceStatusEvent) error // Follow this pattern for PublishConfigSnapshot // Stream subjects list needs "config.snapshot.>" added ``` From poller/internal/config/config.go: ```go func Load() (*Config, error) // Uses getEnv(key, default) and getEnvInt(key, default) helpers ``` Task 1: SSH executor, normalizer, and their tests poller/internal/device/ssh_executor.go, poller/internal/device/ssh_executor_test.go, poller/internal/device/normalize.go, poller/internal/device/normalize_test.go SSH Executor (ssh_executor_test.go): - Test SSHErrorKind classification: given various ssh/net error types, classifySSHError returns correct kind (AuthFailed, HostKeyMismatch, Timeout, ConnectionRefused, Unknown) - Test TOFU host key callback: when fingerprint is empty (first connect), callback accepts and returns fingerprint; when fingerprint matches, callback accepts; when fingerprint mismatches, callback rejects with HostKeyMismatch error - Test CommandResult: verify struct fields (Stdout, Stderr, ExitCode, Duration, Error) Normalizer (normalize_test.go): - Test timestamp stripping: input with "# 2024/01/15 10:30:00 by RouterOS 7.x\n# software id = XXXX\n" strips only the timestamp line and following blank line, preserves software id comment - Test line ending normalization: "\r\n" becomes "\n" - Test trailing whitespace trimming: " /ip address \n" becomes "/ip address\n" - Test blank line collapsing: three consecutive blank lines become one - Test trailing newline: output always ends with exactly one "\n" - Test comment preservation: lines starting with "# " that are NOT the timestamp header are preserved - Test full normalization pipeline: realistic RouterOS export with all issues produces clean output - Test HashConfig: returns lowercase hex SHA256 of the normalized string (64 chars) - Test idempotency: NormalizeConfig(NormalizeConfig(input)) == NormalizeConfig(input) Create `poller/internal/device/ssh_executor.go`: 1. Define types: - `SSHErrorKind` string enum: `ErrAuthFailed`, `ErrHostKeyMismatch`, `ErrTimeout`, `ErrTruncatedOutput`, `ErrConnectionRefused`, `ErrUnknown` - `SSHError` struct implementing `error`: `Kind SSHErrorKind`, `Err error`, `Message string` - `CommandResult` struct: `Stdout string`, `Stderr string`, `ExitCode int`, `Duration time.Duration` 2. `RunCommand(ctx context.Context, ip string, port int, username, password string, timeout time.Duration, knownFingerprint string, command string) (*CommandResult, string, error)`: - Returns (result, observedFingerprint, error) - Build ssh.ClientConfig with password auth and custom HostKeyCallback for TOFU: - If knownFingerprint == "": accept any key, compute and return SHA256 fingerprint - If knownFingerprint matches: accept - If knownFingerprint mismatches: reject with SSHError{Kind: ErrHostKeyMismatch} - Fingerprint format: `SHA256:base64(sha256(publicKeyBytes))` (same as ssh-keygen) - Dial with context-aware timeout - Create session, run command via session.Run() - Capture stdout/stderr via session.StdoutPipe/StderrPipe or CombinedOutput pattern - Classify errors using `classifySSHError(err)` helper that inspects error strings and types - Detect truncated output: if command times out mid-stream, return SSHError{Kind: ErrTruncatedOutput} 3. `classifySSHError(err error) SSHErrorKind`: inspect error for "unable to authenticate", "host key", "i/o timeout", "connection refused" patterns Create `poller/internal/device/normalize.go`: 1. `NormalizeConfig(raw string) string`: - Use regexp to strip timestamp header line matching `^# \d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2} by RouterOS.*\n` and the blank line immediately following it - Replace \r\n with \n (before other processing) - Split into lines, trim trailing whitespace from each line - Collapse consecutive blank lines (2+ empty lines become 1) - Ensure single trailing newline - Return normalized string 2. `HashConfig(normalized string) string`: - Compute SHA256 of the normalized string bytes - Return lowercase hex string (64 chars) 3. `const NormalizationVersion = 1` — for future tracking in NATS payload Write tests FIRST (RED), then implement (GREEN). Tests for normalizer use table-driven test style matching Go conventions. SSH executor tests use mock/classification tests (no real SSH connection needed for unit tests). cd /Volumes/ssd01/v9/the-other-dude/poller && go test ./internal/device/ -run "TestNormalize|TestHash|TestSSH|TestClassify|TestTOFU" -v -count=1 - RunCommand function compiles with correct signature returning (CommandResult, fingerprint, error) - SSHError type with Kind field covers all 6 error classifications - TOFU host key callback accepts on first connect, validates on subsequent, rejects on mismatch - NormalizeConfig strips timestamp, normalizes line endings, trims whitespace, collapses blanks, ensures trailing newline - HashConfig returns 64-char lowercase hex SHA256 - All unit tests pass Task 2: Config env vars, NATS event type, device model extensions, Alembic migration, metrics poller/internal/config/config.go, poller/internal/bus/publisher.go, poller/internal/store/devices.go, poller/internal/observability/metrics.go, backend/alembic/versions/028_device_ssh_host_key.py **1. Config env vars** (`config.go`): Add three fields to the Config struct and load them in Load(): - `ConfigBackupIntervalSeconds int` — `getEnvInt("CONFIG_BACKUP_INTERVAL", 21600)` (6h = 21600s) - `ConfigBackupMaxConcurrent int` — `getEnvInt("CONFIG_BACKUP_MAX_CONCURRENT", 10)` - `ConfigBackupCommandTimeoutSeconds int` — `getEnvInt("CONFIG_BACKUP_COMMAND_TIMEOUT", 60)` **2. NATS event type and publisher** (`publisher.go`): - Add `ConfigSnapshotEvent` struct: ```go type ConfigSnapshotEvent struct { DeviceID string `json:"device_id"` TenantID string `json:"tenant_id"` RouterOSVersion string `json:"routeros_version,omitempty"` CollectedAt string `json:"collected_at"` // RFC3339 SHA256Hash string `json:"sha256_hash"` ConfigText string `json:"config_text"` NormalizationVersion int `json:"normalization_version"` } ``` - Add `PublishConfigSnapshot(ctx, event) error` method on Publisher following the exact pattern of PublishStatus/PublishMetrics - Subject: `fmt.Sprintf("config.snapshot.create.%s", event.DeviceID)` - Add `"config.snapshot.>"` to the DEVICE_EVENTS stream subjects list in `NewPublisher` **3. Device model extensions** (`devices.go`): - Add fields to Device struct: `SSHPort int`, `SSHHostKeyFingerprint *string` - Update FetchDevices query to SELECT `COALESCE(d.ssh_port, 22)` and `d.ssh_host_key_fingerprint` - Update GetDevice query similarly - Update both Scan calls to include the new fields - Add `UpdateSSHHostKey(ctx context.Context, deviceID string, fingerprint string) error` method on DeviceStore: ```go const query = `UPDATE devices SET ssh_host_key_fingerprint = $1 WHERE id = $2` ``` (This requires poller_user to have UPDATE on devices(ssh_host_key_fingerprint) — handled in migration) **4. Alembic migration** (`028_device_ssh_host_key.py`): Follow the raw SQL pattern from migration 027. Create migration that: - `ALTER TABLE devices ADD COLUMN ssh_port INTEGER DEFAULT 22` - `ALTER TABLE devices ADD COLUMN ssh_host_key_fingerprint TEXT` - `ALTER TABLE devices ADD COLUMN ssh_host_key_first_seen TIMESTAMPTZ` - `ALTER TABLE devices ADD COLUMN ssh_host_key_last_verified TIMESTAMPTZ` - `GRANT UPDATE (ssh_host_key_fingerprint, ssh_host_key_first_seen, ssh_host_key_last_verified) ON devices TO poller_user` - Downgrade: `ALTER TABLE devices DROP COLUMN ssh_port, DROP COLUMN ssh_host_key_fingerprint, DROP COLUMN ssh_host_key_first_seen, DROP COLUMN ssh_host_key_last_verified` - `REVOKE UPDATE (ssh_host_key_fingerprint, ssh_host_key_first_seen, ssh_host_key_last_verified) ON devices FROM poller_user` **5. Prometheus metrics** (`metrics.go`): Add config backup specific metrics: - `ConfigBackupTotal` CounterVec with labels ["status"] — status: "success", "error", "skipped_offline", "skipped_auth_blocked", "skipped_hostkey_blocked" - `ConfigBackupDuration` Histogram — buckets: [1, 5, 10, 30, 60, 120, 300] - `ConfigBackupActive` Gauge — number of concurrent backup jobs running cd /Volumes/ssd01/v9/the-other-dude/poller && go build ./... && go vet ./... && go test ./internal/config/ -v -count=1 - Config struct has 3 new backup config fields loading from env vars with correct defaults - ConfigSnapshotEvent type exists with all required JSON fields - PublishConfigSnapshot method exists following existing publisher pattern - config.snapshot.> added to DEVICE_EVENTS stream subjects - Device struct has SSHPort and SSHHostKeyFingerprint fields - FetchDevices and GetDevice queries select and scan the new columns - UpdateSSHHostKey method exists for TOFU fingerprint storage - Alembic migration 028 adds ssh_port, ssh_host_key_fingerprint, timestamp columns with correct grants - Three new Prometheus metrics registered for config backup observability - All existing tests still pass, project compiles clean 1. `cd poller && go build ./...` — entire project compiles 2. `cd poller && go vet ./...` — no static analysis issues 3. `cd poller && go test ./internal/device/ -v -count=1` — SSH executor and normalizer tests pass 4. `cd poller && go test ./internal/config/ -v -count=1` — config tests pass 5. Migration file exists at `backend/alembic/versions/028_device_ssh_host_key.py` - SSH executor RunCommand function exists with TOFU host key verification and typed error classification - Config normalizer strips timestamps, normalizes whitespace, and computes SHA256 hashes deterministically - All config backup environment variables load with correct defaults (6h interval, 10 concurrent, 60s timeout) - ConfigSnapshotEvent and PublishConfigSnapshot are ready for the scheduler to use - Device model includes SSH port and host key fingerprint fields - Database migration ready to add SSH columns to devices table - Prometheus metrics registered for backup collection observability - All tests pass, project compiles clean After completion, create `.planning/phases/02-poller-config-collection/02-01-SUMMARY.md`