Files
the-other-dude/.planning/phases/02-poller-config-collection/02-01-PLAN.md
Jason Staack 33f888a6e2 docs(02): create phase plan for poller config collection
Two plans covering SSH executor, config normalization, NATS publishing,
backup scheduler, and main.go wiring for periodic RouterOS config backup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:39:47 -05:00

16 KiB

phase, plan, type, wave, depends_on, files_modified, autonomous, requirements, must_haves
phase plan type wave depends_on files_modified autonomous requirements must_haves
02-poller-config-collection 01 execute 1
poller/internal/device/ssh_executor.go
poller/internal/device/ssh_executor_test.go
poller/internal/device/normalize.go
poller/internal/device/normalize_test.go
poller/internal/config/config.go
poller/internal/bus/publisher.go
poller/internal/observability/metrics.go
poller/internal/store/devices.go
backend/alembic/versions/028_device_ssh_host_key.py
true
COLL-01
COLL-02
COLL-06
truths artifacts key_links
SSH executor can run a command on a RouterOS device and return stdout, stderr, exit code, duration, and typed errors
Config output is normalized deterministically (timestamp stripped, whitespace trimmed, line endings unified, blank lines collapsed)
SHA256 hash is computed on normalized output
Config backup interval and concurrency are configurable via environment variables
Host key fingerprint is stored on device record for TOFU verification
path provides exports
poller/internal/device/ssh_executor.go RunCommand SSH executor with TOFU host key verification and typed errors
RunCommand
CommandResult
SSHError
SSHErrorKind
path provides exports
poller/internal/device/normalize.go NormalizeConfig function and SHA256 hashing
NormalizeConfig
HashConfig
path provides
poller/internal/device/ssh_executor_test.go Unit tests for SSH executor error classification
path provides
poller/internal/device/normalize_test.go Unit tests for config normalization with edge cases
path provides
poller/internal/config/config.go CONFIG_BACKUP_INTERVAL, CONFIG_BACKUP_MAX_CONCURRENT, CONFIG_BACKUP_COMMAND_TIMEOUT env vars
path provides
poller/internal/bus/publisher.go ConfigSnapshotEvent type and PublishConfigSnapshot method, config.snapshot.create subject in stream
path provides
poller/internal/store/devices.go SSHPort and SSHHostKeyFingerprint fields on Device struct, UpdateSSHHostKey method
path provides
backend/alembic/versions/028_device_ssh_host_key.py Migration adding ssh_port, ssh_host_key_fingerprint columns to devices table
from to via pattern
poller/internal/device/ssh_executor.go poller/internal/store/devices.go Uses Device.SSHPort and Device.SSHHostKeyFingerprint for connection dev.SSHPort|dev.SSHHostKeyFingerprint
from to via pattern
poller/internal/device/normalize.go poller/internal/bus/publisher.go Normalized config text and SHA256 hash populate ConfigSnapshotEvent fields NormalizeConfig|HashConfig
Build the reusable primitives for config backup collection: SSH command executor with TOFU host key verification, config output normalizer with SHA256 hashing, environment variable configuration, NATS event type, and device model extensions.

Purpose: These are the building blocks that the backup scheduler (Plan 02) wires together. Each is independently testable and follows existing codebase patterns. Output: SSH executor module, normalization module, extended config/store/bus/metrics, Alembic migration for device SSH columns.

<execution_context> @/Users/jasonstaack/.claude/get-shit-done/workflows/execute-plan.md @/Users/jasonstaack/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/02-poller-config-collection/02-CONTEXT.md @.planning/phases/01-database-schema/01-01-SUMMARY.md

@poller/internal/device/sftp.go @poller/internal/bus/publisher.go @poller/internal/config/config.go @poller/internal/store/devices.go @poller/internal/observability/metrics.go @poller/internal/poller/scheduler.go @poller/go.mod

From poller/internal/device/sftp.go:

func NewSSHClient(ip string, port int, username, password string, timeout time.Duration) (*ssh.Client, error)
// Uses ssh.InsecureIgnoreHostKey() — executor replaces this with TOFU callback

From poller/internal/store/devices.go:

type Device struct {
    ID                          string
    TenantID                    string
    IPAddress                   string
    APIPort                     int
    APISSLPort                  int
    EncryptedCredentials        []byte
    EncryptedCredentialsTransit *string
    RouterOSVersion             *string
    MajorVersion                *int
    TLSMode                     string
    CACertPEM                   *string
}
// SSHPort and SSHHostKeyFingerprint need to be added

From poller/internal/bus/publisher.go:

type Publisher struct { nc *nats.Conn; js jetstream.JetStream }
func (p *Publisher) PublishStatus(ctx context.Context, event DeviceStatusEvent) error
// Follow this pattern for PublishConfigSnapshot
// Stream subjects list needs "config.snapshot.>" added

From poller/internal/config/config.go:

func Load() (*Config, error)
// Uses getEnv(key, default) and getEnvInt(key, default) helpers
Task 1: SSH executor, normalizer, and their tests poller/internal/device/ssh_executor.go, poller/internal/device/ssh_executor_test.go, poller/internal/device/normalize.go, poller/internal/device/normalize_test.go SSH Executor (ssh_executor_test.go): - Test SSHErrorKind classification: given various ssh/net error types, classifySSHError returns correct kind (AuthFailed, HostKeyMismatch, Timeout, ConnectionRefused, Unknown) - Test TOFU host key callback: when fingerprint is empty (first connect), callback accepts and returns fingerprint; when fingerprint matches, callback accepts; when fingerprint mismatches, callback rejects with HostKeyMismatch error - Test CommandResult: verify struct fields (Stdout, Stderr, ExitCode, Duration, Error)
Normalizer (normalize_test.go):
- Test timestamp stripping: input with "# 2024/01/15 10:30:00 by RouterOS 7.x\n# software id = XXXX\n" strips only the timestamp line and following blank line, preserves software id comment
- Test line ending normalization: "\r\n" becomes "\n"
- Test trailing whitespace trimming: "  /ip address  \n" becomes "/ip address\n"
- Test blank line collapsing: three consecutive blank lines become one
- Test trailing newline: output always ends with exactly one "\n"
- Test comment preservation: lines starting with "# " that are NOT the timestamp header are preserved
- Test full normalization pipeline: realistic RouterOS export with all issues produces clean output
- Test HashConfig: returns lowercase hex SHA256 of the normalized string (64 chars)
- Test idempotency: NormalizeConfig(NormalizeConfig(input)) == NormalizeConfig(input)
Create `poller/internal/device/ssh_executor.go`:
1. Define types:
   - `SSHErrorKind` string enum: `ErrAuthFailed`, `ErrHostKeyMismatch`, `ErrTimeout`, `ErrTruncatedOutput`, `ErrConnectionRefused`, `ErrUnknown`
   - `SSHError` struct implementing `error`: `Kind SSHErrorKind`, `Err error`, `Message string`
   - `CommandResult` struct: `Stdout string`, `Stderr string`, `ExitCode int`, `Duration time.Duration`

2. `RunCommand(ctx context.Context, ip string, port int, username, password string, timeout time.Duration, knownFingerprint string, command string) (*CommandResult, string, error)`:
   - Returns (result, observedFingerprint, error)
   - Build ssh.ClientConfig with password auth and custom HostKeyCallback for TOFU:
     - If knownFingerprint == "": accept any key, compute and return SHA256 fingerprint
     - If knownFingerprint matches: accept
     - If knownFingerprint mismatches: reject with SSHError{Kind: ErrHostKeyMismatch}
   - Fingerprint format: `SHA256:base64(sha256(publicKeyBytes))` (same as ssh-keygen)
   - Dial with context-aware timeout
   - Create session, run command via session.Run()
   - Capture stdout/stderr via session.StdoutPipe/StderrPipe or CombinedOutput pattern
   - Classify errors using `classifySSHError(err)` helper that inspects error strings and types
   - Detect truncated output: if command times out mid-stream, return SSHError{Kind: ErrTruncatedOutput}

3. `classifySSHError(err error) SSHErrorKind`: inspect error for "unable to authenticate", "host key", "i/o timeout", "connection refused" patterns

Create `poller/internal/device/normalize.go`:

1. `NormalizeConfig(raw string) string`:
   - Use regexp to strip timestamp header line matching `^# \d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2} by RouterOS.*\n` and the blank line immediately following it
   - Replace \r\n with \n (before other processing)
   - Split into lines, trim trailing whitespace from each line
   - Collapse consecutive blank lines (2+ empty lines become 1)
   - Ensure single trailing newline
   - Return normalized string

2. `HashConfig(normalized string) string`:
   - Compute SHA256 of the normalized string bytes
   - Return lowercase hex string (64 chars)

3. `const NormalizationVersion = 1` — for future tracking in NATS payload

Write tests FIRST (RED), then implement (GREEN). Tests for normalizer use table-driven test style matching Go conventions. SSH executor tests use mock/classification tests (no real SSH connection needed for unit tests).
cd /Volumes/ssd01/v9/the-other-dude/poller && go test ./internal/device/ -run "TestNormalize|TestHash|TestSSH|TestClassify|TestTOFU" -v -count=1 - RunCommand function compiles with correct signature returning (CommandResult, fingerprint, error) - SSHError type with Kind field covers all 6 error classifications - TOFU host key callback accepts on first connect, validates on subsequent, rejects on mismatch - NormalizeConfig strips timestamp, normalizes line endings, trims whitespace, collapses blanks, ensures trailing newline - HashConfig returns 64-char lowercase hex SHA256 - All unit tests pass Task 2: Config env vars, NATS event type, device model extensions, Alembic migration, metrics poller/internal/config/config.go, poller/internal/bus/publisher.go, poller/internal/store/devices.go, poller/internal/observability/metrics.go, backend/alembic/versions/028_device_ssh_host_key.py **1. Config env vars** (`config.go`): Add three fields to the Config struct and load them in Load(): - `ConfigBackupIntervalSeconds int` — `getEnvInt("CONFIG_BACKUP_INTERVAL", 21600)` (6h = 21600s) - `ConfigBackupMaxConcurrent int` — `getEnvInt("CONFIG_BACKUP_MAX_CONCURRENT", 10)` - `ConfigBackupCommandTimeoutSeconds int` — `getEnvInt("CONFIG_BACKUP_COMMAND_TIMEOUT", 60)`
**2. NATS event type and publisher** (`publisher.go`):
- Add `ConfigSnapshotEvent` struct:
  ```go
  type ConfigSnapshotEvent struct {
      DeviceID             string `json:"device_id"`
      TenantID             string `json:"tenant_id"`
      RouterOSVersion      string `json:"routeros_version,omitempty"`
      CollectedAt          string `json:"collected_at"`          // RFC3339
      SHA256Hash           string `json:"sha256_hash"`
      ConfigText           string `json:"config_text"`
      NormalizationVersion int    `json:"normalization_version"`
  }
  ```
- Add `PublishConfigSnapshot(ctx, event) error` method on Publisher following the exact pattern of PublishStatus/PublishMetrics
- Subject: `fmt.Sprintf("config.snapshot.create.%s", event.DeviceID)`
- Add `"config.snapshot.>"` to the DEVICE_EVENTS stream subjects list in `NewPublisher`

**3. Device model extensions** (`devices.go`):
- Add fields to Device struct: `SSHPort int`, `SSHHostKeyFingerprint *string`
- Update FetchDevices query to SELECT `COALESCE(d.ssh_port, 22)` and `d.ssh_host_key_fingerprint`
- Update GetDevice query similarly
- Update both Scan calls to include the new fields
- Add `UpdateSSHHostKey(ctx context.Context, deviceID string, fingerprint string) error` method on DeviceStore:
  ```go
  const query = `UPDATE devices SET ssh_host_key_fingerprint = $1 WHERE id = $2`
  ```
  (This requires poller_user to have UPDATE on devices(ssh_host_key_fingerprint) — handled in migration)

**4. Alembic migration** (`028_device_ssh_host_key.py`):
Follow the raw SQL pattern from migration 027. Create migration that:
- `ALTER TABLE devices ADD COLUMN ssh_port INTEGER DEFAULT 22`
- `ALTER TABLE devices ADD COLUMN ssh_host_key_fingerprint TEXT`
- `ALTER TABLE devices ADD COLUMN ssh_host_key_first_seen TIMESTAMPTZ`
- `ALTER TABLE devices ADD COLUMN ssh_host_key_last_verified TIMESTAMPTZ`
- `GRANT UPDATE (ssh_host_key_fingerprint, ssh_host_key_first_seen, ssh_host_key_last_verified) ON devices TO poller_user`
- Downgrade: `ALTER TABLE devices DROP COLUMN ssh_port, DROP COLUMN ssh_host_key_fingerprint, DROP COLUMN ssh_host_key_first_seen, DROP COLUMN ssh_host_key_last_verified`
- `REVOKE UPDATE (ssh_host_key_fingerprint, ssh_host_key_first_seen, ssh_host_key_last_verified) ON devices FROM poller_user`

**5. Prometheus metrics** (`metrics.go`):
Add config backup specific metrics:
- `ConfigBackupTotal` CounterVec with labels ["status"] — status: "success", "error", "skipped_offline", "skipped_auth_blocked", "skipped_hostkey_blocked"
- `ConfigBackupDuration` Histogram — buckets: [1, 5, 10, 30, 60, 120, 300]
- `ConfigBackupActive` Gauge — number of concurrent backup jobs running
cd /Volumes/ssd01/v9/the-other-dude/poller && go build ./... && go vet ./... && go test ./internal/config/ -v -count=1 - Config struct has 3 new backup config fields loading from env vars with correct defaults - ConfigSnapshotEvent type exists with all required JSON fields - PublishConfigSnapshot method exists following existing publisher pattern - config.snapshot.> added to DEVICE_EVENTS stream subjects - Device struct has SSHPort and SSHHostKeyFingerprint fields - FetchDevices and GetDevice queries select and scan the new columns - UpdateSSHHostKey method exists for TOFU fingerprint storage - Alembic migration 028 adds ssh_port, ssh_host_key_fingerprint, timestamp columns with correct grants - Three new Prometheus metrics registered for config backup observability - All existing tests still pass, project compiles clean 1. `cd poller && go build ./...` — entire project compiles 2. `cd poller && go vet ./...` — no static analysis issues 3. `cd poller && go test ./internal/device/ -v -count=1` — SSH executor and normalizer tests pass 4. `cd poller && go test ./internal/config/ -v -count=1` — config tests pass 5. Migration file exists at `backend/alembic/versions/028_device_ssh_host_key.py`

<success_criteria>

  • SSH executor RunCommand function exists with TOFU host key verification and typed error classification
  • Config normalizer strips timestamps, normalizes whitespace, and computes SHA256 hashes deterministically
  • All config backup environment variables load with correct defaults (6h interval, 10 concurrent, 60s timeout)
  • ConfigSnapshotEvent and PublishConfigSnapshot are ready for the scheduler to use
  • Device model includes SSH port and host key fingerprint fields
  • Database migration ready to add SSH columns to devices table
  • Prometheus metrics registered for backup collection observability
  • All tests pass, project compiles clean </success_criteria>
After completion, create `.planning/phases/02-poller-config-collection/02-01-SUMMARY.md`