docs(02): create phase plan for poller config collection
Two plans covering SSH executor, config normalization, NATS publishing, backup scheduler, and main.go wiring for periodic RouterOS config backup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
308
.planning/phases/02-poller-config-collection/02-01-PLAN.md
Normal file
308
.planning/phases/02-poller-config-collection/02-01-PLAN.md
Normal file
@@ -0,0 +1,308 @@
|
||||
---
|
||||
phase: 02-poller-config-collection
|
||||
plan: 01
|
||||
type: execute
|
||||
wave: 1
|
||||
depends_on: []
|
||||
files_modified:
|
||||
- poller/internal/device/ssh_executor.go
|
||||
- poller/internal/device/ssh_executor_test.go
|
||||
- poller/internal/device/normalize.go
|
||||
- poller/internal/device/normalize_test.go
|
||||
- poller/internal/config/config.go
|
||||
- poller/internal/bus/publisher.go
|
||||
- poller/internal/observability/metrics.go
|
||||
- poller/internal/store/devices.go
|
||||
- backend/alembic/versions/028_device_ssh_host_key.py
|
||||
autonomous: true
|
||||
requirements: [COLL-01, COLL-02, COLL-06]
|
||||
|
||||
must_haves:
|
||||
truths:
|
||||
- "SSH executor can run a command on a RouterOS device and return stdout, stderr, exit code, duration, and typed errors"
|
||||
- "Config output is normalized deterministically (timestamp stripped, whitespace trimmed, line endings unified, blank lines collapsed)"
|
||||
- "SHA256 hash is computed on normalized output"
|
||||
- "Config backup interval and concurrency are configurable via environment variables"
|
||||
- "Host key fingerprint is stored on device record for TOFU verification"
|
||||
artifacts:
|
||||
- path: "poller/internal/device/ssh_executor.go"
|
||||
provides: "RunCommand SSH executor with TOFU host key verification and typed errors"
|
||||
exports: ["RunCommand", "CommandResult", "SSHError", "SSHErrorKind"]
|
||||
- path: "poller/internal/device/normalize.go"
|
||||
provides: "NormalizeConfig function and SHA256 hashing"
|
||||
exports: ["NormalizeConfig", "HashConfig"]
|
||||
- path: "poller/internal/device/ssh_executor_test.go"
|
||||
provides: "Unit tests for SSH executor error classification"
|
||||
- path: "poller/internal/device/normalize_test.go"
|
||||
provides: "Unit tests for config normalization with edge cases"
|
||||
- path: "poller/internal/config/config.go"
|
||||
provides: "CONFIG_BACKUP_INTERVAL, CONFIG_BACKUP_MAX_CONCURRENT, CONFIG_BACKUP_COMMAND_TIMEOUT env vars"
|
||||
- path: "poller/internal/bus/publisher.go"
|
||||
provides: "ConfigSnapshotEvent type and PublishConfigSnapshot method, config.snapshot.create subject in stream"
|
||||
- path: "poller/internal/store/devices.go"
|
||||
provides: "SSHPort and SSHHostKeyFingerprint fields on Device struct, UpdateSSHHostKey method"
|
||||
- path: "backend/alembic/versions/028_device_ssh_host_key.py"
|
||||
provides: "Migration adding ssh_port, ssh_host_key_fingerprint columns to devices table"
|
||||
key_links:
|
||||
- from: "poller/internal/device/ssh_executor.go"
|
||||
to: "poller/internal/store/devices.go"
|
||||
via: "Uses Device.SSHPort and Device.SSHHostKeyFingerprint for connection"
|
||||
pattern: "dev\\.SSHPort|dev\\.SSHHostKeyFingerprint"
|
||||
- from: "poller/internal/device/normalize.go"
|
||||
to: "poller/internal/bus/publisher.go"
|
||||
via: "Normalized config text and SHA256 hash populate ConfigSnapshotEvent fields"
|
||||
pattern: "NormalizeConfig|HashConfig"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Build the reusable primitives for config backup collection: SSH command executor with TOFU host key verification, config output normalizer with SHA256 hashing, environment variable configuration, NATS event type, and device model extensions.
|
||||
|
||||
Purpose: These are the building blocks that the backup scheduler (Plan 02) wires together. Each is independently testable and follows existing codebase patterns.
|
||||
Output: SSH executor module, normalization module, extended config/store/bus/metrics, Alembic migration for device SSH columns.
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@/Users/jasonstaack/.claude/get-shit-done/workflows/execute-plan.md
|
||||
@/Users/jasonstaack/.claude/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/PROJECT.md
|
||||
@.planning/ROADMAP.md
|
||||
@.planning/STATE.md
|
||||
@.planning/phases/02-poller-config-collection/02-CONTEXT.md
|
||||
@.planning/phases/01-database-schema/01-01-SUMMARY.md
|
||||
|
||||
@poller/internal/device/sftp.go
|
||||
@poller/internal/bus/publisher.go
|
||||
@poller/internal/config/config.go
|
||||
@poller/internal/store/devices.go
|
||||
@poller/internal/observability/metrics.go
|
||||
@poller/internal/poller/scheduler.go
|
||||
@poller/go.mod
|
||||
|
||||
<interfaces>
|
||||
<!-- Existing patterns the executor must follow -->
|
||||
|
||||
From poller/internal/device/sftp.go:
|
||||
```go
|
||||
func NewSSHClient(ip string, port int, username, password string, timeout time.Duration) (*ssh.Client, error)
|
||||
// Uses ssh.InsecureIgnoreHostKey() — executor replaces this with TOFU callback
|
||||
```
|
||||
|
||||
From poller/internal/store/devices.go:
|
||||
```go
|
||||
type Device struct {
|
||||
ID string
|
||||
TenantID string
|
||||
IPAddress string
|
||||
APIPort int
|
||||
APISSLPort int
|
||||
EncryptedCredentials []byte
|
||||
EncryptedCredentialsTransit *string
|
||||
RouterOSVersion *string
|
||||
MajorVersion *int
|
||||
TLSMode string
|
||||
CACertPEM *string
|
||||
}
|
||||
// SSHPort and SSHHostKeyFingerprint need to be added
|
||||
```
|
||||
|
||||
From poller/internal/bus/publisher.go:
|
||||
```go
|
||||
type Publisher struct { nc *nats.Conn; js jetstream.JetStream }
|
||||
func (p *Publisher) PublishStatus(ctx context.Context, event DeviceStatusEvent) error
|
||||
// Follow this pattern for PublishConfigSnapshot
|
||||
// Stream subjects list needs "config.snapshot.>" added
|
||||
```
|
||||
|
||||
From poller/internal/config/config.go:
|
||||
```go
|
||||
func Load() (*Config, error)
|
||||
// Uses getEnv(key, default) and getEnvInt(key, default) helpers
|
||||
```
|
||||
</interfaces>
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto" tdd="true">
|
||||
<name>Task 1: SSH executor, normalizer, and their tests</name>
|
||||
<files>
|
||||
poller/internal/device/ssh_executor.go,
|
||||
poller/internal/device/ssh_executor_test.go,
|
||||
poller/internal/device/normalize.go,
|
||||
poller/internal/device/normalize_test.go
|
||||
</files>
|
||||
<behavior>
|
||||
SSH Executor (ssh_executor_test.go):
|
||||
- Test SSHErrorKind classification: given various ssh/net error types, classifySSHError returns correct kind (AuthFailed, HostKeyMismatch, Timeout, ConnectionRefused, Unknown)
|
||||
- Test TOFU host key callback: when fingerprint is empty (first connect), callback accepts and returns fingerprint; when fingerprint matches, callback accepts; when fingerprint mismatches, callback rejects with HostKeyMismatch error
|
||||
- Test CommandResult: verify struct fields (Stdout, Stderr, ExitCode, Duration, Error)
|
||||
|
||||
Normalizer (normalize_test.go):
|
||||
- Test timestamp stripping: input with "# 2024/01/15 10:30:00 by RouterOS 7.x\n# software id = XXXX\n" strips only the timestamp line and following blank line, preserves software id comment
|
||||
- Test line ending normalization: "\r\n" becomes "\n"
|
||||
- Test trailing whitespace trimming: " /ip address \n" becomes "/ip address\n"
|
||||
- Test blank line collapsing: three consecutive blank lines become one
|
||||
- Test trailing newline: output always ends with exactly one "\n"
|
||||
- Test comment preservation: lines starting with "# " that are NOT the timestamp header are preserved
|
||||
- Test full normalization pipeline: realistic RouterOS export with all issues produces clean output
|
||||
- Test HashConfig: returns lowercase hex SHA256 of the normalized string (64 chars)
|
||||
- Test idempotency: NormalizeConfig(NormalizeConfig(input)) == NormalizeConfig(input)
|
||||
</behavior>
|
||||
<action>
|
||||
Create `poller/internal/device/ssh_executor.go`:
|
||||
|
||||
1. Define types:
|
||||
- `SSHErrorKind` string enum: `ErrAuthFailed`, `ErrHostKeyMismatch`, `ErrTimeout`, `ErrTruncatedOutput`, `ErrConnectionRefused`, `ErrUnknown`
|
||||
- `SSHError` struct implementing `error`: `Kind SSHErrorKind`, `Err error`, `Message string`
|
||||
- `CommandResult` struct: `Stdout string`, `Stderr string`, `ExitCode int`, `Duration time.Duration`
|
||||
|
||||
2. `RunCommand(ctx context.Context, ip string, port int, username, password string, timeout time.Duration, knownFingerprint string, command string) (*CommandResult, string, error)`:
|
||||
- Returns (result, observedFingerprint, error)
|
||||
- Build ssh.ClientConfig with password auth and custom HostKeyCallback for TOFU:
|
||||
- If knownFingerprint == "": accept any key, compute and return SHA256 fingerprint
|
||||
- If knownFingerprint matches: accept
|
||||
- If knownFingerprint mismatches: reject with SSHError{Kind: ErrHostKeyMismatch}
|
||||
- Fingerprint format: `SHA256:base64(sha256(publicKeyBytes))` (same as ssh-keygen)
|
||||
- Dial with context-aware timeout
|
||||
- Create session, run command via session.Run()
|
||||
- Capture stdout/stderr via session.StdoutPipe/StderrPipe or CombinedOutput pattern
|
||||
- Classify errors using `classifySSHError(err)` helper that inspects error strings and types
|
||||
- Detect truncated output: if command times out mid-stream, return SSHError{Kind: ErrTruncatedOutput}
|
||||
|
||||
3. `classifySSHError(err error) SSHErrorKind`: inspect error for "unable to authenticate", "host key", "i/o timeout", "connection refused" patterns
|
||||
|
||||
Create `poller/internal/device/normalize.go`:
|
||||
|
||||
1. `NormalizeConfig(raw string) string`:
|
||||
- Use regexp to strip timestamp header line matching `^# \d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2} by RouterOS.*\n` and the blank line immediately following it
|
||||
- Replace \r\n with \n (before other processing)
|
||||
- Split into lines, trim trailing whitespace from each line
|
||||
- Collapse consecutive blank lines (2+ empty lines become 1)
|
||||
- Ensure single trailing newline
|
||||
- Return normalized string
|
||||
|
||||
2. `HashConfig(normalized string) string`:
|
||||
- Compute SHA256 of the normalized string bytes
|
||||
- Return lowercase hex string (64 chars)
|
||||
|
||||
3. `const NormalizationVersion = 1` — for future tracking in NATS payload
|
||||
|
||||
Write tests FIRST (RED), then implement (GREEN). Tests for normalizer use table-driven test style matching Go conventions. SSH executor tests use mock/classification tests (no real SSH connection needed for unit tests).
|
||||
</action>
|
||||
<verify>
|
||||
<automated>cd /Volumes/ssd01/v9/the-other-dude/poller && go test ./internal/device/ -run "TestNormalize|TestHash|TestSSH|TestClassify|TestTOFU" -v -count=1</automated>
|
||||
</verify>
|
||||
<done>
|
||||
- RunCommand function compiles with correct signature returning (CommandResult, fingerprint, error)
|
||||
- SSHError type with Kind field covers all 6 error classifications
|
||||
- TOFU host key callback accepts on first connect, validates on subsequent, rejects on mismatch
|
||||
- NormalizeConfig strips timestamp, normalizes line endings, trims whitespace, collapses blanks, ensures trailing newline
|
||||
- HashConfig returns 64-char lowercase hex SHA256
|
||||
- All unit tests pass
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 2: Config env vars, NATS event type, device model extensions, Alembic migration, metrics</name>
|
||||
<files>
|
||||
poller/internal/config/config.go,
|
||||
poller/internal/bus/publisher.go,
|
||||
poller/internal/store/devices.go,
|
||||
poller/internal/observability/metrics.go,
|
||||
backend/alembic/versions/028_device_ssh_host_key.py
|
||||
</files>
|
||||
<action>
|
||||
**1. Config env vars** (`config.go`):
|
||||
Add three fields to the Config struct and load them in Load():
|
||||
- `ConfigBackupIntervalSeconds int` — `getEnvInt("CONFIG_BACKUP_INTERVAL", 21600)` (6h = 21600s)
|
||||
- `ConfigBackupMaxConcurrent int` — `getEnvInt("CONFIG_BACKUP_MAX_CONCURRENT", 10)`
|
||||
- `ConfigBackupCommandTimeoutSeconds int` — `getEnvInt("CONFIG_BACKUP_COMMAND_TIMEOUT", 60)`
|
||||
|
||||
**2. NATS event type and publisher** (`publisher.go`):
|
||||
- Add `ConfigSnapshotEvent` struct:
|
||||
```go
|
||||
type ConfigSnapshotEvent struct {
|
||||
DeviceID string `json:"device_id"`
|
||||
TenantID string `json:"tenant_id"`
|
||||
RouterOSVersion string `json:"routeros_version,omitempty"`
|
||||
CollectedAt string `json:"collected_at"` // RFC3339
|
||||
SHA256Hash string `json:"sha256_hash"`
|
||||
ConfigText string `json:"config_text"`
|
||||
NormalizationVersion int `json:"normalization_version"`
|
||||
}
|
||||
```
|
||||
- Add `PublishConfigSnapshot(ctx, event) error` method on Publisher following the exact pattern of PublishStatus/PublishMetrics
|
||||
- Subject: `fmt.Sprintf("config.snapshot.create.%s", event.DeviceID)`
|
||||
- Add `"config.snapshot.>"` to the DEVICE_EVENTS stream subjects list in `NewPublisher`
|
||||
|
||||
**3. Device model extensions** (`devices.go`):
|
||||
- Add fields to Device struct: `SSHPort int`, `SSHHostKeyFingerprint *string`
|
||||
- Update FetchDevices query to SELECT `COALESCE(d.ssh_port, 22)` and `d.ssh_host_key_fingerprint`
|
||||
- Update GetDevice query similarly
|
||||
- Update both Scan calls to include the new fields
|
||||
- Add `UpdateSSHHostKey(ctx context.Context, deviceID string, fingerprint string) error` method on DeviceStore:
|
||||
```go
|
||||
const query = `UPDATE devices SET ssh_host_key_fingerprint = $1 WHERE id = $2`
|
||||
```
|
||||
(This requires poller_user to have UPDATE on devices(ssh_host_key_fingerprint) — handled in migration)
|
||||
|
||||
**4. Alembic migration** (`028_device_ssh_host_key.py`):
|
||||
Follow the raw SQL pattern from migration 027. Create migration that:
|
||||
- `ALTER TABLE devices ADD COLUMN ssh_port INTEGER DEFAULT 22`
|
||||
- `ALTER TABLE devices ADD COLUMN ssh_host_key_fingerprint TEXT`
|
||||
- `ALTER TABLE devices ADD COLUMN ssh_host_key_first_seen TIMESTAMPTZ`
|
||||
- `ALTER TABLE devices ADD COLUMN ssh_host_key_last_verified TIMESTAMPTZ`
|
||||
- `GRANT UPDATE (ssh_host_key_fingerprint, ssh_host_key_first_seen, ssh_host_key_last_verified) ON devices TO poller_user`
|
||||
- Downgrade: `ALTER TABLE devices DROP COLUMN ssh_port, DROP COLUMN ssh_host_key_fingerprint, DROP COLUMN ssh_host_key_first_seen, DROP COLUMN ssh_host_key_last_verified`
|
||||
- `REVOKE UPDATE (ssh_host_key_fingerprint, ssh_host_key_first_seen, ssh_host_key_last_verified) ON devices FROM poller_user`
|
||||
|
||||
**5. Prometheus metrics** (`metrics.go`):
|
||||
Add config backup specific metrics:
|
||||
- `ConfigBackupTotal` CounterVec with labels ["status"] — status: "success", "error", "skipped_offline", "skipped_auth_blocked", "skipped_hostkey_blocked"
|
||||
- `ConfigBackupDuration` Histogram — buckets: [1, 5, 10, 30, 60, 120, 300]
|
||||
- `ConfigBackupActive` Gauge — number of concurrent backup jobs running
|
||||
</action>
|
||||
<verify>
|
||||
<automated>cd /Volumes/ssd01/v9/the-other-dude/poller && go build ./... && go vet ./... && go test ./internal/config/ -v -count=1</automated>
|
||||
</verify>
|
||||
<done>
|
||||
- Config struct has 3 new backup config fields loading from env vars with correct defaults
|
||||
- ConfigSnapshotEvent type exists with all required JSON fields
|
||||
- PublishConfigSnapshot method exists following existing publisher pattern
|
||||
- config.snapshot.> added to DEVICE_EVENTS stream subjects
|
||||
- Device struct has SSHPort and SSHHostKeyFingerprint fields
|
||||
- FetchDevices and GetDevice queries select and scan the new columns
|
||||
- UpdateSSHHostKey method exists for TOFU fingerprint storage
|
||||
- Alembic migration 028 adds ssh_port, ssh_host_key_fingerprint, timestamp columns with correct grants
|
||||
- Three new Prometheus metrics registered for config backup observability
|
||||
- All existing tests still pass, project compiles clean
|
||||
</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
1. `cd poller && go build ./...` — entire project compiles
|
||||
2. `cd poller && go vet ./...` — no static analysis issues
|
||||
3. `cd poller && go test ./internal/device/ -v -count=1` — SSH executor and normalizer tests pass
|
||||
4. `cd poller && go test ./internal/config/ -v -count=1` — config tests pass
|
||||
5. Migration file exists at `backend/alembic/versions/028_device_ssh_host_key.py`
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
- SSH executor RunCommand function exists with TOFU host key verification and typed error classification
|
||||
- Config normalizer strips timestamps, normalizes whitespace, and computes SHA256 hashes deterministically
|
||||
- All config backup environment variables load with correct defaults (6h interval, 10 concurrent, 60s timeout)
|
||||
- ConfigSnapshotEvent and PublishConfigSnapshot are ready for the scheduler to use
|
||||
- Device model includes SSH port and host key fingerprint fields
|
||||
- Database migration ready to add SSH columns to devices table
|
||||
- Prometheus metrics registered for backup collection observability
|
||||
- All tests pass, project compiles clean
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/02-poller-config-collection/02-01-SUMMARY.md`
|
||||
</output>
|
||||
Reference in New Issue
Block a user