chore: remove .planning from tracking (already in .gitignore)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Jason Staack
2026-03-13 06:55:28 -05:00
parent ed3ad8eb17
commit 7af08276ea
25 changed files with 0 additions and 4680 deletions

View File

@@ -1,308 +0,0 @@
---
phase: 02-poller-config-collection
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- poller/internal/device/ssh_executor.go
- poller/internal/device/ssh_executor_test.go
- poller/internal/device/normalize.go
- poller/internal/device/normalize_test.go
- poller/internal/config/config.go
- poller/internal/bus/publisher.go
- poller/internal/observability/metrics.go
- poller/internal/store/devices.go
- backend/alembic/versions/028_device_ssh_host_key.py
autonomous: true
requirements: [COLL-01, COLL-02, COLL-06]
must_haves:
truths:
- "SSH executor can run a command on a RouterOS device and return stdout, stderr, exit code, duration, and typed errors"
- "Config output is normalized deterministically (timestamp stripped, whitespace trimmed, line endings unified, blank lines collapsed)"
- "SHA256 hash is computed on normalized output"
- "Config backup interval and concurrency are configurable via environment variables"
- "Host key fingerprint is stored on device record for TOFU verification"
artifacts:
- path: "poller/internal/device/ssh_executor.go"
provides: "RunCommand SSH executor with TOFU host key verification and typed errors"
exports: ["RunCommand", "CommandResult", "SSHError", "SSHErrorKind"]
- path: "poller/internal/device/normalize.go"
provides: "NormalizeConfig function and SHA256 hashing"
exports: ["NormalizeConfig", "HashConfig"]
- path: "poller/internal/device/ssh_executor_test.go"
provides: "Unit tests for SSH executor error classification"
- path: "poller/internal/device/normalize_test.go"
provides: "Unit tests for config normalization with edge cases"
- path: "poller/internal/config/config.go"
provides: "CONFIG_BACKUP_INTERVAL, CONFIG_BACKUP_MAX_CONCURRENT, CONFIG_BACKUP_COMMAND_TIMEOUT env vars"
- path: "poller/internal/bus/publisher.go"
provides: "ConfigSnapshotEvent type and PublishConfigSnapshot method, config.snapshot.create subject in stream"
- path: "poller/internal/store/devices.go"
provides: "SSHPort and SSHHostKeyFingerprint fields on Device struct, UpdateSSHHostKey method"
- path: "backend/alembic/versions/028_device_ssh_host_key.py"
provides: "Migration adding ssh_port, ssh_host_key_fingerprint columns to devices table"
key_links:
- from: "poller/internal/device/ssh_executor.go"
to: "poller/internal/store/devices.go"
via: "Uses Device.SSHPort and Device.SSHHostKeyFingerprint for connection"
pattern: "dev\\.SSHPort|dev\\.SSHHostKeyFingerprint"
- from: "poller/internal/device/normalize.go"
to: "poller/internal/bus/publisher.go"
via: "Normalized config text and SHA256 hash populate ConfigSnapshotEvent fields"
pattern: "NormalizeConfig|HashConfig"
---
<objective>
Build the reusable primitives for config backup collection: SSH command executor with TOFU host key verification, config output normalizer with SHA256 hashing, environment variable configuration, NATS event type, and device model extensions.
Purpose: These are the building blocks that the backup scheduler (Plan 02) wires together. Each is independently testable and follows existing codebase patterns.
Output: SSH executor module, normalization module, extended config/store/bus/metrics, Alembic migration for device SSH columns.
</objective>
<execution_context>
@/Users/jasonstaack/.claude/get-shit-done/workflows/execute-plan.md
@/Users/jasonstaack/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/02-poller-config-collection/02-CONTEXT.md
@.planning/phases/01-database-schema/01-01-SUMMARY.md
@poller/internal/device/sftp.go
@poller/internal/bus/publisher.go
@poller/internal/config/config.go
@poller/internal/store/devices.go
@poller/internal/observability/metrics.go
@poller/internal/poller/scheduler.go
@poller/go.mod
<interfaces>
<!-- Existing patterns the executor must follow -->
From poller/internal/device/sftp.go:
```go
func NewSSHClient(ip string, port int, username, password string, timeout time.Duration) (*ssh.Client, error)
// Uses ssh.InsecureIgnoreHostKey() — executor replaces this with TOFU callback
```
From poller/internal/store/devices.go:
```go
type Device struct {
ID string
TenantID string
IPAddress string
APIPort int
APISSLPort int
EncryptedCredentials []byte
EncryptedCredentialsTransit *string
RouterOSVersion *string
MajorVersion *int
TLSMode string
CACertPEM *string
}
// SSHPort and SSHHostKeyFingerprint need to be added
```
From poller/internal/bus/publisher.go:
```go
type Publisher struct { nc *nats.Conn; js jetstream.JetStream }
func (p *Publisher) PublishStatus(ctx context.Context, event DeviceStatusEvent) error
// Follow this pattern for PublishConfigSnapshot
// Stream subjects list needs "config.snapshot.>" added
```
From poller/internal/config/config.go:
```go
func Load() (*Config, error)
// Uses getEnv(key, default) and getEnvInt(key, default) helpers
```
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: SSH executor, normalizer, and their tests</name>
<files>
poller/internal/device/ssh_executor.go,
poller/internal/device/ssh_executor_test.go,
poller/internal/device/normalize.go,
poller/internal/device/normalize_test.go
</files>
<behavior>
SSH Executor (ssh_executor_test.go):
- Test SSHErrorKind classification: given various ssh/net error types, classifySSHError returns correct kind (AuthFailed, HostKeyMismatch, Timeout, ConnectionRefused, Unknown)
- Test TOFU host key callback: when fingerprint is empty (first connect), callback accepts and returns fingerprint; when fingerprint matches, callback accepts; when fingerprint mismatches, callback rejects with HostKeyMismatch error
- Test CommandResult: verify struct fields (Stdout, Stderr, ExitCode, Duration, Error)
Normalizer (normalize_test.go):
- Test timestamp stripping: input with "# 2024/01/15 10:30:00 by RouterOS 7.x\n# software id = XXXX\n" strips only the timestamp line and following blank line, preserves software id comment
- Test line ending normalization: "\r\n" becomes "\n"
- Test trailing whitespace trimming: " /ip address \n" becomes "/ip address\n"
- Test blank line collapsing: three consecutive blank lines become one
- Test trailing newline: output always ends with exactly one "\n"
- Test comment preservation: lines starting with "# " that are NOT the timestamp header are preserved
- Test full normalization pipeline: realistic RouterOS export with all issues produces clean output
- Test HashConfig: returns lowercase hex SHA256 of the normalized string (64 chars)
- Test idempotency: NormalizeConfig(NormalizeConfig(input)) == NormalizeConfig(input)
</behavior>
<action>
Create `poller/internal/device/ssh_executor.go`:
1. Define types:
- `SSHErrorKind` string enum: `ErrAuthFailed`, `ErrHostKeyMismatch`, `ErrTimeout`, `ErrTruncatedOutput`, `ErrConnectionRefused`, `ErrUnknown`
- `SSHError` struct implementing `error`: `Kind SSHErrorKind`, `Err error`, `Message string`
- `CommandResult` struct: `Stdout string`, `Stderr string`, `ExitCode int`, `Duration time.Duration`
2. `RunCommand(ctx context.Context, ip string, port int, username, password string, timeout time.Duration, knownFingerprint string, command string) (*CommandResult, string, error)`:
- Returns (result, observedFingerprint, error)
- Build ssh.ClientConfig with password auth and custom HostKeyCallback for TOFU:
- If knownFingerprint == "": accept any key, compute and return SHA256 fingerprint
- If knownFingerprint matches: accept
- If knownFingerprint mismatches: reject with SSHError{Kind: ErrHostKeyMismatch}
- Fingerprint format: `SHA256:base64(sha256(publicKeyBytes))` (same as ssh-keygen)
- Dial with context-aware timeout
- Create session, run command via session.Run()
- Capture stdout/stderr via session.StdoutPipe/StderrPipe or CombinedOutput pattern
- Classify errors using `classifySSHError(err)` helper that inspects error strings and types
- Detect truncated output: if command times out mid-stream, return SSHError{Kind: ErrTruncatedOutput}
3. `classifySSHError(err error) SSHErrorKind`: inspect error for "unable to authenticate", "host key", "i/o timeout", "connection refused" patterns
Create `poller/internal/device/normalize.go`:
1. `NormalizeConfig(raw string) string`:
- Use regexp to strip timestamp header line matching `^# \d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2} by RouterOS.*\n` and the blank line immediately following it
- Replace \r\n with \n (before other processing)
- Split into lines, trim trailing whitespace from each line
- Collapse consecutive blank lines (2+ empty lines become 1)
- Ensure single trailing newline
- Return normalized string
2. `HashConfig(normalized string) string`:
- Compute SHA256 of the normalized string bytes
- Return lowercase hex string (64 chars)
3. `const NormalizationVersion = 1` — for future tracking in NATS payload
Write tests FIRST (RED), then implement (GREEN). Tests for normalizer use table-driven test style matching Go conventions. SSH executor tests use mock/classification tests (no real SSH connection needed for unit tests).
</action>
<verify>
<automated>cd /Volumes/ssd01/v9/the-other-dude/poller && go test ./internal/device/ -run "TestNormalize|TestHash|TestSSH|TestClassify|TestTOFU" -v -count=1</automated>
</verify>
<done>
- RunCommand function compiles with correct signature returning (CommandResult, fingerprint, error)
- SSHError type with Kind field covers all 6 error classifications
- TOFU host key callback accepts on first connect, validates on subsequent, rejects on mismatch
- NormalizeConfig strips timestamp, normalizes line endings, trims whitespace, collapses blanks, ensures trailing newline
- HashConfig returns 64-char lowercase hex SHA256
- All unit tests pass
</done>
</task>
<task type="auto">
<name>Task 2: Config env vars, NATS event type, device model extensions, Alembic migration, metrics</name>
<files>
poller/internal/config/config.go,
poller/internal/bus/publisher.go,
poller/internal/store/devices.go,
poller/internal/observability/metrics.go,
backend/alembic/versions/028_device_ssh_host_key.py
</files>
<action>
**1. Config env vars** (`config.go`):
Add three fields to the Config struct and load them in Load():
- `ConfigBackupIntervalSeconds int``getEnvInt("CONFIG_BACKUP_INTERVAL", 21600)` (6h = 21600s)
- `ConfigBackupMaxConcurrent int``getEnvInt("CONFIG_BACKUP_MAX_CONCURRENT", 10)`
- `ConfigBackupCommandTimeoutSeconds int``getEnvInt("CONFIG_BACKUP_COMMAND_TIMEOUT", 60)`
**2. NATS event type and publisher** (`publisher.go`):
- Add `ConfigSnapshotEvent` struct:
```go
type ConfigSnapshotEvent struct {
DeviceID string `json:"device_id"`
TenantID string `json:"tenant_id"`
RouterOSVersion string `json:"routeros_version,omitempty"`
CollectedAt string `json:"collected_at"` // RFC3339
SHA256Hash string `json:"sha256_hash"`
ConfigText string `json:"config_text"`
NormalizationVersion int `json:"normalization_version"`
}
```
- Add `PublishConfigSnapshot(ctx, event) error` method on Publisher following the exact pattern of PublishStatus/PublishMetrics
- Subject: `fmt.Sprintf("config.snapshot.create.%s", event.DeviceID)`
- Add `"config.snapshot.>"` to the DEVICE_EVENTS stream subjects list in `NewPublisher`
**3. Device model extensions** (`devices.go`):
- Add fields to Device struct: `SSHPort int`, `SSHHostKeyFingerprint *string`
- Update FetchDevices query to SELECT `COALESCE(d.ssh_port, 22)` and `d.ssh_host_key_fingerprint`
- Update GetDevice query similarly
- Update both Scan calls to include the new fields
- Add `UpdateSSHHostKey(ctx context.Context, deviceID string, fingerprint string) error` method on DeviceStore:
```go
const query = `UPDATE devices SET ssh_host_key_fingerprint = $1 WHERE id = $2`
```
(This requires poller_user to have UPDATE on devices(ssh_host_key_fingerprint) — handled in migration)
**4. Alembic migration** (`028_device_ssh_host_key.py`):
Follow the raw SQL pattern from migration 027. Create migration that:
- `ALTER TABLE devices ADD COLUMN ssh_port INTEGER DEFAULT 22`
- `ALTER TABLE devices ADD COLUMN ssh_host_key_fingerprint TEXT`
- `ALTER TABLE devices ADD COLUMN ssh_host_key_first_seen TIMESTAMPTZ`
- `ALTER TABLE devices ADD COLUMN ssh_host_key_last_verified TIMESTAMPTZ`
- `GRANT UPDATE (ssh_host_key_fingerprint, ssh_host_key_first_seen, ssh_host_key_last_verified) ON devices TO poller_user`
- Downgrade: `ALTER TABLE devices DROP COLUMN ssh_port, DROP COLUMN ssh_host_key_fingerprint, DROP COLUMN ssh_host_key_first_seen, DROP COLUMN ssh_host_key_last_verified`
- `REVOKE UPDATE (ssh_host_key_fingerprint, ssh_host_key_first_seen, ssh_host_key_last_verified) ON devices FROM poller_user`
**5. Prometheus metrics** (`metrics.go`):
Add config backup specific metrics:
- `ConfigBackupTotal` CounterVec with labels ["status"] — status: "success", "error", "skipped_offline", "skipped_auth_blocked", "skipped_hostkey_blocked"
- `ConfigBackupDuration` Histogram — buckets: [1, 5, 10, 30, 60, 120, 300]
- `ConfigBackupActive` Gauge — number of concurrent backup jobs running
</action>
<verify>
<automated>cd /Volumes/ssd01/v9/the-other-dude/poller && go build ./... && go vet ./... && go test ./internal/config/ -v -count=1</automated>
</verify>
<done>
- Config struct has 3 new backup config fields loading from env vars with correct defaults
- ConfigSnapshotEvent type exists with all required JSON fields
- PublishConfigSnapshot method exists following existing publisher pattern
- config.snapshot.> added to DEVICE_EVENTS stream subjects
- Device struct has SSHPort and SSHHostKeyFingerprint fields
- FetchDevices and GetDevice queries select and scan the new columns
- UpdateSSHHostKey method exists for TOFU fingerprint storage
- Alembic migration 028 adds ssh_port, ssh_host_key_fingerprint, timestamp columns with correct grants
- Three new Prometheus metrics registered for config backup observability
- All existing tests still pass, project compiles clean
</done>
</task>
</tasks>
<verification>
1. `cd poller && go build ./...` — entire project compiles
2. `cd poller && go vet ./...` — no static analysis issues
3. `cd poller && go test ./internal/device/ -v -count=1` — SSH executor and normalizer tests pass
4. `cd poller && go test ./internal/config/ -v -count=1` — config tests pass
5. Migration file exists at `backend/alembic/versions/028_device_ssh_host_key.py`
</verification>
<success_criteria>
- SSH executor RunCommand function exists with TOFU host key verification and typed error classification
- Config normalizer strips timestamps, normalizes whitespace, and computes SHA256 hashes deterministically
- All config backup environment variables load with correct defaults (6h interval, 10 concurrent, 60s timeout)
- ConfigSnapshotEvent and PublishConfigSnapshot are ready for the scheduler to use
- Device model includes SSH port and host key fingerprint fields
- Database migration ready to add SSH columns to devices table
- Prometheus metrics registered for backup collection observability
- All tests pass, project compiles clean
</success_criteria>
<output>
After completion, create `.planning/phases/02-poller-config-collection/02-01-SUMMARY.md`
</output>

View File

@@ -1,128 +0,0 @@
---
phase: 02-poller-config-collection
plan: 01
subsystem: poller
tags: [ssh, tofu, routeros, config-normalization, sha256, nats, prometheus, alembic]
requires:
- phase: 01-database-schema
provides: router_config_snapshots table for storing backup data
provides:
- SSH command executor with TOFU host key verification and typed error classification
- Config normalizer with deterministic SHA256 hashing
- ConfigSnapshotEvent NATS event type and PublishConfigSnapshot method
- Config backup environment variables (interval, concurrency, timeout)
- Device model SSH fields (port, host key fingerprint) with UpdateSSHHostKey method
- Alembic migration 028 for devices table SSH columns
- Prometheus metrics for config backup observability
affects: [02-02-backup-scheduler, 03-backend-subscriber]
tech-stack:
added: []
patterns:
- "TOFU host key verification via SHA256 fingerprint comparison"
- "Config normalization pipeline: line endings, timestamp strip, whitespace trim, blank collapse"
- "SSH error classification into typed SSHErrorKind enum"
key-files:
created:
- poller/internal/device/ssh_executor.go
- poller/internal/device/ssh_executor_test.go
- poller/internal/device/normalize.go
- poller/internal/device/normalize_test.go
- backend/alembic/versions/028_device_ssh_host_key.py
modified:
- poller/internal/config/config.go
- poller/internal/bus/publisher.go
- poller/internal/store/devices.go
- poller/internal/observability/metrics.go
key-decisions:
- "TOFU fingerprint format matches ssh-keygen: SHA256:base64(sha256(pubkey))"
- "NormalizationVersion=1 constant included in NATS payloads for future re-processing"
- "UpdateSSHHostKey sets first_seen via COALESCE to preserve original observation time"
patterns-established:
- "SSH error classification: classifySSHError inspects error strings for auth/hostkey/timeout/refused patterns"
- "Config normalization: version-tracked deterministic pipeline for RouterOS export output"
requirements-completed: [COLL-01, COLL-02, COLL-06]
duration: 5min
completed: 2026-03-13
---
# Phase 02 Plan 01: Config Backup Primitives Summary
**SSH executor with TOFU host key verification, RouterOS config normalizer with SHA256 hashing, NATS snapshot event, and Alembic migration for device SSH columns**
## Performance
- **Duration:** 5 min
- **Started:** 2026-03-13T01:43:33Z
- **Completed:** 2026-03-13T01:48:38Z
- **Tasks:** 2
- **Files modified:** 9
## Accomplishments
- SSH RunCommand executor with context-aware dialing, TOFU host key callback, and 6-kind typed error classification
- Deterministic config normalizer: strips RouterOS timestamps, normalizes line endings, trims whitespace, collapses blanks, computes SHA256 hash
- 22 unit tests covering error classification, TOFU flows (first connect/match/mismatch), normalization edge cases, idempotency
- Config backup env vars, NATS ConfigSnapshotEvent, device model SSH extensions, migration 028, Prometheus metrics
## Task Commits
Each task was committed atomically:
1. **Task 1: SSH executor, normalizer, and their tests** - `f1abb75` (feat)
2. **Task 2: Config env vars, NATS event type, device model extensions, Alembic migration, metrics** - `4ae39d2` (feat)
_Note: Task 1 used TDD -- tests written first (RED), implementation second (GREEN)._
## Files Created/Modified
- `poller/internal/device/ssh_executor.go` - RunCommand SSH executor with TOFU host key verification and typed errors
- `poller/internal/device/ssh_executor_test.go` - Unit tests for SSH error classification, TOFU callbacks, CommandResult
- `poller/internal/device/normalize.go` - NormalizeConfig and HashConfig for RouterOS export output
- `poller/internal/device/normalize_test.go` - Table-driven tests for normalization pipeline edge cases
- `poller/internal/config/config.go` - Added ConfigBackupIntervalSeconds, ConfigBackupMaxConcurrent, ConfigBackupCommandTimeoutSeconds
- `poller/internal/bus/publisher.go` - Added ConfigSnapshotEvent type, PublishConfigSnapshot method, config.snapshot.> stream subject
- `poller/internal/store/devices.go` - Added SSHPort/SSHHostKeyFingerprint fields, UpdateSSHHostKey method, updated queries
- `poller/internal/observability/metrics.go` - Added ConfigBackupTotal, ConfigBackupDuration, ConfigBackupActive metrics
- `backend/alembic/versions/028_device_ssh_host_key.py` - Migration adding ssh_port, ssh_host_key_fingerprint, timestamp columns
## Decisions Made
- TOFU fingerprint format uses SHA256:base64(sha256(pubkey)) to match ssh-keygen output format
- NormalizationVersion=1 constant is included in NATS payloads so consumers can detect algorithm changes
- UpdateSSHHostKey uses COALESCE on ssh_host_key_first_seen to preserve original observation timestamp
## Deviations from Plan
### Auto-fixed Issues
**1. [Rule 1 - Bug] Fixed test key generation approach**
- **Found during:** Task 1 (GREEN phase)
- **Issue:** Embedded OpenSSH PEM test key had padding errors ("ssh: padding not as expected")
- **Fix:** Switched to programmatic ed25519 key generation via crypto/ed25519.GenerateKey
- **Files modified:** poller/internal/device/ssh_executor_test.go
- **Verification:** All 22 tests pass
- **Committed in:** f1abb75 (Task 1 commit)
---
**Total deviations:** 1 auto-fixed (1 bug)
**Impact on plan:** Minimal -- test infrastructure fix only, no production code change.
## Issues Encountered
None beyond the test key generation fix documented above.
## User Setup Required
None - no external service configuration required.
## Next Phase Readiness
- All primitives ready for Plan 02 (backup scheduler) to wire together
- SSH executor, normalizer, NATS event, device model, config, and metrics are independently tested and compilable
- Migration 028 ready to apply before deploying the backup scheduler
---
*Phase: 02-poller-config-collection*
*Completed: 2026-03-13*

View File

@@ -1,394 +0,0 @@
---
phase: 02-poller-config-collection
plan: 02
type: execute
wave: 2
depends_on: ["02-01"]
files_modified:
- poller/internal/poller/backup_scheduler.go
- poller/internal/poller/backup_scheduler_test.go
- poller/internal/poller/interfaces.go
- poller/cmd/poller/main.go
autonomous: true
requirements: [COLL-01, COLL-03, COLL-05, COLL-06]
must_haves:
truths:
- "Poller runs /export show-sensitive via SSH on each online RouterOS device at a configurable interval (default 6h)"
- "Poller publishes normalized config snapshot to NATS config.snapshot.create with device_id, tenant_id, sha256_hash, config_text"
- "Unreachable devices log a warning and are retried on the next interval without blocking other devices"
- "Backup interval is configurable via CONFIG_BACKUP_INTERVAL environment variable"
- "First backup runs with randomized jitter (30-300s) after device discovery"
- "Global concurrency is limited via CONFIG_BACKUP_MAX_CONCURRENT semaphore"
- "Auth failures and host key mismatches block retries until resolved"
artifacts:
- path: "poller/internal/poller/backup_scheduler.go"
provides: "BackupScheduler managing per-device backup goroutines with concurrency, retry, and NATS publishing"
exports: ["BackupScheduler", "NewBackupScheduler"]
min_lines: 200
- path: "poller/internal/poller/backup_scheduler_test.go"
provides: "Unit tests for backup scheduling, jitter, concurrency, error handling"
- path: "poller/internal/poller/interfaces.go"
provides: "SSHHostKeyUpdater interface for device store dependency"
- path: "poller/cmd/poller/main.go"
provides: "BackupScheduler initialization and lifecycle wiring"
key_links:
- from: "poller/internal/poller/backup_scheduler.go"
to: "poller/internal/device/ssh_executor.go"
via: "Calls device.RunCommand to execute /export show-sensitive"
pattern: "device\\.RunCommand"
- from: "poller/internal/poller/backup_scheduler.go"
to: "poller/internal/device/normalize.go"
via: "Calls device.NormalizeConfig and device.HashConfig on SSH output"
pattern: "device\\.NormalizeConfig|device\\.HashConfig"
- from: "poller/internal/poller/backup_scheduler.go"
to: "poller/internal/bus/publisher.go"
via: "Calls publisher.PublishConfigSnapshot with ConfigSnapshotEvent"
pattern: "publisher\\.PublishConfigSnapshot|bus\\.ConfigSnapshotEvent"
- from: "poller/internal/poller/backup_scheduler.go"
to: "poller/internal/store/devices.go"
via: "Calls store.UpdateSSHHostKey for TOFU fingerprint storage"
pattern: "UpdateSSHHostKey"
- from: "poller/cmd/poller/main.go"
to: "poller/internal/poller/backup_scheduler.go"
via: "Creates and starts BackupScheduler in main goroutine lifecycle"
pattern: "NewBackupScheduler|backupScheduler\\.Run"
---
<objective>
Build the backup scheduler that orchestrates periodic SSH config collection from RouterOS devices, normalizes output, and publishes to NATS. Wire it into the poller's main lifecycle.
Purpose: This is the core orchestration that ties together the SSH executor, normalizer, and NATS publisher from Plan 01 into a running backup collection system with proper scheduling, concurrency control, error handling, and retry logic.
Output: BackupScheduler module fully integrated into the poller's main.go lifecycle.
</objective>
<execution_context>
@/Users/jasonstaack/.claude/get-shit-done/workflows/execute-plan.md
@/Users/jasonstaack/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/02-poller-config-collection/02-CONTEXT.md
@.planning/phases/02-poller-config-collection/02-01-SUMMARY.md
@poller/internal/poller/scheduler.go
@poller/internal/poller/worker.go
@poller/internal/poller/interfaces.go
@poller/cmd/poller/main.go
@poller/internal/device/ssh_executor.go
@poller/internal/device/normalize.go
@poller/internal/bus/publisher.go
@poller/internal/config/config.go
@poller/internal/store/devices.go
@poller/internal/observability/metrics.go
<interfaces>
<!-- From Plan 01 outputs (executor and normalizer) -->
From poller/internal/device/ssh_executor.go (created in Plan 01):
```go
type SSHErrorKind string
const (
ErrAuthFailed SSHErrorKind = "auth_failed"
ErrHostKeyMismatch SSHErrorKind = "host_key_mismatch"
ErrTimeout SSHErrorKind = "timeout"
ErrTruncatedOutput SSHErrorKind = "truncated_output"
ErrConnectionRefused SSHErrorKind = "connection_refused"
ErrUnknown SSHErrorKind = "unknown"
)
type SSHError struct { Kind SSHErrorKind; Err error; Message string }
type CommandResult struct { Stdout string; Stderr string; ExitCode int; Duration time.Duration }
func RunCommand(ctx context.Context, ip string, port int, username, password string, timeout time.Duration, knownFingerprint string, command string) (*CommandResult, string, error)
```
From poller/internal/device/normalize.go (created in Plan 01):
```go
func NormalizeConfig(raw string) string
func HashConfig(normalized string) string
const NormalizationVersion = 1
```
From poller/internal/bus/publisher.go (modified in Plan 01):
```go
type ConfigSnapshotEvent struct {
DeviceID string `json:"device_id"`
TenantID string `json:"tenant_id"`
RouterOSVersion string `json:"routeros_version,omitempty"`
CollectedAt string `json:"collected_at"`
SHA256Hash string `json:"sha256_hash"`
ConfigText string `json:"config_text"`
NormalizationVersion int `json:"normalization_version"`
}
func (p *Publisher) PublishConfigSnapshot(ctx context.Context, event ConfigSnapshotEvent) error
```
From poller/internal/store/devices.go (modified in Plan 01):
```go
type Device struct {
// ... existing fields ...
SSHPort int
SSHHostKeyFingerprint *string
}
func (s *DeviceStore) UpdateSSHHostKey(ctx context.Context, deviceID string, fingerprint string) error
```
From poller/internal/config/config.go (modified in Plan 01):
```go
type Config struct {
// ... existing fields ...
ConfigBackupIntervalSeconds int
ConfigBackupMaxConcurrent int
ConfigBackupCommandTimeoutSeconds int
}
```
From poller/internal/observability/metrics.go (modified in Plan 01):
```go
var ConfigBackupTotal *prometheus.CounterVec // labels: ["status"]
var ConfigBackupDuration prometheus.Histogram
var ConfigBackupActive prometheus.Gauge
```
<!-- Existing patterns to follow -->
From poller/internal/poller/scheduler.go:
```go
type Scheduler struct { ... }
func NewScheduler(...) *Scheduler
func (s *Scheduler) Run(ctx context.Context) error
func (s *Scheduler) reconcileDevices(ctx context.Context, wg *sync.WaitGroup) error
func (s *Scheduler) runDeviceLoop(ctx context.Context, dev store.Device, ds *deviceState) // per-device goroutine with ticker
```
From poller/internal/poller/interfaces.go:
```go
type DeviceFetcher interface {
FetchDevices(ctx context.Context) ([]store.Device, error)
}
```
</interfaces>
</context>
<tasks>
<task type="auto" tdd="true">
<name>Task 1: BackupScheduler with per-device goroutines, concurrency control, and retry logic</name>
<files>
poller/internal/poller/backup_scheduler.go,
poller/internal/poller/backup_scheduler_test.go,
poller/internal/poller/interfaces.go
</files>
<behavior>
- Test jitter generation: randomJitter(30, 300) returns value in [30s, 300s] range
- Test backoff sequence: given consecutive failures, backoff returns 5m, 15m, 1h, then caps at 1h
- Test auth failure blocking: when last error is ErrAuthFailed, shouldRetry returns false
- Test host key mismatch blocking: when last error is ErrHostKeyMismatch, shouldRetry returns false
- Test online-only gating: backup is skipped for devices not currently marked online
- Test concurrency semaphore: when semaphore is full, backup waits (does not drop)
</behavior>
<action>
**1. Update interfaces.go:**
Add `SSHHostKeyUpdater` interface (consumer-side, Go best practice):
```go
type SSHHostKeyUpdater interface {
UpdateSSHHostKey(ctx context.Context, deviceID string, fingerprint string) error
}
```
**2. Create backup_scheduler.go:**
Define `backupDeviceState` struct tracking per-device backup state:
- `cancel context.CancelFunc`
- `lastAttemptAt time.Time`
- `lastSuccessAt time.Time`
- `lastStatus string` — "success", "error", "skipped_offline", "auth_blocked", "hostkey_blocked"
- `lastError string`
- `consecutiveFailures int`
- `backoffUntil time.Time`
- `lastErrorKind device.SSHErrorKind` — tracks whether error is auth/hostkey (blocks retry)
Define `BackupScheduler` struct:
- `store DeviceFetcher` — reuse existing interface for FetchDevices
- `hostKeyStore SSHHostKeyUpdater` — for UpdateSSHHostKey
- `locker *redislock.Client` — per-device distributed lock
- `publisher *bus.Publisher` — for NATS publishing
- `credentialCache *vault.CredentialCache` — for decrypting device SSH creds
- `redisClient *redis.Client` — for tracking device online status
- `backupInterval time.Duration`
- `commandTimeout time.Duration`
- `refreshPeriod time.Duration` — how often to reconcile devices (reuse from existing scheduler, e.g., 60s)
- `semaphore chan struct{}` — buffered channel of size maxConcurrent
- `mu sync.Mutex`
- `activeDevices map[string]*backupDeviceState`
`NewBackupScheduler(...)` constructor — accept all dependencies, create semaphore as `make(chan struct{}, maxConcurrent)`.
`Run(ctx context.Context) error` — mirrors existing Scheduler.Run pattern:
- defer shutdown: cancel all device goroutines, wait for WaitGroup
- Loop: reconcileBackupDevices(ctx, &wg), then select on ctx.Done or time.After(refreshPeriod)
`reconcileBackupDevices(ctx, wg)` — mirrors reconcileDevices:
- FetchDevices from store
- Start backup goroutines for new devices
- Stop goroutines for removed devices
`runBackupLoop(ctx, dev, state)` — per-device backup goroutine:
- On first run: sleep for randomJitter(30, 300) seconds, then do initial backup
- After initial: ticker at backupInterval
- On each tick:
a. Check if device is online via Redis key `device:{id}:status` (set by status poll). If not online, log debug "skipped_offline", update state, increment ConfigBackupTotal("skipped_offline"), continue
b. Check if lastErrorKind is ErrAuthFailed — skip with "skipped_auth_blocked", log warning with guidance to update credentials
c. Check if lastErrorKind is ErrHostKeyMismatch — skip with "skipped_hostkey_blocked", log warning with guidance to reset host key
d. Check backoff: if time.Now().Before(state.backoffUntil), skip
e. Acquire semaphore (blocks if at max concurrency, does not drop)
f. Acquire Redis lock `backup:device:{id}` with TTL = commandTimeout + 30s
g. Call `collectAndPublish(ctx, dev, state)`
h. Release semaphore
i. Update state based on result
`collectAndPublish(ctx, dev, state) error`:
- Increment ConfigBackupActive gauge
- Defer decrement ConfigBackupActive gauge
- Start timer for ConfigBackupDuration
- Decrypt credentials via credentialCache.GetCredentials
- Call `device.RunCommand(ctx, dev.IPAddress, dev.SSHPort, username, password, commandTimeout, knownFingerprint, "/export show-sensitive")`
- On error: classify error kind, update state, apply backoff (transient: 5m/15m/1h exponential; auth/hostkey: block), return
- If new fingerprint returned (TOFU first connect): call hostKeyStore.UpdateSSHHostKey
- Validate output is non-empty and looks like RouterOS config (basic sanity: contains "/")
- Call `device.NormalizeConfig(result.Stdout)`
- Call `device.HashConfig(normalized)`
- Build `bus.ConfigSnapshotEvent` with device_id, tenant_id, routeros_version (from device or Redis), collected_at (RFC3339 now), sha256_hash, config_text, normalization_version
- Call `publisher.PublishConfigSnapshot(ctx, event)`
- On success: reset consecutiveFailures, update lastSuccessAt, increment ConfigBackupTotal("success")
- Record ConfigBackupDuration
`randomJitter(minSeconds, maxSeconds int) time.Duration` — uses math/rand for uniform distribution
Backoff for transient errors: `calculateBackupBackoff(failures int) time.Duration`:
- 1 failure: 5 min
- 2 failures: 15 min
- 3+ failures: 1 hour (cap)
Device online check via Redis: check if key `device:{id}:status` equals "online". This key is set by the existing status poll publisher flow. If key doesn't exist, assume device might be online (first poll hasn't happened yet) — allow backup attempt.
RouterOS version: read from the Device struct's RouterOSVersion field (populated by store query). If nil, use empty string in the event.
**Important implementation notes:**
- Use `log/slog` for all logging (structured JSON, matching existing pattern)
- Use existing `redislock` pattern from worker.go for per-device locking
- Semaphore pattern: `s.semaphore <- struct{}{}` to acquire, `<-s.semaphore` to release
- Do NOT share circuit breaker state with the status poll scheduler — these are independent
- Partial/truncated output (SSHError with Kind ErrTruncatedOutput) is treated as transient error — never publish, apply backoff
</action>
<verify>
<automated>cd /Volumes/ssd01/v9/the-other-dude/poller && go test ./internal/poller/ -run "TestBackup|TestJitter|TestBackoff|TestShouldRetry" -v -count=1</automated>
</verify>
<done>
- BackupScheduler manages per-device backup goroutines independently from status poll scheduler
- First backup uses 30-300s random jitter delay
- Concurrency limited by buffered channel semaphore (default 10)
- Per-device Redis lock prevents duplicate backups across pods
- Auth failures and host key mismatches block retries with clear log messages
- Transient errors use 5m/15m/1h exponential backoff
- Offline devices are skipped without error
- Successful backups normalize config, compute SHA256, and publish to NATS
- TOFU fingerprint stored on first successful connection
- All unit tests pass
</done>
</task>
<task type="auto">
<name>Task 2: Wire BackupScheduler into main.go lifecycle</name>
<files>poller/cmd/poller/main.go</files>
<action>
Add BackupScheduler initialization and startup to main.go, following the existing pattern of scheduler initialization (lines 250-278).
After the existing scheduler creation (around line 270), add a new section:
```
// -----------------------------------------------------------------------
// Start the config backup scheduler
// -----------------------------------------------------------------------
```
1. Convert config values to durations:
```go
backupInterval := time.Duration(cfg.ConfigBackupIntervalSeconds) * time.Second
backupCmdTimeout := time.Duration(cfg.ConfigBackupCommandTimeoutSeconds) * time.Second
```
2. Create BackupScheduler:
```go
backupScheduler := poller.NewBackupScheduler(
deviceStore,
deviceStore, // SSHHostKeyUpdater (DeviceStore satisfies this interface)
locker,
publisher,
credentialCache,
redisClient,
backupInterval,
backupCmdTimeout,
refreshPeriod, // reuse existing device refresh period
cfg.ConfigBackupMaxConcurrent,
)
```
3. Start in a goroutine (runs parallel with the main status poll scheduler):
```go
go func() {
slog.Info("starting config backup scheduler",
"interval", backupInterval,
"max_concurrent", cfg.ConfigBackupMaxConcurrent,
"command_timeout", backupCmdTimeout,
)
if err := backupScheduler.Run(ctx); err != nil {
slog.Error("backup scheduler exited with error", "error", err)
}
}()
```
The BackupScheduler shares the same ctx as everything else, so SIGINT/SIGTERM will trigger its shutdown via context cancellation. No additional shutdown logic needed — Run() returns when ctx is cancelled.
Log the startup with the same pattern as the existing scheduler startup log (line 273-276).
</action>
<verify>
<automated>cd /Volumes/ssd01/v9/the-other-dude/poller && go build ./cmd/poller/ && echo "build successful"</automated>
</verify>
<done>
- BackupScheduler created in main.go with all dependencies injected
- Runs as a goroutine parallel to the status poll scheduler
- Shares the same context for graceful shutdown
- Startup logged with interval, max_concurrent, and command_timeout
- Poller binary compiles successfully with the new scheduler wired in
</done>
</task>
</tasks>
<verification>
1. `cd poller && go build ./cmd/poller/` — binary compiles with backup scheduler wired in
2. `cd poller && go vet ./...` — no static analysis issues
3. `cd poller && go test ./internal/poller/ -v -count=1` — all poller tests pass (existing + new backup tests)
4. `cd poller && go test ./... -count=1` — full test suite passes
</verification>
<success_criteria>
- BackupScheduler runs independently from status poll scheduler with its own per-device goroutines
- Devices get their first backup 30-300s after discovery, then every CONFIG_BACKUP_INTERVAL
- SSH command execution uses TOFU host key verification and stores fingerprints on first connect
- Config output is normalized, hashed, and published to NATS config.snapshot.create
- Concurrency limited to CONFIG_BACKUP_MAX_CONCURRENT parallel SSH sessions
- Auth/hostkey errors block retries; transient errors use exponential backoff (5m/15m/1h)
- Offline devices are skipped gracefully
- BackupScheduler is wired into main.go and starts/stops with the poller lifecycle
- All tests pass, project compiles clean
</success_criteria>
<output>
After completion, create `.planning/phases/02-poller-config-collection/02-02-SUMMARY.md`
</output>

View File

@@ -1,100 +0,0 @@
---
phase: 02-poller-config-collection
plan: 02
subsystem: poller
tags: [ssh, backup, scheduler, nats, routeros, concurrency, tofu, redis]
requires:
- phase: 02-poller-config-collection/01
provides: SSH executor, config normalizer, NATS ConfigSnapshotEvent, Prometheus metrics, config fields
provides:
- BackupScheduler with per-device goroutines managing periodic SSH config collection
- Concurrency-limited config backup pipeline (SSH -> normalize -> hash -> NATS publish)
- TOFU host key verification with persistent fingerprint storage
- Auth/hostkey error blocking with transient error exponential backoff
- SSHHostKeyUpdater consumer-side interface
affects: [03-backend-snapshot-consumer, api, poller]
tech-stack:
added: []
patterns: [per-device goroutine lifecycle, buffered channel semaphore, Redis online gating]
key-files:
created:
- poller/internal/poller/backup_scheduler.go
- poller/internal/poller/backup_scheduler_test.go
modified:
- poller/internal/poller/interfaces.go
- poller/cmd/poller/main.go
key-decisions:
- "BackupScheduler runs independently from status poll scheduler with separate goroutines"
- "Semaphore uses buffered channel pattern matching existing codebase style"
- "Device with no Redis status key assumed potentially online (first poll not yet completed)"
patterns-established:
- "Backup goroutine pattern: jitter -> initial backup -> ticker loop with gating checks"
- "Error classification: auth/hostkey block retries, transient errors use exponential backoff"
requirements-completed: [COLL-01, COLL-03, COLL-05, COLL-06]
duration: 4min
completed: 2026-03-13
---
# Phase 2 Plan 2: Backup Scheduler Summary
**BackupScheduler orchestrating periodic SSH config collection with per-device goroutines, concurrency semaphore, TOFU verification, and NATS publishing**
## Performance
- **Duration:** 4 min
- **Started:** 2026-03-13T01:51:27Z
- **Completed:** 2026-03-13T01:55:37Z
- **Tasks:** 2
- **Files modified:** 4
## Accomplishments
- BackupScheduler manages per-device backup goroutines with 30-300s initial jitter
- Concurrency limited by configurable buffered channel semaphore (default 10)
- Auth failures and host key mismatches permanently block retries with clear log warnings
- Transient errors use stepped backoff (5m/15m/1h cap)
- Full pipeline wired into main.go running parallel to existing status poll scheduler
## Task Commits
Each task was committed atomically:
1. **Task 1: BackupScheduler with per-device goroutines** - `a884b09` (test) + `2653a32` (feat) -- TDD red/green
2. **Task 2: Wire BackupScheduler into main.go** - `d34817a` (feat)
## Files Created/Modified
- `poller/internal/poller/backup_scheduler.go` - BackupScheduler with per-device goroutines, concurrency control, SSH collection, NATS publishing
- `poller/internal/poller/backup_scheduler_test.go` - Unit tests for jitter, backoff, retry blocking, online gating, semaphore, reconciliation
- `poller/internal/poller/interfaces.go` - Added SSHHostKeyUpdater consumer-side interface
- `poller/cmd/poller/main.go` - BackupScheduler initialization and goroutine startup
## Decisions Made
- BackupScheduler runs independently from status poll scheduler -- separate goroutine pool, no shared state
- Semaphore uses buffered channel pattern (consistent with Go idioms, no external deps)
- Devices with no Redis status key assumed potentially online to avoid blocking first backup
- Locker nil-check allows tests to run without Redis lock infrastructure
## Deviations from Plan
None - plan executed exactly as written.
## Issues Encountered
None
## User Setup Required
None - no external service configuration required.
## Next Phase Readiness
- Config backup pipeline complete: SSH -> normalize -> hash -> NATS publish
- Backend snapshot consumer (Phase 3) can subscribe to config.snapshot.create.> to receive snapshots
- Pre-existing integration test failures in poller package (missing certificate_authorities table) are unrelated to this work
---
*Phase: 02-poller-config-collection*
*Completed: 2026-03-13*

View File

@@ -1,108 +0,0 @@
---
phase: 03-snapshot-ingestion
plan: 01
subsystem: api
tags: [nats, jetstream, openbao, transit, encryption, postgresql, prometheus, dedup]
# Dependency graph
requires:
- phase: 01-database-schema
provides: RouterConfigSnapshot model and router_config_snapshots table
- phase: 02-poller-config-collection
provides: Go poller publishes config.snapshot.> NATS messages
provides:
- NATS subscriber consuming config.snapshot.> messages
- SHA256 dedup preventing duplicate snapshot storage
- OpenBao Transit encryption of config text before INSERT
- Prometheus metrics for ingestion monitoring
affects: [04-diff-engine, snapshot-api, config-timeline]
# Tech tracking
tech-stack:
added: [prometheus_client]
patterns: [nats-subscriber-with-dedup, transit-encrypt-before-insert]
key-files:
created:
- backend/app/services/config_snapshot_subscriber.py
- backend/tests/test_config_snapshot_subscriber.py
modified:
- backend/app/main.py
key-decisions:
- "Trust poller-provided SHA256 hash (no recompute on backend)"
- "Raw SQL for dedup SELECT and INSERT (consistent with nats_subscriber.py pattern)"
- "OpenBao Transit service instantiated per-message with close() for connection hygiene"
patterns-established:
- "Config snapshot ingestion: dedup by SHA256 -> encrypt -> INSERT -> ack"
- "Transit failure causes nak (NATS retry), plaintext never stored as fallback"
requirements-completed: [STOR-02]
# Metrics
duration: 4min
completed: 2026-03-13
---
# Phase 3 Plan 1: Config Snapshot Subscriber Summary
**NATS subscriber ingesting config snapshots with SHA256 dedup, OpenBao Transit encryption, and Prometheus metrics**
## Performance
- **Duration:** 4 min
- **Started:** 2026-03-13T02:44:01Z
- **Completed:** 2026-03-13T02:48:08Z
- **Tasks:** 2
- **Files modified:** 3
## Accomplishments
- NATS subscriber consuming config.snapshot.> on DEVICE_EVENTS stream with durable consumer
- SHA256 dedup: duplicate snapshots silently skipped at debug level with Prometheus counter
- OpenBao Transit encryption: plaintext never stored in PostgreSQL, Transit failure causes nak
- Malformed and orphan device messages acked and discarded safely with warning logs
- 6 unit tests covering all handler paths (new, duplicate, encrypt fail, malformed, orphan, first)
- Wired into main.py lifespan with non-fatal startup pattern
## Task Commits
Each task was committed atomically:
1. **Task 1 (RED): Failing tests** - `9d82741` (test)
2. **Task 1 (GREEN): Config snapshot subscriber** - `3ab9f27` (feat)
3. **Task 2: Wire into main.py lifespan** - `0db0641` (feat)
_TDD task had RED + GREEN commits_
## Files Created/Modified
- `backend/app/services/config_snapshot_subscriber.py` - NATS subscriber with dedup, encryption, metrics
- `backend/tests/test_config_snapshot_subscriber.py` - 6 unit tests for all handler paths
- `backend/app/main.py` - Lifespan wiring for start/stop
## Decisions Made
- Trust poller-provided SHA256 hash (no recompute on backend) -- per project decision
- Raw SQL for dedup SELECT and INSERT -- consistent with existing nats_subscriber.py pattern
- OpenBao Transit service instantiated per-message with close() -- connection hygiene
- config_text never appears in any log statement -- contains passwords and keys
## Deviations from Plan
None - plan executed exactly as written.
## Issues Encountered
None.
## User Setup Required
None - no external service configuration required.
## Next Phase Readiness
- Config snapshot subscriber ready to receive messages from Go poller
- RouterConfigSnapshot rows will be available for diff engine (Phase 4)
- Prometheus metrics exposed for monitoring ingestion rate and errors
---
*Phase: 03-snapshot-ingestion*
*Completed: 2026-03-13*

View File

@@ -1,115 +0,0 @@
---
phase: 04-manual-backup-trigger
plan: 01
subsystem: api
tags: [nats, request-reply, backup, ssh, go, fastapi]
# Dependency graph
requires:
- phase: 02-poller-config-collection
provides: BackupScheduler with SSH config collection pipeline
- phase: 03-snapshot-ingestion
provides: Config snapshot subscriber for NATS ingestion
provides:
- BackupResponder NATS handler for manual config backup triggers
- POST /config-snapshot/trigger API endpoint for on-demand backups
- Public CollectAndPublish method on BackupScheduler returning sha256 hash
- BackupExecutor/BackupLocker/DeviceGetter interfaces for testability
affects: [05-snapshot-list-api, 06-diff-api]
# Tech tracking
tech-stack:
added: [nats-server/v2 (test dependency)]
patterns: [interface-based dependency injection for NATS responders, in-process NATS server for Go unit tests]
key-files:
created:
- poller/internal/bus/backup_responder.go
- poller/internal/bus/backup_responder_test.go
- poller/internal/bus/redis_locker.go
- backend/tests/test_config_snapshot_trigger.py
modified:
- poller/internal/poller/backup_scheduler.go
- poller/cmd/poller/main.go
- backend/app/routers/config_backups.py
key-decisions:
- "Used interface-based DI (BackupExecutor, BackupLocker, DeviceGetter) for BackupResponder testability"
- "Refactored collectAndPublish to return (string, error) with public CollectAndPublish wrapper"
- "Used in-process nats-server/v2 for fast Go unit tests instead of testcontainers"
- "Reused routeros_proxy NATS connection for Python endpoint instead of separate connection"
patterns-established:
- "BackupExecutor interface: abstracts backup pipeline for manual trigger callers"
- "In-process NATS test server: startTestNATS helper for Go bus package tests"
requirements-completed: [COLL-04]
# Metrics
duration: 7min
completed: 2026-03-13
---
# Phase 4 Plan 1: Manual Backup Trigger Summary
**NATS request-reply manual backup trigger with Go BackupResponder and Python API endpoint returning synchronous success/failure/hash**
## Performance
- **Duration:** 7 min
- **Started:** 2026-03-13T03:03:57Z
- **Completed:** 2026-03-13T03:10:41Z
- **Tasks:** 2
- **Files modified:** 7
## Accomplishments
- BackupResponder subscribes to config.backup.trigger (core NATS) and reuses BackupScheduler pipeline
- API endpoint POST /tenants/{tid}/devices/{did}/config-snapshot/trigger with operator role, 10/min rate limit
- Returns 201/409/502/504 with structured JSON including sha256 hash on success
- Per-device Redis lock prevents concurrent manual+scheduled backup collisions
- 12 total tests (6 Go, 6 Python) all passing
## Task Commits
Each task was committed atomically:
1. **Task 1: Go BackupResponder with extracted collectAndPublish** - `9e102fd` (test: RED), `0851ece` (feat: GREEN)
2. **Task 2: Python API endpoint for manual config snapshot trigger** - `0e66415` (test: RED), `00f0a8b` (feat: GREEN)
_TDD tasks have separate test and implementation commits._
## Files Created/Modified
- `poller/internal/bus/backup_responder.go` - NATS request-reply handler for manual backup triggers
- `poller/internal/bus/backup_responder_test.go` - 6 tests with in-process NATS server
- `poller/internal/bus/redis_locker.go` - RedisBackupLocker adapter implementing BackupLocker interface
- `poller/internal/poller/backup_scheduler.go` - Public CollectAndPublish method, returns (string, error)
- `poller/cmd/poller/main.go` - BackupResponder wired into lifecycle
- `backend/app/routers/config_backups.py` - New trigger_config_snapshot endpoint
- `backend/tests/test_config_snapshot_trigger.py` - 6 tests covering all response paths
## Decisions Made
- Used interface-based dependency injection (BackupExecutor, BackupLocker, DeviceGetter) rather than direct struct dependencies for testability
- Refactored collectAndPublish to return hash string alongside error, enabling public CollectAndPublish wrapper
- Added nats-server/v2 as test dependency for fast in-process NATS testing instead of testcontainers
- Python tests use simulated handler logic to avoid import chain issues (rate_limit -> redis, auth -> bcrypt)
- Reused routeros_proxy NATS connection via _get_nats() import instead of duplicating lazy-init pattern
## Deviations from Plan
None - plan executed exactly as written.
## Issues Encountered
- Python test environment lacks redis and bcrypt packages, preventing direct import of app.routers.config_backups. Resolved by testing handler logic via simulation function that mirrors the endpoint implementation.
## User Setup Required
None - no external service configuration required.
## Next Phase Readiness
- Manual backup trigger complete, ready for Phase 5 (snapshot list API)
- config.backup.trigger NATS subject uses core NATS (not JetStream), no stream config changes needed
- BackupExecutor interface available for any future caller needing programmatic backup triggers
---
*Phase: 04-manual-backup-trigger*
*Completed: 2026-03-13*

View File

@@ -1,115 +0,0 @@
---
phase: 05-diff-engine
plan: 01
subsystem: api
tags: [difflib, unified-diff, openbao, transit, prometheus, nats]
requires:
- phase: 03-snapshot-ingestion
provides: "config snapshot subscriber and router_config_snapshots table"
- phase: 01-database-schema
provides: "router_config_diffs table schema"
provides:
- "generate_and_store_diff() for unified diff between consecutive snapshots"
- "Prometheus metrics for diff generation success/failure/timing"
- "Subscriber integration calling diff after snapshot INSERT"
affects: [06-change-parser, 07-timeline-api]
tech-stack:
added: [difflib]
patterns: [best-effort-secondary-operation, tdd-red-green]
key-files:
created:
- backend/app/services/config_diff_service.py
- backend/tests/test_config_diff_service.py
modified:
- backend/app/services/config_snapshot_subscriber.py
- backend/tests/test_config_snapshot_subscriber.py
key-decisions:
- "Diff service instantiates its own OpenBaoTransitService per-call with close() for clean lifecycle"
- "RETURNING id added to snapshot INSERT to capture new_snapshot_id for diff generation"
- "Subscriber tests mock generate_and_store_diff to isolate snapshot logic from diff logic"
patterns-established:
- "Best-effort secondary operations: wrap in try/except, log+count errors, never block primary flow"
- "Line counting excludes unified diff headers (+++ and --- lines)"
requirements-completed: [DIFF-01, DIFF-02]
duration: 3min
completed: 2026-03-13
---
# Phase 5 Plan 1: Config Diff Service Summary
**Unified diff generation between consecutive config snapshots using difflib with Transit decrypt and best-effort error handling**
## Performance
- **Duration:** 3 min
- **Started:** 2026-03-13T03:30:07Z
- **Completed:** 2026-03-13T03:33:Z
- **Tasks:** 2
- **Files modified:** 4
## Accomplishments
- Config diff service generates unified diffs between consecutive snapshots per device
- Transit decrypt of both old and new ciphertext before diffing in memory
- Best-effort pattern: decrypt/DB failures logged and counted, never block snapshot ack
- Prometheus metrics track diff success, errors (by type), and generation duration
- Subscriber wired to call diff generation after every successful snapshot INSERT
## Task Commits
Each task was committed atomically:
1. **Task 1: Diff generation service (TDD RED)** - `79453fa` (test)
2. **Task 1: Diff generation service (TDD GREEN)** - `72d0ae2` (feat)
3. **Task 2: Wire diff into subscriber** - `eb76343` (feat)
_TDD task had separate RED and GREEN commits_
## Files Created/Modified
- `backend/app/services/config_diff_service.py` - Diff generation with Transit decrypt, difflib, Prometheus metrics
- `backend/tests/test_config_diff_service.py` - 5 unit tests covering diff, first-snapshot, decrypt failure, line counts, empty diff
- `backend/app/services/config_snapshot_subscriber.py` - Added RETURNING id, generate_and_store_diff call after commit
- `backend/tests/test_config_snapshot_subscriber.py` - Updated to mock generate_and_store_diff
## Decisions Made
- Diff service instantiates its own OpenBaoTransitService per-call (clean lifecycle, consistent with subscriber pattern)
- RETURNING id added to snapshot INSERT SQL to capture the new_snapshot_id without a separate query
- Subscriber tests mock generate_and_store_diff to keep snapshot tests isolated and unchanged in assertion counts
## Deviations from Plan
### Auto-fixed Issues
**1. [Rule 1 - Bug] Updated subscriber test assertions for diff integration**
- **Found during:** Task 2 (wire diff into subscriber)
- **Issue:** Existing subscriber tests failed because generate_and_store_diff made additional DB calls through the shared mock session
- **Fix:** Added patch for generate_and_store_diff in subscriber tests that successfully INSERT (test 1 and test 6)
- **Files modified:** backend/tests/test_config_snapshot_subscriber.py
- **Verification:** All 11 tests pass
- **Committed in:** eb76343 (Task 2 commit)
---
**Total deviations:** 1 auto-fixed (1 bug)
**Impact on plan:** Necessary to maintain test isolation. No scope creep.
## Issues Encountered
None
## User Setup Required
None - no external service configuration required.
## Next Phase Readiness
- Diff generation is active and will produce diffs for every new non-duplicate snapshot
- router_config_diffs table populated with diff_text, line counts, and snapshot references
- Ready for change parser (Phase 6) to parse semantic changes from diff_text
---
*Phase: 05-diff-engine*
*Completed: 2026-03-13*

View File

@@ -1,112 +0,0 @@
---
phase: 05-diff-engine
plan: 02
subsystem: api
tags: [parser, routeros, structured-changes, tdd]
requires:
- phase: 05-diff-engine
plan: 01
provides: "generate_and_store_diff() and router_config_diffs table"
provides:
- "parse_diff_changes() for extracting structured component changes from unified diffs"
- "router_config_changes rows linked to diff_id for timeline UI"
affects: [07-timeline-api]
tech-stack:
added: []
patterns: [tdd-red-green, best-effort-secondary-operation]
key-files:
created:
- backend/app/services/config_change_parser.py
- backend/tests/test_config_change_parser.py
modified:
- backend/app/services/config_diff_service.py
- backend/tests/test_config_diff_service.py
key-decisions:
- "Change parser is pure function (no DB/IO) for easy testing; DB writes happen in diff service"
- "RETURNING id added to diff INSERT to capture diff_id for linking changes"
- "Change parser errors are best-effort: diff is always stored, only changes are lost on parser failure"
patterns-established:
- "RouterOS path to component: strip leading /, replace spaces with / (e.g., /ip firewall filter -> ip/firewall/filter)"
- "Fallback component system/general for diffs without RouterOS path headers"
requirements-completed: [DIFF-03, DIFF-04]
duration: 2min
completed: 2026-03-13
---
# Phase 5 Plan 2: Structured Change Parser Summary
**RouterOS diff change parser extracting component names, human-readable summaries, and raw lines from unified diffs with best-effort DB storage**
## Performance
- **Duration:** 2 min
- **Started:** 2026-03-13T03:34:48Z
- **Completed:** 2026-03-13T03:37:14Z
- **Tasks:** 2
- **Files modified:** 4
## Accomplishments
- Pure-function change parser extracts component, summary, raw_line from RouterOS unified diffs
- RouterOS path detection converts section headers to component format (ip/firewall/filter)
- Human-readable summaries: Added/Removed/Modified N rules per component
- Diff service wired to call parser after INSERT and store results in router_config_changes
- Parser failures are best-effort: diff always stored, changes lost only on parser error
## Task Commits
Each task was committed atomically:
1. **Task 1: Change parser TDD RED** - `7fddf35` (test)
2. **Task 1: Change parser TDD GREEN** - `b167831` (feat)
3. **Task 2: Wire parser into diff service** - `122b591` (feat)
_TDD task had separate RED and GREEN commits_
## Files Created/Modified
- `backend/app/services/config_change_parser.py` - Pure parser: parse_diff_changes() with path detection, summary generation, raw line capture
- `backend/tests/test_config_change_parser.py` - 6 unit tests covering additions, multi-section, removals, modifications, fallback, raw_line
- `backend/app/services/config_diff_service.py` - Added RETURNING id, parse_diff_changes integration, change INSERT loop
- `backend/tests/test_config_diff_service.py` - Updated existing tests for RETURNING id, added 2 tests for change storage and parser error resilience
## Decisions Made
- Change parser is a pure function (no DB/IO) for straightforward unit testing; DB writes are the diff service's responsibility
- RETURNING id added to diff INSERT SQL to get diff_id without separate query
- Change parser errors caught by separate try/except so diff is always committed first
## Deviations from Plan
### Auto-fixed Issues
**1. [Rule 1 - Bug] Updated existing diff service tests for RETURNING id and parse_diff_changes integration**
- **Found during:** Task 2
- **Issue:** Existing tests expected 3 execute calls without scalar_one on INSERT result; new RETURNING id and parse_diff_changes call changed the interaction pattern
- **Fix:** Added scalar_one mock to INSERT result, patched parse_diff_changes to return empty list in existing tests to isolate behavior
- **Files modified:** backend/tests/test_config_diff_service.py
- **Committed in:** 122b591
---
**Total deviations:** 1 auto-fixed (1 bug)
**Impact on plan:** Necessary test update for API change. No scope creep.
## Issues Encountered
None
## User Setup Required
None
## Next Phase Readiness
- router_config_changes table populated with structured changes for every non-empty diff
- Changes linked to diff_id, device_id, tenant_id for timeline queries
- Ready for timeline API (Phase 7) to query changes per device
---
*Phase: 05-diff-engine*
*Completed: 2026-03-13*

View File

@@ -1,95 +0,0 @@
---
phase: 06-history-api
plan: 01
subsystem: api
tags: [fastapi, sqlalchemy, pagination, timeline, rbac]
# Dependency graph
requires:
- phase: 05-diff-engine
provides: router_config_changes and router_config_diffs tables with parsed change data
provides:
- GET /api/tenants/{tid}/devices/{did}/config-history endpoint
- get_config_history service function with pagination
affects: [06-02, frontend-config-history]
# Tech tracking
tech-stack:
added: []
patterns: [raw SQL text() joins for timeline queries, same RBAC pattern as config_backups]
key-files:
created:
- backend/app/services/config_history_service.py
- backend/app/routers/config_history.py
- backend/tests/test_config_history_service.py
modified:
- backend/app/main.py
key-decisions:
- "Raw SQL text() for JOIN query consistent with config_diff_service.py pattern"
- "Pagination defaults: limit=50, offset=0 with validation (ge=1, le=200 for limit)"
patterns-established:
- "Config history queries use JOIN between changes and diffs tables for timeline view"
requirements-completed: [API-01, API-04]
# Metrics
duration: 2min
completed: 2026-03-13
---
# Phase 6 Plan 1: Config History Timeline Summary
**GET /config-history endpoint returning paginated change timeline with component, summary, timestamp, and diff metadata via JOIN query**
## Performance
- **Duration:** 2 min
- **Started:** 2026-03-13T03:58:03Z
- **Completed:** 2026-03-13T04:00:00Z
- **Tasks:** 2
- **Files modified:** 4
## Accomplishments
- Config history service querying router_config_changes JOIN router_config_diffs for timeline entries
- REST endpoint with viewer+ RBAC and config:read scope enforcement
- 4 unit tests covering formatting, empty results, pagination, and ordering
- Router registered in main.py alongside existing config routers
## Task Commits
Each task was committed atomically:
1. **Task 1: Config history service and tests (TDD)** - `f7d5aec` (feat)
2. **Task 2: Config history router and main.py registration** - `5c56344` (feat)
## Files Created/Modified
- `backend/app/services/config_history_service.py` - Query function for paginated config change timeline
- `backend/app/routers/config_history.py` - REST endpoint with RBAC, pagination query params
- `backend/tests/test_config_history_service.py` - 4 unit tests with AsyncMock sessions
- `backend/app/main.py` - Router import and registration
## Decisions Made
- Used raw SQL text() for the JOIN query, consistent with config_diff_service.py pattern
- Pagination limit constrained to 1-200 via FastAPI Query validation
- Copied _check_tenant_access helper (same pattern as config_backups.py)
## Deviations from Plan
None - plan executed exactly as written.
## Issues Encountered
None
## User Setup Required
None - no external service configuration required.
## Next Phase Readiness
- Config history timeline endpoint ready for frontend consumption
- Plan 06-02 can build on this for detailed diff view endpoints
---
*Phase: 06-history-api*
*Completed: 2026-03-13*

View File

@@ -1,95 +0,0 @@
---
phase: 06-history-api
plan: 02
subsystem: api
tags: [fastapi, sqlalchemy, openbao, transit-decrypt, rbac, snapshot]
# Dependency graph
requires:
- phase: 06-history-api
provides: config_history_service.py with get_config_history, config_history router with RBAC
- phase: 05-diff-engine
provides: router_config_diffs and router_config_snapshots tables with encrypted config data
provides:
- GET /api/tenants/{tid}/devices/{did}/config/{snapshot_id} endpoint (decrypted snapshot)
- GET /api/tenants/{tid}/devices/{did}/config/{snapshot_id}/diff endpoint (unified diff)
- get_snapshot and get_snapshot_diff service functions
affects: [frontend-config-history, frontend-diff-viewer]
# Tech tracking
tech-stack:
added: []
patterns: [Transit decrypt in service layer with try/finally close, 404 for missing snapshots/diffs]
key-files:
created: []
modified:
- backend/app/services/config_history_service.py
- backend/app/routers/config_history.py
- backend/tests/test_config_history_service.py
key-decisions:
- "Transit decrypt in get_snapshot with try/finally for clean openbao lifecycle"
- "500 error wrapping for Transit decrypt failures in router (not service)"
patterns-established:
- "Snapshot retrieval filters by id + device_id + tenant_id for RLS-safe queries"
requirements-completed: [API-02, API-03, API-04]
# Metrics
duration: 2min
completed: 2026-03-13
---
# Phase 6 Plan 2: Snapshot View and Diff Retrieval Summary
**Snapshot view and diff retrieval endpoints with Transit decrypt for full config text and unified diff, enforcing viewer+ RBAC**
## Performance
- **Duration:** 2 min
- **Started:** 2026-03-13T04:01:58Z
- **Completed:** 2026-03-13T04:03:39Z
- **Tasks:** 2
- **Files modified:** 3
## Accomplishments
- get_snapshot function decrypts config via OpenBao Transit and returns plaintext with metadata
- get_snapshot_diff function queries diff by new_snapshot_id for a device/tenant
- Two new router endpoints with viewer+ RBAC and config:read scope enforcement
- 4 new tests (8 total) covering decrypted content, not-found, diff retrieval, and no-diff cases
## Task Commits
Each task was committed atomically:
1. **Task 1: Snapshot and diff service functions with tests (TDD)** - `83cd661` (feat)
2. **Task 2: Snapshot and diff router endpoints** - `af7007d` (feat)
## Files Created/Modified
- `backend/app/services/config_history_service.py` - Added get_snapshot (Transit decrypt) and get_snapshot_diff query functions
- `backend/app/routers/config_history.py` - Two new GET endpoints with RBAC, 404/500 error handling
- `backend/tests/test_config_history_service.py` - 4 new tests with mocked Transit and DB sessions
## Decisions Made
- Transit decrypt happens in service layer (get_snapshot), error wrapping in router layer (500 response)
- Query filters include device_id + tenant_id alongside snapshot_id for RLS-safe access
## Deviations from Plan
None - plan executed exactly as written.
## Issues Encountered
None
## User Setup Required
None - no external service configuration required.
## Next Phase Readiness
- All 3 config history API endpoints complete (timeline, snapshot view, diff view)
- Phase 06 complete -- ready for frontend integration
---
*Phase: 06-history-api*
*Completed: 2026-03-13*

View File

@@ -1,89 +0,0 @@
---
phase: 07-config-history-ui
plan: 01
subsystem: ui
tags: [react, tanstack-query, timeline, config-history]
requires:
- phase: 06-history-api
provides: GET /api/tenants/{tid}/devices/{did}/config-history endpoint
provides:
- ConfigHistorySection component with timeline rendering
- configHistoryApi.list() API client function
- Configuration history visible on device detail overview tab
affects: [07-config-history-ui]
tech-stack:
added: []
patterns: [timeline component pattern matching BackupTimeline.tsx]
key-files:
created:
- frontend/src/components/config/ConfigHistorySection.tsx
modified:
- frontend/src/lib/api.ts
- frontend/src/routes/_authenticated/tenants/$tenantId/devices/$deviceId.tsx
key-decisions:
- "Reimplemented formatRelativeTime locally rather than extracting shared util (matches BackupTimeline pattern)"
- "Poll interval 60s via refetchInterval for near-real-time change visibility"
patterns-established:
- "Config history timeline: vertical dot timeline with component badge, summary, line delta, relative time"
requirements-completed: [UI-01, UI-02]
duration: 3min
completed: 2026-03-13
---
# Phase 7 Plan 1: Config History UI Summary
**ConfigHistorySection timeline component on device detail page, fetching change entries via TanStack Query with 60s polling**
## Performance
- **Duration:** 3 min
- **Started:** 2026-03-13T04:11:08Z
- **Completed:** 2026-03-13T04:14:00Z
- **Tasks:** 2
- **Files modified:** 3
## Accomplishments
- Added configHistoryApi.list() and ConfigChangeEntry interface to api.ts
- Created ConfigHistorySection with vertical timeline, loading skeleton, and empty state
- Wired component into device detail overview tab below Interface Utilization
## Task Commits
Each task was committed atomically:
1. **Task 1: API client and ConfigHistorySection component** - `6bd2451` (feat)
2. **Task 2: Wire ConfigHistorySection into device detail page** - `36861ff` (feat)
## Files Created/Modified
- `frontend/src/lib/api.ts` - Added ConfigChangeEntry interface and configHistoryApi.list()
- `frontend/src/components/config/ConfigHistorySection.tsx` - Timeline component with loading/empty/data states
- `frontend/src/routes/_authenticated/tenants/$tenantId/devices/$deviceId.tsx` - Import and render ConfigHistorySection
## Decisions Made
- Reimplemented formatRelativeTime locally (same pattern as BackupTimeline.tsx) rather than extracting to shared util -- keeps components self-contained
- Used 60s refetchInterval for polling new config changes
## Deviations from Plan
None - plan executed exactly as written.
## Issues Encountered
None
## User Setup Required
None - no external service configuration required.
## Next Phase Readiness
- Config history timeline renders on device overview tab
- Ready for any future detail/drill-down views on individual changes
---
*Phase: 07-config-history-ui*
*Completed: 2026-03-13*

View File

@@ -1,92 +0,0 @@
---
phase: 08-diff-viewer-download
plan: 01
subsystem: ui
tags: [react, diff-viewer, tanstack-query, tailwind]
requires:
- phase: 07-config-history-ui
provides: ConfigHistorySection timeline component with ConfigChangeEntry data
- phase: 06-config-history-api
provides: GET /config/{snapshot_id}/diff endpoint returning DiffResponse
provides:
- DiffViewer component with unified diff rendering (green/red line highlighting)
- configHistoryApi.getDiff() API client method
- Clickable timeline entries in ConfigHistorySection
affects: [08-diff-viewer-download]
tech-stack:
added: []
patterns: [inline diff viewer with line-level classification]
key-files:
created:
- frontend/src/components/config/DiffViewer.tsx
modified:
- frontend/src/lib/api.ts
- frontend/src/components/config/ConfigHistorySection.tsx
key-decisions:
- "DiffViewer rendered inline above timeline (not modal) for context preservation"
- "Line classification function for unified diff: +green, -red, @@blue, ---/+++ muted"
patterns-established:
- "Inline viewer pattern: state-driven component rendered above list, closed via callback"
requirements-completed: [UI-03]
duration: 1min
completed: 2026-03-13
---
# Phase 8 Plan 1: Diff Viewer Summary
**Inline diff viewer with green/red line highlighting, wired into clickable config history timeline entries**
## Performance
- **Duration:** 1 min
- **Started:** 2026-03-13T04:19:53Z
- **Completed:** 2026-03-13T04:20:56Z
- **Tasks:** 2
- **Files modified:** 3
## Accomplishments
- DiffViewer component renders unified diffs with color-coded lines (green additions, red removals, blue hunk headers)
- API client getDiff method fetches diff data from backend endpoint
- Timeline entries in ConfigHistorySection are clickable with hover states
## Task Commits
Each task was committed atomically:
1. **Task 1: Add diff API client and create DiffViewer component** - `dda00fb` (feat)
2. **Task 2: Wire DiffViewer into ConfigHistorySection timeline entries** - `2cf426f` (feat)
## Files Created/Modified
- `frontend/src/components/config/DiffViewer.tsx` - Unified diff viewer with line-level color highlighting, loading skeleton, error state
- `frontend/src/lib/api.ts` - Added DiffResponse interface and configHistoryApi.getDiff() method
- `frontend/src/components/config/ConfigHistorySection.tsx` - Added click handlers, selectedSnapshotId state, inline DiffViewer rendering
## Decisions Made
- Rendered DiffViewer inline above the timeline rather than in a modal, preserving context
- Used a classifyLine helper function for clean line-type detection (handles +++ and --- separately from + and -)
- Loading skeleton uses randomized widths for visual variety
## Deviations from Plan
None - plan executed exactly as written.
## Issues Encountered
None
## User Setup Required
None - no external service configuration required.
## Next Phase Readiness
- Diff viewer complete, ready for config download functionality (plan 08-02)
- All TypeScript compiles cleanly
---
*Phase: 08-diff-viewer-download*
*Completed: 2026-03-13*

View File

@@ -1,98 +0,0 @@
---
phase: 09-retention-cleanup
plan: 01
subsystem: database
tags: [apscheduler, retention, postgresql, prometheus, cascade-delete]
# Dependency graph
requires:
- phase: 01-database-schema
provides: router_config_snapshots table with CASCADE FK constraints
provides:
- Automatic retention cleanup of expired config snapshots
- CONFIG_RETENTION_DAYS env var for configurable retention period
- Prometheus metrics for cleanup observability
affects: []
# Tech tracking
tech-stack:
added: []
patterns: [APScheduler IntervalTrigger for periodic maintenance jobs]
key-files:
created:
- backend/app/services/retention_service.py
- backend/tests/test_retention_service.py
modified:
- backend/app/config.py
- backend/app/main.py
key-decisions:
- "make_interval(days => :days) for parameterized PostgreSQL interval (no string concatenation)"
- "24h IntervalTrigger with 1h jitter to stagger cleanup across instances"
- "AdminAsyncSessionLocal (bypasses RLS) since retention is cross-tenant system operation"
patterns-established:
- "IntervalTrigger pattern for periodic maintenance jobs (vs CronTrigger for scheduled backups)"
requirements-completed: [STOR-03, STOR-04]
# Metrics
duration: 2min
completed: 2026-03-13
---
# Phase 9 Plan 1: Retention Cleanup Summary
**Daily APScheduler job deletes config snapshots older than CONFIG_RETENTION_DAYS (default 90) with CASCADE FK cleanup of diffs and changes**
## Performance
- **Duration:** 2 min
- **Started:** 2026-03-13T04:31:48Z
- **Completed:** 2026-03-13T04:34:12Z
- **Tasks:** 2
- **Files modified:** 4
## Accomplishments
- Retention service with parameterized SQL DELETE using make_interval for safe interval binding
- APScheduler IntervalTrigger running every 24h with 1h jitter for stagger
- Prometheus counter and histogram for cleanup observability
- Wired into main.py lifespan with non-fatal startup pattern
## Task Commits
Each task was committed atomically:
1. **Task 1 (RED): Add failing tests** - `00bdde9` (test)
2. **Task 1 (GREEN): Implement retention service + config setting** - `a9f7a45` (feat)
3. **Task 2: Wire retention scheduler into lifespan** - `4d62bc9` (feat)
## Files Created/Modified
- `backend/app/services/retention_service.py` - Retention cleanup logic, scheduler, Prometheus metrics
- `backend/tests/test_retention_service.py` - 4 unit tests for cleanup function
- `backend/app/config.py` - Added CONFIG_RETENTION_DAYS setting (default 90)
- `backend/app/main.py` - Wired start/stop retention scheduler into lifespan
## Decisions Made
- Used make_interval(days => :days) for parameterized PostgreSQL interval (avoids string concatenation SQL injection risk)
- 24h IntervalTrigger with 1h jitter to stagger cleanup across instances
- AdminAsyncSessionLocal bypasses RLS since retention is a cross-tenant system operation
## Deviations from Plan
None - plan executed exactly as written.
## Issues Encountered
None
## User Setup Required
None - no external service configuration required. CONFIG_RETENTION_DAYS defaults to 90 if not set.
## Next Phase Readiness
- Retention cleanup is fully operational, ready for phase 10
- No blockers
---
*Phase: 09-retention-cleanup*
*Completed: 2026-03-13*

View File

@@ -1,98 +0,0 @@
---
phase: 10-audit-observability
plan: 01
subsystem: api
tags: [audit, logging, config-backup, nats, observability]
# Dependency graph
requires:
- phase: 03-snapshot-ingestion
provides: config_snapshot_subscriber handle_config_snapshot handler
- phase: 05-config-diff
provides: config_diff_service generate_and_store_diff function
- phase: 04-manual-backup-trigger
provides: config_backups trigger_config_snapshot endpoint
provides:
- Audit trail for all config backup operations (4 event types)
- Tests verifying audit event emission
affects: []
# Tech tracking
tech-stack:
added: []
patterns: [try/except-wrapped log_action calls for fire-and-forget audit, inline imports in diff service to avoid circular deps]
key-files:
created:
- backend/tests/test_audit_config_backup.py
modified:
- backend/app/services/config_snapshot_subscriber.py
- backend/app/services/config_diff_service.py
- backend/app/routers/config_backups.py
key-decisions:
- "Module-level import of log_action in snapshot subscriber (no circular risk), inline import in diff service and router (consistent with existing best-effort pattern)"
- "All audit calls wrapped in try/except Exception: pass to never break parent operations"
patterns-established:
- "Audit event pattern: try/except-wrapped log_action calls at success points in NATS subscribers and API endpoints"
requirements-completed: [OBS-01, OBS-02]
# Metrics
duration: 3min
completed: 2026-03-13
---
# Phase 10 Plan 01: Config Backup Audit Events Summary
**Four audit event types (created, skipped_duplicate, diff_generated, manual_trigger) wired into config backup operations with try/except safety and 4 passing tests**
## Performance
- **Duration:** 3 min
- **Started:** 2026-03-13T04:43:11Z
- **Completed:** 2026-03-13T04:46:04Z
- **Tasks:** 2
- **Files modified:** 4
## Accomplishments
- Added audit logging to all 4 config backup operations: snapshot creation, deduplication skip, diff generation, and manual backup trigger
- All log_action calls follow project pattern: try/except wrapped, fire-and-forget, with tenant_id, device_id, action, resource_type, and details
- 4 new tests verify correct audit action strings are emitted, all 17 tests pass (4 new + 13 existing)
## Task Commits
Each task was committed atomically:
1. **Task 1: Add audit event emission to snapshot subscriber, diff service, and backup trigger endpoint** - `1a1ceb2` (feat)
2. **Task 2: Add tests verifying audit events are emitted** - `fb91fed` (test)
## Files Created/Modified
- `backend/app/services/config_snapshot_subscriber.py` - Added config_snapshot_created and config_snapshot_skipped_duplicate audit events
- `backend/app/services/config_diff_service.py` - Added config_diff_generated audit event after diff INSERT
- `backend/app/routers/config_backups.py` - Added config_backup_manual_trigger audit event on manual trigger success
- `backend/tests/test_audit_config_backup.py` - 4 tests verifying all audit event types are emitted
## Decisions Made
- Module-level import of log_action in snapshot subscriber (no circular dependency risk since audit_service has no deps on snapshot subscriber)
- Inline import in diff service try block (consistent with existing best-effort pattern and avoids any potential circular import)
- Inline import in config_backups router try block (same pattern as diff service)
## Deviations from Plan
None - plan executed exactly as written.
## Issues Encountered
None
## User Setup Required
None - no external service configuration required.
## Next Phase Readiness
- Audit trail complete for all config backup operations
- All existing tests continue to pass with the new audit imports
---
*Phase: 10-audit-observability*
*Completed: 2026-03-13*