chore: remove .planning from tracking (already in .gitignore)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 06:55:28 -05:00
parent ed3ad8eb17
commit 7af08276ea
25 changed files with 0 additions and 4680 deletions
--- a/.planning/phases/02-poller-config-collection/02-01-PLAN.md
+++ b/.planning/phases/02-poller-config-collection/02-01-PLAN.md
@@ -1,308 +0,0 @@
---
-phase: 02-poller-config-collection
-plan: 01
-type: execute
-wave: 1
-depends_on: []
-files_modified:
-  - poller/internal/device/ssh_executor.go
-  - poller/internal/device/ssh_executor_test.go
-  - poller/internal/device/normalize.go
-  - poller/internal/device/normalize_test.go
-  - poller/internal/config/config.go
-  - poller/internal/bus/publisher.go
-  - poller/internal/observability/metrics.go
-  - poller/internal/store/devices.go
-  - backend/alembic/versions/028_device_ssh_host_key.py
-autonomous: true
-requirements: [COLL-01, COLL-02, COLL-06]
-
-must_haves:
-  truths:
-    - "SSH executor can run a command on a RouterOS device and return stdout, stderr, exit code, duration, and typed errors"
-    - "Config output is normalized deterministically (timestamp stripped, whitespace trimmed, line endings unified, blank lines collapsed)"
-    - "SHA256 hash is computed on normalized output"
-    - "Config backup interval and concurrency are configurable via environment variables"
-    - "Host key fingerprint is stored on device record for TOFU verification"
-  artifacts:
-    - path: "poller/internal/device/ssh_executor.go"
-      provides: "RunCommand SSH executor with TOFU host key verification and typed errors"
-      exports: ["RunCommand", "CommandResult", "SSHError", "SSHErrorKind"]
-    - path: "poller/internal/device/normalize.go"
-      provides: "NormalizeConfig function and SHA256 hashing"
-      exports: ["NormalizeConfig", "HashConfig"]
-    - path: "poller/internal/device/ssh_executor_test.go"
-      provides: "Unit tests for SSH executor error classification"
-    - path: "poller/internal/device/normalize_test.go"
-      provides: "Unit tests for config normalization with edge cases"
-    - path: "poller/internal/config/config.go"
-      provides: "CONFIG_BACKUP_INTERVAL, CONFIG_BACKUP_MAX_CONCURRENT, CONFIG_BACKUP_COMMAND_TIMEOUT env vars"
-    - path: "poller/internal/bus/publisher.go"
-      provides: "ConfigSnapshotEvent type and PublishConfigSnapshot method, config.snapshot.create subject in stream"
-    - path: "poller/internal/store/devices.go"
-      provides: "SSHPort and SSHHostKeyFingerprint fields on Device struct, UpdateSSHHostKey method"
-    - path: "backend/alembic/versions/028_device_ssh_host_key.py"
-      provides: "Migration adding ssh_port, ssh_host_key_fingerprint columns to devices table"
-  key_links:
-    - from: "poller/internal/device/ssh_executor.go"
-      to: "poller/internal/store/devices.go"
-      via: "Uses Device.SSHPort and Device.SSHHostKeyFingerprint for connection"
-      pattern: "dev\\.SSHPort|dev\\.SSHHostKeyFingerprint"
-    - from: "poller/internal/device/normalize.go"
-      to: "poller/internal/bus/publisher.go"
-      via: "Normalized config text and SHA256 hash populate ConfigSnapshotEvent fields"
-      pattern: "NormalizeConfig|HashConfig"
---
-
-<objective>
-Build the reusable primitives for config backup collection: SSH command executor with TOFU host key verification, config output normalizer with SHA256 hashing, environment variable configuration, NATS event type, and device model extensions.
-
-Purpose: These are the building blocks that the backup scheduler (Plan 02) wires together. Each is independently testable and follows existing codebase patterns.
-Output: SSH executor module, normalization module, extended config/store/bus/metrics, Alembic migration for device SSH columns.
-</objective>
-
-<execution_context>
-@/Users/jasonstaack/.claude/get-shit-done/workflows/execute-plan.md
-@/Users/jasonstaack/.claude/get-shit-done/templates/summary.md
-</execution_context>
-
-<context>
-@.planning/PROJECT.md
-@.planning/ROADMAP.md
-@.planning/STATE.md
-@.planning/phases/02-poller-config-collection/02-CONTEXT.md
-@.planning/phases/01-database-schema/01-01-SUMMARY.md
-
-@poller/internal/device/sftp.go
-@poller/internal/bus/publisher.go
-@poller/internal/config/config.go
-@poller/internal/store/devices.go
-@poller/internal/observability/metrics.go
-@poller/internal/poller/scheduler.go
-@poller/go.mod
-
-<interfaces>
-<!-- Existing patterns the executor must follow -->
-
-From poller/internal/device/sftp.go:
-```go
-func NewSSHClient(ip string, port int, username, password string, timeout time.Duration) (*ssh.Client, error)
-// Uses ssh.InsecureIgnoreHostKey() — executor replaces this with TOFU callback
-```
-
-From poller/internal/store/devices.go:
-```go
-type Device struct {
-    ID                          string
-    TenantID                    string
-    IPAddress                   string
-    APIPort                     int
-    APISSLPort                  int
-    EncryptedCredentials        []byte
-    EncryptedCredentialsTransit *string
-    RouterOSVersion             *string
-    MajorVersion                *int
-    TLSMode                     string
-    CACertPEM                   *string
-}
-// SSHPort and SSHHostKeyFingerprint need to be added
-```
-
-From poller/internal/bus/publisher.go:
-```go
-type Publisher struct { nc *nats.Conn; js jetstream.JetStream }
-func (p *Publisher) PublishStatus(ctx context.Context, event DeviceStatusEvent) error
-// Follow this pattern for PublishConfigSnapshot
-// Stream subjects list needs "config.snapshot.>" added
-```
-
-From poller/internal/config/config.go:
-```go
-func Load() (*Config, error)
-// Uses getEnv(key, default) and getEnvInt(key, default) helpers
-```
-</interfaces>
-</context>
-
-<tasks>
-
-<task type="auto" tdd="true">
-  <name>Task 1: SSH executor, normalizer, and their tests</name>
-  <files>
-    poller/internal/device/ssh_executor.go,
-    poller/internal/device/ssh_executor_test.go,
-    poller/internal/device/normalize.go,
-    poller/internal/device/normalize_test.go
-  </files>
-  <behavior>
-    SSH Executor (ssh_executor_test.go):
-    - Test SSHErrorKind classification: given various ssh/net error types, classifySSHError returns correct kind (AuthFailed, HostKeyMismatch, Timeout, ConnectionRefused, Unknown)
-    - Test TOFU host key callback: when fingerprint is empty (first connect), callback accepts and returns fingerprint; when fingerprint matches, callback accepts; when fingerprint mismatches, callback rejects with HostKeyMismatch error
-    - Test CommandResult: verify struct fields (Stdout, Stderr, ExitCode, Duration, Error)
-
-    Normalizer (normalize_test.go):
-    - Test timestamp stripping: input with "# 2024/01/15 10:30:00 by RouterOS 7.x\n# software id = XXXX\n" strips only the timestamp line and following blank line, preserves software id comment
-    - Test line ending normalization: "\r\n" becomes "\n"
-    - Test trailing whitespace trimming: "  /ip address  \n" becomes "/ip address\n"
-    - Test blank line collapsing: three consecutive blank lines become one
-    - Test trailing newline: output always ends with exactly one "\n"
-    - Test comment preservation: lines starting with "# " that are NOT the timestamp header are preserved
-    - Test full normalization pipeline: realistic RouterOS export with all issues produces clean output
-    - Test HashConfig: returns lowercase hex SHA256 of the normalized string (64 chars)
-    - Test idempotency: NormalizeConfig(NormalizeConfig(input)) == NormalizeConfig(input)
-  </behavior>
-  <action>
-    Create `poller/internal/device/ssh_executor.go`:
-
-    1. Define types:
-       - `SSHErrorKind` string enum: `ErrAuthFailed`, `ErrHostKeyMismatch`, `ErrTimeout`, `ErrTruncatedOutput`, `ErrConnectionRefused`, `ErrUnknown`
-       - `SSHError` struct implementing `error`: `Kind SSHErrorKind`, `Err error`, `Message string`
-       - `CommandResult` struct: `Stdout string`, `Stderr string`, `ExitCode int`, `Duration time.Duration`
-
-    2. `RunCommand(ctx context.Context, ip string, port int, username, password string, timeout time.Duration, knownFingerprint string, command string) (*CommandResult, string, error)`:
-       - Returns (result, observedFingerprint, error)
-       - Build ssh.ClientConfig with password auth and custom HostKeyCallback for TOFU:
-         - If knownFingerprint == "": accept any key, compute and return SHA256 fingerprint
-         - If knownFingerprint matches: accept
-         - If knownFingerprint mismatches: reject with SSHError{Kind: ErrHostKeyMismatch}
-       - Fingerprint format: `SHA256:base64(sha256(publicKeyBytes))` (same as ssh-keygen)
-       - Dial with context-aware timeout
-       - Create session, run command via session.Run()
-       - Capture stdout/stderr via session.StdoutPipe/StderrPipe or CombinedOutput pattern
-       - Classify errors using `classifySSHError(err)` helper that inspects error strings and types
-       - Detect truncated output: if command times out mid-stream, return SSHError{Kind: ErrTruncatedOutput}
-
-    3. `classifySSHError(err error) SSHErrorKind`: inspect error for "unable to authenticate", "host key", "i/o timeout", "connection refused" patterns
-
-    Create `poller/internal/device/normalize.go`:
-
-    1. `NormalizeConfig(raw string) string`:
-       - Use regexp to strip timestamp header line matching `^# \d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2} by RouterOS.*\n` and the blank line immediately following it
-       - Replace \r\n with \n (before other processing)
-       - Split into lines, trim trailing whitespace from each line
-       - Collapse consecutive blank lines (2+ empty lines become 1)
-       - Ensure single trailing newline
-       - Return normalized string
-
-    2. `HashConfig(normalized string) string`:
-       - Compute SHA256 of the normalized string bytes
-       - Return lowercase hex string (64 chars)
-
-    3. `const NormalizationVersion = 1` — for future tracking in NATS payload
-
-    Write tests FIRST (RED), then implement (GREEN). Tests for normalizer use table-driven test style matching Go conventions. SSH executor tests use mock/classification tests (no real SSH connection needed for unit tests).
-  </action>
-  <verify>
-    <automated>cd /Volumes/ssd01/v9/the-other-dude/poller && go test ./internal/device/ -run "TestNormalize|TestHash|TestSSH|TestClassify|TestTOFU" -v -count=1</automated>
-  </verify>
-  <done>
-    - RunCommand function compiles with correct signature returning (CommandResult, fingerprint, error)
-    - SSHError type with Kind field covers all 6 error classifications
-    - TOFU host key callback accepts on first connect, validates on subsequent, rejects on mismatch
-    - NormalizeConfig strips timestamp, normalizes line endings, trims whitespace, collapses blanks, ensures trailing newline
-    - HashConfig returns 64-char lowercase hex SHA256
-    - All unit tests pass
-  </done>
-</task>
-
-<task type="auto">
-  <name>Task 2: Config env vars, NATS event type, device model extensions, Alembic migration, metrics</name>
-  <files>
-    poller/internal/config/config.go,
-    poller/internal/bus/publisher.go,
-    poller/internal/store/devices.go,
-    poller/internal/observability/metrics.go,
-    backend/alembic/versions/028_device_ssh_host_key.py
-  </files>
-  <action>
-    **1. Config env vars** (`config.go`):
-    Add three fields to the Config struct and load them in Load():
-    - `ConfigBackupIntervalSeconds int` — `getEnvInt("CONFIG_BACKUP_INTERVAL", 21600)` (6h = 21600s)
-    - `ConfigBackupMaxConcurrent int` — `getEnvInt("CONFIG_BACKUP_MAX_CONCURRENT", 10)`
-    - `ConfigBackupCommandTimeoutSeconds int` — `getEnvInt("CONFIG_BACKUP_COMMAND_TIMEOUT", 60)`
-
-    **2. NATS event type and publisher** (`publisher.go`):
-    - Add `ConfigSnapshotEvent` struct:
-      ```go
-      type ConfigSnapshotEvent struct {
-          DeviceID             string `json:"device_id"`
-          TenantID             string `json:"tenant_id"`
-          RouterOSVersion      string `json:"routeros_version,omitempty"`
-          CollectedAt          string `json:"collected_at"`          // RFC3339
-          SHA256Hash           string `json:"sha256_hash"`
-          ConfigText           string `json:"config_text"`
-          NormalizationVersion int    `json:"normalization_version"`
-      }
-      ```
-    - Add `PublishConfigSnapshot(ctx, event) error` method on Publisher following the exact pattern of PublishStatus/PublishMetrics
-    - Subject: `fmt.Sprintf("config.snapshot.create.%s", event.DeviceID)`
-    - Add `"config.snapshot.>"` to the DEVICE_EVENTS stream subjects list in `NewPublisher`
-
-    **3. Device model extensions** (`devices.go`):
-    - Add fields to Device struct: `SSHPort int`, `SSHHostKeyFingerprint *string`
-    - Update FetchDevices query to SELECT `COALESCE(d.ssh_port, 22)` and `d.ssh_host_key_fingerprint`
-    - Update GetDevice query similarly
-    - Update both Scan calls to include the new fields
-    - Add `UpdateSSHHostKey(ctx context.Context, deviceID string, fingerprint string) error` method on DeviceStore:
-      ```go
-      const query = `UPDATE devices SET ssh_host_key_fingerprint = $1 WHERE id = $2`
-      ```
-      (This requires poller_user to have UPDATE on devices(ssh_host_key_fingerprint) — handled in migration)
-
-    **4. Alembic migration** (`028_device_ssh_host_key.py`):
-    Follow the raw SQL pattern from migration 027. Create migration that:
-    - `ALTER TABLE devices ADD COLUMN ssh_port INTEGER DEFAULT 22`
-    - `ALTER TABLE devices ADD COLUMN ssh_host_key_fingerprint TEXT`
-    - `ALTER TABLE devices ADD COLUMN ssh_host_key_first_seen TIMESTAMPTZ`
-    - `ALTER TABLE devices ADD COLUMN ssh_host_key_last_verified TIMESTAMPTZ`
-    - `GRANT UPDATE (ssh_host_key_fingerprint, ssh_host_key_first_seen, ssh_host_key_last_verified) ON devices TO poller_user`
-    - Downgrade: `ALTER TABLE devices DROP COLUMN ssh_port, DROP COLUMN ssh_host_key_fingerprint, DROP COLUMN ssh_host_key_first_seen, DROP COLUMN ssh_host_key_last_verified`
-    - `REVOKE UPDATE (ssh_host_key_fingerprint, ssh_host_key_first_seen, ssh_host_key_last_verified) ON devices FROM poller_user`
-
-    **5. Prometheus metrics** (`metrics.go`):
-    Add config backup specific metrics:
-    - `ConfigBackupTotal` CounterVec with labels ["status"] — status: "success", "error", "skipped_offline", "skipped_auth_blocked", "skipped_hostkey_blocked"
-    - `ConfigBackupDuration` Histogram — buckets: [1, 5, 10, 30, 60, 120, 300]
-    - `ConfigBackupActive` Gauge — number of concurrent backup jobs running
-  </action>
-  <verify>
-    <automated>cd /Volumes/ssd01/v9/the-other-dude/poller && go build ./... && go vet ./... && go test ./internal/config/ -v -count=1</automated>
-  </verify>
-  <done>
-    - Config struct has 3 new backup config fields loading from env vars with correct defaults
-    - ConfigSnapshotEvent type exists with all required JSON fields
-    - PublishConfigSnapshot method exists following existing publisher pattern
-    - config.snapshot.> added to DEVICE_EVENTS stream subjects
-    - Device struct has SSHPort and SSHHostKeyFingerprint fields
-    - FetchDevices and GetDevice queries select and scan the new columns
-    - UpdateSSHHostKey method exists for TOFU fingerprint storage
-    - Alembic migration 028 adds ssh_port, ssh_host_key_fingerprint, timestamp columns with correct grants
-    - Three new Prometheus metrics registered for config backup observability
-    - All existing tests still pass, project compiles clean
-  </done>
-</task>
-
-</tasks>
-
-<verification>
-1. `cd poller && go build ./...` — entire project compiles
-2. `cd poller && go vet ./...` — no static analysis issues
-3. `cd poller && go test ./internal/device/ -v -count=1` — SSH executor and normalizer tests pass
-4. `cd poller && go test ./internal/config/ -v -count=1` — config tests pass
-5. Migration file exists at `backend/alembic/versions/028_device_ssh_host_key.py`
-</verification>
-
-<success_criteria>
- SSH executor RunCommand function exists with TOFU host key verification and typed error classification
- Config normalizer strips timestamps, normalizes whitespace, and computes SHA256 hashes deterministically
- All config backup environment variables load with correct defaults (6h interval, 10 concurrent, 60s timeout)
- ConfigSnapshotEvent and PublishConfigSnapshot are ready for the scheduler to use
- Device model includes SSH port and host key fingerprint fields
- Database migration ready to add SSH columns to devices table
- Prometheus metrics registered for backup collection observability
- All tests pass, project compiles clean
-</success_criteria>
-
-<output>
-After completion, create `.planning/phases/02-poller-config-collection/02-01-SUMMARY.md`
-</output>
--- a/.planning/phases/02-poller-config-collection/02-01-SUMMARY.md
+++ b/.planning/phases/02-poller-config-collection/02-01-SUMMARY.md
@@ -1,128 +0,0 @@
---
-phase: 02-poller-config-collection
-plan: 01
-subsystem: poller
-tags: [ssh, tofu, routeros, config-normalization, sha256, nats, prometheus, alembic]
-
-requires:
-  - phase: 01-database-schema
-    provides: router_config_snapshots table for storing backup data
-provides:
-  - SSH command executor with TOFU host key verification and typed error classification
-  - Config normalizer with deterministic SHA256 hashing
-  - ConfigSnapshotEvent NATS event type and PublishConfigSnapshot method
-  - Config backup environment variables (interval, concurrency, timeout)
-  - Device model SSH fields (port, host key fingerprint) with UpdateSSHHostKey method
-  - Alembic migration 028 for devices table SSH columns
-  - Prometheus metrics for config backup observability
-affects: [02-02-backup-scheduler, 03-backend-subscriber]
-
-tech-stack:
-  added: []
-  patterns:
-    - "TOFU host key verification via SHA256 fingerprint comparison"
-    - "Config normalization pipeline: line endings, timestamp strip, whitespace trim, blank collapse"
-    - "SSH error classification into typed SSHErrorKind enum"
-
-key-files:
-  created:
-    - poller/internal/device/ssh_executor.go
-    - poller/internal/device/ssh_executor_test.go
-    - poller/internal/device/normalize.go
-    - poller/internal/device/normalize_test.go
-    - backend/alembic/versions/028_device_ssh_host_key.py
-  modified:
-    - poller/internal/config/config.go
-    - poller/internal/bus/publisher.go
-    - poller/internal/store/devices.go
-    - poller/internal/observability/metrics.go
-
-key-decisions:
-  - "TOFU fingerprint format matches ssh-keygen: SHA256:base64(sha256(pubkey))"
-  - "NormalizationVersion=1 constant included in NATS payloads for future re-processing"
-  - "UpdateSSHHostKey sets first_seen via COALESCE to preserve original observation time"
-
-patterns-established:
-  - "SSH error classification: classifySSHError inspects error strings for auth/hostkey/timeout/refused patterns"
-  - "Config normalization: version-tracked deterministic pipeline for RouterOS export output"
-
-requirements-completed: [COLL-01, COLL-02, COLL-06]
-
-duration: 5min
-completed: 2026-03-13
---
-
-# Phase 02 Plan 01: Config Backup Primitives Summary
-
-**SSH executor with TOFU host key verification, RouterOS config normalizer with SHA256 hashing, NATS snapshot event, and Alembic migration for device SSH columns**
-
-## Performance
-
- **Duration:** 5 min
- **Started:** 2026-03-13T01:43:33Z
- **Completed:** 2026-03-13T01:48:38Z
- **Tasks:** 2
- **Files modified:** 9
-
-## Accomplishments
- SSH RunCommand executor with context-aware dialing, TOFU host key callback, and 6-kind typed error classification
- Deterministic config normalizer: strips RouterOS timestamps, normalizes line endings, trims whitespace, collapses blanks, computes SHA256 hash
- 22 unit tests covering error classification, TOFU flows (first connect/match/mismatch), normalization edge cases, idempotency
- Config backup env vars, NATS ConfigSnapshotEvent, device model SSH extensions, migration 028, Prometheus metrics
-
-## Task Commits
-
-Each task was committed atomically:
-
-1. **Task 1: SSH executor, normalizer, and their tests** - `f1abb75` (feat)
-2. **Task 2: Config env vars, NATS event type, device model extensions, Alembic migration, metrics** - `4ae39d2` (feat)
-
-_Note: Task 1 used TDD -- tests written first (RED), implementation second (GREEN)._
-
-## Files Created/Modified
- `poller/internal/device/ssh_executor.go` - RunCommand SSH executor with TOFU host key verification and typed errors
- `poller/internal/device/ssh_executor_test.go` - Unit tests for SSH error classification, TOFU callbacks, CommandResult
- `poller/internal/device/normalize.go` - NormalizeConfig and HashConfig for RouterOS export output
- `poller/internal/device/normalize_test.go` - Table-driven tests for normalization pipeline edge cases
- `poller/internal/config/config.go` - Added ConfigBackupIntervalSeconds, ConfigBackupMaxConcurrent, ConfigBackupCommandTimeoutSeconds
- `poller/internal/bus/publisher.go` - Added ConfigSnapshotEvent type, PublishConfigSnapshot method, config.snapshot.> stream subject
- `poller/internal/store/devices.go` - Added SSHPort/SSHHostKeyFingerprint fields, UpdateSSHHostKey method, updated queries
- `poller/internal/observability/metrics.go` - Added ConfigBackupTotal, ConfigBackupDuration, ConfigBackupActive metrics
- `backend/alembic/versions/028_device_ssh_host_key.py` - Migration adding ssh_port, ssh_host_key_fingerprint, timestamp columns
-
-## Decisions Made
- TOFU fingerprint format uses SHA256:base64(sha256(pubkey)) to match ssh-keygen output format
- NormalizationVersion=1 constant is included in NATS payloads so consumers can detect algorithm changes
- UpdateSSHHostKey uses COALESCE on ssh_host_key_first_seen to preserve original observation timestamp
-
-## Deviations from Plan
-
-### Auto-fixed Issues
-
-**1. [Rule 1 - Bug] Fixed test key generation approach**
- **Found during:** Task 1 (GREEN phase)
- **Issue:** Embedded OpenSSH PEM test key had padding errors ("ssh: padding not as expected")
- **Fix:** Switched to programmatic ed25519 key generation via crypto/ed25519.GenerateKey
- **Files modified:** poller/internal/device/ssh_executor_test.go
- **Verification:** All 22 tests pass
- **Committed in:** f1abb75 (Task 1 commit)
-
---
-
-**Total deviations:** 1 auto-fixed (1 bug)
-**Impact on plan:** Minimal -- test infrastructure fix only, no production code change.
-
-## Issues Encountered
-None beyond the test key generation fix documented above.
-
-## User Setup Required
-None - no external service configuration required.
-
-## Next Phase Readiness
- All primitives ready for Plan 02 (backup scheduler) to wire together
- SSH executor, normalizer, NATS event, device model, config, and metrics are independently tested and compilable
- Migration 028 ready to apply before deploying the backup scheduler
-
---
-*Phase: 02-poller-config-collection*
-*Completed: 2026-03-13*
--- a/.planning/phases/02-poller-config-collection/02-02-PLAN.md
+++ b/.planning/phases/02-poller-config-collection/02-02-PLAN.md
@@ -1,394 +0,0 @@
---
-phase: 02-poller-config-collection
-plan: 02
-type: execute
-wave: 2
-depends_on: ["02-01"]
-files_modified:
-  - poller/internal/poller/backup_scheduler.go
-  - poller/internal/poller/backup_scheduler_test.go
-  - poller/internal/poller/interfaces.go
-  - poller/cmd/poller/main.go
-autonomous: true
-requirements: [COLL-01, COLL-03, COLL-05, COLL-06]
-
-must_haves:
-  truths:
-    - "Poller runs /export show-sensitive via SSH on each online RouterOS device at a configurable interval (default 6h)"
-    - "Poller publishes normalized config snapshot to NATS config.snapshot.create with device_id, tenant_id, sha256_hash, config_text"
-    - "Unreachable devices log a warning and are retried on the next interval without blocking other devices"
-    - "Backup interval is configurable via CONFIG_BACKUP_INTERVAL environment variable"
-    - "First backup runs with randomized jitter (30-300s) after device discovery"
-    - "Global concurrency is limited via CONFIG_BACKUP_MAX_CONCURRENT semaphore"
-    - "Auth failures and host key mismatches block retries until resolved"
-  artifacts:
-    - path: "poller/internal/poller/backup_scheduler.go"
-      provides: "BackupScheduler managing per-device backup goroutines with concurrency, retry, and NATS publishing"
-      exports: ["BackupScheduler", "NewBackupScheduler"]
-      min_lines: 200
-    - path: "poller/internal/poller/backup_scheduler_test.go"
-      provides: "Unit tests for backup scheduling, jitter, concurrency, error handling"
-    - path: "poller/internal/poller/interfaces.go"
-      provides: "SSHHostKeyUpdater interface for device store dependency"
-    - path: "poller/cmd/poller/main.go"
-      provides: "BackupScheduler initialization and lifecycle wiring"
-  key_links:
-    - from: "poller/internal/poller/backup_scheduler.go"
-      to: "poller/internal/device/ssh_executor.go"
-      via: "Calls device.RunCommand to execute /export show-sensitive"
-      pattern: "device\\.RunCommand"
-    - from: "poller/internal/poller/backup_scheduler.go"
-      to: "poller/internal/device/normalize.go"
-      via: "Calls device.NormalizeConfig and device.HashConfig on SSH output"
-      pattern: "device\\.NormalizeConfig|device\\.HashConfig"
-    - from: "poller/internal/poller/backup_scheduler.go"
-      to: "poller/internal/bus/publisher.go"
-      via: "Calls publisher.PublishConfigSnapshot with ConfigSnapshotEvent"
-      pattern: "publisher\\.PublishConfigSnapshot|bus\\.ConfigSnapshotEvent"
-    - from: "poller/internal/poller/backup_scheduler.go"
-      to: "poller/internal/store/devices.go"
-      via: "Calls store.UpdateSSHHostKey for TOFU fingerprint storage"
-      pattern: "UpdateSSHHostKey"
-    - from: "poller/cmd/poller/main.go"
-      to: "poller/internal/poller/backup_scheduler.go"
-      via: "Creates and starts BackupScheduler in main goroutine lifecycle"
-      pattern: "NewBackupScheduler|backupScheduler\\.Run"
---
-
-<objective>
-Build the backup scheduler that orchestrates periodic SSH config collection from RouterOS devices, normalizes output, and publishes to NATS. Wire it into the poller's main lifecycle.
-
-Purpose: This is the core orchestration that ties together the SSH executor, normalizer, and NATS publisher from Plan 01 into a running backup collection system with proper scheduling, concurrency control, error handling, and retry logic.
-Output: BackupScheduler module fully integrated into the poller's main.go lifecycle.
-</objective>
-
-<execution_context>
-@/Users/jasonstaack/.claude/get-shit-done/workflows/execute-plan.md
-@/Users/jasonstaack/.claude/get-shit-done/templates/summary.md
-</execution_context>
-
-<context>
-@.planning/PROJECT.md
-@.planning/ROADMAP.md
-@.planning/STATE.md
-@.planning/phases/02-poller-config-collection/02-CONTEXT.md
-@.planning/phases/02-poller-config-collection/02-01-SUMMARY.md
-
-@poller/internal/poller/scheduler.go
-@poller/internal/poller/worker.go
-@poller/internal/poller/interfaces.go
-@poller/cmd/poller/main.go
-@poller/internal/device/ssh_executor.go
-@poller/internal/device/normalize.go
-@poller/internal/bus/publisher.go
-@poller/internal/config/config.go
-@poller/internal/store/devices.go
-@poller/internal/observability/metrics.go
-
-<interfaces>
-<!-- From Plan 01 outputs (executor and normalizer) -->
-
-From poller/internal/device/ssh_executor.go (created in Plan 01):
-```go
-type SSHErrorKind string
-const (
-    ErrAuthFailed       SSHErrorKind = "auth_failed"
-    ErrHostKeyMismatch  SSHErrorKind = "host_key_mismatch"
-    ErrTimeout          SSHErrorKind = "timeout"
-    ErrTruncatedOutput  SSHErrorKind = "truncated_output"
-    ErrConnectionRefused SSHErrorKind = "connection_refused"
-    ErrUnknown          SSHErrorKind = "unknown"
-)
-
-type SSHError struct { Kind SSHErrorKind; Err error; Message string }
-type CommandResult struct { Stdout string; Stderr string; ExitCode int; Duration time.Duration }
-
-func RunCommand(ctx context.Context, ip string, port int, username, password string, timeout time.Duration, knownFingerprint string, command string) (*CommandResult, string, error)
-```
-
-From poller/internal/device/normalize.go (created in Plan 01):
-```go
-func NormalizeConfig(raw string) string
-func HashConfig(normalized string) string
-const NormalizationVersion = 1
-```
-
-From poller/internal/bus/publisher.go (modified in Plan 01):
-```go
-type ConfigSnapshotEvent struct {
-    DeviceID             string `json:"device_id"`
-    TenantID             string `json:"tenant_id"`
-    RouterOSVersion      string `json:"routeros_version,omitempty"`
-    CollectedAt          string `json:"collected_at"`
-    SHA256Hash           string `json:"sha256_hash"`
-    ConfigText           string `json:"config_text"`
-    NormalizationVersion int    `json:"normalization_version"`
-}
-func (p *Publisher) PublishConfigSnapshot(ctx context.Context, event ConfigSnapshotEvent) error
-```
-
-From poller/internal/store/devices.go (modified in Plan 01):
-```go
-type Device struct {
-    // ... existing fields ...
-    SSHPort                int
-    SSHHostKeyFingerprint  *string
-}
-func (s *DeviceStore) UpdateSSHHostKey(ctx context.Context, deviceID string, fingerprint string) error
-```
-
-From poller/internal/config/config.go (modified in Plan 01):
-```go
-type Config struct {
-    // ... existing fields ...
-    ConfigBackupIntervalSeconds      int
-    ConfigBackupMaxConcurrent        int
-    ConfigBackupCommandTimeoutSeconds int
-}
-```
-
-From poller/internal/observability/metrics.go (modified in Plan 01):
-```go
-var ConfigBackupTotal    *prometheus.CounterVec  // labels: ["status"]
-var ConfigBackupDuration prometheus.Histogram
-var ConfigBackupActive   prometheus.Gauge
-```
-
-<!-- Existing patterns to follow -->
-
-From poller/internal/poller/scheduler.go:
-```go
-type Scheduler struct { ... }
-func NewScheduler(...) *Scheduler
-func (s *Scheduler) Run(ctx context.Context) error
-func (s *Scheduler) reconcileDevices(ctx context.Context, wg *sync.WaitGroup) error
-func (s *Scheduler) runDeviceLoop(ctx context.Context, dev store.Device, ds *deviceState) // per-device goroutine with ticker
-```
-
-From poller/internal/poller/interfaces.go:
-```go
-type DeviceFetcher interface {
-    FetchDevices(ctx context.Context) ([]store.Device, error)
-}
-```
-</interfaces>
-</context>
-
-<tasks>
-
-<task type="auto" tdd="true">
-  <name>Task 1: BackupScheduler with per-device goroutines, concurrency control, and retry logic</name>
-  <files>
-    poller/internal/poller/backup_scheduler.go,
-    poller/internal/poller/backup_scheduler_test.go,
-    poller/internal/poller/interfaces.go
-  </files>
-  <behavior>
-    - Test jitter generation: randomJitter(30, 300) returns value in [30s, 300s] range
-    - Test backoff sequence: given consecutive failures, backoff returns 5m, 15m, 1h, then caps at 1h
-    - Test auth failure blocking: when last error is ErrAuthFailed, shouldRetry returns false
-    - Test host key mismatch blocking: when last error is ErrHostKeyMismatch, shouldRetry returns false
-    - Test online-only gating: backup is skipped for devices not currently marked online
-    - Test concurrency semaphore: when semaphore is full, backup waits (does not drop)
-  </behavior>
-  <action>
-    **1. Update interfaces.go:**
-    Add `SSHHostKeyUpdater` interface (consumer-side, Go best practice):
-    ```go
-    type SSHHostKeyUpdater interface {
-        UpdateSSHHostKey(ctx context.Context, deviceID string, fingerprint string) error
-    }
-    ```
-
-    **2. Create backup_scheduler.go:**
-
-    Define `backupDeviceState` struct tracking per-device backup state:
-    - `cancel context.CancelFunc`
-    - `lastAttemptAt time.Time`
-    - `lastSuccessAt time.Time`
-    - `lastStatus string` — "success", "error", "skipped_offline", "auth_blocked", "hostkey_blocked"
-    - `lastError string`
-    - `consecutiveFailures int`
-    - `backoffUntil time.Time`
-    - `lastErrorKind device.SSHErrorKind` — tracks whether error is auth/hostkey (blocks retry)
-
-    Define `BackupScheduler` struct:
-    - `store DeviceFetcher` — reuse existing interface for FetchDevices
-    - `hostKeyStore SSHHostKeyUpdater` — for UpdateSSHHostKey
-    - `locker *redislock.Client` — per-device distributed lock
-    - `publisher *bus.Publisher` — for NATS publishing
-    - `credentialCache *vault.CredentialCache` — for decrypting device SSH creds
-    - `redisClient *redis.Client` — for tracking device online status
-    - `backupInterval time.Duration`
-    - `commandTimeout time.Duration`
-    - `refreshPeriod time.Duration` — how often to reconcile devices (reuse from existing scheduler, e.g., 60s)
-    - `semaphore chan struct{}` — buffered channel of size maxConcurrent
-    - `mu sync.Mutex`
-    - `activeDevices map[string]*backupDeviceState`
-
-    `NewBackupScheduler(...)` constructor — accept all dependencies, create semaphore as `make(chan struct{}, maxConcurrent)`.
-
-    `Run(ctx context.Context) error` — mirrors existing Scheduler.Run pattern:
-    - defer shutdown: cancel all device goroutines, wait for WaitGroup
-    - Loop: reconcileBackupDevices(ctx, &wg), then select on ctx.Done or time.After(refreshPeriod)
-
-    `reconcileBackupDevices(ctx, wg)` — mirrors reconcileDevices:
-    - FetchDevices from store
-    - Start backup goroutines for new devices
-    - Stop goroutines for removed devices
-
-    `runBackupLoop(ctx, dev, state)` — per-device backup goroutine:
-    - On first run: sleep for randomJitter(30, 300) seconds, then do initial backup
-    - After initial: ticker at backupInterval
-    - On each tick:
-      a. Check if device is online via Redis key `device:{id}:status` (set by status poll). If not online, log debug "skipped_offline", update state, increment ConfigBackupTotal("skipped_offline"), continue
-      b. Check if lastErrorKind is ErrAuthFailed — skip with "skipped_auth_blocked", log warning with guidance to update credentials
-      c. Check if lastErrorKind is ErrHostKeyMismatch — skip with "skipped_hostkey_blocked", log warning with guidance to reset host key
-      d. Check backoff: if time.Now().Before(state.backoffUntil), skip
-      e. Acquire semaphore (blocks if at max concurrency, does not drop)
-      f. Acquire Redis lock `backup:device:{id}` with TTL = commandTimeout + 30s
-      g. Call `collectAndPublish(ctx, dev, state)`
-      h. Release semaphore
-      i. Update state based on result
-
-    `collectAndPublish(ctx, dev, state) error`:
-    - Increment ConfigBackupActive gauge
-    - Defer decrement ConfigBackupActive gauge
-    - Start timer for ConfigBackupDuration
-    - Decrypt credentials via credentialCache.GetCredentials
-    - Call `device.RunCommand(ctx, dev.IPAddress, dev.SSHPort, username, password, commandTimeout, knownFingerprint, "/export show-sensitive")`
-    - On error: classify error kind, update state, apply backoff (transient: 5m/15m/1h exponential; auth/hostkey: block), return
-    - If new fingerprint returned (TOFU first connect): call hostKeyStore.UpdateSSHHostKey
-    - Validate output is non-empty and looks like RouterOS config (basic sanity: contains "/")
-    - Call `device.NormalizeConfig(result.Stdout)`
-    - Call `device.HashConfig(normalized)`
-    - Build `bus.ConfigSnapshotEvent` with device_id, tenant_id, routeros_version (from device or Redis), collected_at (RFC3339 now), sha256_hash, config_text, normalization_version
-    - Call `publisher.PublishConfigSnapshot(ctx, event)`
-    - On success: reset consecutiveFailures, update lastSuccessAt, increment ConfigBackupTotal("success")
-    - Record ConfigBackupDuration
-
-    `randomJitter(minSeconds, maxSeconds int) time.Duration` — uses math/rand for uniform distribution
-
-    Backoff for transient errors: `calculateBackupBackoff(failures int) time.Duration`:
-    - 1 failure: 5 min
-    - 2 failures: 15 min
-    - 3+ failures: 1 hour (cap)
-
-    Device online check via Redis: check if key `device:{id}:status` equals "online". This key is set by the existing status poll publisher flow. If key doesn't exist, assume device might be online (first poll hasn't happened yet) — allow backup attempt.
-
-    RouterOS version: read from the Device struct's RouterOSVersion field (populated by store query). If nil, use empty string in the event.
-
-    **Important implementation notes:**
-    - Use `log/slog` for all logging (structured JSON, matching existing pattern)
-    - Use existing `redislock` pattern from worker.go for per-device locking
-    - Semaphore pattern: `s.semaphore <- struct{}{}` to acquire, `<-s.semaphore` to release
-    - Do NOT share circuit breaker state with the status poll scheduler — these are independent
-    - Partial/truncated output (SSHError with Kind ErrTruncatedOutput) is treated as transient error — never publish, apply backoff
-  </action>
-  <verify>
-    <automated>cd /Volumes/ssd01/v9/the-other-dude/poller && go test ./internal/poller/ -run "TestBackup|TestJitter|TestBackoff|TestShouldRetry" -v -count=1</automated>
-  </verify>
-  <done>
-    - BackupScheduler manages per-device backup goroutines independently from status poll scheduler
-    - First backup uses 30-300s random jitter delay
-    - Concurrency limited by buffered channel semaphore (default 10)
-    - Per-device Redis lock prevents duplicate backups across pods
-    - Auth failures and host key mismatches block retries with clear log messages
-    - Transient errors use 5m/15m/1h exponential backoff
-    - Offline devices are skipped without error
-    - Successful backups normalize config, compute SHA256, and publish to NATS
-    - TOFU fingerprint stored on first successful connection
-    - All unit tests pass
-  </done>
-</task>
-
-<task type="auto">
-  <name>Task 2: Wire BackupScheduler into main.go lifecycle</name>
-  <files>poller/cmd/poller/main.go</files>
-  <action>
-    Add BackupScheduler initialization and startup to main.go, following the existing pattern of scheduler initialization (lines 250-278).
-
-    After the existing scheduler creation (around line 270), add a new section:
-
-    ```
-    // -----------------------------------------------------------------------
-    // Start the config backup scheduler
-    // -----------------------------------------------------------------------
-    ```
-
-    1. Convert config values to durations:
-       ```go
-       backupInterval := time.Duration(cfg.ConfigBackupIntervalSeconds) * time.Second
-       backupCmdTimeout := time.Duration(cfg.ConfigBackupCommandTimeoutSeconds) * time.Second
-       ```
-
-    2. Create BackupScheduler:
-       ```go
-       backupScheduler := poller.NewBackupScheduler(
-           deviceStore,
-           deviceStore,        // SSHHostKeyUpdater (DeviceStore satisfies this interface)
-           locker,
-           publisher,
-           credentialCache,
-           redisClient,
-           backupInterval,
-           backupCmdTimeout,
-           refreshPeriod,      // reuse existing device refresh period
-           cfg.ConfigBackupMaxConcurrent,
-       )
-       ```
-
-    3. Start in a goroutine (runs parallel with the main status poll scheduler):
-       ```go
-       go func() {
-           slog.Info("starting config backup scheduler",
-               "interval", backupInterval,
-               "max_concurrent", cfg.ConfigBackupMaxConcurrent,
-               "command_timeout", backupCmdTimeout,
-           )
-           if err := backupScheduler.Run(ctx); err != nil {
-               slog.Error("backup scheduler exited with error", "error", err)
-           }
-       }()
-       ```
-
-    The BackupScheduler shares the same ctx as everything else, so SIGINT/SIGTERM will trigger its shutdown via context cancellation. No additional shutdown logic needed — Run() returns when ctx is cancelled.
-
-    Log the startup with the same pattern as the existing scheduler startup log (line 273-276).
-  </action>
-  <verify>
-    <automated>cd /Volumes/ssd01/v9/the-other-dude/poller && go build ./cmd/poller/ && echo "build successful"</automated>
-  </verify>
-  <done>
-    - BackupScheduler created in main.go with all dependencies injected
-    - Runs as a goroutine parallel to the status poll scheduler
-    - Shares the same context for graceful shutdown
-    - Startup logged with interval, max_concurrent, and command_timeout
-    - Poller binary compiles successfully with the new scheduler wired in
-  </done>
-</task>
-
-</tasks>
-
-<verification>
-1. `cd poller && go build ./cmd/poller/` — binary compiles with backup scheduler wired in
-2. `cd poller && go vet ./...` — no static analysis issues
-3. `cd poller && go test ./internal/poller/ -v -count=1` — all poller tests pass (existing + new backup tests)
-4. `cd poller && go test ./... -count=1` — full test suite passes
-</verification>
-
-<success_criteria>
- BackupScheduler runs independently from status poll scheduler with its own per-device goroutines
- Devices get their first backup 30-300s after discovery, then every CONFIG_BACKUP_INTERVAL
- SSH command execution uses TOFU host key verification and stores fingerprints on first connect
- Config output is normalized, hashed, and published to NATS config.snapshot.create
- Concurrency limited to CONFIG_BACKUP_MAX_CONCURRENT parallel SSH sessions
- Auth/hostkey errors block retries; transient errors use exponential backoff (5m/15m/1h)
- Offline devices are skipped gracefully
- BackupScheduler is wired into main.go and starts/stops with the poller lifecycle
- All tests pass, project compiles clean
-</success_criteria>
-
-<output>
-After completion, create `.planning/phases/02-poller-config-collection/02-02-SUMMARY.md`
-</output>
--- a/.planning/phases/02-poller-config-collection/02-02-SUMMARY.md
+++ b/.planning/phases/02-poller-config-collection/02-02-SUMMARY.md
@@ -1,100 +0,0 @@
---
-phase: 02-poller-config-collection
-plan: 02
-subsystem: poller
-tags: [ssh, backup, scheduler, nats, routeros, concurrency, tofu, redis]
-
-requires:
-  - phase: 02-poller-config-collection/01
-    provides: SSH executor, config normalizer, NATS ConfigSnapshotEvent, Prometheus metrics, config fields
-provides:
-  - BackupScheduler with per-device goroutines managing periodic SSH config collection
-  - Concurrency-limited config backup pipeline (SSH -> normalize -> hash -> NATS publish)
-  - TOFU host key verification with persistent fingerprint storage
-  - Auth/hostkey error blocking with transient error exponential backoff
-  - SSHHostKeyUpdater consumer-side interface
-affects: [03-backend-snapshot-consumer, api, poller]
-
-tech-stack:
-  added: []
-  patterns: [per-device goroutine lifecycle, buffered channel semaphore, Redis online gating]
-
-key-files:
-  created:
-    - poller/internal/poller/backup_scheduler.go
-    - poller/internal/poller/backup_scheduler_test.go
-  modified:
-    - poller/internal/poller/interfaces.go
-    - poller/cmd/poller/main.go
-
-key-decisions:
-  - "BackupScheduler runs independently from status poll scheduler with separate goroutines"
-  - "Semaphore uses buffered channel pattern matching existing codebase style"
-  - "Device with no Redis status key assumed potentially online (first poll not yet completed)"
-
-patterns-established:
-  - "Backup goroutine pattern: jitter -> initial backup -> ticker loop with gating checks"
-  - "Error classification: auth/hostkey block retries, transient errors use exponential backoff"
-
-requirements-completed: [COLL-01, COLL-03, COLL-05, COLL-06]
-
-duration: 4min
-completed: 2026-03-13
---
-
-# Phase 2 Plan 2: Backup Scheduler Summary
-
-**BackupScheduler orchestrating periodic SSH config collection with per-device goroutines, concurrency semaphore, TOFU verification, and NATS publishing**
-
-## Performance
-
- **Duration:** 4 min
- **Started:** 2026-03-13T01:51:27Z
- **Completed:** 2026-03-13T01:55:37Z
- **Tasks:** 2
- **Files modified:** 4
-
-## Accomplishments
- BackupScheduler manages per-device backup goroutines with 30-300s initial jitter
- Concurrency limited by configurable buffered channel semaphore (default 10)
- Auth failures and host key mismatches permanently block retries with clear log warnings
- Transient errors use stepped backoff (5m/15m/1h cap)
- Full pipeline wired into main.go running parallel to existing status poll scheduler
-
-## Task Commits
-
-Each task was committed atomically:
-
-1. **Task 1: BackupScheduler with per-device goroutines** - `a884b09` (test) + `2653a32` (feat) -- TDD red/green
-2. **Task 2: Wire BackupScheduler into main.go** - `d34817a` (feat)
-
-## Files Created/Modified
- `poller/internal/poller/backup_scheduler.go` - BackupScheduler with per-device goroutines, concurrency control, SSH collection, NATS publishing
- `poller/internal/poller/backup_scheduler_test.go` - Unit tests for jitter, backoff, retry blocking, online gating, semaphore, reconciliation
- `poller/internal/poller/interfaces.go` - Added SSHHostKeyUpdater consumer-side interface
- `poller/cmd/poller/main.go` - BackupScheduler initialization and goroutine startup
-
-## Decisions Made
- BackupScheduler runs independently from status poll scheduler -- separate goroutine pool, no shared state
- Semaphore uses buffered channel pattern (consistent with Go idioms, no external deps)
- Devices with no Redis status key assumed potentially online to avoid blocking first backup
- Locker nil-check allows tests to run without Redis lock infrastructure
-
-## Deviations from Plan
-
-None - plan executed exactly as written.
-
-## Issues Encountered
-None
-
-## User Setup Required
-None - no external service configuration required.
-
-## Next Phase Readiness
- Config backup pipeline complete: SSH -> normalize -> hash -> NATS publish
- Backend snapshot consumer (Phase 3) can subscribe to config.snapshot.create.> to receive snapshots
- Pre-existing integration test failures in poller package (missing certificate_authorities table) are unrelated to this work
-
---
-*Phase: 02-poller-config-collection*
-*Completed: 2026-03-13*
--- a/.planning/phases/03-snapshot-ingestion/03-01-SUMMARY.md
+++ b/.planning/phases/03-snapshot-ingestion/03-01-SUMMARY.md
@@ -1,108 +0,0 @@
---
-phase: 03-snapshot-ingestion
-plan: 01
-subsystem: api
-tags: [nats, jetstream, openbao, transit, encryption, postgresql, prometheus, dedup]
-
-# Dependency graph
-requires:
-  - phase: 01-database-schema
-    provides: RouterConfigSnapshot model and router_config_snapshots table
-  - phase: 02-poller-config-collection
-    provides: Go poller publishes config.snapshot.> NATS messages
-provides:
-  - NATS subscriber consuming config.snapshot.> messages
-  - SHA256 dedup preventing duplicate snapshot storage
-  - OpenBao Transit encryption of config text before INSERT
-  - Prometheus metrics for ingestion monitoring
-affects: [04-diff-engine, snapshot-api, config-timeline]
-
-# Tech tracking
-tech-stack:
-  added: [prometheus_client]
-  patterns: [nats-subscriber-with-dedup, transit-encrypt-before-insert]
-
-key-files:
-  created:
-    - backend/app/services/config_snapshot_subscriber.py
-    - backend/tests/test_config_snapshot_subscriber.py
-  modified:
-    - backend/app/main.py
-
-key-decisions:
-  - "Trust poller-provided SHA256 hash (no recompute on backend)"
-  - "Raw SQL for dedup SELECT and INSERT (consistent with nats_subscriber.py pattern)"
-  - "OpenBao Transit service instantiated per-message with close() for connection hygiene"
-
-patterns-established:
-  - "Config snapshot ingestion: dedup by SHA256 -> encrypt -> INSERT -> ack"
-  - "Transit failure causes nak (NATS retry), plaintext never stored as fallback"
-
-requirements-completed: [STOR-02]
-
-# Metrics
-duration: 4min
-completed: 2026-03-13
---
-
-# Phase 3 Plan 1: Config Snapshot Subscriber Summary
-
-**NATS subscriber ingesting config snapshots with SHA256 dedup, OpenBao Transit encryption, and Prometheus metrics**
-
-## Performance
-
- **Duration:** 4 min
- **Started:** 2026-03-13T02:44:01Z
- **Completed:** 2026-03-13T02:48:08Z
- **Tasks:** 2
- **Files modified:** 3
-
-## Accomplishments
- NATS subscriber consuming config.snapshot.> on DEVICE_EVENTS stream with durable consumer
- SHA256 dedup: duplicate snapshots silently skipped at debug level with Prometheus counter
- OpenBao Transit encryption: plaintext never stored in PostgreSQL, Transit failure causes nak
- Malformed and orphan device messages acked and discarded safely with warning logs
- 6 unit tests covering all handler paths (new, duplicate, encrypt fail, malformed, orphan, first)
- Wired into main.py lifespan with non-fatal startup pattern
-
-## Task Commits
-
-Each task was committed atomically:
-
-1. **Task 1 (RED): Failing tests** - `9d82741` (test)
-2. **Task 1 (GREEN): Config snapshot subscriber** - `3ab9f27` (feat)
-3. **Task 2: Wire into main.py lifespan** - `0db0641` (feat)
-
-_TDD task had RED + GREEN commits_
-
-## Files Created/Modified
- `backend/app/services/config_snapshot_subscriber.py` - NATS subscriber with dedup, encryption, metrics
- `backend/tests/test_config_snapshot_subscriber.py` - 6 unit tests for all handler paths
- `backend/app/main.py` - Lifespan wiring for start/stop
-
-## Decisions Made
- Trust poller-provided SHA256 hash (no recompute on backend) -- per project decision
- Raw SQL for dedup SELECT and INSERT -- consistent with existing nats_subscriber.py pattern
- OpenBao Transit service instantiated per-message with close() -- connection hygiene
- config_text never appears in any log statement -- contains passwords and keys
-
-## Deviations from Plan
-
-None - plan executed exactly as written.
-
-## Issues Encountered
-
-None.
-
-## User Setup Required
-
-None - no external service configuration required.
-
-## Next Phase Readiness
- Config snapshot subscriber ready to receive messages from Go poller
- RouterConfigSnapshot rows will be available for diff engine (Phase 4)
- Prometheus metrics exposed for monitoring ingestion rate and errors
-
---
-*Phase: 03-snapshot-ingestion*
-*Completed: 2026-03-13*
--- a/.planning/phases/04-manual-backup-trigger/04-01-SUMMARY.md
+++ b/.planning/phases/04-manual-backup-trigger/04-01-SUMMARY.md
@@ -1,115 +0,0 @@
---
-phase: 04-manual-backup-trigger
-plan: 01
-subsystem: api
-tags: [nats, request-reply, backup, ssh, go, fastapi]
-
-# Dependency graph
-requires:
-  - phase: 02-poller-config-collection
-    provides: BackupScheduler with SSH config collection pipeline
-  - phase: 03-snapshot-ingestion
-    provides: Config snapshot subscriber for NATS ingestion
-provides:
-  - BackupResponder NATS handler for manual config backup triggers
-  - POST /config-snapshot/trigger API endpoint for on-demand backups
-  - Public CollectAndPublish method on BackupScheduler returning sha256 hash
-  - BackupExecutor/BackupLocker/DeviceGetter interfaces for testability
-affects: [05-snapshot-list-api, 06-diff-api]
-
-# Tech tracking
-tech-stack:
-  added: [nats-server/v2 (test dependency)]
-  patterns: [interface-based dependency injection for NATS responders, in-process NATS server for Go unit tests]
-
-key-files:
-  created:
-    - poller/internal/bus/backup_responder.go
-    - poller/internal/bus/backup_responder_test.go
-    - poller/internal/bus/redis_locker.go
-    - backend/tests/test_config_snapshot_trigger.py
-  modified:
-    - poller/internal/poller/backup_scheduler.go
-    - poller/cmd/poller/main.go
-    - backend/app/routers/config_backups.py
-
-key-decisions:
-  - "Used interface-based DI (BackupExecutor, BackupLocker, DeviceGetter) for BackupResponder testability"
-  - "Refactored collectAndPublish to return (string, error) with public CollectAndPublish wrapper"
-  - "Used in-process nats-server/v2 for fast Go unit tests instead of testcontainers"
-  - "Reused routeros_proxy NATS connection for Python endpoint instead of separate connection"
-
-patterns-established:
-  - "BackupExecutor interface: abstracts backup pipeline for manual trigger callers"
-  - "In-process NATS test server: startTestNATS helper for Go bus package tests"
-
-requirements-completed: [COLL-04]
-
-# Metrics
-duration: 7min
-completed: 2026-03-13
---
-
-# Phase 4 Plan 1: Manual Backup Trigger Summary
-
-**NATS request-reply manual backup trigger with Go BackupResponder and Python API endpoint returning synchronous success/failure/hash**
-
-## Performance
-
- **Duration:** 7 min
- **Started:** 2026-03-13T03:03:57Z
- **Completed:** 2026-03-13T03:10:41Z
- **Tasks:** 2
- **Files modified:** 7
-
-## Accomplishments
- BackupResponder subscribes to config.backup.trigger (core NATS) and reuses BackupScheduler pipeline
- API endpoint POST /tenants/{tid}/devices/{did}/config-snapshot/trigger with operator role, 10/min rate limit
- Returns 201/409/502/504 with structured JSON including sha256 hash on success
- Per-device Redis lock prevents concurrent manual+scheduled backup collisions
- 12 total tests (6 Go, 6 Python) all passing
-
-## Task Commits
-
-Each task was committed atomically:
-
-1. **Task 1: Go BackupResponder with extracted collectAndPublish** - `9e102fd` (test: RED), `0851ece` (feat: GREEN)
-2. **Task 2: Python API endpoint for manual config snapshot trigger** - `0e66415` (test: RED), `00f0a8b` (feat: GREEN)
-
-_TDD tasks have separate test and implementation commits._
-
-## Files Created/Modified
- `poller/internal/bus/backup_responder.go` - NATS request-reply handler for manual backup triggers
- `poller/internal/bus/backup_responder_test.go` - 6 tests with in-process NATS server
- `poller/internal/bus/redis_locker.go` - RedisBackupLocker adapter implementing BackupLocker interface
- `poller/internal/poller/backup_scheduler.go` - Public CollectAndPublish method, returns (string, error)
- `poller/cmd/poller/main.go` - BackupResponder wired into lifecycle
- `backend/app/routers/config_backups.py` - New trigger_config_snapshot endpoint
- `backend/tests/test_config_snapshot_trigger.py` - 6 tests covering all response paths
-
-## Decisions Made
- Used interface-based dependency injection (BackupExecutor, BackupLocker, DeviceGetter) rather than direct struct dependencies for testability
- Refactored collectAndPublish to return hash string alongside error, enabling public CollectAndPublish wrapper
- Added nats-server/v2 as test dependency for fast in-process NATS testing instead of testcontainers
- Python tests use simulated handler logic to avoid import chain issues (rate_limit -> redis, auth -> bcrypt)
- Reused routeros_proxy NATS connection via _get_nats() import instead of duplicating lazy-init pattern
-
-## Deviations from Plan
-
-None - plan executed exactly as written.
-
-## Issues Encountered
- Python test environment lacks redis and bcrypt packages, preventing direct import of app.routers.config_backups. Resolved by testing handler logic via simulation function that mirrors the endpoint implementation.
-
-## User Setup Required
-
-None - no external service configuration required.
-
-## Next Phase Readiness
- Manual backup trigger complete, ready for Phase 5 (snapshot list API)
- config.backup.trigger NATS subject uses core NATS (not JetStream), no stream config changes needed
- BackupExecutor interface available for any future caller needing programmatic backup triggers
-
---
-*Phase: 04-manual-backup-trigger*
-*Completed: 2026-03-13*
--- a/.planning/phases/05-diff-engine/05-01-SUMMARY.md
+++ b/.planning/phases/05-diff-engine/05-01-SUMMARY.md
@@ -1,115 +0,0 @@
---
-phase: 05-diff-engine
-plan: 01
-subsystem: api
-tags: [difflib, unified-diff, openbao, transit, prometheus, nats]
-
-requires:
-  - phase: 03-snapshot-ingestion
-    provides: "config snapshot subscriber and router_config_snapshots table"
-  - phase: 01-database-schema
-    provides: "router_config_diffs table schema"
-provides:
-  - "generate_and_store_diff() for unified diff between consecutive snapshots"
-  - "Prometheus metrics for diff generation success/failure/timing"
-  - "Subscriber integration calling diff after snapshot INSERT"
-affects: [06-change-parser, 07-timeline-api]
-
-tech-stack:
-  added: [difflib]
-  patterns: [best-effort-secondary-operation, tdd-red-green]
-
-key-files:
-  created:
-    - backend/app/services/config_diff_service.py
-    - backend/tests/test_config_diff_service.py
-  modified:
-    - backend/app/services/config_snapshot_subscriber.py
-    - backend/tests/test_config_snapshot_subscriber.py
-
-key-decisions:
-  - "Diff service instantiates its own OpenBaoTransitService per-call with close() for clean lifecycle"
-  - "RETURNING id added to snapshot INSERT to capture new_snapshot_id for diff generation"
-  - "Subscriber tests mock generate_and_store_diff to isolate snapshot logic from diff logic"
-
-patterns-established:
-  - "Best-effort secondary operations: wrap in try/except, log+count errors, never block primary flow"
-  - "Line counting excludes unified diff headers (+++ and --- lines)"
-
-requirements-completed: [DIFF-01, DIFF-02]
-
-duration: 3min
-completed: 2026-03-13
---
-
-# Phase 5 Plan 1: Config Diff Service Summary
-
-**Unified diff generation between consecutive config snapshots using difflib with Transit decrypt and best-effort error handling**
-
-## Performance
-
- **Duration:** 3 min
- **Started:** 2026-03-13T03:30:07Z
- **Completed:** 2026-03-13T03:33:Z
- **Tasks:** 2
- **Files modified:** 4
-
-## Accomplishments
- Config diff service generates unified diffs between consecutive snapshots per device
- Transit decrypt of both old and new ciphertext before diffing in memory
- Best-effort pattern: decrypt/DB failures logged and counted, never block snapshot ack
- Prometheus metrics track diff success, errors (by type), and generation duration
- Subscriber wired to call diff generation after every successful snapshot INSERT
-
-## Task Commits
-
-Each task was committed atomically:
-
-1. **Task 1: Diff generation service (TDD RED)** - `79453fa` (test)
-2. **Task 1: Diff generation service (TDD GREEN)** - `72d0ae2` (feat)
-3. **Task 2: Wire diff into subscriber** - `eb76343` (feat)
-
-_TDD task had separate RED and GREEN commits_
-
-## Files Created/Modified
- `backend/app/services/config_diff_service.py` - Diff generation with Transit decrypt, difflib, Prometheus metrics
- `backend/tests/test_config_diff_service.py` - 5 unit tests covering diff, first-snapshot, decrypt failure, line counts, empty diff
- `backend/app/services/config_snapshot_subscriber.py` - Added RETURNING id, generate_and_store_diff call after commit
- `backend/tests/test_config_snapshot_subscriber.py` - Updated to mock generate_and_store_diff
-
-## Decisions Made
- Diff service instantiates its own OpenBaoTransitService per-call (clean lifecycle, consistent with subscriber pattern)
- RETURNING id added to snapshot INSERT SQL to capture the new_snapshot_id without a separate query
- Subscriber tests mock generate_and_store_diff to keep snapshot tests isolated and unchanged in assertion counts
-
-## Deviations from Plan
-
-### Auto-fixed Issues
-
-**1. [Rule 1 - Bug] Updated subscriber test assertions for diff integration**
- **Found during:** Task 2 (wire diff into subscriber)
- **Issue:** Existing subscriber tests failed because generate_and_store_diff made additional DB calls through the shared mock session
- **Fix:** Added patch for generate_and_store_diff in subscriber tests that successfully INSERT (test 1 and test 6)
- **Files modified:** backend/tests/test_config_snapshot_subscriber.py
- **Verification:** All 11 tests pass
- **Committed in:** eb76343 (Task 2 commit)
-
---
-
-**Total deviations:** 1 auto-fixed (1 bug)
-**Impact on plan:** Necessary to maintain test isolation. No scope creep.
-
-## Issues Encountered
-None
-
-## User Setup Required
-None - no external service configuration required.
-
-## Next Phase Readiness
- Diff generation is active and will produce diffs for every new non-duplicate snapshot
- router_config_diffs table populated with diff_text, line counts, and snapshot references
- Ready for change parser (Phase 6) to parse semantic changes from diff_text
-
---
-*Phase: 05-diff-engine*
-*Completed: 2026-03-13*
--- a/.planning/phases/05-diff-engine/05-02-SUMMARY.md
+++ b/.planning/phases/05-diff-engine/05-02-SUMMARY.md
@@ -1,112 +0,0 @@
---
-phase: 05-diff-engine
-plan: 02
-subsystem: api
-tags: [parser, routeros, structured-changes, tdd]
-
-requires:
-  - phase: 05-diff-engine
-    plan: 01
-    provides: "generate_and_store_diff() and router_config_diffs table"
-provides:
-  - "parse_diff_changes() for extracting structured component changes from unified diffs"
-  - "router_config_changes rows linked to diff_id for timeline UI"
-affects: [07-timeline-api]
-
-tech-stack:
-  added: []
-  patterns: [tdd-red-green, best-effort-secondary-operation]
-
-key-files:
-  created:
-    - backend/app/services/config_change_parser.py
-    - backend/tests/test_config_change_parser.py
-  modified:
-    - backend/app/services/config_diff_service.py
-    - backend/tests/test_config_diff_service.py
-
-key-decisions:
-  - "Change parser is pure function (no DB/IO) for easy testing; DB writes happen in diff service"
-  - "RETURNING id added to diff INSERT to capture diff_id for linking changes"
-  - "Change parser errors are best-effort: diff is always stored, only changes are lost on parser failure"
-
-patterns-established:
-  - "RouterOS path to component: strip leading /, replace spaces with / (e.g., /ip firewall filter -> ip/firewall/filter)"
-  - "Fallback component system/general for diffs without RouterOS path headers"
-
-requirements-completed: [DIFF-03, DIFF-04]
-
-duration: 2min
-completed: 2026-03-13
---
-
-# Phase 5 Plan 2: Structured Change Parser Summary
-
-**RouterOS diff change parser extracting component names, human-readable summaries, and raw lines from unified diffs with best-effort DB storage**
-
-## Performance
-
- **Duration:** 2 min
- **Started:** 2026-03-13T03:34:48Z
- **Completed:** 2026-03-13T03:37:14Z
- **Tasks:** 2
- **Files modified:** 4
-
-## Accomplishments
- Pure-function change parser extracts component, summary, raw_line from RouterOS unified diffs
- RouterOS path detection converts section headers to component format (ip/firewall/filter)
- Human-readable summaries: Added/Removed/Modified N rules per component
- Diff service wired to call parser after INSERT and store results in router_config_changes
- Parser failures are best-effort: diff always stored, changes lost only on parser error
-
-## Task Commits
-
-Each task was committed atomically:
-
-1. **Task 1: Change parser TDD RED** - `7fddf35` (test)
-2. **Task 1: Change parser TDD GREEN** - `b167831` (feat)
-3. **Task 2: Wire parser into diff service** - `122b591` (feat)
-
-_TDD task had separate RED and GREEN commits_
-
-## Files Created/Modified
- `backend/app/services/config_change_parser.py` - Pure parser: parse_diff_changes() with path detection, summary generation, raw line capture
- `backend/tests/test_config_change_parser.py` - 6 unit tests covering additions, multi-section, removals, modifications, fallback, raw_line
- `backend/app/services/config_diff_service.py` - Added RETURNING id, parse_diff_changes integration, change INSERT loop
- `backend/tests/test_config_diff_service.py` - Updated existing tests for RETURNING id, added 2 tests for change storage and parser error resilience
-
-## Decisions Made
- Change parser is a pure function (no DB/IO) for straightforward unit testing; DB writes are the diff service's responsibility
- RETURNING id added to diff INSERT SQL to get diff_id without separate query
- Change parser errors caught by separate try/except so diff is always committed first
-
-## Deviations from Plan
-
-### Auto-fixed Issues
-
-**1. [Rule 1 - Bug] Updated existing diff service tests for RETURNING id and parse_diff_changes integration**
- **Found during:** Task 2
- **Issue:** Existing tests expected 3 execute calls without scalar_one on INSERT result; new RETURNING id and parse_diff_changes call changed the interaction pattern
- **Fix:** Added scalar_one mock to INSERT result, patched parse_diff_changes to return empty list in existing tests to isolate behavior
- **Files modified:** backend/tests/test_config_diff_service.py
- **Committed in:** 122b591
-
---
-
-**Total deviations:** 1 auto-fixed (1 bug)
-**Impact on plan:** Necessary test update for API change. No scope creep.
-
-## Issues Encountered
-None
-
-## User Setup Required
-None
-
-## Next Phase Readiness
- router_config_changes table populated with structured changes for every non-empty diff
- Changes linked to diff_id, device_id, tenant_id for timeline queries
- Ready for timeline API (Phase 7) to query changes per device
-
---
-*Phase: 05-diff-engine*
-*Completed: 2026-03-13*
--- a/.planning/phases/06-history-api/06-01-SUMMARY.md
+++ b/.planning/phases/06-history-api/06-01-SUMMARY.md
@@ -1,95 +0,0 @@
---
-phase: 06-history-api
-plan: 01
-subsystem: api
-tags: [fastapi, sqlalchemy, pagination, timeline, rbac]
-
-# Dependency graph
-requires:
-  - phase: 05-diff-engine
-    provides: router_config_changes and router_config_diffs tables with parsed change data
-provides:
-  - GET /api/tenants/{tid}/devices/{did}/config-history endpoint
-  - get_config_history service function with pagination
-affects: [06-02, frontend-config-history]
-
-# Tech tracking
-tech-stack:
-  added: []
-  patterns: [raw SQL text() joins for timeline queries, same RBAC pattern as config_backups]
-
-key-files:
-  created:
-    - backend/app/services/config_history_service.py
-    - backend/app/routers/config_history.py
-    - backend/tests/test_config_history_service.py
-  modified:
-    - backend/app/main.py
-
-key-decisions:
-  - "Raw SQL text() for JOIN query consistent with config_diff_service.py pattern"
-  - "Pagination defaults: limit=50, offset=0 with validation (ge=1, le=200 for limit)"
-
-patterns-established:
-  - "Config history queries use JOIN between changes and diffs tables for timeline view"
-
-requirements-completed: [API-01, API-04]
-
-# Metrics
-duration: 2min
-completed: 2026-03-13
---
-
-# Phase 6 Plan 1: Config History Timeline Summary
-
-**GET /config-history endpoint returning paginated change timeline with component, summary, timestamp, and diff metadata via JOIN query**
-
-## Performance
-
- **Duration:** 2 min
- **Started:** 2026-03-13T03:58:03Z
- **Completed:** 2026-03-13T04:00:00Z
- **Tasks:** 2
- **Files modified:** 4
-
-## Accomplishments
- Config history service querying router_config_changes JOIN router_config_diffs for timeline entries
- REST endpoint with viewer+ RBAC and config:read scope enforcement
- 4 unit tests covering formatting, empty results, pagination, and ordering
- Router registered in main.py alongside existing config routers
-
-## Task Commits
-
-Each task was committed atomically:
-
-1. **Task 1: Config history service and tests (TDD)** - `f7d5aec` (feat)
-2. **Task 2: Config history router and main.py registration** - `5c56344` (feat)
-
-## Files Created/Modified
- `backend/app/services/config_history_service.py` - Query function for paginated config change timeline
- `backend/app/routers/config_history.py` - REST endpoint with RBAC, pagination query params
- `backend/tests/test_config_history_service.py` - 4 unit tests with AsyncMock sessions
- `backend/app/main.py` - Router import and registration
-
-## Decisions Made
- Used raw SQL text() for the JOIN query, consistent with config_diff_service.py pattern
- Pagination limit constrained to 1-200 via FastAPI Query validation
- Copied _check_tenant_access helper (same pattern as config_backups.py)
-
-## Deviations from Plan
-
-None - plan executed exactly as written.
-
-## Issues Encountered
-None
-
-## User Setup Required
-None - no external service configuration required.
-
-## Next Phase Readiness
- Config history timeline endpoint ready for frontend consumption
- Plan 06-02 can build on this for detailed diff view endpoints
-
---
-*Phase: 06-history-api*
-*Completed: 2026-03-13*
--- a/.planning/phases/06-history-api/06-02-SUMMARY.md
+++ b/.planning/phases/06-history-api/06-02-SUMMARY.md
@@ -1,95 +0,0 @@
---
-phase: 06-history-api
-plan: 02
-subsystem: api
-tags: [fastapi, sqlalchemy, openbao, transit-decrypt, rbac, snapshot]
-
-# Dependency graph
-requires:
-  - phase: 06-history-api
-    provides: config_history_service.py with get_config_history, config_history router with RBAC
-  - phase: 05-diff-engine
-    provides: router_config_diffs and router_config_snapshots tables with encrypted config data
-provides:
-  - GET /api/tenants/{tid}/devices/{did}/config/{snapshot_id} endpoint (decrypted snapshot)
-  - GET /api/tenants/{tid}/devices/{did}/config/{snapshot_id}/diff endpoint (unified diff)
-  - get_snapshot and get_snapshot_diff service functions
-affects: [frontend-config-history, frontend-diff-viewer]
-
-# Tech tracking
-tech-stack:
-  added: []
-  patterns: [Transit decrypt in service layer with try/finally close, 404 for missing snapshots/diffs]
-
-key-files:
-  created: []
-  modified:
-    - backend/app/services/config_history_service.py
-    - backend/app/routers/config_history.py
-    - backend/tests/test_config_history_service.py
-
-key-decisions:
-  - "Transit decrypt in get_snapshot with try/finally for clean openbao lifecycle"
-  - "500 error wrapping for Transit decrypt failures in router (not service)"
-
-patterns-established:
-  - "Snapshot retrieval filters by id + device_id + tenant_id for RLS-safe queries"
-
-requirements-completed: [API-02, API-03, API-04]
-
-# Metrics
-duration: 2min
-completed: 2026-03-13
---
-
-# Phase 6 Plan 2: Snapshot View and Diff Retrieval Summary
-
-**Snapshot view and diff retrieval endpoints with Transit decrypt for full config text and unified diff, enforcing viewer+ RBAC**
-
-## Performance
-
- **Duration:** 2 min
- **Started:** 2026-03-13T04:01:58Z
- **Completed:** 2026-03-13T04:03:39Z
- **Tasks:** 2
- **Files modified:** 3
-
-## Accomplishments
- get_snapshot function decrypts config via OpenBao Transit and returns plaintext with metadata
- get_snapshot_diff function queries diff by new_snapshot_id for a device/tenant
- Two new router endpoints with viewer+ RBAC and config:read scope enforcement
- 4 new tests (8 total) covering decrypted content, not-found, diff retrieval, and no-diff cases
-
-## Task Commits
-
-Each task was committed atomically:
-
-1. **Task 1: Snapshot and diff service functions with tests (TDD)** - `83cd661` (feat)
-2. **Task 2: Snapshot and diff router endpoints** - `af7007d` (feat)
-
-## Files Created/Modified
- `backend/app/services/config_history_service.py` - Added get_snapshot (Transit decrypt) and get_snapshot_diff query functions
- `backend/app/routers/config_history.py` - Two new GET endpoints with RBAC, 404/500 error handling
- `backend/tests/test_config_history_service.py` - 4 new tests with mocked Transit and DB sessions
-
-## Decisions Made
- Transit decrypt happens in service layer (get_snapshot), error wrapping in router layer (500 response)
- Query filters include device_id + tenant_id alongside snapshot_id for RLS-safe access
-
-## Deviations from Plan
-
-None - plan executed exactly as written.
-
-## Issues Encountered
-None
-
-## User Setup Required
-None - no external service configuration required.
-
-## Next Phase Readiness
- All 3 config history API endpoints complete (timeline, snapshot view, diff view)
- Phase 06 complete -- ready for frontend integration
-
---
-*Phase: 06-history-api*
-*Completed: 2026-03-13*
--- a/.planning/phases/07-config-history-ui/07-01-SUMMARY.md
+++ b/.planning/phases/07-config-history-ui/07-01-SUMMARY.md
@@ -1,89 +0,0 @@
---
-phase: 07-config-history-ui
-plan: 01
-subsystem: ui
-tags: [react, tanstack-query, timeline, config-history]
-
-requires:
-  - phase: 06-history-api
-    provides: GET /api/tenants/{tid}/devices/{did}/config-history endpoint
-provides:
-  - ConfigHistorySection component with timeline rendering
-  - configHistoryApi.list() API client function
-  - Configuration history visible on device detail overview tab
-affects: [07-config-history-ui]
-
-tech-stack:
-  added: []
-  patterns: [timeline component pattern matching BackupTimeline.tsx]
-
-key-files:
-  created:
-    - frontend/src/components/config/ConfigHistorySection.tsx
-  modified:
-    - frontend/src/lib/api.ts
-    - frontend/src/routes/_authenticated/tenants/$tenantId/devices/$deviceId.tsx
-
-key-decisions:
-  - "Reimplemented formatRelativeTime locally rather than extracting shared util (matches BackupTimeline pattern)"
-  - "Poll interval 60s via refetchInterval for near-real-time change visibility"
-
-patterns-established:
-  - "Config history timeline: vertical dot timeline with component badge, summary, line delta, relative time"
-
-requirements-completed: [UI-01, UI-02]
-
-duration: 3min
-completed: 2026-03-13
---
-
-# Phase 7 Plan 1: Config History UI Summary
-
-**ConfigHistorySection timeline component on device detail page, fetching change entries via TanStack Query with 60s polling**
-
-## Performance
-
- **Duration:** 3 min
- **Started:** 2026-03-13T04:11:08Z
- **Completed:** 2026-03-13T04:14:00Z
- **Tasks:** 2
- **Files modified:** 3
-
-## Accomplishments
- Added configHistoryApi.list() and ConfigChangeEntry interface to api.ts
- Created ConfigHistorySection with vertical timeline, loading skeleton, and empty state
- Wired component into device detail overview tab below Interface Utilization
-
-## Task Commits
-
-Each task was committed atomically:
-
-1. **Task 1: API client and ConfigHistorySection component** - `6bd2451` (feat)
-2. **Task 2: Wire ConfigHistorySection into device detail page** - `36861ff` (feat)
-
-## Files Created/Modified
- `frontend/src/lib/api.ts` - Added ConfigChangeEntry interface and configHistoryApi.list()
- `frontend/src/components/config/ConfigHistorySection.tsx` - Timeline component with loading/empty/data states
- `frontend/src/routes/_authenticated/tenants/$tenantId/devices/$deviceId.tsx` - Import and render ConfigHistorySection
-
-## Decisions Made
- Reimplemented formatRelativeTime locally (same pattern as BackupTimeline.tsx) rather than extracting to shared util -- keeps components self-contained
- Used 60s refetchInterval for polling new config changes
-
-## Deviations from Plan
-
-None - plan executed exactly as written.
-
-## Issues Encountered
-None
-
-## User Setup Required
-None - no external service configuration required.
-
-## Next Phase Readiness
- Config history timeline renders on device overview tab
- Ready for any future detail/drill-down views on individual changes
-
---
-*Phase: 07-config-history-ui*
-*Completed: 2026-03-13*
--- a/.planning/phases/08-diff-viewer-download/08-01-SUMMARY.md
+++ b/.planning/phases/08-diff-viewer-download/08-01-SUMMARY.md
@@ -1,92 +0,0 @@
---
-phase: 08-diff-viewer-download
-plan: 01
-subsystem: ui
-tags: [react, diff-viewer, tanstack-query, tailwind]
-
-requires:
-  - phase: 07-config-history-ui
-    provides: ConfigHistorySection timeline component with ConfigChangeEntry data
-  - phase: 06-config-history-api
-    provides: GET /config/{snapshot_id}/diff endpoint returning DiffResponse
-provides:
-  - DiffViewer component with unified diff rendering (green/red line highlighting)
-  - configHistoryApi.getDiff() API client method
-  - Clickable timeline entries in ConfigHistorySection
-affects: [08-diff-viewer-download]
-
-tech-stack:
-  added: []
-  patterns: [inline diff viewer with line-level classification]
-
-key-files:
-  created:
-    - frontend/src/components/config/DiffViewer.tsx
-  modified:
-    - frontend/src/lib/api.ts
-    - frontend/src/components/config/ConfigHistorySection.tsx
-
-key-decisions:
-  - "DiffViewer rendered inline above timeline (not modal) for context preservation"
-  - "Line classification function for unified diff: +green, -red, @@blue, ---/+++ muted"
-
-patterns-established:
-  - "Inline viewer pattern: state-driven component rendered above list, closed via callback"
-
-requirements-completed: [UI-03]
-
-duration: 1min
-completed: 2026-03-13
---
-
-# Phase 8 Plan 1: Diff Viewer Summary
-
-**Inline diff viewer with green/red line highlighting, wired into clickable config history timeline entries**
-
-## Performance
-
- **Duration:** 1 min
- **Started:** 2026-03-13T04:19:53Z
- **Completed:** 2026-03-13T04:20:56Z
- **Tasks:** 2
- **Files modified:** 3
-
-## Accomplishments
- DiffViewer component renders unified diffs with color-coded lines (green additions, red removals, blue hunk headers)
- API client getDiff method fetches diff data from backend endpoint
- Timeline entries in ConfigHistorySection are clickable with hover states
-
-## Task Commits
-
-Each task was committed atomically:
-
-1. **Task 1: Add diff API client and create DiffViewer component** - `dda00fb` (feat)
-2. **Task 2: Wire DiffViewer into ConfigHistorySection timeline entries** - `2cf426f` (feat)
-
-## Files Created/Modified
- `frontend/src/components/config/DiffViewer.tsx` - Unified diff viewer with line-level color highlighting, loading skeleton, error state
- `frontend/src/lib/api.ts` - Added DiffResponse interface and configHistoryApi.getDiff() method
- `frontend/src/components/config/ConfigHistorySection.tsx` - Added click handlers, selectedSnapshotId state, inline DiffViewer rendering
-
-## Decisions Made
- Rendered DiffViewer inline above the timeline rather than in a modal, preserving context
- Used a classifyLine helper function for clean line-type detection (handles +++ and --- separately from + and -)
- Loading skeleton uses randomized widths for visual variety
-
-## Deviations from Plan
-
-None - plan executed exactly as written.
-
-## Issues Encountered
-None
-
-## User Setup Required
-None - no external service configuration required.
-
-## Next Phase Readiness
- Diff viewer complete, ready for config download functionality (plan 08-02)
- All TypeScript compiles cleanly
-
---
-*Phase: 08-diff-viewer-download*
-*Completed: 2026-03-13*
--- a/.planning/phases/09-retention-cleanup/09-01-SUMMARY.md
+++ b/.planning/phases/09-retention-cleanup/09-01-SUMMARY.md
@@ -1,98 +0,0 @@
---
-phase: 09-retention-cleanup
-plan: 01
-subsystem: database
-tags: [apscheduler, retention, postgresql, prometheus, cascade-delete]
-
-# Dependency graph
-requires:
-  - phase: 01-database-schema
-    provides: router_config_snapshots table with CASCADE FK constraints
-provides:
-  - Automatic retention cleanup of expired config snapshots
-  - CONFIG_RETENTION_DAYS env var for configurable retention period
-  - Prometheus metrics for cleanup observability
-affects: []
-
-# Tech tracking
-tech-stack:
-  added: []
-  patterns: [APScheduler IntervalTrigger for periodic maintenance jobs]
-
-key-files:
-  created:
-    - backend/app/services/retention_service.py
-    - backend/tests/test_retention_service.py
-  modified:
-    - backend/app/config.py
-    - backend/app/main.py
-
-key-decisions:
-  - "make_interval(days => :days) for parameterized PostgreSQL interval (no string concatenation)"
-  - "24h IntervalTrigger with 1h jitter to stagger cleanup across instances"
-  - "AdminAsyncSessionLocal (bypasses RLS) since retention is cross-tenant system operation"
-
-patterns-established:
-  - "IntervalTrigger pattern for periodic maintenance jobs (vs CronTrigger for scheduled backups)"
-
-requirements-completed: [STOR-03, STOR-04]
-
-# Metrics
-duration: 2min
-completed: 2026-03-13
---
-
-# Phase 9 Plan 1: Retention Cleanup Summary
-
-**Daily APScheduler job deletes config snapshots older than CONFIG_RETENTION_DAYS (default 90) with CASCADE FK cleanup of diffs and changes**
-
-## Performance
-
- **Duration:** 2 min
- **Started:** 2026-03-13T04:31:48Z
- **Completed:** 2026-03-13T04:34:12Z
- **Tasks:** 2
- **Files modified:** 4
-
-## Accomplishments
- Retention service with parameterized SQL DELETE using make_interval for safe interval binding
- APScheduler IntervalTrigger running every 24h with 1h jitter for stagger
- Prometheus counter and histogram for cleanup observability
- Wired into main.py lifespan with non-fatal startup pattern
-
-## Task Commits
-
-Each task was committed atomically:
-
-1. **Task 1 (RED): Add failing tests** - `00bdde9` (test)
-2. **Task 1 (GREEN): Implement retention service + config setting** - `a9f7a45` (feat)
-3. **Task 2: Wire retention scheduler into lifespan** - `4d62bc9` (feat)
-
-## Files Created/Modified
- `backend/app/services/retention_service.py` - Retention cleanup logic, scheduler, Prometheus metrics
- `backend/tests/test_retention_service.py` - 4 unit tests for cleanup function
- `backend/app/config.py` - Added CONFIG_RETENTION_DAYS setting (default 90)
- `backend/app/main.py` - Wired start/stop retention scheduler into lifespan
-
-## Decisions Made
- Used make_interval(days => :days) for parameterized PostgreSQL interval (avoids string concatenation SQL injection risk)
- 24h IntervalTrigger with 1h jitter to stagger cleanup across instances
- AdminAsyncSessionLocal bypasses RLS since retention is a cross-tenant system operation
-
-## Deviations from Plan
-
-None - plan executed exactly as written.
-
-## Issues Encountered
-None
-
-## User Setup Required
-None - no external service configuration required. CONFIG_RETENTION_DAYS defaults to 90 if not set.
-
-## Next Phase Readiness
- Retention cleanup is fully operational, ready for phase 10
- No blockers
-
---
-*Phase: 09-retention-cleanup*
-*Completed: 2026-03-13*
--- a/.planning/phases/10-audit-observability/10-01-SUMMARY.md
+++ b/.planning/phases/10-audit-observability/10-01-SUMMARY.md
@@ -1,98 +0,0 @@
---
-phase: 10-audit-observability
-plan: 01
-subsystem: api
-tags: [audit, logging, config-backup, nats, observability]
-
-# Dependency graph
-requires:
-  - phase: 03-snapshot-ingestion
-    provides: config_snapshot_subscriber handle_config_snapshot handler
-  - phase: 05-config-diff
-    provides: config_diff_service generate_and_store_diff function
-  - phase: 04-manual-backup-trigger
-    provides: config_backups trigger_config_snapshot endpoint
-provides:
-  - Audit trail for all config backup operations (4 event types)
-  - Tests verifying audit event emission
-affects: []
-
-# Tech tracking
-tech-stack:
-  added: []
-  patterns: [try/except-wrapped log_action calls for fire-and-forget audit, inline imports in diff service to avoid circular deps]
-
-key-files:
-  created:
-    - backend/tests/test_audit_config_backup.py
-  modified:
-    - backend/app/services/config_snapshot_subscriber.py
-    - backend/app/services/config_diff_service.py
-    - backend/app/routers/config_backups.py
-
-key-decisions:
-  - "Module-level import of log_action in snapshot subscriber (no circular risk), inline import in diff service and router (consistent with existing best-effort pattern)"
-  - "All audit calls wrapped in try/except Exception: pass to never break parent operations"
-
-patterns-established:
-  - "Audit event pattern: try/except-wrapped log_action calls at success points in NATS subscribers and API endpoints"
-
-requirements-completed: [OBS-01, OBS-02]
-
-# Metrics
-duration: 3min
-completed: 2026-03-13
---
-
-# Phase 10 Plan 01: Config Backup Audit Events Summary
-
-**Four audit event types (created, skipped_duplicate, diff_generated, manual_trigger) wired into config backup operations with try/except safety and 4 passing tests**
-
-## Performance
-
- **Duration:** 3 min
- **Started:** 2026-03-13T04:43:11Z
- **Completed:** 2026-03-13T04:46:04Z
- **Tasks:** 2
- **Files modified:** 4
-
-## Accomplishments
- Added audit logging to all 4 config backup operations: snapshot creation, deduplication skip, diff generation, and manual backup trigger
- All log_action calls follow project pattern: try/except wrapped, fire-and-forget, with tenant_id, device_id, action, resource_type, and details
- 4 new tests verify correct audit action strings are emitted, all 17 tests pass (4 new + 13 existing)
-
-## Task Commits
-
-Each task was committed atomically:
-
-1. **Task 1: Add audit event emission to snapshot subscriber, diff service, and backup trigger endpoint** - `1a1ceb2` (feat)
-2. **Task 2: Add tests verifying audit events are emitted** - `fb91fed` (test)
-
-## Files Created/Modified
- `backend/app/services/config_snapshot_subscriber.py` - Added config_snapshot_created and config_snapshot_skipped_duplicate audit events
- `backend/app/services/config_diff_service.py` - Added config_diff_generated audit event after diff INSERT
- `backend/app/routers/config_backups.py` - Added config_backup_manual_trigger audit event on manual trigger success
- `backend/tests/test_audit_config_backup.py` - 4 tests verifying all audit event types are emitted
-
-## Decisions Made
- Module-level import of log_action in snapshot subscriber (no circular dependency risk since audit_service has no deps on snapshot subscriber)
- Inline import in diff service try block (consistent with existing best-effort pattern and avoids any potential circular import)
- Inline import in config_backups router try block (same pattern as diff service)
-
-## Deviations from Plan
-
-None - plan executed exactly as written.
-
-## Issues Encountered
-None
-
-## User Setup Required
-None - no external service configuration required.
-
-## Next Phase Readiness
- Audit trail complete for all config backup operations
- All existing tests continue to pass with the new audit imports
-
---
-*Phase: 10-audit-observability*
-*Completed: 2026-03-13*