Commit Graph

88 Commits

Author SHA1 Message Date
Jason Staack
db5bb3fa96 docs(04-01): complete manual backup trigger plan
- Summary with 12 tests (6 Go, 6 Python), all passing
- STATE.md updated: Phase 4 complete, decisions logged
- ROADMAP.md updated: Phase 4 plan progress
- REQUIREMENTS.md: COLL-04 marked complete

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:12:33 -05:00
Jason Staack
00f0a8b507 feat(04-01): add config snapshot trigger endpoint with NATS request-reply
- POST /tenants/{tid}/devices/{did}/config-snapshot/trigger endpoint
- Requires operator role, rate limited 10/minute
- Returns 201 success, 404 device not found, 409 lock held, 502 failure, 504 timeout
- Reuses NATS connection from routeros_proxy module
- 6 tests covering all response paths including connection errors

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:10:25 -05:00
Jason Staack
0e664150e7 test(04-01): add failing tests for config snapshot trigger endpoint
- Test success returns 201 with sha256_hash
- Test NATS timeout returns 504
- Test poller failure returns 502
- Test device not found returns 404
- Test lock contention returns 409

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:08:13 -05:00
Jason Staack
0851eced36 feat(04-01): implement BackupResponder with extracted CollectAndPublish
- Create BackupResponder for NATS request-reply on config.backup.trigger
- Extract public CollectAndPublish from BackupScheduler returning sha256 hash
- Define BackupExecutor/BackupLocker/DeviceGetter interfaces for testability
- Create RedisBackupLocker adapter wrapping redislock.Client
- Wire BackupResponder into main.go lifecycle
- All 6 tests pass with in-process NATS server

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:07:35 -05:00
Jason Staack
9e102fda20 test(04-01): add failing tests for BackupResponder
- Test subscribe registers subscription
- Test valid request returns success with sha256_hash
- Test lock held returns locked status
- Test invalid JSON returns error
- Test Stop unsubscribes cleanly
- Test device not found returns failed status

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 22:04:44 -05:00
Jason Staack
bf3fb509ed docs(03-01): complete config snapshot subscriber plan
- SUMMARY.md with task commits and decisions
- STATE.md updated to Phase 3 complete
- ROADMAP.md progress updated
- REQUIREMENTS.md: STOR-02 marked complete

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 21:49:43 -05:00
Jason Staack
0db06419e7 feat(03-01): wire config snapshot subscriber into main.py lifespan
- Start config_snapshot_subscriber in lifespan startup (non-fatal)
- Stop config_snapshot_subscriber in lifespan shutdown
- Placed after push_rollback_subscriber (near config-related subscribers)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 21:47:51 -05:00
Jason Staack
3ab9f27d49 feat(03-01): implement config snapshot subscriber with dedup and encryption
- NATS subscriber for config.snapshot.> on DEVICE_EVENTS stream
- Dedup by SHA256 hash against latest snapshot per device
- OpenBao Transit encryption before INSERT (plaintext never stored)
- Malformed/orphan messages acked and discarded safely
- Transit failure causes nak for NATS retry
- Prometheus metrics: ingested, dedup_skipped, errors, duration
- All 6 unit tests pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 21:47:07 -05:00
Jason Staack
9d8274158a test(03-01): add failing tests for config snapshot subscriber
- 6 tests: new snapshot stored, duplicate skipped, encrypt failure naks,
  malformed acked, orphan device acked, first-snapshot stored
- Tests mock NATS msg, AdminAsyncSessionLocal, OpenBaoTransitService

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 21:45:41 -05:00
Jason Staack
d456fe58e9 docs(02-02): complete backup scheduler plan
- SUMMARY.md with execution metrics and decisions
- STATE.md updated: Phase 2 complete, 3 plans done
- ROADMAP.md updated: Phase 2 marked complete
- REQUIREMENTS.md: COLL-03, COLL-05 marked complete

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:57:47 -05:00
Jason Staack
d34817a36c feat(02-02): wire BackupScheduler into main.go lifecycle
- Create BackupScheduler with all dependencies injected
- Run as goroutine parallel to status poll scheduler
- Shares same context for graceful shutdown via SIGINT/SIGTERM
- Startup logged with interval, max_concurrent, command_timeout

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:55:06 -05:00
Jason Staack
2653a32d6f feat(02-02): implement BackupScheduler with per-device goroutines and concurrency control
- BackupScheduler manages per-device backup goroutines independently from status poll
- First backup uses 30-300s random jitter delay to spread load
- Concurrency limited by buffered channel semaphore (configurable max)
- Per-device Redis lock prevents duplicate backups across pods
- Auth failures and host key mismatches block retries with clear warnings
- Transient errors use 5m/15m/1h exponential backoff with cap
- Offline devices skipped via Redis status key check
- TOFU fingerprint stored on first successful SSH connection
- Config output validated, normalized, hashed, published to NATS
- SSHHostKeyUpdater interface added to interfaces.go
- All 12 backup unit tests pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:54:23 -05:00
Jason Staack
a884b0945d test(02-02): add failing tests for BackupScheduler
- Jitter range, backoff sequence, shouldRetry blocking logic
- Online-only gating via Redis, concurrency semaphore behavior
- Reconciliation start/stop device lifecycle

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:52:29 -05:00
Jason Staack
7ff3178b84 docs(02-01): complete config backup primitives plan
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:50:27 -05:00
Jason Staack
4ae39d2cb3 feat(02-01): add config backup env vars, NATS event, device SSH fields, migration, metrics
- Config: CONFIG_BACKUP_INTERVAL (21600s), CONFIG_BACKUP_MAX_CONCURRENT (10), CONFIG_BACKUP_COMMAND_TIMEOUT (60s)
- NATS: ConfigSnapshotEvent type, PublishConfigSnapshot method, config.snapshot.> stream subject
- Device: SSHPort/SSHHostKeyFingerprint fields, UpdateSSHHostKey method, updated queries/scans
- Migration 028: ssh_port, ssh_host_key_fingerprint, timestamp columns with poller_user grants
- Metrics: ConfigBackupTotal (counter), ConfigBackupDuration (histogram), ConfigBackupActive (gauge)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:48:12 -05:00
Jason Staack
f1abb75cab feat(02-01): add SSH executor with TOFU host key verification and config normalizer
- SSH RunCommand with typed error classification (auth, hostkey, timeout, connection refused, truncated)
- TOFU host key callback: accept-on-first-connect, verify-on-subsequent, reject-on-mismatch
- NormalizeConfig strips timestamps, normalizes line endings, trims whitespace, collapses blanks
- HashConfig returns 64-char lowercase hex SHA256 of normalized config
- 22 unit tests covering all error kinds, TOFU flows, normalization edge cases, idempotency

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:46:04 -05:00
Jason Staack
33f888a6e2 docs(02): create phase plan for poller config collection
Two plans covering SSH executor, config normalization, NATS publishing,
backup scheduler, and main.go wiring for periodic RouterOS config backup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:39:47 -05:00
Jason Staack
a7a17a5ecd feat(01-01): add Alembic migration 027 for config snapshot tables with RLS
- Create router_config_snapshots table with Transit ciphertext storage
- Create router_config_diffs table with snapshot pair FK references
- Create router_config_changes table for parsed semantic changes
- Add RLS tenant isolation (ENABLE + FORCE + USING + WITH CHECK) on all 3
- Add GRANT SELECT/INSERT/DELETE to app_user on all 3
- Add performance indexes: device+collected_at, device+hash, snapshot pair, diff_id

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:04:18 -05:00
Jason Staack
8fe275e6f3 feat(01-01): add RouterConfigSnapshot/Diff/Change ORM models and tests
- Add RouterConfigSnapshot model with Transit ciphertext config_text
  and SHA-256 plaintext hash for deduplication
- Add RouterConfigDiff model for unified diffs between snapshots
- Add RouterConfigChange model for parsed semantic changes
- Export all three from app.models barrel file
- Add unit tests for importability, table names, columns, and types

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:03:43 -05:00
Jason Staack
7e2d637e0d docs: initialize project 2026-03-12 19:37:09 -05:00
Jason Staack
70126980a4 docs: map existing codebase 2026-03-12 19:33:26 -05:00
Jason Staack
5beede9502 Merge branch 'feature/remote-access' 2026-03-12 19:04:14 -05:00
Jason Staack
c2eea6847f fix: WinBox tunnel bind address, port range, and proxy support
- Bind tunnel listeners to 0.0.0.0 instead of 127.0.0.1 so tunnels
  are reachable through reverse proxies and container networks
- Reduce port range to 49000-49004 (5 concurrent tunnels)
- Derive WinBox URI host from request Host header instead of
  hardcoding 127.0.0.1, enabling use behind reverse proxies
- Add README security warning about default encryption keys

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 19:03:53 -05:00
Jason Staack
8cce0ef750 docs: add v9.5 remote access to website docs
WinBox tunnels, SSH terminal, NATS request-reply architecture,
session management, security notes, and updated port tables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 17:02:02 -05:00
Jason Staack
acf1790bed feat: add audit.session.end NATS pipeline for SSH session tracking
Poller publishes session end events via JetStream when SSH sessions
close (normal disconnect or idle timeout). Backend subscribes with a
durable consumer and writes ssh_session_end audit log entries with
duration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 16:07:10 -05:00
Jason Staack
7aaaeaa1d1 fix: address spec compliance gaps - tenant check, XFF fallback, rate limiting
- Gap 1: Add tenant ID verification after device lookup in SSH relay handleSSH,
  closing cross-tenant token reuse vulnerability
- Gap 2: Add X-Forwarded-For fallback (last entry) when X-Real-IP is absent in
  SSH relay source IP extraction; import strings package
- Gap 3: Add @limiter.limit("10/minute") to POST /winbox-session and POST
  /ssh-session using existing slowapi pattern from app.middleware.rate_limit
- Gap 4: Add TODO comment in open_ssh_session explaining that SSH session count
  enforcement is at the poller level; no NATS subject exists yet for API-side
  pre-check

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:51:14 -05:00
Jason Staack
a4e1c78744 docs: update documentation for v9.5 remote access feature
Add tunnel manager, SSH relay, new env vars, security model, and
Remote Access key feature entry across ARCHITECTURE, DEPLOYMENT,
SECURITY, CONFIGURATION, and README.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 15:47:03 -05:00
Jason Staack
d2471278ab feat(frontend): integrate WinBox and SSH buttons into device page
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:45:14 -05:00
Jason Staack
b76fdb3240 feat(frontend): add SSH terminal component with xterm.js 2026-03-12 15:43:31 -05:00
Jason Staack
b3b2f87beb feat(frontend): add WinBox tunnel button component 2026-03-12 15:43:03 -05:00
Jason Staack
79afd2a1ad feat(frontend): add remote access API client methods 2026-03-12 15:42:42 -05:00
Jason Staack
e5a9758f58 chore(frontend): add xterm.js dependencies for SSH terminal 2026-03-12 15:42:29 -05:00
Jason Staack
27f4403856 feat(infra): add nginx WebSocket proxy and SSH relay config to compose files
- Add WebSocket upgrade map to nginx and proxy /ws/ssh to poller:8080
- Update CSP connect-src to allow ws: and wss: for terminal connections
- Add tunnel port range 49000-49100, SSH relay env vars, ulimits, and healthcheck to poller in both override and prod compose files
- Increase poller memory limit to 512M in prod for tunnel/SSH overhead

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 15:40:53 -05:00
Jason Staack
4860fad643 feat(api): add remote access endpoints for WinBox tunnels and SSH sessions
Implements four operator-gated endpoints under /api/tenants/{tenant_id}/devices/{device_id}/:
- POST /winbox-session: opens a WinBox tunnel via NATS request-reply to poller
- POST /ssh-session: mints a single-use Redis token (120s TTL) for WebSocket SSH relay
- DELETE /winbox-session/{tunnel_id}: idempotently closes a WinBox tunnel
- GET /sessions: lists active WinBox tunnels via NATS tunnel.status.list

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:39:24 -05:00
Jason Staack
63fa45ffdd feat(api): add remote access pydantic schemas 2026-03-12 15:36:36 -05:00
Jason Staack
cb427272ed feat(poller): wire tunnel manager and SSH relay into main
Add TunnelManager, TunnelResponder, SSH relay server, and SSH relay HTTP
server to the poller startup sequence with env-configurable port ranges,
idle timeouts, and session limits. Extends graceful shutdown to cover the
HTTP server (5s context), tunnel manager, and SSH relay server via defer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:35:55 -05:00
Jason Staack
c73466c5e0 feat(poller): add SSH relay server with WebSocket-to-PTY bridge
Implements the SSH relay server (Task 2.1) that validates single-use
Redis tokens via GETDEL, dials SSH to the target device with PTY,
and bridges WebSocket binary/text frames to SSH stdin/stdout/stderr
with idle timeout and per-user/per-device session limits.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:33:48 -05:00
Jason Staack
d3d3e36192 feat(poller): add NATS tunnel responder for WinBox tunnel management
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 15:30:34 -05:00
Jason Staack
7a6ebdca89 feat(poller): add tunnel manager with idle cleanup and status tracking
Implements Manager which orchestrates WinBox tunnel lifecycle: open,
close, idle cleanup, and status queries. Uses PortPool and Tunnel from
Tasks 1.2/1.3. DeviceStore and CredentialCache wired in for Task 1.5.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:28:56 -05:00
Jason Staack
8105b995ff feat(poller): add TCP tunnel with bidirectional proxy and activity tracking
Implements Tunnel type that listens on a local port, accepts WinBox client
connections, dials the remote RouterOS device, and proxies traffic
bidirectionally. Uses activityReader to atomically update LastActive on
each read for idle timeout detection. Per-connection contexts derive from
the tunnel context so Close() terminates all connections cleanly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 15:26:47 -05:00
Jason Staack
d885f9b4b6 feat(poller): add port pool for WinBox tunnel allocation
Implements PortPool with mutex-protected allocation, bind verification
to skip ports already in use by the OS, and release-for-reuse semantics.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 15:25:01 -05:00
Jason Staack
5f9410fa54 chore(poller): add websocket dependency for remote access 2026-03-12 15:23:48 -05:00
Jason Staack
c0304da2dd docs: add remote access (v9.5) implementation plan
Six-chunk TDD implementation plan for WinBox TCP tunnels and SSH terminal relay through the Go poller. Covers tunnel manager, SSH relay, API endpoints, infrastructure, frontend, and documentation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:20:04 -05:00
Jason Staack
d16a5c991f docs: add remote access design spec (WinBox tunnels + SSH terminal)
Comprehensive design for v9.5 remote access feature:
- WinBox TCP tunnel through poller with localhost port allocation
- Browser SSH terminal via xterm.js + WebSocket to poller SSH relay
- RBAC enforcement (operator+), audit logging, session tokens
- Infrastructure: nginx WebSocket proxy, Docker port range mapping

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:10:40 -05:00
Jason Staack
bb9176fb9c docs: update docs to reflect recent fixes and actual codebase state
- Fix Go version (1.23 → 1.24), router count (21 → 25), add settings router
- Document vault key decryption on login and refresh token cookie delivery
- Document audit log self-commit behavior for reliability
- Add firmware cache volume and nginx dynamic DNS resolver to deployment guide
- Fix placeholder clone URL to actual repository

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 14:13:57 -05:00
Cog
6b22741f54 fix: audit logs never persisted + firmware-cache permission denied
Two bugs fixed:

1. audit_service.py: log_action() inserted into audit_logs using the
   caller's DB session but never committed. Any router that called
   db.commit() before log_action() (firmware, devices, config_editor,
   alerts, certificates) had its audit rows silently rolled back when
   the request session closed.
   Fix: log_action now opens its own AdminAsyncSessionLocal and self-
   commits, making audit persistence independent of the caller's
   transaction. The 'db' parameter is kept for backward compat but
   unused. Affects 5 routers (firmware, devices, config_editor,
   alerts, certificates).

2. docker-compose.override.yml: /data/firmware-cache had no volume
   mount so the directory didn't exist in the container, causing
   firmware downloads to fail with Permission denied.
   Fix: bind-mount docker-data/firmware-cache:/data/firmware-cache
   so firmware images survive container restarts.
2026-03-12 14:05:40 -05:00
Cog
21b8ce029f fix: nginx 502 after API container restart (dynamic DNS resolver)
Without a resolver directive, nginx resolves upstream hostnames once at
startup and caches the IP forever. When the API container restarts it gets
a new Docker-assigned IP, causing 502 Bad Gateway until nginx is reloaded.

Fix:
- Add 'resolver 127.0.0.11 valid=10s' (Docker embedded DNS)
- Use a variable in proxy_pass ('set \ api') so nginx
  re-resolves on every request using the resolver above
- Variable proxy_pass passes the full request URI as-is, so /api/...
  correctly maps to http://api:8000/api/... without double-pathing
2026-03-12 14:05:40 -05:00
Cog
58597ad4fd fix: CRLF/BOM line endings + restart policies + gitattributes
- poller/docker-entrypoint.sh: convert from CRLF+BOM to LF (UTF-8 no BOM)
  Windows saved the file with a UTF-8 BOM which made the Linux kernel
  reject the shebang with 'exec format error', crashing the poller.

- infrastructure/openbao/init.sh: same CRLF -> LF fix

- poller/Dockerfile: add sed to strip CRLF and BOM at image build time
  as a defensive measure for future Windows edits

- docker-compose.override.yml: add 'restart: on-failure' to api and poller
  so they recover from the postgres startup race (TimescaleDB restarts
  postgres after initdb, briefly causing connection refused on first boot)

- .gitattributes: enforce LF for all text/script/code files so git
  normalises line endings on checkout and prevents this class of bug
2026-03-12 14:05:40 -05:00
Cog
57e754bb27 fix: implement vault key decryption on login + fix token refresh via cookie
Three bugs fixed:

1. Phase 30 (auth.ts): After SRP login the encrypted_key_set was returned
   from the server but the vault key and RSA private key were never unwrapped
   with the AUK. keyStore.getVaultKey() was always null, causing Tier 1
   config-backup diffs to crash with a TypeError.
   Fix: unwrap vault key and private key using crypto.subtle.unwrapKey after
   successful SRP verification. Non-fatal: warns to console if decryption
   fails so login always succeeds.

2. Token refresh (auth.py): The /refresh endpoint required refresh_token in
   the request body, but the frontend never stored or sent it. After the 15-
   minute access token TTL, all authenticated API calls would fail silently
   because the interceptor sent an empty body and received 422 (not 401),
   so the retry loop never fired.
   Fix: login/srpVerify now set an httpOnly refresh_token cookie scoped to
   /api/auth/refresh. The refresh endpoint now accepts the token from either
   cookie (preferred) or body (legacy). Logout clears both cookies.
   RefreshRequest.refresh_token is now Optional to allow empty-body calls.

3. Silent token rotation: the /refresh endpoint now also rotates the refresh
   token cookie on each use (issues a fresh token), reducing the window for
   stolen refresh token replay.
2026-03-12 14:05:40 -05:00
Jason Staack
d0548bec86 fix(crypto): use 27 base-30 chars for Secret Key to prevent data loss
The Secret Key encoder used 26 base-30 characters which can only
represent 30^26 ≈ 2^127.58 values. Since the key is 128 bits,
~25% of generated keys silently lost their high bits during
formatting, making the Emergency Kit key unable to reconstruct
the original bytes on a new browser.

Changed KEY_CHAR_LENGTH from 26 to 27 (30^27 > 2^128). Parser
accepts both old 26-char and new 27-char keys for backward
compatibility. Format: A3-XXXXXX-XXXXXX-XXXXXX-XXXXXX-XXX

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 14:04:24 -05:00