From d16a5c991f454d19151287918b30f72591135990 Mon Sep 17 00:00:00 2001 From: Jason Staack Date: Thu, 12 Mar 2026 15:09:31 -0500 Subject: [PATCH] docs: add remote access design spec (WinBox tunnels + SSH terminal) Comprehensive design for v9.5 remote access feature: - WinBox TCP tunnel through poller with localhost port allocation - Browser SSH terminal via xterm.js + WebSocket to poller SSH relay - RBAC enforcement (operator+), audit logging, session tokens - Infrastructure: nginx WebSocket proxy, Docker port range mapping Co-Authored-By: Claude Opus 4.6 --- .../specs/2026-03-12-remote-access-design.md | 841 ++++++++++++++++++ 1 file changed, 841 insertions(+) create mode 100644 docs/superpowers/specs/2026-03-12-remote-access-design.md diff --git a/docs/superpowers/specs/2026-03-12-remote-access-design.md b/docs/superpowers/specs/2026-03-12-remote-access-design.md new file mode 100644 index 0000000..3644fa3 --- /dev/null +++ b/docs/superpowers/specs/2026-03-12-remote-access-design.md @@ -0,0 +1,841 @@ +# Remote Access Design — WinBox Tunnels + SSH Terminal (v9.5) + +## Overview + +Add remote WinBox and SSH terminal access to TOD. Users connect to RouterOS devices behind NAT through the TOD controller without direct network access to the router. + +- **WinBox**: TCP tunnel through the poller container. User's native WinBox app connects to `127.0.0.1:`. +- **SSH Terminal**: Browser-based xterm.js terminal. WebSocket to poller, which bridges to SSH PTY on the router. + +### Device Type Scope + +- **WinBox tunnels**: RouterOS devices only (WinBox is MikroTik-specific, port 8291) +- **SSH terminal**: All device types that support SSH (RouterOS and future `linuxrtr` devices) +- The frontend should show/hide the "Open WinBox" button based on device type. The "SSH Terminal" button renders for all SSH-capable device types. + +## System Architecture + +``` + ┌─────────────────────────────────┐ + │ User's Machine │ + │ │ + │ Browser (TOD UI) │ + │ ├─ xterm.js SSH terminal │ + │ └─ "Open WinBox" button │ + │ │ + │ WinBox app │ + │ └─ connects 127.0.0.1:491xx │ + └──────────┬──────────┬───────────┘ + │ │ + WebSocket TCP (WinBox) + /ws/ssh/ 127.0.0.1:49000-49100 + │ │ +┌────────────────────────────────────┼──────────┼────────────────┐ +│ Docker Network: tod │ │ │ +│ │ │ │ +│ ┌──────────────┐ │ │ │ +│ │ nginx │──────────────────┘ │ │ +│ │ port 3000 │ (proxy /ws/ssh → poller) │ │ +│ │ │ (proxy /api → api) │ │ +│ └──────┬───────┘ │ │ +│ │ │ │ +│ ┌──────▼───────┐ NATS ┌───────────────▼──────────┐ │ +│ │ API │◄───────────►│ Poller │ │ +│ │ FastAPI │ │ Go │ │ +│ │ │ │ ├─ tunnel manager │ │ +│ │ - RBAC │ session │ │ (TCP proxy :49000+) │ │ +│ │ - audit log │ tokens │ ├─ SSH relay │ │ +│ │ - session │ (Redis) │ │ (WebSocket ↔ PTY) │ │ +│ │ tokens │ │ ├─ device poller │ │ +│ └──────────────┘ │ └─ cmd responder │ │ +│ └───────────────┬───────────┘ │ +│ │ │ +│ ┌───────────────▼───────────┐ │ +│ │ WireGuard │ │ +│ │ 10.10.0.1/24 │ │ +│ │ port 51820/udp │ │ +│ └───────────────┬───────────┘ │ +└───────────────────────────────────────────────┼────────────────┘ + │ + ┌─────────────────────┼──────────────┐ + │ │ │ + RouterOS RouterOS RouterOS + (direct IP) (VPN peer) (VPN peer) + :8291 :22 10.10.0.x 10.10.0.y + :8291 :22 :8291 :22 +``` + +**Key data paths:** + +- **WinBox**: Browser click → API (auth+audit) → NATS → Poller allocates port → Docker maps `127.0.0.1:491xx` → Poller TCP proxy → WireGuard → Router:8291 +- **SSH**: Browser click → API (auth+audit+token) → Browser opens WebSocket → nginx → Poller validates token → SSH+PTY → Router:22 +- **Auth boundary**: API handles all RBAC and audit logging. Poller validates single-use session tokens but never does primary auth. + +## RBAC + +Roles allowed for remote access: `operator`, `admin`, `super_admin`. + +`viewer` role receives 403 Forbidden. The API is the enforcement point; frontend hides buttons for viewers but does not rely on that for security. + +Every remote access operation produces an audit log entry: + +- `user_id`, `tenant_id`, `device_id`, `session_type`, `source_ip`, `timestamp` +- SSH sessions additionally log `start_time` and `end_time` + +## Poller: Tunnel Manager + +New package: `poller/internal/tunnel/` + +### Data Structures + +```go +type TunnelManager struct { + mu sync.Mutex + tunnels map[string]*Tunnel // keyed by tunnel ID (uuid) + portPool *PortPool // tracks available ports 49000-49100 + idleTime time.Duration // 5 minutes + deviceStore *store.DeviceStore // DB lookup for device connection details + credCache *vault.CredentialCache +} + +type Tunnel struct { + ID string + DeviceID string + TenantID string + UserID string + LocalPort int + RemoteAddr string // router IP:8291 + CreatedAt time.Time + LastActive int64 // atomic, unix nanoseconds + listener net.Listener + cancel context.CancelFunc + conns sync.WaitGroup + activeConns int64 // atomic counter +} +``` + +### LastActive Concurrency + +`LastActive` stored as `int64` (unix nanoseconds) using atomic operations: + +- Write: `atomic.StoreInt64(&t.LastActive, time.Now().UnixNano())` +- Read: `time.Since(time.Unix(0, atomic.LoadInt64(&t.LastActive)))` + +### Port Pool + +```go +type PortPool struct { + mu sync.Mutex + ports []bool // true = in use + base int // 49000 +} +``` + +- `Allocate()` returns next free port or error if exhausted +- `Release()` marks port as free +- Before allocation, attempt bind to verify port is actually free (handles stale Docker mappings after restart) +- All operations protected by mutex + +### Tunnel Lifecycle + +1. NATS message arrives on `tunnel.open` +2. Manager looks up device from database via `DeviceStore.GetDevice(deviceID)` to obtain encrypted credentials and connection details (same pattern as `CmdResponder`) +3. Decrypts device credentials via credential cache +4. Allocates port from pool (verify bind succeeds) +5. Starts TCP listener on `127.0.0.1:` (never `0.0.0.0`) +6. Returns allocated port via NATS reply +7. For each incoming TCP connection: + - `t.conns.Add(1)`, increment `activeConns` + - Dial `router_ip:8291` through WireGuard (10s timeout) + - If dial fails: close client connection, decrement counter, do not update LastActive + - Bidirectional proxy with context cancellation (see below) + - On exit: decrement `activeConns`, `t.conns.Done()` +8. Background goroutine checks every 30s: + - If idle > 5 minutes AND `activeConns == 0`: close tunnel +9. Never close a tunnel while WinBox has an active socket + +### TCP Proxy (per connection) + +```go +func (t *Tunnel) handleConn(tunnelCtx context.Context, clientConn net.Conn) { + defer t.conns.Done() + defer atomic.AddInt64(&t.activeConns, -1) + + routerConn, err := net.DialTimeout("tcp", t.RemoteAddr, 10*time.Second) + if err != nil { + clientConn.Close() + return + } + + ctx, cancel := context.WithCancel(tunnelCtx) // derived from tunnel context for shutdown propagation + defer cancel() // ensure context cleanup on all exit paths + + go func() { + io.Copy(routerConn, newActivityReader(clientConn, &t.LastActive)) + cancel() + }() + go func() { + io.Copy(clientConn, newActivityReader(routerConn, &t.LastActive)) + cancel() + }() + + <-ctx.Done() + clientConn.Close() + routerConn.Close() +} +``` + +`activityReader` wraps `io.Reader` and calls `atomic.StoreInt64` on every `Read()`. + +### Tunnel Shutdown Order + +```go +func (t *Tunnel) Close() { + t.listener.Close() // 1. stop accepting new connections + t.cancel() // 2. cancel context + t.conns.Wait() // 3. wait for active connections + // 4. release port (done by manager) + // 5. delete from manager map (done by manager) +} +``` + +### NATS Subjects + +- `tunnel.open` — Request: `{device_id, tenant_id, user_id, target_port}` → Reply: `{tunnel_id, local_port}` +- `tunnel.close` — Request: `{tunnel_id}` → Reply: `{ok}` +- `tunnel.status` — Request: `{tunnel_id}` → Reply: `{active, local_port, connected_clients, idle_seconds}` +- `tunnel.status.list` — Request: `{device_id}` → Reply: list of active tunnels + +### Logging + +Structured JSON logs for: tunnel creation, port allocation, client connection, client disconnect, idle timeout, tunnel close. Fields: `tunnel_id`, `device_id`, `tenant_id`, `local_port`, `remote_addr`. + +## Poller: SSH Relay + +New package: `poller/internal/sshrelay/` + +### Data Structures + +```go +type Server struct { + redis *redis.Client + credCache *vault.CredentialCache + deviceStore *store.DeviceStore + sessions map[string]*Session + mu sync.Mutex + idleTime time.Duration // 15 minutes + maxSessions int // 200 + maxPerUser int // 10 + maxPerDevice int // 20 +} + +type Session struct { + ID string // uuid + DeviceID string + TenantID string + UserID string + SourceIP string + StartTime time.Time + LastActive int64 // atomic, unix nanoseconds + sshClient *ssh.Client + sshSession *ssh.Session + ptyCols int + ptyRows int + cancel context.CancelFunc +} +``` + +### HTTP Server + +Runs on port 8080 inside the container (configurable via `SSH_RELAY_PORT`). Not exposed to host — only accessible through nginx on Docker network. + +Endpoints: + +- `/ws/ssh?token=` — WebSocket upgrade for SSH terminal +- `/healthz` — Health check (returns `{"status":"ok"}`) + +### Connection Flow + +1. Browser opens `ws://host/ws/ssh?token=` +2. nginx proxies to poller `:8080/ws/ssh` +3. Poller validates single-use token via Redis `GETDEL` +4. Token must contain: `device_id`, `tenant_id`, `user_id`, `source_ip`, `cols`, `rows`, `created_at` +5. Verify `tenant_id` matches device's tenant +6. Check session limits (200 total, 10 per user, 20 per device) — reject with close frame if exceeded +7. Upgrade to WebSocket with hardening: + - `SetReadLimit(1 << 20)` (1MB) + - Read deadline management + - Ping/pong keepalive + - Origin validation +8. Decrypt device credentials via credential cache +9. SSH dial to router (port 22, password auth, `InsecureIgnoreHostKey`) + - Log host key fingerprint on first connect + - If dial fails: close WebSocket with error message, clean up +10. Open SSH session, request PTY (`xterm-256color`, initial cols/rows from token) +11. Obtain stdin, stdout, stderr pipes +12. Start shell +13. Bridge WebSocket ↔ SSH PTY + +### WebSocket Message Protocol + +- **Binary frames**: Terminal data — forwarded directly to/from SSH PTY +- **Text frames**: JSON control messages + +``` +{"type": "resize", "cols": 120, "rows": 40} +{"type": "ping"} +``` + +Resize validation: `cols > 0 && cols <= 500 && rows > 0 && rows <= 200`. Reject invalid values. + +### Bridge Function + +```go +func bridge(ctx context.Context, cancel context.CancelFunc, + wsConn, sshSession, stdin, stdout, stderr, lastActive *int64) { + + // WebSocket → SSH stdin + go func() { + defer cancel() + for { + msgType, data, err := wsConn.Read(ctx) + if err != nil { return } + atomic.StoreInt64(lastActive, time.Now().UnixNano()) + + if msgType == websocket.TextMessage { + var ctrl ControlMsg + if json.Unmarshal(data, &ctrl) != nil { continue } + if ctrl.Type == "resize" { + // validate bounds + if ctrl.Cols > 0 && ctrl.Cols <= 500 && ctrl.Rows > 0 && ctrl.Rows <= 200 { + sshSession.WindowChange(ctrl.Rows, ctrl.Cols) + } + } + continue + } + stdin.Write(data) + } + }() + + // SSH stdout → WebSocket + go func() { + defer cancel() + buf := make([]byte, 4096) + for { + n, err := stdout.Read(buf) + if err != nil { return } + atomic.StoreInt64(lastActive, time.Now().UnixNano()) + wsConn.Write(ctx, websocket.BinaryMessage, buf[:n]) + } + }() + + // SSH stderr → WebSocket (merged into same stream) + go func() { + defer cancel() // stderr EOF also triggers cleanup + io.Copy(wsWriterAdapter(wsConn), stderr) + }() + + <-ctx.Done() +} +``` + +### Session Cleanup Order + +1. Cancel context (triggers bridge shutdown) +2. Close WebSocket +3. Close SSH session +4. Close SSH client +5. Remove session from server map (under mutex) +6. Publish audit event via NATS: `audit.session.end` with payload `{session_id, user_id, tenant_id, device_id, start_time, end_time, source_ip, reason}` + +### Audit End-Time Pipeline + +The API subscribes to the NATS subject `audit.session.end` (durable consumer, same pattern as existing NATS subscribers in `backend/app/services/nats_subscribers.py`). When a message arrives, the subscriber calls `log_action("ssh_session_end", ...)` with the session details including `end_time` and duration. This uses the existing self-committing audit service — no new persistence mechanism needed. + +### Idle Timeout + +Per-session goroutine, every 30s: + +``` +idle := time.Since(time.Unix(0, atomic.LoadInt64(&sess.LastActive))) +if idle > 15 minutes: + cancel() +``` + +### Source IP + +Extracted from `X-Real-IP` header (set by nginx from `$remote_addr`), fallback to `X-Forwarded-For` last entry before nginx, fallback to `r.RemoteAddr`. Using `X-Real-IP` as primary avoids client-spoofed `X-Forwarded-For` entries. + +### Logging + +Structured JSON logs for: session start, session end (with duration and reason: disconnect/idle/error). Fields: `session_id`, `device_id`, `tenant_id`, `user_id`, `source_ip`. + +## API: Remote Access Endpoints + +New router: `backend/app/routers/remote_access.py` + +### WinBox Tunnel + +``` +POST /api/tenants/{tenant_id}/devices/{device_id}/winbox-session + +RBAC: operator+ +``` + +Flow: + +1. Validate JWT, require `operator+` +2. Verify device exists, belongs to tenant, is active (not disabled/deleted) +3. Return 404 if not found, 403 if tenant mismatch (never leak cross-tenant existence) +4. Extract source IP from `X-Real-IP` header (preferred, set by nginx), fallback to `request.client.host` +5. Audit log: `log_action("winbox_tunnel_open", ...)` +6. NATS request to `tunnel.open` (10s timeout) +7. If timeout or error: return 503 +8. Validate returned port is in range 49000–49100 +9. Response: + +```json +{ + "tunnel_id": "uuid", + "host": "127.0.0.1", + "port": 49023, + "winbox_uri": "winbox://127.0.0.1:49023", + "idle_timeout_seconds": 300 +} +``` + +`host` is always hardcoded to `"127.0.0.1"` — never overridden by poller response. + +Rate limit: 10 requests/min per user. + +### SSH Session Token + +``` +POST /api/tenants/{tenant_id}/devices/{device_id}/ssh-session + +RBAC: operator+ + +Body: {"cols": 80, "rows": 24} +``` + +Flow: + +1. Validate JWT, require `operator+` +2. Verify device exists, belongs to tenant, is active +3. Check session limits (10 per user, 20 per device) — return 429 if exceeded +4. Audit log: `log_action("ssh_session_open", ...)` +5. Generate token: `secrets.token_urlsafe(32)` +6. Store in Redis with SETEX (atomic), 120s TTL. Key format: `ssh:token:` + +```json +{ + "device_id": "uuid", + "tenant_id": "uuid", + "user_id": "uuid", + "source_ip": "1.2.3.4", + "cols": 80, + "rows": 24, + "created_at": 1710288000 +} +``` + +7. Response: + +```json +{ + "token": "...", + "websocket_url": "/ws/ssh?token=", + "idle_timeout_seconds": 900 +} +``` + +Rate limit: 10 requests/min per user. + +Input validation: `cols` 1–500, `rows` 1–200. + +### Tunnel Close + +``` +DELETE /api/tenants/{tenant_id}/devices/{device_id}/winbox-session/{tunnel_id} + +RBAC: operator+ +``` + +Idempotent — returns 200 even if tunnel already closed. Audit log recorded. + +### Active Sessions + +``` +GET /api/tenants/{tenant_id}/devices/{device_id}/sessions + +RBAC: operator+ +``` + +NATS request to poller. If poller doesn't respond within 10s, return empty session lists (degrade gracefully). + +### Schemas + +```python +class WinboxSessionResponse(BaseModel): + tunnel_id: str + host: str = "127.0.0.1" + port: int + winbox_uri: str + idle_timeout_seconds: int = 300 + +class SSHSessionRequest(BaseModel): + cols: int = Field(default=80, gt=0, le=500) + rows: int = Field(default=24, gt=0, le=200) + +class SSHSessionResponse(BaseModel): + token: str + websocket_url: str + idle_timeout_seconds: int = 900 +``` + +### Error Responses + +- 403: insufficient role or tenant mismatch +- 404: device not found +- 429: session or rate limit exceeded +- 503: poller unavailable or port range exhausted + +## Frontend: Remote Access UI + +### Dependencies + +New: `@xterm/xterm` (v5+), `@xterm/addon-fit`, `@xterm/addon-web-links`. No other new dependencies. + +### Device Page + +Remote access buttons render in the device header for `operator+` roles: + +``` +┌──────────────────────────────────────────┐ +│ site-branch-01 Online ● │ +│ 10.10.0.5 RB4011 RouterOS 7.16 │ +│ │ +│ [ Open WinBox ] [ SSH Terminal ] │ +│ │ +└──────────────────────────────────────────┘ +``` + +### WinBox Button + +States: `idle`, `requesting`, `ready`, `closing`, `error`. + +On click: + +1. Mutation: `POST .../winbox-session` +2. On success, display: + +``` +WinBox tunnel ready + +Connect to: 127.0.0.1:49023 + +[ Copy Address ] [ Close Tunnel ] + +Tunnel closes after 5 min of inactivity +``` + +3. Attempt deep link on Windows only (detect via `navigator.userAgent`): `window.open("winbox://127.0.0.1:49023")` — must fire directly inside the click handler chain (no setTimeout) to avoid browser blocking. On macOS/Linux, skip the deep link attempt and rely on the copy-address fallback. +4. Copy button with clipboard fallback for HTTP environments (textarea + `execCommand("copy")`) +5. Navigating away does not close the tunnel — backend idle timeout handles cleanup +6. Close button disabled while DELETE request is in flight + +### SSH Terminal + +Two phases: + +**Phase 1 — Token acquisition:** + +``` +POST .../ssh-session { cols, rows } +→ { token, websocket_url } +``` + +**Phase 2 — Terminal session:** + +```typescript +const term = new Terminal({ + cursorBlink: true, + fontFamily: 'Geist Mono, monospace', + fontSize: 14, + scrollback: 2000, + convertEol: true, + theme: darkMode ? darkTheme : lightTheme +}) +const fitAddon = new FitAddon() +term.loadAddon(fitAddon) +term.open(containerRef) +// fit after font load +fitAddon.fit() +``` + +WebSocket scheme derived dynamically: `location.protocol === "https:" ? "wss" : "ws"` + +**Data flow:** + +- User keystroke → `term.onData` → `ws.send(binaryFrame)` → poller → SSH stdin +- Router output → SSH stdout → poller → `ws.onmessage` → `term.write(new Uint8Array(data))` +- Resize → `term.onResize` → throttled (75ms) → `ws.send(JSON.stringify({type:"resize", cols, rows}))` + +**WebSocket lifecycle:** + +- `onopen`: `term.write("Connecting to router...\r\n")` +- `onmessage`: binary → `term.write`, text → parse control +- `onclose`: display "Session closed." in red, disable input, show Reconnect button +- `onerror`: display "Connection error." in red +- Abnormal close codes (1006, 1008, 1011) display appropriate messages + +**Reconnect**: Always requests a new token. Never reuses WebSocket or token. + +**Cleanup on unmount:** + +```typescript +useEffect(() => { + return () => { + term?.dispose() + ws?.close() + } +}, []) +``` + +**Terminal UI:** + +``` +┌──────────────────────────────────────────────────┐ +│ SSH: site-branch-01 [ Disconnect ] │ +├──────────────────────────────────────────────────┤ +│ │ +│ [admin@site-branch-01] > │ +│ │ +└──────────────────────────────────────────────────┘ +SSH session active — idle timeout: 15 min +``` + +- Inline on device page by default, expandable to full viewport +- Auto-expand to full viewport on screens < 900px width +- Dark/light theme maps to existing Tailwind HSL tokens (no hardcoded hex) +- `tabindex=0` on terminal container for keyboard focus +- Active session indicator when sessions list returns data + +### API Client Extension + +```typescript +const remoteAccessApi = { + openWinbox: (tenantId: string, deviceId: string) => + client.post( + `/tenants/${tenantId}/devices/${deviceId}/winbox-session` + ), + closeWinbox: (tenantId: string, deviceId: string, tunnelId: string) => + client.delete( + `/tenants/${tenantId}/devices/${deviceId}/winbox-session/${tunnelId}` + ), + openSSH: (tenantId: string, deviceId: string, req: SSHSessionRequest) => + client.post( + `/tenants/${tenantId}/devices/${deviceId}/ssh-session`, req + ), + getSessions: (tenantId: string, deviceId: string) => + client.get( + `/tenants/${tenantId}/devices/${deviceId}/sessions` + ), +} +``` + +## Infrastructure + +### nginx — WebSocket Proxy + +Add to `infrastructure/docker/nginx-spa.conf`: + +```nginx +# WebSocket upgrade mapping (top-level, outside server block) +map $http_upgrade $connection_upgrade { + default upgrade; + '' close; +} + +# Inside server block: +location /ws/ssh { + resolver 127.0.0.11 valid=10s ipv6=off; + set $poller_upstream http://poller:8080; + + proxy_pass $poller_upstream; + proxy_http_version 1.1; + + proxy_set_header Upgrade $http_upgrade; + proxy_set_header Connection $connection_upgrade; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header Host $host; + + proxy_read_timeout 1800s; + proxy_send_timeout 1800s; + + proxy_buffering off; + proxy_request_buffering off; + proxy_busy_buffers_size 512k; + proxy_buffers 8 512k; + +} +``` + +**CSP**: The existing `connect-src 'self'` should be sufficient for same-origin WebSocket connections in modern browsers (CSP `self` matches same-origin `ws://` and `wss://`). For maximum compatibility across all environments, explicitly add `ws: wss:` to the `connect-src` directive. HTTPS-only deployments can restrict to just `wss:`. + +### Docker Compose + +**Poller service additions — apply to these specific files:** + +- `docker-compose.override.yml` (dev): ports, environment, ulimits, healthcheck +- `docker-compose.prod.yml` (production): ports, environment, ulimits, healthcheck, increased memory limit +- `docker-compose.staging.yml` (staging): same as prod + +```yaml +poller: + ports: + - "127.0.0.1:49000-49100:49000-49100" + ulimits: + nofile: + soft: 8192 + hard: 8192 + environment: + TUNNEL_PORT_MIN: 49000 + TUNNEL_PORT_MAX: 49100 + TUNNEL_IDLE_TIMEOUT: 300 + SSH_RELAY_PORT: 8080 + SSH_IDLE_TIMEOUT: 900 + SSH_MAX_SESSIONS: 200 + SSH_MAX_PER_USER: 10 + SSH_MAX_PER_DEVICE: 20 + healthcheck: + test: ["CMD-SHELL", "wget --spider -q http://localhost:8080/healthz || exit 1"] + interval: 30s + timeout: 3s + retries: 3 +``` + +**Production memory limit**: Increase poller from 256MB to 384–512MB. + +**Redis dependency**: Ensure `depends_on: redis: condition: service_started`. + +**Docker proxy note**: The 101-port range mapping creates individual `docker-proxy` processes. For production, set `"userland-proxy": false` in `/etc/docker/daemon.json` to use iptables-based forwarding instead, which avoids spawning 101 proxy processes and improves startup time. + +### Poller HTTP Server + +```go +httpServer := &http.Server{ + Addr: ":" + cfg.SSHRelayPort, + Handler: sshrelay.NewServer(redisClient, credCache).Handler(), +} +go httpServer.ListenAndServe() +// Graceful shutdown with 5s timeout +httpServer.Shutdown(ctx) +``` + +### New Environment Variables + +| Variable | Default | Description | +|----------|---------|-------------| +| `TUNNEL_PORT_MIN` | `49000` | Start of WinBox tunnel port range | +| `TUNNEL_PORT_MAX` | `49100` | End of WinBox tunnel port range | +| `TUNNEL_IDLE_TIMEOUT` | `300` | WinBox tunnel idle timeout (seconds) | +| `SSH_RELAY_PORT` | `8080` | Internal HTTP/WebSocket port for SSH relay | +| `SSH_IDLE_TIMEOUT` | `900` | SSH session idle timeout (seconds) | +| `SSH_MAX_SESSIONS` | `200` | Max concurrent SSH sessions per poller | +| `SSH_MAX_PER_USER` | `10` | Max concurrent SSH sessions per user | +| `SSH_MAX_PER_DEVICE` | `20` | Max concurrent SSH sessions per device | + +### Graceful Shutdown + +When poller container shuts down: + +1. Stop accepting new tunnels and SSH sessions +2. Close HTTP/WebSocket server (5s timeout) +3. Gracefully terminate SSH sessions +4. Close all tunnel listeners +5. Wait for active connections +6. Release tunnel ports + +## Testing Strategy + +### Unit Tests + +**Poller (Go):** + +- Port pool: allocation, release, reuse after close, concurrent access, exhaustion, bind failure retry +- Tunnel manager: lifecycle, idle detection with zero active connections, multiple concurrent connections on same tunnel, cleanup when listener creation fails +- TCP proxy: activity tracking (atomic), bidirectional shutdown, dial failure cleanup +- SSH relay: token validation (valid/expired/reused/wrong tenant), session limits, resize parsing and validation, malformed control messages, invalid JSON frames, binary frame size limits, resize flood protection, cleanup on SSH dial failure, cleanup on abrupt WebSocket close + +**Backend (Python):** + +- RBAC: viewer gets 403, operator gets 200 +- Device validation: wrong tenant gets 404, disabled device rejected +- Token generation: stored in Redis with correct TTL +- Rate limiting: 11th request gets 429 +- Session limits: exceed per-user/per-device limits gets 429 +- Source IP extraction from X-Forwarded-For +- NATS timeout returns 503 +- Redis unavailable during token storage +- Malformed request payloads rejected + +### Integration Tests + +- **Tunnel end-to-end**: API → NATS → poller allocates port → verify listening on 127.0.0.1 → TCP connect → data forwarded to mock router +- **SSH end-to-end**: API issues token → WebSocket → poller validates → SSH to mock SSHD → verify keystroke round-trip and resize +- **Token lifecycle**: consumed on first use, second use rejected, expired token rejected +- **Idle timeout**: open tunnel, no traffic, verify closes after 5min; open SSH, no activity, verify closes after 15min +- **Concurrent sessions**: 10 SSH from same user succeeds, 11th rejected +- **Tunnel stress**: 50 concurrent tunnels, verify unique ports, verify cleanup +- **SSH stress**: many simultaneous WebSocket sessions, verify limits and stability +- **Router unreachable**: SSH dial fails, WebSocket closes with error, no zombie session +- **Poller restart**: sessions terminate, frontend shows disconnect, reconnect works +- **Backward compatibility**: existing polling, config push, NATS subjects unchanged + +### Security Tests + +- Token replay: reuse consumed token → rejected +- Cross-tenant: user from tenant A accesses device from tenant B → rejected +- Malformed token: invalid base64, wrong length → rejected without panic + +### Resource Leak Detection + +During integration testing, monitor: open file descriptors, goroutine count, memory usage. Verify SSH sessions and tunnels release all resources after closure. + +### Manual Testing + +- WinBox tunnel to router behind WireGuard — full WinBox functionality +- SSH terminal — tab completion, arrow keys, command history, line wrapping after resize +- Deep link `winbox://` on Windows — auto-launch +- Copy address fallback on macOS/Linux +- Navigate away with open tunnel — stays open, closes on idle +- Poller restart — frontend handles disconnect, reconnect works +- Multiple SSH terminals to different devices simultaneously +- Dark/light mode terminal theme +- Chrome, Firefox, Safari — WebSocket stability, clipboard, deep link, resize + +### Observability Verification + +Verify structured JSON logs exist with correct fields for: tunnel created/closed, port allocated, SSH session started/ended (with duration and reason), idle timeout events. + +## Rollout Sequence + +1. Deploy poller changes to staging (tunnel manager, SSH relay, HTTP server, NATS subjects) +2. Deploy infrastructure changes (docker-compose ports, nginx WebSocket config, CSP, ulimits) +3. Validate tunnels and SSH relay in staging +4. Deploy API endpoints (remote access router, session tokens, audit logging, rate limiting) +5. Deploy frontend (WinBox button, SSH terminal, API client) +6. Update documentation (ARCHITECTURE, DEPLOYMENT, SECURITY, CONFIGURATION, README) +7. Tag as v9.5 with release notes covering: WinBox remote access, browser SSH terminal, new env vars, port range requirement + +Never deploy frontend before backend endpoints exist. + +## Out of Scope + +- WinBox protocol reimplementation in browser +- SSH key authentication (password only, matching existing credential model) +- Session recording/playback +- File transfer through SSH terminal +- Multi-user shared terminal sessions