Files
the-other-dude/docs/superpowers/specs/2026-03-14-vpn-isolation-design.md
2026-03-14 12:43:53 -05:00

275 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Per-Tenant VPN Network Isolation — Design Spec
## Overview
Isolate WireGuard VPN networks per tenant so that devices in one tenant's VPN cannot reach devices in another tenant's VPN. Each tenant gets a unique `/24` subnet auto-allocated from `10.10.0.0/16`, with iptables rules blocking cross-subnet traffic.
**Branch:** `main` (this is a security fix, not SaaS-specific)
## Design Decisions
- **Single `wg0` interface** — WireGuard handles thousands of peers on one interface with negligible performance impact. No need for per-tenant interfaces.
- **Per-tenant `/24` subnets** — allocated from `10.10.0.0/16`, giving 255 tenants (index 1255). Index 0 is reserved. Expandable to `10.0.0.0/8` if needed (note: `_next_available_ip()` materializes all hosts in the subnet, so subnets larger than `/24` require refactoring that function).
- **Auto-allocation only** — `setup_vpn()` picks the next available subnet. No manual override.
- **Global config sync** — one `wg0.conf` with all tenants' peers. Rebuilt on any VPN change. Protected by a PostgreSQL advisory lock to prevent concurrent writes.
- **Global server keypair** — a single WireGuard server keypair stored in `system_settings`, replacing per-tenant server keys. Generated on first `setup_vpn()` call or during migration.
- **iptables isolation** — cross-subnet traffic blocked at the WireGuard container's firewall. IPv6 blocked too.
- **Device-side config is untrusted** — isolation relies entirely on server-side enforcement (AllowedIPs `/32` + iptables DROP). A malicious device operator changing their `allowed-address` to `10.10.0.0/16` on their router gains nothing — the server only routes their assigned `/32`.
## Data Model Changes
### Modified: `vpn_config`
| Column | Change | Description |
|--------|--------|-------------|
| `subnet_index` | **New column**, integer, unique, not null | Maps to third octet: index 1 = `10.10.1.0/24` |
| `subnet` | Default changes | No longer `10.10.0.0/24`; derived from `subnet_index` |
| `server_address` | Default changes | No longer `10.10.0.1/24`; derived as `10.10.{index}.1/24` |
| `server_private_key` | **Deprecated** | Kept in table for rollback safety but no longer used. Global key in `system_settings` is authoritative. |
| `server_public_key` | **Deprecated** | Same — kept but unused. All peers use the global public key. |
### New: `system_settings` entries
| Key | Description |
|-----|-------------|
| `vpn_server_private_key` | Global WireGuard server private key (encrypted with CREDENTIAL_ENCRYPTION_KEY) |
| `vpn_server_public_key` | Global WireGuard server public key (plaintext) |
### Allocation Logic
```
subnet_index = first available integer in range [1, 255] not already in vpn_config
subnet = 10.10.{subnet_index}.0/24
server_address = 10.10.{subnet_index}.1/24
```
Allocation query (atomic, gap-filling):
```sql
SELECT MIN(x) FROM generate_series(1, 255) AS x
WHERE x NOT IN (SELECT subnet_index FROM vpn_config)
```
If no index available → 422 "VPN subnet pool exhausted".
Unique constraint on `subnet_index` provides safety against race conditions. On conflict, retry once.
## VPN Service Changes
### `setup_vpn(db, tenant_id, endpoint)`
Current behavior: creates VpnConfig with hardcoded `10.10.0.0/24` and generates a per-tenant server keypair.
New behavior:
1. **Get or create global server keypair:** check `system_settings` for `vpn_server_private_key`. If not found, generate a new keypair and store both the private key (encrypted) and public key. This happens on the first `setup_vpn()` call on a fresh install.
2. Allocate next `subnet_index` using the gap-filling query
3. Set `subnet = 10.10.{index}.0/24`
4. Set `server_address = 10.10.{index}.1/24`
5. Store the global public key in `server_public_key` (for backward compat / display)
6. Call `sync_wireguard_config(db)` (global, not per-tenant)
### `sync_wireguard_config(db)`
Current signature: `sync_wireguard_config(db, tenant_id)` — builds config for one tenant.
New signature: `sync_wireguard_config(db)` — builds config for ALL tenants.
**Concurrency protection:** acquire a PostgreSQL advisory lock (`pg_advisory_xact_lock(hash)`) before writing. This prevents two simultaneous peer additions from producing a corrupt `wg0.conf`.
**Atomic write:** write to a temp file, then `os.rename()` to `wg0.conf`. This prevents the WireGuard container from reading a partially-written file.
New behavior:
1. Acquire advisory lock
2. Read global server private key from `system_settings` (decrypt it)
3. Query ALL enabled `VpnConfig` rows (across all tenants, using admin engine to bypass RLS)
4. For each, query enabled `VpnPeer` rows
5. Build single `wg0.conf`:
```ini
[Interface]
Address = 10.10.0.1/16
ListenPort = 51820
PrivateKey = {global_server_private_key}
# --- Tenant: {tenant_name} (10.10.1.0/24) ---
[Peer]
PublicKey = {peer_public_key}
PresharedKey = {preshared_key}
AllowedIPs = 10.10.1.2/32
# --- Tenant: {tenant_name_2} (10.10.2.0/24) ---
[Peer]
PublicKey = {peer_public_key}
PresharedKey = {preshared_key}
AllowedIPs = 10.10.2.2/32
```
6. Write to temp file, `os.rename()` to `wg0.conf`
7. Touch `.reload` flag
8. Release advisory lock
### `_next_available_ip(db, tenant_id, config)`
No changes needed — already scoped to `tenant_id` and uses the config's subnet. With unique subnets per tenant, IPs are naturally isolated. Note: this function materializes all `/24` hosts into a list, which is fine for `/24` (253 entries) but must be refactored if subnets larger than `/24` are ever used.
### `add_peer(db, tenant_id, device_id, ...)`
Changes:
- Calls `sync_wireguard_config(db)` instead of `sync_wireguard_config(db, tenant_id)`
- **Validate `additional_allowed_ips`:** if provided, reject any subnet that overlaps with `10.10.0.0/16` (the VPN address space). Only non-VPN subnets are allowed (e.g., `192.168.1.0/24` for site-to-site routing). This prevents a tenant from claiming another tenant's VPN subnet in their AllowedIPs.
### `remove_peer(db, tenant_id, peer_id)`
Minor change: calls `sync_wireguard_config(db)` instead of `sync_wireguard_config(db, tenant_id)`.
### Tenant deletion hook
When a tenant is deleted (CASCADE deletes vpn_config and vpn_peers), call `sync_wireguard_config(db)` to regenerate `wg0.conf` without the deleted tenant's peers. Add this to the tenant deletion endpoint.
### `read_wg_status()`
No changes — status is keyed by peer public key, which is unique globally. The existing `get_peer_handshake()` lookup continues to work.
## WireGuard Container Changes
### iptables Isolation Rules
Update `docker-data/wireguard/custom-cont-init.d/10-forwarding.sh`:
```bash
#!/bin/sh
# Enable forwarding between Docker network and WireGuard tunnel
# Idempotent: check before adding to prevent duplicates on restart
iptables -C FORWARD -i eth0 -o wg0 -j ACCEPT 2>/dev/null || iptables -A FORWARD -i eth0 -o wg0 -j ACCEPT
iptables -C FORWARD -i wg0 -o eth0 -j ACCEPT 2>/dev/null || iptables -A FORWARD -i wg0 -o eth0 -j ACCEPT
# Block cross-subnet traffic on wg0 (tenant isolation)
# Peers in 10.10.1.0/24 cannot reach peers in 10.10.2.0/24
iptables -C FORWARD -i wg0 -o wg0 -j DROP 2>/dev/null || iptables -A FORWARD -i wg0 -o wg0 -j DROP
# Block IPv6 forwarding on wg0 (prevent link-local bypass)
ip6tables -C FORWARD -i wg0 -j DROP 2>/dev/null || ip6tables -A FORWARD -i wg0 -j DROP
# NAT for return traffic
iptables -C POSTROUTING -t nat -o wg0 -j MASQUERADE 2>/dev/null || iptables -t nat -A POSTROUTING -o wg0 -j MASQUERADE
echo "WireGuard forwarding and tenant isolation rules applied"
```
Rules use `iptables -C` (check) before `-A` (append) to be idempotent across container restarts.
The key isolation layers:
1. **WireGuard AllowedIPs** — each peer can only send to its own `/32` IP (cryptographic enforcement)
2. **iptables `wg0 → wg0` DROP** — blocks any traffic that enters and exits the tunnel interface (peer-to-peer)
3. **iptables IPv6 DROP** — prevents link-local IPv6 bypass
4. **Separate subnets** — no IP collisions between tenants
5. **`additional_allowed_ips` validation** — blocks tenants from claiming VPN address space
### Server Address
The `[Interface] Address` changes from `10.10.0.1/24` to `10.10.0.1/16` so the server can route to all tenant subnets.
## Routing Changes
### Poller & API
No changes needed. Both already route `10.10.0.0/16` via the WireGuard container.
### setup.py
Update `prepare_data_dirs()` to write the updated forwarding script with idempotent rules and IPv6 blocking.
## RouterOS Command Generation
### `onboard_device()` and `get_peer_config()`
These generate RouterOS commands for device setup. Changes:
- `allowed-address` changes from `10.10.0.0/24` to `10.10.{index}.0/24` (tenant's specific subnet)
- `endpoint-address` and `endpoint-port` unchanged
- Server public key changes to the global server public key (read from `system_settings`)
## Migration
### Database Migration
1. Generate global server keypair:
- Create keypair using `generate_wireguard_keypair()`
- Store in `system_settings`: `vpn_server_private_key` (encrypted), `vpn_server_public_key` (plaintext)
2. Add `subnet_index` column to `vpn_config` (integer, unique, not null)
3. For existing VpnConfig rows (may be multiple if multiple tenants have VPN):
- Assign sequential `subnet_index` values starting from 1
- Update `subnet` to `10.10.{index}.0/24`
- Update `server_address` to `10.10.{index}.1/24`
4. For existing VpnPeer rows:
- Remap IPs: `10.10.0.X``10.10.{tenant's index}.X` (preserve the host octet)
- Example: Tenant A (index 1) peer at `10.10.0.2``10.10.1.2`. Tenant B (index 2) peer at `10.10.0.2``10.10.2.2`. No collision.
5. Regenerate `wg0.conf` using the new global sync function
### Device-Side Update Required
This is a **breaking change** for existing VPN peers. After migration:
- Devices need updated RouterOS commands:
- New server public key (global key replaces per-tenant key)
- New VPN IP address (`10.10.0.X``10.10.{index}.X`)
- New allowed-address (`10.10.{index}.0/24`)
- The API should expose a "regenerate commands" endpoint or show a banner in the UI indicating that VPN reconfiguration is needed.
### Migration Communication
After the migration runs:
- Log a warning with the list of affected devices
- Show a banner in the VPN UI: "VPN network updated — devices need reconfiguration. Click here for updated commands."
- The existing "View Setup Commands" button in the UI will show the correct updated commands.
## API Changes
### Modified Endpoints
| Method | Path | Change |
|--------|------|--------|
| `POST` | `/api/tenants/{id}/vpn` | `setup_vpn` allocates subnet_index, uses global server key |
| `GET` | `/api/tenants/{id}/vpn` | Returns tenant's specific subnet info |
| `GET` | `/api/tenants/{id}/vpn/peers/{id}/config` | Returns commands with tenant-specific subnet and global server key |
| `POST` | `/api/tenants/{id}/vpn/peers` | Validates `additional_allowed_ips` doesn't overlap `10.10.0.0/16` |
| `DELETE` | `/api/tenants/{id}` | Calls `sync_wireguard_config(db)` after cascade delete |
### No New Endpoints
The isolation is transparent — tenants don't need to know about it.
## Error Handling
| Scenario | HTTP Status | Message |
|----------|-------------|---------|
| No available subnet index (255 tenants with VPN) | 422 | "VPN subnet pool exhausted" |
| Subnet index conflict (race condition) | — | Retry allocation once |
| `additional_allowed_ips` overlaps VPN space | 422 | "Additional allowed IPs must not overlap the VPN address space (10.10.0.0/16)" |
## Testing
- Create two tenants with VPN enabled → verify they get different subnets (`10.10.1.0/24`, `10.10.2.0/24`)
- Add peers in both → verify IPs don't collide
- From tenant A's device, attempt to ping tenant B's device → verify it's blocked
- Verify `wg0.conf` contains peers from both tenants with correct subnets
- Verify iptables rules are in place after container restart (idempotent)
- Verify `additional_allowed_ips` with `10.10.x.x` subnet is rejected
- Delete a tenant → verify `wg0.conf` is regenerated without its peers
- Disable a tenant's VPN → verify peers excluded from `wg0.conf`
- Empty state (no enabled tenants) → verify `wg0.conf` has only `[Interface]` section
- Migration: multiple tenants sharing `10.10.0.0/24` → verify correct remapping to unique subnets
## Audit Logging
- Subnet allocated (tenant_id, subnet_index, subnet)
- Global server keypair generated (first-run event)
- VPN config regenerated (triggered by which operation)
## Out of Scope
- Multiple WireGuard interfaces (not needed at current scale)
- Manual subnet assignment
- IPv6 VPN support (IPv6 is blocked as a security measure)
- Per-tenant WireGuard listen ports
- VPN-level rate limiting or bandwidth quotas