feat: The Other Dude v9.0.1 — full-featured email system

ci: add GitHub Pages deployment workflow for docs site

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Jason Staack
2026-03-08 17:46:37 -05:00
commit b840047e19
511 changed files with 106948 additions and 0 deletions

257
docs/DEPLOYMENT.md Normal file
View File

@@ -0,0 +1,257 @@
# TOD - The Other Dude — Deployment Guide
## Overview
TOD (The Other Dude) is a containerized fleet management platform for RouterOS devices. This guide covers Docker Compose deployment for production environments.
### Architecture
- **Backend API** (Python/FastAPI) -- REST API with JWT authentication and PostgreSQL RLS
- **Go Poller** -- Polls RouterOS devices via binary API, publishes events to NATS
- **Frontend** (React/nginx) -- Single-page application served by nginx
- **PostgreSQL + TimescaleDB** -- Primary database with time-series extensions
- **Redis** -- Distributed locking and rate limiting
- **NATS JetStream** -- Message bus for device events
## Prerequisites
- Docker Engine 24+ with Docker Compose v2
- At least 4GB RAM (2GB absolute minimum -- builds are memory-intensive)
- External SSD or fast storage recommended for Docker volumes
- Network access to RouterOS devices on ports 8728 (API) and 8729 (API-SSL)
## Quick Start
### 1. Clone and Configure
```bash
git clone <repository-url> tod
cd tod
# Copy environment template
cp .env.example .env.prod
```
### 2. Generate Secrets
```bash
# Generate JWT secret
python3 -c "import secrets; print(secrets.token_urlsafe(64))"
# Generate credential encryption key (32 bytes, base64-encoded)
python3 -c "import secrets, base64; print(base64.b64encode(secrets.token_bytes(32)).decode())"
```
Edit `.env.prod` with the generated values:
```env
ENVIRONMENT=production
JWT_SECRET_KEY=<generated-jwt-secret>
CREDENTIAL_ENCRYPTION_KEY=<generated-encryption-key>
POSTGRES_PASSWORD=<strong-password>
# First admin user (created on first startup)
FIRST_ADMIN_EMAIL=admin@example.com
FIRST_ADMIN_PASSWORD=<strong-password>
```
### 3. Build Images
Build images **one at a time** to avoid out-of-memory crashes on constrained hosts:
```bash
docker compose -f docker-compose.yml -f docker-compose.prod.yml build api
docker compose -f docker-compose.yml -f docker-compose.prod.yml build poller
docker compose -f docker-compose.yml -f docker-compose.prod.yml build frontend
```
### 4. Start the Stack
```bash
docker compose -f docker-compose.yml -f docker-compose.prod.yml --env-file .env.prod up -d
```
### 5. Verify
```bash
# Check all services are running
docker compose ps
# Check API health (liveness)
curl http://localhost:8000/health
# Check readiness (PostgreSQL, Redis, NATS connected)
curl http://localhost:8000/health/ready
# Access the portal
open http://localhost
```
Log in with the `FIRST_ADMIN_EMAIL` and `FIRST_ADMIN_PASSWORD` credentials set in step 2.
## Environment Configuration
### Required Variables
| Variable | Description | Example |
|----------|-------------|---------|
| `ENVIRONMENT` | Deployment environment | `production` |
| `JWT_SECRET_KEY` | JWT signing secret (min 32 chars) | `<generated>` |
| `CREDENTIAL_ENCRYPTION_KEY` | AES-256 key for device credentials (base64) | `<generated>` |
| `POSTGRES_PASSWORD` | PostgreSQL superuser password | `<strong-password>` |
| `FIRST_ADMIN_EMAIL` | Initial admin account email | `admin@example.com` |
| `FIRST_ADMIN_PASSWORD` | Initial admin account password | `<strong-password>` |
### Optional Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `GUNICORN_WORKERS` | `2` | API worker process count |
| `DB_POOL_SIZE` | `20` | App database connection pool size |
| `DB_MAX_OVERFLOW` | `40` | Max overflow connections above pool |
| `DB_ADMIN_POOL_SIZE` | `10` | Admin database connection pool size |
| `DB_ADMIN_MAX_OVERFLOW` | `20` | Admin max overflow connections |
| `POLL_INTERVAL_SECONDS` | `60` | Device polling interval |
| `CONNECTION_TIMEOUT_SECONDS` | `10` | RouterOS connection timeout |
| `COMMAND_TIMEOUT_SECONDS` | `30` | RouterOS per-command timeout |
| `CIRCUIT_BREAKER_MAX_FAILURES` | `5` | Consecutive failures before backoff |
| `CIRCUIT_BREAKER_BASE_BACKOFF_SECONDS` | `30` | Initial backoff duration |
| `CIRCUIT_BREAKER_MAX_BACKOFF_SECONDS` | `900` | Maximum backoff (15 min) |
| `LOG_LEVEL` | `info` | Logging verbosity (`debug`/`info`/`warn`/`error`) |
| `CORS_ORIGINS` | `http://localhost:3000` | Comma-separated CORS origins |
### Security Notes
- **Never use default secrets in production.** The application refuses to start if it detects known insecure defaults (like the dev JWT secret) in non-dev environments.
- **Credential encryption key** is used to encrypt RouterOS device passwords at rest. Losing this key means re-entering all device credentials.
- **CORS_ORIGINS** should be set to your actual domain in production.
- **RLS enforcement**: The app_user database role enforces row-level security. Tenants cannot access each other's data even with a compromised JWT.
## Storage Configuration
Docker volumes mount to the host filesystem. Default locations are configured in `docker-compose.yml`:
- **PostgreSQL data**: `./docker-data/postgres`
- **Redis data**: `./docker-data/redis`
- **NATS data**: `./docker-data/nats`
- **Git store (config backups)**: `./docker-data/git-store`
To change storage locations, edit the volume mounts in `docker-compose.yml`.
## Resource Limits
Container memory limits are enforced in `docker-compose.prod.yml` to prevent OOM crashes:
| Service | Memory Limit |
|---------|-------------|
| PostgreSQL | 512MB |
| Redis | 128MB |
| NATS | 128MB |
| API | 512MB |
| Poller | 256MB |
| Frontend | 64MB |
Adjust under `deploy.resources.limits.memory` in `docker-compose.prod.yml`.
## API Documentation
The backend serves interactive API documentation at:
- **Swagger UI**: `http://localhost:8000/docs`
- **ReDoc**: `http://localhost:8000/redoc`
All endpoints include descriptions, request/response schemas, and authentication requirements.
## Monitoring (Optional)
Enable Prometheus and Grafana monitoring with the observability compose overlay:
```bash
docker compose \
-f docker-compose.yml \
-f docker-compose.prod.yml \
-f docker-compose.observability.yml \
--env-file .env.prod up -d
```
- **Prometheus**: `http://localhost:9090`
- **Grafana**: `http://localhost:3001` (default: admin/admin)
### Exported Metrics
The API and poller export Prometheus metrics:
| Metric | Source | Description |
|--------|--------|-------------|
| `http_requests_total` | API | HTTP request count by method, path, status |
| `http_request_duration_seconds` | API | Request latency histogram |
| `mikrotik_poll_total` | Poller | Poll cycles by status (success/error/skipped) |
| `mikrotik_poll_duration_seconds` | Poller | Poll cycle duration histogram |
| `mikrotik_devices_active` | Poller | Number of devices being polled |
| `mikrotik_circuit_breaker_skips_total` | Poller | Polls skipped due to backoff |
| `mikrotik_nats_publish_total` | Poller | NATS publishes by subject and status |
## Maintenance
### Backup Strategy
- **Database**: Use `pg_dump` or configure PostgreSQL streaming replication
- **Config backups**: Git repositories in the git-store volume (automatic nightly backups)
- **Encryption key**: Store `CREDENTIAL_ENCRYPTION_KEY` securely -- required to decrypt device credentials
### Updating
```bash
git pull
docker compose -f docker-compose.yml -f docker-compose.prod.yml build api
docker compose -f docker-compose.yml -f docker-compose.prod.yml build poller
docker compose -f docker-compose.yml -f docker-compose.prod.yml build frontend
docker compose -f docker-compose.yml -f docker-compose.prod.yml --env-file .env.prod up -d
```
Database migrations run automatically on API startup via Alembic.
### Logs
```bash
# All services
docker compose logs -f
# Specific service
docker compose logs -f api
# Filter structured JSON logs with jq
docker compose logs api --no-log-prefix 2>&1 | jq 'select(.event != null)'
# View audit logs (config editor operations)
docker compose logs api --no-log-prefix 2>&1 | jq 'select(.event | startswith("routeros_"))'
```
### Graceful Shutdown
All services handle SIGTERM for graceful shutdown:
- **API (gunicorn)**: Finishes in-flight requests within `GUNICORN_GRACEFUL_TIMEOUT` (default 30s), then disposes database connection pools
- **Poller (Go)**: Cancels all device polling goroutines via context propagation, waits for in-flight polls to complete
- **Frontend (nginx)**: Stops accepting new connections and finishes serving active requests
```bash
# Graceful stop (sends SIGTERM, waits 30s)
docker compose stop
# Restart a single service
docker compose restart api
```
## Troubleshooting
| Issue | Solution |
|-------|----------|
| API won't start with secret error | Generate production secrets (see step 2 above) |
| Build crashes with OOM | Build images one at a time (see step 3 above) |
| Device shows offline | Check network access to device API port (8728/8729) |
| Health check fails | Check `docker compose logs api` for startup errors |
| Rate limited (429) | Wait 60 seconds or check Redis connectivity |
| Migration fails | Check `docker compose logs api` for Alembic errors |
| NATS subscriber won't start | Non-fatal -- API runs without NATS; check NATS container health |
| Poller circuit breaker active | Device unreachable; check `CIRCUIT_BREAKER_*` env vars to tune backoff |