# Server Consolidation & Security Hardening Plan Audit-driven plan to (a) reduce 12 PM2 processes to 3 application servers + 1 auth server, (b) put every API endpoint behind real authentication, and (c) standardize on ESM across all Node services. Approach is "do it properly the first time" — no half-finished pieces, no deferred cleanup. --- ## Status (2026-05-23) | Phase | Status | Notes | |---|---|---| | 1 — Decommission dead services | **Complete** | aircall/gorgias/clarity/legacy-auth-server deleted from repo + PM2 + Caddyfile + ecosystem.cjs | | 2 — Build shared `lib/` | **Complete** | Lives at `inventory-server/shared/` (see Deviations). `/verify` endpoint live on auth-server | | 3 — Convert auth-server + inventory-server to ESM | Not started | Next up. Auth-server is still CJS but has the new `/verify` route added in CJS form | | 4 — Build `dashboard-server` (the merge) | Not started | klaviyo/meta/google/typeform still run as 4 separate PM2 apps | | 5 — Convert `acot-server` to ESM | Not started | | | 6 — Auth hardening | Not started | Shared modules exist (`shared/rate-limit`, `shared/cors`, `shared/logging`) but no service consumes them yet. JWT_SECRET footgun discovered — see 6.4 | | 7 — Caddyfile final form | Partial | Dead routes removed; `forward_auth` gate + `/uploads/*` gating + per-vendor cleanup deferred to after Phase 4 | | 8 — ecosystem.config.cjs final form | Partial | Dead apps removed; final shape depends on Phase 4 merge | **Live PM2 count: 10** (down from 13). Target after Phase 4: 5 application apps + acot-phone-server + lt-wordlist-api. --- ## Goals - Every public-facing endpoint requires a valid auth token (Caddy gate + per-server middleware + per-route permission checks for sensitive operations). - Reduce service count from 12 PM2 processes to 4: `inventory-server`, `acot-server`, `dashboard-server`, `auth-server`. - Standardize on ESM (`"type": "module"`) across all Node services. - Decommission `aircall-server`, `gorgias-server`, `clarity-server`, and the legacy `auth-server` (port 3003). - Eliminate dependency duplication: one Redis client, one Postgres pool helper, one logger, one auth middleware — shared across services. ## Non-goals - Rewriting business logic. Route handlers move as-is unless they break under ESM or shared middleware. - Switching auth providers (we keep JWT + bcrypt + Postgres). - Replacing PM2 or Caddy. - Migrating Klaviyo/Meta/Google/Typeform's external API contracts. --- ## Target architecture ``` ┌──────────────────────────┐ │ tools.acherryontop.com │ │ (Caddy) │ │ forward_auth gate ─────┼──► auth-server:3011 └────────────┬─────────────┘ /verify endpoint │ ┌────────────────────────────────┼────────────────────────────────┐ ▼ ▼ ▼ ┌─────────────────────┐ ┌──────────────────────┐ ┌─────────────────────┐ │ inventory-server │ │ dashboard-server │ │ acot-server │ │ :3010 (ESM) │ │ :3015 (ESM) │ │ :3012 (ESM) │ │ │ │ │ │ │ │ /api/products │ │ /api/klaviyo/* │ │ /api/acot/* │ │ /api/orders │ │ /api/meta/* │ │ (MySQL via SSH) │ │ /api/analytics │ │ /api/google-*/* │ │ │ │ /api/dashboard │ │ /api/typeform/* │ │ │ │ ... (~25 routers) │ │ │ │ │ └─────────────────────┘ └──────────────────────┘ └─────────────────────┘ │ │ │ ├── Postgres (inventory_db) ├── Postgres (klaviyo) └── MySQL (workpi, via ssh2 tunnel) ├── shared lib/ ◄────────────────┤ │ - auth middleware ├── Redis (shared client) │ - permission helper └── shared lib/ ◄─────────────────┐ │ - logger │ │ - pg pool factory │ │ - error formatter │ └─────────────────────────────────────────────────────────────────┘ │ ┌──────────────────┴───┐ │ auth-server │ │ :3011 (ESM) │ │ /login, /me, │ │ /verify, user mgmt │ └──────────────────────┘ ``` PM2 process count: **12 → 4** (plus `acot-phone-server` and `lt-wordlist-api`, which stay as-is — out of scope). --- ## Phase 1 — Decommission dead/leaving services Status: **Complete (2026-05-23)**. All four services removed from repo, PM2, Caddyfile, and ecosystem.config.cjs. Frontend widgets (`AircallDashboard.jsx`, `GorgiasOverview.jsx`) and their dashboard.ts/Navigation.jsx/vite.config.ts wiring also removed. Verification: smoke-tested `https://tools.acherryontop.com/api/{aircall,gorgias,clarity}/*` → 404. Backups left at `/home/matt/{ecosystem.config.cjs,Caddyfile}.bak.2026-05-23`. ### To remove | Service | Reason | Steps | |---|---|---| | `aircall-server` (3002) | Migrating off Aircall | `pm2 delete aircall-server`; remove from `ecosystem.config.cjs`; remove `/api/aircall/*` from Caddyfile; drop `inventory/dashboard/aircall-server/` directory; remove MongoDB connection from any frontend code; cancel Mongo if it was only feeding Aircall | | `gorgias-server` (3006) | Migrating off Gorgias | same pattern; check frontend for `/api/gorgias/*` callers and delete the dashboards/widgets that use them | | `clarity-server` (3009) | Already dead (no `.js` files, not in ecosystem) | remove `/api/clarity/*` from Caddyfile; delete `inventory/dashboard/clarity-server/` directory | | `auth-server` (3003, legacy) | Replaced by `new-auth-server` on 3011 | grep entire codebase for `dashboard-auth` and `localhost:3003`; redirect or remove callers; `pm2 delete auth-server`; remove from ecosystem; remove `/dashboard-auth/*` from Caddyfile; delete `inventory/dashboard/auth-server/` directory | ### Verification before deletion ```bash # from inventory/ root — find any references before removing grep -rn "aircall\|/api/aircall" inventory/src/ inventory-server/src/ grep -rn "gorgias\|/api/gorgias" inventory/src/ inventory-server/src/ grep -rn "/dashboard-auth\|localhost:3003" inventory/src/ inventory-server/src/ grep -rn "/api/clarity" inventory/src/ inventory-server/src/ ``` Any remaining callers must be deleted or repointed before the server is removed. Do **not** leave a 502 response in production. ### Database/secret cleanup - Drop the MongoDB instance feeding Aircall (after confirming no other consumers). - Rotate any Gorgias/Aircall API keys still in `.env` files (defense in depth — they'll be useless soon anyway, but commit hygiene matters). - Remove `MONGODB_URI`, `AIRCALL_*`, `GORGIAS_*` from any `.env` files. --- ## Phase 2 — Build the shared `lib/` Status: **Complete (2026-05-23)**. All 11 modules written under `inventory-server/shared/` (NOT repo root — see Deviations). `/verify` endpoint added to auth-server in CJS form (will move to shared/auth/verify.js usage during Phase 3 ESM conversion). Smoke-tested with no-token / bad-token / expired-token / valid-token cases. No service consumes shared/ yet; that happens in Phases 3–5. ### Location A single shared directory at the repo root: `shared/` (sibling of `inventory/` and `acot-phone/`). Each service imports from it via a relative path. We do **not** introduce npm workspaces yet — relative imports are fine for three consumers and avoid the npm-link / hoisting headaches. ### Modules to create ``` shared/ ├── package.json # "type": "module" ├── auth/ │ ├── middleware.js # authenticate(), requirePermission(), requireAdmin() │ └── verify.js # verifyToken() — pure function, no Express dependency ├── db/ │ ├── pg.js # createPool(envPrefix) — returns configured Pool │ └── redis.js # createRedis() — single client, lazy-connect ├── logging/ │ ├── logger.js # pino-based, redacts Authorization/Cookie │ └── request-log.js # Express middleware, structured access log ├── errors/ │ └── handler.js # consistent error envelope, no leak in prod ├── cors/ │ └── policy.js # single allowed-origins list, exported as cors() options └── rate-limit/ └── login.js # express-rate-limit config for /login ``` ### Auth middleware spec (`shared/auth/middleware.js`) ```js // Pseudocode — final implementation matches the existing pattern in // inventory/auth/routes.js authenticate() but factored out. export function authenticate({ pool }) { return async (req, res, next) => { const header = req.headers.authorization; if (!header?.startsWith('Bearer ')) { return res.status(401).json({ error: 'Authentication required' }); } try { const decoded = jwt.verify(header.slice(7), process.env.JWT_SECRET); // Short-circuit DB hit with an in-memory cache, 60s TTL keyed by token jti const user = await loadUserCached(pool, decoded.userId); if (!user.is_active) return res.status(403).json({ error: 'Account inactive' }); req.user = user; next(); } catch { res.status(401).json({ error: 'Invalid token' }); } }; } export function requirePermission(code) { return (req, res, next) => { if (req.user.is_admin) return next(); if (req.user.permissions?.includes(code)) return next(); res.status(403).json({ error: 'Insufficient permissions' }); }; } export const requireAdmin = (req, res, next) => req.user.is_admin ? next() : res.status(403).json({ error: 'Admin only' }); ``` ### Why a 60s in-memory user cache `forward_auth` in Caddy will call `auth-server` on every request. Each per-server `authenticate()` middleware also has a DB lookup to load permissions. Without caching, every API request becomes 1 SQL query for the user row + 1 for permissions. 60s TTL is short enough that deactivating a user takes effect within a minute, long enough that Klaviyo dashboards (which fire dozens of requests on load) don't hammer Postgres. ### Add to `auth-server`: a `/verify` endpoint Caddy's `forward_auth` only needs "is this token valid? give me a user-id." Today's `/me` does that but with a full permissions join. Add a lightweight `/verify` that: - Verifies JWT signature only (no DB hit). - Returns `200` with `X-User-Id` and `X-User-Is-Admin` response headers (which Caddy `copy_headers` will pass to upstream). - Returns `401` on bad token. **Decision: each service re-verifies the JWT independently.** Caddy's `forward_auth` is a fast first-pass reject for obviously bad tokens, but the security boundary is the per-server `authenticate()` middleware. Cost is negligible (one HMAC-SHA256 per request); the upside is that a misconfigured Caddyfile can never let an unauthenticated request reach a backend. Upstream services do **not** trust any `X-User-*` headers from Caddy — they parse the `Authorization` header themselves. --- ## Phase 3 — Convert `auth-server` and `inventory-server` to ESM Status: **Not started.** Lift the easy ones first. These two stay standalone (don't merge into anything), so they're isolated changes. The auth-server's new `/verify` route (added in Phase 2) is currently CJS — refactor it during this phase to import from `../shared/auth/verify.js`. ### Mechanical conversion Per service: 1. Add `"type": "module"` to `package.json`. 2. Convert `require()` → `import`. `module.exports` → `export` / `export default`. 3. Fix `__dirname`/`__filename` (use `import.meta.url` + `fileURLToPath`). 4. Convert any dynamic require (e.g., conditional plugin loading) to `await import()`. 5. Update any sub-imports that don't include the file extension — ESM requires `./foo.js`, not `./foo`. 6. Update `ecosystem.config.cjs` if any service entry depended on CJS semantics. The ecosystem file itself can stay `.cjs` — PM2 reads it as config, doesn't matter what the apps it spawns are. 7. Update nodemon config / scripts. ### Risk areas in inventory-server - `routes/ai.js` does a lazy init (`aiRouter.initInBackground()` called from `server.js`) — confirm the export shape still works as a default export of an Express router with a sidecar function. May need to split into `export default router; export function initInBackground() {}`. - Multer setup in `routes/import.js` — straightforward, no ESM-specific concerns. - SSE setup in `server.js` — moves over cleanly, no module-system entanglement. - The `child_process.spawn` calls for metrics calculation: ESM doesn't change `child_process` behavior, but if any spawned script uses `require()` of a sibling, that sibling must also be ESM (or stay CJS with a `.cjs` extension). ### Test strategy - After conversion, `pm2 start ecosystem.config.cjs --only inventory-server` on the server, watch logs for require/import errors at startup. - Hit `/health`, then the most exercised endpoints (`/api/products`, `/api/dashboard/overview`, `/api/analytics/...`). If startup is clean and three smoke endpoints work, ESM conversion is done. Functional correctness is preserved because no logic changed. ### Auth-server Already small (~200 LOC server.js + ~few hundred in routes.js + permissions.js). 1-day conversion. Add the new `/verify` endpoint as part of this work. --- ## Phase 4 — Build `dashboard-server` (the merge) Status: **Not started.** The big merge. Klaviyo + Meta + Google + Typeform → one ESM service. Highest-risk phase — see Rollback strategy for the per-vendor cutover plan. ### Layout ``` inventory/dashboard/ ├── server.js # entry: load env, init Postgres+Redis, mount routes, listen ├── package.json # "type": "module", deps from all 4 source servers (deduped) ├── .env # KLAVIYO_*, META_*, GOOGLE_*, TYPEFORM_*, shared DB_*, REDIS_URL ├── routes/ │ ├── klaviyo/ # absorbed from dashboard/klaviyo-server/src/ │ ├── meta/ # absorbed from dashboard/meta-server/ │ ├── google/ # absorbed from dashboard/google-server/ │ └── typeform/ # absorbed from dashboard/typeform-server/ ├── services/ # per-vendor API clients (Klaviyo SDK calls, etc.) ├── scripts/ │ └── import-campaign-products.js # one-shot, moved from klaviyo-server/scripts/ └── logs/ ``` ### Mount points ```js // server.js (sketch) import { authenticate, requirePermission } from '../../shared/auth/middleware.js'; import { createPool } from '../../shared/db/pg.js'; import { createRedis } from '../../shared/db/redis.js'; import { logger, requestLog } from '../../shared/logging/index.js'; import corsPolicy from '../../shared/cors/policy.js'; import errorHandler from '../../shared/errors/handler.js'; import klaviyoRouter from './routes/klaviyo/index.js'; import metaRouter from './routes/meta/index.js'; import googleRouter from './routes/google/index.js'; import typeformRouter from './routes/typeform/index.js'; const app = express(); const pool = await createPool('KLAVIYO_DB'); // klaviyo has its own DB; others can share or have none const redis = await createRedis(); app.use(requestLog); app.use(cors(corsPolicy)); app.use(express.json({ limit: '10mb' })); // Everything below this line requires a valid token. app.use('/api', authenticate({ pool })); app.use('/api/klaviyo', klaviyoRouter({ pool, redis })); app.use('/api/meta', metaRouter({ redis })); app.use('/api/google-analytics', googleRouter({ redis })); // matches Caddy /api/dashboard-analytics rewrite app.use('/api/typeform', typeformRouter({ redis })); app.get('/health', (req, res) => res.json({ ok: true })); app.use(errorHandler); app.listen(process.env.DASHBOARD_PORT || 3015); ``` ### Per-vendor routers Each vendor's existing route file becomes a factory that takes the shared `pool`/`redis` and returns an Express router. Replace each server's per-instance pool/redis with the injected one. ### Permission gates (sensitive routes only) Authenticated-only is the default after `app.use('/api', authenticate(...))`. For sensitive operations, add `requirePermission` per route: - Anything that mutates Klaviyo lists/segments → `requirePermission('klaviyo_write')` - Triggering a campaign sync → `requirePermission('klaviyo_admin')` - Read-only dashboards → no extra check beyond authenticate. Define the new permission codes in the `permissions` table via a migration in Phase 6. ### Dependency dedup **Decision: standardize on `ioredis`.** Klaviyo's larger codebase already uses it, and `ioredis` has better cluster/sentinel support if we ever need it. Update `meta`/`google`/`typeform` call sites — each is a handful of `get`/`set` calls, mechanical conversion. Remove the `redis` package from `dashboard-server`'s `package.json`. ### Env consolidation Single `.env` at `inventory/dashboard/.env`, prefixed keys: ``` DASHBOARD_PORT=3015 KLAVIYO_API_KEY=... KLAVIYO_DB_HOST=... KLAVIYO_DB_NAME=... META_ACCESS_TOKEN=... GOOGLE_SERVICE_ACCOUNT_KEY=... TYPEFORM_TOKEN=... REDIS_URL=... JWT_SECRET=... # shared with auth-server; same secret means same tokens valid here ``` ### Klaviyo's `scripts/import-campaign-products.js` One-shot script — keep it, but run it from the merged dashboard-server's directory. Update the script's imports to ESM. If it's run via cron, update the cron entry to the new path. ### Risk: shared error states When all four vendors share a Redis client, a Redis hiccup affects all four. Make sure the connection has retry config (`ioredis` defaults are reasonable but verify) and that vendor routes degrade gracefully when Redis is unavailable (most use it as a cache, so cache-miss → fall through to upstream API is the right behavior). --- ## Phase 5 — Convert `acot-server` to ESM (stays standalone) Status: **Not started.** Largest single conversion (~5K LOC), but no merge involved. ### Special concern: ssh2 tunnel `acot-server` opens an SSH tunnel via `ssh2` to access the production MySQL at `192.168.1.5:3309`. The tunnel must be: - Established before the HTTP listener starts (so no requests fail with "no DB connection"). - Re-established on disconnect (`ssh2` connection's `close` event → recreate). - Cleanly torn down on `SIGTERM`/`SIGINT` so PM2 restarts don't leak file descriptors. Verify (or add) this lifecycle handling as part of the conversion. If it's already correct, conversion is mechanical; if not, this is a good moment to fix it. ### Test strategy Same as inventory-server: start with PM2, smoke-test the most-used `/api/acot/*` endpoints, watch logs for unhandled rejection or tunnel-close events. --- ## Phase 6 — Auth hardening Status: **Not started.** This is the security work that justifies the whole refactor. Runs in parallel with phases 3–5 where possible. Shared building blocks already exist (`shared/rate-limit/login.js`, `shared/cors/policy.js`, `shared/logging/request-log.js`, `shared/errors/handler.js`) — Phase 6 is about *applying* them per-service. ### 6.1 Caddy `forward_auth` gate Add to the `tools.acherryontop.com` block, before the `@api_routes` handler: ```caddyfile # Forward-auth gate for all API traffic @needs_auth path /api/* /chat-api/* handle @needs_auth { forward_auth localhost:3011 { uri /verify copy_headers Authorization # On 401/403, Caddy returns the auth-server's response body verbatim } # Existing per-vendor handle blocks remain below this line } # /auth-inv/* stays public (you need to log in!) handle /auth-inv/* { uri strip_prefix /auth-inv reverse_proxy localhost:3011 } ``` The `forward_auth` directive subrequests `/verify` on the auth-server. If it returns 2xx, the request proceeds upstream. If 401/403, Caddy returns that response to the client and never hits the backend. This is the **first** line of defense. Per-server middleware (`shared/auth/middleware.js`) is the **second** line — re-verifies the JWT independently. Defense in depth: a Caddyfile typo can't open a hole. ### 6.2 Per-route permission gates After per-server `authenticate()`, add `requirePermission(code)` to destructive or sensitive routes. Audit needed in: - `inventory-server/src/routes/config.js` — global config writes → `admin` - `inventory-server/src/routes/import.js` — uploads, deletes, generate-upc → `product_import` - `inventory-server/src/routes/data-management.js` — CSV operations → `data_management` - `inventory-server/src/routes/ai-prompts.js` — prompt edits → `ai_admin` - `inventory-server/src/routes/templates.js` — template writes → `templates_write` - `inventory-server/src/routes/reusable-images.js` — image management → `image_admin` - `inventory-server/src/routes/products.js` — only one POST (`/resolve-identifiers`); evaluate whether it needs a permission code or authenticated-only is fine - `inventory-server/src/routes/product-editor-audit-log.js` and `import-audit-log.js` — read-only by sensitive users → `audit_read` - `dashboard-server` Klaviyo/Meta/Google/Typeform write endpoints → vendor-specific codes per above Migration: a single SQL script that inserts the new permission codes into the `permissions` table and assigns them to existing admin users. Non-admin users get permissions explicitly granted via the user management UI. ```sql INSERT INTO permissions (code, name) VALUES ('product_import', 'Product Import'), ('data_management', 'Data Management'), ('ai_admin', 'AI Settings Admin'), ('templates_write', 'Template Editing'), ('image_admin', 'Image Management'), ('audit_read', 'Audit Log Access'), ('klaviyo_write', 'Klaviyo Write'), ('klaviyo_admin', 'Klaviyo Admin'), ('meta_write', 'Meta Write'), ('google_write', 'Google Analytics Write'), ('typeform_write', 'Typeform Write'), ('acot_admin', 'ACOT Server Admin') ON CONFLICT (code) DO NOTHING; ``` ### 6.3 Rate limiting on login `shared/rate-limit/login.js`: ```js import rateLimit from 'express-rate-limit'; export const loginLimiter = rateLimit({ windowMs: 15 * 60 * 1000, // 15 minutes max: 10, // 10 attempts per IP per window message: { error: 'Too many login attempts, try again later' }, standardHeaders: true, legacyHeaders: false, }); ``` Apply in `auth-server` on the `/login` route. Consider also rate-limiting `/verify` and `/me` (much higher cap, ~600/min — they're called legitimately by every page load). ### 6.4 JWT secret rotation - Rotate `JWT_SECRET` to a fresh 32-byte random string as part of the deployment. - Document that rotation logs out all users — acceptable for an internal tool, do it during off-hours. - Add `JWT_SECRET` to the env var validation block in `auth-server/server.js` (refuse to start if not set). - **Fix the existing footgun**: `/var/www/ecosystem.config.cjs` currently has `JWT_SECRET: process.env.JWT_SECRET` *after* `...inventoryEnv` in the new-auth-server block. This shadows the `.env` value with whatever the shell exported when PM2 was started — which has already silently diverged at least once (detected and fixed 2026-05-23 by a clean PM2 restart in a shell without JWT_SECRET exported). Delete that override line during rotation; let `.env` be the single source of truth. ### 6.5 Request logging `shared/logging/request-log.js` — log method, path, status, duration, user-id (if authenticated). **Never** log `Authorization` or `Cookie` headers. Remove the current `server.js:79-87` debug middleware in inventory-server (it logs full headers including the bearer token). ### 6.6 CORS lockdown Current `middleware/cors.js` allows `192.168.*.*` and `10.*.*.*` with `credentials: true`. Tighten to explicit known origins: ```js origin: [ 'https://tools.acherryontop.com', 'https://inventory.kent.pw', /^http:\/\/localhost:(5174|5175)$/, ] ``` If anyone genuinely needs LAN access, add their specific IP, not a `/16` range. ### 6.7 Upload hardening `POST /api/import/upload-image` (multer-backed) needs: - File-size limit set on multer config (current limit may be defaulted — verify). - MIME-type allowlist (image/jpeg, image/png, image/webp; reject everything else). - Filename sanitization (no `..`, no absolute paths, generate UUID-based names server-side). - The Caddy `/uploads/*` handler currently serves any file in the uploads directory publicly. Move this **behind** the auth gate: include `/uploads/*` in `@needs_auth`. If some images are referenced from public emails (Klaviyo newsletter), put **those** in a separate public bucket; everything else stays gated. ### 6.8 Frontend token storage **Decision: stay on `localStorage`.** This is an internal tool with no untrusted user-generated HTML being rendered, so the XSS-token-theft surface is small. The `forward_auth` gate is the main security gap we're addressing; cookie-based auth would be a larger, separate project (cookie-parser, CSRF double-submit pattern, AuthContext refactor) that doesn't change the threat model for an internal tool with no public sign-up. Sanity check during this refactor: grep the React codebase for `dangerouslySetInnerHTML`. If any usages exist, verify each one is rendering trusted (server-controlled, not user-supplied) content. If a user-supplied content path exists, that's a real XSS vector and needs separate remediation regardless of token-storage choice. ### 6.9 Remove debug middleware [inventory-server/src/server.js:79-87](inventory-server/src/server.js#L79-L87) logs full request headers including `Authorization`. Delete this block. Replace with `shared/logging/request-log.js`. ### 6.10 `lt-wordlist-api` token `ADD_WORD_TOKEN` is currently hardcoded in `/var/www/ecosystem.config.cjs`. Move to `/opt/lt-wordlist-api/.env`, rotate the token value, update any callers. ### 6.11 Audit logging for sensitive operations Already have `import-audit-log` and `product-editor-audit-log` tables. Extend the pattern: - Log `user_id`, `endpoint`, `params`, `result` for `config.js` writes and `data-management.js` operations. - Schema: reuse the existing audit table pattern or add a generic `system_audit_log` table. - Don't log request bodies wholesale (may contain large blobs); log the action and the target ID. --- ## Phase 7 — Caddyfile final form After all phases, the `tools.acherryontop.com` block looks like: ```caddyfile tools.acherryontop.com { import security_headers # Public: login endpoint handle /auth-inv/* { uri strip_prefix /auth-inv reverse_proxy localhost:3011 } # Public: static frontend assets @static path *.js *.css *.png *.jpg *.jpeg *.gif *.ico *.svg *.woff *.woff2 handle @static { header Cache-Control "public, max-age=2592000" root * /var/www/inventory/frontend/build file_server } # All API + uploads: auth gate first @gated path /api/* /chat-api/* /uploads/* handle @gated { forward_auth localhost:3011 { uri /verify copy_headers Authorization } # Uploaded files handle /uploads/* { root * /var/www/inventory file_server } # Vendor dashboard routes → merged dashboard-server handle /api/klaviyo/* { reverse_proxy localhost:3015 } handle /api/meta/* { reverse_proxy localhost:3015 } handle /api/google-analytics/* { reverse_proxy localhost:3015 } handle /api/typeform/* { reverse_proxy localhost:3015 } # ACOT-specific handle /api/acot/* { reverse_proxy localhost:3012 } # Chat handle /chat-api/* { uri strip_prefix /chat-api reverse_proxy localhost:3014 } # Catch-all: inventory-server handle /api/* { reverse_proxy localhost:3010 } } handle /health { reverse_proxy localhost:3010 } # SPA fallback handle { root * /var/www/inventory/frontend/build try_files {path} /index.html file_server encode gzip } handle_errors { respond "{err.status_code} {err.status_text}" } } ``` Removed: `/dashboard-auth/*`, `/api/aircall/*`, `/api/gorgias/*`, `/api/clarity/*`, the LAN/`Access-Control-Allow-Origin "*"` permissive defaults on `/api/*`. Kept: `/apiv2/*` and `/apiv2-test/*` proxies to backend.acherryontop.com (out of scope, separate system). --- ## Phase 8 — ecosystem.config.cjs final form ```js module.exports = { apps: [ { name: 'auth-server', script: './inventory/auth/server.js', cwd: '/var/www', env: { NODE_ENV: 'production', AUTH_PORT: 3011 }, ...commonSettings, }, { name: 'inventory-server', script: './inventory/src/server.js', cwd: '/var/www', env: { NODE_ENV: 'production', PORT: 3010, UPLOADS_DIR: '/var/www/inventory/uploads' }, ...commonSettings, }, { name: 'dashboard-server', script: './inventory/dashboard/server.js', cwd: '/var/www', env: { NODE_ENV: 'production', DASHBOARD_PORT: 3015 }, ...commonSettings, }, { name: 'acot-server', script: './inventory/dashboard/acot-server/server.js', cwd: '/var/www', env: { NODE_ENV: 'production', ACOT_PORT: 3012 }, ...commonSettings, }, { name: 'chat-server', script: './inventory/chat/server.js', cwd: '/var/www', env: { NODE_ENV: 'production', PORT: 3014 }, ...commonSettings, }, // acot-phone-server and lt-wordlist-api unchanged ], }; ``` Five entries instead of twelve. Each app loads its own `.env` from its directory (already handled by `dotenv.config`). --- ## Sequencing & dependencies ``` Phase 1 (decommission) ──┬─────────────────────────────────────────┐ │ │ ▼ │ Phase 2 (shared lib/) │ │ │ ┌──────────────┼──────────────┐ │ ▼ ▼ ▼ ▼ Phase 3a Phase 3b Phase 4 Phase 6 (auth hardening inventory-server auth-server dashboard-server runs alongside 3+4+5, to ESM to ESM + /verify build & test completes after them) │ │ │ │ └──────────────┼──────────────┘ │ ▼ │ Phase 5 (acot-server to ESM) ──────────────────►│ ▼ Phase 7 (Caddy cutover) │ ▼ Phase 8 (PM2 final state) ``` Phase 1 unblocks everything (fewer services to convert). Phase 2 is the foundation; nothing else can start until shared `lib/` exists. Phases 3–5 can run in parallel; they touch independent services. Phase 6's sub-items can be developed alongside 3–5 but **enabled** only after them (no point adding `requirePermission` to a route that doesn't yet have `authenticate`). Phase 7 is the cutover: Caddyfile flip happens when all backend changes are deployed. Phase 8 is cleanup: remove dead PM2 entries. Estimated effort, end-to-end: **~3 weeks of focused work** by one engineer. Phase 1 ≈ 1 day, Phase 2 ≈ 2 days, Phase 3 ≈ 3 days (both services), Phase 4 ≈ 5–7 days (the merge), Phase 5 ≈ 2–3 days, Phase 6 ≈ 3–4 days, Phase 7+8 ≈ 1 day. --- ## Testing strategy No formal test suite exists today (per CLAUDE.md). For a refactor this size, that's a gap to close — but writing tests retroactively for 15K LOC of routes is a separate, larger project. For this refactor: ### Manual smoke testing per phase A checklist of representative endpoints to hit after each deploy: - `inventory-server`: `/api/products`, `/api/dashboard/overview`, `/api/analytics/revenue`, `/api/orders`, `/api/purchase-orders`, `/api/import/list-uploads`, `/api/config/global` - `dashboard-server`: `/api/klaviyo/campaigns`, `/api/meta/insights`, `/api/google-analytics/...`, `/api/typeform/responses` - `acot-server`: `/api/acot/...` (top-3 endpoints by call volume — pull from access logs) - `auth-server`: `/login`, `/me`, `/verify` Each smoke test runs (a) without a token → expect 401, (b) with an invalid token → expect 401, (c) with a valid token → expect 2xx. ### Frontend integration check After deploys, log into the SPA and exercise each major page (Overview, Products, Analytics, Dashboard, Klaviyo, Meta, etc.). If everything loads and dashboards populate, the auth + routing layer is intact. ### Test scaffold during Phase 2 (committed) While building `shared/`, set up `vitest` (lightweight, ESM-native, fast) as the standard test runner for the repo. Initial coverage focuses on the security-critical surface only: - `shared/auth/verify.js` — known good token, expired token, wrong-signature token, malformed token, missing token. - `shared/auth/middleware.js` — request with no header → 401; bad header → 401; valid token + inactive user → 403; valid token + missing permission → 403; valid token + correct permission → next() called with `req.user` populated. - `shared/auth/middleware.js` user-cache TTL: same token within 60s → one DB hit; same token after 61s → two DB hits. `package.json` gets a `"test": "vitest run"` script at the repo root and per-service. Set up but don't backfill broader test coverage — that's a separate, larger project. The vitest scaffold gives future work a foothold; this refactor commits to having tests for the auth boundary specifically because that's what's load-bearing for the whole security model. --- ## Rollback strategy Each phase produces an independently deployable state. Rollback per phase: - **Phase 1**: re-add removed services to ecosystem; restore from git. Don't roll back data deletions — only do those after a week of stable production. - **Phases 3, 5**: ESM conversion is per-service; if one service breaks, `pm2 restart ` to the previous commit. Other services unaffected. - **Phase 4**: the dashboard-server merge is the highest-risk change. Plan: deploy `dashboard-server` to a non-conflicting port (3015) while leaving the old per-vendor servers running. Cut over Caddy routes one vendor at a time (start with Meta — smallest). If any vendor breaks, point Caddy back to the old server (still running) for that vendor, debug, retry. Only delete the old servers after all four are stable on `dashboard-server`. - **Phases 6, 7**: Caddy config is git-tracked. `git revert` + `caddy reload` rolls back in seconds. Auth changes are additive (defense in depth) — if `forward_auth` causes problems, comment it out and per-server middleware continues protecting routes. --- ## Out of scope (intentional) These came up in the audit but aren't part of this refactor: - `httpOnly` cookie auth (deferred — current `localStorage` acceptable for internal tool). - Replacing PM2 with systemd or Docker. - Test coverage beyond the auth-critical surface. - `apiv2`/`apiv2-test` proxies to `backend.acherryontop.com` — separate system, not touched. - `acot-phone-server` and `lt-wordlist-api` — staying as-is. - Centralized observability stack (Prometheus, Grafana). The logger work in Phase 6.5 sets up the data, but shipping it somewhere is future work. --- ## Concrete deliverables When this is done: - 4 application PM2 processes instead of 12 (plus 2 unchanged: acot-phone, lt-wordlist). - All `/api/*` and `/chat-api/*` requests gated at Caddy and re-verified at each upstream. - Sensitive endpoints additionally gated by per-permission checks. - One ESM standard across the entire Node codebase. - One shared `lib/` for auth, logging, DB, errors, CORS. - Login rate-limited. - `JWT_SECRET` rotated. - Old auth-server, Aircall, Gorgias, Clarity directories deleted from the repo. - Caddyfile slimmed to one auth-gated block. - Permission codes inserted into `permissions` table for granular authorization. - No half-finished pieces, no `// TODO: add auth later` comments, no deferred secrets cleanup. --- ## Deviations from original plan (recorded during execution) These are decisions made during Phase 1/2 implementation that amend the spec above. Future phases should follow the deviated path, not the original sketch. 1. **`shared/` location.** Original plan placed `shared/` at the repo root as a sibling of `inventory/` and `acot-phone/`. Implemented at `inventory-server/shared/` (= `/var/www/inventory/shared/` on the server) instead. Reason: the actual project root *is* `/var/www/inventory/`; placing shared/ outside it would have meant building a deployment story for it that doesn't exist. Import paths change accordingly: - From `inventory-server/{auth,src,chat}/server.js` → `../shared/...` - From `inventory-server/dashboard/{vendor}-server/server.js` → `../../shared/...` 2. **`/verify` response headers.** Plan specified `X-User-Id` + `X-User-Is-Admin`. Implemented as `X-User-Id` + `X-User-Username` (both available from the JWT payload). `X-User-Is-Admin` was dropped because `is_admin` isn't in the JWT today and returning it would require a DB lookup — violating the "no DB hit" principle. To restore `X-User-Is-Admin`, enrich the JWT payload at login time (one-line change in `auth/routes.js`) during Phase 6, then echo from `/verify`. Upstreams don't trust these headers anyway (they re-verify), so the omission is informational, not security-relevant. 3. **User cache key in `shared/auth/middleware.js`.** Plan sketch mentioned "60s TTL keyed by token jti". Implemented as keyed by `userId` instead — the JWT doesn't currently include a `jti` claim, and the cache's invalidation semantics are "this user was deactivated/changed permissions" (per-user), not "this token was revoked" (per-token). The plan's pseudocode already used `loadUserCached(pool, decoded.userId)` so this matches the spirit. 4. **Redis client safety.** `shared/db/redis.js` sets `enableOfflineQueue: false` and `lazyConnect: true`. Plan didn't specify but these defaults mean a Redis hiccup fails fast (route fall-through to upstream API as designed in Phase 4 risk notes) rather than queueing commands indefinitely. 5. **CORS allowed origins kept `https://acot.site`.** Plan example listed three origins; production has acot.site as a redirect to tools.acherryontop.com but also reaches the API directly in some flows. Kept it to avoid breakage. LAN wildcards (`192.168.*`, `10.*`) and `Access-Control-Allow-Origin "*"` are NOT included in the new `shared/cors/policy.js` per the plan's Phase 6.6 spirit, but the legacy `inventory-server/src/middleware/cors.js` still has them until services are migrated to consume `shared/cors/`. 6. **Defunct permission codes left in DB.** Removed the `dashboard:gorgias` and `dashboard:calls` Protected blocks from the frontend, but the corresponding permission rows in the `permissions` table are still there (assigned to some users). They're inert (no UI references them) but should be cleaned up alongside the Phase 6.2 permissions migration. 7. **PM2 process names retained `new-auth-server` (not `auth-server`).** Plan's Phase 8 final form names it `auth-server` (after the legacy 3003 one is removed). Decided to keep the existing `new-auth-server` name through Phase 2 to avoid a rename mid-stream. Phase 8 can rename if desired, but it's cosmetic — all wiring is by port (3011) not name. 8. **Caddyfile changes via admin API on `:2020`.** The Caddyfile is owned by root and matt has no passwordless sudo. Cutover used `curl -X POST .../load` on the Caddy admin port (which matt can hit), then a separate `sudo cp /home/matt/Caddyfile.new /etc/caddy/Caddyfile` step to persist the on-disk file. Future Caddyfile changes can follow the same pattern. Backup convention: `/etc/caddy/Caddyfile.bak.YYYY-MM-DD`. 9. **Path-naming.** Plan uses `inventory/` as the top-level (server-side path convention). Locally the equivalent is `inventory-server/`. Whenever the plan says `inventory/dashboard/foo/`, read that as `/var/www/inventory/dashboard/foo/` on the server or `inventory-server/dashboard/foo/` locally.