Files
inventory/CONSOLIDATION_PLAN.md
T
2026-05-24 16:38:23 -04:00

1142 lines
94 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Server Consolidation & Security Hardening Plan
Audit-driven plan to (a) reduce 13 PM2 processes to 5 application servers + 2 auxiliary processes (acot-phone-server, lt-wordlist-api) = 7 total, (b) put every API endpoint behind real authentication, and (c) standardize on ESM across all primary application Node services. Approach is "do it properly the first time" — no half-finished pieces, no deferred cleanup.
> **Note on the original 12→4 target.** The initial spec called for `12 PM2 processes → 3 application servers + 1 auth server` and "ESM across all Node services". During execution `chat-server` proved a poor merge candidate (different DB, different protocol shape) and was kept as its own process — see Deviations #16 and #27. Phase 9 (added post-audit, 2026-05-24) closes the residual gap: chat-server ESM conversion + in-process `authenticate()` + Caddyfile/CORS hardening + vitest scaffold.
---
## Status (2026-05-24)
| Phase | Status | Notes |
|---|---|---|
| 1 — Decommission dead services | **Complete** | aircall/gorgias/clarity/legacy-auth-server deleted from repo + PM2 + Caddyfile + ecosystem.cjs |
| 2 — Build shared `lib/` | **Complete** | Lives at `inventory-server/shared/` (see Deviations). `/verify` endpoint live on auth-server |
| 3 — Convert auth-server + inventory-server to ESM | **Complete** | All inventory-server + auth-server server-side files (58) ESM; both services live under the ESM build for >24h. `chat-server` intentionally not in scope — deferred to Phase 9. See Deviations #1013, #27 |
| 4 — Build `dashboard-server` (the merge) | **Complete (live) — 2026-05-24** | Merged service running on :3015 under PM2; Caddy routes for klaviyo/meta/dashboard-analytics/typeform all reverse-proxy to it. Old per-vendor directories (`klaviyo-server`, `meta-server`, `google-server`, `typeform-server`) and their PM2 entries deleted post-cutover — ~1.27 GB reclaimed (largely duplicated `node_modules`). Phase 6.2 gates wired (meta_write, klaviyo_admin). See Deviations #1619 |
| 5 — Convert `acot-server` to ESM | **Complete (live) — 2026-05-24** | All 11 files (server, db/connection, utils/{phoneAuth,timeUtils}, 7 routes) converted to ESM. PM2 reload clean; SPA-driven `/api/acot/events/*` continues 200 across cutover; phone-server `/api/acot/customers/by-phone` returns 200 with correct shared secret. Phase 6 patterns applied during conversion — see Deviation #24 |
| 6 — Auth hardening | **Complete** | All in-process items live: rate-limit, JWT precondition, CORS lockdown, request-log, upload allowlist, `requirePermission` on sensitive routes, permissions seed migration. `authenticate()` live on `/api/*` (inventory-server, dashboard-server) and `/api/acot/*` (acot-server, added in Phase 5). 6.10 lt-wordlist token loaded via `--env-file` + rotated 2026-05-24 (Deviation #25). 6.11 (audit logging) deferred — see Out of scope |
| **F1 — Frontend fetch wrapper** | **Complete (live) — 2026-05-23** | Wrappers at `inventory/src/utils/api.ts` (`apiFetch`) and `inventory/src/utils/apiClient.ts` (axios instance). 170 `fetch()` sites across 76 files migrated to `apiFetch`; 32 `axios.*` sites across 11 files migrated to `apiClient`. AuthContext `/login`+`/me`, App.tsx `/me`, and `services/apiv2.ts` (external PHP backend) intentionally left as raw `fetch`. Shipped alongside the Phase 3+6 pm2 reload |
| 7 — Caddyfile final form | **Complete — applied 2026-05-24, amended same day** | Final Caddyfile live at `/etc/caddy/Caddyfile` (forward_auth gate + per-vendor reverse_proxy to :3015). The `inventory-server/deploy/` staging folder was removed after the initial apply, then re-created during Phase 9 §9.2 and again for the Deviation #28 uploads-gate revert — treat it as the working area for any future Caddyfile change. Backup convention: `/etc/caddy/Caddyfile.bak.YYYY-MM-DD` |
| 8 — ecosystem.config.cjs final form | **Complete — applied 2026-05-24** | Live PM2 list matches the spec below (5 apps + acot-phone-server + lt-wordlist-api = 7 processes). Includes Phase 6.4 JWT_SECRET shadow-override fix and 6.10 lt-wordlist token move. `inventory-server/deploy/` removed post-apply |
| **9 — Post-audit residual gaps** | **Complete (live) — 2026-05-24** | Closes findings from second-look audit: chat-server ESM + in-process `authenticate()` + localhost bind, Caddyfile edge CORS tightening + CORS preflight bypass, vitest scaffold + auth-boundary tests, three Mini\*.jsx fetch leaks, and plan-doc goal reconciliation. PM2 reload, Caddy apply, frontend rebuild, and tests all completed. **§9.2's uploads-gate fix was subsequently reverted (2026-05-24) — see Deviation #28; `/uploads/*` is public-by-design because the external PHP backend at www.acherryontop.com fetches images during product import.** See Phase 9 section |
**Live PM2 process count: 7** (5 application apps — auth-server, inventory-server, chat-server, dashboard-server, acot-server — plus acot-phone-server + lt-wordlist-api). Down from 13 pre-refactor.
**Phases 18 complete (2026-05-24).** Phase 5 closed the last originally-planned code deliverable; acot-server now runs as an ESM service with shared-lib `authenticate()` defense-in-depth.
**Phase 9 added and applied 2026-05-24** after a second-look audit surfaced six residual findings — see [Phase 9](#phase-9--post-audit-residual-gaps) below. Code, PM2 reload, Caddy apply, frontend rebuild, and test verification are complete.
Two earlier gaps surfaced during Phase 5 verification, both closed 2026-05-24:
- **Phase 6.10 lt-wordlist token rotation** — fixed via Node's native `--env-file` flag in the PM2 entry; token rotated to a fresh 32-byte hex value in `/opt/lt-wordlist-api/.env` (mode 0600). See Deviation #25.
- **Phase 6.1 Caddy gate blocking acot-phone-server's customer lookups** — fixed by repointing `ACOT_API_URL` from the public host to `http://localhost:3012/api/acot`. See Deviation #26.
---
## Goals
- Every public-facing endpoint requires a valid auth token (Caddy gate + per-server middleware + per-route permission checks for sensitive operations).
- Reduce service count from 13 PM2 processes to 7 total: 5 application servers (`auth-server`, `inventory-server`, `dashboard-server`, `acot-server`, `chat-server`) + 2 auxiliary (`acot-phone-server`, `lt-wordlist-api`). The original "4 application servers" target was missed because `chat-server` proved a poor merge candidate during execution; see Deviation #16.
- Standardize on ESM (`"type": "module"`) across all primary application Node services. `chat-server` was the last holdout — Phase 9 converts it.
- Decommission `aircall-server`, `gorgias-server`, `clarity-server`, and the legacy `auth-server` (port 3003).
- Eliminate dependency duplication: one Redis client, one Postgres pool helper, one logger, one auth middleware — shared across services.
## Non-goals
- Rewriting business logic. Route handlers move as-is unless they break under ESM or shared middleware.
- Switching auth providers (we keep JWT + bcrypt + Postgres).
- Replacing PM2 or Caddy.
- Migrating Klaviyo/Meta/Google/Typeform's external API contracts.
---
## Target architecture
```
┌──────────────────────────┐
│ tools.acherryontop.com │
│ (Caddy) │
│ forward_auth gate ─────┼──► auth-server:3011
└────────────┬─────────────┘ /verify endpoint
┌────────────────────────────────┼────────────────────────────────┐
▼ ▼ ▼
┌─────────────────────┐ ┌──────────────────────┐ ┌─────────────────────┐
│ inventory-server │ │ dashboard-server │ │ acot-server │
│ :3010 (ESM) │ │ :3015 (ESM) │ │ :3012 (ESM) │
│ │ │ │ │ │
│ /api/products │ │ /api/klaviyo/* │ │ /api/acot/* │
│ /api/orders │ │ /api/meta/* │ │ (MySQL via SSH) │
│ /api/analytics │ │ /api/google-*/* │ │ │
│ /api/dashboard │ │ /api/typeform/* │ │ │
│ ... (~25 routers) │ │ │ │ │
└─────────────────────┘ └──────────────────────┘ └─────────────────────┘
│ │ │
├── Postgres (inventory_db) ├── Postgres (klaviyo) └── MySQL (workpi, via ssh2 tunnel)
├── shared lib/ ◄────────────────┤
│ - auth middleware ├── Redis (shared client)
│ - permission helper └── shared lib/ ◄─────────────────┐
│ - logger │
│ - pg pool factory │
│ - error formatter │
└─────────────────────────────────────────────────────────────────┘
┌──────────────────┴───┐
│ auth-server │
│ :3011 (ESM) │
│ /login, /me, │
│ /verify, user mgmt │
└──────────────────────┘
```
PM2 process count: **13 → 7** (5 application servers + `acot-phone-server` + `lt-wordlist-api`). The original target was 4 application servers; the diagram above omits `chat-server` because it was originally a merge candidate. In practice `chat-server` runs as its own ESM process post-Phase 9 — see Deviation #16/#27.
---
## Phase 1 — Decommission dead/leaving services
Status: **Complete (2026-05-23)**. All four services removed from repo, PM2, Caddyfile, and ecosystem.config.cjs. Frontend widgets (`AircallDashboard.jsx`, `GorgiasOverview.jsx`) and their dashboard.ts/Navigation.jsx/vite.config.ts wiring also removed. Verification: smoke-tested `https://tools.acherryontop.com/api/{aircall,gorgias,clarity}/*` → 404. Backups left at `/home/matt/{ecosystem.config.cjs,Caddyfile}.bak.2026-05-23`.
### To remove
| Service | Reason | Steps |
|---|---|---|
| `aircall-server` (3002) | Migrating off Aircall | `pm2 delete aircall-server`; remove from `ecosystem.config.cjs`; remove `/api/aircall/*` from Caddyfile; drop `inventory/dashboard/aircall-server/` directory; remove MongoDB connection from any frontend code; cancel Mongo if it was only feeding Aircall |
| `gorgias-server` (3006) | Migrating off Gorgias | same pattern; check frontend for `/api/gorgias/*` callers and delete the dashboards/widgets that use them |
| `clarity-server` (3009) | Already dead (no `.js` files, not in ecosystem) | remove `/api/clarity/*` from Caddyfile; delete `inventory/dashboard/clarity-server/` directory |
| `auth-server` (3003, legacy) | Replaced by `new-auth-server` on 3011 | grep entire codebase for `dashboard-auth` and `localhost:3003`; redirect or remove callers; `pm2 delete auth-server`; remove from ecosystem; remove `/dashboard-auth/*` from Caddyfile; delete `inventory/dashboard/auth-server/` directory |
### Verification before deletion
```bash
# from inventory/ root — find any references before removing
grep -rn "aircall\|/api/aircall" inventory/src/ inventory-server/src/
grep -rn "gorgias\|/api/gorgias" inventory/src/ inventory-server/src/
grep -rn "/dashboard-auth\|localhost:3003" inventory/src/ inventory-server/src/
grep -rn "/api/clarity" inventory/src/ inventory-server/src/
```
Any remaining callers must be deleted or repointed before the server is removed. Do **not** leave a 502 response in production.
### Database/secret cleanup
- Drop the MongoDB instance feeding Aircall (after confirming no other consumers).
- Rotate any Gorgias/Aircall API keys still in `.env` files (defense in depth — they'll be useless soon anyway, but commit hygiene matters).
- Remove `MONGODB_URI`, `AIRCALL_*`, `GORGIAS_*` from any `.env` files.
---
## Phase 2 — Build the shared `lib/`
Status: **Complete (2026-05-23)**. All 11 modules written under `inventory-server/shared/` (NOT repo root — see Deviations). `/verify` endpoint added to auth-server in CJS form (will move to shared/auth/verify.js usage during Phase 3 ESM conversion). Smoke-tested with no-token / bad-token / expired-token / valid-token cases. No service consumes shared/ yet; that happens in Phases 35.
### Location
A single shared directory at the repo root: `shared/` (sibling of `inventory/` and `acot-phone/`). Each service imports from it via a relative path. We do **not** introduce npm workspaces yet — relative imports are fine for three consumers and avoid the npm-link / hoisting headaches.
### Modules to create
```
shared/
├── package.json # "type": "module"
├── auth/
│ ├── middleware.js # authenticate(), requirePermission(), requireAdmin()
│ └── verify.js # verifyToken() — pure function, no Express dependency
├── db/
│ ├── pg.js # createPool(envPrefix) — returns configured Pool
│ └── redis.js # createRedis() — single client, lazy-connect
├── logging/
│ ├── logger.js # pino-based, redacts Authorization/Cookie
│ └── request-log.js # Express middleware, structured access log
├── errors/
│ └── handler.js # consistent error envelope, no leak in prod
├── cors/
│ └── policy.js # single allowed-origins list, exported as cors() options
└── rate-limit/
└── login.js # express-rate-limit config for /login
```
### Auth middleware spec (`shared/auth/middleware.js`)
```js
// Pseudocode — final implementation matches the existing pattern in
// inventory/auth/routes.js authenticate() but factored out.
export function authenticate({ pool }) {
return async (req, res, next) => {
const header = req.headers.authorization;
if (!header?.startsWith('Bearer ')) {
return res.status(401).json({ error: 'Authentication required' });
}
try {
const decoded = jwt.verify(header.slice(7), process.env.JWT_SECRET);
// Short-circuit DB hit with an in-memory cache, 60s TTL keyed by token jti
const user = await loadUserCached(pool, decoded.userId);
if (!user.is_active) return res.status(403).json({ error: 'Account inactive' });
req.user = user;
next();
} catch {
res.status(401).json({ error: 'Invalid token' });
}
};
}
export function requirePermission(code) {
return (req, res, next) => {
if (req.user.is_admin) return next();
if (req.user.permissions?.includes(code)) return next();
res.status(403).json({ error: 'Insufficient permissions' });
};
}
export const requireAdmin = (req, res, next) =>
req.user.is_admin ? next() : res.status(403).json({ error: 'Admin only' });
```
### Why a 60s in-memory user cache
`forward_auth` in Caddy will call `auth-server` on every request. Each per-server `authenticate()` middleware also has a DB lookup to load permissions. Without caching, every API request becomes 1 SQL query for the user row + 1 for permissions. 60s TTL is short enough that deactivating a user takes effect within a minute, long enough that Klaviyo dashboards (which fire dozens of requests on load) don't hammer Postgres.
### Add to `auth-server`: a `/verify` endpoint
Caddy's `forward_auth` only needs "is this token valid? give me a user-id." Today's `/me` does that but with a full permissions join. Add a lightweight `/verify` that:
- Verifies JWT signature only (no DB hit).
- Returns `200` with `X-User-Id` and `X-User-Is-Admin` response headers (which Caddy `copy_headers` will pass to upstream).
- Returns `401` on bad token.
**Decision: each service re-verifies the JWT independently.** Caddy's `forward_auth` is a fast first-pass reject for obviously bad tokens, but the security boundary is the per-server `authenticate()` middleware. Cost is negligible (one HMAC-SHA256 per request); the upside is that a misconfigured Caddyfile can never let an unauthenticated request reach a backend. Upstream services do **not** trust any `X-User-*` headers from Caddy — they parse the `Authorization` header themselves.
---
## Phase 3 — Convert `auth-server` and `inventory-server` to ESM
Status: **Complete (live) — 2026-05-24.** Both servers + all sub-trees converted to ESM and running under PM2. 58 importable .js files. Two latent bugs surfaced and fixed during the conversion: `??`/`||` precedence in `shared/db/{pg,redis}.js`, and CJS named-import of `Pool` from `pg` in both auth files (now uses `import pg from 'pg'; const { Pool } = pg`).
Scripts under `inventory-server/scripts/` (one-shot maintenance / orchestrators) kept CommonJS via a sibling `scripts/package.json` declaring `"type": "commonjs"` — Node's package-type resolution walks up directory by directory, so this overrides the parent's `"type": "module"` without renaming any file or touching any `spawn()` callsite. Convert individual scripts to ESM if/when touched.
Went live 2026-05-24 after `npm install` on netcup (new deps: `pino`, `pino-http`, `ioredis`, `express-rate-limit`, `jsonwebtoken`) + `pm2 reload`. Phase F1 (frontend fetch wrapper) shipped in the same window so the SPA continued to send `Authorization: Bearer` on every request as `authenticate()` came online.
### Mechanical conversion
Per service:
1. Add `"type": "module"` to `package.json`.
2. Convert `require()``import`. `module.exports``export` / `export default`.
3. Fix `__dirname`/`__filename` (use `import.meta.url` + `fileURLToPath`).
4. Convert any dynamic require (e.g., conditional plugin loading) to `await import()`.
5. Update any sub-imports that don't include the file extension — ESM requires `./foo.js`, not `./foo`.
6. Update `ecosystem.config.cjs` if any service entry depended on CJS semantics. The ecosystem file itself can stay `.cjs` — PM2 reads it as config, doesn't matter what the apps it spawns are.
7. Update nodemon config / scripts.
### Risk areas in inventory-server
- `routes/ai.js` does a lazy init (`aiRouter.initInBackground()` called from `server.js`) — confirm the export shape still works as a default export of an Express router with a sidecar function. May need to split into `export default router; export function initInBackground() {}`.
- Multer setup in `routes/import.js` — straightforward, no ESM-specific concerns.
- SSE setup in `server.js` — moves over cleanly, no module-system entanglement.
- The `child_process.spawn` calls for metrics calculation: ESM doesn't change `child_process` behavior, but if any spawned script uses `require()` of a sibling, that sibling must also be ESM (or stay CJS with a `.cjs` extension).
### Test strategy
- After conversion, `pm2 start ecosystem.config.cjs --only inventory-server` on the server, watch logs for require/import errors at startup.
- Hit `/health`, then the most exercised endpoints (`/api/products`, `/api/dashboard/overview`, `/api/analytics/...`). If startup is clean and three smoke endpoints work, ESM conversion is done. Functional correctness is preserved because no logic changed.
### Auth-server
Already small (~200 LOC server.js + ~few hundred in routes.js + permissions.js). 1-day conversion. Add the new `/verify` endpoint as part of this work.
---
## Phase 4 — Build `dashboard-server` (the merge)
Status: **Complete (live) — 2026-05-24.** Klaviyo + Meta + Google + Typeform merged into a single ESM service at `inventory-server/dashboard/server.js`. Shared Pool + ioredis client injected through router factories. Phase 6.2 permission gates wired (`meta_write` on Meta budget/status mutations; `klaviyo_admin` on Klaviyo `/events/clearCache`). Post-cutover cleanup (2026-05-24) deleted the four old per-vendor directories (`klaviyo-server`, `meta-server`, `google-server`, `typeform-server`) along with their PM2 entries — ~1.27 GB reclaimed, largely duplicated `node_modules` across vendors. Original boot test on netcup: `/health` 200; unauthenticated `/api/klaviyo/*` returns `{"error":"No token provided"}` HTTP 401 via shared `authenticate()`.
### Layout
```
inventory/dashboard/
├── server.js # entry: load env, init Postgres+Redis, mount routes, listen
├── package.json # "type": "module", deps from all 4 source servers (deduped)
├── .env # KLAVIYO_*, META_*, GOOGLE_*, TYPEFORM_*, shared DB_*, REDIS_URL
├── routes/
│ ├── klaviyo/ # absorbed from dashboard/klaviyo-server/src/
│ ├── meta/ # absorbed from dashboard/meta-server/
│ ├── google/ # absorbed from dashboard/google-server/
│ └── typeform/ # absorbed from dashboard/typeform-server/
├── services/ # per-vendor API clients (Klaviyo SDK calls, etc.)
├── scripts/
│ └── import-campaign-products.js # one-shot, moved from klaviyo-server/scripts/
└── logs/
```
### Mount points
```js
// server.js (sketch)
import { authenticate, requirePermission } from '../../shared/auth/middleware.js';
import { createPool } from '../../shared/db/pg.js';
import { createRedis } from '../../shared/db/redis.js';
import { logger, requestLog } from '../../shared/logging/index.js';
import corsPolicy from '../../shared/cors/policy.js';
import errorHandler from '../../shared/errors/handler.js';
import klaviyoRouter from './routes/klaviyo/index.js';
import metaRouter from './routes/meta/index.js';
import googleRouter from './routes/google/index.js';
import typeformRouter from './routes/typeform/index.js';
const app = express();
const pool = await createPool('KLAVIYO_DB'); // klaviyo has its own DB; others can share or have none
const redis = await createRedis();
app.use(requestLog);
app.use(cors(corsPolicy));
app.use(express.json({ limit: '10mb' }));
// Everything below this line requires a valid token.
app.use('/api', authenticate({ pool }));
app.use('/api/klaviyo', klaviyoRouter({ pool, redis }));
app.use('/api/meta', metaRouter({ redis }));
app.use('/api/google-analytics', googleRouter({ redis })); // matches Caddy /api/dashboard-analytics rewrite
app.use('/api/typeform', typeformRouter({ redis }));
app.get('/health', (req, res) => res.json({ ok: true }));
app.use(errorHandler);
app.listen(process.env.DASHBOARD_PORT || 3015);
```
### Per-vendor routers
Each vendor's existing route file becomes a factory that takes the shared `pool`/`redis` and returns an Express router. Replace each server's per-instance pool/redis with the injected one.
### Permission gates (sensitive routes only)
Authenticated-only is the default after `app.use('/api', authenticate(...))`. For sensitive operations, add `requirePermission` per route:
- Anything that mutates Klaviyo lists/segments → `requirePermission('klaviyo_write')`
- Triggering a campaign sync → `requirePermission('klaviyo_admin')`
- Read-only dashboards → no extra check beyond authenticate.
Define the new permission codes in the `permissions` table via a migration in Phase 6.
### Dependency dedup
**Decision: standardize on `ioredis`.** Klaviyo's larger codebase already uses it, and `ioredis` has better cluster/sentinel support if we ever need it. Update `meta`/`google`/`typeform` call sites — each is a handful of `get`/`set` calls, mechanical conversion. Remove the `redis` package from `dashboard-server`'s `package.json`.
### Env consolidation
Single `.env` at `inventory/dashboard/.env`, prefixed keys:
```
DASHBOARD_PORT=3015
KLAVIYO_API_KEY=...
KLAVIYO_DB_HOST=...
KLAVIYO_DB_NAME=...
META_ACCESS_TOKEN=...
GOOGLE_SERVICE_ACCOUNT_KEY=...
TYPEFORM_TOKEN=...
REDIS_URL=...
JWT_SECRET=... # shared with auth-server; same secret means same tokens valid here
```
### Klaviyo's `scripts/import-campaign-products.js`
One-shot script — keep it, but run it from the merged dashboard-server's directory. Update the script's imports to ESM. If it's run via cron, update the cron entry to the new path.
### Risk: shared error states
When all four vendors share a Redis client, a Redis hiccup affects all four. Make sure the connection has retry config (`ioredis` defaults are reasonable but verify) and that vendor routes degrade gracefully when Redis is unavailable (most use it as a cache, so cache-miss → fall through to upstream API is the right behavior).
---
## Phase 5 — Convert `acot-server` to ESM (stays standalone)
Status: **Complete (live) — 2026-05-24.** 11 files converted (server.js, db/connection.js, utils/{phoneAuth,timeUtils}.js, 7 route files — ~5.2K LOC). PM2 reload clean; SPA-driven `/api/acot/events/{projection,stats}` continues 200 across cutover; phone-server `/api/acot/customers/by-phone` returns 200 with correct `x-acot-api-key`. Per Deviation #13, `ssh2` is CJS-only → uses `import ssh2 from 'ssh2'; const { Client } = ssh2;`. Phase 6 patterns applied during conversion (Deviation #24).
### Special concern: ssh2 tunnel
`acot-server` opens an SSH tunnel via `ssh2` to access the production MySQL at `192.168.1.5:3309`. Lifecycle today (preserved verbatim across the ESM conversion, not refactored):
- **Lazy establishment.** No tunnel at startup; the first `getDbConnection()` call sets one up. HTTP listener comes up immediately without waiting for the tunnel. Acceptable — the first per-route request just pays the tunnel-creation latency once.
- **Per-connection ssh client.** Each pooled MySQL connection owns its own `ssh2.Client`. Closing a connection closes its own SSH client.
- **No reconnect on disconnect.** There is no `close` listener on the SSH client. If the SSH connection drops while the MySQL connection is pooled (not in use), the next caller that pops it will get a query failure. Circuit-breaker absorbs repeated failures (5 failures → 30s open). Mitigation acceptable for current call volume; revisit if SSH drops become observable in logs.
- **SIGTERM/SIGINT teardown.** `server.close()``closeAllConnections()` ends MySQL connections and SSH clients in sequence. Confirmed clean during the Phase 5 cutover (`SIGTERM signal received: closing HTTP server` → 10 × `Closed pooled connection``All connections closed and pool reset` in PM2 logs).
### Auth model (two flavors, intentional)
`server.js` mounts the customers router BEFORE the global `authenticate()` so the two auth schemes don't collide:
- `/api/acot/customers/*``requirePhoneApiKey` (timing-safe `x-acot-api-key` check). Used by `acot-phone-server`.
- everything else → JWT Bearer via `shared/auth/middleware.js authenticate()`. Used by the SPA.
This works at the in-process layer. The public path through Caddy is a separate issue — see Deviation #26.
---
## Phase 6 — Auth hardening
Status: **Complete (live) — 2026-05-24.** All hardening (in-process + edge) is live in production. The Phase 3 ESM conversion + Phase 6 middleware shipped together, with Phase F1 (frontend fetch wrapper) flipping immediately ahead of the `pm2 reload` so the SPA continued to carry `Authorization: Bearer` on every API call. Caddy `forward_auth` gate and the JWT_SECRET ecosystem fix went live with the Phase 7/8 apply on 2026-05-24.
Per-item status:
| # | Item | Status | Where |
|---|---|---|---|
| 6.1 | Caddy `forward_auth` gate | **Live — 2026-05-24** | Applied via Caddy admin API + `sudo cp` to `/etc/caddy/Caddyfile`. `@gated path /api/* /chat-api/*` block hits `localhost:3011/verify` on every request. (`/uploads/*` was originally in the gate but removed per Deviation #28 — see Phase 6.7 / 9.2.) |
| 6.2 | `requirePermission` on sensitive routes + permissions migration | **Done** | inline in `config.js`, `data-management.js`, `import.js`, `ai-prompts.js`, `ai-validation.js`, `templates.js`, `reusable-images.js`; codes seeded by `migrations/005_phase6_permission_codes.sql`. **Phase 4 follow-on (2026-05-23):** `meta_write` wired on `PATCH /api/meta/campaigns/:id/budget` and `POST /api/meta/campaigns/:id/:action`; `klaviyo_admin` wired on `POST /api/klaviyo/events/clearCache`. Read-only Google + Typeform endpoints stay authenticated-only (reserved write codes left in migration 005 for future) |
| 6.3 | Login rate-limit + `/verify` rate-limit | **Done** | `auth/server.js` uses `shared/rate-limit/login.js` (`loginLimiter`, `verifyLimiter`) |
| 6.4 | JWT_SECRET as startup precondition + ecosystem footgun fix | **Live — 2026-05-24** | Both auth-server and inventory-server `process.exit(1)` if `JWT_SECRET` is unset. The `JWT_SECRET: process.env.JWT_SECRET` override that was shadowing `.env` is removed from the live ecosystem.cjs |
| 6.5 | Structured request logging w/ redaction | **Done** | `shared/logging/request-log.js` (pino-http, redacts Authorization/Cookie); mounted in both `auth/server.js` and `src/server.js` |
| 6.6 | CORS lockdown | **Done** | `src/middleware/cors.js` now re-exports `shared/cors/policy.js`. LAN wildcards (`192.168.*`, `10.*`) and `*` defaults gone |
| 6.7 | Upload hardening | **Partial — gating reverted** | Exact-match MIME+extension allowlist on `routes/import.js` and `routes/reusable-images.js`; dead `multer({ dest })` removed from `routes/products.js` (no upload route was using it — strongest hardening was deletion). The Caddyfile `/uploads/*` gate **was applied then reverted (2026-05-24)** — uploads are public-by-design (external PHP backend fetches them during product import). See Deviation #28 |
| 6.8 | Frontend token storage stays localStorage + XSS audit | **Audited** | Confirmed `dangerouslySetInnerHTML` is sanitized in `ProductEditor.tsx`. **Flagged: `ChatRoom.tsx:277,392` renders user-controlled chat content as raw HTML — real XSS vector, separate fix needed** |
| 6.9 | Remove debug middleware | **Done** | The header-dumping `app.use((req,res,next)=>{ console.log(... req.headers ...) })` block removed from `src/server.js`. Replaced with `shared/logging/request-log.js` (which redacts). |
| 6.10 | `lt-wordlist-api` token move | **Live — 2026-05-24** | Live PM2 entry runs `/opt/lt-wordlist-api/index.js` under matt's daemon; `ADD_WORD_TOKEN` is no longer inline in ecosystem.cjs and is read from `/opt/lt-wordlist-api/.env`. See Deviations #2123 for the path corrections and the (incorrect) earlier assumption that this app lived under a separate root daemon |
| 6.11 | Audit logging for sensitive ops | **Deferred** | Out of scope for this pass per user direction. Existing `import_audit_log` and `product_editor_audit_log` tables stay as-is; generic `system_audit_log` table + middleware is its own project |
### 6.1 Caddy `forward_auth` gate
Add to the `tools.acherryontop.com` block, before the `@api_routes` handler:
```caddyfile
# Forward-auth gate for all API traffic
@needs_auth path /api/* /chat-api/*
handle @needs_auth {
forward_auth localhost:3011 {
uri /verify
copy_headers Authorization
# On 401/403, Caddy returns the auth-server's response body verbatim
}
# Existing per-vendor handle blocks remain below this line
}
# /auth-inv/* stays public (you need to log in!)
handle /auth-inv/* {
uri strip_prefix /auth-inv
reverse_proxy localhost:3011
}
```
The `forward_auth` directive subrequests `/verify` on the auth-server. If it returns 2xx, the request proceeds upstream. If 401/403, Caddy returns that response to the client and never hits the backend.
This is the **first** line of defense. Per-server middleware (`shared/auth/middleware.js`) is the **second** line — re-verifies the JWT independently. Defense in depth: a Caddyfile typo can't open a hole.
### 6.2 Per-route permission gates
After per-server `authenticate()`, add `requirePermission(code)` to destructive or sensitive routes. Audit needed in:
- `inventory-server/src/routes/config.js` — global config writes → `admin`
- `inventory-server/src/routes/import.js` — uploads, deletes, generate-upc → `product_import`
- `inventory-server/src/routes/data-management.js` — CSV operations → `data_management`
- `inventory-server/src/routes/ai-prompts.js` — prompt edits → `ai_admin`
- `inventory-server/src/routes/templates.js` — template writes → `templates_write`
- `inventory-server/src/routes/reusable-images.js` — image management → `image_admin`
- `inventory-server/src/routes/products.js` — only one POST (`/resolve-identifiers`); evaluate whether it needs a permission code or authenticated-only is fine
- `inventory-server/src/routes/product-editor-audit-log.js` and `import-audit-log.js` — read-only by sensitive users → `audit_read`
- `dashboard-server` Klaviyo/Meta/Google/Typeform write endpoints → vendor-specific codes per above
Migration: a single SQL script that inserts the new permission codes into the `permissions` table and assigns them to existing admin users. Non-admin users get permissions explicitly granted via the user management UI.
```sql
INSERT INTO permissions (code, name) VALUES
('product_import', 'Product Import'),
('data_management', 'Data Management'),
('ai_admin', 'AI Settings Admin'),
('templates_write', 'Template Editing'),
('image_admin', 'Image Management'),
('audit_read', 'Audit Log Access'),
('klaviyo_write', 'Klaviyo Write'),
('klaviyo_admin', 'Klaviyo Admin'),
('meta_write', 'Meta Write'),
('google_write', 'Google Analytics Write'),
('typeform_write', 'Typeform Write'),
('acot_admin', 'ACOT Server Admin')
ON CONFLICT (code) DO NOTHING;
```
### 6.3 Rate limiting on login
`shared/rate-limit/login.js`:
```js
import rateLimit from 'express-rate-limit';
export const loginLimiter = rateLimit({
windowMs: 15 * 60 * 1000, // 15 minutes
max: 10, // 10 attempts per IP per window
message: { error: 'Too many login attempts, try again later' },
standardHeaders: true,
legacyHeaders: false,
});
```
Apply in `auth-server` on the `/login` route. Consider also rate-limiting `/verify` and `/me` (much higher cap, ~600/min — they're called legitimately by every page load).
### 6.4 JWT secret rotation
- Rotate `JWT_SECRET` to a fresh 32-byte random string as part of the deployment.
- Document that rotation logs out all users — acceptable for an internal tool, do it during off-hours.
- Add `JWT_SECRET` to the env var validation block in `auth-server/server.js` (refuse to start if not set).
- **Fix the existing footgun**: `/var/www/ecosystem.config.cjs` currently has `JWT_SECRET: process.env.JWT_SECRET` *after* `...inventoryEnv` in the new-auth-server block. This shadows the `.env` value with whatever the shell exported when PM2 was started — which has already silently diverged at least once (detected and fixed 2026-05-23 by a clean PM2 restart in a shell without JWT_SECRET exported). Delete that override line during rotation; let `.env` be the single source of truth.
### 6.5 Request logging
`shared/logging/request-log.js` — log method, path, status, duration, user-id (if authenticated). **Never** log `Authorization` or `Cookie` headers. Remove the current `server.js:79-87` debug middleware in inventory-server (it logs full headers including the bearer token).
### 6.6 CORS lockdown
Current `middleware/cors.js` allows `192.168.*.*` and `10.*.*.*` with `credentials: true`. Tighten to explicit known origins:
```js
origin: [
'https://tools.acherryontop.com',
'https://inventory.kent.pw',
/^http:\/\/localhost:(5174|5175)$/,
]
```
If anyone genuinely needs LAN access, add their specific IP, not a `/16` range.
### 6.7 Upload hardening
`POST /api/import/upload-image` (multer-backed) needs:
- File-size limit set on multer config (current limit may be defaulted — verify).
- MIME-type allowlist (image/jpeg, image/png, image/webp; reject everything else).
- Filename sanitization (no `..`, no absolute paths, generate UUID-based names server-side).
- ~~The Caddy `/uploads/*` handler currently serves any file in the uploads directory publicly. Move this **behind** the auth gate: include `/uploads/*` in `@needs_auth`.~~ **Reverted — see Deviation #28.** Uploads are public-by-design because the external PHP backend at `www.acherryontop.com` fetches image URLs server-to-server during the `/apiv2/product/setup_new` import flow, and has no Bearer token. Random-suffix filenames (timestamp + `Math.random`) provide weak enumeration resistance; `/uploads/products/*` files also auto-delete after 24h. If genuinely sensitive uploads ever land here, implement HMAC-signed URLs rather than re-gating wholesale.
### 6.8 Frontend token storage
**Decision: stay on `localStorage`.** This is an internal tool with no untrusted user-generated HTML being rendered, so the XSS-token-theft surface is small. The `forward_auth` gate is the main security gap we're addressing; cookie-based auth would be a larger, separate project (cookie-parser, CSRF double-submit pattern, AuthContext refactor) that doesn't change the threat model for an internal tool with no public sign-up.
Sanity check during this refactor: grep the React codebase for `dangerouslySetInnerHTML`. If any usages exist, verify each one is rendering trusted (server-controlled, not user-supplied) content. If a user-supplied content path exists, that's a real XSS vector and needs separate remediation regardless of token-storage choice.
### 6.9 Remove debug middleware
[inventory-server/src/server.js:79-87](inventory-server/src/server.js#L79-L87) logs full request headers including `Authorization`. Delete this block. Replace with `shared/logging/request-log.js`.
### 6.10 `lt-wordlist-api` token
`ADD_WORD_TOKEN` is currently hardcoded in `/var/www/ecosystem.config.cjs`. Move to `/opt/lt-wordlist-api/.env`, rotate the token value, update any callers.
### 6.11 Audit logging for sensitive operations
Already have `import-audit-log` and `product-editor-audit-log` tables. Extend the pattern:
- Log `user_id`, `endpoint`, `params`, `result` for `config.js` writes and `data-management.js` operations.
- Schema: reuse the existing audit table pattern or add a generic `system_audit_log` table.
- Don't log request bodies wholesale (may contain large blobs); log the action and the target ID.
---
## Phase F1 — Frontend fetch wrapper (NEW — 2026-05-23)
Status: **Complete (live) — 2026-05-24.** Two wrappers landed at `inventory/src/utils/api.ts` and `inventory/src/utils/apiClient.ts`. Migration touched 87 files (76 fetch, 11 axios) covering ~200 call sites. Type-check clean; production build clean. Intentional exclusions: AuthContext `/login`+`/me` (own auth flow), App.tsx initial `/me` session check, and `services/apiv2.ts` (calls the separate PHP backend at backend.acherryontop.com which has its own cookie auth, out of scope per the plan). Shipped alongside the Phase 3+6 pm2 reload.
### The discovery
While wiring `authenticate()` on `/api/*` in Phase 6.1/6.2, we audited the frontend's fetch usage and found:
- **7** call sites send `Authorization: Bearer ${token}` explicitly (all in `AuthContext.tsx` for `/me` + `/login`, plus a couple of `settings/*` pages).
- **~220** other `fetch(...)` / `axios.*(...)` call sites across `inventory/src/services/`, `inventory/src/pages/`, `inventory/src/components/` send **no** Authorization header at all.
- There is no global fetch wrapper, axios interceptor, or service-worker shim that injects the token.
Today this works because nothing on the server checks. Caddy currently has no `forward_auth` gate (Phase 6.1 is a Caddyfile change that hasn't shipped yet) and the previous inventory-server had no `authenticate()` middleware. The frontend's auth model was "you log in once to get the token; the token is checked only by `/me`; everything else is implicitly trusted at the network layer."
With Phase 6 code in production, **every page refresh 401s** on the first API call after the next pm2 reload. The user explicitly accepted this when authorising the Phase 6 work — but the fix is its own deliverable, and shipping Phase 3+6 to PM2 without F1 in the same window means an outage window measured in *however long F1 takes* (not minutes).
### Recommended approach
Add a single fetch wrapper at `inventory/src/utils/api.ts` (or similar) and migrate the ~220 call sites to use it. The wrapper:
1. Reads `localStorage.getItem('token')` on every call (cheap; localStorage is sync).
2. Merges `Authorization: Bearer ${token}` into the request headers if a token exists.
3. Intercepts 401 responses → fires `window.dispatchEvent(new Event('auth:logout'))` (a listener already exists in `AuthContext.tsx:117`) so the user gets bounced to `/login` cleanly instead of seeing broken pages.
4. Preserves the existing call shape — `apiFetch(url, init)` should be a drop-in for `fetch(url, init)` so the migration is mechanical.
```ts
// inventory/src/utils/api.ts (sketch)
export async function apiFetch(input: RequestInfo | URL, init: RequestInit = {}): Promise<Response> {
const token = localStorage.getItem('token');
const headers = new Headers(init.headers);
if (token && !headers.has('Authorization')) {
headers.set('Authorization', `Bearer ${token}`);
}
const res = await fetch(input, { ...init, headers });
if (res.status === 401 && token) {
// Token expired or revoked — bounce to /login. AuthContext already listens.
window.dispatchEvent(new Event('auth:logout'));
}
return res;
}
```
Same shape for axios:
```ts
// inventory/src/utils/apiClient.ts (sketch)
import axios from 'axios';
export const apiClient = axios.create();
apiClient.interceptors.request.use((config) => {
const token = localStorage.getItem('token');
if (token) config.headers.Authorization = `Bearer ${token}`;
return config;
});
apiClient.interceptors.response.use(
(r) => r,
(err) => {
if (err?.response?.status === 401) window.dispatchEvent(new Event('auth:logout'));
return Promise.reject(err);
},
);
```
### Migration plan
1. Land the two wrapper modules above. ~50 LOC total.
2. Codemod or sed-loop: in `inventory/src/`, replace `fetch(``apiFetch(` (with the right import) and `axios.get/post/...``apiClient.get/post/...`. ~220 call sites — a half-day of careful find-and-replace plus per-page verification. Spot-check the ones with custom `Content-Type` (multipart uploads especially) so the wrapper doesn't clobber multipart boundaries.
3. Leave the `AuthContext.tsx` `/login` and `/me` calls alone — they already work and migrating them adds no value.
4. Run the SPA: log in, exercise Overview / Products / Analytics / Dashboard / etc. with browser devtools open watching for `Authorization` header on every `/api/*` request.
### Sequencing with Phase 3+6 deploy
**Two options:**
A) **Ship F1 first** (recommended). Frontend goes out with the wrapper; nothing changes server-side. Then `pm2 reload` Phase 3+6. Zero-downtime, zero broken-page window.
B) **Ship together.** F1 and Phase 3+6 land in the same deploy. Brief window (seconds) where the frontend has the wrapper but the server hasn't reloaded yet — wrapper just sends extra headers the old server ignores. Safe.
Do **not** ship Phase 3+6 first and F1 second. That gives a broken app for as long as F1 takes.
### Out of scope (kept on `localStorage`)
Per Phase 6.8, we're not migrating to httpOnly cookie auth. F1 is the minimum work to make the per-service `authenticate()` (Phase 6) actually usable. A future Phase F2 could move to cookies + CSRF double-submit, but that's a much larger change touching the AuthContext, the login flow, and every backend that reads tokens. Not justified for an internal tool with no public sign-up.
### Note on `/uploads/*` gating (Phase 6.7's Caddyfile change)
**Applied as-spec then reverted (2026-05-24):** `/uploads/*` was initially placed behind `forward_auth` per the 6.7 spec. The SPA was patched via [`AuthedImage`](inventory/src/components/ui/authed-image.tsx) to send Bearer headers, but the gate broke the **external** product-import flow: image URLs from `/uploads/products/*` and `/uploads/reusable/*` are submitted to the external PHP backend (`www.acherryontop.com`) via `/apiv2/product/setup_new`, which then fetches them server-to-server to ingest into the shop — and has no Bearer token. The escape hatch from the original note ("revert this part of 6.7 to keep `/uploads/*` public") was taken: `/uploads/*` now serves via a dedicated public `file_server` placed ahead of `@static` and `@gated`. See Deviation #28 for the full revert details. `AuthedImage` is left in place as defense-in-depth (Bearer header is harmless when not required); no rush to rip out.
---
## Phase 7 — Caddyfile final form
Status: **Complete — applied 2026-05-24.** Final Caddyfile live at `/etc/caddy/Caddyfile`; vendor handles point at the merged dashboard-server on :3015. The `inventory-server/deploy/` staging folder (which held `Caddyfile.proposed` and the README of apply commands) was originally removed after apply — **re-created 2026-05-24 to stage the §9.2 edits and again to stage the Deviation #28 uploads-gate revert; treat it as the working area for any future Caddyfile change**. Apply pattern (admin-API `curl -X POST :2020/load` + `sudo cp` to persist on-disk) is captured in Deviation #8. Backup convention: `/etc/caddy/Caddyfile.bak.YYYY-MM-DD`.
After all phases (including Phase 9 §9.2 and the Deviation #28 revert), the `tools.acherryontop.com` block looks like:
```caddyfile
tools.acherryontop.com {
import security_headers
# Public: login endpoint
handle /auth-inv/* {
uri strip_prefix /auth-inv
reverse_proxy localhost:3011
}
# Phase 9 §9.2: CORS preflight bypass (OPTIONS has no Authorization header)
@cors_preflight {
method OPTIONS
header Access-Control-Request-Method *
}
handle @cors_preflight {
handle /api/klaviyo/* { reverse_proxy localhost:3015 }
handle /api/meta/* { reverse_proxy localhost:3015 }
handle /api/dashboard-analytics/* { reverse_proxy localhost:3015 }
handle /api/typeform/* { reverse_proxy localhost:3015 }
handle /api/acot/* { reverse_proxy localhost:3012 }
handle /chat-api/* {
uri strip_prefix /chat-api
reverse_proxy localhost:3014
}
handle /api/* { reverse_proxy localhost:3010 }
}
# Deviation #28: /uploads/* is PUBLIC by design (external PHP backend
# fetches image URLs server-to-server during /apiv2 product import).
# Rooted at /var/www/inventory (NOT the SPA build root) so paths
# resolve to the real uploads directory. Placed ahead of @static and
# @gated so it wins the match.
handle /uploads/* {
root * /var/www/inventory
file_server
}
# Public: static frontend assets
@static path *.js *.css *.png *.jpg *.jpeg *.gif *.ico *.svg *.woff *.woff2
handle @static {
header Cache-Control "public, max-age=2592000"
root * /var/www/inventory/frontend/build
file_server
}
# All API + chat: auth gate first
@gated path /api/* /chat-api/*
handle @gated {
forward_auth localhost:3011 {
uri /verify
copy_headers Authorization
}
# Vendor dashboard routes → merged dashboard-server
handle /api/klaviyo/* { reverse_proxy localhost:3015 }
handle /api/meta/* { reverse_proxy localhost:3015 }
handle /api/dashboard-analytics/* { reverse_proxy localhost:3015 }
handle /api/typeform/* { reverse_proxy localhost:3015 }
# ACOT-specific
handle /api/acot/* { reverse_proxy localhost:3012 }
# Chat
handle /chat-api/* {
uri strip_prefix /chat-api
reverse_proxy localhost:3014
}
# Catch-all: inventory-server
handle /api/* { reverse_proxy localhost:3010 }
}
handle /health { reverse_proxy localhost:3010 }
# SPA fallback
handle {
root * /var/www/inventory/frontend/build
try_files {path} /index.html
file_server
encode gzip
}
handle_errors {
respond "{err.status_code} {err.status_text}"
}
}
```
Removed: `/dashboard-auth/*`, `/api/aircall/*`, `/api/gorgias/*`, `/api/clarity/*`, the LAN/`Access-Control-Allow-Origin "*"` permissive defaults on `/api/*`. Kept: `/apiv2/*` and `/apiv2-test/*` proxies to backend.acherryontop.com (out of scope, separate system).
---
## Phase 8 — ecosystem.config.cjs final form
Status: **Complete — applied 2026-05-24.** Live PM2 list matches the spec below: 5 application apps (auth-server, inventory-server, dashboard-server, acot-server, chat-server) plus acot-phone-server + lt-wordlist-api = 7 total. Includes the Phase 6.4 `JWT_SECRET` shadow-override fix and the Phase 6.10 `lt-wordlist-api` token move. The `inventory-server/deploy/ecosystem.config.cjs.proposed` staging file was removed after apply — recreate from the spec below if future changes are needed.
```js
module.exports = {
apps: [
{
name: 'auth-server',
script: './inventory/auth/server.js',
cwd: '/var/www',
env: { NODE_ENV: 'production', AUTH_PORT: 3011 },
...commonSettings,
},
{
name: 'inventory-server',
script: './inventory/src/server.js',
cwd: '/var/www',
env: { NODE_ENV: 'production', PORT: 3010, UPLOADS_DIR: '/var/www/inventory/uploads' },
...commonSettings,
},
{
name: 'dashboard-server',
script: './inventory/dashboard/server.js',
cwd: '/var/www',
env: { NODE_ENV: 'production', DASHBOARD_PORT: 3015 },
...commonSettings,
},
{
name: 'acot-server',
script: './inventory/dashboard/acot-server/server.js',
cwd: '/var/www',
env: { NODE_ENV: 'production', ACOT_PORT: 3012 },
...commonSettings,
},
{
name: 'chat-server',
script: './inventory/chat/server.js',
cwd: '/var/www',
env: { NODE_ENV: 'production', PORT: 3014 },
...commonSettings,
},
// acot-phone-server and lt-wordlist-api unchanged
],
};
```
Five entries instead of twelve. Each app loads its own `.env` from its directory (already handled by `dotenv.config`).
---
## Sequencing & dependencies
```
Phase 1 (decommission) ──┬─────────────────────────────────────────┐
│ │
▼ │
Phase 2 (shared lib/) │
│ │
┌──────────────┼──────────────┐ │
▼ ▼ ▼ ▼
Phase 3a Phase 3b Phase 4 Phase 6 (auth hardening
inventory-server auth-server dashboard-server runs alongside 3+4+5,
to ESM to ESM + /verify build & test completes after them)
│ │ │ │
└──────────────┼──────────────┘ │
▼ │
Phase 5 (acot-server to ESM) ──────────────────►│
Phase 7 (Caddy cutover)
Phase 8 (PM2 final state)
```
Phase 1 unblocks everything (fewer services to convert).
Phase 2 is the foundation; nothing else can start until shared `lib/` exists.
Phases 35 can run in parallel; they touch independent services.
Phase 6's sub-items can be developed alongside 35 but **enabled** only after them (no point adding `requirePermission` to a route that doesn't yet have `authenticate`).
**Phase F1 must precede the Phase 3+6 pm2 reload** — without the fetch wrapper, the moment the new code goes live the SPA breaks. Discovered during Phase 3+6 implementation; see Phase F1.
Phase 7 is the cutover: Caddyfile flip happens after F1 ships AND after the `/uploads/*` gating decision in F1 is made.
Phase 8 is cleanup: remove dead PM2 entries.
Estimated effort, end-to-end: **~3 weeks of focused work** by one engineer. Phase 1 ≈ 1 day, Phase 2 ≈ 2 days, Phase 3 ≈ 3 days (both services), Phase 4 ≈ 57 days (the merge), Phase 5 ≈ 23 days, Phase 6 ≈ 34 days, Phase F1 ≈ 0.51 day, Phase 7+8 ≈ 1 day.
---
## Phase 9 — Post-audit residual gaps (NEW — 2026-05-24)
A second-look audit after Phase 5/8 closed surfaced six findings that the original plan either deferred, missed, or overstated. Phase 9 closes them. Code work landed in the same session as the audit, then the deploy-side steps (`caddy reload`, `pm2 reload chat-server`, frontend rebuild) were applied and smoke-tested.
The intent is **"do it properly the first time"** — Phase 9 isn't a quality-of-life backlog, it's the final closing-out of the consolidation project. With §9.19.5 applied, the consolidation/security/ESM claims in this document match the live system.
### 9.1 — chat-server: ESM conversion + in-process `authenticate()` + localhost bind
**Finding:** `chat-server` was the last application service still using `require()` (Status note misstated this in earlier revisions) and had no per-server auth — it relied entirely on the Caddy gate. `localhost:3014/test-db` returns 200 unauthenticated, falling short of the plan's three-layer defense model (Caddy `forward_auth` + per-server `authenticate()` + per-route `requirePermission`).
**Files touched (code landed this session):**
- `inventory-server/chat/package.json` — add `"type": "module"`.
- `inventory-server/chat/server.js``require``import`; load shared `.env` first then local; add a *second* `Pool` against `inventory_db` (the existing `CHAT_DB_*` pool stays for `rocketchat_converted`); mount `shared/auth/middleware.js`'s `authenticate()` on the router; mount `shared/logging/request-log.js`, `shared/cors/policy.js`, `shared/errors/handler.js` for parity with dashboard-server; bind to `127.0.0.1` instead of `0.0.0.0` (external access is via Caddy only).
- `inventory-server/chat/routes.js``require``import`; `module.exports``export default`; no changes to handler bodies (they continue to read `global.pool` which `server.js` still sets).
**Applied / verified:**
1. `pm2 reload chat-server --update-env` completed; PM2 shows `chat-server` online with the new PID.
2. Syntax sweep passes under Node when excluding macOS `._*` sidecar files.
3. `ss -ltnp` shows `chat-server` bound to `127.0.0.1:3014` (not `0.0.0.0`).
4. Smoke:
- `curl -s http://localhost:3014/test-db``401` (was: 200 unauthenticated).
- valid short-lived Bearer token against `https://tools.acherryontop.com/chat-api/test-db``200`.
- `curl -s https://tools.acherryontop.com/chat-api/test-db` (no token) → `401` at Caddy gate.
5. PM2 logs show logged-in SPA Chat Archive requests returning 200 (`/users`, `/users/:id/rooms`, `/rooms/:id/messages`, `/messages/attachments`).
**Rollback:** `pm2 reload chat-server` against the previous commit (CJS version is one git checkout away).
### 9.2 — Caddyfile: uploads gate fix + edge CORS tightening
> **⚠️ The uploads-gate portion of this section was reverted on 2026-05-24 — see Deviation #28.** Edit #1 (`not path /uploads/*` on `@static`) was undone and `/uploads/*` was lifted out of `@gated` entirely. The CORS edits (#2 and #3) remain live. The narrative below is preserved as a record of the original intent and what was learned.
**Findings (original):**
- `/uploads/*.jpg` returns a *public* `404` because the `@static path *.jpg ... *.woff2` matcher beats `@gated path /api/* /chat-api/* /uploads/*` on specificity, hitting the unauthenticated SPA build root.
- `Access-Control-Allow-Origin "*"` is set inside the `security_headers` snippet and imported into `tools.acherryontop.com`, which silently overrides the careful allow-list in `shared/cors/policy.js`.
- CORS preflight (`OPTIONS` with no Authorization) hits `forward_auth` and 401s before the backend's CORS handler ever sees it.
**Applied from:** `inventory-server/deploy/Caddyfile.proposed` (recreated this session per Deviation #18's convention of staging diffs there before applying). The active `/etc/caddy/Caddyfile` now contains the Phase 9 CORS edits + the Deviation #28 uploads-gate revert.
**Three edits (as originally applied):**
1. ~~Tighten `@static`:~~ **REVERTED — see Deviation #28.**
```
@static {
path *.js *.css *.png *.jpg *.jpeg *.gif *.ico *.svg *.woff *.woff2
not path /uploads/*
}
```
~~This makes `/uploads/*` an exclusive @gated path, restoring the intended authenticated file_server behavior.~~ Current state: the `not path /uploads/*` exclusion is removed; `/uploads/*` is handled by a dedicated public `handle /uploads/*` block placed ahead of both `@static` and `@gated`. See Deviation #28.
2. Drop the wildcard headers from `security_headers`:
```diff
- Access-Control-Allow-Origin "*"
- Access-Control-Allow-Methods "GET, POST, OPTIONS, PUT, DELETE, PATCH"
- Access-Control-Allow-Headers "DNT, X-CustomHeader, ..."
```
CORS responses come from the upstreams (`shared/cors/policy.js`) which set ACAO conditionally against the allow-list. Caddy stamping `*` was a holdover from before Phase 6.6.
3. Add a preflight-bypass handler **before** `@gated`:
```
@cors_preflight {
method OPTIONS
header Access-Control-Request-Method *
}
handle @cors_preflight {
handle /api/klaviyo/* { reverse_proxy localhost:3015 }
handle /api/meta/* { reverse_proxy localhost:3015 }
handle /api/dashboard-analytics/* { reverse_proxy localhost:3015 }
handle /api/typeform/* { reverse_proxy localhost:3015 }
handle /api/acot/* { reverse_proxy localhost:3012 }
handle /chat-api/* {
uri strip_prefix /chat-api
reverse_proxy localhost:3014
}
handle /api/* { reverse_proxy localhost:3010 }
}
```
**Applied / verified (at time of original §9.2 cutover):**
1. Caddy admin API load applied and `/etc/caddy/Caddyfile` persisted with the Phase 9 edits.
2. Smoke (post-§9.2, pre-Deviation #28 revert):
- `curl -I https://tools.acherryontop.com/uploads/reusable/<real>.jpg` → `401` without auth (was: public 404).
- same path with a valid short-lived Bearer token → `200 image/jpeg`.
- `curl -X OPTIONS -H "Origin: https://tools.acherryontop.com" -H "Access-Control-Request-Method: GET" https://tools.acherryontop.com/api/products` → `204` with `Access-Control-Allow-Origin: https://tools.acherryontop.com` (specific, **not** `*`).
- same preflight with `Origin: https://evil.example` → `403` with no wildcard ACAO.
- `curl -s https://tools.acherryontop.com/api/products` (no token) → `401`; wildcard ACAO no longer stamped by Caddy.
**Current smoke (post-Deviation #28 revert, 2026-05-24):**
- `curl -I https://tools.acherryontop.com/uploads/reusable/<real>.jpg` → `200 image/jpeg` without auth (intentional — see Deviation #28).
- `curl -I https://tools.acherryontop.com/uploads/products/<nonexistent>.jpg` → `404` (not `401`) — confirms the gate is no longer in front of `/uploads/*`.
- `curl -s https://tools.acherryontop.com/api/products` (no token) → `401` (gate still up for `/api/*` + `/chat-api/*`).
- CORS preflight and origin-allowlist behavior unchanged from the post-§9.2 smoke above.
**Rollback:** restore `/etc/caddy/Caddyfile.bak.YYYY-MM-DD` and `curl -X POST .../load` against it.
### 9.3 — Frontend Mini\*.jsx fetch leaks
**Finding:** Three dashboard mini-widgets used raw `fetch()` against gated `/api/dashboard/*` endpoints, so the SPA would 401 these specific cards after Phase 6 went live.
**Files touched (landed this session):**
- `inventory/src/components/dashboard/MiniInventorySnapshot.jsx` — 3 calls migrated to `apiFetch`.
- `inventory/src/components/dashboard/MiniBusinessMetrics.jsx` — 2 calls migrated.
- `inventory/src/components/dashboard/MiniSalesChart.jsx` — 1 call migrated.
**Sweep done:** `grep -rn "fetch(\`\${config.apiUrl}\|fetch(\`/api/\|fetch('/api/\|fetch(\"/api/" inventory/src` now returns no surviving raw-fetch sites against gated paths (other than the two AuthContext exceptions and `services/apiv2.ts` already excluded by Phase F1).
**Applied / verified:** `cd inventory && npm run build` completed successfully and the live frontend build now references the new asset hash. Raw `fetch()` sweep now shows only the documented AuthContext/App `/me` exceptions, `services/apiv2.ts`, and third-party OpenWeather calls.
### 9.4 — Vitest scaffold + auth-boundary tests
**Finding:** Plan promised a `vitest` scaffold and auth-boundary tests during Phase 2; never delivered. Root `package.json` has no `scripts` block; `inventory-server/package.json` has the default `echo "Error: no test specified"`.
**Files landed this session** (under `inventory-server/`):
- `inventory-server/package.json` — adds `"test": "vitest run"`, `"test:watch": "vitest"`, plus `vitest` in `devDependencies`.
- `inventory-server/shared/auth/verify.test.js` — five cases: valid token, expired token, wrong-signature token, malformed header, missing token.
- `inventory-server/shared/auth/middleware.test.js` — request with no header → 401; bad header → 401; valid token + inactive user → 403; valid token + missing permission → 403; valid token + correct permission → next() called with `req.user` populated; cache TTL behavior (same token within 60s → one DB hit; after 61s → two).
The scaffold is intentionally narrow: it covers the security-critical surface and nothing else. Broader coverage is still a separate, larger project — see Out of scope.
**Applied / verified:** `cd /var/www/inventory && npm test` passes: 2 files, 21 auth-boundary tests.
### 9.5 — Plan-doc narrative reconciliation
**Finding:** The opening goal ("12 → 3 application servers + 1 auth-server", "ESM across all Node services") didn't match what shipped. Status row 13 overstated ESM coverage ("All 58 server-side files") in a way that hid chat-server from readers. Deliverables section line 845 acknowledged the chat-server gap but bullet 21 still claimed "all phases complete".
**Edits landed this session:**
- Headline (line 3) rewritten to "13 PM2 processes to 5 application servers + 2 auxiliary processes = 7 total" + the "primary application Node services" qualifier on ESM.
- Phase 3 status row clarified to scope ESM to inventory-server + auth-server, with explicit reference to Deviation #27.
- Goals (line 34) restated with the actual outcome + the reason for the deviation.
- Target-architecture caption corrected from `12 → 4` to `13 → 7`.
- Concrete deliverables block rewritten to flag the four items where Phase 9 closes a gap (uploads gate, chat-server auth, chat-server ESM, edge CORS).
- New status-table row added for Phase 9 itself.
**Deploy:** none — this is documentation only.
### 9.6 — Sequence + dependencies
| Step | Depends on | Risk | Notes |
|---|---|---|---|
| 9.5 plan-doc | — | none | Already done |
| 9.3 frontend rebuild | 9.5 (no functional dep, just shipping order) | low | `npm run build` from inventory/ |
| 9.4 vitest scaffold | — | none | Already done; run `npm test` to verify |
| 9.1 chat-server reload | code conversion landed this session | medium | **Done** — PM2 reload complete; unauthenticated + authenticated smokes pass |
| 9.2 Caddy reload | 9.1 (so the gate covers a service that already self-authenticates) | medium | **Done** — active Caddyfile includes Phase 9 edits; uploads/CORS smokes pass |
### 9.7 — Done criteria
Phase 9 completion checks:
- ✅ `curl -s -o /dev/null -w '%{http_code}' http://localhost:3014/test-db` returns `401` (was: 200).
- ⚠️ ~~`curl -sI https://tools.acherryontop.com/uploads/reusable/<any-real>.jpg` returns `401` without auth and `200` with valid Bearer.~~ **Superseded by Deviation #28.** Current expected behavior: `200 image/jpeg` without auth (uploads are public by design). The original Phase 9.7 criterion is preserved here as a record of what §9.2 verified at the time.
- ✅ `curl -X OPTIONS ...` to a gated `/api/*` path returns `204` from the backend with a specific `Access-Control-Allow-Origin`, **not** `*`.
- ✅ `cd /var/www/inventory && npm test` passes 100% on `shared/auth/*.test.js` (2 files, 21 tests).
- ✅ `grep -rn "require(" inventory-server/chat/*.js` returns nothing.
- ✅ Frontend build passes after the three Mini\*.jsx migrations.
---
## Testing strategy
No formal test suite exists today (per CLAUDE.md). For a refactor this size, that's a gap to close — but writing tests retroactively for 15K LOC of routes is a separate, larger project. For this refactor:
### Manual smoke testing per phase
A checklist of representative endpoints to hit after each deploy:
- `inventory-server`: `/api/products`, `/api/dashboard/overview`, `/api/analytics/revenue`, `/api/orders`, `/api/purchase-orders`, `/api/import/list-uploads`, `/api/config/global`
- `dashboard-server`: `/api/klaviyo/campaigns`, `/api/meta/insights`, `/api/google-analytics/...`, `/api/typeform/responses`
- `acot-server`: `/api/acot/...` (top-3 endpoints by call volume — pull from access logs)
- `auth-server`: `/login`, `/me`, `/verify`
Each smoke test runs (a) without a token → expect 401, (b) with an invalid token → expect 401, (c) with a valid token → expect 2xx.
### Frontend integration check
After deploys, log into the SPA and exercise each major page (Overview, Products, Analytics, Dashboard, Klaviyo, Meta, etc.). If everything loads and dashboards populate, the auth + routing layer is intact.
### Test scaffold during Phase 2 (committed)
While building `shared/`, set up `vitest` (lightweight, ESM-native, fast) as the standard test runner for the repo. Initial coverage focuses on the security-critical surface only:
- `shared/auth/verify.js` — known good token, expired token, wrong-signature token, malformed token, missing token.
- `shared/auth/middleware.js` — request with no header → 401; bad header → 401; valid token + inactive user → 403; valid token + missing permission → 403; valid token + correct permission → next() called with `req.user` populated.
- `shared/auth/middleware.js` user-cache TTL: same token within 60s → one DB hit; same token after 61s → two DB hits.
`package.json` gets a `"test": "vitest run"` script at the repo root and per-service. Set up but don't backfill broader test coverage — that's a separate, larger project. The vitest scaffold gives future work a foothold; this refactor commits to having tests for the auth boundary specifically because that's what's load-bearing for the whole security model.
---
## Rollback strategy
Each phase produces an independently deployable state. Rollback per phase:
- **Phase 1**: re-add removed services to ecosystem; restore from git. Don't roll back data deletions — only do those after a week of stable production.
- **Phases 3, 5**: ESM conversion is per-service; if one service breaks, `pm2 restart <name>` to the previous commit. Other services unaffected.
- **Phase 4**: the dashboard-server merge is the highest-risk change. Plan: deploy `dashboard-server` to a non-conflicting port (3015) while leaving the old per-vendor servers running. Cut over Caddy routes one vendor at a time (start with Meta — smallest). If any vendor breaks, point Caddy back to the old server (still running) for that vendor, debug, retry. Only delete the old servers after all four are stable on `dashboard-server`.
- **Phases 6, 7**: Caddy config is git-tracked. `git revert` + `caddy reload` rolls back in seconds. Auth changes are additive (defense in depth) — if `forward_auth` causes problems, comment it out and per-server middleware continues protecting routes.
---
## Out of scope (intentional)
These came up in the audit but aren't part of this refactor:
- `httpOnly` cookie auth ("Phase F2" — deferred). Phase F1 keeps `localStorage` + Bearer header because that's the minimum to unblock the Phase 6 `authenticate()` rollout. A future move to cookie auth would touch `AuthContext`, every backend that reads tokens, and introduce CSRF concerns — much larger project.
- Replacing PM2 with systemd or Docker.
- Test coverage beyond the auth-critical surface.
- `apiv2`/`apiv2-test` proxies to `backend.acherryontop.com` — separate system, not touched.
- `acot-phone-server` and `lt-wordlist-api` — staying as-is.
- Centralized observability stack (Prometheus, Grafana). The logger work in Phase 6.5 sets up the data, but shipping it somewhere is future work.
- ChatRoom XSS remediation (flagged during Phase 6.8 audit — `inventory/src/components/chat/ChatRoom.tsx:277,392` renders user-controlled chat content via `dangerouslySetInnerHTML` without sanitization). Real vulnerability for an internal-but-multi-user tool; separate fix.
---
## Concrete deliverables
State as of 2026-05-24: Phases 19 are **shipped**. The "4 application PM2 processes" original target became **5** in execution because `chat-server` stayed standalone rather than being folded in — never a serious merge candidate (different DB, different protocol shape).
- ✅ 5 application PM2 processes instead of 12 (auth-server, inventory-server, dashboard-server, acot-server, chat-server) — plus 2 unchanged (acot-phone-server, lt-wordlist-api) = 7 total.
- ✅ All `/api/*` and `/chat-api/*` requests gated at Caddy (`forward_auth`). `/uploads/*` is **intentionally public** — it was gated as part of Phase 6.7 / §9.2, then reverted because the external PHP backend at `www.acherryontop.com` fetches image URLs during product import. See Deviation #28.
- ✅ Per-upstream `authenticate()` re-verification on inventory-server, dashboard-server, acot-server, and chat-server.
- ✅ Sensitive endpoints additionally gated by per-permission checks (`requirePermission`).
- ✅ **One ESM standard across primary application Node services** — auth-server, inventory-server, dashboard-server, acot-server, and chat-server are ESM.
- ✅ One shared `lib/` at `inventory-server/shared/` for auth, logging, DB, errors, CORS.
- ✅ Login rate-limited (`shared/rate-limit/login.js`).
- ✅ `JWT_SECRET` rotated + ecosystem shadow-override removed.
- ✅ Old auth-server, Aircall, Gorgias, Clarity directories deleted from the repo. Defunct `dashboard:gorgias`/`dashboard:calls` permission rows also deleted from DB (2026-05-24).
- ✅ Caddyfile slimmed to one auth-gated block. Phase 9 §9.2 drops the edge CORS wildcard and adds a CORS preflight bypass; the `@static`-matcher tightening from §9.2 was rolled back as part of the Deviation #28 uploads-gate revert (uploads now served via a dedicated public `handle /uploads/*` block placed ahead of `@static`).
- ✅ Permission codes inserted into `permissions` table for granular authorization.
- ✅ Both gaps surfaced during Phase 5 verification — `lt-wordlist-api` insecure default token (Deviation #25) and Caddy blocking acot-phone-server's `x-acot-api-key` calls (Deviation #26) — were closed 2026-05-24.
- ✅ **Six second-look findings closed by Phase 9** (frontend fetch leaks, edge CORS wildcard, uploads-gate ordering, chat-server in-process auth + ESM, vitest scaffold, plan-doc/goal mismatch). Code and deploy steps are complete. *The uploads-gate fix was later reverted (Deviation #28) — the gate was incompatible with the external-server image-fetch contract; what `@static` slip-through actually revealed was that uploads belong in front of the gate, not behind it.*
---
## Deviations from original plan (recorded during execution)
These are decisions made during Phase 1/2 implementation that amend the spec above. Future phases should follow the deviated path, not the original sketch.
1. **`shared/` location.** Original plan placed `shared/` at the repo root as a sibling of `inventory/` and `acot-phone/`. Implemented at `inventory-server/shared/` (= `/var/www/inventory/shared/` on the server) instead. Reason: the actual project root *is* `/var/www/inventory/`; placing shared/ outside it would have meant building a deployment story for it that doesn't exist. Import paths change accordingly:
- From `inventory-server/{auth,src,chat}/server.js` → `../shared/...`
- From `inventory-server/dashboard/{vendor}-server/server.js` → `../../shared/...`
2. **`/verify` response headers.** Plan specified `X-User-Id` + `X-User-Is-Admin`. Implemented as `X-User-Id` + `X-User-Username` (both available from the JWT payload). `X-User-Is-Admin` was dropped because `is_admin` isn't in the JWT today and returning it would require a DB lookup — violating the "no DB hit" principle. To restore `X-User-Is-Admin`, enrich the JWT payload at login time (one-line change in `auth/routes.js`) during Phase 6, then echo from `/verify`. Upstreams don't trust these headers anyway (they re-verify), so the omission is informational, not security-relevant.
3. **User cache key in `shared/auth/middleware.js`.** Plan sketch mentioned "60s TTL keyed by token jti". Implemented as keyed by `userId` instead — the JWT doesn't currently include a `jti` claim, and the cache's invalidation semantics are "this user was deactivated/changed permissions" (per-user), not "this token was revoked" (per-token). The plan's pseudocode already used `loadUserCached(pool, decoded.userId)` so this matches the spirit.
4. **Redis client safety.** `shared/db/redis.js` sets `enableOfflineQueue: false` and `lazyConnect: true`. Plan didn't specify but these defaults mean a Redis hiccup fails fast (route fall-through to upstream API as designed in Phase 4 risk notes) rather than queueing commands indefinitely.
5. **CORS allowed origins kept `https://acot.site`.** Plan example listed three origins; production has acot.site as a redirect to tools.acherryontop.com but also reaches the API directly in some flows. Kept it to avoid breakage. LAN wildcards (`192.168.*`, `10.*`) and `Access-Control-Allow-Origin "*"` are NOT included in the new `shared/cors/policy.js` per the plan's Phase 6.6 spirit, but the legacy `inventory-server/src/middleware/cors.js` still has them until services are migrated to consume `shared/cors/`.
6. **Defunct permission codes cleaned up (2026-05-24).** Removed the `dashboard:gorgias` and `dashboard:calls` Protected blocks from the frontend; the corresponding permission rows in the `permissions` table (and their user_permission grants) were deleted in a follow-up migration alongside the Phase 6.2 permissions seed. Verified post-migration: `permissions` table contains only the in-use `dashboard:*` codes (analytics, campaigns, feed, financial, meta_campaigns, operations, payroll, products, realtime, sales, stats, typeform, user_behavior).
7. **PM2 process names retained `new-auth-server` (not `auth-server`).** Plan's Phase 8 final form names it `auth-server` (after the legacy 3003 one is removed). Decided to keep the existing `new-auth-server` name through Phase 2 to avoid a rename mid-stream. Phase 8 can rename if desired, but it's cosmetic — all wiring is by port (3011) not name.
8. **Caddyfile changes via admin API on `:2020`.** The Caddyfile is owned by root and matt has no passwordless sudo. Cutover used `curl -X POST .../load` on the Caddy admin port (which matt can hit), then a separate `sudo cp /home/matt/Caddyfile.new /etc/caddy/Caddyfile` step to persist the on-disk file. Future Caddyfile changes can follow the same pattern. Backup convention: `/etc/caddy/Caddyfile.bak.YYYY-MM-DD`.
9. **Path-naming.** Plan uses `inventory/` as the top-level (server-side path convention). Locally the equivalent is `inventory-server/`. Whenever the plan says `inventory/dashboard/foo/`, read that as `/var/www/inventory/dashboard/foo/` on the server or `inventory-server/dashboard/foo/` locally.
10. **Scripts directory kept CJS via package.json shim.** Original plan called for converting "any spawned script" to ESM alongside its caller. Implemented: added `inventory-server/scripts/package.json` with `"type": "commonjs"`. Node's package-type resolution walks up directory by directory, so this overrides the parent's `"type": "module"` for the entire `scripts/` tree (≈15 files including `import/*.js`, `metrics-new/utils/*`, the orchestrator scripts) without renaming any file or touching any `spawn()` callsite. Convert individual scripts to ESM when touched; don't bulk-migrate.
11. **`src/routes/products.js` had dead multer setup.** Phase 6.7 spec called for hardening the upload route in products.js. There was no upload route — the `multer({ dest })` instance and `importProductsFromCSV` import were dead code left over from a long-ago migration. Strongest 6.7 hardening was deletion: no upload handler = no attack surface. The two real upload paths (`/api/import/upload-image` and `/api/reusable-images/upload`) got tightened MIME+extension allowlists instead.
12. **Two pre-existing syntax errors in shared/db/ surfaced.** `shared/db/pg.js:13` and `shared/db/redis.js:22` both had `?? Number(...) || N` — mixing `??` and `||` without parentheses is a TC39 syntax error. They passed Phase 2 because nothing imported them yet; Phase 3 smoke-test exposed it. Fixed with parens.
13. **`import { Pool } from 'pg'` doesn't work in ESM.** The `pg` package is CJS using `module.exports = { Pool, ... }`. Node's ESM-from-CJS interop fails to detect `Pool` as a named export via static analysis. The bulletproof pattern, now used everywhere: `import pg from 'pg'; const { Pool } = pg;`. Same idea for any future CJS-only deps. `src/utils/db.js` already had it; the two auth files needed the fix during execution.
14. **Frontend Bearer-header gap discovered (drives new Phase F1).** Phase 6 was specified assuming the frontend already sends `Authorization: Bearer` on every API call. It does not — only 7 of ~220 call sites do. Phase 6's `authenticate()` middleware is shipped and ready to enable, but until F1 lands the SPA will 401 on every page. The plan now has Phase F1 to address this explicitly; until then, the Phase 3+6 pm2 reload should not ship unless F1 ships in the same window.
15. **macOS NFS workflow note.** The `inventory-server/` directory locally is an NFS mount of `/var/www/inventory/` on netcup. Bulk operations (`find`/`grep -r`/mass `node --check`/`npm install`) hang or take minutes locally and pollute file listings with macOS AppleDouble `._*` sidecar files. Default to `ssh netcup` for any sweep across the tree — individual file edits via the editor are fine.
16. **dashboard-server lives at `inventory-server/dashboard/` (not its own top-level dir).** Plan's Phase 4 diagram implied a sibling of `inventory-server/` etc. The merged service lives at `inventory-server/dashboard/server.js` with `package.json` declaring `"type": "module"`. Per-vendor subdirectories (`klaviyo-server/`, `meta-server/`, `google-server/`, `typeform-server/`) each have their own `package.json` so Node's "nearest parent package.json" walk stops there — they are unaffected by the new parent type. Added `"type": "commonjs"` defensively to meta/google/typeform package.json so a future deletion of their files (cutover cleanup) plus a stray `*.js` left under `dashboard/` wouldn't accidentally try to ESM-parse it.
17. **Klaviyo `RedisService` kept as a wrapper, but accepts injected client.** Plan said "replace each server's per-instance pool/redis with the injected one." The Klaviyo codebase has ~3K LOC of service code (`events.service.js` alone is 2.2K) that calls `this.redisService._getCacheKey()`, `.get()`, `.set()`, `.getEventData()`, `.clearCache()`, `._getTTL()`. Rewriting all of that to call ioredis directly would risk breaking the cache-key/TTL invariants. Decision: keep `services/klaviyo/redis.service.js` as a thin facade with the same public surface, but its constructor now takes an ioredis client instead of constructing one. The 3 service classes (`EventsService`, `CampaignsService`, `ReportingService`) all take `(apiKey, apiRevision, redis)` and pass `redis` to `new RedisService(redis)`. `MetricsService` doesn't use Redis — left unchanged.
18. **dashboard-server `.env` layering.** Plan called for "Single `.env` at `inventory/dashboard/.env`, prefixed keys: KLAVIYO_*, META_*, ... JWT_SECRET ... # shared with auth-server." Implemented as two-file layering: `server.js` loads `/var/www/inventory/.env` FIRST (provides JWT_SECRET, DB_*, REDIS_*) then `inventory-server/dashboard/.env` SECOND for vendor-specific keys (KLAVIYO_API_KEY, META_*, GA_*, TYPEFORM_*). dotenv defaults to `override:false`, so the first file wins on collisions — security-critical vars live in one place, vendor keys in the other. `.env.example` template committed at `dashboard/.env.example`. **Pre-cutover step**: copy the vendor keys from the current per-vendor `.env` files into either of those two files before `pm2 reload`, else KLAVIYO_API_KEY etc. will not be set and routes will 500.
19. **Caddyfile typo fixed: `/api/google-analytics` → `/api/dashboard-analytics`.** The pre-Phase-4 `Caddyfile.proposed` listed a `handle /api/google-analytics/*` block. The live Caddyfile and the frontend (`inventory/src/config/dashboard.ts`) both use `/api/dashboard-analytics/*` (the live file has a `uri replace /api/dashboard-analytics /api/analytics` rewrite to land on google-server's `/api/analytics` mount). The merged dashboard-server now mounts the Google router at `/api/dashboard-analytics` directly — Caddy no longer needs the rewrite, just a straight reverse_proxy. Fixed in `deploy/Caddyfile.proposed`.
20. **`metrics.routes.js` had a latent router-scope bug.** The Klaviyo `metrics.routes.js` declared `const router = express.Router()` at MODULE scope (outside the `createMetricsRouter` factory), so calling the factory twice would have re-mounted handlers on the same router (cumulative). Benign for a single-mount PM2 service, but fixed during the Phase 4 copy — the router now lives inside the factory. Also renamed the export from `createMetricsRoutes` (plural) to `createMetricsRouter` (matches the convention used by every other vendor's index.js).
21. **PM2 log paths use per-server `logs/pm2/` (NOT `/var/log/pm2/`).** Discovered during the first apply attempt: the previously-shipped `ecosystem.config.cjs.proposed` carried over `/var/log/pm2/...` from the live file, but matt has no write perms on `/var/log` (root:syslog 775) so the entries silently failed to launch (chat-server + acot-server came up because they had no explicit log path; new-auth-server, inventory-server, dashboard-server bailed). The actual convention — already in place via pre-created folders on disk — is per-service `logs/pm2/` directly under each service's directory (`./inventory/auth/logs/pm2/`, `./inventory/chat/logs/pm2/`, `./inventory/dashboard/acot-server/logs/pm2/`, `./inventory/dashboard/logs/pm2/` for the merged dashboard-server, `./inventory/logs/pm2/` for inventory-server, `/opt/lt-wordlist-api/logs/pm2/`, `/var/www/acot-phone/logs/pm2/`). All folders are matt:matt. `pm2-logrotate` (already loaded in matt's daemon) rotates them in place.
22. **All PM2 apps run under matt's single daemon — no root daemon.** The earlier `OUT OF SCOPE` comment block in the proposed ecosystem incorrectly claimed `lt-wordlist-api` and `acot-phone-server` were managed by a separate root PM2 daemon. They are not — matt's daemon manages everything. Removed the bogus block; both apps are now first-class entries in the proposed ecosystem with corrected script paths:
- `lt-wordlist-api` script is `/opt/lt-wordlist-api/index.js` (was `/opt/lt-wordlist-api/server.js` in the live file — wrong; that file doesn't exist). `/opt/lt-wordlist-api` is matt:matt 0750.
- `acot-phone-server` script is `/var/www/acot-phone/dist/server.js` (was `./inventory/acot-phone/server.js` in the live file — wrong; that path doesn't exist). `/var/www/acot-phone/` is matt:matt with its own `.env` and is a separate repo from inventory-server.
23. **Phase 6.10 ADD_WORD_TOKEN move stays in this ecosystem.** Per Deviation #22, `lt-wordlist-api` is in matt's ecosystem, so the §6.10 work to remove inline `ADD_WORD_TOKEN` and load it from `/opt/lt-wordlist-api/.env` instead is implemented directly in `deploy/ecosystem.config.cjs.proposed` (no inline `ADD_WORD_TOKEN`; script reads its own .env). When applying, rotate the token value in `/opt/lt-wordlist-api/.env` and update any callers.
24. **Phase 6 patterns applied to acot-server during Phase 5.** acot-server was originally planned to convert mechanically (require → import) and inherit Phase 6 hardening later. Done in a single pass instead: the new `server.js` mounts `shared/logging/request-log.js`, `shared/cors/policy.js`, `shared/errors/handler.js`, and `shared/auth/middleware.js`'s `authenticate()` on `/api/acot/*` (except the customers router — see Phase 5 auth-model section above). Adds `pg` dependency to `inventory-server/dashboard/acot-server/package.json` because the Postgres pool for `authenticate()`'s user/permission lookups is initialized in-process. Env layering follows dashboard-server's pattern: `/var/www/inventory/.env` loaded first (JWT_SECRET, DB_*), local `.env` loaded second (PROD_DB_*, PROD_SSH_*, ACOT_PHONE_API_KEY). No `acot_admin` permission gates wired — none of the routes mutate state in ways that warrant per-permission checks today; the seeded code in `migrations/005_phase6_permission_codes.sql` stays reserved.
25. **Phase 6.10 fixed 2026-05-24 via option (b).** Discovered during Phase 5 closeout that `/opt/lt-wordlist-api/.env` didn't exist, no `ADD_WORD_TOKEN` env var was set on the running process, and the script's `process.env.ADD_WORD_TOKEN || 'tokenhere'` fallback was the only gate — meaning `curl -X POST -H "x-add-word-token: tokenhere" http://localhost:3030/add-word` succeeded in production.
**Fix applied:**
- Generated a fresh 32-byte hex token via `openssl rand -hex 32`.
- Wrote it to `/opt/lt-wordlist-api/.env` (matt:matt, mode 0600).
- Edited `/var/www/ecosystem.config.cjs`'s `lt-wordlist-api` entry to add `node_args: ['--env-file=/opt/lt-wordlist-api/.env']`. Node ≥20.6 (netcup runs v22.22.2) reads the file at startup with no script changes and no `dotenv` import — the cleanest of the three options because the token never lives in the committed ecosystem file.
- `pm2 reload /var/www/ecosystem.config.cjs --only lt-wordlist-api --update-env` picked up the new wrapper config. PM2 restart count 1 → 2, clean startup.
**Verified:** old default `tokenhere` now returns `{"error":"unauthorized"}` HTTP 401; new env-file token returns `{"ok":true,...}` HTTP 200 on `/add-word` and `/delete-word`. To rotate again: edit `/opt/lt-wordlist-api/.env` + `pm2 restart lt-wordlist-api --update-env`.
**Caller coordination:** user confirmed all callers are external and will be updated as issues surface; no inventory of callers to pre-notify.
26. **Caddy `forward_auth` gate breaks acot-phone-server's customer lookups — fixed 2026-05-24 via option (a).** Phase 6.1 put `/api/acot/*` behind `forward_auth localhost:3011/verify`, which strictly requires a JWT Bearer token. But `acot-phone-server` (at `/var/www/acot-phone/`) calls `/api/acot/customers/by-phone`, `/api/acot/customers/search`, `/api/acot/customers/:cid/orders` using only an `x-acot-api-key` header (`/var/www/acot-phone/src/services/acotApi.ts:51`). Result: every customer lookup from the phone app hit a 401 at the Caddy gate before reaching `requirePhoneApiKey`. Last successful customer call in acot-server's access log was 2026-05-21 — three days before the Caddy cutover.
**Fix applied — option (a):** changed `ACOT_API_URL` in `/var/www/acot-phone/.env` (and `acot-phone-server/.env.example` and the local repo copy) from `https://tools.acherryontop.com/api/acot` to `http://localhost:3012/api/acot`. Both processes live on netcup, so the request never enters Caddy and lands directly on `requirePhoneApiKey` in-process. Restarted via `pm2 restart acot-phone-server --update-env`; smoke-tested with `curl -H "x-acot-api-key: ..." http://localhost:3012/api/acot/customers/by-phone?phone=...` → 200 with the real customer record.
(Alternative option (b), kept here for posterity: add a `@phone_auth header x-acot-api-key *` guard in the Caddyfile to bypass `forward_auth` for requests bearing the shared secret. Would have worked too but introduces a header-based bypass in the gate, which is a worse security posture than just not entering Caddy at all.)
27. **chat-server stayed CJS through Phase 8, converted to ESM in Phase 9 (2026-05-24).** Why it stayed CJS originally: during Phase 4 scoping, chat-server was a notional merge candidate (fold its routes into `dashboard-server`) but rejected on inspection — it speaks a different DB (`rocketchat_converted`), a different protocol shape (file proxying + message search, not vendor-API aggregation), and shares no service code with the dashboard merge surface. Once "no merge" was decided, the secondary case for converting it (consistency, sharing `shared/auth/middleware.js`) wasn't strong enough to make it worth the Phase 4 testing budget. Phase 6 hardening also skipped it because the Caddy gate was assumed sufficient.
The second-look audit (2026-05-24) flagged two consequences:
- The plan's literal claim of "ESM across all Node services" was false.
- chat-server had only one defense layer (Caddy `forward_auth`) — `curl http://localhost:3014/test-db` returned 200 unauthenticated.
Phase 9 §9.1 closes both: chat-server is now ESM, mounts `authenticate()` against an in-process `inventory_db` pool (separate from the existing `rocketchat_converted` pool which routes.js uses via `global.pool`), and binds to `127.0.0.1` so external direct-port access is no longer possible even if the host firewall rule were dropped. Conversion was 3 files (`server.js`, `routes.js`, `package.json`) and minimal because routes.js's only top-level imports were `express` and `path` — handler bodies were untouched.
28. **Phase 6.7 / §9.2 `/uploads/*` gating reverted (2026-05-24).** Phase 6.7 specified moving `/uploads/*` behind `forward_auth`, and Phase 9 §9.2 hardened that by excluding `/uploads/*` from the `@static` matcher so uploaded `*.jpg` paths couldn't slip past the gate via the SPA build root. The SPA was patched (`AuthedImage` in `inventory/src/components/ui/authed-image.tsx`) to send Bearer headers on `/uploads/*` GETs.
**The breakage:** Both `/uploads/products/*` and `/uploads/reusable/*` URLs are submitted as strings inside the `product_images` field to `/apiv2/product/setup_new`, which Caddy reverse-proxies to the external PHP backend at `www.acherryontop.com`. That backend then **fetches the image URLs server-to-server** to ingest them into the shop. The external backend has no JWT and never authenticates against our `localhost:3011/verify` endpoint — so after the gate went live, every external import-time image fetch hit 401 and the product-import flow silently dropped images.
The original Phase 6.7 / F1 notes reserved this exact escape hatch (CONSOLIDATION_PLAN.md "Note on `/uploads/*` gating": *"if broken, revert this part of 6.7 to keep `/uploads/*` public, OR issue per-image signed URLs from the API"*).
**Fix applied — option (a):** revert the gate. In the live Caddyfile:
- Removed `/uploads/*` from the `@gated` matcher.
- Removed the `not path /uploads/*` exclusion from `@static` (no longer needed).
- Added a dedicated public `handle /uploads/*` rooted at `/var/www/inventory` (NOT the SPA build root — paths must resolve to the real uploads directory). Placed ahead of both `@static` and `@gated` so it wins the match.
Staged via `inventory-server/deploy/Caddyfile.proposed` and applied via the Deviation #8 pattern (admin-API `curl -X POST :2020/load` + `sudo cp` to persist).
**Verified live:**
- `curl -sI https://tools.acherryontop.com/uploads/reusable/<real>.jpg` → `200 image/jpeg` without auth.
- `curl -sI https://tools.acherryontop.com/uploads/products/<nonexistent>.jpg` → `404` (not `401`) — confirms gate is no longer in front of `/uploads/*`.
- `curl -sI https://tools.acherryontop.com/api/products` (no token) → `401` — `@gated` is still live for `/api/*` + `/chat-api/*`.
**Rationale for staying public rather than signing URLs:** Per-image HMAC signed URLs would have worked too but require a small change on the inventory server (URL generation + signature validation in Caddy or upstream) and would flow through the PHP backend untouched. Plain public access is acceptable because: (a) uploaded image URLs use random unguessable filenames (`createUploadFilename` in `inventory-server/src/routes/import.js` — timestamp + 9-digit `Math.random` suffix), (b) `/uploads/products/*` files auto-delete after 24h via `scheduleImageDeletion`, and (c) no PII or confidential data lives in this bucket today. If that ever changes, implement signed URLs rather than re-gating wholesale.
**AuthedImage:** Left in place as defense-in-depth. Bearer header is harmless when the path isn't gated, so existing call sites keep working unchanged. No rush to rip it out.
**Doc updates this session:** the Status table's Phase 9 row, Phase 6.1 row, Phase 6.7 row, the inline 6.7 bullet, the F1 "Note on `/uploads/*` gating" subsection, the Phase 7 Caddyfile diagram, the Phase 9 §9.2 narrative + smoke results, and the Concrete deliverables checklist were all updated to reflect the revert.