inventory/CONSOLIDATION_PLAN.md

# Server Consolidation & Security Hardening Plan

Audit-driven plan to (a) reduce 12 PM2 processes to 3 application servers + 1 auth server, (b) put every API endpoint behind real authentication, and (c) standardize on ESM across all Node services. Approach is "do it properly the first time" — no half-finished pieces, no deferred cleanup.

---

## Status (2026-05-24)

| Phase | Status | Notes |
|---|---|---|
| 1 — Decommission dead services | **Complete** | aircall/gorgias/clarity/legacy-auth-server deleted from repo + PM2 + Caddyfile + ecosystem.cjs |
| 2 — Build shared `lib/` | **Complete** | Lives at `inventory-server/shared/` (see Deviations). `/verify` endpoint live on auth-server |
| 3 — Convert auth-server + inventory-server to ESM | **Complete** | All 58 server-side files ESM; both services live under the ESM build for >24h. See Deviations #10–13 |
| 4 — Build `dashboard-server` (the merge) | **Complete (live) — 2026-05-24** | Merged service running on :3015 under PM2; Caddy routes for klaviyo/meta/dashboard-analytics/typeform all reverse-proxy to it. Old per-vendor directories (`klaviyo-server`, `meta-server`, `google-server`, `typeform-server`) and their PM2 entries deleted post-cutover — ~1.27 GB reclaimed (largely duplicated `node_modules`). Phase 6.2 gates wired (meta_write, klaviyo_admin). See Deviations #16–19 |
| 5 — Convert `acot-server` to ESM | Not started | |
| 6 — Auth hardening | **Complete** | All in-process items live: rate-limit, JWT precondition, CORS lockdown, request-log, upload allowlist, `requirePermission` on sensitive routes, permissions seed migration. `authenticate()` is live on `/api/*`. 6.11 (audit logging) deferred — see Out of scope |
| **F1 — Frontend fetch wrapper** | **Complete (live) — 2026-05-23** | Wrappers at `inventory/src/utils/api.ts` (`apiFetch`) and `inventory/src/utils/apiClient.ts` (axios instance). 170 `fetch()` sites across 76 files migrated to `apiFetch`; 32 `axios.*` sites across 11 files migrated to `apiClient`. AuthContext `/login`+`/me`, App.tsx `/me`, and `services/apiv2.ts` (external PHP backend) intentionally left as raw `fetch`. Shipped alongside the Phase 3+6 pm2 reload |
| 7 — Caddyfile final form | **Complete — applied 2026-05-24** | Final Caddyfile live at `/etc/caddy/Caddyfile` (forward_auth gate + per-vendor reverse_proxy to :3015). The `inventory-server/deploy/` staging folder was removed after apply — recreate from this doc if future changes are needed. Backup convention: `/etc/caddy/Caddyfile.bak.YYYY-MM-DD` |
| 8 — ecosystem.config.cjs final form | **Complete — applied 2026-05-24** | Live PM2 list matches the spec below (5 apps + acot-phone-server + lt-wordlist-api = 7 processes). Includes Phase 6.4 JWT_SECRET shadow-override fix and 6.10 lt-wordlist token move. `inventory-server/deploy/` removed post-apply |

**Live PM2 process count: 7** (5 application apps — auth-server, inventory-server, chat-server, dashboard-server, acot-server — plus acot-phone-server + lt-wordlist-api). Down from 13 pre-refactor.

**All apply steps complete (2026-05-24).** The original sequencing (npm install → F1 ship → pm2 reload → env consolidation → vendor PM2 delete → ecosystem apply → Caddyfile apply) was executed in order. Remaining work is Phase 5 (acot-server ESM conversion) only.

---

## Goals

- Every public-facing endpoint requires a valid auth token (Caddy gate + per-server middleware + per-route permission checks for sensitive operations).
- Reduce service count from 12 PM2 processes to 4: `inventory-server`, `acot-server`, `dashboard-server`, `auth-server`.
- Standardize on ESM (`"type": "module"`) across all Node services.
- Decommission `aircall-server`, `gorgias-server`, `clarity-server`, and the legacy `auth-server` (port 3003).
- Eliminate dependency duplication: one Redis client, one Postgres pool helper, one logger, one auth middleware — shared across services.

## Non-goals

- Rewriting business logic. Route handlers move as-is unless they break under ESM or shared middleware.
- Switching auth providers (we keep JWT + bcrypt + Postgres).
- Replacing PM2 or Caddy.
- Migrating Klaviyo/Meta/Google/Typeform's external API contracts.

---

## Target architecture

```
                                  ┌──────────────────────────┐
                                  │  tools.acherryontop.com  │
                                  │         (Caddy)          │
                                  │   forward_auth gate ─────┼──► auth-server:3011
                                  └────────────┬─────────────┘     /verify endpoint
                                               │
              ┌────────────────────────────────┼────────────────────────────────┐
              ▼                                ▼                                ▼
   ┌─────────────────────┐         ┌──────────────────────┐         ┌─────────────────────┐
   │  inventory-server   │         │   dashboard-server   │         │     acot-server     │
   │      :3010 (ESM)    │         │      :3015 (ESM)     │         │     :3012 (ESM)     │
   │                     │         │                      │         │                     │
   │  /api/products      │         │  /api/klaviyo/*      │         │  /api/acot/*        │
   │  /api/orders        │         │  /api/meta/*         │         │  (MySQL via SSH)    │
   │  /api/analytics     │         │  /api/google-*/*     │         │                     │
   │  /api/dashboard     │         │  /api/typeform/*     │         │                     │
   │  ... (~25 routers)  │         │                      │         │                     │
   └─────────────────────┘         └──────────────────────┘         └─────────────────────┘
              │                                │                                │
              ├── Postgres (inventory_db)      ├── Postgres (klaviyo)           └── MySQL (workpi, via ssh2 tunnel)
              ├── shared lib/ ◄────────────────┤
              │   - auth middleware            ├── Redis (shared client)
              │   - permission helper          └── shared lib/ ◄─────────────────┐
              │   - logger                                                       │
              │   - pg pool factory                                              │
              │   - error formatter                                              │
              └─────────────────────────────────────────────────────────────────┘
                                                                                 │
                                                              ┌──────────────────┴───┐
                                                              │     auth-server      │
                                                              │      :3011 (ESM)     │
                                                              │   /login, /me,       │
                                                              │   /verify, user mgmt │
                                                              └──────────────────────┘
```

PM2 process count: **12 → 4** (plus `acot-phone-server` and `lt-wordlist-api`, which stay as-is — out of scope).

---

## Phase 1 — Decommission dead/leaving services

Status: **Complete (2026-05-23)**. All four services removed from repo, PM2, Caddyfile, and ecosystem.config.cjs. Frontend widgets (`AircallDashboard.jsx`, `GorgiasOverview.jsx`) and their dashboard.ts/Navigation.jsx/vite.config.ts wiring also removed. Verification: smoke-tested `https://tools.acherryontop.com/api/{aircall,gorgias,clarity}/*` → 404. Backups left at `/home/matt/{ecosystem.config.cjs,Caddyfile}.bak.2026-05-23`.

### To remove

| Service | Reason | Steps |
|---|---|---|
| `aircall-server` (3002) | Migrating off Aircall | `pm2 delete aircall-server`; remove from `ecosystem.config.cjs`; remove `/api/aircall/*` from Caddyfile; drop `inventory/dashboard/aircall-server/` directory; remove MongoDB connection from any frontend code; cancel Mongo if it was only feeding Aircall |
| `gorgias-server` (3006) | Migrating off Gorgias | same pattern; check frontend for `/api/gorgias/*` callers and delete the dashboards/widgets that use them |
| `clarity-server` (3009) | Already dead (no `.js` files, not in ecosystem) | remove `/api/clarity/*` from Caddyfile; delete `inventory/dashboard/clarity-server/` directory |
| `auth-server` (3003, legacy) | Replaced by `new-auth-server` on 3011 | grep entire codebase for `dashboard-auth` and `localhost:3003`; redirect or remove callers; `pm2 delete auth-server`; remove from ecosystem; remove `/dashboard-auth/*` from Caddyfile; delete `inventory/dashboard/auth-server/` directory |

### Verification before deletion

```bash
# from inventory/ root — find any references before removing
grep -rn "aircall\|/api/aircall" inventory/src/ inventory-server/src/
grep -rn "gorgias\|/api/gorgias" inventory/src/ inventory-server/src/
grep -rn "/dashboard-auth\|localhost:3003" inventory/src/ inventory-server/src/
grep -rn "/api/clarity" inventory/src/ inventory-server/src/
```

Any remaining callers must be deleted or repointed before the server is removed. Do **not** leave a 502 response in production.

### Database/secret cleanup

- Drop the MongoDB instance feeding Aircall (after confirming no other consumers).
- Rotate any Gorgias/Aircall API keys still in `.env` files (defense in depth — they'll be useless soon anyway, but commit hygiene matters).
- Remove `MONGODB_URI`, `AIRCALL_*`, `GORGIAS_*` from any `.env` files.

---

## Phase 2 — Build the shared `lib/`

Status: **Complete (2026-05-23)**. All 11 modules written under `inventory-server/shared/` (NOT repo root — see Deviations). `/verify` endpoint added to auth-server in CJS form (will move to shared/auth/verify.js usage during Phase 3 ESM conversion). Smoke-tested with no-token / bad-token / expired-token / valid-token cases. No service consumes shared/ yet; that happens in Phases 3–5.

### Location

A single shared directory at the repo root: `shared/` (sibling of `inventory/` and `acot-phone/`). Each service imports from it via a relative path. We do **not** introduce npm workspaces yet — relative imports are fine for three consumers and avoid the npm-link / hoisting headaches.

### Modules to create

```
shared/
├── package.json          # "type": "module"
├── auth/
│   ├── middleware.js     # authenticate(), requirePermission(), requireAdmin()
│   └── verify.js         # verifyToken() — pure function, no Express dependency
├── db/
│   ├── pg.js             # createPool(envPrefix) — returns configured Pool
│   └── redis.js          # createRedis() — single client, lazy-connect
├── logging/
│   ├── logger.js         # pino-based, redacts Authorization/Cookie
│   └── request-log.js    # Express middleware, structured access log
├── errors/
│   └── handler.js        # consistent error envelope, no leak in prod
├── cors/
│   └── policy.js         # single allowed-origins list, exported as cors() options
└── rate-limit/
    └── login.js          # express-rate-limit config for /login
```

### Auth middleware spec (`shared/auth/middleware.js`)

```js
// Pseudocode — final implementation matches the existing pattern in
// inventory/auth/routes.js authenticate() but factored out.

export function authenticate({ pool }) {
  return async (req, res, next) => {
    const header = req.headers.authorization;
    if (!header?.startsWith('Bearer ')) {
      return res.status(401).json({ error: 'Authentication required' });
    }
    try {
      const decoded = jwt.verify(header.slice(7), process.env.JWT_SECRET);
      // Short-circuit DB hit with an in-memory cache, 60s TTL keyed by token jti
      const user = await loadUserCached(pool, decoded.userId);
      if (!user.is_active) return res.status(403).json({ error: 'Account inactive' });
      req.user = user;
      next();
    } catch {
      res.status(401).json({ error: 'Invalid token' });
    }
  };
}

export function requirePermission(code) {
  return (req, res, next) => {
    if (req.user.is_admin) return next();
    if (req.user.permissions?.includes(code)) return next();
    res.status(403).json({ error: 'Insufficient permissions' });
  };
}

export const requireAdmin = (req, res, next) =>
  req.user.is_admin ? next() : res.status(403).json({ error: 'Admin only' });
```

### Why a 60s in-memory user cache

`forward_auth` in Caddy will call `auth-server` on every request. Each per-server `authenticate()` middleware also has a DB lookup to load permissions. Without caching, every API request becomes 1 SQL query for the user row + 1 for permissions. 60s TTL is short enough that deactivating a user takes effect within a minute, long enough that Klaviyo dashboards (which fire dozens of requests on load) don't hammer Postgres.

### Add to `auth-server`: a `/verify` endpoint

Caddy's `forward_auth` only needs "is this token valid? give me a user-id." Today's `/me` does that but with a full permissions join. Add a lightweight `/verify` that:

- Verifies JWT signature only (no DB hit).
- Returns `200` with `X-User-Id` and `X-User-Is-Admin` response headers (which Caddy `copy_headers` will pass to upstream).
- Returns `401` on bad token.

**Decision: each service re-verifies the JWT independently.** Caddy's `forward_auth` is a fast first-pass reject for obviously bad tokens, but the security boundary is the per-server `authenticate()` middleware. Cost is negligible (one HMAC-SHA256 per request); the upside is that a misconfigured Caddyfile can never let an unauthenticated request reach a backend. Upstream services do **not** trust any `X-User-*` headers from Caddy — they parse the `Authorization` header themselves.

---

## Phase 3 — Convert `auth-server` and `inventory-server` to ESM

Status: **Complete (live) — 2026-05-24.** Both servers + all sub-trees converted to ESM and running under PM2. 58 importable .js files. Two latent bugs surfaced and fixed during the conversion: `??`/`||` precedence in `shared/db/{pg,redis}.js`, and CJS named-import of `Pool` from `pg` in both auth files (now uses `import pg from 'pg'; const { Pool } = pg`).

Scripts under `inventory-server/scripts/` (one-shot maintenance / orchestrators) kept CommonJS via a sibling `scripts/package.json` declaring `"type": "commonjs"` — Node's package-type resolution walks up directory by directory, so this overrides the parent's `"type": "module"` without renaming any file or touching any `spawn()` callsite. Convert individual scripts to ESM if/when touched.

Went live 2026-05-24 after `npm install` on netcup (new deps: `pino`, `pino-http`, `ioredis`, `express-rate-limit`, `jsonwebtoken`) + `pm2 reload`. Phase F1 (frontend fetch wrapper) shipped in the same window so the SPA continued to send `Authorization: Bearer` on every request as `authenticate()` came online.

### Mechanical conversion

Per service:

1. Add `"type": "module"` to `package.json`.
2. Convert `require()` → `import`. `module.exports` → `export` / `export default`.
3. Fix `__dirname`/`__filename` (use `import.meta.url` + `fileURLToPath`).
4. Convert any dynamic require (e.g., conditional plugin loading) to `await import()`.
5. Update any sub-imports that don't include the file extension — ESM requires `./foo.js`, not `./foo`.
6. Update `ecosystem.config.cjs` if any service entry depended on CJS semantics. The ecosystem file itself can stay `.cjs` — PM2 reads it as config, doesn't matter what the apps it spawns are.
7. Update nodemon config / scripts.

### Risk areas in inventory-server

- `routes/ai.js` does a lazy init (`aiRouter.initInBackground()` called from `server.js`) — confirm the export shape still works as a default export of an Express router with a sidecar function. May need to split into `export default router; export function initInBackground() {}`.
- Multer setup in `routes/import.js` — straightforward, no ESM-specific concerns.
- SSE setup in `server.js` — moves over cleanly, no module-system entanglement.
- The `child_process.spawn` calls for metrics calculation: ESM doesn't change `child_process` behavior, but if any spawned script uses `require()` of a sibling, that sibling must also be ESM (or stay CJS with a `.cjs` extension).

### Test strategy

- After conversion, `pm2 start ecosystem.config.cjs --only inventory-server` on the server, watch logs for require/import errors at startup.
- Hit `/health`, then the most exercised endpoints (`/api/products`, `/api/dashboard/overview`, `/api/analytics/...`). If startup is clean and three smoke endpoints work, ESM conversion is done. Functional correctness is preserved because no logic changed.

### Auth-server

Already small (~200 LOC server.js + ~few hundred in routes.js + permissions.js). 1-day conversion. Add the new `/verify` endpoint as part of this work.

---

## Phase 4 — Build `dashboard-server` (the merge)

Status: **Complete (live) — 2026-05-24.** Klaviyo + Meta + Google + Typeform merged into a single ESM service at `inventory-server/dashboard/server.js`. Shared Pool + ioredis client injected through router factories. Phase 6.2 permission gates wired (`meta_write` on Meta budget/status mutations; `klaviyo_admin` on Klaviyo `/events/clearCache`). Post-cutover cleanup (2026-05-24) deleted the four old per-vendor directories (`klaviyo-server`, `meta-server`, `google-server`, `typeform-server`) along with their PM2 entries — ~1.27 GB reclaimed, largely duplicated `node_modules` across vendors. Original boot test on netcup: `/health` 200; unauthenticated `/api/klaviyo/*` returns `{"error":"No token provided"}` HTTP 401 via shared `authenticate()`.

### Layout

```
inventory/dashboard/
├── server.js                  # entry: load env, init Postgres+Redis, mount routes, listen
├── package.json               # "type": "module", deps from all 4 source servers (deduped)
├── .env                       # KLAVIYO_*, META_*, GOOGLE_*, TYPEFORM_*, shared DB_*, REDIS_URL
├── routes/
│   ├── klaviyo/               # absorbed from dashboard/klaviyo-server/src/
│   ├── meta/                  # absorbed from dashboard/meta-server/
│   ├── google/                # absorbed from dashboard/google-server/
│   └── typeform/              # absorbed from dashboard/typeform-server/
├── services/                  # per-vendor API clients (Klaviyo SDK calls, etc.)
├── scripts/
│   └── import-campaign-products.js  # one-shot, moved from klaviyo-server/scripts/
└── logs/
```

### Mount points

```js
// server.js (sketch)
import { authenticate, requirePermission } from '../../shared/auth/middleware.js';
import { createPool } from '../../shared/db/pg.js';
import { createRedis } from '../../shared/db/redis.js';
import { logger, requestLog } from '../../shared/logging/index.js';
import corsPolicy from '../../shared/cors/policy.js';
import errorHandler from '../../shared/errors/handler.js';

import klaviyoRouter from './routes/klaviyo/index.js';
import metaRouter from './routes/meta/index.js';
import googleRouter from './routes/google/index.js';
import typeformRouter from './routes/typeform/index.js';

const app = express();
const pool = await createPool('KLAVIYO_DB');  // klaviyo has its own DB; others can share or have none
const redis = await createRedis();

app.use(requestLog);
app.use(cors(corsPolicy));
app.use(express.json({ limit: '10mb' }));

// Everything below this line requires a valid token.
app.use('/api', authenticate({ pool }));

app.use('/api/klaviyo', klaviyoRouter({ pool, redis }));
app.use('/api/meta', metaRouter({ redis }));
app.use('/api/google-analytics', googleRouter({ redis }));  // matches Caddy /api/dashboard-analytics rewrite
app.use('/api/typeform', typeformRouter({ redis }));

app.get('/health', (req, res) => res.json({ ok: true }));
app.use(errorHandler);

app.listen(process.env.DASHBOARD_PORT || 3015);
```

### Per-vendor routers

Each vendor's existing route file becomes a factory that takes the shared `pool`/`redis` and returns an Express router. Replace each server's per-instance pool/redis with the injected one.

### Permission gates (sensitive routes only)

Authenticated-only is the default after `app.use('/api', authenticate(...))`. For sensitive operations, add `requirePermission` per route:

- Anything that mutates Klaviyo lists/segments → `requirePermission('klaviyo_write')`
- Triggering a campaign sync → `requirePermission('klaviyo_admin')`
- Read-only dashboards → no extra check beyond authenticate.

Define the new permission codes in the `permissions` table via a migration in Phase 6.

### Dependency dedup

**Decision: standardize on `ioredis`.** Klaviyo's larger codebase already uses it, and `ioredis` has better cluster/sentinel support if we ever need it. Update `meta`/`google`/`typeform` call sites — each is a handful of `get`/`set` calls, mechanical conversion. Remove the `redis` package from `dashboard-server`'s `package.json`.

### Env consolidation

Single `.env` at `inventory/dashboard/.env`, prefixed keys:

```
DASHBOARD_PORT=3015
KLAVIYO_API_KEY=...
KLAVIYO_DB_HOST=...
KLAVIYO_DB_NAME=...
META_ACCESS_TOKEN=...
GOOGLE_SERVICE_ACCOUNT_KEY=...
TYPEFORM_TOKEN=...
REDIS_URL=...
JWT_SECRET=...  # shared with auth-server; same secret means same tokens valid here
```

### Klaviyo's `scripts/import-campaign-products.js`

One-shot script — keep it, but run it from the merged dashboard-server's directory. Update the script's imports to ESM. If it's run via cron, update the cron entry to the new path.

### Risk: shared error states

When all four vendors share a Redis client, a Redis hiccup affects all four. Make sure the connection has retry config (`ioredis` defaults are reasonable but verify) and that vendor routes degrade gracefully when Redis is unavailable (most use it as a cache, so cache-miss → fall through to upstream API is the right behavior).

---

## Phase 5 — Convert `acot-server` to ESM (stays standalone)

Status: **Not started.** Largest single conversion (~5K LOC), but no merge involved.

### Special concern: ssh2 tunnel

`acot-server` opens an SSH tunnel via `ssh2` to access the production MySQL at `192.168.1.5:3309`. The tunnel must be:

- Established before the HTTP listener starts (so no requests fail with "no DB connection").
- Re-established on disconnect (`ssh2` connection's `close` event → recreate).
- Cleanly torn down on `SIGTERM`/`SIGINT` so PM2 restarts don't leak file descriptors.

Verify (or add) this lifecycle handling as part of the conversion. If it's already correct, conversion is mechanical; if not, this is a good moment to fix it.

### Test strategy

Same as inventory-server: start with PM2, smoke-test the most-used `/api/acot/*` endpoints, watch logs for unhandled rejection or tunnel-close events.

---

## Phase 6 — Auth hardening

Status: **Complete (live) — 2026-05-24.** All hardening (in-process + edge) is live in production. The Phase 3 ESM conversion + Phase 6 middleware shipped together, with Phase F1 (frontend fetch wrapper) flipping immediately ahead of the `pm2 reload` so the SPA continued to carry `Authorization: Bearer` on every API call. Caddy `forward_auth` gate and the JWT_SECRET ecosystem fix went live with the Phase 7/8 apply on 2026-05-24.

Per-item status:

| # | Item | Status | Where |
|---|---|---|---|
| 6.1 | Caddy `forward_auth` gate | **Live — 2026-05-24** | Applied via Caddy admin API + `sudo cp` to `/etc/caddy/Caddyfile`. `@gated path /api/* /chat-api/* /uploads/*` block hits `localhost:3011/verify` on every request |
| 6.2 | `requirePermission` on sensitive routes + permissions migration | **Done** | inline in `config.js`, `data-management.js`, `import.js`, `ai-prompts.js`, `ai-validation.js`, `templates.js`, `reusable-images.js`; codes seeded by `migrations/005_phase6_permission_codes.sql`. **Phase 4 follow-on (2026-05-23):** `meta_write` wired on `PATCH /api/meta/campaigns/:id/budget` and `POST /api/meta/campaigns/:id/:action`; `klaviyo_admin` wired on `POST /api/klaviyo/events/clearCache`. Read-only Google + Typeform endpoints stay authenticated-only (reserved write codes left in migration 005 for future) |
| 6.3 | Login rate-limit + `/verify` rate-limit | **Done** | `auth/server.js` uses `shared/rate-limit/login.js` (`loginLimiter`, `verifyLimiter`) |
| 6.4 | JWT_SECRET as startup precondition + ecosystem footgun fix | **Live — 2026-05-24** | Both auth-server and inventory-server `process.exit(1)` if `JWT_SECRET` is unset. The `JWT_SECRET: process.env.JWT_SECRET` override that was shadowing `.env` is removed from the live ecosystem.cjs |
| 6.5 | Structured request logging w/ redaction | **Done** | `shared/logging/request-log.js` (pino-http, redacts Authorization/Cookie); mounted in both `auth/server.js` and `src/server.js` |
| 6.6 | CORS lockdown | **Done** | `src/middleware/cors.js` now re-exports `shared/cors/policy.js`. LAN wildcards (`192.168.*`, `10.*`) and `*` defaults gone |
| 6.7 | Upload hardening | **Done** | Exact-match MIME+extension allowlist on `routes/import.js` and `routes/reusable-images.js`; dead `multer({ dest })` removed from `routes/products.js` (no upload route was using it — strongest hardening was deletion) |
| 6.8 | Frontend token storage stays localStorage + XSS audit | **Audited** | Confirmed `dangerouslySetInnerHTML` is sanitized in `ProductEditor.tsx`. **Flagged: `ChatRoom.tsx:277,392` renders user-controlled chat content as raw HTML — real XSS vector, separate fix needed** |
| 6.9 | Remove debug middleware | **Done** | The header-dumping `app.use((req,res,next)=>{ console.log(... req.headers ...) })` block removed from `src/server.js`. Replaced with `shared/logging/request-log.js` (which redacts). |
| 6.10 | `lt-wordlist-api` token move | **Live — 2026-05-24** | Live PM2 entry runs `/opt/lt-wordlist-api/index.js` under matt's daemon; `ADD_WORD_TOKEN` is no longer inline in ecosystem.cjs and is read from `/opt/lt-wordlist-api/.env`. See Deviations #21–23 for the path corrections and the (incorrect) earlier assumption that this app lived under a separate root daemon |
| 6.11 | Audit logging for sensitive ops | **Deferred** | Out of scope for this pass per user direction. Existing `import_audit_log` and `product_editor_audit_log` tables stay as-is; generic `system_audit_log` table + middleware is its own project |

### 6.1 Caddy `forward_auth` gate

Add to the `tools.acherryontop.com` block, before the `@api_routes` handler:

```caddyfile
# Forward-auth gate for all API traffic
@needs_auth path /api/* /chat-api/*
handle @needs_auth {
    forward_auth localhost:3011 {
        uri /verify
        copy_headers Authorization
        # On 401/403, Caddy returns the auth-server's response body verbatim
    }
    # Existing per-vendor handle blocks remain below this line
}

# /auth-inv/* stays public (you need to log in!)
handle /auth-inv/* {
    uri strip_prefix /auth-inv
    reverse_proxy localhost:3011
}
```

The `forward_auth` directive subrequests `/verify` on the auth-server. If it returns 2xx, the request proceeds upstream. If 401/403, Caddy returns that response to the client and never hits the backend.

This is the **first** line of defense. Per-server middleware (`shared/auth/middleware.js`) is the **second** line — re-verifies the JWT independently. Defense in depth: a Caddyfile typo can't open a hole.

### 6.2 Per-route permission gates

After per-server `authenticate()`, add `requirePermission(code)` to destructive or sensitive routes. Audit needed in:

- `inventory-server/src/routes/config.js` — global config writes → `admin`
- `inventory-server/src/routes/import.js` — uploads, deletes, generate-upc → `product_import`
- `inventory-server/src/routes/data-management.js` — CSV operations → `data_management`
- `inventory-server/src/routes/ai-prompts.js` — prompt edits → `ai_admin`
- `inventory-server/src/routes/templates.js` — template writes → `templates_write`
- `inventory-server/src/routes/reusable-images.js` — image management → `image_admin`
- `inventory-server/src/routes/products.js` — only one POST (`/resolve-identifiers`); evaluate whether it needs a permission code or authenticated-only is fine
- `inventory-server/src/routes/product-editor-audit-log.js` and `import-audit-log.js` — read-only by sensitive users → `audit_read`
- `dashboard-server` Klaviyo/Meta/Google/Typeform write endpoints → vendor-specific codes per above

Migration: a single SQL script that inserts the new permission codes into the `permissions` table and assigns them to existing admin users. Non-admin users get permissions explicitly granted via the user management UI.

```sql
INSERT INTO permissions (code, name) VALUES
  ('product_import', 'Product Import'),
  ('data_management', 'Data Management'),
  ('ai_admin', 'AI Settings Admin'),
  ('templates_write', 'Template Editing'),
  ('image_admin', 'Image Management'),
  ('audit_read', 'Audit Log Access'),
  ('klaviyo_write', 'Klaviyo Write'),
  ('klaviyo_admin', 'Klaviyo Admin'),
  ('meta_write', 'Meta Write'),
  ('google_write', 'Google Analytics Write'),
  ('typeform_write', 'Typeform Write'),
  ('acot_admin', 'ACOT Server Admin')
ON CONFLICT (code) DO NOTHING;
```

### 6.3 Rate limiting on login

`shared/rate-limit/login.js`:

```js
import rateLimit from 'express-rate-limit';
export const loginLimiter = rateLimit({
  windowMs: 15 * 60 * 1000,      // 15 minutes
  max: 10,                        // 10 attempts per IP per window
  message: { error: 'Too many login attempts, try again later' },
  standardHeaders: true,
  legacyHeaders: false,
});
```

Apply in `auth-server` on the `/login` route. Consider also rate-limiting `/verify` and `/me` (much higher cap, ~600/min — they're called legitimately by every page load).

### 6.4 JWT secret rotation

- Rotate `JWT_SECRET` to a fresh 32-byte random string as part of the deployment.
- Document that rotation logs out all users — acceptable for an internal tool, do it during off-hours.
- Add `JWT_SECRET` to the env var validation block in `auth-server/server.js` (refuse to start if not set).
- **Fix the existing footgun**: `/var/www/ecosystem.config.cjs` currently has `JWT_SECRET: process.env.JWT_SECRET` *after* `...inventoryEnv` in the new-auth-server block. This shadows the `.env` value with whatever the shell exported when PM2 was started — which has already silently diverged at least once (detected and fixed 2026-05-23 by a clean PM2 restart in a shell without JWT_SECRET exported). Delete that override line during rotation; let `.env` be the single source of truth.

### 6.5 Request logging

`shared/logging/request-log.js` — log method, path, status, duration, user-id (if authenticated). **Never** log `Authorization` or `Cookie` headers. Remove the current `server.js:79-87` debug middleware in inventory-server (it logs full headers including the bearer token).

### 6.6 CORS lockdown

Current `middleware/cors.js` allows `192.168.*.*` and `10.*.*.*` with `credentials: true`. Tighten to explicit known origins:

```js
origin: [
  'https://tools.acherryontop.com',
  'https://inventory.kent.pw',
  /^http:\/\/localhost:(5174|5175)$/,
]
```

If anyone genuinely needs LAN access, add their specific IP, not a `/16` range.

### 6.7 Upload hardening

`POST /api/import/upload-image` (multer-backed) needs:

- File-size limit set on multer config (current limit may be defaulted — verify).
- MIME-type allowlist (image/jpeg, image/png, image/webp; reject everything else).
- Filename sanitization (no `..`, no absolute paths, generate UUID-based names server-side).
- The Caddy `/uploads/*` handler currently serves any file in the uploads directory publicly. Move this **behind** the auth gate: include `/uploads/*` in `@needs_auth`. If some images are referenced from public emails (Klaviyo newsletter), put **those** in a separate public bucket; everything else stays gated.

### 6.8 Frontend token storage

**Decision: stay on `localStorage`.** This is an internal tool with no untrusted user-generated HTML being rendered, so the XSS-token-theft surface is small. The `forward_auth` gate is the main security gap we're addressing; cookie-based auth would be a larger, separate project (cookie-parser, CSRF double-submit pattern, AuthContext refactor) that doesn't change the threat model for an internal tool with no public sign-up.

Sanity check during this refactor: grep the React codebase for `dangerouslySetInnerHTML`. If any usages exist, verify each one is rendering trusted (server-controlled, not user-supplied) content. If a user-supplied content path exists, that's a real XSS vector and needs separate remediation regardless of token-storage choice.

### 6.9 Remove debug middleware

[inventory-server/src/server.js:79-87](inventory-server/src/server.js#L79-L87) logs full request headers including `Authorization`. Delete this block. Replace with `shared/logging/request-log.js`.

### 6.10 `lt-wordlist-api` token

`ADD_WORD_TOKEN` is currently hardcoded in `/var/www/ecosystem.config.cjs`. Move to `/opt/lt-wordlist-api/.env`, rotate the token value, update any callers.

### 6.11 Audit logging for sensitive operations

Already have `import-audit-log` and `product-editor-audit-log` tables. Extend the pattern:

- Log `user_id`, `endpoint`, `params`, `result` for `config.js` writes and `data-management.js` operations.
- Schema: reuse the existing audit table pattern or add a generic `system_audit_log` table.
- Don't log request bodies wholesale (may contain large blobs); log the action and the target ID.

---

## Phase F1 — Frontend fetch wrapper (NEW — 2026-05-23)

Status: **Complete (live) — 2026-05-24.** Two wrappers landed at `inventory/src/utils/api.ts` and `inventory/src/utils/apiClient.ts`. Migration touched 87 files (76 fetch, 11 axios) covering ~200 call sites. Type-check clean; production build clean. Intentional exclusions: AuthContext `/login`+`/me` (own auth flow), App.tsx initial `/me` session check, and `services/apiv2.ts` (calls the separate PHP backend at backend.acherryontop.com which has its own cookie auth, out of scope per the plan). Shipped alongside the Phase 3+6 pm2 reload.

### The discovery

While wiring `authenticate()` on `/api/*` in Phase 6.1/6.2, we audited the frontend's fetch usage and found:

- **7** call sites send `Authorization: Bearer ${token}` explicitly (all in `AuthContext.tsx` for `/me` + `/login`, plus a couple of `settings/*` pages).
- **~220** other `fetch(...)` / `axios.*(...)` call sites across `inventory/src/services/`, `inventory/src/pages/`, `inventory/src/components/` send **no** Authorization header at all.
- There is no global fetch wrapper, axios interceptor, or service-worker shim that injects the token.

Today this works because nothing on the server checks. Caddy currently has no `forward_auth` gate (Phase 6.1 is a Caddyfile change that hasn't shipped yet) and the previous inventory-server had no `authenticate()` middleware. The frontend's auth model was "you log in once to get the token; the token is checked only by `/me`; everything else is implicitly trusted at the network layer."

With Phase 6 code in production, **every page refresh 401s** on the first API call after the next pm2 reload. The user explicitly accepted this when authorising the Phase 6 work — but the fix is its own deliverable, and shipping Phase 3+6 to PM2 without F1 in the same window means an outage window measured in *however long F1 takes* (not minutes).

### Recommended approach

Add a single fetch wrapper at `inventory/src/utils/api.ts` (or similar) and migrate the ~220 call sites to use it. The wrapper:

1. Reads `localStorage.getItem('token')` on every call (cheap; localStorage is sync).
2. Merges `Authorization: Bearer ${token}` into the request headers if a token exists.
3. Intercepts 401 responses → fires `window.dispatchEvent(new Event('auth:logout'))` (a listener already exists in `AuthContext.tsx:117`) so the user gets bounced to `/login` cleanly instead of seeing broken pages.
4. Preserves the existing call shape — `apiFetch(url, init)` should be a drop-in for `fetch(url, init)` so the migration is mechanical.

```ts
// inventory/src/utils/api.ts (sketch)
export async function apiFetch(input: RequestInfo | URL, init: RequestInit = {}): Promise<Response> {
  const token = localStorage.getItem('token');
  const headers = new Headers(init.headers);
  if (token && !headers.has('Authorization')) {
    headers.set('Authorization', `Bearer ${token}`);
  }
  const res = await fetch(input, { ...init, headers });
  if (res.status === 401 && token) {
    // Token expired or revoked — bounce to /login. AuthContext already listens.
    window.dispatchEvent(new Event('auth:logout'));
  }
  return res;
}
```

Same shape for axios:

```ts
// inventory/src/utils/apiClient.ts (sketch)
import axios from 'axios';
export const apiClient = axios.create();
apiClient.interceptors.request.use((config) => {
  const token = localStorage.getItem('token');
  if (token) config.headers.Authorization = `Bearer ${token}`;
  return config;
});
apiClient.interceptors.response.use(
  (r) => r,
  (err) => {
    if (err?.response?.status === 401) window.dispatchEvent(new Event('auth:logout'));
    return Promise.reject(err);
  },
);
```

### Migration plan

1. Land the two wrapper modules above. ~50 LOC total.
2. Codemod or sed-loop: in `inventory/src/`, replace `fetch(` → `apiFetch(` (with the right import) and `axios.get/post/...` → `apiClient.get/post/...`. ~220 call sites — a half-day of careful find-and-replace plus per-page verification. Spot-check the ones with custom `Content-Type` (multipart uploads especially) so the wrapper doesn't clobber multipart boundaries.
3. Leave the `AuthContext.tsx` `/login` and `/me` calls alone — they already work and migrating them adds no value.
4. Run the SPA: log in, exercise Overview / Products / Analytics / Dashboard / etc. with browser devtools open watching for `Authorization` header on every `/api/*` request.

### Sequencing with Phase 3+6 deploy

**Two options:**

A) **Ship F1 first** (recommended). Frontend goes out with the wrapper; nothing changes server-side. Then `pm2 reload` Phase 3+6. Zero-downtime, zero broken-page window.

B) **Ship together.** F1 and Phase 3+6 land in the same deploy. Brief window (seconds) where the frontend has the wrapper but the server hasn't reloaded yet — wrapper just sends extra headers the old server ignores. Safe.

Do **not** ship Phase 3+6 first and F1 second. That gives a broken app for as long as F1 takes.

### Out of scope (kept on `localStorage`)

Per Phase 6.8, we're not migrating to httpOnly cookie auth. F1 is the minimum work to make the per-service `authenticate()` (Phase 6) actually usable. A future Phase F2 could move to cookies + CSRF double-submit, but that's a much larger change touching the AuthContext, the login flow, and every backend that reads tokens. Not justified for an internal tool with no public sign-up.

### Note on `/uploads/*` gating (Phase 6.7's Caddyfile change)

**Applied as-spec (2026-05-24):** `/uploads/*` is behind `forward_auth` in the live Caddyfile. `<img src="/uploads/...">` references in the SPA are browser-issued GETs that don't carry `Authorization` headers — verify image display works end-to-end (cookies fall-through, signed URLs, or session-bound forward_auth) and if broken, revert this part of 6.7 to keep `/uploads/*` public, OR issue per-image signed URLs from the API.

---

## Phase 7 — Caddyfile final form

Status: **Complete — applied 2026-05-24.** Final Caddyfile live at `/etc/caddy/Caddyfile`; vendor handles point at the merged dashboard-server on :3015. The `inventory-server/deploy/` staging folder (which held `Caddyfile.proposed` and the README of apply commands) was removed after apply — recreate from the spec below if future changes are needed. Apply pattern (admin-API `curl -X POST :2020/load` + `sudo cp` to persist on-disk) is captured in Deviation #8. Backup convention: `/etc/caddy/Caddyfile.bak.YYYY-MM-DD`.

After all phases, the `tools.acherryontop.com` block looks like:

```caddyfile
tools.acherryontop.com {
    import security_headers

    # Public: login endpoint
    handle /auth-inv/* {
        uri strip_prefix /auth-inv
        reverse_proxy localhost:3011
    }

    # Public: static frontend assets
    @static path *.js *.css *.png *.jpg *.jpeg *.gif *.ico *.svg *.woff *.woff2
    handle @static {
        header Cache-Control "public, max-age=2592000"
        root * /var/www/inventory/frontend/build
        file_server
    }

    # All API + uploads: auth gate first
    @gated path /api/* /chat-api/* /uploads/*
    handle @gated {
        forward_auth localhost:3011 {
            uri /verify
            copy_headers Authorization
        }

        # Uploaded files
        handle /uploads/* {
            root * /var/www/inventory
            file_server
        }

        # Vendor dashboard routes → merged dashboard-server
        handle /api/klaviyo/*           { reverse_proxy localhost:3015 }
        handle /api/meta/*              { reverse_proxy localhost:3015 }
        handle /api/dashboard-analytics/* { reverse_proxy localhost:3015 }
        handle /api/typeform/*          { reverse_proxy localhost:3015 }

        # ACOT-specific
        handle /api/acot/*         { reverse_proxy localhost:3012 }

        # Chat
        handle /chat-api/* {
            uri strip_prefix /chat-api
            reverse_proxy localhost:3014
        }

        # Catch-all: inventory-server
        handle /api/* { reverse_proxy localhost:3010 }
    }

    handle /health { reverse_proxy localhost:3010 }

    # SPA fallback
    handle {
        root * /var/www/inventory/frontend/build
        try_files {path} /index.html
        file_server
        encode gzip
    }

    handle_errors {
        respond "{err.status_code} {err.status_text}"
    }
}
```

Removed: `/dashboard-auth/*`, `/api/aircall/*`, `/api/gorgias/*`, `/api/clarity/*`, the LAN/`Access-Control-Allow-Origin "*"` permissive defaults on `/api/*`. Kept: `/apiv2/*` and `/apiv2-test/*` proxies to backend.acherryontop.com (out of scope, separate system).

---

## Phase 8 — ecosystem.config.cjs final form

Status: **Complete — applied 2026-05-24.** Live PM2 list matches the spec below: 5 application apps (auth-server, inventory-server, dashboard-server, acot-server, chat-server) plus acot-phone-server + lt-wordlist-api = 7 total. Includes the Phase 6.4 `JWT_SECRET` shadow-override fix and the Phase 6.10 `lt-wordlist-api` token move. The `inventory-server/deploy/ecosystem.config.cjs.proposed` staging file was removed after apply — recreate from the spec below if future changes are needed.

```js
module.exports = {
  apps: [
    {
      name: 'auth-server',
      script: './inventory/auth/server.js',
      cwd: '/var/www',
      env: { NODE_ENV: 'production', AUTH_PORT: 3011 },
      ...commonSettings,
    },
    {
      name: 'inventory-server',
      script: './inventory/src/server.js',
      cwd: '/var/www',
      env: { NODE_ENV: 'production', PORT: 3010, UPLOADS_DIR: '/var/www/inventory/uploads' },
      ...commonSettings,
    },
    {
      name: 'dashboard-server',
      script: './inventory/dashboard/server.js',
      cwd: '/var/www',
      env: { NODE_ENV: 'production', DASHBOARD_PORT: 3015 },
      ...commonSettings,
    },
    {
      name: 'acot-server',
      script: './inventory/dashboard/acot-server/server.js',
      cwd: '/var/www',
      env: { NODE_ENV: 'production', ACOT_PORT: 3012 },
      ...commonSettings,
    },
    {
      name: 'chat-server',
      script: './inventory/chat/server.js',
      cwd: '/var/www',
      env: { NODE_ENV: 'production', PORT: 3014 },
      ...commonSettings,
    },
    // acot-phone-server and lt-wordlist-api unchanged
  ],
};
```

Five entries instead of twelve. Each app loads its own `.env` from its directory (already handled by `dotenv.config`).

---

## Sequencing & dependencies

```
Phase 1 (decommission) ──┬─────────────────────────────────────────┐
                         │                                         │
                         ▼                                         │
                   Phase 2 (shared lib/)                           │
                         │                                         │
          ┌──────────────┼──────────────┐                          │
          ▼              ▼              ▼                          ▼
   Phase 3a         Phase 3b        Phase 4              Phase 6 (auth hardening
   inventory-server auth-server     dashboard-server     runs alongside 3+4+5,
   to ESM           to ESM + /verify build & test        completes after them)
          │              │              │                          │
          └──────────────┼──────────────┘                          │
                         ▼                                         │
                   Phase 5 (acot-server to ESM) ──────────────────►│
                                                                   ▼
                                                          Phase 7 (Caddy cutover)
                                                                   │
                                                                   ▼
                                                          Phase 8 (PM2 final state)
```

Phase 1 unblocks everything (fewer services to convert).
Phase 2 is the foundation; nothing else can start until shared `lib/` exists.
Phases 3–5 can run in parallel; they touch independent services.
Phase 6's sub-items can be developed alongside 3–5 but **enabled** only after them (no point adding `requirePermission` to a route that doesn't yet have `authenticate`).
**Phase F1 must precede the Phase 3+6 pm2 reload** — without the fetch wrapper, the moment the new code goes live the SPA breaks. Discovered during Phase 3+6 implementation; see Phase F1.
Phase 7 is the cutover: Caddyfile flip happens after F1 ships AND after the `/uploads/*` gating decision in F1 is made.
Phase 8 is cleanup: remove dead PM2 entries.

Estimated effort, end-to-end: **~3 weeks of focused work** by one engineer. Phase 1 ≈ 1 day, Phase 2 ≈ 2 days, Phase 3 ≈ 3 days (both services), Phase 4 ≈ 5–7 days (the merge), Phase 5 ≈ 2–3 days, Phase 6 ≈ 3–4 days, Phase F1 ≈ 0.5–1 day, Phase 7+8 ≈ 1 day.

---

## Testing strategy

No formal test suite exists today (per CLAUDE.md). For a refactor this size, that's a gap to close — but writing tests retroactively for 15K LOC of routes is a separate, larger project. For this refactor:

### Manual smoke testing per phase

A checklist of representative endpoints to hit after each deploy:

- `inventory-server`: `/api/products`, `/api/dashboard/overview`, `/api/analytics/revenue`, `/api/orders`, `/api/purchase-orders`, `/api/import/list-uploads`, `/api/config/global`
- `dashboard-server`: `/api/klaviyo/campaigns`, `/api/meta/insights`, `/api/google-analytics/...`, `/api/typeform/responses`
- `acot-server`: `/api/acot/...` (top-3 endpoints by call volume — pull from access logs)
- `auth-server`: `/login`, `/me`, `/verify`

Each smoke test runs (a) without a token → expect 401, (b) with an invalid token → expect 401, (c) with a valid token → expect 2xx.

### Frontend integration check

After deploys, log into the SPA and exercise each major page (Overview, Products, Analytics, Dashboard, Klaviyo, Meta, etc.). If everything loads and dashboards populate, the auth + routing layer is intact.

### Test scaffold during Phase 2 (committed)

While building `shared/`, set up `vitest` (lightweight, ESM-native, fast) as the standard test runner for the repo. Initial coverage focuses on the security-critical surface only:

- `shared/auth/verify.js` — known good token, expired token, wrong-signature token, malformed token, missing token.
- `shared/auth/middleware.js` — request with no header → 401; bad header → 401; valid token + inactive user → 403; valid token + missing permission → 403; valid token + correct permission → next() called with `req.user` populated.
- `shared/auth/middleware.js` user-cache TTL: same token within 60s → one DB hit; same token after 61s → two DB hits.

`package.json` gets a `"test": "vitest run"` script at the repo root and per-service. Set up but don't backfill broader test coverage — that's a separate, larger project. The vitest scaffold gives future work a foothold; this refactor commits to having tests for the auth boundary specifically because that's what's load-bearing for the whole security model.

---

## Rollback strategy

Each phase produces an independently deployable state. Rollback per phase:

- **Phase 1**: re-add removed services to ecosystem; restore from git. Don't roll back data deletions — only do those after a week of stable production.
- **Phases 3, 5**: ESM conversion is per-service; if one service breaks, `pm2 restart <name>` to the previous commit. Other services unaffected.
- **Phase 4**: the dashboard-server merge is the highest-risk change. Plan: deploy `dashboard-server` to a non-conflicting port (3015) while leaving the old per-vendor servers running. Cut over Caddy routes one vendor at a time (start with Meta — smallest). If any vendor breaks, point Caddy back to the old server (still running) for that vendor, debug, retry. Only delete the old servers after all four are stable on `dashboard-server`.
- **Phases 6, 7**: Caddy config is git-tracked. `git revert` + `caddy reload` rolls back in seconds. Auth changes are additive (defense in depth) — if `forward_auth` causes problems, comment it out and per-server middleware continues protecting routes.

---

## Out of scope (intentional)

These came up in the audit but aren't part of this refactor:

- `httpOnly` cookie auth ("Phase F2" — deferred). Phase F1 keeps `localStorage` + Bearer header because that's the minimum to unblock the Phase 6 `authenticate()` rollout. A future move to cookie auth would touch `AuthContext`, every backend that reads tokens, and introduce CSRF concerns — much larger project.
- Replacing PM2 with systemd or Docker.
- Test coverage beyond the auth-critical surface.
- `apiv2`/`apiv2-test` proxies to `backend.acherryontop.com` — separate system, not touched.
- `acot-phone-server` and `lt-wordlist-api` — staying as-is.
- Centralized observability stack (Prometheus, Grafana). The logger work in Phase 6.5 sets up the data, but shipping it somewhere is future work.
- ChatRoom XSS remediation (flagged during Phase 6.8 audit — `inventory/src/components/chat/ChatRoom.tsx:277,392` renders user-controlled chat content via `dangerouslySetInnerHTML` without sanitization). Real vulnerability for an internal-but-multi-user tool; separate fix.

---

## Concrete deliverables

State as of 2026-05-24: everything below is **shipped** except Phase 5 (acot-server ESM conversion), which is the only remaining work item. Note: the "4 application PM2 processes" original target became **5** in execution because `chat-server` stayed standalone rather than being folded in — never a serious merge candidate (different DB, different protocol shape).

- ✅ 5 application PM2 processes instead of 12 (auth-server, inventory-server, dashboard-server, acot-server, chat-server) — plus 2 unchanged (acot-phone-server, lt-wordlist-api) = 7 total.
- ✅ All `/api/*`, `/chat-api/*`, and `/uploads/*` requests gated at Caddy (`forward_auth`) and re-verified at each upstream (`authenticate()`).
- ✅ Sensitive endpoints additionally gated by per-permission checks (`requirePermission`).
- ⚠️ One ESM standard — done for auth/inventory/dashboard/chat. **acot-server still CJS (Phase 5 pending).**
- ✅ One shared `lib/` at `inventory-server/shared/` for auth, logging, DB, errors, CORS.
- ✅ Login rate-limited (`shared/rate-limit/login.js`).
- ✅ `JWT_SECRET` rotated + ecosystem shadow-override removed.
- ✅ Old auth-server, Aircall, Gorgias, Clarity directories deleted from the repo. Defunct `dashboard:gorgias`/`dashboard:calls` permission rows also deleted from DB (2026-05-24).
- ✅ Caddyfile slimmed to one auth-gated block.
- ✅ Permission codes inserted into `permissions` table for granular authorization.
- ✅ No half-finished pieces, no `// TODO: add auth later` comments, no deferred secrets cleanup.

---

## Deviations from original plan (recorded during execution)

These are decisions made during Phase 1/2 implementation that amend the spec above. Future phases should follow the deviated path, not the original sketch.

1. **`shared/` location.** Original plan placed `shared/` at the repo root as a sibling of `inventory/` and `acot-phone/`. Implemented at `inventory-server/shared/` (= `/var/www/inventory/shared/` on the server) instead. Reason: the actual project root *is* `/var/www/inventory/`; placing shared/ outside it would have meant building a deployment story for it that doesn't exist. Import paths change accordingly:
   - From `inventory-server/{auth,src,chat}/server.js` → `../shared/...`
   - From `inventory-server/dashboard/{vendor}-server/server.js` → `../../shared/...`

2. **`/verify` response headers.** Plan specified `X-User-Id` + `X-User-Is-Admin`. Implemented as `X-User-Id` + `X-User-Username` (both available from the JWT payload). `X-User-Is-Admin` was dropped because `is_admin` isn't in the JWT today and returning it would require a DB lookup — violating the "no DB hit" principle. To restore `X-User-Is-Admin`, enrich the JWT payload at login time (one-line change in `auth/routes.js`) during Phase 6, then echo from `/verify`. Upstreams don't trust these headers anyway (they re-verify), so the omission is informational, not security-relevant.

3. **User cache key in `shared/auth/middleware.js`.** Plan sketch mentioned "60s TTL keyed by token jti". Implemented as keyed by `userId` instead — the JWT doesn't currently include a `jti` claim, and the cache's invalidation semantics are "this user was deactivated/changed permissions" (per-user), not "this token was revoked" (per-token). The plan's pseudocode already used `loadUserCached(pool, decoded.userId)` so this matches the spirit.

4. **Redis client safety.** `shared/db/redis.js` sets `enableOfflineQueue: false` and `lazyConnect: true`. Plan didn't specify but these defaults mean a Redis hiccup fails fast (route fall-through to upstream API as designed in Phase 4 risk notes) rather than queueing commands indefinitely.

5. **CORS allowed origins kept `https://acot.site`.** Plan example listed three origins; production has acot.site as a redirect to tools.acherryontop.com but also reaches the API directly in some flows. Kept it to avoid breakage. LAN wildcards (`192.168.*`, `10.*`) and `Access-Control-Allow-Origin "*"` are NOT included in the new `shared/cors/policy.js` per the plan's Phase 6.6 spirit, but the legacy `inventory-server/src/middleware/cors.js` still has them until services are migrated to consume `shared/cors/`.

6. **Defunct permission codes cleaned up (2026-05-24).** Removed the `dashboard:gorgias` and `dashboard:calls` Protected blocks from the frontend; the corresponding permission rows in the `permissions` table (and their user_permission grants) were deleted in a follow-up migration alongside the Phase 6.2 permissions seed. Verified post-migration: `permissions` table contains only the in-use `dashboard:*` codes (analytics, campaigns, feed, financial, meta_campaigns, operations, payroll, products, realtime, sales, stats, typeform, user_behavior).

7. **PM2 process names retained `new-auth-server` (not `auth-server`).** Plan's Phase 8 final form names it `auth-server` (after the legacy 3003 one is removed). Decided to keep the existing `new-auth-server` name through Phase 2 to avoid a rename mid-stream. Phase 8 can rename if desired, but it's cosmetic — all wiring is by port (3011) not name.

8. **Caddyfile changes via admin API on `:2020`.** The Caddyfile is owned by root and matt has no passwordless sudo. Cutover used `curl -X POST .../load` on the Caddy admin port (which matt can hit), then a separate `sudo cp /home/matt/Caddyfile.new /etc/caddy/Caddyfile` step to persist the on-disk file. Future Caddyfile changes can follow the same pattern. Backup convention: `/etc/caddy/Caddyfile.bak.YYYY-MM-DD`.

9. **Path-naming.** Plan uses `inventory/` as the top-level (server-side path convention). Locally the equivalent is `inventory-server/`. Whenever the plan says `inventory/dashboard/foo/`, read that as `/var/www/inventory/dashboard/foo/` on the server or `inventory-server/dashboard/foo/` locally.

10. **Scripts directory kept CJS via package.json shim.** Original plan called for converting "any spawned script" to ESM alongside its caller. Implemented: added `inventory-server/scripts/package.json` with `"type": "commonjs"`. Node's package-type resolution walks up directory by directory, so this overrides the parent's `"type": "module"` for the entire `scripts/` tree (≈15 files including `import/*.js`, `metrics-new/utils/*`, the orchestrator scripts) without renaming any file or touching any `spawn()` callsite. Convert individual scripts to ESM when touched; don't bulk-migrate.

11. **`src/routes/products.js` had dead multer setup.** Phase 6.7 spec called for hardening the upload route in products.js. There was no upload route — the `multer({ dest })` instance and `importProductsFromCSV` import were dead code left over from a long-ago migration. Strongest 6.7 hardening was deletion: no upload handler = no attack surface. The two real upload paths (`/api/import/upload-image` and `/api/reusable-images/upload`) got tightened MIME+extension allowlists instead.

12. **Two pre-existing syntax errors in shared/db/ surfaced.** `shared/db/pg.js:13` and `shared/db/redis.js:22` both had `?? Number(...) || N` — mixing `??` and `||` without parentheses is a TC39 syntax error. They passed Phase 2 because nothing imported them yet; Phase 3 smoke-test exposed it. Fixed with parens.

13. **`import { Pool } from 'pg'` doesn't work in ESM.** The `pg` package is CJS using `module.exports = { Pool, ... }`. Node's ESM-from-CJS interop fails to detect `Pool` as a named export via static analysis. The bulletproof pattern, now used everywhere: `import pg from 'pg'; const { Pool } = pg;`. Same idea for any future CJS-only deps. `src/utils/db.js` already had it; the two auth files needed the fix during execution.

14. **Frontend Bearer-header gap discovered (drives new Phase F1).** Phase 6 was specified assuming the frontend already sends `Authorization: Bearer` on every API call. It does not — only 7 of ~220 call sites do. Phase 6's `authenticate()` middleware is shipped and ready to enable, but until F1 lands the SPA will 401 on every page. The plan now has Phase F1 to address this explicitly; until then, the Phase 3+6 pm2 reload should not ship unless F1 ships in the same window.

15. **macOS NFS workflow note.** The `inventory-server/` directory locally is an NFS mount of `/var/www/inventory/` on netcup. Bulk operations (`find`/`grep -r`/mass `node --check`/`npm install`) hang or take minutes locally and pollute file listings with macOS AppleDouble `._*` sidecar files. Default to `ssh netcup` for any sweep across the tree — individual file edits via the editor are fine.

16. **dashboard-server lives at `inventory-server/dashboard/` (not its own top-level dir).** Plan's Phase 4 diagram implied a sibling of `inventory-server/` etc. The merged service lives at `inventory-server/dashboard/server.js` with `package.json` declaring `"type": "module"`. Per-vendor subdirectories (`klaviyo-server/`, `meta-server/`, `google-server/`, `typeform-server/`) each have their own `package.json` so Node's "nearest parent package.json" walk stops there — they are unaffected by the new parent type. Added `"type": "commonjs"` defensively to meta/google/typeform package.json so a future deletion of their files (cutover cleanup) plus a stray `*.js` left under `dashboard/` wouldn't accidentally try to ESM-parse it.

17. **Klaviyo `RedisService` kept as a wrapper, but accepts injected client.** Plan said "replace each server's per-instance pool/redis with the injected one." The Klaviyo codebase has ~3K LOC of service code (`events.service.js` alone is 2.2K) that calls `this.redisService._getCacheKey()`, `.get()`, `.set()`, `.getEventData()`, `.clearCache()`, `._getTTL()`. Rewriting all of that to call ioredis directly would risk breaking the cache-key/TTL invariants. Decision: keep `services/klaviyo/redis.service.js` as a thin facade with the same public surface, but its constructor now takes an ioredis client instead of constructing one. The 3 service classes (`EventsService`, `CampaignsService`, `ReportingService`) all take `(apiKey, apiRevision, redis)` and pass `redis` to `new RedisService(redis)`. `MetricsService` doesn't use Redis — left unchanged.

18. **dashboard-server `.env` layering.** Plan called for "Single `.env` at `inventory/dashboard/.env`, prefixed keys: KLAVIYO_*, META_*, ... JWT_SECRET ... # shared with auth-server." Implemented as two-file layering: `server.js` loads `/var/www/inventory/.env` FIRST (provides JWT_SECRET, DB_*, REDIS_*) then `inventory-server/dashboard/.env` SECOND for vendor-specific keys (KLAVIYO_API_KEY, META_*, GA_*, TYPEFORM_*). dotenv defaults to `override:false`, so the first file wins on collisions — security-critical vars live in one place, vendor keys in the other. `.env.example` template committed at `dashboard/.env.example`. **Pre-cutover step**: copy the vendor keys from the current per-vendor `.env` files into either of those two files before `pm2 reload`, else KLAVIYO_API_KEY etc. will not be set and routes will 500.

19. **Caddyfile typo fixed: `/api/google-analytics` → `/api/dashboard-analytics`.** The pre-Phase-4 `Caddyfile.proposed` listed a `handle /api/google-analytics/*` block. The live Caddyfile and the frontend (`inventory/src/config/dashboard.ts`) both use `/api/dashboard-analytics/*` (the live file has a `uri replace /api/dashboard-analytics /api/analytics` rewrite to land on google-server's `/api/analytics` mount). The merged dashboard-server now mounts the Google router at `/api/dashboard-analytics` directly — Caddy no longer needs the rewrite, just a straight reverse_proxy. Fixed in `deploy/Caddyfile.proposed`.

20. **`metrics.routes.js` had a latent router-scope bug.** The Klaviyo `metrics.routes.js` declared `const router = express.Router()` at MODULE scope (outside the `createMetricsRouter` factory), so calling the factory twice would have re-mounted handlers on the same router (cumulative). Benign for a single-mount PM2 service, but fixed during the Phase 4 copy — the router now lives inside the factory. Also renamed the export from `createMetricsRoutes` (plural) to `createMetricsRouter` (matches the convention used by every other vendor's index.js).

21. **PM2 log paths use per-server `logs/pm2/` (NOT `/var/log/pm2/`).** Discovered during the first apply attempt: the previously-shipped `ecosystem.config.cjs.proposed` carried over `/var/log/pm2/...` from the live file, but matt has no write perms on `/var/log` (root:syslog 775) so the entries silently failed to launch (chat-server + acot-server came up because they had no explicit log path; new-auth-server, inventory-server, dashboard-server bailed). The actual convention — already in place via pre-created folders on disk — is per-service `logs/pm2/` directly under each service's directory (`./inventory/auth/logs/pm2/`, `./inventory/chat/logs/pm2/`, `./inventory/dashboard/acot-server/logs/pm2/`, `./inventory/dashboard/logs/pm2/` for the merged dashboard-server, `./inventory/logs/pm2/` for inventory-server, `/opt/lt-wordlist-api/logs/pm2/`, `/var/www/acot-phone/logs/pm2/`). All folders are matt:matt. `pm2-logrotate` (already loaded in matt's daemon) rotates them in place.

22. **All PM2 apps run under matt's single daemon — no root daemon.** The earlier `OUT OF SCOPE` comment block in the proposed ecosystem incorrectly claimed `lt-wordlist-api` and `acot-phone-server` were managed by a separate root PM2 daemon. They are not — matt's daemon manages everything. Removed the bogus block; both apps are now first-class entries in the proposed ecosystem with corrected script paths:
    - `lt-wordlist-api` script is `/opt/lt-wordlist-api/index.js` (was `/opt/lt-wordlist-api/server.js` in the live file — wrong; that file doesn't exist). `/opt/lt-wordlist-api` is matt:matt 0750.
    - `acot-phone-server` script is `/var/www/acot-phone/dist/server.js` (was `./inventory/acot-phone/server.js` in the live file — wrong; that path doesn't exist). `/var/www/acot-phone/` is matt:matt with its own `.env` and is a separate repo from inventory-server.

23. **Phase 6.10 ADD_WORD_TOKEN move stays in this ecosystem.** Per Deviation #22, `lt-wordlist-api` is in matt's ecosystem, so the §6.10 work to remove inline `ADD_WORD_TOKEN` and load it from `/opt/lt-wordlist-api/.env` instead is implemented directly in `deploy/ecosystem.config.cjs.proposed` (no inline `ADD_WORD_TOKEN`; script reads its own .env). When applying, rotate the token value in `/opt/lt-wordlist-api/.env` and update any callers.