Files
inventory/CONSOLIDATION_PLAN.md
T
2026-05-24 09:41:06 -04:00

72 KiB
Raw Blame History

Server Consolidation & Security Hardening Plan

Audit-driven plan to (a) reduce 12 PM2 processes to 3 application servers + 1 auth server, (b) put every API endpoint behind real authentication, and (c) standardize on ESM across all Node services. Approach is "do it properly the first time" — no half-finished pieces, no deferred cleanup.


Status (2026-05-24)

Phase Status Notes
1 — Decommission dead services Complete aircall/gorgias/clarity/legacy-auth-server deleted from repo + PM2 + Caddyfile + ecosystem.cjs
2 — Build shared lib/ Complete Lives at inventory-server/shared/ (see Deviations). /verify endpoint live on auth-server
3 — Convert auth-server + inventory-server to ESM Complete All 58 server-side files ESM; both services live under the ESM build for >24h. See Deviations #1013
4 — Build dashboard-server (the merge) Complete (live) — 2026-05-24 Merged service running on :3015 under PM2; Caddy routes for klaviyo/meta/dashboard-analytics/typeform all reverse-proxy to it. Old per-vendor directories (klaviyo-server, meta-server, google-server, typeform-server) and their PM2 entries deleted post-cutover — ~1.27 GB reclaimed (largely duplicated node_modules). Phase 6.2 gates wired (meta_write, klaviyo_admin). See Deviations #1619
5 — Convert acot-server to ESM Complete (live) — 2026-05-24 All 11 files (server, db/connection, utils/{phoneAuth,timeUtils}, 7 routes) converted to ESM. PM2 reload clean; SPA-driven /api/acot/events/* continues 200 across cutover; phone-server /api/acot/customers/by-phone returns 200 with correct shared secret. Phase 6 patterns applied during conversion — see Deviation #24
6 — Auth hardening Complete All in-process items live: rate-limit, JWT precondition, CORS lockdown, request-log, upload allowlist, requirePermission on sensitive routes, permissions seed migration. authenticate() live on /api/* (inventory-server, dashboard-server) and /api/acot/* (acot-server, added in Phase 5). 6.10 lt-wordlist token loaded via --env-file + rotated 2026-05-24 (Deviation #25). 6.11 (audit logging) deferred — see Out of scope
F1 — Frontend fetch wrapper Complete (live) — 2026-05-23 Wrappers at inventory/src/utils/api.ts (apiFetch) and inventory/src/utils/apiClient.ts (axios instance). 170 fetch() sites across 76 files migrated to apiFetch; 32 axios.* sites across 11 files migrated to apiClient. AuthContext /login+/me, App.tsx /me, and services/apiv2.ts (external PHP backend) intentionally left as raw fetch. Shipped alongside the Phase 3+6 pm2 reload
7 — Caddyfile final form Complete — applied 2026-05-24 Final Caddyfile live at /etc/caddy/Caddyfile (forward_auth gate + per-vendor reverse_proxy to :3015). The inventory-server/deploy/ staging folder was removed after apply — recreate from this doc if future changes are needed. Backup convention: /etc/caddy/Caddyfile.bak.YYYY-MM-DD
8 — ecosystem.config.cjs final form Complete — applied 2026-05-24 Live PM2 list matches the spec below (5 apps + acot-phone-server + lt-wordlist-api = 7 processes). Includes Phase 6.4 JWT_SECRET shadow-override fix and 6.10 lt-wordlist token move. inventory-server/deploy/ removed post-apply

Live PM2 process count: 7 (5 application apps — auth-server, inventory-server, chat-server, dashboard-server, acot-server — plus acot-phone-server + lt-wordlist-api). Down from 13 pre-refactor.

All planned phases complete (2026-05-24). Phase 5 was the last code-level deliverable; acot-server now runs as an ESM service with shared-lib authenticate() defense-in-depth.

All originally planned phases shipped. Two real gaps surfaced during Phase 5 verification — both closed 2026-05-24:

  • Phase 6.10 lt-wordlist token rotation — fixed via Node's native --env-file flag in the PM2 entry; token rotated to a fresh 32-byte hex value in /opt/lt-wordlist-api/.env (mode 0600). See Deviation #25.
  • Phase 6.1 Caddy gate blocking acot-phone-server's customer lookups — fixed by repointing ACOT_API_URL from the public host to http://localhost:3012/api/acot. See Deviation #26.

Goals

  • Every public-facing endpoint requires a valid auth token (Caddy gate + per-server middleware + per-route permission checks for sensitive operations).
  • Reduce service count from 12 PM2 processes to 4: inventory-server, acot-server, dashboard-server, auth-server.
  • Standardize on ESM ("type": "module") across all Node services.
  • Decommission aircall-server, gorgias-server, clarity-server, and the legacy auth-server (port 3003).
  • Eliminate dependency duplication: one Redis client, one Postgres pool helper, one logger, one auth middleware — shared across services.

Non-goals

  • Rewriting business logic. Route handlers move as-is unless they break under ESM or shared middleware.
  • Switching auth providers (we keep JWT + bcrypt + Postgres).
  • Replacing PM2 or Caddy.
  • Migrating Klaviyo/Meta/Google/Typeform's external API contracts.

Target architecture

                                  ┌──────────────────────────┐
                                  │  tools.acherryontop.com  │
                                  │         (Caddy)          │
                                  │   forward_auth gate ─────┼──► auth-server:3011
                                  └────────────┬─────────────┘     /verify endpoint
                                               │
              ┌────────────────────────────────┼────────────────────────────────┐
              ▼                                ▼                                ▼
   ┌─────────────────────┐         ┌──────────────────────┐         ┌─────────────────────┐
   │  inventory-server   │         │   dashboard-server   │         │     acot-server     │
   │      :3010 (ESM)    │         │      :3015 (ESM)     │         │     :3012 (ESM)     │
   │                     │         │                      │         │                     │
   │  /api/products      │         │  /api/klaviyo/*      │         │  /api/acot/*        │
   │  /api/orders        │         │  /api/meta/*         │         │  (MySQL via SSH)    │
   │  /api/analytics     │         │  /api/google-*/*     │         │                     │
   │  /api/dashboard     │         │  /api/typeform/*     │         │                     │
   │  ... (~25 routers)  │         │                      │         │                     │
   └─────────────────────┘         └──────────────────────┘         └─────────────────────┘
              │                                │                                │
              ├── Postgres (inventory_db)      ├── Postgres (klaviyo)           └── MySQL (workpi, via ssh2 tunnel)
              ├── shared lib/ ◄────────────────┤                                
              │   - auth middleware            ├── Redis (shared client)
              │   - permission helper          └── shared lib/ ◄─────────────────┐
              │   - logger                                                       │
              │   - pg pool factory                                              │
              │   - error formatter                                              │
              └─────────────────────────────────────────────────────────────────┘
                                                                                 │
                                                              ┌──────────────────┴───┐
                                                              │     auth-server      │
                                                              │      :3011 (ESM)     │
                                                              │   /login, /me,       │
                                                              │   /verify, user mgmt │
                                                              └──────────────────────┘

PM2 process count: 12 → 4 (plus acot-phone-server and lt-wordlist-api, which stay as-is — out of scope).


Phase 1 — Decommission dead/leaving services

Status: Complete (2026-05-23). All four services removed from repo, PM2, Caddyfile, and ecosystem.config.cjs. Frontend widgets (AircallDashboard.jsx, GorgiasOverview.jsx) and their dashboard.ts/Navigation.jsx/vite.config.ts wiring also removed. Verification: smoke-tested https://tools.acherryontop.com/api/{aircall,gorgias,clarity}/* → 404. Backups left at /home/matt/{ecosystem.config.cjs,Caddyfile}.bak.2026-05-23.

To remove

Service Reason Steps
aircall-server (3002) Migrating off Aircall pm2 delete aircall-server; remove from ecosystem.config.cjs; remove /api/aircall/* from Caddyfile; drop inventory/dashboard/aircall-server/ directory; remove MongoDB connection from any frontend code; cancel Mongo if it was only feeding Aircall
gorgias-server (3006) Migrating off Gorgias same pattern; check frontend for /api/gorgias/* callers and delete the dashboards/widgets that use them
clarity-server (3009) Already dead (no .js files, not in ecosystem) remove /api/clarity/* from Caddyfile; delete inventory/dashboard/clarity-server/ directory
auth-server (3003, legacy) Replaced by new-auth-server on 3011 grep entire codebase for dashboard-auth and localhost:3003; redirect or remove callers; pm2 delete auth-server; remove from ecosystem; remove /dashboard-auth/* from Caddyfile; delete inventory/dashboard/auth-server/ directory

Verification before deletion

# from inventory/ root — find any references before removing
grep -rn "aircall\|/api/aircall" inventory/src/ inventory-server/src/
grep -rn "gorgias\|/api/gorgias" inventory/src/ inventory-server/src/
grep -rn "/dashboard-auth\|localhost:3003" inventory/src/ inventory-server/src/
grep -rn "/api/clarity" inventory/src/ inventory-server/src/

Any remaining callers must be deleted or repointed before the server is removed. Do not leave a 502 response in production.

Database/secret cleanup

  • Drop the MongoDB instance feeding Aircall (after confirming no other consumers).
  • Rotate any Gorgias/Aircall API keys still in .env files (defense in depth — they'll be useless soon anyway, but commit hygiene matters).
  • Remove MONGODB_URI, AIRCALL_*, GORGIAS_* from any .env files.

Phase 2 — Build the shared lib/

Status: Complete (2026-05-23). All 11 modules written under inventory-server/shared/ (NOT repo root — see Deviations). /verify endpoint added to auth-server in CJS form (will move to shared/auth/verify.js usage during Phase 3 ESM conversion). Smoke-tested with no-token / bad-token / expired-token / valid-token cases. No service consumes shared/ yet; that happens in Phases 35.

Location

A single shared directory at the repo root: shared/ (sibling of inventory/ and acot-phone/). Each service imports from it via a relative path. We do not introduce npm workspaces yet — relative imports are fine for three consumers and avoid the npm-link / hoisting headaches.

Modules to create

shared/
├── package.json          # "type": "module"
├── auth/
│   ├── middleware.js     # authenticate(), requirePermission(), requireAdmin()
│   └── verify.js         # verifyToken() — pure function, no Express dependency
├── db/
│   ├── pg.js             # createPool(envPrefix) — returns configured Pool
│   └── redis.js          # createRedis() — single client, lazy-connect
├── logging/
│   ├── logger.js         # pino-based, redacts Authorization/Cookie
│   └── request-log.js    # Express middleware, structured access log
├── errors/
│   └── handler.js        # consistent error envelope, no leak in prod
├── cors/
│   └── policy.js         # single allowed-origins list, exported as cors() options
└── rate-limit/
    └── login.js          # express-rate-limit config for /login

Auth middleware spec (shared/auth/middleware.js)

// Pseudocode — final implementation matches the existing pattern in
// inventory/auth/routes.js authenticate() but factored out.

export function authenticate({ pool }) {
  return async (req, res, next) => {
    const header = req.headers.authorization;
    if (!header?.startsWith('Bearer ')) {
      return res.status(401).json({ error: 'Authentication required' });
    }
    try {
      const decoded = jwt.verify(header.slice(7), process.env.JWT_SECRET);
      // Short-circuit DB hit with an in-memory cache, 60s TTL keyed by token jti
      const user = await loadUserCached(pool, decoded.userId);
      if (!user.is_active) return res.status(403).json({ error: 'Account inactive' });
      req.user = user;
      next();
    } catch {
      res.status(401).json({ error: 'Invalid token' });
    }
  };
}

export function requirePermission(code) {
  return (req, res, next) => {
    if (req.user.is_admin) return next();
    if (req.user.permissions?.includes(code)) return next();
    res.status(403).json({ error: 'Insufficient permissions' });
  };
}

export const requireAdmin = (req, res, next) =>
  req.user.is_admin ? next() : res.status(403).json({ error: 'Admin only' });

Why a 60s in-memory user cache

forward_auth in Caddy will call auth-server on every request. Each per-server authenticate() middleware also has a DB lookup to load permissions. Without caching, every API request becomes 1 SQL query for the user row + 1 for permissions. 60s TTL is short enough that deactivating a user takes effect within a minute, long enough that Klaviyo dashboards (which fire dozens of requests on load) don't hammer Postgres.

Add to auth-server: a /verify endpoint

Caddy's forward_auth only needs "is this token valid? give me a user-id." Today's /me does that but with a full permissions join. Add a lightweight /verify that:

  • Verifies JWT signature only (no DB hit).
  • Returns 200 with X-User-Id and X-User-Is-Admin response headers (which Caddy copy_headers will pass to upstream).
  • Returns 401 on bad token.

Decision: each service re-verifies the JWT independently. Caddy's forward_auth is a fast first-pass reject for obviously bad tokens, but the security boundary is the per-server authenticate() middleware. Cost is negligible (one HMAC-SHA256 per request); the upside is that a misconfigured Caddyfile can never let an unauthenticated request reach a backend. Upstream services do not trust any X-User-* headers from Caddy — they parse the Authorization header themselves.


Phase 3 — Convert auth-server and inventory-server to ESM

Status: Complete (live) — 2026-05-24. Both servers + all sub-trees converted to ESM and running under PM2. 58 importable .js files. Two latent bugs surfaced and fixed during the conversion: ??/|| precedence in shared/db/{pg,redis}.js, and CJS named-import of Pool from pg in both auth files (now uses import pg from 'pg'; const { Pool } = pg).

Scripts under inventory-server/scripts/ (one-shot maintenance / orchestrators) kept CommonJS via a sibling scripts/package.json declaring "type": "commonjs" — Node's package-type resolution walks up directory by directory, so this overrides the parent's "type": "module" without renaming any file or touching any spawn() callsite. Convert individual scripts to ESM if/when touched.

Went live 2026-05-24 after npm install on netcup (new deps: pino, pino-http, ioredis, express-rate-limit, jsonwebtoken) + pm2 reload. Phase F1 (frontend fetch wrapper) shipped in the same window so the SPA continued to send Authorization: Bearer on every request as authenticate() came online.

Mechanical conversion

Per service:

  1. Add "type": "module" to package.json.
  2. Convert require()import. module.exportsexport / export default.
  3. Fix __dirname/__filename (use import.meta.url + fileURLToPath).
  4. Convert any dynamic require (e.g., conditional plugin loading) to await import().
  5. Update any sub-imports that don't include the file extension — ESM requires ./foo.js, not ./foo.
  6. Update ecosystem.config.cjs if any service entry depended on CJS semantics. The ecosystem file itself can stay .cjs — PM2 reads it as config, doesn't matter what the apps it spawns are.
  7. Update nodemon config / scripts.

Risk areas in inventory-server

  • routes/ai.js does a lazy init (aiRouter.initInBackground() called from server.js) — confirm the export shape still works as a default export of an Express router with a sidecar function. May need to split into export default router; export function initInBackground() {}.
  • Multer setup in routes/import.js — straightforward, no ESM-specific concerns.
  • SSE setup in server.js — moves over cleanly, no module-system entanglement.
  • The child_process.spawn calls for metrics calculation: ESM doesn't change child_process behavior, but if any spawned script uses require() of a sibling, that sibling must also be ESM (or stay CJS with a .cjs extension).

Test strategy

  • After conversion, pm2 start ecosystem.config.cjs --only inventory-server on the server, watch logs for require/import errors at startup.
  • Hit /health, then the most exercised endpoints (/api/products, /api/dashboard/overview, /api/analytics/...). If startup is clean and three smoke endpoints work, ESM conversion is done. Functional correctness is preserved because no logic changed.

Auth-server

Already small (~200 LOC server.js + ~few hundred in routes.js + permissions.js). 1-day conversion. Add the new /verify endpoint as part of this work.


Phase 4 — Build dashboard-server (the merge)

Status: Complete (live) — 2026-05-24. Klaviyo + Meta + Google + Typeform merged into a single ESM service at inventory-server/dashboard/server.js. Shared Pool + ioredis client injected through router factories. Phase 6.2 permission gates wired (meta_write on Meta budget/status mutations; klaviyo_admin on Klaviyo /events/clearCache). Post-cutover cleanup (2026-05-24) deleted the four old per-vendor directories (klaviyo-server, meta-server, google-server, typeform-server) along with their PM2 entries — ~1.27 GB reclaimed, largely duplicated node_modules across vendors. Original boot test on netcup: /health 200; unauthenticated /api/klaviyo/* returns {"error":"No token provided"} HTTP 401 via shared authenticate().

Layout

inventory/dashboard/
├── server.js                  # entry: load env, init Postgres+Redis, mount routes, listen
├── package.json               # "type": "module", deps from all 4 source servers (deduped)
├── .env                       # KLAVIYO_*, META_*, GOOGLE_*, TYPEFORM_*, shared DB_*, REDIS_URL
├── routes/
│   ├── klaviyo/               # absorbed from dashboard/klaviyo-server/src/
│   ├── meta/                  # absorbed from dashboard/meta-server/
│   ├── google/                # absorbed from dashboard/google-server/
│   └── typeform/              # absorbed from dashboard/typeform-server/
├── services/                  # per-vendor API clients (Klaviyo SDK calls, etc.)
├── scripts/
│   └── import-campaign-products.js  # one-shot, moved from klaviyo-server/scripts/
└── logs/

Mount points

// server.js (sketch)
import { authenticate, requirePermission } from '../../shared/auth/middleware.js';
import { createPool } from '../../shared/db/pg.js';
import { createRedis } from '../../shared/db/redis.js';
import { logger, requestLog } from '../../shared/logging/index.js';
import corsPolicy from '../../shared/cors/policy.js';
import errorHandler from '../../shared/errors/handler.js';

import klaviyoRouter from './routes/klaviyo/index.js';
import metaRouter from './routes/meta/index.js';
import googleRouter from './routes/google/index.js';
import typeformRouter from './routes/typeform/index.js';

const app = express();
const pool = await createPool('KLAVIYO_DB');  // klaviyo has its own DB; others can share or have none
const redis = await createRedis();

app.use(requestLog);
app.use(cors(corsPolicy));
app.use(express.json({ limit: '10mb' }));

// Everything below this line requires a valid token.
app.use('/api', authenticate({ pool }));

app.use('/api/klaviyo', klaviyoRouter({ pool, redis }));
app.use('/api/meta', metaRouter({ redis }));
app.use('/api/google-analytics', googleRouter({ redis }));  // matches Caddy /api/dashboard-analytics rewrite
app.use('/api/typeform', typeformRouter({ redis }));

app.get('/health', (req, res) => res.json({ ok: true }));
app.use(errorHandler);

app.listen(process.env.DASHBOARD_PORT || 3015);

Per-vendor routers

Each vendor's existing route file becomes a factory that takes the shared pool/redis and returns an Express router. Replace each server's per-instance pool/redis with the injected one.

Permission gates (sensitive routes only)

Authenticated-only is the default after app.use('/api', authenticate(...)). For sensitive operations, add requirePermission per route:

  • Anything that mutates Klaviyo lists/segments → requirePermission('klaviyo_write')
  • Triggering a campaign sync → requirePermission('klaviyo_admin')
  • Read-only dashboards → no extra check beyond authenticate.

Define the new permission codes in the permissions table via a migration in Phase 6.

Dependency dedup

Decision: standardize on ioredis. Klaviyo's larger codebase already uses it, and ioredis has better cluster/sentinel support if we ever need it. Update meta/google/typeform call sites — each is a handful of get/set calls, mechanical conversion. Remove the redis package from dashboard-server's package.json.

Env consolidation

Single .env at inventory/dashboard/.env, prefixed keys:

DASHBOARD_PORT=3015
KLAVIYO_API_KEY=...
KLAVIYO_DB_HOST=...
KLAVIYO_DB_NAME=...
META_ACCESS_TOKEN=...
GOOGLE_SERVICE_ACCOUNT_KEY=...
TYPEFORM_TOKEN=...
REDIS_URL=...
JWT_SECRET=...  # shared with auth-server; same secret means same tokens valid here

Klaviyo's scripts/import-campaign-products.js

One-shot script — keep it, but run it from the merged dashboard-server's directory. Update the script's imports to ESM. If it's run via cron, update the cron entry to the new path.

Risk: shared error states

When all four vendors share a Redis client, a Redis hiccup affects all four. Make sure the connection has retry config (ioredis defaults are reasonable but verify) and that vendor routes degrade gracefully when Redis is unavailable (most use it as a cache, so cache-miss → fall through to upstream API is the right behavior).


Phase 5 — Convert acot-server to ESM (stays standalone)

Status: Complete (live) — 2026-05-24. 11 files converted (server.js, db/connection.js, utils/{phoneAuth,timeUtils}.js, 7 route files — ~5.2K LOC). PM2 reload clean; SPA-driven /api/acot/events/{projection,stats} continues 200 across cutover; phone-server /api/acot/customers/by-phone returns 200 with correct x-acot-api-key. Per Deviation #13, ssh2 is CJS-only → uses import ssh2 from 'ssh2'; const { Client } = ssh2;. Phase 6 patterns applied during conversion (Deviation #24).

Special concern: ssh2 tunnel

acot-server opens an SSH tunnel via ssh2 to access the production MySQL at 192.168.1.5:3309. Lifecycle today (preserved verbatim across the ESM conversion, not refactored):

  • Lazy establishment. No tunnel at startup; the first getDbConnection() call sets one up. HTTP listener comes up immediately without waiting for the tunnel. Acceptable — the first per-route request just pays the tunnel-creation latency once.
  • Per-connection ssh client. Each pooled MySQL connection owns its own ssh2.Client. Closing a connection closes its own SSH client.
  • No reconnect on disconnect. There is no close listener on the SSH client. If the SSH connection drops while the MySQL connection is pooled (not in use), the next caller that pops it will get a query failure. Circuit-breaker absorbs repeated failures (5 failures → 30s open). Mitigation acceptable for current call volume; revisit if SSH drops become observable in logs.
  • SIGTERM/SIGINT teardown. server.close()closeAllConnections() ends MySQL connections and SSH clients in sequence. Confirmed clean during the Phase 5 cutover (SIGTERM signal received: closing HTTP server → 10 × Closed pooled connectionAll connections closed and pool reset in PM2 logs).

Auth model (two flavors, intentional)

server.js mounts the customers router BEFORE the global authenticate() so the two auth schemes don't collide:

  • /api/acot/customers/*requirePhoneApiKey (timing-safe x-acot-api-key check). Used by acot-phone-server.
  • everything else → JWT Bearer via shared/auth/middleware.js authenticate(). Used by the SPA.

This works at the in-process layer. The public path through Caddy is a separate issue — see Deviation #26.


Phase 6 — Auth hardening

Status: Complete (live) — 2026-05-24. All hardening (in-process + edge) is live in production. The Phase 3 ESM conversion + Phase 6 middleware shipped together, with Phase F1 (frontend fetch wrapper) flipping immediately ahead of the pm2 reload so the SPA continued to carry Authorization: Bearer on every API call. Caddy forward_auth gate and the JWT_SECRET ecosystem fix went live with the Phase 7/8 apply on 2026-05-24.

Per-item status:

# Item Status Where
6.1 Caddy forward_auth gate Live — 2026-05-24 Applied via Caddy admin API + sudo cp to /etc/caddy/Caddyfile. @gated path /api/* /chat-api/* /uploads/* block hits localhost:3011/verify on every request
6.2 requirePermission on sensitive routes + permissions migration Done inline in config.js, data-management.js, import.js, ai-prompts.js, ai-validation.js, templates.js, reusable-images.js; codes seeded by migrations/005_phase6_permission_codes.sql. Phase 4 follow-on (2026-05-23): meta_write wired on PATCH /api/meta/campaigns/:id/budget and POST /api/meta/campaigns/:id/:action; klaviyo_admin wired on POST /api/klaviyo/events/clearCache. Read-only Google + Typeform endpoints stay authenticated-only (reserved write codes left in migration 005 for future)
6.3 Login rate-limit + /verify rate-limit Done auth/server.js uses shared/rate-limit/login.js (loginLimiter, verifyLimiter)
6.4 JWT_SECRET as startup precondition + ecosystem footgun fix Live — 2026-05-24 Both auth-server and inventory-server process.exit(1) if JWT_SECRET is unset. The JWT_SECRET: process.env.JWT_SECRET override that was shadowing .env is removed from the live ecosystem.cjs
6.5 Structured request logging w/ redaction Done shared/logging/request-log.js (pino-http, redacts Authorization/Cookie); mounted in both auth/server.js and src/server.js
6.6 CORS lockdown Done src/middleware/cors.js now re-exports shared/cors/policy.js. LAN wildcards (192.168.*, 10.*) and * defaults gone
6.7 Upload hardening Done Exact-match MIME+extension allowlist on routes/import.js and routes/reusable-images.js; dead multer({ dest }) removed from routes/products.js (no upload route was using it — strongest hardening was deletion)
6.8 Frontend token storage stays localStorage + XSS audit Audited Confirmed dangerouslySetInnerHTML is sanitized in ProductEditor.tsx. Flagged: ChatRoom.tsx:277,392 renders user-controlled chat content as raw HTML — real XSS vector, separate fix needed
6.9 Remove debug middleware Done The header-dumping app.use((req,res,next)=>{ console.log(... req.headers ...) }) block removed from src/server.js. Replaced with shared/logging/request-log.js (which redacts).
6.10 lt-wordlist-api token move Live — 2026-05-24 Live PM2 entry runs /opt/lt-wordlist-api/index.js under matt's daemon; ADD_WORD_TOKEN is no longer inline in ecosystem.cjs and is read from /opt/lt-wordlist-api/.env. See Deviations #2123 for the path corrections and the (incorrect) earlier assumption that this app lived under a separate root daemon
6.11 Audit logging for sensitive ops Deferred Out of scope for this pass per user direction. Existing import_audit_log and product_editor_audit_log tables stay as-is; generic system_audit_log table + middleware is its own project

6.1 Caddy forward_auth gate

Add to the tools.acherryontop.com block, before the @api_routes handler:

# Forward-auth gate for all API traffic
@needs_auth path /api/* /chat-api/*
handle @needs_auth {
    forward_auth localhost:3011 {
        uri /verify
        copy_headers Authorization
        # On 401/403, Caddy returns the auth-server's response body verbatim
    }
    # Existing per-vendor handle blocks remain below this line
}

# /auth-inv/* stays public (you need to log in!)
handle /auth-inv/* {
    uri strip_prefix /auth-inv
    reverse_proxy localhost:3011
}

The forward_auth directive subrequests /verify on the auth-server. If it returns 2xx, the request proceeds upstream. If 401/403, Caddy returns that response to the client and never hits the backend.

This is the first line of defense. Per-server middleware (shared/auth/middleware.js) is the second line — re-verifies the JWT independently. Defense in depth: a Caddyfile typo can't open a hole.

6.2 Per-route permission gates

After per-server authenticate(), add requirePermission(code) to destructive or sensitive routes. Audit needed in:

  • inventory-server/src/routes/config.js — global config writes → admin
  • inventory-server/src/routes/import.js — uploads, deletes, generate-upc → product_import
  • inventory-server/src/routes/data-management.js — CSV operations → data_management
  • inventory-server/src/routes/ai-prompts.js — prompt edits → ai_admin
  • inventory-server/src/routes/templates.js — template writes → templates_write
  • inventory-server/src/routes/reusable-images.js — image management → image_admin
  • inventory-server/src/routes/products.js — only one POST (/resolve-identifiers); evaluate whether it needs a permission code or authenticated-only is fine
  • inventory-server/src/routes/product-editor-audit-log.js and import-audit-log.js — read-only by sensitive users → audit_read
  • dashboard-server Klaviyo/Meta/Google/Typeform write endpoints → vendor-specific codes per above

Migration: a single SQL script that inserts the new permission codes into the permissions table and assigns them to existing admin users. Non-admin users get permissions explicitly granted via the user management UI.

INSERT INTO permissions (code, name) VALUES
  ('product_import', 'Product Import'),
  ('data_management', 'Data Management'),
  ('ai_admin', 'AI Settings Admin'),
  ('templates_write', 'Template Editing'),
  ('image_admin', 'Image Management'),
  ('audit_read', 'Audit Log Access'),
  ('klaviyo_write', 'Klaviyo Write'),
  ('klaviyo_admin', 'Klaviyo Admin'),
  ('meta_write', 'Meta Write'),
  ('google_write', 'Google Analytics Write'),
  ('typeform_write', 'Typeform Write'),
  ('acot_admin', 'ACOT Server Admin')
ON CONFLICT (code) DO NOTHING;

6.3 Rate limiting on login

shared/rate-limit/login.js:

import rateLimit from 'express-rate-limit';
export const loginLimiter = rateLimit({
  windowMs: 15 * 60 * 1000,      // 15 minutes
  max: 10,                        // 10 attempts per IP per window
  message: { error: 'Too many login attempts, try again later' },
  standardHeaders: true,
  legacyHeaders: false,
});

Apply in auth-server on the /login route. Consider also rate-limiting /verify and /me (much higher cap, ~600/min — they're called legitimately by every page load).

6.4 JWT secret rotation

  • Rotate JWT_SECRET to a fresh 32-byte random string as part of the deployment.
  • Document that rotation logs out all users — acceptable for an internal tool, do it during off-hours.
  • Add JWT_SECRET to the env var validation block in auth-server/server.js (refuse to start if not set).
  • Fix the existing footgun: /var/www/ecosystem.config.cjs currently has JWT_SECRET: process.env.JWT_SECRET after ...inventoryEnv in the new-auth-server block. This shadows the .env value with whatever the shell exported when PM2 was started — which has already silently diverged at least once (detected and fixed 2026-05-23 by a clean PM2 restart in a shell without JWT_SECRET exported). Delete that override line during rotation; let .env be the single source of truth.

6.5 Request logging

shared/logging/request-log.js — log method, path, status, duration, user-id (if authenticated). Never log Authorization or Cookie headers. Remove the current server.js:79-87 debug middleware in inventory-server (it logs full headers including the bearer token).

6.6 CORS lockdown

Current middleware/cors.js allows 192.168.*.* and 10.*.*.* with credentials: true. Tighten to explicit known origins:

origin: [
  'https://tools.acherryontop.com',
  'https://inventory.kent.pw',
  /^http:\/\/localhost:(5174|5175)$/,
]

If anyone genuinely needs LAN access, add their specific IP, not a /16 range.

6.7 Upload hardening

POST /api/import/upload-image (multer-backed) needs:

  • File-size limit set on multer config (current limit may be defaulted — verify).
  • MIME-type allowlist (image/jpeg, image/png, image/webp; reject everything else).
  • Filename sanitization (no .., no absolute paths, generate UUID-based names server-side).
  • The Caddy /uploads/* handler currently serves any file in the uploads directory publicly. Move this behind the auth gate: include /uploads/* in @needs_auth. If some images are referenced from public emails (Klaviyo newsletter), put those in a separate public bucket; everything else stays gated.

6.8 Frontend token storage

Decision: stay on localStorage. This is an internal tool with no untrusted user-generated HTML being rendered, so the XSS-token-theft surface is small. The forward_auth gate is the main security gap we're addressing; cookie-based auth would be a larger, separate project (cookie-parser, CSRF double-submit pattern, AuthContext refactor) that doesn't change the threat model for an internal tool with no public sign-up.

Sanity check during this refactor: grep the React codebase for dangerouslySetInnerHTML. If any usages exist, verify each one is rendering trusted (server-controlled, not user-supplied) content. If a user-supplied content path exists, that's a real XSS vector and needs separate remediation regardless of token-storage choice.

6.9 Remove debug middleware

inventory-server/src/server.js:79-87 logs full request headers including Authorization. Delete this block. Replace with shared/logging/request-log.js.

6.10 lt-wordlist-api token

ADD_WORD_TOKEN is currently hardcoded in /var/www/ecosystem.config.cjs. Move to /opt/lt-wordlist-api/.env, rotate the token value, update any callers.

6.11 Audit logging for sensitive operations

Already have import-audit-log and product-editor-audit-log tables. Extend the pattern:

  • Log user_id, endpoint, params, result for config.js writes and data-management.js operations.
  • Schema: reuse the existing audit table pattern or add a generic system_audit_log table.
  • Don't log request bodies wholesale (may contain large blobs); log the action and the target ID.

Phase F1 — Frontend fetch wrapper (NEW — 2026-05-23)

Status: Complete (live) — 2026-05-24. Two wrappers landed at inventory/src/utils/api.ts and inventory/src/utils/apiClient.ts. Migration touched 87 files (76 fetch, 11 axios) covering ~200 call sites. Type-check clean; production build clean. Intentional exclusions: AuthContext /login+/me (own auth flow), App.tsx initial /me session check, and services/apiv2.ts (calls the separate PHP backend at backend.acherryontop.com which has its own cookie auth, out of scope per the plan). Shipped alongside the Phase 3+6 pm2 reload.

The discovery

While wiring authenticate() on /api/* in Phase 6.1/6.2, we audited the frontend's fetch usage and found:

  • 7 call sites send Authorization: Bearer ${token} explicitly (all in AuthContext.tsx for /me + /login, plus a couple of settings/* pages).
  • ~220 other fetch(...) / axios.*(...) call sites across inventory/src/services/, inventory/src/pages/, inventory/src/components/ send no Authorization header at all.
  • There is no global fetch wrapper, axios interceptor, or service-worker shim that injects the token.

Today this works because nothing on the server checks. Caddy currently has no forward_auth gate (Phase 6.1 is a Caddyfile change that hasn't shipped yet) and the previous inventory-server had no authenticate() middleware. The frontend's auth model was "you log in once to get the token; the token is checked only by /me; everything else is implicitly trusted at the network layer."

With Phase 6 code in production, every page refresh 401s on the first API call after the next pm2 reload. The user explicitly accepted this when authorising the Phase 6 work — but the fix is its own deliverable, and shipping Phase 3+6 to PM2 without F1 in the same window means an outage window measured in however long F1 takes (not minutes).

Add a single fetch wrapper at inventory/src/utils/api.ts (or similar) and migrate the ~220 call sites to use it. The wrapper:

  1. Reads localStorage.getItem('token') on every call (cheap; localStorage is sync).
  2. Merges Authorization: Bearer ${token} into the request headers if a token exists.
  3. Intercepts 401 responses → fires window.dispatchEvent(new Event('auth:logout')) (a listener already exists in AuthContext.tsx:117) so the user gets bounced to /login cleanly instead of seeing broken pages.
  4. Preserves the existing call shape — apiFetch(url, init) should be a drop-in for fetch(url, init) so the migration is mechanical.
// inventory/src/utils/api.ts (sketch)
export async function apiFetch(input: RequestInfo | URL, init: RequestInit = {}): Promise<Response> {
  const token = localStorage.getItem('token');
  const headers = new Headers(init.headers);
  if (token && !headers.has('Authorization')) {
    headers.set('Authorization', `Bearer ${token}`);
  }
  const res = await fetch(input, { ...init, headers });
  if (res.status === 401 && token) {
    // Token expired or revoked — bounce to /login. AuthContext already listens.
    window.dispatchEvent(new Event('auth:logout'));
  }
  return res;
}

Same shape for axios:

// inventory/src/utils/apiClient.ts (sketch)
import axios from 'axios';
export const apiClient = axios.create();
apiClient.interceptors.request.use((config) => {
  const token = localStorage.getItem('token');
  if (token) config.headers.Authorization = `Bearer ${token}`;
  return config;
});
apiClient.interceptors.response.use(
  (r) => r,
  (err) => {
    if (err?.response?.status === 401) window.dispatchEvent(new Event('auth:logout'));
    return Promise.reject(err);
  },
);

Migration plan

  1. Land the two wrapper modules above. ~50 LOC total.
  2. Codemod or sed-loop: in inventory/src/, replace fetch(apiFetch( (with the right import) and axios.get/post/...apiClient.get/post/.... ~220 call sites — a half-day of careful find-and-replace plus per-page verification. Spot-check the ones with custom Content-Type (multipart uploads especially) so the wrapper doesn't clobber multipart boundaries.
  3. Leave the AuthContext.tsx /login and /me calls alone — they already work and migrating them adds no value.
  4. Run the SPA: log in, exercise Overview / Products / Analytics / Dashboard / etc. with browser devtools open watching for Authorization header on every /api/* request.

Sequencing with Phase 3+6 deploy

Two options:

A) Ship F1 first (recommended). Frontend goes out with the wrapper; nothing changes server-side. Then pm2 reload Phase 3+6. Zero-downtime, zero broken-page window.

B) Ship together. F1 and Phase 3+6 land in the same deploy. Brief window (seconds) where the frontend has the wrapper but the server hasn't reloaded yet — wrapper just sends extra headers the old server ignores. Safe.

Do not ship Phase 3+6 first and F1 second. That gives a broken app for as long as F1 takes.

Out of scope (kept on localStorage)

Per Phase 6.8, we're not migrating to httpOnly cookie auth. F1 is the minimum work to make the per-service authenticate() (Phase 6) actually usable. A future Phase F2 could move to cookies + CSRF double-submit, but that's a much larger change touching the AuthContext, the login flow, and every backend that reads tokens. Not justified for an internal tool with no public sign-up.

Note on /uploads/* gating (Phase 6.7's Caddyfile change)

Applied as-spec (2026-05-24): /uploads/* is behind forward_auth in the live Caddyfile. <img src="/uploads/..."> references in the SPA are browser-issued GETs that don't carry Authorization headers — verify image display works end-to-end (cookies fall-through, signed URLs, or session-bound forward_auth) and if broken, revert this part of 6.7 to keep /uploads/* public, OR issue per-image signed URLs from the API.


Phase 7 — Caddyfile final form

Status: Complete — applied 2026-05-24. Final Caddyfile live at /etc/caddy/Caddyfile; vendor handles point at the merged dashboard-server on :3015. The inventory-server/deploy/ staging folder (which held Caddyfile.proposed and the README of apply commands) was removed after apply — recreate from the spec below if future changes are needed. Apply pattern (admin-API curl -X POST :2020/load + sudo cp to persist on-disk) is captured in Deviation #8. Backup convention: /etc/caddy/Caddyfile.bak.YYYY-MM-DD.

After all phases, the tools.acherryontop.com block looks like:

tools.acherryontop.com {
    import security_headers

    # Public: login endpoint
    handle /auth-inv/* {
        uri strip_prefix /auth-inv
        reverse_proxy localhost:3011
    }

    # Public: static frontend assets
    @static path *.js *.css *.png *.jpg *.jpeg *.gif *.ico *.svg *.woff *.woff2
    handle @static {
        header Cache-Control "public, max-age=2592000"
        root * /var/www/inventory/frontend/build
        file_server
    }

    # All API + uploads: auth gate first
    @gated path /api/* /chat-api/* /uploads/*
    handle @gated {
        forward_auth localhost:3011 {
            uri /verify
            copy_headers Authorization
        }

        # Uploaded files
        handle /uploads/* {
            root * /var/www/inventory
            file_server
        }

        # Vendor dashboard routes → merged dashboard-server
        handle /api/klaviyo/*           { reverse_proxy localhost:3015 }
        handle /api/meta/*              { reverse_proxy localhost:3015 }
        handle /api/dashboard-analytics/* { reverse_proxy localhost:3015 }
        handle /api/typeform/*          { reverse_proxy localhost:3015 }

        # ACOT-specific
        handle /api/acot/*         { reverse_proxy localhost:3012 }

        # Chat
        handle /chat-api/* {
            uri strip_prefix /chat-api
            reverse_proxy localhost:3014
        }

        # Catch-all: inventory-server
        handle /api/* { reverse_proxy localhost:3010 }
    }

    handle /health { reverse_proxy localhost:3010 }

    # SPA fallback
    handle {
        root * /var/www/inventory/frontend/build
        try_files {path} /index.html
        file_server
        encode gzip
    }

    handle_errors {
        respond "{err.status_code} {err.status_text}"
    }
}

Removed: /dashboard-auth/*, /api/aircall/*, /api/gorgias/*, /api/clarity/*, the LAN/Access-Control-Allow-Origin "*" permissive defaults on /api/*. Kept: /apiv2/* and /apiv2-test/* proxies to backend.acherryontop.com (out of scope, separate system).


Phase 8 — ecosystem.config.cjs final form

Status: Complete — applied 2026-05-24. Live PM2 list matches the spec below: 5 application apps (auth-server, inventory-server, dashboard-server, acot-server, chat-server) plus acot-phone-server + lt-wordlist-api = 7 total. Includes the Phase 6.4 JWT_SECRET shadow-override fix and the Phase 6.10 lt-wordlist-api token move. The inventory-server/deploy/ecosystem.config.cjs.proposed staging file was removed after apply — recreate from the spec below if future changes are needed.

module.exports = {
  apps: [
    {
      name: 'auth-server',
      script: './inventory/auth/server.js',
      cwd: '/var/www',
      env: { NODE_ENV: 'production', AUTH_PORT: 3011 },
      ...commonSettings,
    },
    {
      name: 'inventory-server',
      script: './inventory/src/server.js',
      cwd: '/var/www',
      env: { NODE_ENV: 'production', PORT: 3010, UPLOADS_DIR: '/var/www/inventory/uploads' },
      ...commonSettings,
    },
    {
      name: 'dashboard-server',
      script: './inventory/dashboard/server.js',
      cwd: '/var/www',
      env: { NODE_ENV: 'production', DASHBOARD_PORT: 3015 },
      ...commonSettings,
    },
    {
      name: 'acot-server',
      script: './inventory/dashboard/acot-server/server.js',
      cwd: '/var/www',
      env: { NODE_ENV: 'production', ACOT_PORT: 3012 },
      ...commonSettings,
    },
    {
      name: 'chat-server',
      script: './inventory/chat/server.js',
      cwd: '/var/www',
      env: { NODE_ENV: 'production', PORT: 3014 },
      ...commonSettings,
    },
    // acot-phone-server and lt-wordlist-api unchanged
  ],
};

Five entries instead of twelve. Each app loads its own .env from its directory (already handled by dotenv.config).


Sequencing & dependencies

Phase 1 (decommission) ──┬─────────────────────────────────────────┐
                         │                                         │
                         ▼                                         │
                   Phase 2 (shared lib/)                           │
                         │                                         │
          ┌──────────────┼──────────────┐                          │
          ▼              ▼              ▼                          ▼
   Phase 3a         Phase 3b        Phase 4              Phase 6 (auth hardening
   inventory-server auth-server     dashboard-server     runs alongside 3+4+5,
   to ESM           to ESM + /verify build & test        completes after them)
          │              │              │                          │
          └──────────────┼──────────────┘                          │
                         ▼                                         │
                   Phase 5 (acot-server to ESM) ──────────────────►│
                                                                   ▼
                                                          Phase 7 (Caddy cutover)
                                                                   │
                                                                   ▼
                                                          Phase 8 (PM2 final state)

Phase 1 unblocks everything (fewer services to convert). Phase 2 is the foundation; nothing else can start until shared lib/ exists. Phases 35 can run in parallel; they touch independent services. Phase 6's sub-items can be developed alongside 35 but enabled only after them (no point adding requirePermission to a route that doesn't yet have authenticate). Phase F1 must precede the Phase 3+6 pm2 reload — without the fetch wrapper, the moment the new code goes live the SPA breaks. Discovered during Phase 3+6 implementation; see Phase F1. Phase 7 is the cutover: Caddyfile flip happens after F1 ships AND after the /uploads/* gating decision in F1 is made. Phase 8 is cleanup: remove dead PM2 entries.

Estimated effort, end-to-end: ~3 weeks of focused work by one engineer. Phase 1 ≈ 1 day, Phase 2 ≈ 2 days, Phase 3 ≈ 3 days (both services), Phase 4 ≈ 57 days (the merge), Phase 5 ≈ 23 days, Phase 6 ≈ 34 days, Phase F1 ≈ 0.51 day, Phase 7+8 ≈ 1 day.


Testing strategy

No formal test suite exists today (per CLAUDE.md). For a refactor this size, that's a gap to close — but writing tests retroactively for 15K LOC of routes is a separate, larger project. For this refactor:

Manual smoke testing per phase

A checklist of representative endpoints to hit after each deploy:

  • inventory-server: /api/products, /api/dashboard/overview, /api/analytics/revenue, /api/orders, /api/purchase-orders, /api/import/list-uploads, /api/config/global
  • dashboard-server: /api/klaviyo/campaigns, /api/meta/insights, /api/google-analytics/..., /api/typeform/responses
  • acot-server: /api/acot/... (top-3 endpoints by call volume — pull from access logs)
  • auth-server: /login, /me, /verify

Each smoke test runs (a) without a token → expect 401, (b) with an invalid token → expect 401, (c) with a valid token → expect 2xx.

Frontend integration check

After deploys, log into the SPA and exercise each major page (Overview, Products, Analytics, Dashboard, Klaviyo, Meta, etc.). If everything loads and dashboards populate, the auth + routing layer is intact.

Test scaffold during Phase 2 (committed)

While building shared/, set up vitest (lightweight, ESM-native, fast) as the standard test runner for the repo. Initial coverage focuses on the security-critical surface only:

  • shared/auth/verify.js — known good token, expired token, wrong-signature token, malformed token, missing token.
  • shared/auth/middleware.js — request with no header → 401; bad header → 401; valid token + inactive user → 403; valid token + missing permission → 403; valid token + correct permission → next() called with req.user populated.
  • shared/auth/middleware.js user-cache TTL: same token within 60s → one DB hit; same token after 61s → two DB hits.

package.json gets a "test": "vitest run" script at the repo root and per-service. Set up but don't backfill broader test coverage — that's a separate, larger project. The vitest scaffold gives future work a foothold; this refactor commits to having tests for the auth boundary specifically because that's what's load-bearing for the whole security model.


Rollback strategy

Each phase produces an independently deployable state. Rollback per phase:

  • Phase 1: re-add removed services to ecosystem; restore from git. Don't roll back data deletions — only do those after a week of stable production.
  • Phases 3, 5: ESM conversion is per-service; if one service breaks, pm2 restart <name> to the previous commit. Other services unaffected.
  • Phase 4: the dashboard-server merge is the highest-risk change. Plan: deploy dashboard-server to a non-conflicting port (3015) while leaving the old per-vendor servers running. Cut over Caddy routes one vendor at a time (start with Meta — smallest). If any vendor breaks, point Caddy back to the old server (still running) for that vendor, debug, retry. Only delete the old servers after all four are stable on dashboard-server.
  • Phases 6, 7: Caddy config is git-tracked. git revert + caddy reload rolls back in seconds. Auth changes are additive (defense in depth) — if forward_auth causes problems, comment it out and per-server middleware continues protecting routes.

Out of scope (intentional)

These came up in the audit but aren't part of this refactor:

  • httpOnly cookie auth ("Phase F2" — deferred). Phase F1 keeps localStorage + Bearer header because that's the minimum to unblock the Phase 6 authenticate() rollout. A future move to cookie auth would touch AuthContext, every backend that reads tokens, and introduce CSRF concerns — much larger project.
  • Replacing PM2 with systemd or Docker.
  • Test coverage beyond the auth-critical surface.
  • apiv2/apiv2-test proxies to backend.acherryontop.com — separate system, not touched.
  • acot-phone-server and lt-wordlist-api — staying as-is.
  • Centralized observability stack (Prometheus, Grafana). The logger work in Phase 6.5 sets up the data, but shipping it somewhere is future work.
  • ChatRoom XSS remediation (flagged during Phase 6.8 audit — inventory/src/components/chat/ChatRoom.tsx:277,392 renders user-controlled chat content via dangerouslySetInnerHTML without sanitization). Real vulnerability for an internal-but-multi-user tool; separate fix.

Concrete deliverables

State as of 2026-05-24: all planned phases are shipped. Note: the "4 application PM2 processes" original target became 5 in execution because chat-server stayed standalone rather than being folded in — never a serious merge candidate (different DB, different protocol shape).

  • 5 application PM2 processes instead of 12 (auth-server, inventory-server, dashboard-server, acot-server, chat-server) — plus 2 unchanged (acot-phone-server, lt-wordlist-api) = 7 total.
  • All /api/*, /chat-api/*, and /uploads/* requests gated at Caddy (forward_auth).
  • Per-upstream authenticate() re-verification on inventory-server, dashboard-server, and acot-server. (chat-server still relies on the Caddy gate alone — see asterisk below.)
  • Sensitive endpoints additionally gated by per-permission checks (requirePermission).
  • ⚠️ One ESM standard — done for auth/inventory/dashboard/acot. chat-server is still CJS (the prior version of this document erroneously claimed it had been converted; verified 2026-05-24 — its server.js still uses require() and its package.json has no "type": "module"). Out of scope for this refactor; tracked as future work.
  • One shared lib/ at inventory-server/shared/ for auth, logging, DB, errors, CORS.
  • Login rate-limited (shared/rate-limit/login.js).
  • JWT_SECRET rotated + ecosystem shadow-override removed.
  • Old auth-server, Aircall, Gorgias, Clarity directories deleted from the repo. Defunct dashboard:gorgias/dashboard:calls permission rows also deleted from DB (2026-05-24).
  • Caddyfile slimmed to one auth-gated block.
  • Permission codes inserted into permissions table for granular authorization.
  • No half-finished pieces remain. Both gaps surfaced during Phase 5 verification — lt-wordlist-api insecure default token (Deviation #25) and Caddy blocking acot-phone-server's x-acot-api-key calls (Deviation #26) — were closed 2026-05-24.

Deviations from original plan (recorded during execution)

These are decisions made during Phase 1/2 implementation that amend the spec above. Future phases should follow the deviated path, not the original sketch.

  1. shared/ location. Original plan placed shared/ at the repo root as a sibling of inventory/ and acot-phone/. Implemented at inventory-server/shared/ (= /var/www/inventory/shared/ on the server) instead. Reason: the actual project root is /var/www/inventory/; placing shared/ outside it would have meant building a deployment story for it that doesn't exist. Import paths change accordingly:

    • From inventory-server/{auth,src,chat}/server.js../shared/...
    • From inventory-server/dashboard/{vendor}-server/server.js../../shared/...
  2. /verify response headers. Plan specified X-User-Id + X-User-Is-Admin. Implemented as X-User-Id + X-User-Username (both available from the JWT payload). X-User-Is-Admin was dropped because is_admin isn't in the JWT today and returning it would require a DB lookup — violating the "no DB hit" principle. To restore X-User-Is-Admin, enrich the JWT payload at login time (one-line change in auth/routes.js) during Phase 6, then echo from /verify. Upstreams don't trust these headers anyway (they re-verify), so the omission is informational, not security-relevant.

  3. User cache key in shared/auth/middleware.js. Plan sketch mentioned "60s TTL keyed by token jti". Implemented as keyed by userId instead — the JWT doesn't currently include a jti claim, and the cache's invalidation semantics are "this user was deactivated/changed permissions" (per-user), not "this token was revoked" (per-token). The plan's pseudocode already used loadUserCached(pool, decoded.userId) so this matches the spirit.

  4. Redis client safety. shared/db/redis.js sets enableOfflineQueue: false and lazyConnect: true. Plan didn't specify but these defaults mean a Redis hiccup fails fast (route fall-through to upstream API as designed in Phase 4 risk notes) rather than queueing commands indefinitely.

  5. CORS allowed origins kept https://acot.site. Plan example listed three origins; production has acot.site as a redirect to tools.acherryontop.com but also reaches the API directly in some flows. Kept it to avoid breakage. LAN wildcards (192.168.*, 10.*) and Access-Control-Allow-Origin "*" are NOT included in the new shared/cors/policy.js per the plan's Phase 6.6 spirit, but the legacy inventory-server/src/middleware/cors.js still has them until services are migrated to consume shared/cors/.

  6. Defunct permission codes cleaned up (2026-05-24). Removed the dashboard:gorgias and dashboard:calls Protected blocks from the frontend; the corresponding permission rows in the permissions table (and their user_permission grants) were deleted in a follow-up migration alongside the Phase 6.2 permissions seed. Verified post-migration: permissions table contains only the in-use dashboard:* codes (analytics, campaigns, feed, financial, meta_campaigns, operations, payroll, products, realtime, sales, stats, typeform, user_behavior).

  7. PM2 process names retained new-auth-server (not auth-server). Plan's Phase 8 final form names it auth-server (after the legacy 3003 one is removed). Decided to keep the existing new-auth-server name through Phase 2 to avoid a rename mid-stream. Phase 8 can rename if desired, but it's cosmetic — all wiring is by port (3011) not name.

  8. Caddyfile changes via admin API on :2020. The Caddyfile is owned by root and matt has no passwordless sudo. Cutover used curl -X POST .../load on the Caddy admin port (which matt can hit), then a separate sudo cp /home/matt/Caddyfile.new /etc/caddy/Caddyfile step to persist the on-disk file. Future Caddyfile changes can follow the same pattern. Backup convention: /etc/caddy/Caddyfile.bak.YYYY-MM-DD.

  9. Path-naming. Plan uses inventory/ as the top-level (server-side path convention). Locally the equivalent is inventory-server/. Whenever the plan says inventory/dashboard/foo/, read that as /var/www/inventory/dashboard/foo/ on the server or inventory-server/dashboard/foo/ locally.

  10. Scripts directory kept CJS via package.json shim. Original plan called for converting "any spawned script" to ESM alongside its caller. Implemented: added inventory-server/scripts/package.json with "type": "commonjs". Node's package-type resolution walks up directory by directory, so this overrides the parent's "type": "module" for the entire scripts/ tree (≈15 files including import/*.js, metrics-new/utils/*, the orchestrator scripts) without renaming any file or touching any spawn() callsite. Convert individual scripts to ESM when touched; don't bulk-migrate.

  11. src/routes/products.js had dead multer setup. Phase 6.7 spec called for hardening the upload route in products.js. There was no upload route — the multer({ dest }) instance and importProductsFromCSV import were dead code left over from a long-ago migration. Strongest 6.7 hardening was deletion: no upload handler = no attack surface. The two real upload paths (/api/import/upload-image and /api/reusable-images/upload) got tightened MIME+extension allowlists instead.

  12. Two pre-existing syntax errors in shared/db/ surfaced. shared/db/pg.js:13 and shared/db/redis.js:22 both had ?? Number(...) || N — mixing ?? and || without parentheses is a TC39 syntax error. They passed Phase 2 because nothing imported them yet; Phase 3 smoke-test exposed it. Fixed with parens.

  13. import { Pool } from 'pg' doesn't work in ESM. The pg package is CJS using module.exports = { Pool, ... }. Node's ESM-from-CJS interop fails to detect Pool as a named export via static analysis. The bulletproof pattern, now used everywhere: import pg from 'pg'; const { Pool } = pg;. Same idea for any future CJS-only deps. src/utils/db.js already had it; the two auth files needed the fix during execution.

  14. Frontend Bearer-header gap discovered (drives new Phase F1). Phase 6 was specified assuming the frontend already sends Authorization: Bearer on every API call. It does not — only 7 of ~220 call sites do. Phase 6's authenticate() middleware is shipped and ready to enable, but until F1 lands the SPA will 401 on every page. The plan now has Phase F1 to address this explicitly; until then, the Phase 3+6 pm2 reload should not ship unless F1 ships in the same window.

  15. macOS NFS workflow note. The inventory-server/ directory locally is an NFS mount of /var/www/inventory/ on netcup. Bulk operations (find/grep -r/mass node --check/npm install) hang or take minutes locally and pollute file listings with macOS AppleDouble ._* sidecar files. Default to ssh netcup for any sweep across the tree — individual file edits via the editor are fine.

  16. dashboard-server lives at inventory-server/dashboard/ (not its own top-level dir). Plan's Phase 4 diagram implied a sibling of inventory-server/ etc. The merged service lives at inventory-server/dashboard/server.js with package.json declaring "type": "module". Per-vendor subdirectories (klaviyo-server/, meta-server/, google-server/, typeform-server/) each have their own package.json so Node's "nearest parent package.json" walk stops there — they are unaffected by the new parent type. Added "type": "commonjs" defensively to meta/google/typeform package.json so a future deletion of their files (cutover cleanup) plus a stray *.js left under dashboard/ wouldn't accidentally try to ESM-parse it.

  17. Klaviyo RedisService kept as a wrapper, but accepts injected client. Plan said "replace each server's per-instance pool/redis with the injected one." The Klaviyo codebase has ~3K LOC of service code (events.service.js alone is 2.2K) that calls this.redisService._getCacheKey(), .get(), .set(), .getEventData(), .clearCache(), ._getTTL(). Rewriting all of that to call ioredis directly would risk breaking the cache-key/TTL invariants. Decision: keep services/klaviyo/redis.service.js as a thin facade with the same public surface, but its constructor now takes an ioredis client instead of constructing one. The 3 service classes (EventsService, CampaignsService, ReportingService) all take (apiKey, apiRevision, redis) and pass redis to new RedisService(redis). MetricsService doesn't use Redis — left unchanged.

  18. dashboard-server .env layering. Plan called for "Single .env at inventory/dashboard/.env, prefixed keys: KLAVIYO_, META_, ... JWT_SECRET ... # shared with auth-server." Implemented as two-file layering: server.js loads /var/www/inventory/.env FIRST (provides JWT_SECRET, DB_, REDIS_) then inventory-server/dashboard/.env SECOND for vendor-specific keys (KLAVIYO_API_KEY, META_, GA_, TYPEFORM_*). dotenv defaults to override:false, so the first file wins on collisions — security-critical vars live in one place, vendor keys in the other. .env.example template committed at dashboard/.env.example. Pre-cutover step: copy the vendor keys from the current per-vendor .env files into either of those two files before pm2 reload, else KLAVIYO_API_KEY etc. will not be set and routes will 500.

  19. Caddyfile typo fixed: /api/google-analytics/api/dashboard-analytics. The pre-Phase-4 Caddyfile.proposed listed a handle /api/google-analytics/* block. The live Caddyfile and the frontend (inventory/src/config/dashboard.ts) both use /api/dashboard-analytics/* (the live file has a uri replace /api/dashboard-analytics /api/analytics rewrite to land on google-server's /api/analytics mount). The merged dashboard-server now mounts the Google router at /api/dashboard-analytics directly — Caddy no longer needs the rewrite, just a straight reverse_proxy. Fixed in deploy/Caddyfile.proposed.

  20. metrics.routes.js had a latent router-scope bug. The Klaviyo metrics.routes.js declared const router = express.Router() at MODULE scope (outside the createMetricsRouter factory), so calling the factory twice would have re-mounted handlers on the same router (cumulative). Benign for a single-mount PM2 service, but fixed during the Phase 4 copy — the router now lives inside the factory. Also renamed the export from createMetricsRoutes (plural) to createMetricsRouter (matches the convention used by every other vendor's index.js).

  21. PM2 log paths use per-server logs/pm2/ (NOT /var/log/pm2/). Discovered during the first apply attempt: the previously-shipped ecosystem.config.cjs.proposed carried over /var/log/pm2/... from the live file, but matt has no write perms on /var/log (root:syslog 775) so the entries silently failed to launch (chat-server + acot-server came up because they had no explicit log path; new-auth-server, inventory-server, dashboard-server bailed). The actual convention — already in place via pre-created folders on disk — is per-service logs/pm2/ directly under each service's directory (./inventory/auth/logs/pm2/, ./inventory/chat/logs/pm2/, ./inventory/dashboard/acot-server/logs/pm2/, ./inventory/dashboard/logs/pm2/ for the merged dashboard-server, ./inventory/logs/pm2/ for inventory-server, /opt/lt-wordlist-api/logs/pm2/, /var/www/acot-phone/logs/pm2/). All folders are matt:matt. pm2-logrotate (already loaded in matt's daemon) rotates them in place.

  22. All PM2 apps run under matt's single daemon — no root daemon. The earlier OUT OF SCOPE comment block in the proposed ecosystem incorrectly claimed lt-wordlist-api and acot-phone-server were managed by a separate root PM2 daemon. They are not — matt's daemon manages everything. Removed the bogus block; both apps are now first-class entries in the proposed ecosystem with corrected script paths:

    • lt-wordlist-api script is /opt/lt-wordlist-api/index.js (was /opt/lt-wordlist-api/server.js in the live file — wrong; that file doesn't exist). /opt/lt-wordlist-api is matt:matt 0750.
    • acot-phone-server script is /var/www/acot-phone/dist/server.js (was ./inventory/acot-phone/server.js in the live file — wrong; that path doesn't exist). /var/www/acot-phone/ is matt:matt with its own .env and is a separate repo from inventory-server.
  23. Phase 6.10 ADD_WORD_TOKEN move stays in this ecosystem. Per Deviation #22, lt-wordlist-api is in matt's ecosystem, so the §6.10 work to remove inline ADD_WORD_TOKEN and load it from /opt/lt-wordlist-api/.env instead is implemented directly in deploy/ecosystem.config.cjs.proposed (no inline ADD_WORD_TOKEN; script reads its own .env). When applying, rotate the token value in /opt/lt-wordlist-api/.env and update any callers.

  24. Phase 6 patterns applied to acot-server during Phase 5. acot-server was originally planned to convert mechanically (require → import) and inherit Phase 6 hardening later. Done in a single pass instead: the new server.js mounts shared/logging/request-log.js, shared/cors/policy.js, shared/errors/handler.js, and shared/auth/middleware.js's authenticate() on /api/acot/* (except the customers router — see Phase 5 auth-model section above). Adds pg dependency to inventory-server/dashboard/acot-server/package.json because the Postgres pool for authenticate()'s user/permission lookups is initialized in-process. Env layering follows dashboard-server's pattern: /var/www/inventory/.env loaded first (JWT_SECRET, DB_), local .env loaded second (PROD_DB_, PROD_SSH_*, ACOT_PHONE_API_KEY). No acot_admin permission gates wired — none of the routes mutate state in ways that warrant per-permission checks today; the seeded code in migrations/005_phase6_permission_codes.sql stays reserved.

  25. Phase 6.10 fixed 2026-05-24 via option (b). Discovered during Phase 5 closeout that /opt/lt-wordlist-api/.env didn't exist, no ADD_WORD_TOKEN env var was set on the running process, and the script's process.env.ADD_WORD_TOKEN || 'tokenhere' fallback was the only gate — meaning curl -X POST -H "x-add-word-token: tokenhere" http://localhost:3030/add-word succeeded in production.

    Fix applied:

    • Generated a fresh 32-byte hex token via openssl rand -hex 32.
    • Wrote it to /opt/lt-wordlist-api/.env (matt:matt, mode 0600).
    • Edited /var/www/ecosystem.config.cjs's lt-wordlist-api entry to add node_args: ['--env-file=/opt/lt-wordlist-api/.env']. Node ≥20.6 (netcup runs v22.22.2) reads the file at startup with no script changes and no dotenv import — the cleanest of the three options because the token never lives in the committed ecosystem file.
    • pm2 reload /var/www/ecosystem.config.cjs --only lt-wordlist-api --update-env picked up the new wrapper config. PM2 restart count 1 → 2, clean startup.

    Verified: old default tokenhere now returns {"error":"unauthorized"} HTTP 401; new env-file token returns {"ok":true,...} HTTP 200 on /add-word and /delete-word. To rotate again: edit /opt/lt-wordlist-api/.env + pm2 restart lt-wordlist-api --update-env.

    Caller coordination: user confirmed all callers are external and will be updated as issues surface; no inventory of callers to pre-notify.

  26. Caddy forward_auth gate breaks acot-phone-server's customer lookups — fixed 2026-05-24 via option (a). Phase 6.1 put /api/acot/* behind forward_auth localhost:3011/verify, which strictly requires a JWT Bearer token. But acot-phone-server (at /var/www/acot-phone/) calls /api/acot/customers/by-phone, /api/acot/customers/search, /api/acot/customers/:cid/orders using only an x-acot-api-key header (/var/www/acot-phone/src/services/acotApi.ts:51). Result: every customer lookup from the phone app hit a 401 at the Caddy gate before reaching requirePhoneApiKey. Last successful customer call in acot-server's access log was 2026-05-21 — three days before the Caddy cutover.

    Fix applied — option (a): changed ACOT_API_URL in /var/www/acot-phone/.env (and acot-phone-server/.env.example and the local repo copy) from https://tools.acherryontop.com/api/acot to http://localhost:3012/api/acot. Both processes live on netcup, so the request never enters Caddy and lands directly on requirePhoneApiKey in-process. Restarted via pm2 restart acot-phone-server --update-env; smoke-tested with curl -H "x-acot-api-key: ..." http://localhost:3012/api/acot/customers/by-phone?phone=... → 200 with the real customer record.

    (Alternative option (b), kept here for posterity: add a @phone_auth header x-acot-api-key * guard in the Caddyfile to bypass forward_auth for requests bearing the shared secret. Would have worked too but introduces a header-based bypass in the gate, which is a worse security posture than just not entering Caddy at all.)