Files
inventory/CONSOLIDATION_PLAN.md
T

752 lines
42 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Server Consolidation & Security Hardening Plan
Audit-driven plan to (a) reduce 12 PM2 processes to 3 application servers + 1 auth server, (b) put every API endpoint behind real authentication, and (c) standardize on ESM across all Node services. Approach is "do it properly the first time" — no half-finished pieces, no deferred cleanup.
---
## Status (2026-05-23)
| Phase | Status | Notes |
|---|---|---|
| 1 — Decommission dead services | **Complete** | aircall/gorgias/clarity/legacy-auth-server deleted from repo + PM2 + Caddyfile + ecosystem.cjs |
| 2 — Build shared `lib/` | **Complete** | Lives at `inventory-server/shared/` (see Deviations). `/verify` endpoint live on auth-server |
| 3 — Convert auth-server + inventory-server to ESM | Not started | Next up. Auth-server is still CJS but has the new `/verify` route added in CJS form |
| 4 — Build `dashboard-server` (the merge) | Not started | klaviyo/meta/google/typeform still run as 4 separate PM2 apps |
| 5 — Convert `acot-server` to ESM | Not started | |
| 6 — Auth hardening | Not started | Shared modules exist (`shared/rate-limit`, `shared/cors`, `shared/logging`) but no service consumes them yet. JWT_SECRET footgun discovered — see 6.4 |
| 7 — Caddyfile final form | Partial | Dead routes removed; `forward_auth` gate + `/uploads/*` gating + per-vendor cleanup deferred to after Phase 4 |
| 8 — ecosystem.config.cjs final form | Partial | Dead apps removed; final shape depends on Phase 4 merge |
**Live PM2 count: 10** (down from 13). Target after Phase 4: 5 application apps + acot-phone-server + lt-wordlist-api.
---
## Goals
- Every public-facing endpoint requires a valid auth token (Caddy gate + per-server middleware + per-route permission checks for sensitive operations).
- Reduce service count from 12 PM2 processes to 4: `inventory-server`, `acot-server`, `dashboard-server`, `auth-server`.
- Standardize on ESM (`"type": "module"`) across all Node services.
- Decommission `aircall-server`, `gorgias-server`, `clarity-server`, and the legacy `auth-server` (port 3003).
- Eliminate dependency duplication: one Redis client, one Postgres pool helper, one logger, one auth middleware — shared across services.
## Non-goals
- Rewriting business logic. Route handlers move as-is unless they break under ESM or shared middleware.
- Switching auth providers (we keep JWT + bcrypt + Postgres).
- Replacing PM2 or Caddy.
- Migrating Klaviyo/Meta/Google/Typeform's external API contracts.
---
## Target architecture
```
┌──────────────────────────┐
│ tools.acherryontop.com │
│ (Caddy) │
│ forward_auth gate ─────┼──► auth-server:3011
└────────────┬─────────────┘ /verify endpoint
┌────────────────────────────────┼────────────────────────────────┐
▼ ▼ ▼
┌─────────────────────┐ ┌──────────────────────┐ ┌─────────────────────┐
│ inventory-server │ │ dashboard-server │ │ acot-server │
│ :3010 (ESM) │ │ :3015 (ESM) │ │ :3012 (ESM) │
│ │ │ │ │ │
│ /api/products │ │ /api/klaviyo/* │ │ /api/acot/* │
│ /api/orders │ │ /api/meta/* │ │ (MySQL via SSH) │
│ /api/analytics │ │ /api/google-*/* │ │ │
│ /api/dashboard │ │ /api/typeform/* │ │ │
│ ... (~25 routers) │ │ │ │ │
└─────────────────────┘ └──────────────────────┘ └─────────────────────┘
│ │ │
├── Postgres (inventory_db) ├── Postgres (klaviyo) └── MySQL (workpi, via ssh2 tunnel)
├── shared lib/ ◄────────────────┤
│ - auth middleware ├── Redis (shared client)
│ - permission helper └── shared lib/ ◄─────────────────┐
│ - logger │
│ - pg pool factory │
│ - error formatter │
└─────────────────────────────────────────────────────────────────┘
┌──────────────────┴───┐
│ auth-server │
│ :3011 (ESM) │
│ /login, /me, │
│ /verify, user mgmt │
└──────────────────────┘
```
PM2 process count: **12 → 4** (plus `acot-phone-server` and `lt-wordlist-api`, which stay as-is — out of scope).
---
## Phase 1 — Decommission dead/leaving services
Status: **Complete (2026-05-23)**. All four services removed from repo, PM2, Caddyfile, and ecosystem.config.cjs. Frontend widgets (`AircallDashboard.jsx`, `GorgiasOverview.jsx`) and their dashboard.ts/Navigation.jsx/vite.config.ts wiring also removed. Verification: smoke-tested `https://tools.acherryontop.com/api/{aircall,gorgias,clarity}/*` → 404. Backups left at `/home/matt/{ecosystem.config.cjs,Caddyfile}.bak.2026-05-23`.
### To remove
| Service | Reason | Steps |
|---|---|---|
| `aircall-server` (3002) | Migrating off Aircall | `pm2 delete aircall-server`; remove from `ecosystem.config.cjs`; remove `/api/aircall/*` from Caddyfile; drop `inventory/dashboard/aircall-server/` directory; remove MongoDB connection from any frontend code; cancel Mongo if it was only feeding Aircall |
| `gorgias-server` (3006) | Migrating off Gorgias | same pattern; check frontend for `/api/gorgias/*` callers and delete the dashboards/widgets that use them |
| `clarity-server` (3009) | Already dead (no `.js` files, not in ecosystem) | remove `/api/clarity/*` from Caddyfile; delete `inventory/dashboard/clarity-server/` directory |
| `auth-server` (3003, legacy) | Replaced by `new-auth-server` on 3011 | grep entire codebase for `dashboard-auth` and `localhost:3003`; redirect or remove callers; `pm2 delete auth-server`; remove from ecosystem; remove `/dashboard-auth/*` from Caddyfile; delete `inventory/dashboard/auth-server/` directory |
### Verification before deletion
```bash
# from inventory/ root — find any references before removing
grep -rn "aircall\|/api/aircall" inventory/src/ inventory-server/src/
grep -rn "gorgias\|/api/gorgias" inventory/src/ inventory-server/src/
grep -rn "/dashboard-auth\|localhost:3003" inventory/src/ inventory-server/src/
grep -rn "/api/clarity" inventory/src/ inventory-server/src/
```
Any remaining callers must be deleted or repointed before the server is removed. Do **not** leave a 502 response in production.
### Database/secret cleanup
- Drop the MongoDB instance feeding Aircall (after confirming no other consumers).
- Rotate any Gorgias/Aircall API keys still in `.env` files (defense in depth — they'll be useless soon anyway, but commit hygiene matters).
- Remove `MONGODB_URI`, `AIRCALL_*`, `GORGIAS_*` from any `.env` files.
---
## Phase 2 — Build the shared `lib/`
Status: **Complete (2026-05-23)**. All 11 modules written under `inventory-server/shared/` (NOT repo root — see Deviations). `/verify` endpoint added to auth-server in CJS form (will move to shared/auth/verify.js usage during Phase 3 ESM conversion). Smoke-tested with no-token / bad-token / expired-token / valid-token cases. No service consumes shared/ yet; that happens in Phases 35.
### Location
A single shared directory at the repo root: `shared/` (sibling of `inventory/` and `acot-phone/`). Each service imports from it via a relative path. We do **not** introduce npm workspaces yet — relative imports are fine for three consumers and avoid the npm-link / hoisting headaches.
### Modules to create
```
shared/
├── package.json # "type": "module"
├── auth/
│ ├── middleware.js # authenticate(), requirePermission(), requireAdmin()
│ └── verify.js # verifyToken() — pure function, no Express dependency
├── db/
│ ├── pg.js # createPool(envPrefix) — returns configured Pool
│ └── redis.js # createRedis() — single client, lazy-connect
├── logging/
│ ├── logger.js # pino-based, redacts Authorization/Cookie
│ └── request-log.js # Express middleware, structured access log
├── errors/
│ └── handler.js # consistent error envelope, no leak in prod
├── cors/
│ └── policy.js # single allowed-origins list, exported as cors() options
└── rate-limit/
└── login.js # express-rate-limit config for /login
```
### Auth middleware spec (`shared/auth/middleware.js`)
```js
// Pseudocode — final implementation matches the existing pattern in
// inventory/auth/routes.js authenticate() but factored out.
export function authenticate({ pool }) {
return async (req, res, next) => {
const header = req.headers.authorization;
if (!header?.startsWith('Bearer ')) {
return res.status(401).json({ error: 'Authentication required' });
}
try {
const decoded = jwt.verify(header.slice(7), process.env.JWT_SECRET);
// Short-circuit DB hit with an in-memory cache, 60s TTL keyed by token jti
const user = await loadUserCached(pool, decoded.userId);
if (!user.is_active) return res.status(403).json({ error: 'Account inactive' });
req.user = user;
next();
} catch {
res.status(401).json({ error: 'Invalid token' });
}
};
}
export function requirePermission(code) {
return (req, res, next) => {
if (req.user.is_admin) return next();
if (req.user.permissions?.includes(code)) return next();
res.status(403).json({ error: 'Insufficient permissions' });
};
}
export const requireAdmin = (req, res, next) =>
req.user.is_admin ? next() : res.status(403).json({ error: 'Admin only' });
```
### Why a 60s in-memory user cache
`forward_auth` in Caddy will call `auth-server` on every request. Each per-server `authenticate()` middleware also has a DB lookup to load permissions. Without caching, every API request becomes 1 SQL query for the user row + 1 for permissions. 60s TTL is short enough that deactivating a user takes effect within a minute, long enough that Klaviyo dashboards (which fire dozens of requests on load) don't hammer Postgres.
### Add to `auth-server`: a `/verify` endpoint
Caddy's `forward_auth` only needs "is this token valid? give me a user-id." Today's `/me` does that but with a full permissions join. Add a lightweight `/verify` that:
- Verifies JWT signature only (no DB hit).
- Returns `200` with `X-User-Id` and `X-User-Is-Admin` response headers (which Caddy `copy_headers` will pass to upstream).
- Returns `401` on bad token.
**Decision: each service re-verifies the JWT independently.** Caddy's `forward_auth` is a fast first-pass reject for obviously bad tokens, but the security boundary is the per-server `authenticate()` middleware. Cost is negligible (one HMAC-SHA256 per request); the upside is that a misconfigured Caddyfile can never let an unauthenticated request reach a backend. Upstream services do **not** trust any `X-User-*` headers from Caddy — they parse the `Authorization` header themselves.
---
## Phase 3 — Convert `auth-server` and `inventory-server` to ESM
Status: **Not started.** Lift the easy ones first. These two stay standalone (don't merge into anything), so they're isolated changes. The auth-server's new `/verify` route (added in Phase 2) is currently CJS — refactor it during this phase to import from `../shared/auth/verify.js`.
### Mechanical conversion
Per service:
1. Add `"type": "module"` to `package.json`.
2. Convert `require()``import`. `module.exports``export` / `export default`.
3. Fix `__dirname`/`__filename` (use `import.meta.url` + `fileURLToPath`).
4. Convert any dynamic require (e.g., conditional plugin loading) to `await import()`.
5. Update any sub-imports that don't include the file extension — ESM requires `./foo.js`, not `./foo`.
6. Update `ecosystem.config.cjs` if any service entry depended on CJS semantics. The ecosystem file itself can stay `.cjs` — PM2 reads it as config, doesn't matter what the apps it spawns are.
7. Update nodemon config / scripts.
### Risk areas in inventory-server
- `routes/ai.js` does a lazy init (`aiRouter.initInBackground()` called from `server.js`) — confirm the export shape still works as a default export of an Express router with a sidecar function. May need to split into `export default router; export function initInBackground() {}`.
- Multer setup in `routes/import.js` — straightforward, no ESM-specific concerns.
- SSE setup in `server.js` — moves over cleanly, no module-system entanglement.
- The `child_process.spawn` calls for metrics calculation: ESM doesn't change `child_process` behavior, but if any spawned script uses `require()` of a sibling, that sibling must also be ESM (or stay CJS with a `.cjs` extension).
### Test strategy
- After conversion, `pm2 start ecosystem.config.cjs --only inventory-server` on the server, watch logs for require/import errors at startup.
- Hit `/health`, then the most exercised endpoints (`/api/products`, `/api/dashboard/overview`, `/api/analytics/...`). If startup is clean and three smoke endpoints work, ESM conversion is done. Functional correctness is preserved because no logic changed.
### Auth-server
Already small (~200 LOC server.js + ~few hundred in routes.js + permissions.js). 1-day conversion. Add the new `/verify` endpoint as part of this work.
---
## Phase 4 — Build `dashboard-server` (the merge)
Status: **Not started.** The big merge. Klaviyo + Meta + Google + Typeform → one ESM service. Highest-risk phase — see Rollback strategy for the per-vendor cutover plan.
### Layout
```
inventory/dashboard/
├── server.js # entry: load env, init Postgres+Redis, mount routes, listen
├── package.json # "type": "module", deps from all 4 source servers (deduped)
├── .env # KLAVIYO_*, META_*, GOOGLE_*, TYPEFORM_*, shared DB_*, REDIS_URL
├── routes/
│ ├── klaviyo/ # absorbed from dashboard/klaviyo-server/src/
│ ├── meta/ # absorbed from dashboard/meta-server/
│ ├── google/ # absorbed from dashboard/google-server/
│ └── typeform/ # absorbed from dashboard/typeform-server/
├── services/ # per-vendor API clients (Klaviyo SDK calls, etc.)
├── scripts/
│ └── import-campaign-products.js # one-shot, moved from klaviyo-server/scripts/
└── logs/
```
### Mount points
```js
// server.js (sketch)
import { authenticate, requirePermission } from '../../shared/auth/middleware.js';
import { createPool } from '../../shared/db/pg.js';
import { createRedis } from '../../shared/db/redis.js';
import { logger, requestLog } from '../../shared/logging/index.js';
import corsPolicy from '../../shared/cors/policy.js';
import errorHandler from '../../shared/errors/handler.js';
import klaviyoRouter from './routes/klaviyo/index.js';
import metaRouter from './routes/meta/index.js';
import googleRouter from './routes/google/index.js';
import typeformRouter from './routes/typeform/index.js';
const app = express();
const pool = await createPool('KLAVIYO_DB'); // klaviyo has its own DB; others can share or have none
const redis = await createRedis();
app.use(requestLog);
app.use(cors(corsPolicy));
app.use(express.json({ limit: '10mb' }));
// Everything below this line requires a valid token.
app.use('/api', authenticate({ pool }));
app.use('/api/klaviyo', klaviyoRouter({ pool, redis }));
app.use('/api/meta', metaRouter({ redis }));
app.use('/api/google-analytics', googleRouter({ redis })); // matches Caddy /api/dashboard-analytics rewrite
app.use('/api/typeform', typeformRouter({ redis }));
app.get('/health', (req, res) => res.json({ ok: true }));
app.use(errorHandler);
app.listen(process.env.DASHBOARD_PORT || 3015);
```
### Per-vendor routers
Each vendor's existing route file becomes a factory that takes the shared `pool`/`redis` and returns an Express router. Replace each server's per-instance pool/redis with the injected one.
### Permission gates (sensitive routes only)
Authenticated-only is the default after `app.use('/api', authenticate(...))`. For sensitive operations, add `requirePermission` per route:
- Anything that mutates Klaviyo lists/segments → `requirePermission('klaviyo_write')`
- Triggering a campaign sync → `requirePermission('klaviyo_admin')`
- Read-only dashboards → no extra check beyond authenticate.
Define the new permission codes in the `permissions` table via a migration in Phase 6.
### Dependency dedup
**Decision: standardize on `ioredis`.** Klaviyo's larger codebase already uses it, and `ioredis` has better cluster/sentinel support if we ever need it. Update `meta`/`google`/`typeform` call sites — each is a handful of `get`/`set` calls, mechanical conversion. Remove the `redis` package from `dashboard-server`'s `package.json`.
### Env consolidation
Single `.env` at `inventory/dashboard/.env`, prefixed keys:
```
DASHBOARD_PORT=3015
KLAVIYO_API_KEY=...
KLAVIYO_DB_HOST=...
KLAVIYO_DB_NAME=...
META_ACCESS_TOKEN=...
GOOGLE_SERVICE_ACCOUNT_KEY=...
TYPEFORM_TOKEN=...
REDIS_URL=...
JWT_SECRET=... # shared with auth-server; same secret means same tokens valid here
```
### Klaviyo's `scripts/import-campaign-products.js`
One-shot script — keep it, but run it from the merged dashboard-server's directory. Update the script's imports to ESM. If it's run via cron, update the cron entry to the new path.
### Risk: shared error states
When all four vendors share a Redis client, a Redis hiccup affects all four. Make sure the connection has retry config (`ioredis` defaults are reasonable but verify) and that vendor routes degrade gracefully when Redis is unavailable (most use it as a cache, so cache-miss → fall through to upstream API is the right behavior).
---
## Phase 5 — Convert `acot-server` to ESM (stays standalone)
Status: **Not started.** Largest single conversion (~5K LOC), but no merge involved.
### Special concern: ssh2 tunnel
`acot-server` opens an SSH tunnel via `ssh2` to access the production MySQL at `192.168.1.5:3309`. The tunnel must be:
- Established before the HTTP listener starts (so no requests fail with "no DB connection").
- Re-established on disconnect (`ssh2` connection's `close` event → recreate).
- Cleanly torn down on `SIGTERM`/`SIGINT` so PM2 restarts don't leak file descriptors.
Verify (or add) this lifecycle handling as part of the conversion. If it's already correct, conversion is mechanical; if not, this is a good moment to fix it.
### Test strategy
Same as inventory-server: start with PM2, smoke-test the most-used `/api/acot/*` endpoints, watch logs for unhandled rejection or tunnel-close events.
---
## Phase 6 — Auth hardening
Status: **Not started.** This is the security work that justifies the whole refactor. Runs in parallel with phases 35 where possible. Shared building blocks already exist (`shared/rate-limit/login.js`, `shared/cors/policy.js`, `shared/logging/request-log.js`, `shared/errors/handler.js`) — Phase 6 is about *applying* them per-service.
### 6.1 Caddy `forward_auth` gate
Add to the `tools.acherryontop.com` block, before the `@api_routes` handler:
```caddyfile
# Forward-auth gate for all API traffic
@needs_auth path /api/* /chat-api/*
handle @needs_auth {
forward_auth localhost:3011 {
uri /verify
copy_headers Authorization
# On 401/403, Caddy returns the auth-server's response body verbatim
}
# Existing per-vendor handle blocks remain below this line
}
# /auth-inv/* stays public (you need to log in!)
handle /auth-inv/* {
uri strip_prefix /auth-inv
reverse_proxy localhost:3011
}
```
The `forward_auth` directive subrequests `/verify` on the auth-server. If it returns 2xx, the request proceeds upstream. If 401/403, Caddy returns that response to the client and never hits the backend.
This is the **first** line of defense. Per-server middleware (`shared/auth/middleware.js`) is the **second** line — re-verifies the JWT independently. Defense in depth: a Caddyfile typo can't open a hole.
### 6.2 Per-route permission gates
After per-server `authenticate()`, add `requirePermission(code)` to destructive or sensitive routes. Audit needed in:
- `inventory-server/src/routes/config.js` — global config writes → `admin`
- `inventory-server/src/routes/import.js` — uploads, deletes, generate-upc → `product_import`
- `inventory-server/src/routes/data-management.js` — CSV operations → `data_management`
- `inventory-server/src/routes/ai-prompts.js` — prompt edits → `ai_admin`
- `inventory-server/src/routes/templates.js` — template writes → `templates_write`
- `inventory-server/src/routes/reusable-images.js` — image management → `image_admin`
- `inventory-server/src/routes/products.js` — only one POST (`/resolve-identifiers`); evaluate whether it needs a permission code or authenticated-only is fine
- `inventory-server/src/routes/product-editor-audit-log.js` and `import-audit-log.js` — read-only by sensitive users → `audit_read`
- `dashboard-server` Klaviyo/Meta/Google/Typeform write endpoints → vendor-specific codes per above
Migration: a single SQL script that inserts the new permission codes into the `permissions` table and assigns them to existing admin users. Non-admin users get permissions explicitly granted via the user management UI.
```sql
INSERT INTO permissions (code, name) VALUES
('product_import', 'Product Import'),
('data_management', 'Data Management'),
('ai_admin', 'AI Settings Admin'),
('templates_write', 'Template Editing'),
('image_admin', 'Image Management'),
('audit_read', 'Audit Log Access'),
('klaviyo_write', 'Klaviyo Write'),
('klaviyo_admin', 'Klaviyo Admin'),
('meta_write', 'Meta Write'),
('google_write', 'Google Analytics Write'),
('typeform_write', 'Typeform Write'),
('acot_admin', 'ACOT Server Admin')
ON CONFLICT (code) DO NOTHING;
```
### 6.3 Rate limiting on login
`shared/rate-limit/login.js`:
```js
import rateLimit from 'express-rate-limit';
export const loginLimiter = rateLimit({
windowMs: 15 * 60 * 1000, // 15 minutes
max: 10, // 10 attempts per IP per window
message: { error: 'Too many login attempts, try again later' },
standardHeaders: true,
legacyHeaders: false,
});
```
Apply in `auth-server` on the `/login` route. Consider also rate-limiting `/verify` and `/me` (much higher cap, ~600/min — they're called legitimately by every page load).
### 6.4 JWT secret rotation
- Rotate `JWT_SECRET` to a fresh 32-byte random string as part of the deployment.
- Document that rotation logs out all users — acceptable for an internal tool, do it during off-hours.
- Add `JWT_SECRET` to the env var validation block in `auth-server/server.js` (refuse to start if not set).
- **Fix the existing footgun**: `/var/www/ecosystem.config.cjs` currently has `JWT_SECRET: process.env.JWT_SECRET` *after* `...inventoryEnv` in the new-auth-server block. This shadows the `.env` value with whatever the shell exported when PM2 was started — which has already silently diverged at least once (detected and fixed 2026-05-23 by a clean PM2 restart in a shell without JWT_SECRET exported). Delete that override line during rotation; let `.env` be the single source of truth.
### 6.5 Request logging
`shared/logging/request-log.js` — log method, path, status, duration, user-id (if authenticated). **Never** log `Authorization` or `Cookie` headers. Remove the current `server.js:79-87` debug middleware in inventory-server (it logs full headers including the bearer token).
### 6.6 CORS lockdown
Current `middleware/cors.js` allows `192.168.*.*` and `10.*.*.*` with `credentials: true`. Tighten to explicit known origins:
```js
origin: [
'https://tools.acherryontop.com',
'https://inventory.kent.pw',
/^http:\/\/localhost:(5174|5175)$/,
]
```
If anyone genuinely needs LAN access, add their specific IP, not a `/16` range.
### 6.7 Upload hardening
`POST /api/import/upload-image` (multer-backed) needs:
- File-size limit set on multer config (current limit may be defaulted — verify).
- MIME-type allowlist (image/jpeg, image/png, image/webp; reject everything else).
- Filename sanitization (no `..`, no absolute paths, generate UUID-based names server-side).
- The Caddy `/uploads/*` handler currently serves any file in the uploads directory publicly. Move this **behind** the auth gate: include `/uploads/*` in `@needs_auth`. If some images are referenced from public emails (Klaviyo newsletter), put **those** in a separate public bucket; everything else stays gated.
### 6.8 Frontend token storage
**Decision: stay on `localStorage`.** This is an internal tool with no untrusted user-generated HTML being rendered, so the XSS-token-theft surface is small. The `forward_auth` gate is the main security gap we're addressing; cookie-based auth would be a larger, separate project (cookie-parser, CSRF double-submit pattern, AuthContext refactor) that doesn't change the threat model for an internal tool with no public sign-up.
Sanity check during this refactor: grep the React codebase for `dangerouslySetInnerHTML`. If any usages exist, verify each one is rendering trusted (server-controlled, not user-supplied) content. If a user-supplied content path exists, that's a real XSS vector and needs separate remediation regardless of token-storage choice.
### 6.9 Remove debug middleware
[inventory-server/src/server.js:79-87](inventory-server/src/server.js#L79-L87) logs full request headers including `Authorization`. Delete this block. Replace with `shared/logging/request-log.js`.
### 6.10 `lt-wordlist-api` token
`ADD_WORD_TOKEN` is currently hardcoded in `/var/www/ecosystem.config.cjs`. Move to `/opt/lt-wordlist-api/.env`, rotate the token value, update any callers.
### 6.11 Audit logging for sensitive operations
Already have `import-audit-log` and `product-editor-audit-log` tables. Extend the pattern:
- Log `user_id`, `endpoint`, `params`, `result` for `config.js` writes and `data-management.js` operations.
- Schema: reuse the existing audit table pattern or add a generic `system_audit_log` table.
- Don't log request bodies wholesale (may contain large blobs); log the action and the target ID.
---
## Phase 7 — Caddyfile final form
After all phases, the `tools.acherryontop.com` block looks like:
```caddyfile
tools.acherryontop.com {
import security_headers
# Public: login endpoint
handle /auth-inv/* {
uri strip_prefix /auth-inv
reverse_proxy localhost:3011
}
# Public: static frontend assets
@static path *.js *.css *.png *.jpg *.jpeg *.gif *.ico *.svg *.woff *.woff2
handle @static {
header Cache-Control "public, max-age=2592000"
root * /var/www/inventory/frontend/build
file_server
}
# All API + uploads: auth gate first
@gated path /api/* /chat-api/* /uploads/*
handle @gated {
forward_auth localhost:3011 {
uri /verify
copy_headers Authorization
}
# Uploaded files
handle /uploads/* {
root * /var/www/inventory
file_server
}
# Vendor dashboard routes → merged dashboard-server
handle /api/klaviyo/* { reverse_proxy localhost:3015 }
handle /api/meta/* { reverse_proxy localhost:3015 }
handle /api/google-analytics/* { reverse_proxy localhost:3015 }
handle /api/typeform/* { reverse_proxy localhost:3015 }
# ACOT-specific
handle /api/acot/* { reverse_proxy localhost:3012 }
# Chat
handle /chat-api/* {
uri strip_prefix /chat-api
reverse_proxy localhost:3014
}
# Catch-all: inventory-server
handle /api/* { reverse_proxy localhost:3010 }
}
handle /health { reverse_proxy localhost:3010 }
# SPA fallback
handle {
root * /var/www/inventory/frontend/build
try_files {path} /index.html
file_server
encode gzip
}
handle_errors {
respond "{err.status_code} {err.status_text}"
}
}
```
Removed: `/dashboard-auth/*`, `/api/aircall/*`, `/api/gorgias/*`, `/api/clarity/*`, the LAN/`Access-Control-Allow-Origin "*"` permissive defaults on `/api/*`. Kept: `/apiv2/*` and `/apiv2-test/*` proxies to backend.acherryontop.com (out of scope, separate system).
---
## Phase 8 — ecosystem.config.cjs final form
```js
module.exports = {
apps: [
{
name: 'auth-server',
script: './inventory/auth/server.js',
cwd: '/var/www',
env: { NODE_ENV: 'production', AUTH_PORT: 3011 },
...commonSettings,
},
{
name: 'inventory-server',
script: './inventory/src/server.js',
cwd: '/var/www',
env: { NODE_ENV: 'production', PORT: 3010, UPLOADS_DIR: '/var/www/inventory/uploads' },
...commonSettings,
},
{
name: 'dashboard-server',
script: './inventory/dashboard/server.js',
cwd: '/var/www',
env: { NODE_ENV: 'production', DASHBOARD_PORT: 3015 },
...commonSettings,
},
{
name: 'acot-server',
script: './inventory/dashboard/acot-server/server.js',
cwd: '/var/www',
env: { NODE_ENV: 'production', ACOT_PORT: 3012 },
...commonSettings,
},
{
name: 'chat-server',
script: './inventory/chat/server.js',
cwd: '/var/www',
env: { NODE_ENV: 'production', PORT: 3014 },
...commonSettings,
},
// acot-phone-server and lt-wordlist-api unchanged
],
};
```
Five entries instead of twelve. Each app loads its own `.env` from its directory (already handled by `dotenv.config`).
---
## Sequencing & dependencies
```
Phase 1 (decommission) ──┬─────────────────────────────────────────┐
│ │
▼ │
Phase 2 (shared lib/) │
│ │
┌──────────────┼──────────────┐ │
▼ ▼ ▼ ▼
Phase 3a Phase 3b Phase 4 Phase 6 (auth hardening
inventory-server auth-server dashboard-server runs alongside 3+4+5,
to ESM to ESM + /verify build & test completes after them)
│ │ │ │
└──────────────┼──────────────┘ │
▼ │
Phase 5 (acot-server to ESM) ──────────────────►│
Phase 7 (Caddy cutover)
Phase 8 (PM2 final state)
```
Phase 1 unblocks everything (fewer services to convert).
Phase 2 is the foundation; nothing else can start until shared `lib/` exists.
Phases 35 can run in parallel; they touch independent services.
Phase 6's sub-items can be developed alongside 35 but **enabled** only after them (no point adding `requirePermission` to a route that doesn't yet have `authenticate`).
Phase 7 is the cutover: Caddyfile flip happens when all backend changes are deployed.
Phase 8 is cleanup: remove dead PM2 entries.
Estimated effort, end-to-end: **~3 weeks of focused work** by one engineer. Phase 1 ≈ 1 day, Phase 2 ≈ 2 days, Phase 3 ≈ 3 days (both services), Phase 4 ≈ 57 days (the merge), Phase 5 ≈ 23 days, Phase 6 ≈ 34 days, Phase 7+8 ≈ 1 day.
---
## Testing strategy
No formal test suite exists today (per CLAUDE.md). For a refactor this size, that's a gap to close — but writing tests retroactively for 15K LOC of routes is a separate, larger project. For this refactor:
### Manual smoke testing per phase
A checklist of representative endpoints to hit after each deploy:
- `inventory-server`: `/api/products`, `/api/dashboard/overview`, `/api/analytics/revenue`, `/api/orders`, `/api/purchase-orders`, `/api/import/list-uploads`, `/api/config/global`
- `dashboard-server`: `/api/klaviyo/campaigns`, `/api/meta/insights`, `/api/google-analytics/...`, `/api/typeform/responses`
- `acot-server`: `/api/acot/...` (top-3 endpoints by call volume — pull from access logs)
- `auth-server`: `/login`, `/me`, `/verify`
Each smoke test runs (a) without a token → expect 401, (b) with an invalid token → expect 401, (c) with a valid token → expect 2xx.
### Frontend integration check
After deploys, log into the SPA and exercise each major page (Overview, Products, Analytics, Dashboard, Klaviyo, Meta, etc.). If everything loads and dashboards populate, the auth + routing layer is intact.
### Test scaffold during Phase 2 (committed)
While building `shared/`, set up `vitest` (lightweight, ESM-native, fast) as the standard test runner for the repo. Initial coverage focuses on the security-critical surface only:
- `shared/auth/verify.js` — known good token, expired token, wrong-signature token, malformed token, missing token.
- `shared/auth/middleware.js` — request with no header → 401; bad header → 401; valid token + inactive user → 403; valid token + missing permission → 403; valid token + correct permission → next() called with `req.user` populated.
- `shared/auth/middleware.js` user-cache TTL: same token within 60s → one DB hit; same token after 61s → two DB hits.
`package.json` gets a `"test": "vitest run"` script at the repo root and per-service. Set up but don't backfill broader test coverage — that's a separate, larger project. The vitest scaffold gives future work a foothold; this refactor commits to having tests for the auth boundary specifically because that's what's load-bearing for the whole security model.
---
## Rollback strategy
Each phase produces an independently deployable state. Rollback per phase:
- **Phase 1**: re-add removed services to ecosystem; restore from git. Don't roll back data deletions — only do those after a week of stable production.
- **Phases 3, 5**: ESM conversion is per-service; if one service breaks, `pm2 restart <name>` to the previous commit. Other services unaffected.
- **Phase 4**: the dashboard-server merge is the highest-risk change. Plan: deploy `dashboard-server` to a non-conflicting port (3015) while leaving the old per-vendor servers running. Cut over Caddy routes one vendor at a time (start with Meta — smallest). If any vendor breaks, point Caddy back to the old server (still running) for that vendor, debug, retry. Only delete the old servers after all four are stable on `dashboard-server`.
- **Phases 6, 7**: Caddy config is git-tracked. `git revert` + `caddy reload` rolls back in seconds. Auth changes are additive (defense in depth) — if `forward_auth` causes problems, comment it out and per-server middleware continues protecting routes.
---
## Out of scope (intentional)
These came up in the audit but aren't part of this refactor:
- `httpOnly` cookie auth (deferred — current `localStorage` acceptable for internal tool).
- Replacing PM2 with systemd or Docker.
- Test coverage beyond the auth-critical surface.
- `apiv2`/`apiv2-test` proxies to `backend.acherryontop.com` — separate system, not touched.
- `acot-phone-server` and `lt-wordlist-api` — staying as-is.
- Centralized observability stack (Prometheus, Grafana). The logger work in Phase 6.5 sets up the data, but shipping it somewhere is future work.
---
## Concrete deliverables
When this is done:
- 4 application PM2 processes instead of 12 (plus 2 unchanged: acot-phone, lt-wordlist).
- All `/api/*` and `/chat-api/*` requests gated at Caddy and re-verified at each upstream.
- Sensitive endpoints additionally gated by per-permission checks.
- One ESM standard across the entire Node codebase.
- One shared `lib/` for auth, logging, DB, errors, CORS.
- Login rate-limited.
- `JWT_SECRET` rotated.
- Old auth-server, Aircall, Gorgias, Clarity directories deleted from the repo.
- Caddyfile slimmed to one auth-gated block.
- Permission codes inserted into `permissions` table for granular authorization.
- No half-finished pieces, no `// TODO: add auth later` comments, no deferred secrets cleanup.
---
## Deviations from original plan (recorded during execution)
These are decisions made during Phase 1/2 implementation that amend the spec above. Future phases should follow the deviated path, not the original sketch.
1. **`shared/` location.** Original plan placed `shared/` at the repo root as a sibling of `inventory/` and `acot-phone/`. Implemented at `inventory-server/shared/` (= `/var/www/inventory/shared/` on the server) instead. Reason: the actual project root *is* `/var/www/inventory/`; placing shared/ outside it would have meant building a deployment story for it that doesn't exist. Import paths change accordingly:
- From `inventory-server/{auth,src,chat}/server.js``../shared/...`
- From `inventory-server/dashboard/{vendor}-server/server.js``../../shared/...`
2. **`/verify` response headers.** Plan specified `X-User-Id` + `X-User-Is-Admin`. Implemented as `X-User-Id` + `X-User-Username` (both available from the JWT payload). `X-User-Is-Admin` was dropped because `is_admin` isn't in the JWT today and returning it would require a DB lookup — violating the "no DB hit" principle. To restore `X-User-Is-Admin`, enrich the JWT payload at login time (one-line change in `auth/routes.js`) during Phase 6, then echo from `/verify`. Upstreams don't trust these headers anyway (they re-verify), so the omission is informational, not security-relevant.
3. **User cache key in `shared/auth/middleware.js`.** Plan sketch mentioned "60s TTL keyed by token jti". Implemented as keyed by `userId` instead — the JWT doesn't currently include a `jti` claim, and the cache's invalidation semantics are "this user was deactivated/changed permissions" (per-user), not "this token was revoked" (per-token). The plan's pseudocode already used `loadUserCached(pool, decoded.userId)` so this matches the spirit.
4. **Redis client safety.** `shared/db/redis.js` sets `enableOfflineQueue: false` and `lazyConnect: true`. Plan didn't specify but these defaults mean a Redis hiccup fails fast (route fall-through to upstream API as designed in Phase 4 risk notes) rather than queueing commands indefinitely.
5. **CORS allowed origins kept `https://acot.site`.** Plan example listed three origins; production has acot.site as a redirect to tools.acherryontop.com but also reaches the API directly in some flows. Kept it to avoid breakage. LAN wildcards (`192.168.*`, `10.*`) and `Access-Control-Allow-Origin "*"` are NOT included in the new `shared/cors/policy.js` per the plan's Phase 6.6 spirit, but the legacy `inventory-server/src/middleware/cors.js` still has them until services are migrated to consume `shared/cors/`.
6. **Defunct permission codes left in DB.** Removed the `dashboard:gorgias` and `dashboard:calls` Protected blocks from the frontend, but the corresponding permission rows in the `permissions` table are still there (assigned to some users). They're inert (no UI references them) but should be cleaned up alongside the Phase 6.2 permissions migration.
7. **PM2 process names retained `new-auth-server` (not `auth-server`).** Plan's Phase 8 final form names it `auth-server` (after the legacy 3003 one is removed). Decided to keep the existing `new-auth-server` name through Phase 2 to avoid a rename mid-stream. Phase 8 can rename if desired, but it's cosmetic — all wiring is by port (3011) not name.
8. **Caddyfile changes via admin API on `:2020`.** The Caddyfile is owned by root and matt has no passwordless sudo. Cutover used `curl -X POST .../load` on the Caddy admin port (which matt can hit), then a separate `sudo cp /home/matt/Caddyfile.new /etc/caddy/Caddyfile` step to persist the on-disk file. Future Caddyfile changes can follow the same pattern. Backup convention: `/etc/caddy/Caddyfile.bak.YYYY-MM-DD`.
9. **Path-naming.** Plan uses `inventory/` as the top-level (server-side path convention). Locally the equivalent is `inventory-server/`. Whenever the plan says `inventory/dashboard/foo/`, read that as `/var/www/inventory/dashboard/foo/` on the server or `inventory-server/dashboard/foo/` locally.