Phase 1-2 of server consolidation + security hardening
This commit is contained in:
@@ -0,0 +1,751 @@
|
||||
# Server Consolidation & Security Hardening Plan
|
||||
|
||||
Audit-driven plan to (a) reduce 12 PM2 processes to 3 application servers + 1 auth server, (b) put every API endpoint behind real authentication, and (c) standardize on ESM across all Node services. Approach is "do it properly the first time" — no half-finished pieces, no deferred cleanup.
|
||||
|
||||
---
|
||||
|
||||
## Status (2026-05-23)
|
||||
|
||||
| Phase | Status | Notes |
|
||||
|---|---|---|
|
||||
| 1 — Decommission dead services | **Complete** | aircall/gorgias/clarity/legacy-auth-server deleted from repo + PM2 + Caddyfile + ecosystem.cjs |
|
||||
| 2 — Build shared `lib/` | **Complete** | Lives at `inventory-server/shared/` (see Deviations). `/verify` endpoint live on auth-server |
|
||||
| 3 — Convert auth-server + inventory-server to ESM | Not started | Next up. Auth-server is still CJS but has the new `/verify` route added in CJS form |
|
||||
| 4 — Build `dashboard-server` (the merge) | Not started | klaviyo/meta/google/typeform still run as 4 separate PM2 apps |
|
||||
| 5 — Convert `acot-server` to ESM | Not started | |
|
||||
| 6 — Auth hardening | Not started | Shared modules exist (`shared/rate-limit`, `shared/cors`, `shared/logging`) but no service consumes them yet. JWT_SECRET footgun discovered — see 6.4 |
|
||||
| 7 — Caddyfile final form | Partial | Dead routes removed; `forward_auth` gate + `/uploads/*` gating + per-vendor cleanup deferred to after Phase 4 |
|
||||
| 8 — ecosystem.config.cjs final form | Partial | Dead apps removed; final shape depends on Phase 4 merge |
|
||||
|
||||
**Live PM2 count: 10** (down from 13). Target after Phase 4: 5 application apps + acot-phone-server + lt-wordlist-api.
|
||||
|
||||
---
|
||||
|
||||
## Goals
|
||||
|
||||
- Every public-facing endpoint requires a valid auth token (Caddy gate + per-server middleware + per-route permission checks for sensitive operations).
|
||||
- Reduce service count from 12 PM2 processes to 4: `inventory-server`, `acot-server`, `dashboard-server`, `auth-server`.
|
||||
- Standardize on ESM (`"type": "module"`) across all Node services.
|
||||
- Decommission `aircall-server`, `gorgias-server`, `clarity-server`, and the legacy `auth-server` (port 3003).
|
||||
- Eliminate dependency duplication: one Redis client, one Postgres pool helper, one logger, one auth middleware — shared across services.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Rewriting business logic. Route handlers move as-is unless they break under ESM or shared middleware.
|
||||
- Switching auth providers (we keep JWT + bcrypt + Postgres).
|
||||
- Replacing PM2 or Caddy.
|
||||
- Migrating Klaviyo/Meta/Google/Typeform's external API contracts.
|
||||
|
||||
---
|
||||
|
||||
## Target architecture
|
||||
|
||||
```
|
||||
┌──────────────────────────┐
|
||||
│ tools.acherryontop.com │
|
||||
│ (Caddy) │
|
||||
│ forward_auth gate ─────┼──► auth-server:3011
|
||||
└────────────┬─────────────┘ /verify endpoint
|
||||
│
|
||||
┌────────────────────────────────┼────────────────────────────────┐
|
||||
▼ ▼ ▼
|
||||
┌─────────────────────┐ ┌──────────────────────┐ ┌─────────────────────┐
|
||||
│ inventory-server │ │ dashboard-server │ │ acot-server │
|
||||
│ :3010 (ESM) │ │ :3015 (ESM) │ │ :3012 (ESM) │
|
||||
│ │ │ │ │ │
|
||||
│ /api/products │ │ /api/klaviyo/* │ │ /api/acot/* │
|
||||
│ /api/orders │ │ /api/meta/* │ │ (MySQL via SSH) │
|
||||
│ /api/analytics │ │ /api/google-*/* │ │ │
|
||||
│ /api/dashboard │ │ /api/typeform/* │ │ │
|
||||
│ ... (~25 routers) │ │ │ │ │
|
||||
└─────────────────────┘ └──────────────────────┘ └─────────────────────┘
|
||||
│ │ │
|
||||
├── Postgres (inventory_db) ├── Postgres (klaviyo) └── MySQL (workpi, via ssh2 tunnel)
|
||||
├── shared lib/ ◄────────────────┤
|
||||
│ - auth middleware ├── Redis (shared client)
|
||||
│ - permission helper └── shared lib/ ◄─────────────────┐
|
||||
│ - logger │
|
||||
│ - pg pool factory │
|
||||
│ - error formatter │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
┌──────────────────┴───┐
|
||||
│ auth-server │
|
||||
│ :3011 (ESM) │
|
||||
│ /login, /me, │
|
||||
│ /verify, user mgmt │
|
||||
└──────────────────────┘
|
||||
```
|
||||
|
||||
PM2 process count: **12 → 4** (plus `acot-phone-server` and `lt-wordlist-api`, which stay as-is — out of scope).
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Decommission dead/leaving services
|
||||
|
||||
Status: **Complete (2026-05-23)**. All four services removed from repo, PM2, Caddyfile, and ecosystem.config.cjs. Frontend widgets (`AircallDashboard.jsx`, `GorgiasOverview.jsx`) and their dashboard.ts/Navigation.jsx/vite.config.ts wiring also removed. Verification: smoke-tested `https://tools.acherryontop.com/api/{aircall,gorgias,clarity}/*` → 404. Backups left at `/home/matt/{ecosystem.config.cjs,Caddyfile}.bak.2026-05-23`.
|
||||
|
||||
### To remove
|
||||
|
||||
| Service | Reason | Steps |
|
||||
|---|---|---|
|
||||
| `aircall-server` (3002) | Migrating off Aircall | `pm2 delete aircall-server`; remove from `ecosystem.config.cjs`; remove `/api/aircall/*` from Caddyfile; drop `inventory/dashboard/aircall-server/` directory; remove MongoDB connection from any frontend code; cancel Mongo if it was only feeding Aircall |
|
||||
| `gorgias-server` (3006) | Migrating off Gorgias | same pattern; check frontend for `/api/gorgias/*` callers and delete the dashboards/widgets that use them |
|
||||
| `clarity-server` (3009) | Already dead (no `.js` files, not in ecosystem) | remove `/api/clarity/*` from Caddyfile; delete `inventory/dashboard/clarity-server/` directory |
|
||||
| `auth-server` (3003, legacy) | Replaced by `new-auth-server` on 3011 | grep entire codebase for `dashboard-auth` and `localhost:3003`; redirect or remove callers; `pm2 delete auth-server`; remove from ecosystem; remove `/dashboard-auth/*` from Caddyfile; delete `inventory/dashboard/auth-server/` directory |
|
||||
|
||||
### Verification before deletion
|
||||
|
||||
```bash
|
||||
# from inventory/ root — find any references before removing
|
||||
grep -rn "aircall\|/api/aircall" inventory/src/ inventory-server/src/
|
||||
grep -rn "gorgias\|/api/gorgias" inventory/src/ inventory-server/src/
|
||||
grep -rn "/dashboard-auth\|localhost:3003" inventory/src/ inventory-server/src/
|
||||
grep -rn "/api/clarity" inventory/src/ inventory-server/src/
|
||||
```
|
||||
|
||||
Any remaining callers must be deleted or repointed before the server is removed. Do **not** leave a 502 response in production.
|
||||
|
||||
### Database/secret cleanup
|
||||
|
||||
- Drop the MongoDB instance feeding Aircall (after confirming no other consumers).
|
||||
- Rotate any Gorgias/Aircall API keys still in `.env` files (defense in depth — they'll be useless soon anyway, but commit hygiene matters).
|
||||
- Remove `MONGODB_URI`, `AIRCALL_*`, `GORGIAS_*` from any `.env` files.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — Build the shared `lib/`
|
||||
|
||||
Status: **Complete (2026-05-23)**. All 11 modules written under `inventory-server/shared/` (NOT repo root — see Deviations). `/verify` endpoint added to auth-server in CJS form (will move to shared/auth/verify.js usage during Phase 3 ESM conversion). Smoke-tested with no-token / bad-token / expired-token / valid-token cases. No service consumes shared/ yet; that happens in Phases 3–5.
|
||||
|
||||
### Location
|
||||
|
||||
A single shared directory at the repo root: `shared/` (sibling of `inventory/` and `acot-phone/`). Each service imports from it via a relative path. We do **not** introduce npm workspaces yet — relative imports are fine for three consumers and avoid the npm-link / hoisting headaches.
|
||||
|
||||
### Modules to create
|
||||
|
||||
```
|
||||
shared/
|
||||
├── package.json # "type": "module"
|
||||
├── auth/
|
||||
│ ├── middleware.js # authenticate(), requirePermission(), requireAdmin()
|
||||
│ └── verify.js # verifyToken() — pure function, no Express dependency
|
||||
├── db/
|
||||
│ ├── pg.js # createPool(envPrefix) — returns configured Pool
|
||||
│ └── redis.js # createRedis() — single client, lazy-connect
|
||||
├── logging/
|
||||
│ ├── logger.js # pino-based, redacts Authorization/Cookie
|
||||
│ └── request-log.js # Express middleware, structured access log
|
||||
├── errors/
|
||||
│ └── handler.js # consistent error envelope, no leak in prod
|
||||
├── cors/
|
||||
│ └── policy.js # single allowed-origins list, exported as cors() options
|
||||
└── rate-limit/
|
||||
└── login.js # express-rate-limit config for /login
|
||||
```
|
||||
|
||||
### Auth middleware spec (`shared/auth/middleware.js`)
|
||||
|
||||
```js
|
||||
// Pseudocode — final implementation matches the existing pattern in
|
||||
// inventory/auth/routes.js authenticate() but factored out.
|
||||
|
||||
export function authenticate({ pool }) {
|
||||
return async (req, res, next) => {
|
||||
const header = req.headers.authorization;
|
||||
if (!header?.startsWith('Bearer ')) {
|
||||
return res.status(401).json({ error: 'Authentication required' });
|
||||
}
|
||||
try {
|
||||
const decoded = jwt.verify(header.slice(7), process.env.JWT_SECRET);
|
||||
// Short-circuit DB hit with an in-memory cache, 60s TTL keyed by token jti
|
||||
const user = await loadUserCached(pool, decoded.userId);
|
||||
if (!user.is_active) return res.status(403).json({ error: 'Account inactive' });
|
||||
req.user = user;
|
||||
next();
|
||||
} catch {
|
||||
res.status(401).json({ error: 'Invalid token' });
|
||||
}
|
||||
};
|
||||
}
|
||||
|
||||
export function requirePermission(code) {
|
||||
return (req, res, next) => {
|
||||
if (req.user.is_admin) return next();
|
||||
if (req.user.permissions?.includes(code)) return next();
|
||||
res.status(403).json({ error: 'Insufficient permissions' });
|
||||
};
|
||||
}
|
||||
|
||||
export const requireAdmin = (req, res, next) =>
|
||||
req.user.is_admin ? next() : res.status(403).json({ error: 'Admin only' });
|
||||
```
|
||||
|
||||
### Why a 60s in-memory user cache
|
||||
|
||||
`forward_auth` in Caddy will call `auth-server` on every request. Each per-server `authenticate()` middleware also has a DB lookup to load permissions. Without caching, every API request becomes 1 SQL query for the user row + 1 for permissions. 60s TTL is short enough that deactivating a user takes effect within a minute, long enough that Klaviyo dashboards (which fire dozens of requests on load) don't hammer Postgres.
|
||||
|
||||
### Add to `auth-server`: a `/verify` endpoint
|
||||
|
||||
Caddy's `forward_auth` only needs "is this token valid? give me a user-id." Today's `/me` does that but with a full permissions join. Add a lightweight `/verify` that:
|
||||
|
||||
- Verifies JWT signature only (no DB hit).
|
||||
- Returns `200` with `X-User-Id` and `X-User-Is-Admin` response headers (which Caddy `copy_headers` will pass to upstream).
|
||||
- Returns `401` on bad token.
|
||||
|
||||
**Decision: each service re-verifies the JWT independently.** Caddy's `forward_auth` is a fast first-pass reject for obviously bad tokens, but the security boundary is the per-server `authenticate()` middleware. Cost is negligible (one HMAC-SHA256 per request); the upside is that a misconfigured Caddyfile can never let an unauthenticated request reach a backend. Upstream services do **not** trust any `X-User-*` headers from Caddy — they parse the `Authorization` header themselves.
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 — Convert `auth-server` and `inventory-server` to ESM
|
||||
|
||||
Status: **Not started.** Lift the easy ones first. These two stay standalone (don't merge into anything), so they're isolated changes. The auth-server's new `/verify` route (added in Phase 2) is currently CJS — refactor it during this phase to import from `../shared/auth/verify.js`.
|
||||
|
||||
### Mechanical conversion
|
||||
|
||||
Per service:
|
||||
|
||||
1. Add `"type": "module"` to `package.json`.
|
||||
2. Convert `require()` → `import`. `module.exports` → `export` / `export default`.
|
||||
3. Fix `__dirname`/`__filename` (use `import.meta.url` + `fileURLToPath`).
|
||||
4. Convert any dynamic require (e.g., conditional plugin loading) to `await import()`.
|
||||
5. Update any sub-imports that don't include the file extension — ESM requires `./foo.js`, not `./foo`.
|
||||
6. Update `ecosystem.config.cjs` if any service entry depended on CJS semantics. The ecosystem file itself can stay `.cjs` — PM2 reads it as config, doesn't matter what the apps it spawns are.
|
||||
7. Update nodemon config / scripts.
|
||||
|
||||
### Risk areas in inventory-server
|
||||
|
||||
- `routes/ai.js` does a lazy init (`aiRouter.initInBackground()` called from `server.js`) — confirm the export shape still works as a default export of an Express router with a sidecar function. May need to split into `export default router; export function initInBackground() {}`.
|
||||
- Multer setup in `routes/import.js` — straightforward, no ESM-specific concerns.
|
||||
- SSE setup in `server.js` — moves over cleanly, no module-system entanglement.
|
||||
- The `child_process.spawn` calls for metrics calculation: ESM doesn't change `child_process` behavior, but if any spawned script uses `require()` of a sibling, that sibling must also be ESM (or stay CJS with a `.cjs` extension).
|
||||
|
||||
### Test strategy
|
||||
|
||||
- After conversion, `pm2 start ecosystem.config.cjs --only inventory-server` on the server, watch logs for require/import errors at startup.
|
||||
- Hit `/health`, then the most exercised endpoints (`/api/products`, `/api/dashboard/overview`, `/api/analytics/...`). If startup is clean and three smoke endpoints work, ESM conversion is done. Functional correctness is preserved because no logic changed.
|
||||
|
||||
### Auth-server
|
||||
|
||||
Already small (~200 LOC server.js + ~few hundred in routes.js + permissions.js). 1-day conversion. Add the new `/verify` endpoint as part of this work.
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 — Build `dashboard-server` (the merge)
|
||||
|
||||
Status: **Not started.** The big merge. Klaviyo + Meta + Google + Typeform → one ESM service. Highest-risk phase — see Rollback strategy for the per-vendor cutover plan.
|
||||
|
||||
### Layout
|
||||
|
||||
```
|
||||
inventory/dashboard/
|
||||
├── server.js # entry: load env, init Postgres+Redis, mount routes, listen
|
||||
├── package.json # "type": "module", deps from all 4 source servers (deduped)
|
||||
├── .env # KLAVIYO_*, META_*, GOOGLE_*, TYPEFORM_*, shared DB_*, REDIS_URL
|
||||
├── routes/
|
||||
│ ├── klaviyo/ # absorbed from dashboard/klaviyo-server/src/
|
||||
│ ├── meta/ # absorbed from dashboard/meta-server/
|
||||
│ ├── google/ # absorbed from dashboard/google-server/
|
||||
│ └── typeform/ # absorbed from dashboard/typeform-server/
|
||||
├── services/ # per-vendor API clients (Klaviyo SDK calls, etc.)
|
||||
├── scripts/
|
||||
│ └── import-campaign-products.js # one-shot, moved from klaviyo-server/scripts/
|
||||
└── logs/
|
||||
```
|
||||
|
||||
### Mount points
|
||||
|
||||
```js
|
||||
// server.js (sketch)
|
||||
import { authenticate, requirePermission } from '../../shared/auth/middleware.js';
|
||||
import { createPool } from '../../shared/db/pg.js';
|
||||
import { createRedis } from '../../shared/db/redis.js';
|
||||
import { logger, requestLog } from '../../shared/logging/index.js';
|
||||
import corsPolicy from '../../shared/cors/policy.js';
|
||||
import errorHandler from '../../shared/errors/handler.js';
|
||||
|
||||
import klaviyoRouter from './routes/klaviyo/index.js';
|
||||
import metaRouter from './routes/meta/index.js';
|
||||
import googleRouter from './routes/google/index.js';
|
||||
import typeformRouter from './routes/typeform/index.js';
|
||||
|
||||
const app = express();
|
||||
const pool = await createPool('KLAVIYO_DB'); // klaviyo has its own DB; others can share or have none
|
||||
const redis = await createRedis();
|
||||
|
||||
app.use(requestLog);
|
||||
app.use(cors(corsPolicy));
|
||||
app.use(express.json({ limit: '10mb' }));
|
||||
|
||||
// Everything below this line requires a valid token.
|
||||
app.use('/api', authenticate({ pool }));
|
||||
|
||||
app.use('/api/klaviyo', klaviyoRouter({ pool, redis }));
|
||||
app.use('/api/meta', metaRouter({ redis }));
|
||||
app.use('/api/google-analytics', googleRouter({ redis })); // matches Caddy /api/dashboard-analytics rewrite
|
||||
app.use('/api/typeform', typeformRouter({ redis }));
|
||||
|
||||
app.get('/health', (req, res) => res.json({ ok: true }));
|
||||
app.use(errorHandler);
|
||||
|
||||
app.listen(process.env.DASHBOARD_PORT || 3015);
|
||||
```
|
||||
|
||||
### Per-vendor routers
|
||||
|
||||
Each vendor's existing route file becomes a factory that takes the shared `pool`/`redis` and returns an Express router. Replace each server's per-instance pool/redis with the injected one.
|
||||
|
||||
### Permission gates (sensitive routes only)
|
||||
|
||||
Authenticated-only is the default after `app.use('/api', authenticate(...))`. For sensitive operations, add `requirePermission` per route:
|
||||
|
||||
- Anything that mutates Klaviyo lists/segments → `requirePermission('klaviyo_write')`
|
||||
- Triggering a campaign sync → `requirePermission('klaviyo_admin')`
|
||||
- Read-only dashboards → no extra check beyond authenticate.
|
||||
|
||||
Define the new permission codes in the `permissions` table via a migration in Phase 6.
|
||||
|
||||
### Dependency dedup
|
||||
|
||||
**Decision: standardize on `ioredis`.** Klaviyo's larger codebase already uses it, and `ioredis` has better cluster/sentinel support if we ever need it. Update `meta`/`google`/`typeform` call sites — each is a handful of `get`/`set` calls, mechanical conversion. Remove the `redis` package from `dashboard-server`'s `package.json`.
|
||||
|
||||
### Env consolidation
|
||||
|
||||
Single `.env` at `inventory/dashboard/.env`, prefixed keys:
|
||||
|
||||
```
|
||||
DASHBOARD_PORT=3015
|
||||
KLAVIYO_API_KEY=...
|
||||
KLAVIYO_DB_HOST=...
|
||||
KLAVIYO_DB_NAME=...
|
||||
META_ACCESS_TOKEN=...
|
||||
GOOGLE_SERVICE_ACCOUNT_KEY=...
|
||||
TYPEFORM_TOKEN=...
|
||||
REDIS_URL=...
|
||||
JWT_SECRET=... # shared with auth-server; same secret means same tokens valid here
|
||||
```
|
||||
|
||||
### Klaviyo's `scripts/import-campaign-products.js`
|
||||
|
||||
One-shot script — keep it, but run it from the merged dashboard-server's directory. Update the script's imports to ESM. If it's run via cron, update the cron entry to the new path.
|
||||
|
||||
### Risk: shared error states
|
||||
|
||||
When all four vendors share a Redis client, a Redis hiccup affects all four. Make sure the connection has retry config (`ioredis` defaults are reasonable but verify) and that vendor routes degrade gracefully when Redis is unavailable (most use it as a cache, so cache-miss → fall through to upstream API is the right behavior).
|
||||
|
||||
---
|
||||
|
||||
## Phase 5 — Convert `acot-server` to ESM (stays standalone)
|
||||
|
||||
Status: **Not started.** Largest single conversion (~5K LOC), but no merge involved.
|
||||
|
||||
### Special concern: ssh2 tunnel
|
||||
|
||||
`acot-server` opens an SSH tunnel via `ssh2` to access the production MySQL at `192.168.1.5:3309`. The tunnel must be:
|
||||
|
||||
- Established before the HTTP listener starts (so no requests fail with "no DB connection").
|
||||
- Re-established on disconnect (`ssh2` connection's `close` event → recreate).
|
||||
- Cleanly torn down on `SIGTERM`/`SIGINT` so PM2 restarts don't leak file descriptors.
|
||||
|
||||
Verify (or add) this lifecycle handling as part of the conversion. If it's already correct, conversion is mechanical; if not, this is a good moment to fix it.
|
||||
|
||||
### Test strategy
|
||||
|
||||
Same as inventory-server: start with PM2, smoke-test the most-used `/api/acot/*` endpoints, watch logs for unhandled rejection or tunnel-close events.
|
||||
|
||||
---
|
||||
|
||||
## Phase 6 — Auth hardening
|
||||
|
||||
Status: **Not started.** This is the security work that justifies the whole refactor. Runs in parallel with phases 3–5 where possible. Shared building blocks already exist (`shared/rate-limit/login.js`, `shared/cors/policy.js`, `shared/logging/request-log.js`, `shared/errors/handler.js`) — Phase 6 is about *applying* them per-service.
|
||||
|
||||
### 6.1 Caddy `forward_auth` gate
|
||||
|
||||
Add to the `tools.acherryontop.com` block, before the `@api_routes` handler:
|
||||
|
||||
```caddyfile
|
||||
# Forward-auth gate for all API traffic
|
||||
@needs_auth path /api/* /chat-api/*
|
||||
handle @needs_auth {
|
||||
forward_auth localhost:3011 {
|
||||
uri /verify
|
||||
copy_headers Authorization
|
||||
# On 401/403, Caddy returns the auth-server's response body verbatim
|
||||
}
|
||||
# Existing per-vendor handle blocks remain below this line
|
||||
}
|
||||
|
||||
# /auth-inv/* stays public (you need to log in!)
|
||||
handle /auth-inv/* {
|
||||
uri strip_prefix /auth-inv
|
||||
reverse_proxy localhost:3011
|
||||
}
|
||||
```
|
||||
|
||||
The `forward_auth` directive subrequests `/verify` on the auth-server. If it returns 2xx, the request proceeds upstream. If 401/403, Caddy returns that response to the client and never hits the backend.
|
||||
|
||||
This is the **first** line of defense. Per-server middleware (`shared/auth/middleware.js`) is the **second** line — re-verifies the JWT independently. Defense in depth: a Caddyfile typo can't open a hole.
|
||||
|
||||
### 6.2 Per-route permission gates
|
||||
|
||||
After per-server `authenticate()`, add `requirePermission(code)` to destructive or sensitive routes. Audit needed in:
|
||||
|
||||
- `inventory-server/src/routes/config.js` — global config writes → `admin`
|
||||
- `inventory-server/src/routes/import.js` — uploads, deletes, generate-upc → `product_import`
|
||||
- `inventory-server/src/routes/data-management.js` — CSV operations → `data_management`
|
||||
- `inventory-server/src/routes/ai-prompts.js` — prompt edits → `ai_admin`
|
||||
- `inventory-server/src/routes/templates.js` — template writes → `templates_write`
|
||||
- `inventory-server/src/routes/reusable-images.js` — image management → `image_admin`
|
||||
- `inventory-server/src/routes/products.js` — only one POST (`/resolve-identifiers`); evaluate whether it needs a permission code or authenticated-only is fine
|
||||
- `inventory-server/src/routes/product-editor-audit-log.js` and `import-audit-log.js` — read-only by sensitive users → `audit_read`
|
||||
- `dashboard-server` Klaviyo/Meta/Google/Typeform write endpoints → vendor-specific codes per above
|
||||
|
||||
Migration: a single SQL script that inserts the new permission codes into the `permissions` table and assigns them to existing admin users. Non-admin users get permissions explicitly granted via the user management UI.
|
||||
|
||||
```sql
|
||||
INSERT INTO permissions (code, name) VALUES
|
||||
('product_import', 'Product Import'),
|
||||
('data_management', 'Data Management'),
|
||||
('ai_admin', 'AI Settings Admin'),
|
||||
('templates_write', 'Template Editing'),
|
||||
('image_admin', 'Image Management'),
|
||||
('audit_read', 'Audit Log Access'),
|
||||
('klaviyo_write', 'Klaviyo Write'),
|
||||
('klaviyo_admin', 'Klaviyo Admin'),
|
||||
('meta_write', 'Meta Write'),
|
||||
('google_write', 'Google Analytics Write'),
|
||||
('typeform_write', 'Typeform Write'),
|
||||
('acot_admin', 'ACOT Server Admin')
|
||||
ON CONFLICT (code) DO NOTHING;
|
||||
```
|
||||
|
||||
### 6.3 Rate limiting on login
|
||||
|
||||
`shared/rate-limit/login.js`:
|
||||
|
||||
```js
|
||||
import rateLimit from 'express-rate-limit';
|
||||
export const loginLimiter = rateLimit({
|
||||
windowMs: 15 * 60 * 1000, // 15 minutes
|
||||
max: 10, // 10 attempts per IP per window
|
||||
message: { error: 'Too many login attempts, try again later' },
|
||||
standardHeaders: true,
|
||||
legacyHeaders: false,
|
||||
});
|
||||
```
|
||||
|
||||
Apply in `auth-server` on the `/login` route. Consider also rate-limiting `/verify` and `/me` (much higher cap, ~600/min — they're called legitimately by every page load).
|
||||
|
||||
### 6.4 JWT secret rotation
|
||||
|
||||
- Rotate `JWT_SECRET` to a fresh 32-byte random string as part of the deployment.
|
||||
- Document that rotation logs out all users — acceptable for an internal tool, do it during off-hours.
|
||||
- Add `JWT_SECRET` to the env var validation block in `auth-server/server.js` (refuse to start if not set).
|
||||
- **Fix the existing footgun**: `/var/www/ecosystem.config.cjs` currently has `JWT_SECRET: process.env.JWT_SECRET` *after* `...inventoryEnv` in the new-auth-server block. This shadows the `.env` value with whatever the shell exported when PM2 was started — which has already silently diverged at least once (detected and fixed 2026-05-23 by a clean PM2 restart in a shell without JWT_SECRET exported). Delete that override line during rotation; let `.env` be the single source of truth.
|
||||
|
||||
### 6.5 Request logging
|
||||
|
||||
`shared/logging/request-log.js` — log method, path, status, duration, user-id (if authenticated). **Never** log `Authorization` or `Cookie` headers. Remove the current `server.js:79-87` debug middleware in inventory-server (it logs full headers including the bearer token).
|
||||
|
||||
### 6.6 CORS lockdown
|
||||
|
||||
Current `middleware/cors.js` allows `192.168.*.*` and `10.*.*.*` with `credentials: true`. Tighten to explicit known origins:
|
||||
|
||||
```js
|
||||
origin: [
|
||||
'https://tools.acherryontop.com',
|
||||
'https://inventory.kent.pw',
|
||||
/^http:\/\/localhost:(5174|5175)$/,
|
||||
]
|
||||
```
|
||||
|
||||
If anyone genuinely needs LAN access, add their specific IP, not a `/16` range.
|
||||
|
||||
### 6.7 Upload hardening
|
||||
|
||||
`POST /api/import/upload-image` (multer-backed) needs:
|
||||
|
||||
- File-size limit set on multer config (current limit may be defaulted — verify).
|
||||
- MIME-type allowlist (image/jpeg, image/png, image/webp; reject everything else).
|
||||
- Filename sanitization (no `..`, no absolute paths, generate UUID-based names server-side).
|
||||
- The Caddy `/uploads/*` handler currently serves any file in the uploads directory publicly. Move this **behind** the auth gate: include `/uploads/*` in `@needs_auth`. If some images are referenced from public emails (Klaviyo newsletter), put **those** in a separate public bucket; everything else stays gated.
|
||||
|
||||
### 6.8 Frontend token storage
|
||||
|
||||
**Decision: stay on `localStorage`.** This is an internal tool with no untrusted user-generated HTML being rendered, so the XSS-token-theft surface is small. The `forward_auth` gate is the main security gap we're addressing; cookie-based auth would be a larger, separate project (cookie-parser, CSRF double-submit pattern, AuthContext refactor) that doesn't change the threat model for an internal tool with no public sign-up.
|
||||
|
||||
Sanity check during this refactor: grep the React codebase for `dangerouslySetInnerHTML`. If any usages exist, verify each one is rendering trusted (server-controlled, not user-supplied) content. If a user-supplied content path exists, that's a real XSS vector and needs separate remediation regardless of token-storage choice.
|
||||
|
||||
### 6.9 Remove debug middleware
|
||||
|
||||
[inventory-server/src/server.js:79-87](inventory-server/src/server.js#L79-L87) logs full request headers including `Authorization`. Delete this block. Replace with `shared/logging/request-log.js`.
|
||||
|
||||
### 6.10 `lt-wordlist-api` token
|
||||
|
||||
`ADD_WORD_TOKEN` is currently hardcoded in `/var/www/ecosystem.config.cjs`. Move to `/opt/lt-wordlist-api/.env`, rotate the token value, update any callers.
|
||||
|
||||
### 6.11 Audit logging for sensitive operations
|
||||
|
||||
Already have `import-audit-log` and `product-editor-audit-log` tables. Extend the pattern:
|
||||
|
||||
- Log `user_id`, `endpoint`, `params`, `result` for `config.js` writes and `data-management.js` operations.
|
||||
- Schema: reuse the existing audit table pattern or add a generic `system_audit_log` table.
|
||||
- Don't log request bodies wholesale (may contain large blobs); log the action and the target ID.
|
||||
|
||||
---
|
||||
|
||||
## Phase 7 — Caddyfile final form
|
||||
|
||||
After all phases, the `tools.acherryontop.com` block looks like:
|
||||
|
||||
```caddyfile
|
||||
tools.acherryontop.com {
|
||||
import security_headers
|
||||
|
||||
# Public: login endpoint
|
||||
handle /auth-inv/* {
|
||||
uri strip_prefix /auth-inv
|
||||
reverse_proxy localhost:3011
|
||||
}
|
||||
|
||||
# Public: static frontend assets
|
||||
@static path *.js *.css *.png *.jpg *.jpeg *.gif *.ico *.svg *.woff *.woff2
|
||||
handle @static {
|
||||
header Cache-Control "public, max-age=2592000"
|
||||
root * /var/www/inventory/frontend/build
|
||||
file_server
|
||||
}
|
||||
|
||||
# All API + uploads: auth gate first
|
||||
@gated path /api/* /chat-api/* /uploads/*
|
||||
handle @gated {
|
||||
forward_auth localhost:3011 {
|
||||
uri /verify
|
||||
copy_headers Authorization
|
||||
}
|
||||
|
||||
# Uploaded files
|
||||
handle /uploads/* {
|
||||
root * /var/www/inventory
|
||||
file_server
|
||||
}
|
||||
|
||||
# Vendor dashboard routes → merged dashboard-server
|
||||
handle /api/klaviyo/* { reverse_proxy localhost:3015 }
|
||||
handle /api/meta/* { reverse_proxy localhost:3015 }
|
||||
handle /api/google-analytics/* { reverse_proxy localhost:3015 }
|
||||
handle /api/typeform/* { reverse_proxy localhost:3015 }
|
||||
|
||||
# ACOT-specific
|
||||
handle /api/acot/* { reverse_proxy localhost:3012 }
|
||||
|
||||
# Chat
|
||||
handle /chat-api/* {
|
||||
uri strip_prefix /chat-api
|
||||
reverse_proxy localhost:3014
|
||||
}
|
||||
|
||||
# Catch-all: inventory-server
|
||||
handle /api/* { reverse_proxy localhost:3010 }
|
||||
}
|
||||
|
||||
handle /health { reverse_proxy localhost:3010 }
|
||||
|
||||
# SPA fallback
|
||||
handle {
|
||||
root * /var/www/inventory/frontend/build
|
||||
try_files {path} /index.html
|
||||
file_server
|
||||
encode gzip
|
||||
}
|
||||
|
||||
handle_errors {
|
||||
respond "{err.status_code} {err.status_text}"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Removed: `/dashboard-auth/*`, `/api/aircall/*`, `/api/gorgias/*`, `/api/clarity/*`, the LAN/`Access-Control-Allow-Origin "*"` permissive defaults on `/api/*`. Kept: `/apiv2/*` and `/apiv2-test/*` proxies to backend.acherryontop.com (out of scope, separate system).
|
||||
|
||||
---
|
||||
|
||||
## Phase 8 — ecosystem.config.cjs final form
|
||||
|
||||
```js
|
||||
module.exports = {
|
||||
apps: [
|
||||
{
|
||||
name: 'auth-server',
|
||||
script: './inventory/auth/server.js',
|
||||
cwd: '/var/www',
|
||||
env: { NODE_ENV: 'production', AUTH_PORT: 3011 },
|
||||
...commonSettings,
|
||||
},
|
||||
{
|
||||
name: 'inventory-server',
|
||||
script: './inventory/src/server.js',
|
||||
cwd: '/var/www',
|
||||
env: { NODE_ENV: 'production', PORT: 3010, UPLOADS_DIR: '/var/www/inventory/uploads' },
|
||||
...commonSettings,
|
||||
},
|
||||
{
|
||||
name: 'dashboard-server',
|
||||
script: './inventory/dashboard/server.js',
|
||||
cwd: '/var/www',
|
||||
env: { NODE_ENV: 'production', DASHBOARD_PORT: 3015 },
|
||||
...commonSettings,
|
||||
},
|
||||
{
|
||||
name: 'acot-server',
|
||||
script: './inventory/dashboard/acot-server/server.js',
|
||||
cwd: '/var/www',
|
||||
env: { NODE_ENV: 'production', ACOT_PORT: 3012 },
|
||||
...commonSettings,
|
||||
},
|
||||
{
|
||||
name: 'chat-server',
|
||||
script: './inventory/chat/server.js',
|
||||
cwd: '/var/www',
|
||||
env: { NODE_ENV: 'production', PORT: 3014 },
|
||||
...commonSettings,
|
||||
},
|
||||
// acot-phone-server and lt-wordlist-api unchanged
|
||||
],
|
||||
};
|
||||
```
|
||||
|
||||
Five entries instead of twelve. Each app loads its own `.env` from its directory (already handled by `dotenv.config`).
|
||||
|
||||
---
|
||||
|
||||
## Sequencing & dependencies
|
||||
|
||||
```
|
||||
Phase 1 (decommission) ──┬─────────────────────────────────────────┐
|
||||
│ │
|
||||
▼ │
|
||||
Phase 2 (shared lib/) │
|
||||
│ │
|
||||
┌──────────────┼──────────────┐ │
|
||||
▼ ▼ ▼ ▼
|
||||
Phase 3a Phase 3b Phase 4 Phase 6 (auth hardening
|
||||
inventory-server auth-server dashboard-server runs alongside 3+4+5,
|
||||
to ESM to ESM + /verify build & test completes after them)
|
||||
│ │ │ │
|
||||
└──────────────┼──────────────┘ │
|
||||
▼ │
|
||||
Phase 5 (acot-server to ESM) ──────────────────►│
|
||||
▼
|
||||
Phase 7 (Caddy cutover)
|
||||
│
|
||||
▼
|
||||
Phase 8 (PM2 final state)
|
||||
```
|
||||
|
||||
Phase 1 unblocks everything (fewer services to convert).
|
||||
Phase 2 is the foundation; nothing else can start until shared `lib/` exists.
|
||||
Phases 3–5 can run in parallel; they touch independent services.
|
||||
Phase 6's sub-items can be developed alongside 3–5 but **enabled** only after them (no point adding `requirePermission` to a route that doesn't yet have `authenticate`).
|
||||
Phase 7 is the cutover: Caddyfile flip happens when all backend changes are deployed.
|
||||
Phase 8 is cleanup: remove dead PM2 entries.
|
||||
|
||||
Estimated effort, end-to-end: **~3 weeks of focused work** by one engineer. Phase 1 ≈ 1 day, Phase 2 ≈ 2 days, Phase 3 ≈ 3 days (both services), Phase 4 ≈ 5–7 days (the merge), Phase 5 ≈ 2–3 days, Phase 6 ≈ 3–4 days, Phase 7+8 ≈ 1 day.
|
||||
|
||||
---
|
||||
|
||||
## Testing strategy
|
||||
|
||||
No formal test suite exists today (per CLAUDE.md). For a refactor this size, that's a gap to close — but writing tests retroactively for 15K LOC of routes is a separate, larger project. For this refactor:
|
||||
|
||||
### Manual smoke testing per phase
|
||||
|
||||
A checklist of representative endpoints to hit after each deploy:
|
||||
|
||||
- `inventory-server`: `/api/products`, `/api/dashboard/overview`, `/api/analytics/revenue`, `/api/orders`, `/api/purchase-orders`, `/api/import/list-uploads`, `/api/config/global`
|
||||
- `dashboard-server`: `/api/klaviyo/campaigns`, `/api/meta/insights`, `/api/google-analytics/...`, `/api/typeform/responses`
|
||||
- `acot-server`: `/api/acot/...` (top-3 endpoints by call volume — pull from access logs)
|
||||
- `auth-server`: `/login`, `/me`, `/verify`
|
||||
|
||||
Each smoke test runs (a) without a token → expect 401, (b) with an invalid token → expect 401, (c) with a valid token → expect 2xx.
|
||||
|
||||
### Frontend integration check
|
||||
|
||||
After deploys, log into the SPA and exercise each major page (Overview, Products, Analytics, Dashboard, Klaviyo, Meta, etc.). If everything loads and dashboards populate, the auth + routing layer is intact.
|
||||
|
||||
### Test scaffold during Phase 2 (committed)
|
||||
|
||||
While building `shared/`, set up `vitest` (lightweight, ESM-native, fast) as the standard test runner for the repo. Initial coverage focuses on the security-critical surface only:
|
||||
|
||||
- `shared/auth/verify.js` — known good token, expired token, wrong-signature token, malformed token, missing token.
|
||||
- `shared/auth/middleware.js` — request with no header → 401; bad header → 401; valid token + inactive user → 403; valid token + missing permission → 403; valid token + correct permission → next() called with `req.user` populated.
|
||||
- `shared/auth/middleware.js` user-cache TTL: same token within 60s → one DB hit; same token after 61s → two DB hits.
|
||||
|
||||
`package.json` gets a `"test": "vitest run"` script at the repo root and per-service. Set up but don't backfill broader test coverage — that's a separate, larger project. The vitest scaffold gives future work a foothold; this refactor commits to having tests for the auth boundary specifically because that's what's load-bearing for the whole security model.
|
||||
|
||||
---
|
||||
|
||||
## Rollback strategy
|
||||
|
||||
Each phase produces an independently deployable state. Rollback per phase:
|
||||
|
||||
- **Phase 1**: re-add removed services to ecosystem; restore from git. Don't roll back data deletions — only do those after a week of stable production.
|
||||
- **Phases 3, 5**: ESM conversion is per-service; if one service breaks, `pm2 restart <name>` to the previous commit. Other services unaffected.
|
||||
- **Phase 4**: the dashboard-server merge is the highest-risk change. Plan: deploy `dashboard-server` to a non-conflicting port (3015) while leaving the old per-vendor servers running. Cut over Caddy routes one vendor at a time (start with Meta — smallest). If any vendor breaks, point Caddy back to the old server (still running) for that vendor, debug, retry. Only delete the old servers after all four are stable on `dashboard-server`.
|
||||
- **Phases 6, 7**: Caddy config is git-tracked. `git revert` + `caddy reload` rolls back in seconds. Auth changes are additive (defense in depth) — if `forward_auth` causes problems, comment it out and per-server middleware continues protecting routes.
|
||||
|
||||
---
|
||||
|
||||
## Out of scope (intentional)
|
||||
|
||||
These came up in the audit but aren't part of this refactor:
|
||||
|
||||
- `httpOnly` cookie auth (deferred — current `localStorage` acceptable for internal tool).
|
||||
- Replacing PM2 with systemd or Docker.
|
||||
- Test coverage beyond the auth-critical surface.
|
||||
- `apiv2`/`apiv2-test` proxies to `backend.acherryontop.com` — separate system, not touched.
|
||||
- `acot-phone-server` and `lt-wordlist-api` — staying as-is.
|
||||
- Centralized observability stack (Prometheus, Grafana). The logger work in Phase 6.5 sets up the data, but shipping it somewhere is future work.
|
||||
|
||||
---
|
||||
|
||||
## Concrete deliverables
|
||||
|
||||
When this is done:
|
||||
|
||||
- 4 application PM2 processes instead of 12 (plus 2 unchanged: acot-phone, lt-wordlist).
|
||||
- All `/api/*` and `/chat-api/*` requests gated at Caddy and re-verified at each upstream.
|
||||
- Sensitive endpoints additionally gated by per-permission checks.
|
||||
- One ESM standard across the entire Node codebase.
|
||||
- One shared `lib/` for auth, logging, DB, errors, CORS.
|
||||
- Login rate-limited.
|
||||
- `JWT_SECRET` rotated.
|
||||
- Old auth-server, Aircall, Gorgias, Clarity directories deleted from the repo.
|
||||
- Caddyfile slimmed to one auth-gated block.
|
||||
- Permission codes inserted into `permissions` table for granular authorization.
|
||||
- No half-finished pieces, no `// TODO: add auth later` comments, no deferred secrets cleanup.
|
||||
|
||||
---
|
||||
|
||||
## Deviations from original plan (recorded during execution)
|
||||
|
||||
These are decisions made during Phase 1/2 implementation that amend the spec above. Future phases should follow the deviated path, not the original sketch.
|
||||
|
||||
1. **`shared/` location.** Original plan placed `shared/` at the repo root as a sibling of `inventory/` and `acot-phone/`. Implemented at `inventory-server/shared/` (= `/var/www/inventory/shared/` on the server) instead. Reason: the actual project root *is* `/var/www/inventory/`; placing shared/ outside it would have meant building a deployment story for it that doesn't exist. Import paths change accordingly:
|
||||
- From `inventory-server/{auth,src,chat}/server.js` → `../shared/...`
|
||||
- From `inventory-server/dashboard/{vendor}-server/server.js` → `../../shared/...`
|
||||
|
||||
2. **`/verify` response headers.** Plan specified `X-User-Id` + `X-User-Is-Admin`. Implemented as `X-User-Id` + `X-User-Username` (both available from the JWT payload). `X-User-Is-Admin` was dropped because `is_admin` isn't in the JWT today and returning it would require a DB lookup — violating the "no DB hit" principle. To restore `X-User-Is-Admin`, enrich the JWT payload at login time (one-line change in `auth/routes.js`) during Phase 6, then echo from `/verify`. Upstreams don't trust these headers anyway (they re-verify), so the omission is informational, not security-relevant.
|
||||
|
||||
3. **User cache key in `shared/auth/middleware.js`.** Plan sketch mentioned "60s TTL keyed by token jti". Implemented as keyed by `userId` instead — the JWT doesn't currently include a `jti` claim, and the cache's invalidation semantics are "this user was deactivated/changed permissions" (per-user), not "this token was revoked" (per-token). The plan's pseudocode already used `loadUserCached(pool, decoded.userId)` so this matches the spirit.
|
||||
|
||||
4. **Redis client safety.** `shared/db/redis.js` sets `enableOfflineQueue: false` and `lazyConnect: true`. Plan didn't specify but these defaults mean a Redis hiccup fails fast (route fall-through to upstream API as designed in Phase 4 risk notes) rather than queueing commands indefinitely.
|
||||
|
||||
5. **CORS allowed origins kept `https://acot.site`.** Plan example listed three origins; production has acot.site as a redirect to tools.acherryontop.com but also reaches the API directly in some flows. Kept it to avoid breakage. LAN wildcards (`192.168.*`, `10.*`) and `Access-Control-Allow-Origin "*"` are NOT included in the new `shared/cors/policy.js` per the plan's Phase 6.6 spirit, but the legacy `inventory-server/src/middleware/cors.js` still has them until services are migrated to consume `shared/cors/`.
|
||||
|
||||
6. **Defunct permission codes left in DB.** Removed the `dashboard:gorgias` and `dashboard:calls` Protected blocks from the frontend, but the corresponding permission rows in the `permissions` table are still there (assigned to some users). They're inert (no UI references them) but should be cleaned up alongside the Phase 6.2 permissions migration.
|
||||
|
||||
7. **PM2 process names retained `new-auth-server` (not `auth-server`).** Plan's Phase 8 final form names it `auth-server` (after the legacy 3003 one is removed). Decided to keep the existing `new-auth-server` name through Phase 2 to avoid a rename mid-stream. Phase 8 can rename if desired, but it's cosmetic — all wiring is by port (3011) not name.
|
||||
|
||||
8. **Caddyfile changes via admin API on `:2020`.** The Caddyfile is owned by root and matt has no passwordless sudo. Cutover used `curl -X POST .../load` on the Caddy admin port (which matt can hit), then a separate `sudo cp /home/matt/Caddyfile.new /etc/caddy/Caddyfile` step to persist the on-disk file. Future Caddyfile changes can follow the same pattern. Backup convention: `/etc/caddy/Caddyfile.bak.YYYY-MM-DD`.
|
||||
|
||||
9. **Path-naming.** Plan uses `inventory/` as the top-level (server-side path convention). Locally the equivalent is `inventory-server/`. Whenever the plan says `inventory/dashboard/foo/`, read that as `/var/www/inventory/dashboard/foo/` on the server or `inventory-server/dashboard/foo/` locally.
|
||||
Reference in New Issue
Block a user