diff --git a/FORECAST_FIX_PLAN.md b/FORECAST_FIX_PLAN.md new file mode 100644 index 0000000..95e1f09 --- /dev/null +++ b/FORECAST_FIX_PLAN.md @@ -0,0 +1,343 @@ +# Forecast Accuracy Fix Plan + +**Written:** 2026-06-10, from a code + live-data review of the forecasting pipeline. +**Goal:** eliminate the systematic ~1.7–2x over-forecast bias, recover demand the model currently ignores, and fix the accuracy measurement so improvements are visible and long-lead forecasts are validated. + +Read this whole document before starting. Fixes are grouped into phases; each phase is independently deployable and has its own validation step. Line numbers are as of 2026-06-10 — re-locate by function name if the file has drifted. + +--- + +## 1. Diagnosis summary (measured 2026-06-10) + +The dashboard headline is **202% WMAPE**. Decomposition of that number, all measured against `forecast_accuracy` run 129 and ad-hoc queries: + +| Finding | Evidence | +|---|---| +| Daily-grain WMAPE has a ~190% *floor* for this catalog | Avg demand ≈ 0.11 units/product/day. A perfect rate forecast of intermittent demand scores ≈ 2e^−λ ≈ 190%. A trivial trailing-30d-average naive forecast scores **204%** on the same products/days; the engine scores 221% (slightly *worse than naive*). | +| Same forecasts at 21-day-per-product grain: **109%**; bias-corrected: **75%** | Half the headline is metric grain, most of the rest is bias. | +| Aggregate over-forecast **+70%** (227,690 forecast vs 133,861 actual units) | Portfolio daily ratio is 1.5–2.5x on most days. | +| Decay phase 2.47x over (fc 51,675 / act 20,915) | Root cause F1: velocity inflated **4.07x** (measured: 1.353 vs true 0.332 units/day) by averaging over sparse snapshot rows. | +| Preorder phase 2.15x over (fc 67,212 / act 31,189) | Root cause F4: launch curve applied at age=0 starting *today*, ignoring that the product hasn't arrived. | +| Mature phase 1.69x over (fc 57,857 / act 34,313) | Root causes F2 (history edge truncation) + F3 (seasonal double-count). | +| Dormant products sold **16,180 units** (~11% of demand) against zero forecasts | Root cause F5; also excluded from the headline metric, so invisible. | +| All 879,800 accuracy samples are in the **1–7d lead bucket** | Root cause F7: archiving design only ever saves yesterday's slice. 30–90d forecasts (what purchasing uses) are never validated. | +| Launch phase is healthy: WMAPE 100%, bias −6%, beats naive | The lifecycle-curve concept works; its calibration inputs are broken. Don't redesign it. | + +**Key data fact** underlying several fixes: `daily_product_snapshots` is **activity-based and sparse** — only ~500–1,800 of ~38K products have a row on a given day. Verified: every pid-day with an order DOES have a snapshot row and units match (5,234/5,234 pid-days, 8,980 vs 8,984 units over 7 days). So *missing row = zero sales*, and any query that aggregates over only the rows that exist is averaging over sold-days. + +--- + +## 2. Environment & operational notes + +- **Files:** engine is `inventory-server/scripts/forecast/forecast_engine.py`; orchestrator `run_forecast.js` in the same dir; consumer endpoints in `inventory-server/src/routes/dashboard.js` (`/forecast/metrics` ~line 308, `/forecast/accuracy` ~line 647); overview UI in `inventory/src/components/overview/ForecastMetrics.tsx` and `ForecastAccuracy.tsx`. +- **Local `inventory-server/` is NFS-mounted to `/var/www/inventory/` on the netcup server.** Edits made locally appear on the server immediately — no copy step. Do NOT run bulk `grep`/`find`/`node --check` over `inventory-server/` locally (the mount hangs); `ssh netcup` and run them there. +- **Avoid the glob tool** for search in this repo; use bash (`grep`/`rg` via ssh for server-side trees). +- **Scheduling:** the engine runs daily at **09:30:01 server time** (runs table is conclusive), but the cron entry is NOT in matt's crontab, `/etc/cron.d`, or pm2. Likely root's crontab (`sudo crontab -l` to confirm). You do not need to touch the schedule for these fixes; just know a run fires at 09:30 daily and occasionally skips days (e.g. 2026-06-07/08). +- **Manual test runs:** `ssh netcup`, then `cd /var/www/inventory/scripts/forecast && node run_forecast.js`. Takes ~3.5–4 min. Safe to run any time: the engine TRUNCATEs and rebuilds `product_forecasts`, archives prior past-dated rows, and records a new `forecast_runs` row. Python deps live in the server venv (`venv/`); `run_forecast.js` handles env + venv automatically. +- **DB access for validation:** `ssh netcup`, then `PGPASSWORD=6D3GUkxuFgi2UghwgnUd psql -h localhost -U inventory_readonly -d inventory_db`. The engine itself connects with the write user via env vars loaded from `/var/www/inventory/.env` — schema changes should be made idempotently *inside the engine code* (the file already uses `CREATE TABLE IF NOT EXISTS` / `CREATE INDEX IF NOT EXISTS`; use `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` the same way) so no manual migration is needed. +- **Python gotchas already handled in this file (don't regress):** numpy types must go through the registered psycopg2 adapters; `pd.Series.combine_first()` keeps zeros over real data — use `reindex(..., fill_value=0.0)`. +- Engine runtime budget: currently ~212–227s. Phases 1–2 shouldn't move it meaningfully; Phase 3's extra archiving adds one INSERT…SELECT. If runtime balloons past ~6 min, investigate before shipping. +- `--backfill` mode (`backfill_accuracy_data`) is an in-sample backtest using the *old* formulas. **Do not run it anymore**; there is enough real out-of-sample history. Updating it to match the new logic is optional/low priority (F11). + +--- + +## Phase 1 — Bias bugs in the engine (no schema changes) + +### F1. Decay velocity: stop averaging over sparse snapshot rows + +**Where:** `forecast_engine.py`, `batch_load_product_data()`, the decay query (~lines 697–710). + +**Problem:** `AVG(COALESCE(dps.units_sold, 0))` runs over only the snapshot rows that exist — mostly sold-days. Measured inflation on the current 975 decay products: **4.07x** (1.353 vs 0.332 true units/day). This feeds `compute_scale_factor()` for the decay phase and is the single largest bias source. + +**Fix:** divide the sum by calendar days in the window, clipped to the product's age (decay products are 14–60 days old, so a 20-day-old product's window is 20 days, not 30): + +```sql +SELECT dps.pid, + SUM(COALESCE(dps.units_sold, 0))::float + / GREATEST(LEAST(30, (CURRENT_DATE - pm.date_first_received::date)), 1) AS avg_daily +FROM daily_product_snapshots dps +JOIN product_metrics pm ON pm.pid = dps.pid +WHERE dps.pid = ANY(%s) + AND dps.snapshot_date >= CURRENT_DATE - INTERVAL '30 days' + AND dps.snapshot_date >= pm.date_first_received::date +GROUP BY dps.pid, pm.date_first_received +``` + +No Python-side changes needed; `data['decay_velocity']` keeps the same shape. Products with zero snapshot rows in the window still get no entry → existing `scale = 1.0` fallback applies (acceptable: decay classification requires `sales_velocity_daily > 0`, so truly dead products don't reach this path). + +### F2. Mature history: reindex over the full calendar window + +**Where:** `forecast_engine.py`, `forecast_mature()` (~lines 833–836). + +**Problem:** `hist.set_index('snapshot_date').resample('D').sum()` only spans first-snapshot → last-snapshot. Interior gaps correctly become zeros, but **leading and trailing quiet periods are absent**, so the Holt level is fitted on the product's busy span. A marginal mature product whose activity clusters in 2 of the last 8 weeks gets a level ~4x too high. + +**Fix:** replace the resample with an explicit reindex over the full `EXP_SMOOTHING_WINDOW` ending yesterday: + +```python +hist = history_df.copy() +hist['snapshot_date'] = pd.to_datetime(hist['snapshot_date']) +hist = hist.set_index('snapshot_date')['units_sold'] +full_index = pd.date_range( + end=pd.Timestamp(date.today() - timedelta(days=1)), + periods=EXP_SMOOTHING_WINDOW, freq='D') +series = hist.reindex(full_index, fill_value=0.0).values.astype(float) +``` + +Notes: (pid, snapshot_date) is unique in `daily_product_snapshots`, so no duplicate-index risk. `observed_mean` and the `cap` recompute over the full window automatically (intended — the cap gets correspondingly tighter). Mature products are by definition >60 days old, so the 60-day window never predates first receipt. Do NOT use `combine_first` (see gotchas above). + +### F3. Stop double-applying the monthly seasonal index + +**Where:** `forecast_engine.py`, `generate_all_forecasts()` — the `seasonal_multipliers` pre-compute (~lines 959–961) and application (~line 1050). + +**Problem:** every per-product calibration (decay velocity, mature Holt level, launch first-week scale, preorder rate, slow-mover velocity) is fitted on *raw recent actuals*, which already embed the current month's seasonal level. The forecast then multiplies by the **absolute** monthly index of the target date. Example from the live indices (`forecast_runs.phase_counts` for run 129): May = 1.224 (sale month), June = 0.982. Early-June forecasts were calibrated on May-sale-inflated velocities and barely discounted — a structural ~25% over-forecast at that transition, and it'll be worse around November (1.316). + +**Fix:** apply the seasonal index *relative to the calibration period*. Compute a calibration index as the average monthly index over the trailing 30 calendar days (robust at month boundaries), then divide: + +```python +today = date.today() +trailing = [today - timedelta(days=i) for i in range(1, 31)] +calibration_index = float(np.mean([monthly_indices.get(d.month, 1.0) for d in trailing])) +seasonal_multipliers = [ + monthly_indices.get(d.month, 1.0) / max(calibration_index, 0.1) + for d in forecast_dates +] +``` + +Leave the DOW multipliers absolute — every calibration is a multi-week average and therefore DOW-neutral, so reshaping by absolute DOW indices is correct. + +**Optional sub-fix (same area, low priority):** the monthly indices are computed from a single trailing 365-day window, so each month appears once and YoY growth contaminates "seasonality". A cheap improvement is widening `SEASONAL_LOOKBACK_DAYS` to 730 and averaging the two observations of each month. Do this only after the main fixes are validated. + +### Phase 1 validation + +Deploy (edit locally; NFS propagates), run the engine manually once, wait for 3–5 daily cycles, then: + +```sql +-- Portfolio ratio per day (target: drifts from ~2.0 toward 0.8–1.3) +WITH ranked AS ( + SELECT pfh.pid, pfh.forecast_date, pfh.forecast_units, pfh.lifecycle_phase, + ROW_NUMBER() OVER (PARTITION BY pfh.pid, pfh.forecast_date ORDER BY fr.started_at DESC) rn + FROM product_forecasts_history pfh + JOIN forecast_runs fr ON fr.id = pfh.run_id + WHERE pfh.forecast_date >= CURRENT_DATE - 7) +SELECT r.forecast_date, round(SUM(r.forecast_units),0) AS fc, + SUM(COALESCE(dps.units_sold,0)) AS act, + round(SUM(r.forecast_units)/NULLIF(SUM(COALESCE(dps.units_sold,0)),0),2) AS ratio +FROM ranked r +LEFT JOIN daily_product_snapshots dps ON dps.pid = r.pid AND dps.snapshot_date = r.forecast_date +WHERE r.rn = 1 AND r.lifecycle_phase != 'dormant' +GROUP BY 1 ORDER BY 1; +``` + +Also check `forecast_accuracy` `by_phase` rows for the newest run: decay bias should fall from +0.35 toward ~0, mature from +0.17 toward ~0. (Accuracy lags ~1 day behind each fix since it evaluates yesterday's forecasts.) + +--- + +## Phase 2 — Demand the model currently ignores or mistimes + +### F4. Preorder: forecast the preorder rate until arrival, launch curve after + +**Where:** `forecast_engine.py` — `batch_load_product_data()` (add arrival dates), `generate_all_forecasts()` preorder branch (~lines 1005–1009), and `forecast_from_curve()` (or a small wrapper). + +**Problem:** preorder products run the launch curve from `age=0` starting **today**, i.e. full first-week launch sales while the product is still weeks from arriving. Actual preorder-period sales are a much slower trickle. + +**Fix:** + +1. Batch-load each preorder product's expected arrival from `purchase_orders` (line-item grain: it has `pid` and `expected_date` directly). Open statuses verified against live data: `created`, `ordered`, `electronically_sent`, `receiving_started` (~705 open line items currently have a future `expected_date`): + +```sql +SELECT pid, MIN(expected_date) AS expected_arrival +FROM purchase_orders +WHERE pid = ANY(%s) + AND status IN ('created', 'ordered', 'electronically_sent', 'receiving_started') + AND expected_date IS NOT NULL + AND expected_date >= CURRENT_DATE +GROUP BY pid +``` + +Fallbacks, in order: (a) an open PO with a *past* `expected_date` → assume arrival in 7 days; (b) no PO at all → arrival in 14 days (and log a counter of how many hit this default). + +2. In the preorder branch, build the daily array piecewise. Let `days_until_arrival = (expected_arrival - today).days`: + - Days `0 .. days_until_arrival-1`: flat observed preorder daily rate = `preorder_sales[pid] / max(preorder_days[pid], 1)` (both already batch-loaded), clamped to ≤ the curve's scaled week-0 daily value. + - Days `days_until_arrival .. horizon`: `forecast_from_curve(curve_info, scale, age_days=0, ...)` shifted so the curve's day 0 lands on the arrival date (i.e. pass `horizon_days - days_until_arrival` and offset into the output array). + - Keep the existing `compute_scale_factor('preorder', ...)` for the post-arrival curve; the pre-arrival segment doesn't use it. + +This is consistent with how the reference curves were built: historical preorder units were recorded on their **order dates** (pre-arrival), so week-0 of the fitted curves reflects post-receipt orders, not the backlog. + +### F5. Dormant products: small positive rate instead of hard zero, and count them + +**Where:** `forecast_engine.py` — `generate_all_forecasts()` dormant branch (~lines 1040–1042), `batch_load_product_data()`, and `compute_accuracy()`. + +**Problem:** all ~28K dormant products are forecast at exactly 0, yet they sold 16,180 units in the eval window (~11% of all demand) — restocks, promos, long-tail. Worse, dormant is *excluded* from the headline accuracy filter, so this miss is invisible. + +**Fix (cheap version, do this now):** + +1. Batch-load a trailing-180-day order rate for dormant products (11,362 of them have ≥1 sale in 180d — verified): + +```sql +SELECT o.pid, SUM(o.quantity) / 180.0 AS rate +FROM orders o +WHERE o.pid = ANY(%s) + AND o.canceled IS DISTINCT FROM TRUE + AND o.date >= CURRENT_DATE - INTERVAL '180 days' +GROUP BY o.pid +``` + +2. Dormant branch: if the product has a rate > 0, forecast it flat with `method = 'velocity'`; else keep zeros with `method = 'zero'`. Apply the same DOW/seasonal multipliers as everything else (automatic — they're applied after the branch). +3. In `compute_accuracy()`, add a second overall row: `metric_type='overall', dimension_value='all_incl_dormant'` with no dormant filter (keep the existing `'all'` row unchanged for trend continuity). One extra entry in the `dimensions`/`filter_clauses` dicts. + +**Upgrade path (optional, Phase 4):** replace flat rates for `slow_mover` + dormant-with-sales with TSB (Teunter–Syntetos–Babai), the standard intermittent-demand method with obsolescence handling. Per product over a daily series `d_t` (build it from snapshots the F2 way — full calendar reindex): + +``` +if d_t > 0: p_t = p_{t-1} + β·(1 − p_{t-1}); z_t = z_{t-1} + α·(d_t − z_{t-1}) +else: p_t = p_{t-1}·(1 − β); z_t = z_{t-1} +forecast = p_T · z_T (flat across horizon) +``` + +Start with α=0.1, β=0.05, initialize p = (nonzero days / total days), z = mean of nonzero demands. Scope: slow_mover (~6K) + dormant with 180d sales (~11K); series from up to 180 days of snapshots (sparse rows → ~manageable volume). Only do this after Phase 3 measurement exists to prove it beats the flat rates. + +### Phase 2 validation + +After 3–5 cycles: preorder `by_phase` bias should drop from +0.85 toward < +0.3; the new `all_incl_dormant` row should appear and its `total_actual_units` minus `'all'`'s should be largely *covered* rather than all-miss (dormant `bias` rising from −1.36 toward ~−0.3 or better). + +--- + +## Phase 3 — Fix the measurement (schema + engine + API + UI) + +> Without this phase you cannot see whether Phases 1–2 worked except by ad-hoc SQL, the lead-time chart stays a single bucket forever, and the dashboard keeps displaying a number with a 190% floor in red. + +### F7. Archive long-lead forecasts so 15/30/60/90d accuracy exists + +**Where:** `forecast_engine.py` — `archive_forecasts()` (~lines 1086–1154), `compute_accuracy()` CTE (~lines 1201–1228). + +**Problem:** the current design archives only *past-dated* rows of the previous run before truncation. With daily runs, that's only ever the 1-day-ahead slice — all 879,800 accuracy samples sit in the '1-7d' bucket and the longer buckets in the UI chart can never populate. Purchasing decisions ride on 30–60d forecasts that are never validated. + +**Fix:** + +1. Keep the existing past-date archiving exactly as is (it provides dense short-lead coverage). +2. After `generate_all_forecasts()` completes, additionally archive a **sampled set of future leads** from the new run, non-dormant only, attributed to the *current* run id (correct attribution, unlike the past-date path which attributes to the previous run): + +```sql +INSERT INTO product_forecasts_history + (run_id, pid, forecast_date, forecast_units, forecast_revenue, + lifecycle_phase, forecast_method, confidence_lower, confidence_upper, generated_at) +SELECT %(run_id)s, pid, forecast_date, forecast_units, forecast_revenue, + lifecycle_phase, forecast_method, confidence_lower, confidence_upper, generated_at +FROM product_forecasts +WHERE lifecycle_phase != 'dormant' + AND forecast_date - CURRENT_DATE IN (7, 14, 30, 60, 89) +ON CONFLICT (run_id, pid, forecast_date) DO NOTHING +``` + +Volume: ~10K non-dormant products × 5 leads ≈ 50K rows/day; the existing 90-day prune (`forecast_date < CURRENT_DATE - 90`) bounds steady state at a few million rows. Note future-dated rows survive until their date passes + 90 days — that's intended. + +3. **CRITICAL companion change** in `compute_accuracy()`: the accuracy CTE must now exclude not-yet-realized rows, or future-dated archives get scored against actual=0: + +```sql +FROM product_forecasts_history pfh +JOIN forecast_runs fr ON fr.id = pfh.run_id +WHERE pfh.forecast_date < CURRENT_DATE -- ADD THIS +``` + +4. **Dedup semantics change.** Today's `ROW_NUMBER() OVER (PARTITION BY pid, forecast_date ORDER BY started_at DESC)` keeps only the latest (= shortest-lead) row per pid/date, which would silently discard all the new long-lead rows. Restructure: + - Compute `lead_days = forecast_date - started_at::date` and the lead bucket *inside* `ranked_history`. + - For `by_lead_time`: dedup `PARTITION BY pid, forecast_date, lead_bucket` (one sample per pid/date/bucket, latest run wins within a bucket). + - For everything else (`overall`, `by_phase`, `by_method`, `daily`, and the new weekly metric below): restrict to `lead_days BETWEEN 0 AND 6` and keep the existing per-(pid, date) dedup. This preserves the current meaning of the headline metrics (short-lead) while the lead-time table becomes real. + +### F8. Track a naive baseline (forecast value-added) + +**Where:** `archive_forecasts()` (both INSERT paths), `compute_accuracy()`, `forecast_accuracy` schema, `/forecast/accuracy` endpoint. + +**Problem:** the engine currently *loses* to a trailing-average naive forecast (221% vs 204% daily WMAPE) and nothing on the dashboard would ever reveal that. Every accuracy improvement should be judged as value-over-naive. + +**Fix:** + +1. Schema (idempotent, in the ensure blocks): `ALTER TABLE product_forecasts_history ADD COLUMN IF NOT EXISTS naive_units NUMERIC(10,2);` and `ALTER TABLE forecast_accuracy ADD COLUMN IF NOT EXISTS naive_wmape NUMERIC(10,4), ADD COLUMN IF NOT EXISTS fva NUMERIC(10,4);` +2. Populate `naive_units` during both archive INSERTs via a join — naive = flat trailing-28-day average daily units as of archive time (28 days = DOW-balanced; information available at generation; same value at every lead, which is exactly what a naive baseline means): + +```sql +LEFT JOIN ( + SELECT o.pid, SUM(o.quantity) / 28.0 AS naive_daily + FROM orders o + WHERE o.canceled IS DISTINCT FROM TRUE + AND o.date >= CURRENT_DATE - INTERVAL '28 days' AND o.date < CURRENT_DATE + GROUP BY o.pid +) nv ON nv.pid = pf.pid +-- select COALESCE(nv.naive_daily, 0) AS naive_units +``` + +3. In `compute_accuracy()`, add to each dimension's aggregate: `SUM(ABS(naive_units - actual_units)) / NULLIF(SUM(actual_units),0) AS naive_wmape` and store `fva = 1 - wmape / naive_wmape` (NULL-safe). Rows archived before this change have `naive_units` NULL — treat NULL as excluded (`FILTER (WHERE naive_units IS NOT NULL)` on the naive sums) rather than as zero. +4. Endpoint: include `naiveWmape` and `fva` in the `overall` (and per-phase) payload of `/dashboard/forecast/accuracy` in `dashboard.js`. + +### F9. Weekly-grain headline metric + bias as a percentage + +**Where:** `compute_accuracy()`, `/forecast/accuracy` endpoint, `ForecastAccuracy.tsx`. + +**Problem:** daily-grain WMAPE on this catalog has a ~190% floor — as a headline it's noise. The informative numbers are (a) weekly-per-product WMAPE (currently ~109%, target ~70–85% post-fix) and (b) aggregate bias, which the UI currently renders as `+0.108 units` — indistinguishable from zero while the reality is +70%. + +**Fix:** + +1. New metric in `compute_accuracy()`: `metric_type='overall_weekly', dimension_value='all'`. Definition: using the short-lead deduped rows (lead ≤ 6, non-dormant), aggregate per `(pid, date_trunc('week', forecast_date))` keeping only complete weeks (`COUNT(*) = 7`), then `WMAPE = SUM(ABS(fc_week − act_week)) / SUM(act_week)`, excluding pid-weeks where both are 0. Store sample_size = number of pid-weeks. Compute `naive_wmape`/`fva` the same way from `naive_units`. +2. Endpoint: expose as `overallWeekly`; also add a weekly variant to the `accuracyTrend` query (`metric_type='overall_weekly'`). The trend will start empty (old runs lack the row) — that's fine; don't backfill. +3. `ForecastAccuracy.tsx`: + - Headline WMAPE → `overallWeekly.wmape`, labeled "WMAPE (weekly)". Keep daily WMAPE available in a tooltip if desired. + - Color thresholds for weekly grain: green ≤ 60, yellow ≤ 90, red above (tunable; document that they're calibrated for intermittent retail demand). + - Replace the bias row: show `(totalForecast / totalActual − 1)` as a signed percentage labeled "Forecast vs actual" (both totals already arrive in `overall`). Keep MAE. + - Add a "vs naive" line: naive weekly WMAPE and FVA. FVA > 0 = engine adds value. + - The lead-time chart needs no code change — buckets will populate as F7 rows mature (7d lead evaluable after 7 days, 30d after 30, etc.). +4. `confidenceLevel` in `/forecast/metrics` ([dashboard.js ~line 360]) is "share of products forecast via lifecycle curves", not confidence. It only feeds a per-day tooltip field — rename the JSON field to `curveCoverage` and update the one consumer in `ForecastMetrics.tsx`, or leave it and add a comment; low priority. + +### Phase 3 validation + +- Next run after deploy: `forecast_accuracy` contains `overall_weekly` and `fva` values; `/dashboard/forecast/accuracy` returns them; the overview popover renders weekly WMAPE, bias %, and the naive comparison. +- After 7/14/30 days: `by_lead_time` rows appear for '8-14d', '15-30d', '31-60d' buckets respectively (61-90d after ~60 days). +- Confirm engine runtime still < ~5 min and `product_forecasts_history` growth ≈ 50–70K rows/day. + +--- + +## Phase 4 — Optional / after the above is proven + +- **F6. TSB for slow movers + dormant** (spec in F5). Gate on Phase 3 measurement: ship only if weekly FVA improves on those phases. +- **F10. Confidence-margin source:** `load_accuracy_margins()` feeds daily-grain per-phase WMAPE (clamped to 1.0) into the intervals, so every interval is ±100% — uninformative. Once `overall_weekly` exists, add per-phase weekly rows (`by_phase_weekly`) and source margins from those instead. +- **F11.** Update or delete `backfill_accuracy_data()` (it encodes the old formulas). Until then, just don't run `--backfill`. +- **F12.** `compute_dow_indices()` weights by revenue but the multipliers are applied to units — switch `SUM(o.price * o.quantity)` to `SUM(o.quantity)`. Tiny effect. +- **F13.** Longer term: for reorder decisions the right target is P(lead-time demand > stock), not a point forecast. Evaluate quantile (pinball) loss at lead-time horizons using the existing confidence-interval columns. Design separately. + +--- + +## 4. Success criteria + +1. Rolling-14-day portfolio forecast/actual ratio within **0.8–1.25** (currently 1.5–2.5). +2. Weekly-grain WMAPE ≤ **90%** and **FVA > 0** (engine beats naive) sustained for 2+ weeks. +3. Decay/preorder/mature per-phase bias within ±0.1 units/day (currently +0.35 / +0.85 / +0.17). +4. `all_incl_dormant` actuals covered: dormant bias better than −0.4 (currently −1.36, i.e. 100% miss). +5. Lead-time buckets through 31–60d populated with ≥10K samples each within ~6 weeks. +6. Launch phase stays healthy (bias within ±0.15, WMAPE not degraded) — regression guard for F3/F4 changes. + +## 5. Re-measurement appendix + +The naive-vs-engine comparison used in the diagnosis (rerun any time; adjust dates): + +```sql +WITH ranked AS ( + SELECT pfh.pid, pfh.forecast_date, pfh.forecast_units, pfh.lifecycle_phase, + ROW_NUMBER() OVER (PARTITION BY pfh.pid, pfh.forecast_date ORDER BY fr.started_at DESC) rn + FROM product_forecasts_history pfh + JOIN forecast_runs fr ON fr.id = pfh.run_id + WHERE pfh.forecast_date BETWEEN CURRENT_DATE - 9 AND CURRENT_DATE - 1), +eng AS (SELECT * FROM ranked WHERE rn = 1 AND lifecycle_phase != 'dormant'), +naive AS ( + SELECT o.pid, SUM(o.quantity)/30.0 AS naive_daily FROM orders o + WHERE o.canceled IS DISTINCT FROM TRUE + AND o.date >= CURRENT_DATE - 39 AND o.date < CURRENT_DATE - 9 + GROUP BY o.pid) +SELECT e.lifecycle_phase, COUNT(*) AS n, SUM(COALESCE(dps.units_sold,0)) AS actual, + round(SUM(e.forecast_units),0) AS engine_fc, round(SUM(COALESCE(nv.naive_daily,0)),0) AS naive_fc, + round(SUM(ABS(e.forecast_units - COALESCE(dps.units_sold,0)))/NULLIF(SUM(COALESCE(dps.units_sold,0)),0),2) AS engine_wmape, + round(SUM(ABS(COALESCE(nv.naive_daily,0) - COALESCE(dps.units_sold,0)))/NULLIF(SUM(COALESCE(dps.units_sold,0)),0),2) AS naive_wmape +FROM eng e +LEFT JOIN naive nv ON nv.pid = e.pid +LEFT JOIN daily_product_snapshots dps ON dps.pid = e.pid AND dps.snapshot_date = e.forecast_date +GROUP BY ROLLUP(e.lifecycle_phase) ORDER BY 1; +``` + +Baseline numbers to beat (June 1–9, 2026): engine 221% / naive 204% daily WMAPE; engine_fc/actual = 1.82; per-phase table in §1. diff --git a/inventory-server/scripts/forecast/__pycache__/forecast_engine.cpython-312.pyc b/inventory-server/scripts/forecast/__pycache__/forecast_engine.cpython-312.pyc new file mode 100644 index 0000000..ba931b4 Binary files /dev/null and b/inventory-server/scripts/forecast/__pycache__/forecast_engine.cpython-312.pyc differ diff --git a/inventory-server/scripts/forecast/forecast_engine.py b/inventory-server/scripts/forecast/forecast_engine.py index 22abcb0..3e10097 100644 --- a/inventory-server/scripts/forecast/forecast_engine.py +++ b/inventory-server/scripts/forecast/forecast_engine.py @@ -634,6 +634,52 @@ def forecast_from_curve(curve_params, scale_factor, age_days, horizon_days): return np.array(forecasts) +def forecast_preorder(curve_params, scale_factor, days_until_arrival, + preorder_daily_rate, horizon_days): + """ + Piecewise pre-order forecast: a flat observed pre-order trickle until the + product is expected to arrive, then the scaled launch curve from age 0. + + The launch curve was fit on POST-receipt order history, so running it from + today (while the product is still weeks from arriving) front-loads full + first-week launch volume that hasn't happened yet — the main driver of the + ~2.15x preorder over-forecast. Instead we forecast the slow pre-order rate + up to the arrival date, then start the curve's day 0 on that date. + See FORECAST_FIX_PLAN F4. + + Args: + curve_params: (amplitude, decay_rate, baseline, ...) weekly curve + scale_factor: per-product multiplier for the post-arrival curve envelope + days_until_arrival: calendar days from today until expected arrival + preorder_daily_rate: observed pre-order units/day (trickle) + horizon_days: forecast horizon length + + Returns: + array of daily forecast values of length horizon_days + """ + amplitude, decay_rate, baseline = curve_params[:3] + forecasts = np.zeros(horizon_days) + + # Clamp the arrival offset into the horizon + dua = int(max(0, min(days_until_arrival, horizon_days))) + + # Pre-arrival segment: flat pre-order trickle, capped at the curve's scaled + # week-0 daily value (a pre-order day shouldn't out-sell the launch peak). + if dua > 0: + week0_daily = (amplitude / 7.0) * scale_factor + (baseline / 7.0) + pre_rate = preorder_daily_rate + if week0_daily > 0: + pre_rate = min(pre_rate, week0_daily) + forecasts[:dua] = max(0.0, pre_rate) + + # Post-arrival segment: scaled launch curve, curve day 0 = arrival date. + if dua < horizon_days: + curve_part = forecast_from_curve(curve_params, scale_factor, 0, horizon_days - dua) + forecasts[dua:] = curve_part + + return forecasts + + # --------------------------------------------------------------------------- # Batch data loading (eliminates N+1 per-product queries) # --------------------------------------------------------------------------- @@ -651,9 +697,11 @@ def batch_load_product_data(conn, products): data = { 'preorder_sales': {}, 'preorder_days': {}, + 'preorder_arrival_days': {}, 'launch_sales': {}, 'decay_velocity': {}, 'mature_history': {}, + 'dormant_rate': {}, } # Pre-order sales: orders placed BEFORE first received date @@ -677,6 +725,39 @@ def batch_load_product_data(conn, products): data['preorder_days'][int(row['pid'])] = float(row['preorder_days']) log.info(f"Batch loaded pre-order sales for {len(data['preorder_sales'])}/{len(preorder_pids)} preorder products") + # Expected arrival per pre-order product, to time the launch curve. + # Prefer the soonest FUTURE expected_date on an open PO; if the only open + # PO has a past expected_date assume 7 days; if there's no open PO at all + # assume 14 days. See FORECAST_FIX_PLAN F4. + arrival_sql = """ + SELECT pid, + MIN(expected_date) FILTER ( + WHERE expected_date IS NOT NULL AND expected_date >= CURRENT_DATE + ) AS future_arrival + FROM purchase_orders + WHERE pid = ANY(%s) + AND status IN ('created', 'ordered', 'electronically_sent', 'receiving_started') + GROUP BY pid + """ + adf = execute_query(conn, arrival_sql, [preorder_pids]) + today = date.today() + for _, row in adf.iterrows(): + pid = int(row['pid']) + fa = row['future_arrival'] + if pd.notna(fa): + fa_date = pd.Timestamp(fa).date() + data['preorder_arrival_days'][pid] = max(0, (fa_date - today).days) + else: + data['preorder_arrival_days'][pid] = 7 # open PO, expected_date already past + no_po = 0 + for pid in preorder_pids: + if int(pid) not in data['preorder_arrival_days']: + data['preorder_arrival_days'][int(pid)] = 14 # no open PO at all + no_po += 1 + log.info(f"Batch loaded preorder arrival for " + f"{len(data['preorder_arrival_days']) - no_po}/{len(preorder_pids)} via open POs, " + f"{no_po} defaulted to 14d") + # Launch sales: first 14 days after first received launch_pids = products[products['phase'] == 'launch']['pid'].tolist() if launch_pids: @@ -694,15 +775,23 @@ def batch_load_product_data(conn, products): data['launch_sales'][int(row['pid'])] = float(row['total_sold']) log.info(f"Batch loaded launch sales for {len(data['launch_sales'])}/{len(launch_pids)} launch products") - # Decay recent velocity: average daily sales over last 30 days + # Decay recent velocity: TRUE calendar-daily average over the last 30 days. + # We divide the summed units by calendar days (clipped to the product's age), + # NOT by the number of snapshot rows. Snapshots are sparse and mostly land on + # sold-days, so AVG(units_sold) averages over sold-days only and inflated the + # decay rate ~4x (measured 1.353 vs true 0.332 units/day). See FORECAST_FIX_PLAN F1. decay_pids = products[products['phase'] == 'decay']['pid'].tolist() if decay_pids: sql = """ - SELECT dps.pid, AVG(COALESCE(dps.units_sold, 0)) AS avg_daily + SELECT dps.pid, + SUM(COALESCE(dps.units_sold, 0))::float + / GREATEST(LEAST(30, (CURRENT_DATE - pm.date_first_received::date)), 1) AS avg_daily FROM daily_product_snapshots dps + JOIN product_metrics pm ON pm.pid = dps.pid WHERE dps.pid = ANY(%s) AND dps.snapshot_date >= CURRENT_DATE - INTERVAL '30 days' - GROUP BY dps.pid + AND dps.snapshot_date >= pm.date_first_received::date + GROUP BY dps.pid, pm.date_first_received """ df = execute_query(conn, sql, [decay_pids]) for _, row in df.iterrows(): @@ -724,6 +813,25 @@ def batch_load_product_data(conn, products): data['mature_history'][int(pid)] = group.copy() log.info(f"Batch loaded history for {len(data['mature_history'])}/{len(mature_pids)} mature products") + # Dormant trailing order rate: dormant products forecast 0 by default, but + # ~11K of them still sell (restocks, promos, long-tail) — ~11% of all demand + # currently forecast as a hard zero. Load a trailing-180-day daily order rate + # so the dormant branch can carry a small positive rate. See FORECAST_FIX_PLAN F5. + dormant_pids = products[products['phase'] == 'dormant']['pid'].tolist() + if dormant_pids: + sql = """ + SELECT o.pid, SUM(o.quantity) / 180.0 AS rate + FROM orders o + WHERE o.pid = ANY(%s) + AND o.canceled IS DISTINCT FROM TRUE + AND o.date >= CURRENT_DATE - INTERVAL '180 days' + GROUP BY o.pid + """ + df = execute_query(conn, sql, [dormant_pids]) + for _, row in df.iterrows(): + data['dormant_rate'][int(row['pid'])] = float(row['rate']) + log.info(f"Batch loaded dormant order rate for {len(data['dormant_rate'])}/{len(dormant_pids)} dormant products") + return data @@ -829,11 +937,20 @@ def forecast_mature(product, history_df): # Not enough data — flat velocity return np.full(FORECAST_HORIZON_DAYS, velocity) - # Fill date gaps with 0 sales (days where product had no snapshot = no sales) + # Reindex over the FULL calendar window ending yesterday, not just the span + # between the first and last snapshot. resample() only covers first→last + # snapshot, so leading/trailing quiet periods are absent and the Holt level + # is fitted only on the product's busy span (can run ~4x too high). An + # explicit reindex fills every quiet calendar day with 0. (pid, snapshot_date) + # is unique so there is no duplicate-index risk; do NOT use combine_first + # (it keeps zeros over real data). See FORECAST_FIX_PLAN F2. hist = history_df.copy() hist['snapshot_date'] = pd.to_datetime(hist['snapshot_date']) - hist = hist.set_index('snapshot_date').resample('D').sum().fillna(0) - series = hist['units_sold'].values.astype(float) + hist = hist.set_index('snapshot_date')['units_sold'] + full_index = pd.date_range( + end=pd.Timestamp(date.today() - timedelta(days=1)), + periods=EXP_SMOOTHING_WINDOW, freq='D') + series = hist.reindex(full_index, fill_value=0.0).values.astype(float) # Need at least 2 non-zero values for smoothing if np.count_nonzero(series) < 2: @@ -956,9 +1073,24 @@ def generate_all_forecasts(conn, curves_df, dow_indices, monthly_indices=None, today = date.today() forecast_dates = [today + timedelta(days=i) for i in range(FORECAST_HORIZON_DAYS)] - # Pre-compute DOW and seasonal multipliers for each forecast date + # Pre-compute DOW and seasonal multipliers for each forecast date. + # DOW multipliers stay ABSOLUTE — every calibration is a multi-week average + # and therefore DOW-neutral, so reshaping by absolute DOW indices is correct. + # Seasonal indices must be applied RELATIVE to the calibration period: + # each per-product calibration (decay velocity, mature Holt level, launch / + # preorder scale) is fitted on raw recent actuals that already embed the + # current month's seasonal level. Multiplying by the absolute target-month + # index double-counts seasonality (~25% over-forecast at the May→June sale + # transition, worse near November). Divide by the trailing-30-day average + # index so only the seasonal *change* from calibration to target applies. + # See FORECAST_FIX_PLAN F3. dow_multipliers = [dow_indices.get(d.isoweekday(), 1.0) for d in forecast_dates] - seasonal_multipliers = [monthly_indices.get(d.month, 1.0) for d in forecast_dates] + trailing = [today - timedelta(days=i) for i in range(1, 31)] + calibration_index = float(np.mean([monthly_indices.get(d.month, 1.0) for d in trailing])) + seasonal_multipliers = [ + monthly_indices.get(d.month, 1.0) / max(calibration_index, 0.1) + for d in forecast_dates + ] # TRUNCATE before streaming writes with conn.cursor() as cur: @@ -1002,9 +1134,33 @@ def generate_all_forecasts(conn, curves_df, dow_indices, monthly_indices=None, try: curve_info = get_curve_for_product(product, curves_df) - if phase in ('preorder', 'launch'): + if phase == 'preorder': if curve_info: - scale = compute_scale_factor(phase, product, curve_info, batch_data) + scale = compute_scale_factor('preorder', product, curve_info, batch_data) + # Time the launch curve to expected arrival instead of + # running it from today (F4). Pre-arrival days carry the + # observed pre-order trickle rate. + days_until_arrival = batch_data['preorder_arrival_days'].get(pid, 14) + preorder_units = batch_data['preorder_sales'].get(pid, 0) + preorder_days = batch_data['preorder_days'].get(pid, 1) + preorder_daily_rate = preorder_units / max(preorder_days, 1) + forecasts = forecast_preorder( + curve_info, scale, days_until_arrival, + preorder_daily_rate, FORECAST_HORIZON_DAYS) + method = 'lifecycle_curve' + else: + # No reliable curve — fall back to velocity if available + velocity = product.get('sales_velocity_daily') or 0 + if velocity > 0: + forecasts = np.full(FORECAST_HORIZON_DAYS, velocity) + method = 'velocity' + else: + forecasts = forecast_dormant() + method = 'zero' + + elif phase == 'launch': + if curve_info: + scale = compute_scale_factor('launch', product, curve_info, batch_data) forecasts = forecast_from_curve(curve_info, scale, age, FORECAST_HORIZON_DAYS) method = 'lifecycle_curve' else: @@ -1038,8 +1194,16 @@ def generate_all_forecasts(conn, curves_df, dow_indices, monthly_indices=None, method = 'velocity' else: # dormant - forecasts = forecast_dormant() - method = 'zero' + # Carry a small positive rate for dormant products that still + # trickle sales (restocks/promos/long-tail); only truly dead + # products stay at zero. See FORECAST_FIX_PLAN F5. + rate = batch_data['dormant_rate'].get(pid, 0) + if rate > 0: + forecasts = np.full(FORECAST_HORIZON_DAYS, rate) + method = 'velocity' + else: + forecasts = forecast_dormant() + method = 'zero' # Confidence interval: use accuracy-calibrated margins per phase base_margin = accuracy_margins.get(phase, 0.5) @@ -1108,6 +1272,8 @@ def archive_forecasts(conn, run_id): """) cur.execute("CREATE INDEX IF NOT EXISTS idx_pfh_date ON product_forecasts_history(forecast_date)") cur.execute("CREATE INDEX IF NOT EXISTS idx_pfh_pid_date ON product_forecasts_history(pid, forecast_date)") + # Naive-baseline column for forecast value-added (FVA). See FORECAST_FIX_PLAN F8. + cur.execute("ALTER TABLE product_forecasts_history ADD COLUMN IF NOT EXISTS naive_units NUMERIC(10,2)") # Find the previous completed run (whose forecasts are still in product_forecasts) cur.execute(""" @@ -1124,15 +1290,27 @@ def archive_forecasts(conn, run_id): prev_run_id = prev_run[0] - # Archive only past-date forecasts (where actuals now exist) + # Archive only past-date forecasts (where actuals now exist). Attach the + # naive baseline (flat trailing-28-day daily average) at the same time so + # forecast value-added can be measured. See FORECAST_FIX_PLAN F8. cur.execute(""" INSERT INTO product_forecasts_history (run_id, pid, forecast_date, forecast_units, forecast_revenue, - lifecycle_phase, forecast_method, confidence_lower, confidence_upper, generated_at) - SELECT %s, pid, forecast_date, forecast_units, forecast_revenue, - lifecycle_phase, forecast_method, confidence_lower, confidence_upper, generated_at - FROM product_forecasts - WHERE forecast_date < CURRENT_DATE + lifecycle_phase, forecast_method, confidence_lower, confidence_upper, + generated_at, naive_units) + SELECT %s, pf.pid, pf.forecast_date, pf.forecast_units, pf.forecast_revenue, + pf.lifecycle_phase, pf.forecast_method, pf.confidence_lower, pf.confidence_upper, + pf.generated_at, COALESCE(nv.naive_daily, 0) + FROM product_forecasts pf + LEFT JOIN ( + SELECT o.pid, SUM(o.quantity) / 28.0 AS naive_daily + FROM orders o + WHERE o.canceled IS DISTINCT FROM TRUE + AND o.date >= CURRENT_DATE - INTERVAL '28 days' + AND o.date < CURRENT_DATE + GROUP BY o.pid + ) nv ON nv.pid = pf.pid + WHERE pf.forecast_date < CURRENT_DATE ON CONFLICT (run_id, pid, forecast_date) DO NOTHING """, (prev_run_id,)) @@ -1154,6 +1332,48 @@ def archive_forecasts(conn, run_id): return archived +def archive_future_leads(conn, run_id): + """ + Archive a sampled set of FUTURE-lead forecasts from the just-generated + product_forecasts, attributed to the current run. + + The past-date archive in archive_forecasts() only ever captures the 1-day + slice that just elapsed, so every accuracy sample lands in the '1-7d' lead + bucket and the 15/30/60/90-day forecasts that purchasing actually rides on + are never validated. Here we snapshot the 7/14/30/60/89-day-ahead leads + (non-dormant) so that, once each date passes, compute_accuracy() can score + them in their lead bucket. The naive baseline is attached the same way as in + the past-date path. Future-dated rows survive the 90-day prune until their + own date passes. See FORECAST_FIX_PLAN F7. + """ + with conn.cursor() as cur: + cur.execute(""" + INSERT INTO product_forecasts_history + (run_id, pid, forecast_date, forecast_units, forecast_revenue, + lifecycle_phase, forecast_method, confidence_lower, confidence_upper, + generated_at, naive_units) + SELECT %s, pf.pid, pf.forecast_date, pf.forecast_units, pf.forecast_revenue, + pf.lifecycle_phase, pf.forecast_method, pf.confidence_lower, pf.confidence_upper, + pf.generated_at, COALESCE(nv.naive_daily, 0) + FROM product_forecasts pf + LEFT JOIN ( + SELECT o.pid, SUM(o.quantity) / 28.0 AS naive_daily + FROM orders o + WHERE o.canceled IS DISTINCT FROM TRUE + AND o.date >= CURRENT_DATE - INTERVAL '28 days' + AND o.date < CURRENT_DATE + GROUP BY o.pid + ) nv ON nv.pid = pf.pid + WHERE pf.lifecycle_phase != 'dormant' + AND pf.forecast_date - CURRENT_DATE IN (7, 14, 30, 60, 89) + ON CONFLICT (run_id, pid, forecast_date) DO NOTHING + """, (run_id,)) + archived = cur.rowcount + conn.commit() + log.info(f"Archived {archived} future-lead forecast rows (7/14/30/60/89d) for run {run_id}") + return archived + + def compute_accuracy(conn, run_id): """ Compute forecast accuracy metrics from archived history vs. actual sales. @@ -1162,11 +1382,18 @@ def compute_accuracy(conn, run_id): (pid, forecast_date = snapshot_date) to compare forecasted vs. actual units. Stores results in forecast_accuracy table, broken down by: - - overall: single aggregate row + - overall: two rows — 'all' (non-dormant) and 'all_incl_dormant' (F5) + - overall_weekly: per-product weekly-grain WMAPE — the informative headline + for intermittent demand (daily grain has a ~190% floor) (F9) - by_phase: per lifecycle phase - - by_lead_time: bucketed by how far ahead the forecast was + - by_lead_time: bucketed by how far ahead the forecast was — long-lead + buckets populate as the future-lead archives mature (F7) - by_method: per forecast method - daily: per forecast_date (for trend charts) + + Every dimension also stores naive_wmape (flat trailing-28d baseline) and + fva = 1 - wmape/naive_wmape, so the engine can be judged as value-over-naive + (F8). Only realized dates (forecast_date < CURRENT_DATE) are scored. """ with conn.cursor() as cur: # Ensure accuracy table exists @@ -1186,6 +1413,10 @@ def compute_accuracy(conn, run_id): PRIMARY KEY (run_id, metric_type, dimension_value) ) """) + # Naive-baseline WMAPE and forecast value-added (FVA = 1 - wmape/naive_wmape). + # See FORECAST_FIX_PLAN F8. + cur.execute("ALTER TABLE forecast_accuracy ADD COLUMN IF NOT EXISTS naive_wmape NUMERIC(10,4)") + cur.execute("ALTER TABLE forecast_accuracy ADD COLUMN IF NOT EXISTS fva NUMERIC(10,4)") conn.commit() # Check if we have any history to analyze @@ -1195,124 +1426,199 @@ def compute_accuracy(conn, run_id): log.info("No forecast history available for accuracy computation") return - # For each (pid, forecast_date) pair, keep only the most recent run's - # forecast row. This prevents double-counting when multiple runs have - # archived forecasts for the same product×date combination. - accuracy_cte = """ - WITH ranked_history AS ( + # Base CTEs (FORECAST_FIX_PLAN F7): + # - Only score realized dates (forecast_date < CURRENT_DATE); future-lead + # archives are excluded until their date passes. + # - short_lead*: lead 0-6 deduped per (pid, forecast_date) — preserves the + # meaning of the existing headline metrics. short_lead_eval keeps the + # raw snapshot grid (incl. zero-zero days) for complete-week detection; + # `accuracy` drops zero-zero days for daily-grain metrics. + # - lead_dedup/lead_accuracy: deduped per (pid, forecast_date, lead_bucket) + # so each long-lead bucket gets its own sample (the by_lead_time table). + base_cte = """ + WITH ranked_all AS ( SELECT - pfh.*, + pfh.pid, pfh.forecast_date, pfh.forecast_units, pfh.naive_units, + pfh.lifecycle_phase, pfh.forecast_method, fr.started_at, - ROW_NUMBER() OVER ( - PARTITION BY pfh.pid, pfh.forecast_date - ORDER BY fr.started_at DESC - ) AS rn + (pfh.forecast_date - fr.started_at::date) AS lead_days, + CASE + WHEN (pfh.forecast_date - fr.started_at::date) BETWEEN 0 AND 6 THEN '1-7d' + WHEN (pfh.forecast_date - fr.started_at::date) BETWEEN 7 AND 13 THEN '8-14d' + WHEN (pfh.forecast_date - fr.started_at::date) BETWEEN 14 AND 29 THEN '15-30d' + WHEN (pfh.forecast_date - fr.started_at::date) BETWEEN 30 AND 59 THEN '31-60d' + ELSE '61-90d' + END AS lead_bucket FROM product_forecasts_history pfh JOIN forecast_runs fr ON fr.id = pfh.run_id + WHERE pfh.forecast_date < CURRENT_DATE + ), + short_lead AS ( + SELECT *, + ROW_NUMBER() OVER ( + PARTITION BY pid, forecast_date ORDER BY started_at DESC + ) AS rn + FROM ranked_all + WHERE lead_days BETWEEN 0 AND 6 + ), + short_lead_eval AS ( + SELECT sl.pid, sl.lifecycle_phase, sl.forecast_method, sl.forecast_date, + sl.forecast_units, sl.naive_units, + COALESCE(dps.units_sold, 0) AS actual_units, + (sl.forecast_units - COALESCE(dps.units_sold, 0)) AS error, + ABS(sl.forecast_units - COALESCE(dps.units_sold, 0)) AS abs_error + FROM short_lead sl + LEFT JOIN daily_product_snapshots dps + ON dps.pid = sl.pid AND dps.snapshot_date = sl.forecast_date + WHERE sl.rn = 1 ), accuracy AS ( - SELECT - rh.lifecycle_phase, - rh.forecast_method, - rh.forecast_date, - (rh.forecast_date - rh.started_at::date) AS lead_days, - rh.forecast_units, + SELECT * FROM short_lead_eval + WHERE NOT (forecast_units = 0 AND actual_units = 0) + ), + lead_dedup AS ( + SELECT *, + ROW_NUMBER() OVER ( + PARTITION BY pid, forecast_date, lead_bucket ORDER BY started_at DESC + ) AS rn + FROM ranked_all + ), + lead_accuracy AS ( + SELECT ld.lead_bucket, ld.forecast_units, ld.naive_units, COALESCE(dps.units_sold, 0) AS actual_units, - (rh.forecast_units - COALESCE(dps.units_sold, 0)) AS error, - ABS(rh.forecast_units - COALESCE(dps.units_sold, 0)) AS abs_error - FROM ranked_history rh + (ld.forecast_units - COALESCE(dps.units_sold, 0)) AS error, + ABS(ld.forecast_units - COALESCE(dps.units_sold, 0)) AS abs_error + FROM lead_dedup ld LEFT JOIN daily_product_snapshots dps - ON dps.pid = rh.pid AND dps.snapshot_date = rh.forecast_date - WHERE rh.rn = 1 - AND NOT (rh.forecast_units = 0 AND COALESCE(dps.units_sold, 0) = 0) + ON dps.pid = ld.pid AND dps.snapshot_date = ld.forecast_date + WHERE ld.rn = 1 + AND ld.lifecycle_phase != 'dormant' + AND NOT (ld.forecast_units = 0 AND COALESCE(dps.units_sold, 0) = 0) ) """ - # Compute and insert metrics for each dimension - dimensions = { - 'overall': "SELECT 'all' AS dim", - 'by_phase': "SELECT DISTINCT lifecycle_phase AS dim FROM accuracy", - 'by_lead_time': """ - SELECT DISTINCT - CASE - WHEN lead_days BETWEEN 0 AND 6 THEN '1-7d' - WHEN lead_days BETWEEN 7 AND 13 THEN '8-14d' - WHEN lead_days BETWEEN 14 AND 29 THEN '15-30d' - WHEN lead_days BETWEEN 30 AND 59 THEN '31-60d' - ELSE '61-90d' - END AS dim - FROM accuracy - """, - 'by_method': "SELECT DISTINCT forecast_method AS dim FROM accuracy", - 'daily': "SELECT DISTINCT forecast_date::text AS dim FROM accuracy", - } - - filter_clauses = { - 'overall': "lifecycle_phase != 'dormant'", - 'by_phase': "lifecycle_phase = dims.dim", - 'by_lead_time': """ - CASE - WHEN lead_days BETWEEN 0 AND 6 THEN '1-7d' - WHEN lead_days BETWEEN 7 AND 13 THEN '8-14d' - WHEN lead_days BETWEEN 14 AND 29 THEN '15-30d' - WHEN lead_days BETWEEN 30 AND 59 THEN '31-60d' - ELSE '61-90d' - END = dims.dim - """, - 'by_method': "forecast_method = dims.dim", - 'daily': "forecast_date::text = dims.dim", - } - - total_inserted = 0 - - for metric_type, dim_query in dimensions.items(): - filter_clause = filter_clauses[metric_type] - - sql = f""" - {accuracy_cte}, - dims AS ({dim_query}) + # Daily-grain aggregate over a source CTE aliased `a`, computing the + # engine WMAPE plus the naive-baseline WMAPE (NULL-safe: rows archived + # before F8 have naive_units NULL and are excluded from the naive sums). + def daily_agg(dim_expr, source, where=None, group_by=None): + where_sql = f"WHERE {where}" if where else "" + group_sql = f"GROUP BY {group_by}" if group_by else "" + return f""" SELECT - dims.dim, + {dim_expr} AS dim, COUNT(*) AS sample_size, COALESCE(SUM(a.actual_units), 0) AS total_actual, COALESCE(SUM(a.forecast_units), 0) AS total_forecast, AVG(a.abs_error) AS mae, CASE WHEN SUM(a.actual_units) > 0 - THEN SUM(a.abs_error) / SUM(a.actual_units) - ELSE NULL END AS wmape, + THEN SUM(a.abs_error) / SUM(a.actual_units) ELSE NULL END AS wmape, AVG(a.error) AS bias, - SQRT(AVG(POWER(a.error, 2))) AS rmse - FROM dims - CROSS JOIN accuracy a - WHERE {filter_clause} - GROUP BY dims.dim + SQRT(AVG(POWER(a.error, 2))) AS rmse, + CASE WHEN SUM(a.actual_units) FILTER (WHERE a.naive_units IS NOT NULL) > 0 + THEN SUM(ABS(a.naive_units - a.actual_units)) FILTER (WHERE a.naive_units IS NOT NULL) + / SUM(a.actual_units) FILTER (WHERE a.naive_units IS NOT NULL) + ELSE NULL END AS naive_wmape + FROM {source} a + {where_sql} + {group_sql} """ - cur.execute(sql) - rows = cur.fetchall() + insert_sql = """ + INSERT INTO forecast_accuracy + (run_id, metric_type, dimension_value, sample_size, + total_actual_units, total_forecast_units, mae, wmape, bias, rmse, + naive_wmape, fva) + VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s) + ON CONFLICT (run_id, metric_type, dimension_value) + DO UPDATE SET + sample_size = EXCLUDED.sample_size, + total_actual_units = EXCLUDED.total_actual_units, + total_forecast_units = EXCLUDED.total_forecast_units, + mae = EXCLUDED.mae, wmape = EXCLUDED.wmape, + bias = EXCLUDED.bias, rmse = EXCLUDED.rmse, + naive_wmape = EXCLUDED.naive_wmape, fva = EXCLUDED.fva, + computed_at = NOW() + """ - for row in rows: - dim_val, sample_size, total_actual, total_forecast, mae, wmape, bias, rmse = row - cur.execute(""" - INSERT INTO forecast_accuracy - (run_id, metric_type, dimension_value, sample_size, - total_actual_units, total_forecast_units, mae, wmape, bias, rmse) - VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s) - ON CONFLICT (run_id, metric_type, dimension_value) - DO UPDATE SET - sample_size = EXCLUDED.sample_size, - total_actual_units = EXCLUDED.total_actual_units, - total_forecast_units = EXCLUDED.total_forecast_units, - mae = EXCLUDED.mae, wmape = EXCLUDED.wmape, - bias = EXCLUDED.bias, rmse = EXCLUDED.rmse, - computed_at = NOW() - """, (run_id, metric_type, dim_val, sample_size, - float(total_actual), float(total_forecast), - float(mae) if mae is not None else None, - float(wmape) if wmape is not None else None, - float(bias) if bias is not None else None, - float(rmse) if rmse is not None else None)) - total_inserted += 1 + def _f(x): + return float(x) if x is not None else None + + def run_and_insert(metric_type, sql): + cur.execute(base_cte + sql) + n = 0 + for row in cur.fetchall(): + (dim_val, sample_size, total_actual, total_forecast, + mae, wmape, bias, rmse, naive_wmape) = row + fva = None + if wmape is not None and naive_wmape is not None and float(naive_wmape) > 0: + fva = 1.0 - float(wmape) / float(naive_wmape) + cur.execute(insert_sql, ( + run_id, metric_type, dim_val, sample_size, + _f(total_actual), _f(total_forecast), _f(mae), _f(wmape), + _f(bias), _f(rmse), _f(naive_wmape), _f(fva))) + n += 1 + return n + + total_inserted = 0 + + # overall: two rows — 'all' (non-dormant, the headline) and + # 'all_incl_dormant' (everything, so the ~11% dormant demand stops being + # invisible). Both are short-lead (lead 0-6). F5. + overall_source = """( + SELECT a.*, 'all'::text AS dim FROM accuracy a WHERE a.lifecycle_phase != 'dormant' + UNION ALL + SELECT a.*, 'all_incl_dormant'::text AS dim FROM accuracy a + )""" + total_inserted += run_and_insert('overall', + daily_agg('a.dim', overall_source, group_by='a.dim')) + + # by_phase / by_method / daily — short-lead daily-grain over `accuracy`. + total_inserted += run_and_insert('by_phase', + daily_agg('a.lifecycle_phase', 'accuracy', group_by='a.lifecycle_phase')) + total_inserted += run_and_insert('by_method', + daily_agg('a.forecast_method', 'accuracy', group_by='a.forecast_method')) + total_inserted += run_and_insert('daily', + daily_agg('a.forecast_date::text', 'accuracy', + where="a.lifecycle_phase != 'dormant'", group_by='a.forecast_date')) + + # by_lead_time — one sample per (pid, date, lead bucket) over `lead_accuracy`. + # Buckets beyond '1-7d' populate as the future-lead archives (F7) mature. + total_inserted += run_and_insert('by_lead_time', + daily_agg('a.lead_bucket', 'lead_accuracy', group_by='a.lead_bucket')) + + # overall_weekly — the informative headline for intermittent retail demand. + # Aggregate the short-lead rows to (pid, complete week), then WMAPE over + # pid-weeks. Daily-grain WMAPE has a ~190% floor on this catalog; weekly + # grain is ~109% and responds to real improvement. F9. + weekly_sql = """, + weekly AS ( + SELECT pid, date_trunc('week', forecast_date) AS wk, + SUM(forecast_units) AS fc_week, + SUM(actual_units) AS act_week, + SUM(naive_units) AS naive_week, + bool_and(naive_units IS NOT NULL) AS naive_complete + FROM short_lead_eval + WHERE lifecycle_phase != 'dormant' + GROUP BY pid, date_trunc('week', forecast_date) + HAVING COUNT(*) = 7 + ) + SELECT 'all'::text AS dim, + COUNT(*) AS sample_size, + COALESCE(SUM(act_week), 0) AS total_actual, + COALESCE(SUM(fc_week), 0) AS total_forecast, + AVG(ABS(fc_week - act_week)) AS mae, + CASE WHEN SUM(act_week) > 0 + THEN SUM(ABS(fc_week - act_week)) / SUM(act_week) ELSE NULL END AS wmape, + AVG(fc_week - act_week) AS bias, + SQRT(AVG(POWER(fc_week - act_week, 2))) AS rmse, + CASE WHEN SUM(act_week) FILTER (WHERE naive_complete) > 0 + THEN SUM(ABS(naive_week - act_week)) FILTER (WHERE naive_complete) + / SUM(act_week) FILTER (WHERE naive_complete) + ELSE NULL END AS naive_wmape + FROM weekly + WHERE NOT (fc_week = 0 AND act_week = 0) + """ + total_inserted += run_and_insert('overall_weekly', weekly_sql) conn.commit() @@ -1562,6 +1868,10 @@ def main(): conn, curves_df, dow_indices, monthly_indices, accuracy_margins ) + # Phase 4b: Snapshot sampled future-lead forecasts (7/14/30/60/89d) from + # the fresh run so long-lead accuracy populates once those dates pass (F7). + archive_future_leads(conn, run_id) + duration = time.time() - start_time # Record run completion (include DOW indices in metadata) diff --git a/inventory-server/src/routes/dashboard.js b/inventory-server/src/routes/dashboard.js index 7f677d9..25563a9 100644 --- a/inventory-server/src/routes/dashboard.js +++ b/inventory-server/src/routes/dashboard.js @@ -357,6 +357,9 @@ router.get('/forecast/metrics', async (req, res) => { const active = parseInt(totals.active_products) || 1; const curveProducts = parseInt(totals.curve_products) || 0; + // NOTE: despite the name, this is "share of active products forecast via + // lifecycle curves" (curve coverage), NOT a statistical confidence. It only + // feeds a per-day tooltip field. See FORECAST_FIX_PLAN F9 (point 4). const confidenceLevel = parseFloat((curveProducts / active).toFixed(2)); // Daily series from actual forecast @@ -687,14 +690,29 @@ router.get('/forecast/accuracy', async (req, res) => { const { rows: metrics } = await executeQuery(` SELECT metric_type, dimension_value, sample_size, total_actual_units, total_forecast_units, - mae, wmape, bias, rmse + mae, wmape, bias, rmse, naive_wmape, fva FROM forecast_accuracy WHERE run_id = $1 ORDER BY metric_type, dimension_value `, [latestRunId]); + // Shared shaping for an "overall"-style aggregate row (daily or weekly grain). + const shapeOverall = (m) => m ? { + sampleSize: parseInt(m.sample_size), + totalActual: parseFloat(m.total_actual_units) || 0, + totalForecast: parseFloat(m.total_forecast_units) || 0, + mae: m.mae != null ? parseFloat(parseFloat(m.mae).toFixed(4)) : null, + wmape: m.wmape != null ? parseFloat((parseFloat(m.wmape) * 100).toFixed(1)) : null, + bias: m.bias != null ? parseFloat(parseFloat(m.bias).toFixed(4)) : null, + rmse: m.rmse != null ? parseFloat(parseFloat(m.rmse).toFixed(4)) : null, + naiveWmape: m.naive_wmape != null ? parseFloat((parseFloat(m.naive_wmape) * 100).toFixed(1)) : null, + fva: m.fva != null ? parseFloat(parseFloat(m.fva).toFixed(3)) : null, + } : null; + // Organize into response structure - const overall = metrics.find(m => m.metric_type === 'overall'); + const overall = metrics.find(m => m.metric_type === 'overall' && m.dimension_value === 'all') + const overallInclDormant = metrics.find(m => m.metric_type === 'overall' && m.dimension_value === 'all_incl_dormant') + const overallWeekly = metrics.find(m => m.metric_type === 'overall_weekly'); const byPhase = metrics .filter(m => m.metric_type === 'by_phase') .map(m => ({ @@ -706,6 +724,8 @@ router.get('/forecast/accuracy', async (req, res) => { wmape: m.wmape != null ? parseFloat((parseFloat(m.wmape) * 100).toFixed(1)) : null, bias: m.bias != null ? parseFloat(parseFloat(m.bias).toFixed(4)) : null, rmse: m.rmse != null ? parseFloat(parseFloat(m.rmse).toFixed(4)) : null, + naiveWmape: m.naive_wmape != null ? parseFloat((parseFloat(m.naive_wmape) * 100).toFixed(1)) : null, + fva: m.fva != null ? parseFloat(parseFloat(m.fva).toFixed(3)) : null, })) .sort((a, b) => (b.totalActual || 0) - (a.totalActual || 0)); @@ -763,6 +783,26 @@ router.get('/forecast/accuracy', async (req, res) => { sampleSize: parseInt(r.sample_size), })); + // Weekly-grain trend across runs (starts empty for old runs that predate + // the overall_weekly metric — that's expected, no backfill). F9. + const { rows: weeklyTrendRows } = await executeQuery(` + SELECT fr.finished_at::date AS run_date, + fa.wmape, fa.naive_wmape, fa.fva, fa.sample_size + FROM forecast_accuracy fa + JOIN forecast_runs fr ON fr.id = fa.run_id + WHERE fa.metric_type = 'overall_weekly' + AND fa.dimension_value = 'all' + ORDER BY fr.finished_at + `); + + const accuracyTrendWeekly = weeklyTrendRows.map(r => ({ + date: r.run_date instanceof Date ? r.run_date.toISOString().split('T')[0] : r.run_date, + wmape: r.wmape != null ? parseFloat((parseFloat(r.wmape) * 100).toFixed(1)) : null, + naiveWmape: r.naive_wmape != null ? parseFloat((parseFloat(r.naive_wmape) * 100).toFixed(1)) : null, + fva: r.fva != null ? parseFloat(parseFloat(r.fva).toFixed(3)) : null, + sampleSize: parseInt(r.sample_size), + })); + res.json({ hasData: true, computedAt, @@ -775,20 +815,15 @@ router.get('/forecast/accuracy', async (req, res) => { ? historyInfo.latest_date.toISOString().split('T')[0] : historyInfo.latest_date, }, - overall: overall ? { - sampleSize: parseInt(overall.sample_size), - totalActual: parseFloat(overall.total_actual_units) || 0, - totalForecast: parseFloat(overall.total_forecast_units) || 0, - mae: overall.mae != null ? parseFloat(parseFloat(overall.mae).toFixed(4)) : null, - wmape: overall.wmape != null ? parseFloat((parseFloat(overall.wmape) * 100).toFixed(1)) : null, - bias: overall.bias != null ? parseFloat(parseFloat(overall.bias).toFixed(4)) : null, - rmse: overall.rmse != null ? parseFloat(parseFloat(overall.rmse).toFixed(4)) : null, - } : null, + overall: shapeOverall(overall), + overallInclDormant: shapeOverall(overallInclDormant), + overallWeekly: shapeOverall(overallWeekly), byPhase, byLeadTime, byMethod, dailyTrend, accuracyTrend, + accuracyTrendWeekly, }); } catch (err) { console.error('Error fetching forecast accuracy:', err); diff --git a/inventory/src/components/overview/ForecastAccuracy.tsx b/inventory/src/components/overview/ForecastAccuracy.tsx index c0c3d9a..77e81cf 100644 --- a/inventory/src/components/overview/ForecastAccuracy.tsx +++ b/inventory/src/components/overview/ForecastAccuracy.tsx @@ -2,7 +2,7 @@ import { useQuery } from "@tanstack/react-query" import { apiFetch } from '@/utils/api'; import { BarChart, Bar, ResponsiveContainer, XAxis, YAxis, Tooltip as RechartsTooltip, Cell, LineChart, Line } from "recharts" import config from "@/config" -import { Target, TrendingDown, ArrowUpDown } from "lucide-react" +import { Target, TrendingDown, ArrowUpDown, Swords } from "lucide-react" import { Tooltip as UITooltip, TooltipContent, TooltipProvider, TooltipTrigger } from "@/components/ui/tooltip" import { PHASE_CONFIG } from "@/utils/lifecyclePhases" @@ -14,6 +14,8 @@ interface OverallMetrics { wmape: number | null bias: number | null rmse: number | null + naiveWmape?: number | null + fva?: number | null } interface PhaseAccuracy { @@ -25,6 +27,8 @@ interface PhaseAccuracy { wmape: number | null bias: number | null rmse: number | null + naiveWmape?: number | null + fva?: number | null } interface LeadTimeAccuracy { @@ -51,11 +55,14 @@ interface AccuracyData { daysOfHistory?: number historyRange?: { from: string; to: string } overall?: OverallMetrics + overallInclDormant?: OverallMetrics + overallWeekly?: OverallMetrics byPhase?: PhaseAccuracy[] byLeadTime?: LeadTimeAccuracy[] byMethod?: { method: string; sampleSize: number; mae: number | null; wmape: number | null; bias: number | null }[] dailyTrend?: { date: string; mae: number | null; wmape: number | null; bias: number | null }[] accuracyTrend?: AccuracyTrendPoint[] + accuracyTrendWeekly?: { date: string; wmape: number | null; naiveWmape: number | null; fva: number | null; sampleSize: number }[] } function MetricSkeleton() { @@ -74,12 +81,30 @@ function formatBias(bias: number | null): string { } function getAccuracyColor(wmape: number | null): string { + // Daily-grain thresholds (used for the by-phase / lead-time bars). if (wmape === null) return "text-muted-foreground" if (wmape <= 30) return "text-green-600" if (wmape <= 50) return "text-yellow-600" return "text-red-600" } +function getWeeklyAccuracyColor(wmape: number | null): string { + // Weekly per-product grain has a much lower achievable floor than daily grain + // on this intermittent-demand catalog, so the headline uses its own thresholds. + if (wmape === null) return "text-muted-foreground" + if (wmape <= 60) return "text-green-600" + if (wmape <= 90) return "text-yellow-600" + return "text-red-600" +} + +function formatSignedPct(ratio: number | null, digits = 0): string { + // ratio is a fraction (0.7 => +70%); null-safe. + if (ratio === null || ratio === undefined) return "N/A" + const pct = ratio * 100 + const sign = pct > 0 ? "+" : "" + return `${sign}${pct.toFixed(digits)}%` +} + export function ForecastAccuracy() { const { data, error, isLoading } = useQuery({ queryKey: ["forecast-accuracy"], @@ -133,6 +158,24 @@ export function ForecastAccuracy() { sampleSize: lt.sampleSize, })) + // Headline prefers the weekly-grain WMAPE (informative); falls back to the + // daily-grain number until enough complete weeks of history exist. + const weeklyWmape = data?.overallWeekly?.wmape ?? null + const usingWeekly = weeklyWmape !== null + const headlineWmape = usingWeekly ? weeklyWmape : (data?.overall?.wmape ?? null) + const headlineColor = usingWeekly + ? getWeeklyAccuracyColor(headlineWmape) + : getAccuracyColor(headlineWmape) + // Net forecast-vs-actual ratio (e.g. +70% = over-forecasting), from the + // daily 'all' totals — far more legible than bias in raw units. + const totalFc = data?.overall?.totalForecast ?? 0 + const totalAct = data?.overall?.totalActual ?? 0 + const fcVsAct = totalAct > 0 ? (totalFc / totalAct - 1) : null + // Value over the naive baseline; prefer weekly grain to match the headline. + const naiveSource = data?.overallWeekly ?? data?.overall + const naiveWmape = naiveSource?.naiveWmape ?? null + const fva = naiveSource?.fva ?? null + return (

Forecast Accuracy

@@ -148,10 +191,24 @@ export function ForecastAccuracy() {
-

WMAPE

+

+ WMAPE ({usingWeekly ? "weekly" : "daily"}) +

-

- {formatWmape(data?.overall?.wmape ?? null)} +

+ {formatWmape(headlineWmape)} +

+
+
+
+ +

Forecast vs actual

+
+

+ {formatSignedPct(fcVsAct)} + + {(fcVsAct ?? 0) > 0 ? "over" : (fcVsAct ?? 0) < 0 ? "under" : ""} +

@@ -160,20 +217,24 @@ export function ForecastAccuracy() {

MAE

- {data?.overall?.mae !== null ? data?.overall?.mae?.toFixed(2) : "N/A"} + {data?.overall?.mae != null ? data?.overall?.mae?.toFixed(2) : "N/A"} units

- -

Bias

+ +

vs naive

- {formatBias(data?.overall?.bias ?? null)} - - {(data?.overall?.bias ?? 0) > 0 ? "over" : (data?.overall?.bias ?? 0) < 0 ? "under" : ""} + 0 ? "text-green-600" : "text-red-600") : "text-muted-foreground"}> + {fva != null ? `${formatSignedPct(fva)} FVA` : "N/A"} + {naiveWmape != null && ( + + naive {formatWmape(naiveWmape)} + + )}