Forecast improvements

This commit is contained in:
2026-06-11 14:55:33 -04:00
parent 9ff744399f
commit 3b2f51e6b8
5 changed files with 887 additions and 138 deletions
+343
View File
@@ -0,0 +1,343 @@
# Forecast Accuracy Fix Plan
**Written:** 2026-06-10, from a code + live-data review of the forecasting pipeline.
**Goal:** eliminate the systematic ~1.72x over-forecast bias, recover demand the model currently ignores, and fix the accuracy measurement so improvements are visible and long-lead forecasts are validated.
Read this whole document before starting. Fixes are grouped into phases; each phase is independently deployable and has its own validation step. Line numbers are as of 2026-06-10 — re-locate by function name if the file has drifted.
---
## 1. Diagnosis summary (measured 2026-06-10)
The dashboard headline is **202% WMAPE**. Decomposition of that number, all measured against `forecast_accuracy` run 129 and ad-hoc queries:
| Finding | Evidence |
|---|---|
| Daily-grain WMAPE has a ~190% *floor* for this catalog | Avg demand ≈ 0.11 units/product/day. A perfect rate forecast of intermittent demand scores ≈ 2e^−λ ≈ 190%. A trivial trailing-30d-average naive forecast scores **204%** on the same products/days; the engine scores 221% (slightly *worse than naive*). |
| Same forecasts at 21-day-per-product grain: **109%**; bias-corrected: **75%** | Half the headline is metric grain, most of the rest is bias. |
| Aggregate over-forecast **+70%** (227,690 forecast vs 133,861 actual units) | Portfolio daily ratio is 1.52.5x on most days. |
| Decay phase 2.47x over (fc 51,675 / act 20,915) | Root cause F1: velocity inflated **4.07x** (measured: 1.353 vs true 0.332 units/day) by averaging over sparse snapshot rows. |
| Preorder phase 2.15x over (fc 67,212 / act 31,189) | Root cause F4: launch curve applied at age=0 starting *today*, ignoring that the product hasn't arrived. |
| Mature phase 1.69x over (fc 57,857 / act 34,313) | Root causes F2 (history edge truncation) + F3 (seasonal double-count). |
| Dormant products sold **16,180 units** (~11% of demand) against zero forecasts | Root cause F5; also excluded from the headline metric, so invisible. |
| All 879,800 accuracy samples are in the **17d lead bucket** | Root cause F7: archiving design only ever saves yesterday's slice. 3090d forecasts (what purchasing uses) are never validated. |
| Launch phase is healthy: WMAPE 100%, bias 6%, beats naive | The lifecycle-curve concept works; its calibration inputs are broken. Don't redesign it. |
**Key data fact** underlying several fixes: `daily_product_snapshots` is **activity-based and sparse** — only ~5001,800 of ~38K products have a row on a given day. Verified: every pid-day with an order DOES have a snapshot row and units match (5,234/5,234 pid-days, 8,980 vs 8,984 units over 7 days). So *missing row = zero sales*, and any query that aggregates over only the rows that exist is averaging over sold-days.
---
## 2. Environment & operational notes
- **Files:** engine is `inventory-server/scripts/forecast/forecast_engine.py`; orchestrator `run_forecast.js` in the same dir; consumer endpoints in `inventory-server/src/routes/dashboard.js` (`/forecast/metrics` ~line 308, `/forecast/accuracy` ~line 647); overview UI in `inventory/src/components/overview/ForecastMetrics.tsx` and `ForecastAccuracy.tsx`.
- **Local `inventory-server/` is NFS-mounted to `/var/www/inventory/` on the netcup server.** Edits made locally appear on the server immediately — no copy step. Do NOT run bulk `grep`/`find`/`node --check` over `inventory-server/` locally (the mount hangs); `ssh netcup` and run them there.
- **Avoid the glob tool** for search in this repo; use bash (`grep`/`rg` via ssh for server-side trees).
- **Scheduling:** the engine runs daily at **09:30:01 server time** (runs table is conclusive), but the cron entry is NOT in matt's crontab, `/etc/cron.d`, or pm2. Likely root's crontab (`sudo crontab -l` to confirm). You do not need to touch the schedule for these fixes; just know a run fires at 09:30 daily and occasionally skips days (e.g. 2026-06-07/08).
- **Manual test runs:** `ssh netcup`, then `cd /var/www/inventory/scripts/forecast && node run_forecast.js`. Takes ~3.54 min. Safe to run any time: the engine TRUNCATEs and rebuilds `product_forecasts`, archives prior past-dated rows, and records a new `forecast_runs` row. Python deps live in the server venv (`venv/`); `run_forecast.js` handles env + venv automatically.
- **DB access for validation:** `ssh netcup`, then `PGPASSWORD=6D3GUkxuFgi2UghwgnUd psql -h localhost -U inventory_readonly -d inventory_db`. The engine itself connects with the write user via env vars loaded from `/var/www/inventory/.env` — schema changes should be made idempotently *inside the engine code* (the file already uses `CREATE TABLE IF NOT EXISTS` / `CREATE INDEX IF NOT EXISTS`; use `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` the same way) so no manual migration is needed.
- **Python gotchas already handled in this file (don't regress):** numpy types must go through the registered psycopg2 adapters; `pd.Series.combine_first()` keeps zeros over real data — use `reindex(..., fill_value=0.0)`.
- Engine runtime budget: currently ~212227s. Phases 12 shouldn't move it meaningfully; Phase 3's extra archiving adds one INSERT…SELECT. If runtime balloons past ~6 min, investigate before shipping.
- `--backfill` mode (`backfill_accuracy_data`) is an in-sample backtest using the *old* formulas. **Do not run it anymore**; there is enough real out-of-sample history. Updating it to match the new logic is optional/low priority (F11).
---
## Phase 1 — Bias bugs in the engine (no schema changes)
### F1. Decay velocity: stop averaging over sparse snapshot rows
**Where:** `forecast_engine.py`, `batch_load_product_data()`, the decay query (~lines 697710).
**Problem:** `AVG(COALESCE(dps.units_sold, 0))` runs over only the snapshot rows that exist — mostly sold-days. Measured inflation on the current 975 decay products: **4.07x** (1.353 vs 0.332 true units/day). This feeds `compute_scale_factor()` for the decay phase and is the single largest bias source.
**Fix:** divide the sum by calendar days in the window, clipped to the product's age (decay products are 1460 days old, so a 20-day-old product's window is 20 days, not 30):
```sql
SELECT dps.pid,
SUM(COALESCE(dps.units_sold, 0))::float
/ GREATEST(LEAST(30, (CURRENT_DATE - pm.date_first_received::date)), 1) AS avg_daily
FROM daily_product_snapshots dps
JOIN product_metrics pm ON pm.pid = dps.pid
WHERE dps.pid = ANY(%s)
AND dps.snapshot_date >= CURRENT_DATE - INTERVAL '30 days'
AND dps.snapshot_date >= pm.date_first_received::date
GROUP BY dps.pid, pm.date_first_received
```
No Python-side changes needed; `data['decay_velocity']` keeps the same shape. Products with zero snapshot rows in the window still get no entry → existing `scale = 1.0` fallback applies (acceptable: decay classification requires `sales_velocity_daily > 0`, so truly dead products don't reach this path).
### F2. Mature history: reindex over the full calendar window
**Where:** `forecast_engine.py`, `forecast_mature()` (~lines 833836).
**Problem:** `hist.set_index('snapshot_date').resample('D').sum()` only spans first-snapshot → last-snapshot. Interior gaps correctly become zeros, but **leading and trailing quiet periods are absent**, so the Holt level is fitted on the product's busy span. A marginal mature product whose activity clusters in 2 of the last 8 weeks gets a level ~4x too high.
**Fix:** replace the resample with an explicit reindex over the full `EXP_SMOOTHING_WINDOW` ending yesterday:
```python
hist = history_df.copy()
hist['snapshot_date'] = pd.to_datetime(hist['snapshot_date'])
hist = hist.set_index('snapshot_date')['units_sold']
full_index = pd.date_range(
end=pd.Timestamp(date.today() - timedelta(days=1)),
periods=EXP_SMOOTHING_WINDOW, freq='D')
series = hist.reindex(full_index, fill_value=0.0).values.astype(float)
```
Notes: (pid, snapshot_date) is unique in `daily_product_snapshots`, so no duplicate-index risk. `observed_mean` and the `cap` recompute over the full window automatically (intended — the cap gets correspondingly tighter). Mature products are by definition >60 days old, so the 60-day window never predates first receipt. Do NOT use `combine_first` (see gotchas above).
### F3. Stop double-applying the monthly seasonal index
**Where:** `forecast_engine.py`, `generate_all_forecasts()` — the `seasonal_multipliers` pre-compute (~lines 959961) and application (~line 1050).
**Problem:** every per-product calibration (decay velocity, mature Holt level, launch first-week scale, preorder rate, slow-mover velocity) is fitted on *raw recent actuals*, which already embed the current month's seasonal level. The forecast then multiplies by the **absolute** monthly index of the target date. Example from the live indices (`forecast_runs.phase_counts` for run 129): May = 1.224 (sale month), June = 0.982. Early-June forecasts were calibrated on May-sale-inflated velocities and barely discounted — a structural ~25% over-forecast at that transition, and it'll be worse around November (1.316).
**Fix:** apply the seasonal index *relative to the calibration period*. Compute a calibration index as the average monthly index over the trailing 30 calendar days (robust at month boundaries), then divide:
```python
today = date.today()
trailing = [today - timedelta(days=i) for i in range(1, 31)]
calibration_index = float(np.mean([monthly_indices.get(d.month, 1.0) for d in trailing]))
seasonal_multipliers = [
monthly_indices.get(d.month, 1.0) / max(calibration_index, 0.1)
for d in forecast_dates
]
```
Leave the DOW multipliers absolute — every calibration is a multi-week average and therefore DOW-neutral, so reshaping by absolute DOW indices is correct.
**Optional sub-fix (same area, low priority):** the monthly indices are computed from a single trailing 365-day window, so each month appears once and YoY growth contaminates "seasonality". A cheap improvement is widening `SEASONAL_LOOKBACK_DAYS` to 730 and averaging the two observations of each month. Do this only after the main fixes are validated.
### Phase 1 validation
Deploy (edit locally; NFS propagates), run the engine manually once, wait for 35 daily cycles, then:
```sql
-- Portfolio ratio per day (target: drifts from ~2.0 toward 0.81.3)
WITH ranked AS (
SELECT pfh.pid, pfh.forecast_date, pfh.forecast_units, pfh.lifecycle_phase,
ROW_NUMBER() OVER (PARTITION BY pfh.pid, pfh.forecast_date ORDER BY fr.started_at DESC) rn
FROM product_forecasts_history pfh
JOIN forecast_runs fr ON fr.id = pfh.run_id
WHERE pfh.forecast_date >= CURRENT_DATE - 7)
SELECT r.forecast_date, round(SUM(r.forecast_units),0) AS fc,
SUM(COALESCE(dps.units_sold,0)) AS act,
round(SUM(r.forecast_units)/NULLIF(SUM(COALESCE(dps.units_sold,0)),0),2) AS ratio
FROM ranked r
LEFT JOIN daily_product_snapshots dps ON dps.pid = r.pid AND dps.snapshot_date = r.forecast_date
WHERE r.rn = 1 AND r.lifecycle_phase != 'dormant'
GROUP BY 1 ORDER BY 1;
```
Also check `forecast_accuracy` `by_phase` rows for the newest run: decay bias should fall from +0.35 toward ~0, mature from +0.17 toward ~0. (Accuracy lags ~1 day behind each fix since it evaluates yesterday's forecasts.)
---
## Phase 2 — Demand the model currently ignores or mistimes
### F4. Preorder: forecast the preorder rate until arrival, launch curve after
**Where:** `forecast_engine.py``batch_load_product_data()` (add arrival dates), `generate_all_forecasts()` preorder branch (~lines 10051009), and `forecast_from_curve()` (or a small wrapper).
**Problem:** preorder products run the launch curve from `age=0` starting **today**, i.e. full first-week launch sales while the product is still weeks from arriving. Actual preorder-period sales are a much slower trickle.
**Fix:**
1. Batch-load each preorder product's expected arrival from `purchase_orders` (line-item grain: it has `pid` and `expected_date` directly). Open statuses verified against live data: `created`, `ordered`, `electronically_sent`, `receiving_started` (~705 open line items currently have a future `expected_date`):
```sql
SELECT pid, MIN(expected_date) AS expected_arrival
FROM purchase_orders
WHERE pid = ANY(%s)
AND status IN ('created', 'ordered', 'electronically_sent', 'receiving_started')
AND expected_date IS NOT NULL
AND expected_date >= CURRENT_DATE
GROUP BY pid
```
Fallbacks, in order: (a) an open PO with a *past* `expected_date` → assume arrival in 7 days; (b) no PO at all → arrival in 14 days (and log a counter of how many hit this default).
2. In the preorder branch, build the daily array piecewise. Let `days_until_arrival = (expected_arrival - today).days`:
- Days `0 .. days_until_arrival-1`: flat observed preorder daily rate = `preorder_sales[pid] / max(preorder_days[pid], 1)` (both already batch-loaded), clamped to ≤ the curve's scaled week-0 daily value.
- Days `days_until_arrival .. horizon`: `forecast_from_curve(curve_info, scale, age_days=0, ...)` shifted so the curve's day 0 lands on the arrival date (i.e. pass `horizon_days - days_until_arrival` and offset into the output array).
- Keep the existing `compute_scale_factor('preorder', ...)` for the post-arrival curve; the pre-arrival segment doesn't use it.
This is consistent with how the reference curves were built: historical preorder units were recorded on their **order dates** (pre-arrival), so week-0 of the fitted curves reflects post-receipt orders, not the backlog.
### F5. Dormant products: small positive rate instead of hard zero, and count them
**Where:** `forecast_engine.py``generate_all_forecasts()` dormant branch (~lines 10401042), `batch_load_product_data()`, and `compute_accuracy()`.
**Problem:** all ~28K dormant products are forecast at exactly 0, yet they sold 16,180 units in the eval window (~11% of all demand) — restocks, promos, long-tail. Worse, dormant is *excluded* from the headline accuracy filter, so this miss is invisible.
**Fix (cheap version, do this now):**
1. Batch-load a trailing-180-day order rate for dormant products (11,362 of them have ≥1 sale in 180d — verified):
```sql
SELECT o.pid, SUM(o.quantity) / 180.0 AS rate
FROM orders o
WHERE o.pid = ANY(%s)
AND o.canceled IS DISTINCT FROM TRUE
AND o.date >= CURRENT_DATE - INTERVAL '180 days'
GROUP BY o.pid
```
2. Dormant branch: if the product has a rate > 0, forecast it flat with `method = 'velocity'`; else keep zeros with `method = 'zero'`. Apply the same DOW/seasonal multipliers as everything else (automatic — they're applied after the branch).
3. In `compute_accuracy()`, add a second overall row: `metric_type='overall', dimension_value='all_incl_dormant'` with no dormant filter (keep the existing `'all'` row unchanged for trend continuity). One extra entry in the `dimensions`/`filter_clauses` dicts.
**Upgrade path (optional, Phase 4):** replace flat rates for `slow_mover` + dormant-with-sales with TSB (TeunterSyntetosBabai), the standard intermittent-demand method with obsolescence handling. Per product over a daily series `d_t` (build it from snapshots the F2 way — full calendar reindex):
```
if d_t > 0: p_t = p_{t-1} + β·(1 p_{t-1}); z_t = z_{t-1} + α·(d_t z_{t-1})
else: p_t = p_{t-1}·(1 β); z_t = z_{t-1}
forecast = p_T · z_T (flat across horizon)
```
Start with α=0.1, β=0.05, initialize p = (nonzero days / total days), z = mean of nonzero demands. Scope: slow_mover (~6K) + dormant with 180d sales (~11K); series from up to 180 days of snapshots (sparse rows → ~manageable volume). Only do this after Phase 3 measurement exists to prove it beats the flat rates.
### Phase 2 validation
After 35 cycles: preorder `by_phase` bias should drop from +0.85 toward < +0.3; the new `all_incl_dormant` row should appear and its `total_actual_units` minus `'all'`'s should be largely *covered* rather than all-miss (dormant `bias` rising from 1.36 toward ~0.3 or better).
---
## Phase 3 — Fix the measurement (schema + engine + API + UI)
> Without this phase you cannot see whether Phases 12 worked except by ad-hoc SQL, the lead-time chart stays a single bucket forever, and the dashboard keeps displaying a number with a 190% floor in red.
### F7. Archive long-lead forecasts so 15/30/60/90d accuracy exists
**Where:** `forecast_engine.py``archive_forecasts()` (~lines 10861154), `compute_accuracy()` CTE (~lines 12011228).
**Problem:** the current design archives only *past-dated* rows of the previous run before truncation. With daily runs, that's only ever the 1-day-ahead slice — all 879,800 accuracy samples sit in the '1-7d' bucket and the longer buckets in the UI chart can never populate. Purchasing decisions ride on 3060d forecasts that are never validated.
**Fix:**
1. Keep the existing past-date archiving exactly as is (it provides dense short-lead coverage).
2. After `generate_all_forecasts()` completes, additionally archive a **sampled set of future leads** from the new run, non-dormant only, attributed to the *current* run id (correct attribution, unlike the past-date path which attributes to the previous run):
```sql
INSERT INTO product_forecasts_history
(run_id, pid, forecast_date, forecast_units, forecast_revenue,
lifecycle_phase, forecast_method, confidence_lower, confidence_upper, generated_at)
SELECT %(run_id)s, pid, forecast_date, forecast_units, forecast_revenue,
lifecycle_phase, forecast_method, confidence_lower, confidence_upper, generated_at
FROM product_forecasts
WHERE lifecycle_phase != 'dormant'
AND forecast_date - CURRENT_DATE IN (7, 14, 30, 60, 89)
ON CONFLICT (run_id, pid, forecast_date) DO NOTHING
```
Volume: ~10K non-dormant products × 5 leads ≈ 50K rows/day; the existing 90-day prune (`forecast_date < CURRENT_DATE - 90`) bounds steady state at a few million rows. Note future-dated rows survive until their date passes + 90 days — that's intended.
3. **CRITICAL companion change** in `compute_accuracy()`: the accuracy CTE must now exclude not-yet-realized rows, or future-dated archives get scored against actual=0:
```sql
FROM product_forecasts_history pfh
JOIN forecast_runs fr ON fr.id = pfh.run_id
WHERE pfh.forecast_date < CURRENT_DATE -- ADD THIS
```
4. **Dedup semantics change.** Today's `ROW_NUMBER() OVER (PARTITION BY pid, forecast_date ORDER BY started_at DESC)` keeps only the latest (= shortest-lead) row per pid/date, which would silently discard all the new long-lead rows. Restructure:
- Compute `lead_days = forecast_date - started_at::date` and the lead bucket *inside* `ranked_history`.
- For `by_lead_time`: dedup `PARTITION BY pid, forecast_date, lead_bucket` (one sample per pid/date/bucket, latest run wins within a bucket).
- For everything else (`overall`, `by_phase`, `by_method`, `daily`, and the new weekly metric below): restrict to `lead_days BETWEEN 0 AND 6` and keep the existing per-(pid, date) dedup. This preserves the current meaning of the headline metrics (short-lead) while the lead-time table becomes real.
### F8. Track a naive baseline (forecast value-added)
**Where:** `archive_forecasts()` (both INSERT paths), `compute_accuracy()`, `forecast_accuracy` schema, `/forecast/accuracy` endpoint.
**Problem:** the engine currently *loses* to a trailing-average naive forecast (221% vs 204% daily WMAPE) and nothing on the dashboard would ever reveal that. Every accuracy improvement should be judged as value-over-naive.
**Fix:**
1. Schema (idempotent, in the ensure blocks): `ALTER TABLE product_forecasts_history ADD COLUMN IF NOT EXISTS naive_units NUMERIC(10,2);` and `ALTER TABLE forecast_accuracy ADD COLUMN IF NOT EXISTS naive_wmape NUMERIC(10,4), ADD COLUMN IF NOT EXISTS fva NUMERIC(10,4);`
2. Populate `naive_units` during both archive INSERTs via a join — naive = flat trailing-28-day average daily units as of archive time (28 days = DOW-balanced; information available at generation; same value at every lead, which is exactly what a naive baseline means):
```sql
LEFT JOIN (
SELECT o.pid, SUM(o.quantity) / 28.0 AS naive_daily
FROM orders o
WHERE o.canceled IS DISTINCT FROM TRUE
AND o.date >= CURRENT_DATE - INTERVAL '28 days' AND o.date < CURRENT_DATE
GROUP BY o.pid
) nv ON nv.pid = pf.pid
-- select COALESCE(nv.naive_daily, 0) AS naive_units
```
3. In `compute_accuracy()`, add to each dimension's aggregate: `SUM(ABS(naive_units - actual_units)) / NULLIF(SUM(actual_units),0) AS naive_wmape` and store `fva = 1 - wmape / naive_wmape` (NULL-safe). Rows archived before this change have `naive_units` NULL — treat NULL as excluded (`FILTER (WHERE naive_units IS NOT NULL)` on the naive sums) rather than as zero.
4. Endpoint: include `naiveWmape` and `fva` in the `overall` (and per-phase) payload of `/dashboard/forecast/accuracy` in `dashboard.js`.
### F9. Weekly-grain headline metric + bias as a percentage
**Where:** `compute_accuracy()`, `/forecast/accuracy` endpoint, `ForecastAccuracy.tsx`.
**Problem:** daily-grain WMAPE on this catalog has a ~190% floor — as a headline it's noise. The informative numbers are (a) weekly-per-product WMAPE (currently ~109%, target ~7085% post-fix) and (b) aggregate bias, which the UI currently renders as `+0.108 units` — indistinguishable from zero while the reality is +70%.
**Fix:**
1. New metric in `compute_accuracy()`: `metric_type='overall_weekly', dimension_value='all'`. Definition: using the short-lead deduped rows (lead ≤ 6, non-dormant), aggregate per `(pid, date_trunc('week', forecast_date))` keeping only complete weeks (`COUNT(*) = 7`), then `WMAPE = SUM(ABS(fc_week act_week)) / SUM(act_week)`, excluding pid-weeks where both are 0. Store sample_size = number of pid-weeks. Compute `naive_wmape`/`fva` the same way from `naive_units`.
2. Endpoint: expose as `overallWeekly`; also add a weekly variant to the `accuracyTrend` query (`metric_type='overall_weekly'`). The trend will start empty (old runs lack the row) — that's fine; don't backfill.
3. `ForecastAccuracy.tsx`:
- Headline WMAPE → `overallWeekly.wmape`, labeled "WMAPE (weekly)". Keep daily WMAPE available in a tooltip if desired.
- Color thresholds for weekly grain: green ≤ 60, yellow ≤ 90, red above (tunable; document that they're calibrated for intermittent retail demand).
- Replace the bias row: show `(totalForecast / totalActual 1)` as a signed percentage labeled "Forecast vs actual" (both totals already arrive in `overall`). Keep MAE.
- Add a "vs naive" line: naive weekly WMAPE and FVA. FVA > 0 = engine adds value.
- The lead-time chart needs no code change — buckets will populate as F7 rows mature (7d lead evaluable after 7 days, 30d after 30, etc.).
4. `confidenceLevel` in `/forecast/metrics` ([dashboard.js ~line 360]) is "share of products forecast via lifecycle curves", not confidence. It only feeds a per-day tooltip field — rename the JSON field to `curveCoverage` and update the one consumer in `ForecastMetrics.tsx`, or leave it and add a comment; low priority.
### Phase 3 validation
- Next run after deploy: `forecast_accuracy` contains `overall_weekly` and `fva` values; `/dashboard/forecast/accuracy` returns them; the overview popover renders weekly WMAPE, bias %, and the naive comparison.
- After 7/14/30 days: `by_lead_time` rows appear for '8-14d', '15-30d', '31-60d' buckets respectively (61-90d after ~60 days).
- Confirm engine runtime still < ~5 min and `product_forecasts_history` growth ≈ 5070K rows/day.
---
## Phase 4 — Optional / after the above is proven
- **F6. TSB for slow movers + dormant** (spec in F5). Gate on Phase 3 measurement: ship only if weekly FVA improves on those phases.
- **F10. Confidence-margin source:** `load_accuracy_margins()` feeds daily-grain per-phase WMAPE (clamped to 1.0) into the intervals, so every interval is ±100% — uninformative. Once `overall_weekly` exists, add per-phase weekly rows (`by_phase_weekly`) and source margins from those instead.
- **F11.** Update or delete `backfill_accuracy_data()` (it encodes the old formulas). Until then, just don't run `--backfill`.
- **F12.** `compute_dow_indices()` weights by revenue but the multipliers are applied to units — switch `SUM(o.price * o.quantity)` to `SUM(o.quantity)`. Tiny effect.
- **F13.** Longer term: for reorder decisions the right target is P(lead-time demand > stock), not a point forecast. Evaluate quantile (pinball) loss at lead-time horizons using the existing confidence-interval columns. Design separately.
---
## 4. Success criteria
1. Rolling-14-day portfolio forecast/actual ratio within **0.81.25** (currently 1.52.5).
2. Weekly-grain WMAPE ≤ **90%** and **FVA > 0** (engine beats naive) sustained for 2+ weeks.
3. Decay/preorder/mature per-phase bias within ±0.1 units/day (currently +0.35 / +0.85 / +0.17).
4. `all_incl_dormant` actuals covered: dormant bias better than 0.4 (currently 1.36, i.e. 100% miss).
5. Lead-time buckets through 3160d populated with ≥10K samples each within ~6 weeks.
6. Launch phase stays healthy (bias within ±0.15, WMAPE not degraded) — regression guard for F3/F4 changes.
## 5. Re-measurement appendix
The naive-vs-engine comparison used in the diagnosis (rerun any time; adjust dates):
```sql
WITH ranked AS (
SELECT pfh.pid, pfh.forecast_date, pfh.forecast_units, pfh.lifecycle_phase,
ROW_NUMBER() OVER (PARTITION BY pfh.pid, pfh.forecast_date ORDER BY fr.started_at DESC) rn
FROM product_forecasts_history pfh
JOIN forecast_runs fr ON fr.id = pfh.run_id
WHERE pfh.forecast_date BETWEEN CURRENT_DATE - 9 AND CURRENT_DATE - 1),
eng AS (SELECT * FROM ranked WHERE rn = 1 AND lifecycle_phase != 'dormant'),
naive AS (
SELECT o.pid, SUM(o.quantity)/30.0 AS naive_daily FROM orders o
WHERE o.canceled IS DISTINCT FROM TRUE
AND o.date >= CURRENT_DATE - 39 AND o.date < CURRENT_DATE - 9
GROUP BY o.pid)
SELECT e.lifecycle_phase, COUNT(*) AS n, SUM(COALESCE(dps.units_sold,0)) AS actual,
round(SUM(e.forecast_units),0) AS engine_fc, round(SUM(COALESCE(nv.naive_daily,0)),0) AS naive_fc,
round(SUM(ABS(e.forecast_units - COALESCE(dps.units_sold,0)))/NULLIF(SUM(COALESCE(dps.units_sold,0)),0),2) AS engine_wmape,
round(SUM(ABS(COALESCE(nv.naive_daily,0) - COALESCE(dps.units_sold,0)))/NULLIF(SUM(COALESCE(dps.units_sold,0)),0),2) AS naive_wmape
FROM eng e
LEFT JOIN naive nv ON nv.pid = e.pid
LEFT JOIN daily_product_snapshots dps ON dps.pid = e.pid AND dps.snapshot_date = e.forecast_date
GROUP BY ROLLUP(e.lifecycle_phase) ORDER BY 1;
```
Baseline numbers to beat (June 19, 2026): engine 221% / naive 204% daily WMAPE; engine_fc/actual = 1.82; per-phase table in §1.
@@ -634,6 +634,52 @@ def forecast_from_curve(curve_params, scale_factor, age_days, horizon_days):
return np.array(forecasts)
def forecast_preorder(curve_params, scale_factor, days_until_arrival,
preorder_daily_rate, horizon_days):
"""
Piecewise pre-order forecast: a flat observed pre-order trickle until the
product is expected to arrive, then the scaled launch curve from age 0.
The launch curve was fit on POST-receipt order history, so running it from
today (while the product is still weeks from arriving) front-loads full
first-week launch volume that hasn't happened yet — the main driver of the
~2.15x preorder over-forecast. Instead we forecast the slow pre-order rate
up to the arrival date, then start the curve's day 0 on that date.
See FORECAST_FIX_PLAN F4.
Args:
curve_params: (amplitude, decay_rate, baseline, ...) weekly curve
scale_factor: per-product multiplier for the post-arrival curve envelope
days_until_arrival: calendar days from today until expected arrival
preorder_daily_rate: observed pre-order units/day (trickle)
horizon_days: forecast horizon length
Returns:
array of daily forecast values of length horizon_days
"""
amplitude, decay_rate, baseline = curve_params[:3]
forecasts = np.zeros(horizon_days)
# Clamp the arrival offset into the horizon
dua = int(max(0, min(days_until_arrival, horizon_days)))
# Pre-arrival segment: flat pre-order trickle, capped at the curve's scaled
# week-0 daily value (a pre-order day shouldn't out-sell the launch peak).
if dua > 0:
week0_daily = (amplitude / 7.0) * scale_factor + (baseline / 7.0)
pre_rate = preorder_daily_rate
if week0_daily > 0:
pre_rate = min(pre_rate, week0_daily)
forecasts[:dua] = max(0.0, pre_rate)
# Post-arrival segment: scaled launch curve, curve day 0 = arrival date.
if dua < horizon_days:
curve_part = forecast_from_curve(curve_params, scale_factor, 0, horizon_days - dua)
forecasts[dua:] = curve_part
return forecasts
# ---------------------------------------------------------------------------
# Batch data loading (eliminates N+1 per-product queries)
# ---------------------------------------------------------------------------
@@ -651,9 +697,11 @@ def batch_load_product_data(conn, products):
data = {
'preorder_sales': {},
'preorder_days': {},
'preorder_arrival_days': {},
'launch_sales': {},
'decay_velocity': {},
'mature_history': {},
'dormant_rate': {},
}
# Pre-order sales: orders placed BEFORE first received date
@@ -677,6 +725,39 @@ def batch_load_product_data(conn, products):
data['preorder_days'][int(row['pid'])] = float(row['preorder_days'])
log.info(f"Batch loaded pre-order sales for {len(data['preorder_sales'])}/{len(preorder_pids)} preorder products")
# Expected arrival per pre-order product, to time the launch curve.
# Prefer the soonest FUTURE expected_date on an open PO; if the only open
# PO has a past expected_date assume 7 days; if there's no open PO at all
# assume 14 days. See FORECAST_FIX_PLAN F4.
arrival_sql = """
SELECT pid,
MIN(expected_date) FILTER (
WHERE expected_date IS NOT NULL AND expected_date >= CURRENT_DATE
) AS future_arrival
FROM purchase_orders
WHERE pid = ANY(%s)
AND status IN ('created', 'ordered', 'electronically_sent', 'receiving_started')
GROUP BY pid
"""
adf = execute_query(conn, arrival_sql, [preorder_pids])
today = date.today()
for _, row in adf.iterrows():
pid = int(row['pid'])
fa = row['future_arrival']
if pd.notna(fa):
fa_date = pd.Timestamp(fa).date()
data['preorder_arrival_days'][pid] = max(0, (fa_date - today).days)
else:
data['preorder_arrival_days'][pid] = 7 # open PO, expected_date already past
no_po = 0
for pid in preorder_pids:
if int(pid) not in data['preorder_arrival_days']:
data['preorder_arrival_days'][int(pid)] = 14 # no open PO at all
no_po += 1
log.info(f"Batch loaded preorder arrival for "
f"{len(data['preorder_arrival_days']) - no_po}/{len(preorder_pids)} via open POs, "
f"{no_po} defaulted to 14d")
# Launch sales: first 14 days after first received
launch_pids = products[products['phase'] == 'launch']['pid'].tolist()
if launch_pids:
@@ -694,15 +775,23 @@ def batch_load_product_data(conn, products):
data['launch_sales'][int(row['pid'])] = float(row['total_sold'])
log.info(f"Batch loaded launch sales for {len(data['launch_sales'])}/{len(launch_pids)} launch products")
# Decay recent velocity: average daily sales over last 30 days
# Decay recent velocity: TRUE calendar-daily average over the last 30 days.
# We divide the summed units by calendar days (clipped to the product's age),
# NOT by the number of snapshot rows. Snapshots are sparse and mostly land on
# sold-days, so AVG(units_sold) averages over sold-days only and inflated the
# decay rate ~4x (measured 1.353 vs true 0.332 units/day). See FORECAST_FIX_PLAN F1.
decay_pids = products[products['phase'] == 'decay']['pid'].tolist()
if decay_pids:
sql = """
SELECT dps.pid, AVG(COALESCE(dps.units_sold, 0)) AS avg_daily
SELECT dps.pid,
SUM(COALESCE(dps.units_sold, 0))::float
/ GREATEST(LEAST(30, (CURRENT_DATE - pm.date_first_received::date)), 1) AS avg_daily
FROM daily_product_snapshots dps
JOIN product_metrics pm ON pm.pid = dps.pid
WHERE dps.pid = ANY(%s)
AND dps.snapshot_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY dps.pid
AND dps.snapshot_date >= pm.date_first_received::date
GROUP BY dps.pid, pm.date_first_received
"""
df = execute_query(conn, sql, [decay_pids])
for _, row in df.iterrows():
@@ -724,6 +813,25 @@ def batch_load_product_data(conn, products):
data['mature_history'][int(pid)] = group.copy()
log.info(f"Batch loaded history for {len(data['mature_history'])}/{len(mature_pids)} mature products")
# Dormant trailing order rate: dormant products forecast 0 by default, but
# ~11K of them still sell (restocks, promos, long-tail) — ~11% of all demand
# currently forecast as a hard zero. Load a trailing-180-day daily order rate
# so the dormant branch can carry a small positive rate. See FORECAST_FIX_PLAN F5.
dormant_pids = products[products['phase'] == 'dormant']['pid'].tolist()
if dormant_pids:
sql = """
SELECT o.pid, SUM(o.quantity) / 180.0 AS rate
FROM orders o
WHERE o.pid = ANY(%s)
AND o.canceled IS DISTINCT FROM TRUE
AND o.date >= CURRENT_DATE - INTERVAL '180 days'
GROUP BY o.pid
"""
df = execute_query(conn, sql, [dormant_pids])
for _, row in df.iterrows():
data['dormant_rate'][int(row['pid'])] = float(row['rate'])
log.info(f"Batch loaded dormant order rate for {len(data['dormant_rate'])}/{len(dormant_pids)} dormant products")
return data
@@ -829,11 +937,20 @@ def forecast_mature(product, history_df):
# Not enough data — flat velocity
return np.full(FORECAST_HORIZON_DAYS, velocity)
# Fill date gaps with 0 sales (days where product had no snapshot = no sales)
# Reindex over the FULL calendar window ending yesterday, not just the span
# between the first and last snapshot. resample() only covers first→last
# snapshot, so leading/trailing quiet periods are absent and the Holt level
# is fitted only on the product's busy span (can run ~4x too high). An
# explicit reindex fills every quiet calendar day with 0. (pid, snapshot_date)
# is unique so there is no duplicate-index risk; do NOT use combine_first
# (it keeps zeros over real data). See FORECAST_FIX_PLAN F2.
hist = history_df.copy()
hist['snapshot_date'] = pd.to_datetime(hist['snapshot_date'])
hist = hist.set_index('snapshot_date').resample('D').sum().fillna(0)
series = hist['units_sold'].values.astype(float)
hist = hist.set_index('snapshot_date')['units_sold']
full_index = pd.date_range(
end=pd.Timestamp(date.today() - timedelta(days=1)),
periods=EXP_SMOOTHING_WINDOW, freq='D')
series = hist.reindex(full_index, fill_value=0.0).values.astype(float)
# Need at least 2 non-zero values for smoothing
if np.count_nonzero(series) < 2:
@@ -956,9 +1073,24 @@ def generate_all_forecasts(conn, curves_df, dow_indices, monthly_indices=None,
today = date.today()
forecast_dates = [today + timedelta(days=i) for i in range(FORECAST_HORIZON_DAYS)]
# Pre-compute DOW and seasonal multipliers for each forecast date
# Pre-compute DOW and seasonal multipliers for each forecast date.
# DOW multipliers stay ABSOLUTE — every calibration is a multi-week average
# and therefore DOW-neutral, so reshaping by absolute DOW indices is correct.
# Seasonal indices must be applied RELATIVE to the calibration period:
# each per-product calibration (decay velocity, mature Holt level, launch /
# preorder scale) is fitted on raw recent actuals that already embed the
# current month's seasonal level. Multiplying by the absolute target-month
# index double-counts seasonality (~25% over-forecast at the May→June sale
# transition, worse near November). Divide by the trailing-30-day average
# index so only the seasonal *change* from calibration to target applies.
# See FORECAST_FIX_PLAN F3.
dow_multipliers = [dow_indices.get(d.isoweekday(), 1.0) for d in forecast_dates]
seasonal_multipliers = [monthly_indices.get(d.month, 1.0) for d in forecast_dates]
trailing = [today - timedelta(days=i) for i in range(1, 31)]
calibration_index = float(np.mean([monthly_indices.get(d.month, 1.0) for d in trailing]))
seasonal_multipliers = [
monthly_indices.get(d.month, 1.0) / max(calibration_index, 0.1)
for d in forecast_dates
]
# TRUNCATE before streaming writes
with conn.cursor() as cur:
@@ -1002,9 +1134,33 @@ def generate_all_forecasts(conn, curves_df, dow_indices, monthly_indices=None,
try:
curve_info = get_curve_for_product(product, curves_df)
if phase in ('preorder', 'launch'):
if phase == 'preorder':
if curve_info:
scale = compute_scale_factor(phase, product, curve_info, batch_data)
scale = compute_scale_factor('preorder', product, curve_info, batch_data)
# Time the launch curve to expected arrival instead of
# running it from today (F4). Pre-arrival days carry the
# observed pre-order trickle rate.
days_until_arrival = batch_data['preorder_arrival_days'].get(pid, 14)
preorder_units = batch_data['preorder_sales'].get(pid, 0)
preorder_days = batch_data['preorder_days'].get(pid, 1)
preorder_daily_rate = preorder_units / max(preorder_days, 1)
forecasts = forecast_preorder(
curve_info, scale, days_until_arrival,
preorder_daily_rate, FORECAST_HORIZON_DAYS)
method = 'lifecycle_curve'
else:
# No reliable curve — fall back to velocity if available
velocity = product.get('sales_velocity_daily') or 0
if velocity > 0:
forecasts = np.full(FORECAST_HORIZON_DAYS, velocity)
method = 'velocity'
else:
forecasts = forecast_dormant()
method = 'zero'
elif phase == 'launch':
if curve_info:
scale = compute_scale_factor('launch', product, curve_info, batch_data)
forecasts = forecast_from_curve(curve_info, scale, age, FORECAST_HORIZON_DAYS)
method = 'lifecycle_curve'
else:
@@ -1038,8 +1194,16 @@ def generate_all_forecasts(conn, curves_df, dow_indices, monthly_indices=None,
method = 'velocity'
else: # dormant
forecasts = forecast_dormant()
method = 'zero'
# Carry a small positive rate for dormant products that still
# trickle sales (restocks/promos/long-tail); only truly dead
# products stay at zero. See FORECAST_FIX_PLAN F5.
rate = batch_data['dormant_rate'].get(pid, 0)
if rate > 0:
forecasts = np.full(FORECAST_HORIZON_DAYS, rate)
method = 'velocity'
else:
forecasts = forecast_dormant()
method = 'zero'
# Confidence interval: use accuracy-calibrated margins per phase
base_margin = accuracy_margins.get(phase, 0.5)
@@ -1108,6 +1272,8 @@ def archive_forecasts(conn, run_id):
""")
cur.execute("CREATE INDEX IF NOT EXISTS idx_pfh_date ON product_forecasts_history(forecast_date)")
cur.execute("CREATE INDEX IF NOT EXISTS idx_pfh_pid_date ON product_forecasts_history(pid, forecast_date)")
# Naive-baseline column for forecast value-added (FVA). See FORECAST_FIX_PLAN F8.
cur.execute("ALTER TABLE product_forecasts_history ADD COLUMN IF NOT EXISTS naive_units NUMERIC(10,2)")
# Find the previous completed run (whose forecasts are still in product_forecasts)
cur.execute("""
@@ -1124,15 +1290,27 @@ def archive_forecasts(conn, run_id):
prev_run_id = prev_run[0]
# Archive only past-date forecasts (where actuals now exist)
# Archive only past-date forecasts (where actuals now exist). Attach the
# naive baseline (flat trailing-28-day daily average) at the same time so
# forecast value-added can be measured. See FORECAST_FIX_PLAN F8.
cur.execute("""
INSERT INTO product_forecasts_history
(run_id, pid, forecast_date, forecast_units, forecast_revenue,
lifecycle_phase, forecast_method, confidence_lower, confidence_upper, generated_at)
SELECT %s, pid, forecast_date, forecast_units, forecast_revenue,
lifecycle_phase, forecast_method, confidence_lower, confidence_upper, generated_at
FROM product_forecasts
WHERE forecast_date < CURRENT_DATE
lifecycle_phase, forecast_method, confidence_lower, confidence_upper,
generated_at, naive_units)
SELECT %s, pf.pid, pf.forecast_date, pf.forecast_units, pf.forecast_revenue,
pf.lifecycle_phase, pf.forecast_method, pf.confidence_lower, pf.confidence_upper,
pf.generated_at, COALESCE(nv.naive_daily, 0)
FROM product_forecasts pf
LEFT JOIN (
SELECT o.pid, SUM(o.quantity) / 28.0 AS naive_daily
FROM orders o
WHERE o.canceled IS DISTINCT FROM TRUE
AND o.date >= CURRENT_DATE - INTERVAL '28 days'
AND o.date < CURRENT_DATE
GROUP BY o.pid
) nv ON nv.pid = pf.pid
WHERE pf.forecast_date < CURRENT_DATE
ON CONFLICT (run_id, pid, forecast_date) DO NOTHING
""", (prev_run_id,))
@@ -1154,6 +1332,48 @@ def archive_forecasts(conn, run_id):
return archived
def archive_future_leads(conn, run_id):
"""
Archive a sampled set of FUTURE-lead forecasts from the just-generated
product_forecasts, attributed to the current run.
The past-date archive in archive_forecasts() only ever captures the 1-day
slice that just elapsed, so every accuracy sample lands in the '1-7d' lead
bucket and the 15/30/60/90-day forecasts that purchasing actually rides on
are never validated. Here we snapshot the 7/14/30/60/89-day-ahead leads
(non-dormant) so that, once each date passes, compute_accuracy() can score
them in their lead bucket. The naive baseline is attached the same way as in
the past-date path. Future-dated rows survive the 90-day prune until their
own date passes. See FORECAST_FIX_PLAN F7.
"""
with conn.cursor() as cur:
cur.execute("""
INSERT INTO product_forecasts_history
(run_id, pid, forecast_date, forecast_units, forecast_revenue,
lifecycle_phase, forecast_method, confidence_lower, confidence_upper,
generated_at, naive_units)
SELECT %s, pf.pid, pf.forecast_date, pf.forecast_units, pf.forecast_revenue,
pf.lifecycle_phase, pf.forecast_method, pf.confidence_lower, pf.confidence_upper,
pf.generated_at, COALESCE(nv.naive_daily, 0)
FROM product_forecasts pf
LEFT JOIN (
SELECT o.pid, SUM(o.quantity) / 28.0 AS naive_daily
FROM orders o
WHERE o.canceled IS DISTINCT FROM TRUE
AND o.date >= CURRENT_DATE - INTERVAL '28 days'
AND o.date < CURRENT_DATE
GROUP BY o.pid
) nv ON nv.pid = pf.pid
WHERE pf.lifecycle_phase != 'dormant'
AND pf.forecast_date - CURRENT_DATE IN (7, 14, 30, 60, 89)
ON CONFLICT (run_id, pid, forecast_date) DO NOTHING
""", (run_id,))
archived = cur.rowcount
conn.commit()
log.info(f"Archived {archived} future-lead forecast rows (7/14/30/60/89d) for run {run_id}")
return archived
def compute_accuracy(conn, run_id):
"""
Compute forecast accuracy metrics from archived history vs. actual sales.
@@ -1162,11 +1382,18 @@ def compute_accuracy(conn, run_id):
(pid, forecast_date = snapshot_date) to compare forecasted vs. actual units.
Stores results in forecast_accuracy table, broken down by:
- overall: single aggregate row
- overall: two rows — 'all' (non-dormant) and 'all_incl_dormant' (F5)
- overall_weekly: per-product weekly-grain WMAPE — the informative headline
for intermittent demand (daily grain has a ~190% floor) (F9)
- by_phase: per lifecycle phase
- by_lead_time: bucketed by how far ahead the forecast was
- by_lead_time: bucketed by how far ahead the forecast was — long-lead
buckets populate as the future-lead archives mature (F7)
- by_method: per forecast method
- daily: per forecast_date (for trend charts)
Every dimension also stores naive_wmape (flat trailing-28d baseline) and
fva = 1 - wmape/naive_wmape, so the engine can be judged as value-over-naive
(F8). Only realized dates (forecast_date < CURRENT_DATE) are scored.
"""
with conn.cursor() as cur:
# Ensure accuracy table exists
@@ -1186,6 +1413,10 @@ def compute_accuracy(conn, run_id):
PRIMARY KEY (run_id, metric_type, dimension_value)
)
""")
# Naive-baseline WMAPE and forecast value-added (FVA = 1 - wmape/naive_wmape).
# See FORECAST_FIX_PLAN F8.
cur.execute("ALTER TABLE forecast_accuracy ADD COLUMN IF NOT EXISTS naive_wmape NUMERIC(10,4)")
cur.execute("ALTER TABLE forecast_accuracy ADD COLUMN IF NOT EXISTS fva NUMERIC(10,4)")
conn.commit()
# Check if we have any history to analyze
@@ -1195,124 +1426,199 @@ def compute_accuracy(conn, run_id):
log.info("No forecast history available for accuracy computation")
return
# For each (pid, forecast_date) pair, keep only the most recent run's
# forecast row. This prevents double-counting when multiple runs have
# archived forecasts for the same product×date combination.
accuracy_cte = """
WITH ranked_history AS (
# Base CTEs (FORECAST_FIX_PLAN F7):
# - Only score realized dates (forecast_date < CURRENT_DATE); future-lead
# archives are excluded until their date passes.
# - short_lead*: lead 0-6 deduped per (pid, forecast_date) — preserves the
# meaning of the existing headline metrics. short_lead_eval keeps the
# raw snapshot grid (incl. zero-zero days) for complete-week detection;
# `accuracy` drops zero-zero days for daily-grain metrics.
# - lead_dedup/lead_accuracy: deduped per (pid, forecast_date, lead_bucket)
# so each long-lead bucket gets its own sample (the by_lead_time table).
base_cte = """
WITH ranked_all AS (
SELECT
pfh.*,
pfh.pid, pfh.forecast_date, pfh.forecast_units, pfh.naive_units,
pfh.lifecycle_phase, pfh.forecast_method,
fr.started_at,
ROW_NUMBER() OVER (
PARTITION BY pfh.pid, pfh.forecast_date
ORDER BY fr.started_at DESC
) AS rn
(pfh.forecast_date - fr.started_at::date) AS lead_days,
CASE
WHEN (pfh.forecast_date - fr.started_at::date) BETWEEN 0 AND 6 THEN '1-7d'
WHEN (pfh.forecast_date - fr.started_at::date) BETWEEN 7 AND 13 THEN '8-14d'
WHEN (pfh.forecast_date - fr.started_at::date) BETWEEN 14 AND 29 THEN '15-30d'
WHEN (pfh.forecast_date - fr.started_at::date) BETWEEN 30 AND 59 THEN '31-60d'
ELSE '61-90d'
END AS lead_bucket
FROM product_forecasts_history pfh
JOIN forecast_runs fr ON fr.id = pfh.run_id
WHERE pfh.forecast_date < CURRENT_DATE
),
short_lead AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY pid, forecast_date ORDER BY started_at DESC
) AS rn
FROM ranked_all
WHERE lead_days BETWEEN 0 AND 6
),
short_lead_eval AS (
SELECT sl.pid, sl.lifecycle_phase, sl.forecast_method, sl.forecast_date,
sl.forecast_units, sl.naive_units,
COALESCE(dps.units_sold, 0) AS actual_units,
(sl.forecast_units - COALESCE(dps.units_sold, 0)) AS error,
ABS(sl.forecast_units - COALESCE(dps.units_sold, 0)) AS abs_error
FROM short_lead sl
LEFT JOIN daily_product_snapshots dps
ON dps.pid = sl.pid AND dps.snapshot_date = sl.forecast_date
WHERE sl.rn = 1
),
accuracy AS (
SELECT
rh.lifecycle_phase,
rh.forecast_method,
rh.forecast_date,
(rh.forecast_date - rh.started_at::date) AS lead_days,
rh.forecast_units,
SELECT * FROM short_lead_eval
WHERE NOT (forecast_units = 0 AND actual_units = 0)
),
lead_dedup AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY pid, forecast_date, lead_bucket ORDER BY started_at DESC
) AS rn
FROM ranked_all
),
lead_accuracy AS (
SELECT ld.lead_bucket, ld.forecast_units, ld.naive_units,
COALESCE(dps.units_sold, 0) AS actual_units,
(rh.forecast_units - COALESCE(dps.units_sold, 0)) AS error,
ABS(rh.forecast_units - COALESCE(dps.units_sold, 0)) AS abs_error
FROM ranked_history rh
(ld.forecast_units - COALESCE(dps.units_sold, 0)) AS error,
ABS(ld.forecast_units - COALESCE(dps.units_sold, 0)) AS abs_error
FROM lead_dedup ld
LEFT JOIN daily_product_snapshots dps
ON dps.pid = rh.pid AND dps.snapshot_date = rh.forecast_date
WHERE rh.rn = 1
AND NOT (rh.forecast_units = 0 AND COALESCE(dps.units_sold, 0) = 0)
ON dps.pid = ld.pid AND dps.snapshot_date = ld.forecast_date
WHERE ld.rn = 1
AND ld.lifecycle_phase != 'dormant'
AND NOT (ld.forecast_units = 0 AND COALESCE(dps.units_sold, 0) = 0)
)
"""
# Compute and insert metrics for each dimension
dimensions = {
'overall': "SELECT 'all' AS dim",
'by_phase': "SELECT DISTINCT lifecycle_phase AS dim FROM accuracy",
'by_lead_time': """
SELECT DISTINCT
CASE
WHEN lead_days BETWEEN 0 AND 6 THEN '1-7d'
WHEN lead_days BETWEEN 7 AND 13 THEN '8-14d'
WHEN lead_days BETWEEN 14 AND 29 THEN '15-30d'
WHEN lead_days BETWEEN 30 AND 59 THEN '31-60d'
ELSE '61-90d'
END AS dim
FROM accuracy
""",
'by_method': "SELECT DISTINCT forecast_method AS dim FROM accuracy",
'daily': "SELECT DISTINCT forecast_date::text AS dim FROM accuracy",
}
filter_clauses = {
'overall': "lifecycle_phase != 'dormant'",
'by_phase': "lifecycle_phase = dims.dim",
'by_lead_time': """
CASE
WHEN lead_days BETWEEN 0 AND 6 THEN '1-7d'
WHEN lead_days BETWEEN 7 AND 13 THEN '8-14d'
WHEN lead_days BETWEEN 14 AND 29 THEN '15-30d'
WHEN lead_days BETWEEN 30 AND 59 THEN '31-60d'
ELSE '61-90d'
END = dims.dim
""",
'by_method': "forecast_method = dims.dim",
'daily': "forecast_date::text = dims.dim",
}
total_inserted = 0
for metric_type, dim_query in dimensions.items():
filter_clause = filter_clauses[metric_type]
sql = f"""
{accuracy_cte},
dims AS ({dim_query})
# Daily-grain aggregate over a source CTE aliased `a`, computing the
# engine WMAPE plus the naive-baseline WMAPE (NULL-safe: rows archived
# before F8 have naive_units NULL and are excluded from the naive sums).
def daily_agg(dim_expr, source, where=None, group_by=None):
where_sql = f"WHERE {where}" if where else ""
group_sql = f"GROUP BY {group_by}" if group_by else ""
return f"""
SELECT
dims.dim,
{dim_expr} AS dim,
COUNT(*) AS sample_size,
COALESCE(SUM(a.actual_units), 0) AS total_actual,
COALESCE(SUM(a.forecast_units), 0) AS total_forecast,
AVG(a.abs_error) AS mae,
CASE WHEN SUM(a.actual_units) > 0
THEN SUM(a.abs_error) / SUM(a.actual_units)
ELSE NULL END AS wmape,
THEN SUM(a.abs_error) / SUM(a.actual_units) ELSE NULL END AS wmape,
AVG(a.error) AS bias,
SQRT(AVG(POWER(a.error, 2))) AS rmse
FROM dims
CROSS JOIN accuracy a
WHERE {filter_clause}
GROUP BY dims.dim
SQRT(AVG(POWER(a.error, 2))) AS rmse,
CASE WHEN SUM(a.actual_units) FILTER (WHERE a.naive_units IS NOT NULL) > 0
THEN SUM(ABS(a.naive_units - a.actual_units)) FILTER (WHERE a.naive_units IS NOT NULL)
/ SUM(a.actual_units) FILTER (WHERE a.naive_units IS NOT NULL)
ELSE NULL END AS naive_wmape
FROM {source} a
{where_sql}
{group_sql}
"""
cur.execute(sql)
rows = cur.fetchall()
insert_sql = """
INSERT INTO forecast_accuracy
(run_id, metric_type, dimension_value, sample_size,
total_actual_units, total_forecast_units, mae, wmape, bias, rmse,
naive_wmape, fva)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
ON CONFLICT (run_id, metric_type, dimension_value)
DO UPDATE SET
sample_size = EXCLUDED.sample_size,
total_actual_units = EXCLUDED.total_actual_units,
total_forecast_units = EXCLUDED.total_forecast_units,
mae = EXCLUDED.mae, wmape = EXCLUDED.wmape,
bias = EXCLUDED.bias, rmse = EXCLUDED.rmse,
naive_wmape = EXCLUDED.naive_wmape, fva = EXCLUDED.fva,
computed_at = NOW()
"""
for row in rows:
dim_val, sample_size, total_actual, total_forecast, mae, wmape, bias, rmse = row
cur.execute("""
INSERT INTO forecast_accuracy
(run_id, metric_type, dimension_value, sample_size,
total_actual_units, total_forecast_units, mae, wmape, bias, rmse)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
ON CONFLICT (run_id, metric_type, dimension_value)
DO UPDATE SET
sample_size = EXCLUDED.sample_size,
total_actual_units = EXCLUDED.total_actual_units,
total_forecast_units = EXCLUDED.total_forecast_units,
mae = EXCLUDED.mae, wmape = EXCLUDED.wmape,
bias = EXCLUDED.bias, rmse = EXCLUDED.rmse,
computed_at = NOW()
""", (run_id, metric_type, dim_val, sample_size,
float(total_actual), float(total_forecast),
float(mae) if mae is not None else None,
float(wmape) if wmape is not None else None,
float(bias) if bias is not None else None,
float(rmse) if rmse is not None else None))
total_inserted += 1
def _f(x):
return float(x) if x is not None else None
def run_and_insert(metric_type, sql):
cur.execute(base_cte + sql)
n = 0
for row in cur.fetchall():
(dim_val, sample_size, total_actual, total_forecast,
mae, wmape, bias, rmse, naive_wmape) = row
fva = None
if wmape is not None and naive_wmape is not None and float(naive_wmape) > 0:
fva = 1.0 - float(wmape) / float(naive_wmape)
cur.execute(insert_sql, (
run_id, metric_type, dim_val, sample_size,
_f(total_actual), _f(total_forecast), _f(mae), _f(wmape),
_f(bias), _f(rmse), _f(naive_wmape), _f(fva)))
n += 1
return n
total_inserted = 0
# overall: two rows — 'all' (non-dormant, the headline) and
# 'all_incl_dormant' (everything, so the ~11% dormant demand stops being
# invisible). Both are short-lead (lead 0-6). F5.
overall_source = """(
SELECT a.*, 'all'::text AS dim FROM accuracy a WHERE a.lifecycle_phase != 'dormant'
UNION ALL
SELECT a.*, 'all_incl_dormant'::text AS dim FROM accuracy a
)"""
total_inserted += run_and_insert('overall',
daily_agg('a.dim', overall_source, group_by='a.dim'))
# by_phase / by_method / daily — short-lead daily-grain over `accuracy`.
total_inserted += run_and_insert('by_phase',
daily_agg('a.lifecycle_phase', 'accuracy', group_by='a.lifecycle_phase'))
total_inserted += run_and_insert('by_method',
daily_agg('a.forecast_method', 'accuracy', group_by='a.forecast_method'))
total_inserted += run_and_insert('daily',
daily_agg('a.forecast_date::text', 'accuracy',
where="a.lifecycle_phase != 'dormant'", group_by='a.forecast_date'))
# by_lead_time — one sample per (pid, date, lead bucket) over `lead_accuracy`.
# Buckets beyond '1-7d' populate as the future-lead archives (F7) mature.
total_inserted += run_and_insert('by_lead_time',
daily_agg('a.lead_bucket', 'lead_accuracy', group_by='a.lead_bucket'))
# overall_weekly — the informative headline for intermittent retail demand.
# Aggregate the short-lead rows to (pid, complete week), then WMAPE over
# pid-weeks. Daily-grain WMAPE has a ~190% floor on this catalog; weekly
# grain is ~109% and responds to real improvement. F9.
weekly_sql = """,
weekly AS (
SELECT pid, date_trunc('week', forecast_date) AS wk,
SUM(forecast_units) AS fc_week,
SUM(actual_units) AS act_week,
SUM(naive_units) AS naive_week,
bool_and(naive_units IS NOT NULL) AS naive_complete
FROM short_lead_eval
WHERE lifecycle_phase != 'dormant'
GROUP BY pid, date_trunc('week', forecast_date)
HAVING COUNT(*) = 7
)
SELECT 'all'::text AS dim,
COUNT(*) AS sample_size,
COALESCE(SUM(act_week), 0) AS total_actual,
COALESCE(SUM(fc_week), 0) AS total_forecast,
AVG(ABS(fc_week - act_week)) AS mae,
CASE WHEN SUM(act_week) > 0
THEN SUM(ABS(fc_week - act_week)) / SUM(act_week) ELSE NULL END AS wmape,
AVG(fc_week - act_week) AS bias,
SQRT(AVG(POWER(fc_week - act_week, 2))) AS rmse,
CASE WHEN SUM(act_week) FILTER (WHERE naive_complete) > 0
THEN SUM(ABS(naive_week - act_week)) FILTER (WHERE naive_complete)
/ SUM(act_week) FILTER (WHERE naive_complete)
ELSE NULL END AS naive_wmape
FROM weekly
WHERE NOT (fc_week = 0 AND act_week = 0)
"""
total_inserted += run_and_insert('overall_weekly', weekly_sql)
conn.commit()
@@ -1562,6 +1868,10 @@ def main():
conn, curves_df, dow_indices, monthly_indices, accuracy_margins
)
# Phase 4b: Snapshot sampled future-lead forecasts (7/14/30/60/89d) from
# the fresh run so long-lead accuracy populates once those dates pass (F7).
archive_future_leads(conn, run_id)
duration = time.time() - start_time
# Record run completion (include DOW indices in metadata)
+46 -11
View File
@@ -357,6 +357,9 @@ router.get('/forecast/metrics', async (req, res) => {
const active = parseInt(totals.active_products) || 1;
const curveProducts = parseInt(totals.curve_products) || 0;
// NOTE: despite the name, this is "share of active products forecast via
// lifecycle curves" (curve coverage), NOT a statistical confidence. It only
// feeds a per-day tooltip field. See FORECAST_FIX_PLAN F9 (point 4).
const confidenceLevel = parseFloat((curveProducts / active).toFixed(2));
// Daily series from actual forecast
@@ -687,14 +690,29 @@ router.get('/forecast/accuracy', async (req, res) => {
const { rows: metrics } = await executeQuery(`
SELECT metric_type, dimension_value, sample_size,
total_actual_units, total_forecast_units,
mae, wmape, bias, rmse
mae, wmape, bias, rmse, naive_wmape, fva
FROM forecast_accuracy
WHERE run_id = $1
ORDER BY metric_type, dimension_value
`, [latestRunId]);
// Shared shaping for an "overall"-style aggregate row (daily or weekly grain).
const shapeOverall = (m) => m ? {
sampleSize: parseInt(m.sample_size),
totalActual: parseFloat(m.total_actual_units) || 0,
totalForecast: parseFloat(m.total_forecast_units) || 0,
mae: m.mae != null ? parseFloat(parseFloat(m.mae).toFixed(4)) : null,
wmape: m.wmape != null ? parseFloat((parseFloat(m.wmape) * 100).toFixed(1)) : null,
bias: m.bias != null ? parseFloat(parseFloat(m.bias).toFixed(4)) : null,
rmse: m.rmse != null ? parseFloat(parseFloat(m.rmse).toFixed(4)) : null,
naiveWmape: m.naive_wmape != null ? parseFloat((parseFloat(m.naive_wmape) * 100).toFixed(1)) : null,
fva: m.fva != null ? parseFloat(parseFloat(m.fva).toFixed(3)) : null,
} : null;
// Organize into response structure
const overall = metrics.find(m => m.metric_type === 'overall');
const overall = metrics.find(m => m.metric_type === 'overall' && m.dimension_value === 'all')
const overallInclDormant = metrics.find(m => m.metric_type === 'overall' && m.dimension_value === 'all_incl_dormant')
const overallWeekly = metrics.find(m => m.metric_type === 'overall_weekly');
const byPhase = metrics
.filter(m => m.metric_type === 'by_phase')
.map(m => ({
@@ -706,6 +724,8 @@ router.get('/forecast/accuracy', async (req, res) => {
wmape: m.wmape != null ? parseFloat((parseFloat(m.wmape) * 100).toFixed(1)) : null,
bias: m.bias != null ? parseFloat(parseFloat(m.bias).toFixed(4)) : null,
rmse: m.rmse != null ? parseFloat(parseFloat(m.rmse).toFixed(4)) : null,
naiveWmape: m.naive_wmape != null ? parseFloat((parseFloat(m.naive_wmape) * 100).toFixed(1)) : null,
fva: m.fva != null ? parseFloat(parseFloat(m.fva).toFixed(3)) : null,
}))
.sort((a, b) => (b.totalActual || 0) - (a.totalActual || 0));
@@ -763,6 +783,26 @@ router.get('/forecast/accuracy', async (req, res) => {
sampleSize: parseInt(r.sample_size),
}));
// Weekly-grain trend across runs (starts empty for old runs that predate
// the overall_weekly metric — that's expected, no backfill). F9.
const { rows: weeklyTrendRows } = await executeQuery(`
SELECT fr.finished_at::date AS run_date,
fa.wmape, fa.naive_wmape, fa.fva, fa.sample_size
FROM forecast_accuracy fa
JOIN forecast_runs fr ON fr.id = fa.run_id
WHERE fa.metric_type = 'overall_weekly'
AND fa.dimension_value = 'all'
ORDER BY fr.finished_at
`);
const accuracyTrendWeekly = weeklyTrendRows.map(r => ({
date: r.run_date instanceof Date ? r.run_date.toISOString().split('T')[0] : r.run_date,
wmape: r.wmape != null ? parseFloat((parseFloat(r.wmape) * 100).toFixed(1)) : null,
naiveWmape: r.naive_wmape != null ? parseFloat((parseFloat(r.naive_wmape) * 100).toFixed(1)) : null,
fva: r.fva != null ? parseFloat(parseFloat(r.fva).toFixed(3)) : null,
sampleSize: parseInt(r.sample_size),
}));
res.json({
hasData: true,
computedAt,
@@ -775,20 +815,15 @@ router.get('/forecast/accuracy', async (req, res) => {
? historyInfo.latest_date.toISOString().split('T')[0]
: historyInfo.latest_date,
},
overall: overall ? {
sampleSize: parseInt(overall.sample_size),
totalActual: parseFloat(overall.total_actual_units) || 0,
totalForecast: parseFloat(overall.total_forecast_units) || 0,
mae: overall.mae != null ? parseFloat(parseFloat(overall.mae).toFixed(4)) : null,
wmape: overall.wmape != null ? parseFloat((parseFloat(overall.wmape) * 100).toFixed(1)) : null,
bias: overall.bias != null ? parseFloat(parseFloat(overall.bias).toFixed(4)) : null,
rmse: overall.rmse != null ? parseFloat(parseFloat(overall.rmse).toFixed(4)) : null,
} : null,
overall: shapeOverall(overall),
overallInclDormant: shapeOverall(overallInclDormant),
overallWeekly: shapeOverall(overallWeekly),
byPhase,
byLeadTime,
byMethod,
dailyTrend,
accuracyTrend,
accuracyTrendWeekly,
});
} catch (err) {
console.error('Error fetching forecast accuracy:', err);
@@ -2,7 +2,7 @@ import { useQuery } from "@tanstack/react-query"
import { apiFetch } from '@/utils/api';
import { BarChart, Bar, ResponsiveContainer, XAxis, YAxis, Tooltip as RechartsTooltip, Cell, LineChart, Line } from "recharts"
import config from "@/config"
import { Target, TrendingDown, ArrowUpDown } from "lucide-react"
import { Target, TrendingDown, ArrowUpDown, Swords } from "lucide-react"
import { Tooltip as UITooltip, TooltipContent, TooltipProvider, TooltipTrigger } from "@/components/ui/tooltip"
import { PHASE_CONFIG } from "@/utils/lifecyclePhases"
@@ -14,6 +14,8 @@ interface OverallMetrics {
wmape: number | null
bias: number | null
rmse: number | null
naiveWmape?: number | null
fva?: number | null
}
interface PhaseAccuracy {
@@ -25,6 +27,8 @@ interface PhaseAccuracy {
wmape: number | null
bias: number | null
rmse: number | null
naiveWmape?: number | null
fva?: number | null
}
interface LeadTimeAccuracy {
@@ -51,11 +55,14 @@ interface AccuracyData {
daysOfHistory?: number
historyRange?: { from: string; to: string }
overall?: OverallMetrics
overallInclDormant?: OverallMetrics
overallWeekly?: OverallMetrics
byPhase?: PhaseAccuracy[]
byLeadTime?: LeadTimeAccuracy[]
byMethod?: { method: string; sampleSize: number; mae: number | null; wmape: number | null; bias: number | null }[]
dailyTrend?: { date: string; mae: number | null; wmape: number | null; bias: number | null }[]
accuracyTrend?: AccuracyTrendPoint[]
accuracyTrendWeekly?: { date: string; wmape: number | null; naiveWmape: number | null; fva: number | null; sampleSize: number }[]
}
function MetricSkeleton() {
@@ -74,12 +81,30 @@ function formatBias(bias: number | null): string {
}
function getAccuracyColor(wmape: number | null): string {
// Daily-grain thresholds (used for the by-phase / lead-time bars).
if (wmape === null) return "text-muted-foreground"
if (wmape <= 30) return "text-green-600"
if (wmape <= 50) return "text-yellow-600"
return "text-red-600"
}
function getWeeklyAccuracyColor(wmape: number | null): string {
// Weekly per-product grain has a much lower achievable floor than daily grain
// on this intermittent-demand catalog, so the headline uses its own thresholds.
if (wmape === null) return "text-muted-foreground"
if (wmape <= 60) return "text-green-600"
if (wmape <= 90) return "text-yellow-600"
return "text-red-600"
}
function formatSignedPct(ratio: number | null, digits = 0): string {
// ratio is a fraction (0.7 => +70%); null-safe.
if (ratio === null || ratio === undefined) return "N/A"
const pct = ratio * 100
const sign = pct > 0 ? "+" : ""
return `${sign}${pct.toFixed(digits)}%`
}
export function ForecastAccuracy() {
const { data, error, isLoading } = useQuery<AccuracyData>({
queryKey: ["forecast-accuracy"],
@@ -133,6 +158,24 @@ export function ForecastAccuracy() {
sampleSize: lt.sampleSize,
}))
// Headline prefers the weekly-grain WMAPE (informative); falls back to the
// daily-grain number until enough complete weeks of history exist.
const weeklyWmape = data?.overallWeekly?.wmape ?? null
const usingWeekly = weeklyWmape !== null
const headlineWmape = usingWeekly ? weeklyWmape : (data?.overall?.wmape ?? null)
const headlineColor = usingWeekly
? getWeeklyAccuracyColor(headlineWmape)
: getAccuracyColor(headlineWmape)
// Net forecast-vs-actual ratio (e.g. +70% = over-forecasting), from the
// daily 'all' totals — far more legible than bias in raw units.
const totalFc = data?.overall?.totalForecast ?? 0
const totalAct = data?.overall?.totalActual ?? 0
const fcVsAct = totalAct > 0 ? (totalFc / totalAct - 1) : null
// Value over the naive baseline; prefer weekly grain to match the headline.
const naiveSource = data?.overallWeekly ?? data?.overall
const naiveWmape = naiveSource?.naiveWmape ?? null
const fva = naiveSource?.fva ?? null
return (
<div>
<h3 className="text-lg font-medium mb-3">Forecast Accuracy</h3>
@@ -148,10 +191,24 @@ export function ForecastAccuracy() {
<div className="flex items-baseline justify-between">
<div className="flex items-center gap-2">
<Target className="h-4 w-4 text-muted-foreground" />
<p className="text-sm font-medium text-muted-foreground">WMAPE</p>
<p className="text-sm font-medium text-muted-foreground">
WMAPE <span className="text-[10px] opacity-70">({usingWeekly ? "weekly" : "daily"})</span>
</p>
</div>
<p className={`text-lg font-bold ${getAccuracyColor(data?.overall?.wmape ?? null)}`}>
{formatWmape(data?.overall?.wmape ?? null)}
<p className={`text-lg font-bold ${headlineColor}`}>
{formatWmape(headlineWmape)}
</p>
</div>
<div className="flex items-baseline justify-between">
<div className="flex items-center gap-2">
<ArrowUpDown className="h-4 w-4 text-muted-foreground" />
<p className="text-sm font-medium text-muted-foreground">Forecast vs actual</p>
</div>
<p className="text-lg font-bold">
{formatSignedPct(fcVsAct)}
<span className="text-xs font-normal text-muted-foreground ml-1">
{(fcVsAct ?? 0) > 0 ? "over" : (fcVsAct ?? 0) < 0 ? "under" : ""}
</span>
</p>
</div>
<div className="flex items-baseline justify-between">
@@ -160,20 +217,24 @@ export function ForecastAccuracy() {
<p className="text-sm font-medium text-muted-foreground">MAE</p>
</div>
<p className="text-lg font-bold">
{data?.overall?.mae !== null ? data?.overall?.mae?.toFixed(2) : "N/A"}
{data?.overall?.mae != null ? data?.overall?.mae?.toFixed(2) : "N/A"}
<span className="text-xs font-normal text-muted-foreground ml-1">units</span>
</p>
</div>
<div className="flex items-baseline justify-between">
<div className="flex items-center gap-2">
<ArrowUpDown className="h-4 w-4 text-muted-foreground" />
<p className="text-sm font-medium text-muted-foreground">Bias</p>
<Swords className="h-4 w-4 text-muted-foreground" />
<p className="text-sm font-medium text-muted-foreground">vs naive</p>
</div>
<p className="text-lg font-bold">
{formatBias(data?.overall?.bias ?? null)}
<span className="text-xs font-normal text-muted-foreground ml-1">
{(data?.overall?.bias ?? 0) > 0 ? "over" : (data?.overall?.bias ?? 0) < 0 ? "under" : ""}
<span className={fva != null ? (fva > 0 ? "text-green-600" : "text-red-600") : "text-muted-foreground"}>
{fva != null ? `${formatSignedPct(fva)} FVA` : "N/A"}
</span>
{naiveWmape != null && (
<span className="text-xs font-normal text-muted-foreground ml-1">
naive {formatWmape(naiveWmape)}
</span>
)}
</p>
</div>
</div>