Files
inventory/FORECAST_FIX_PLAN.md
T
2026-06-11 14:55:33 -04:00

26 KiB
Raw Blame History

Forecast Accuracy Fix Plan

Written: 2026-06-10, from a code + live-data review of the forecasting pipeline. Goal: eliminate the systematic ~1.72x over-forecast bias, recover demand the model currently ignores, and fix the accuracy measurement so improvements are visible and long-lead forecasts are validated.

Read this whole document before starting. Fixes are grouped into phases; each phase is independently deployable and has its own validation step. Line numbers are as of 2026-06-10 — re-locate by function name if the file has drifted.


1. Diagnosis summary (measured 2026-06-10)

The dashboard headline is 202% WMAPE. Decomposition of that number, all measured against forecast_accuracy run 129 and ad-hoc queries:

Finding Evidence
Daily-grain WMAPE has a ~190% floor for this catalog Avg demand ≈ 0.11 units/product/day. A perfect rate forecast of intermittent demand scores ≈ 2e^−λ ≈ 190%. A trivial trailing-30d-average naive forecast scores 204% on the same products/days; the engine scores 221% (slightly worse than naive).
Same forecasts at 21-day-per-product grain: 109%; bias-corrected: 75% Half the headline is metric grain, most of the rest is bias.
Aggregate over-forecast +70% (227,690 forecast vs 133,861 actual units) Portfolio daily ratio is 1.52.5x on most days.
Decay phase 2.47x over (fc 51,675 / act 20,915) Root cause F1: velocity inflated 4.07x (measured: 1.353 vs true 0.332 units/day) by averaging over sparse snapshot rows.
Preorder phase 2.15x over (fc 67,212 / act 31,189) Root cause F4: launch curve applied at age=0 starting today, ignoring that the product hasn't arrived.
Mature phase 1.69x over (fc 57,857 / act 34,313) Root causes F2 (history edge truncation) + F3 (seasonal double-count).
Dormant products sold 16,180 units (~11% of demand) against zero forecasts Root cause F5; also excluded from the headline metric, so invisible.
All 879,800 accuracy samples are in the 17d lead bucket Root cause F7: archiving design only ever saves yesterday's slice. 3090d forecasts (what purchasing uses) are never validated.
Launch phase is healthy: WMAPE 100%, bias 6%, beats naive The lifecycle-curve concept works; its calibration inputs are broken. Don't redesign it.

Key data fact underlying several fixes: daily_product_snapshots is activity-based and sparse — only ~5001,800 of ~38K products have a row on a given day. Verified: every pid-day with an order DOES have a snapshot row and units match (5,234/5,234 pid-days, 8,980 vs 8,984 units over 7 days). So missing row = zero sales, and any query that aggregates over only the rows that exist is averaging over sold-days.


2. Environment & operational notes

  • Files: engine is inventory-server/scripts/forecast/forecast_engine.py; orchestrator run_forecast.js in the same dir; consumer endpoints in inventory-server/src/routes/dashboard.js (/forecast/metrics ~line 308, /forecast/accuracy ~line 647); overview UI in inventory/src/components/overview/ForecastMetrics.tsx and ForecastAccuracy.tsx.
  • Local inventory-server/ is NFS-mounted to /var/www/inventory/ on the netcup server. Edits made locally appear on the server immediately — no copy step. Do NOT run bulk grep/find/node --check over inventory-server/ locally (the mount hangs); ssh netcup and run them there.
  • Avoid the glob tool for search in this repo; use bash (grep/rg via ssh for server-side trees).
  • Scheduling: the engine runs daily at 09:30:01 server time (runs table is conclusive), but the cron entry is NOT in matt's crontab, /etc/cron.d, or pm2. Likely root's crontab (sudo crontab -l to confirm). You do not need to touch the schedule for these fixes; just know a run fires at 09:30 daily and occasionally skips days (e.g. 2026-06-07/08).
  • Manual test runs: ssh netcup, then cd /var/www/inventory/scripts/forecast && node run_forecast.js. Takes ~3.54 min. Safe to run any time: the engine TRUNCATEs and rebuilds product_forecasts, archives prior past-dated rows, and records a new forecast_runs row. Python deps live in the server venv (venv/); run_forecast.js handles env + venv automatically.
  • DB access for validation: ssh netcup, then PGPASSWORD=6D3GUkxuFgi2UghwgnUd psql -h localhost -U inventory_readonly -d inventory_db. The engine itself connects with the write user via env vars loaded from /var/www/inventory/.env — schema changes should be made idempotently inside the engine code (the file already uses CREATE TABLE IF NOT EXISTS / CREATE INDEX IF NOT EXISTS; use ALTER TABLE ... ADD COLUMN IF NOT EXISTS the same way) so no manual migration is needed.
  • Python gotchas already handled in this file (don't regress): numpy types must go through the registered psycopg2 adapters; pd.Series.combine_first() keeps zeros over real data — use reindex(..., fill_value=0.0).
  • Engine runtime budget: currently ~212227s. Phases 12 shouldn't move it meaningfully; Phase 3's extra archiving adds one INSERT…SELECT. If runtime balloons past ~6 min, investigate before shipping.
  • --backfill mode (backfill_accuracy_data) is an in-sample backtest using the old formulas. Do not run it anymore; there is enough real out-of-sample history. Updating it to match the new logic is optional/low priority (F11).

Phase 1 — Bias bugs in the engine (no schema changes)

F1. Decay velocity: stop averaging over sparse snapshot rows

Where: forecast_engine.py, batch_load_product_data(), the decay query (~lines 697710).

Problem: AVG(COALESCE(dps.units_sold, 0)) runs over only the snapshot rows that exist — mostly sold-days. Measured inflation on the current 975 decay products: 4.07x (1.353 vs 0.332 true units/day). This feeds compute_scale_factor() for the decay phase and is the single largest bias source.

Fix: divide the sum by calendar days in the window, clipped to the product's age (decay products are 1460 days old, so a 20-day-old product's window is 20 days, not 30):

SELECT dps.pid,
    SUM(COALESCE(dps.units_sold, 0))::float
      / GREATEST(LEAST(30, (CURRENT_DATE - pm.date_first_received::date)), 1) AS avg_daily
FROM daily_product_snapshots dps
JOIN product_metrics pm ON pm.pid = dps.pid
WHERE dps.pid = ANY(%s)
  AND dps.snapshot_date >= CURRENT_DATE - INTERVAL '30 days'
  AND dps.snapshot_date >= pm.date_first_received::date
GROUP BY dps.pid, pm.date_first_received

No Python-side changes needed; data['decay_velocity'] keeps the same shape. Products with zero snapshot rows in the window still get no entry → existing scale = 1.0 fallback applies (acceptable: decay classification requires sales_velocity_daily > 0, so truly dead products don't reach this path).

F2. Mature history: reindex over the full calendar window

Where: forecast_engine.py, forecast_mature() (~lines 833836).

Problem: hist.set_index('snapshot_date').resample('D').sum() only spans first-snapshot → last-snapshot. Interior gaps correctly become zeros, but leading and trailing quiet periods are absent, so the Holt level is fitted on the product's busy span. A marginal mature product whose activity clusters in 2 of the last 8 weeks gets a level ~4x too high.

Fix: replace the resample with an explicit reindex over the full EXP_SMOOTHING_WINDOW ending yesterday:

hist = history_df.copy()
hist['snapshot_date'] = pd.to_datetime(hist['snapshot_date'])
hist = hist.set_index('snapshot_date')['units_sold']
full_index = pd.date_range(
    end=pd.Timestamp(date.today() - timedelta(days=1)),
    periods=EXP_SMOOTHING_WINDOW, freq='D')
series = hist.reindex(full_index, fill_value=0.0).values.astype(float)

Notes: (pid, snapshot_date) is unique in daily_product_snapshots, so no duplicate-index risk. observed_mean and the cap recompute over the full window automatically (intended — the cap gets correspondingly tighter). Mature products are by definition >60 days old, so the 60-day window never predates first receipt. Do NOT use combine_first (see gotchas above).

F3. Stop double-applying the monthly seasonal index

Where: forecast_engine.py, generate_all_forecasts() — the seasonal_multipliers pre-compute (~lines 959961) and application (~line 1050).

Problem: every per-product calibration (decay velocity, mature Holt level, launch first-week scale, preorder rate, slow-mover velocity) is fitted on raw recent actuals, which already embed the current month's seasonal level. The forecast then multiplies by the absolute monthly index of the target date. Example from the live indices (forecast_runs.phase_counts for run 129): May = 1.224 (sale month), June = 0.982. Early-June forecasts were calibrated on May-sale-inflated velocities and barely discounted — a structural ~25% over-forecast at that transition, and it'll be worse around November (1.316).

Fix: apply the seasonal index relative to the calibration period. Compute a calibration index as the average monthly index over the trailing 30 calendar days (robust at month boundaries), then divide:

today = date.today()
trailing = [today - timedelta(days=i) for i in range(1, 31)]
calibration_index = float(np.mean([monthly_indices.get(d.month, 1.0) for d in trailing]))
seasonal_multipliers = [
    monthly_indices.get(d.month, 1.0) / max(calibration_index, 0.1)
    for d in forecast_dates
]

Leave the DOW multipliers absolute — every calibration is a multi-week average and therefore DOW-neutral, so reshaping by absolute DOW indices is correct.

Optional sub-fix (same area, low priority): the monthly indices are computed from a single trailing 365-day window, so each month appears once and YoY growth contaminates "seasonality". A cheap improvement is widening SEASONAL_LOOKBACK_DAYS to 730 and averaging the two observations of each month. Do this only after the main fixes are validated.

Phase 1 validation

Deploy (edit locally; NFS propagates), run the engine manually once, wait for 35 daily cycles, then:

-- Portfolio ratio per day (target: drifts from ~2.0 toward 0.81.3)
WITH ranked AS (
  SELECT pfh.pid, pfh.forecast_date, pfh.forecast_units, pfh.lifecycle_phase,
    ROW_NUMBER() OVER (PARTITION BY pfh.pid, pfh.forecast_date ORDER BY fr.started_at DESC) rn
  FROM product_forecasts_history pfh
  JOIN forecast_runs fr ON fr.id = pfh.run_id
  WHERE pfh.forecast_date >= CURRENT_DATE - 7)
SELECT r.forecast_date, round(SUM(r.forecast_units),0) AS fc,
  SUM(COALESCE(dps.units_sold,0)) AS act,
  round(SUM(r.forecast_units)/NULLIF(SUM(COALESCE(dps.units_sold,0)),0),2) AS ratio
FROM ranked r
LEFT JOIN daily_product_snapshots dps ON dps.pid = r.pid AND dps.snapshot_date = r.forecast_date
WHERE r.rn = 1 AND r.lifecycle_phase != 'dormant'
GROUP BY 1 ORDER BY 1;

Also check forecast_accuracy by_phase rows for the newest run: decay bias should fall from +0.35 toward ~0, mature from +0.17 toward ~0. (Accuracy lags ~1 day behind each fix since it evaluates yesterday's forecasts.)


Phase 2 — Demand the model currently ignores or mistimes

F4. Preorder: forecast the preorder rate until arrival, launch curve after

Where: forecast_engine.pybatch_load_product_data() (add arrival dates), generate_all_forecasts() preorder branch (~lines 10051009), and forecast_from_curve() (or a small wrapper).

Problem: preorder products run the launch curve from age=0 starting today, i.e. full first-week launch sales while the product is still weeks from arriving. Actual preorder-period sales are a much slower trickle.

Fix:

  1. Batch-load each preorder product's expected arrival from purchase_orders (line-item grain: it has pid and expected_date directly). Open statuses verified against live data: created, ordered, electronically_sent, receiving_started (~705 open line items currently have a future expected_date):
SELECT pid, MIN(expected_date) AS expected_arrival
FROM purchase_orders
WHERE pid = ANY(%s)
  AND status IN ('created', 'ordered', 'electronically_sent', 'receiving_started')
  AND expected_date IS NOT NULL
  AND expected_date >= CURRENT_DATE
GROUP BY pid

Fallbacks, in order: (a) an open PO with a past expected_date → assume arrival in 7 days; (b) no PO at all → arrival in 14 days (and log a counter of how many hit this default).

  1. In the preorder branch, build the daily array piecewise. Let days_until_arrival = (expected_arrival - today).days:
    • Days 0 .. days_until_arrival-1: flat observed preorder daily rate = preorder_sales[pid] / max(preorder_days[pid], 1) (both already batch-loaded), clamped to ≤ the curve's scaled week-0 daily value.
    • Days days_until_arrival .. horizon: forecast_from_curve(curve_info, scale, age_days=0, ...) shifted so the curve's day 0 lands on the arrival date (i.e. pass horizon_days - days_until_arrival and offset into the output array).
    • Keep the existing compute_scale_factor('preorder', ...) for the post-arrival curve; the pre-arrival segment doesn't use it.

This is consistent with how the reference curves were built: historical preorder units were recorded on their order dates (pre-arrival), so week-0 of the fitted curves reflects post-receipt orders, not the backlog.

F5. Dormant products: small positive rate instead of hard zero, and count them

Where: forecast_engine.pygenerate_all_forecasts() dormant branch (~lines 10401042), batch_load_product_data(), and compute_accuracy().

Problem: all ~28K dormant products are forecast at exactly 0, yet they sold 16,180 units in the eval window (~11% of all demand) — restocks, promos, long-tail. Worse, dormant is excluded from the headline accuracy filter, so this miss is invisible.

Fix (cheap version, do this now):

  1. Batch-load a trailing-180-day order rate for dormant products (11,362 of them have ≥1 sale in 180d — verified):
SELECT o.pid, SUM(o.quantity) / 180.0 AS rate
FROM orders o
WHERE o.pid = ANY(%s)
  AND o.canceled IS DISTINCT FROM TRUE
  AND o.date >= CURRENT_DATE - INTERVAL '180 days'
GROUP BY o.pid
  1. Dormant branch: if the product has a rate > 0, forecast it flat with method = 'velocity'; else keep zeros with method = 'zero'. Apply the same DOW/seasonal multipliers as everything else (automatic — they're applied after the branch).
  2. In compute_accuracy(), add a second overall row: metric_type='overall', dimension_value='all_incl_dormant' with no dormant filter (keep the existing 'all' row unchanged for trend continuity). One extra entry in the dimensions/filter_clauses dicts.

Upgrade path (optional, Phase 4): replace flat rates for slow_mover + dormant-with-sales with TSB (TeunterSyntetosBabai), the standard intermittent-demand method with obsolescence handling. Per product over a daily series d_t (build it from snapshots the F2 way — full calendar reindex):

if d_t > 0:  p_t = p_{t-1} + β·(1  p_{t-1});  z_t = z_{t-1} + α·(d_t  z_{t-1})
else:        p_t = p_{t-1}·(1  β);            z_t = z_{t-1}
forecast = p_T · z_T   (flat across horizon)

Start with α=0.1, β=0.05, initialize p = (nonzero days / total days), z = mean of nonzero demands. Scope: slow_mover (~6K) + dormant with 180d sales (~11K); series from up to 180 days of snapshots (sparse rows → ~manageable volume). Only do this after Phase 3 measurement exists to prove it beats the flat rates.

Phase 2 validation

After 35 cycles: preorder by_phase bias should drop from +0.85 toward < +0.3; the new all_incl_dormant row should appear and its total_actual_units minus 'all''s should be largely covered rather than all-miss (dormant bias rising from 1.36 toward ~0.3 or better).


Phase 3 — Fix the measurement (schema + engine + API + UI)

Without this phase you cannot see whether Phases 12 worked except by ad-hoc SQL, the lead-time chart stays a single bucket forever, and the dashboard keeps displaying a number with a 190% floor in red.

F7. Archive long-lead forecasts so 15/30/60/90d accuracy exists

Where: forecast_engine.pyarchive_forecasts() (~lines 10861154), compute_accuracy() CTE (~lines 12011228).

Problem: the current design archives only past-dated rows of the previous run before truncation. With daily runs, that's only ever the 1-day-ahead slice — all 879,800 accuracy samples sit in the '1-7d' bucket and the longer buckets in the UI chart can never populate. Purchasing decisions ride on 3060d forecasts that are never validated.

Fix:

  1. Keep the existing past-date archiving exactly as is (it provides dense short-lead coverage).
  2. After generate_all_forecasts() completes, additionally archive a sampled set of future leads from the new run, non-dormant only, attributed to the current run id (correct attribution, unlike the past-date path which attributes to the previous run):
INSERT INTO product_forecasts_history
    (run_id, pid, forecast_date, forecast_units, forecast_revenue,
     lifecycle_phase, forecast_method, confidence_lower, confidence_upper, generated_at)
SELECT %(run_id)s, pid, forecast_date, forecast_units, forecast_revenue,
    lifecycle_phase, forecast_method, confidence_lower, confidence_upper, generated_at
FROM product_forecasts
WHERE lifecycle_phase != 'dormant'
  AND forecast_date - CURRENT_DATE IN (7, 14, 30, 60, 89)
ON CONFLICT (run_id, pid, forecast_date) DO NOTHING

Volume: ~10K non-dormant products × 5 leads ≈ 50K rows/day; the existing 90-day prune (forecast_date < CURRENT_DATE - 90) bounds steady state at a few million rows. Note future-dated rows survive until their date passes + 90 days — that's intended.

  1. CRITICAL companion change in compute_accuracy(): the accuracy CTE must now exclude not-yet-realized rows, or future-dated archives get scored against actual=0:
FROM product_forecasts_history pfh
JOIN forecast_runs fr ON fr.id = pfh.run_id
WHERE pfh.forecast_date < CURRENT_DATE          -- ADD THIS
  1. Dedup semantics change. Today's ROW_NUMBER() OVER (PARTITION BY pid, forecast_date ORDER BY started_at DESC) keeps only the latest (= shortest-lead) row per pid/date, which would silently discard all the new long-lead rows. Restructure:
    • Compute lead_days = forecast_date - started_at::date and the lead bucket inside ranked_history.
    • For by_lead_time: dedup PARTITION BY pid, forecast_date, lead_bucket (one sample per pid/date/bucket, latest run wins within a bucket).
    • For everything else (overall, by_phase, by_method, daily, and the new weekly metric below): restrict to lead_days BETWEEN 0 AND 6 and keep the existing per-(pid, date) dedup. This preserves the current meaning of the headline metrics (short-lead) while the lead-time table becomes real.

F8. Track a naive baseline (forecast value-added)

Where: archive_forecasts() (both INSERT paths), compute_accuracy(), forecast_accuracy schema, /forecast/accuracy endpoint.

Problem: the engine currently loses to a trailing-average naive forecast (221% vs 204% daily WMAPE) and nothing on the dashboard would ever reveal that. Every accuracy improvement should be judged as value-over-naive.

Fix:

  1. Schema (idempotent, in the ensure blocks): ALTER TABLE product_forecasts_history ADD COLUMN IF NOT EXISTS naive_units NUMERIC(10,2); and ALTER TABLE forecast_accuracy ADD COLUMN IF NOT EXISTS naive_wmape NUMERIC(10,4), ADD COLUMN IF NOT EXISTS fva NUMERIC(10,4);
  2. Populate naive_units during both archive INSERTs via a join — naive = flat trailing-28-day average daily units as of archive time (28 days = DOW-balanced; information available at generation; same value at every lead, which is exactly what a naive baseline means):
LEFT JOIN (
    SELECT o.pid, SUM(o.quantity) / 28.0 AS naive_daily
    FROM orders o
    WHERE o.canceled IS DISTINCT FROM TRUE
      AND o.date >= CURRENT_DATE - INTERVAL '28 days' AND o.date < CURRENT_DATE
    GROUP BY o.pid
) nv ON nv.pid = pf.pid
-- select COALESCE(nv.naive_daily, 0) AS naive_units
  1. In compute_accuracy(), add to each dimension's aggregate: SUM(ABS(naive_units - actual_units)) / NULLIF(SUM(actual_units),0) AS naive_wmape and store fva = 1 - wmape / naive_wmape (NULL-safe). Rows archived before this change have naive_units NULL — treat NULL as excluded (FILTER (WHERE naive_units IS NOT NULL) on the naive sums) rather than as zero.
  2. Endpoint: include naiveWmape and fva in the overall (and per-phase) payload of /dashboard/forecast/accuracy in dashboard.js.

F9. Weekly-grain headline metric + bias as a percentage

Where: compute_accuracy(), /forecast/accuracy endpoint, ForecastAccuracy.tsx.

Problem: daily-grain WMAPE on this catalog has a ~190% floor — as a headline it's noise. The informative numbers are (a) weekly-per-product WMAPE (currently ~109%, target ~7085% post-fix) and (b) aggregate bias, which the UI currently renders as +0.108 units — indistinguishable from zero while the reality is +70%.

Fix:

  1. New metric in compute_accuracy(): metric_type='overall_weekly', dimension_value='all'. Definition: using the short-lead deduped rows (lead ≤ 6, non-dormant), aggregate per (pid, date_trunc('week', forecast_date)) keeping only complete weeks (COUNT(*) = 7), then WMAPE = SUM(ABS(fc_week act_week)) / SUM(act_week), excluding pid-weeks where both are 0. Store sample_size = number of pid-weeks. Compute naive_wmape/fva the same way from naive_units.
  2. Endpoint: expose as overallWeekly; also add a weekly variant to the accuracyTrend query (metric_type='overall_weekly'). The trend will start empty (old runs lack the row) — that's fine; don't backfill.
  3. ForecastAccuracy.tsx:
    • Headline WMAPE → overallWeekly.wmape, labeled "WMAPE (weekly)". Keep daily WMAPE available in a tooltip if desired.
    • Color thresholds for weekly grain: green ≤ 60, yellow ≤ 90, red above (tunable; document that they're calibrated for intermittent retail demand).
    • Replace the bias row: show (totalForecast / totalActual 1) as a signed percentage labeled "Forecast vs actual" (both totals already arrive in overall). Keep MAE.
    • Add a "vs naive" line: naive weekly WMAPE and FVA. FVA > 0 = engine adds value.
    • The lead-time chart needs no code change — buckets will populate as F7 rows mature (7d lead evaluable after 7 days, 30d after 30, etc.).
  4. confidenceLevel in /forecast/metrics ([dashboard.js ~line 360]) is "share of products forecast via lifecycle curves", not confidence. It only feeds a per-day tooltip field — rename the JSON field to curveCoverage and update the one consumer in ForecastMetrics.tsx, or leave it and add a comment; low priority.

Phase 3 validation

  • Next run after deploy: forecast_accuracy contains overall_weekly and fva values; /dashboard/forecast/accuracy returns them; the overview popover renders weekly WMAPE, bias %, and the naive comparison.
  • After 7/14/30 days: by_lead_time rows appear for '8-14d', '15-30d', '31-60d' buckets respectively (61-90d after ~60 days).
  • Confirm engine runtime still < ~5 min and product_forecasts_history growth ≈ 5070K rows/day.

Phase 4 — Optional / after the above is proven

  • F6. TSB for slow movers + dormant (spec in F5). Gate on Phase 3 measurement: ship only if weekly FVA improves on those phases.
  • F10. Confidence-margin source: load_accuracy_margins() feeds daily-grain per-phase WMAPE (clamped to 1.0) into the intervals, so every interval is ±100% — uninformative. Once overall_weekly exists, add per-phase weekly rows (by_phase_weekly) and source margins from those instead.
  • F11. Update or delete backfill_accuracy_data() (it encodes the old formulas). Until then, just don't run --backfill.
  • F12. compute_dow_indices() weights by revenue but the multipliers are applied to units — switch SUM(o.price * o.quantity) to SUM(o.quantity). Tiny effect.
  • F13. Longer term: for reorder decisions the right target is P(lead-time demand > stock), not a point forecast. Evaluate quantile (pinball) loss at lead-time horizons using the existing confidence-interval columns. Design separately.

4. Success criteria

  1. Rolling-14-day portfolio forecast/actual ratio within 0.81.25 (currently 1.52.5).
  2. Weekly-grain WMAPE ≤ 90% and FVA > 0 (engine beats naive) sustained for 2+ weeks.
  3. Decay/preorder/mature per-phase bias within ±0.1 units/day (currently +0.35 / +0.85 / +0.17).
  4. all_incl_dormant actuals covered: dormant bias better than 0.4 (currently 1.36, i.e. 100% miss).
  5. Lead-time buckets through 3160d populated with ≥10K samples each within ~6 weeks.
  6. Launch phase stays healthy (bias within ±0.15, WMAPE not degraded) — regression guard for F3/F4 changes.

5. Re-measurement appendix

The naive-vs-engine comparison used in the diagnosis (rerun any time; adjust dates):

WITH ranked AS (
  SELECT pfh.pid, pfh.forecast_date, pfh.forecast_units, pfh.lifecycle_phase,
    ROW_NUMBER() OVER (PARTITION BY pfh.pid, pfh.forecast_date ORDER BY fr.started_at DESC) rn
  FROM product_forecasts_history pfh
  JOIN forecast_runs fr ON fr.id = pfh.run_id
  WHERE pfh.forecast_date BETWEEN CURRENT_DATE - 9 AND CURRENT_DATE - 1),
eng AS (SELECT * FROM ranked WHERE rn = 1 AND lifecycle_phase != 'dormant'),
naive AS (
  SELECT o.pid, SUM(o.quantity)/30.0 AS naive_daily FROM orders o
  WHERE o.canceled IS DISTINCT FROM TRUE
    AND o.date >= CURRENT_DATE - 39 AND o.date < CURRENT_DATE - 9
  GROUP BY o.pid)
SELECT e.lifecycle_phase, COUNT(*) AS n, SUM(COALESCE(dps.units_sold,0)) AS actual,
  round(SUM(e.forecast_units),0) AS engine_fc, round(SUM(COALESCE(nv.naive_daily,0)),0) AS naive_fc,
  round(SUM(ABS(e.forecast_units - COALESCE(dps.units_sold,0)))/NULLIF(SUM(COALESCE(dps.units_sold,0)),0),2) AS engine_wmape,
  round(SUM(ABS(COALESCE(nv.naive_daily,0) - COALESCE(dps.units_sold,0)))/NULLIF(SUM(COALESCE(dps.units_sold,0)),0),2) AS naive_wmape
FROM eng e
LEFT JOIN naive nv ON nv.pid = e.pid
LEFT JOIN daily_product_snapshots dps ON dps.pid = e.pid AND dps.snapshot_date = e.forecast_date
GROUP BY ROLLUP(e.lifecycle_phase) ORDER BY 1;

Baseline numbers to beat (June 19, 2026): engine 221% / naive 204% daily WMAPE; engine_fc/actual = 1.82; per-phase table in §1.