Files

Matt 38b12c188f Restore accidentally removed files, a few additional import/calculation fixes

2026-02-09 10:19:35 -05:00

19 KiB

Raw Blame History

Metrics Calculation Pipeline Audit

Date: 2026-02-07 Scope: All 6 SQL calculation scripts, custom DB functions, import pipeline, and live data verification

Overview

The metrics pipeline in inventory-server/scripts/calculate-metrics-new.js runs 6 SQL scripts sequentially:

update_daily_snapshots.sql — Aggregates daily per-product sales/receiving data
update_product_metrics.sql — Calculates the main product_metrics table (KPIs, forecasting, status)
update_periodic_metrics.sql — ABC classification, average lead time
calculate_brand_metrics.sql — Brand-level aggregated metrics
calculate_vendor_metrics.sql — Vendor-level aggregated metrics
calculate_category_metrics.sql — Category-level metrics with hierarchy rollups

Database Scale

Table	Row Count
products	681,912
orders	2,883,982
purchase_orders	256,809
receivings	313,036
daily_product_snapshots	678,312 (601 distinct dates, since 2024-06-01)
product_metrics	681,912
brand_metrics	1,789
vendor_metrics	281
category_metrics	610

Issues Found

ISSUE 1: [HIGH] Order status filter is non-functional — numeric codes vs text comparison

Files: update_daily_snapshots.sql lines 86-101, update_product_metrics.sql lines 89, 178-183 Confirmed by data: All order statuses are numeric strings ('100', '50', '55', etc.) Status mappings from: docs/prod_registry.class.php

Description: The SQL filters COALESCE(o.status, 'pending') NOT IN ('canceled', 'returned') and o.status NOT IN ('canceled', 'returned') are used throughout the pipeline to exclude canceled/returned orders. However, the import pipeline stores order statuses as their raw numeric codes from the production MySQL database (e.g., '100', '50', '55', '90', '92'). There are zero text status values in the orders table.

This means these filters never exclude any rows — every comparison is '100' NOT IN ('canceled', 'returned') which is always true.

Actual status distribution (with confirmed meanings):

Status	Meaning	Count	Negative Qty	Assessment
100	shipped	2,862,792	3,352	Completed — correct to include
50	awaiting_products	11,109	0	In-progress — not yet shipped
55	shipping_later	5,689	0	In-progress — not yet shipped
56	shipping_together	2,863	0	In-progress — not yet shipped
90	awaiting_shipment	38	0	Near-complete — not yet shipped
92	awaiting_pickup	71	0	Near-complete — awaiting customer
95	shipped_confirmed	5	0	Completed — correct to include
15	cancelled	1	0	Should be excluded

Full status reference (from prod_registry.class.php):

0=created, 10=unfinished, 15=cancelled, 16=combined, 20=placed, 22=placed_incomplete
30=cancelled_old (historical), 40=awaiting_payment, 50=awaiting_products
55=shipping_later, 56=shipping_together, 60=ready, 61=flagged
62=fix_before_pick, 65=manual_picking, 70=in_pt, 80=picked
90=awaiting_shipment, 91=remote_wait, 92=awaiting_pickup, 93=fix_before_ship
95=shipped_confirmed, 100=shipped

Severity revised to HIGH (from CRITICAL): Now that we know the actual meanings, no cancelled/refunded orders are being miscounted (only 1 cancelled order exists, status=15). The real concern is twofold:

The text-based filter is dead code — it can never match any row. Either map statuses to text during import (like POs do) or change SQL to use numeric comparisons.
~19,775 unfulfilled orders (statuses 50/55/56/90/92) are counted as completed sales. These are orders in various stages of fulfillment that haven't shipped yet. While most will eventually ship, counting them now inflates current-period metrics. At 0.69% of total orders, the financial impact is modest but the filter should work correctly on principle.

Note: PO statuses ARE properly mapped to text ('canceled', 'done', etc.) in the import pipeline. Only order statuses are numeric.

ISSUE 2: [CRITICAL] Daily Snapshots use current stock instead of historical EOD stock

File: update_daily_snapshots.sql, lines 126-135, 173 Confirmed by data: Top product (pid 666925) shows eod_stock_quantity = 0 for ALL dates even though it sold 28 units on Jan 28 (clearly had stock then)

Description: The CurrentStock CTE reads stock_quantity directly from the products table at query execution time. When the script processes historical dates (today minus 1-4 days), it writes today's stock as if it were the end-of-day stock for those past dates.

Cascading impact on product_metrics:

avg_stock_units_30d / avg_stock_cost_30d — Wrong averages
stockout_days_30d — Undercounts (only based on current stock state, not historical)
stockout_rate_30d, service_level_30d, fill_rate_30d — All derived from wrong stockout data
gmroi_30d — Wrong denominator (avg stock cost)
stockturn_30d — Wrong denominator (avg stock units)
sell_through_30d — Affected by stock level inaccuracy

ISSUE 3: [CRITICAL] Snapshot coverage is 0.17% — most products have no snapshot data

Confirmed by data: 678,312 snapshot rows across 601 dates = ~1,128 products/day out of 681,912 total

Description: The daily snapshots script only creates rows for products with sales or receiving activity on that date (ProductsWithActivity CTE, line 136). This means:

91.1% of products (621,221) have NULL sales_30d — they had no orders in the last 30 days so no snapshot rows exist
AVG(eod_stock_quantity) averages only across days with activity, not 30 days
stockout_days_30d only counts stockout days where there was ALSO some activity
A product out of stock with zero sales gets zero stockout_days even though it was stocked out

This is by design (to avoid creating 681K rows/day) but means stock-related metrics are systematically biased.

ISSUE 4: [HIGH] `costeach` fallback to 50% of price in import pipeline

File: inventory-server/scripts/import/orders.js (line ~573)

Description: When the MySQL order_costs table has no record for an order item, costeach defaults to price * 0.5. There is no flag in the PostgreSQL data to distinguish actual costs from estimated ones.

Data impact: 385,545 products (56.5%) have current_cost_price = 0 AND current_landing_cost_price = 0. For these products, the COGS calculation in daily_snapshots falls through the chain:

o.costeach — May be the 50% estimate from import
get_weighted_avg_cost() — Returns NULL if no receivings exist
p.landing_cost_price — Always NULL (hardcoded in import)
p.cost_price — 0 for 56.5% of products

Only 27 products have zero COGS with positive sales, meaning the costeach field is doing its job for products that sell, but the 50% fallback means margins for those products are estimates, not actuals.

ISSUE 5: [HIGH] `landing_cost_price` is always NULL

File: inventory-server/scripts/import/products.js (line ~175)

Description: The import explicitly sets landing_cost_price = NULL for all products. The daily_snapshots COGS calculation uses it as a fallback: COALESCE(o.costeach, get_weighted_avg_cost(...), p.landing_cost_price, p.cost_price). Since it's always NULL, this fallback step is useless and the chain jumps straight to cost_price.

The product_metrics field current_landing_cost_price is populated as COALESCE(p.landing_cost_price, p.cost_price, 0.00), so it equals cost_price for all products. Any UI showing "landing cost" is actually just showing cost_price.

ISSUE 6: [HIGH] Vendor lead time is drastically wrong — missing supplier_id join

File: calculate_vendor_metrics.sql, lines 62-82 Confirmed by data: Vendor-level lead times are 2-10x higher than product-level lead times

Description: The vendor metrics lead time joins POs to receivings only by pid:

LEFT JOIN public.receivings r ON r.pid = po.pid

But the periodic metrics lead time correctly matches supplier:

JOIN public.receivings r ON r.pid = po.pid AND r.supplier_id = po.supplier_id

Without supplier matching, a PO for product X from Vendor A can match a receiving of product X from Vendor B, creating inflated/wrong lead times.

Measured discrepancies:

Vendor	Vendor Metrics Lead Time	Avg Product Lead Time
doodlebug design inc.	66 days	14 days
Notions	55 days	4 days
Simple Stories	59 days	27 days
Ranger Industries	31 days	5 days

ISSUE 7: [MEDIUM] Net revenue does not subtract returns

File: update_daily_snapshots.sql, line 184

Description: net_revenue = gross_revenue - discounts. Standard accounting: net_revenue = gross_revenue - discounts - returns. The returns_revenue is calculated separately but not deducted.

Data impact: There are 3,352 orders with negative quantities (returns), totaling -5,499 units. These returns are tracked in returns_revenue but not reflected in net_revenue, which means all downstream revenue-based metrics are slightly overstated.

ISSUE 8: [MEDIUM] Lifetime revenue subquery references wrong table columns

File: update_product_metrics.sql, lines 323-329

Description: The lifetime revenue estimation fallback queries:

SELECT revenue_7d / NULLIF(sales_7d, 0)
FROM daily_product_snapshots
WHERE pid = ci.pid AND sales_7d > 0

But daily_product_snapshots does NOT have revenue_7d or sales_7d columns — those exist in product_metrics. This subquery either errors silently or returns NULL. The effect is that the estimation always falls back to current_price * total_sold.

ISSUE 9: [MEDIUM] Brand/Vendor metrics COGS filter inflates margins

Files: calculate_brand_metrics.sql lines 31, calculate_vendor_metrics.sql line 32

Description: SUM(CASE WHEN pm.cogs_30d > 0 THEN pm.cogs_30d ELSE 0 END) excludes products with zero COGS. But if a product has sales revenue and zero COGS (missing cost data), the brand/vendor totals will include the revenue but not the COGS, artificially inflating the margin.

Data context: Brand metrics revenue matches product_metrics aggregation exactly for sales counts, but shows small discrepancies in revenue (e.g., Stamperia: $7,613.98 brand vs $7,611.11 actual). These tiny diffs come from the > 0 filtering excluding products with negative revenue.

ISSUE 10: [MEDIUM] Extreme margin values from $0.01 price orders

Confirmed by data: 73 products with margin > 100%, 119 with margin < -100%

Examples:

Product	Revenue	COGS	Margin
Flower Gift Box Die (pid 624756)	$0.02	$29.98	-149,800%
Special Flowers Stamp Set (pid 614513)	$0.01	$11.97	-119,632%

These are products with extremely low prices (likely samples, promos, or data errors) where the order price was $0.01. The margin calculation is mathematically correct but these outliers skew any aggregate margin statistics.

ISSUE 11: [MEDIUM] Sell-through rate has edge cases yielding negative/extreme values

File: update_product_metrics.sql, lines 358-361 Confirmed by data: 30 products with negative sell-through, 10 with sell-through > 200%

Description: Beginning inventory is approximated as current_stock + sales - received + returns. When inventory adjustments, shrinkage, or manual corrections occur, this approximation breaks. Edge cases:

Products with many manual stock adjustments → negative denominator → negative sell-through
Products with beginning stock near zero but decent sales → sell-through > 100%

ISSUE 12: [MEDIUM] `total_sold` uses different status filter than orders import

Import pipeline confirmed:

Orders import: order_status >= 15 (includes processing/pending orders)
total_sold in products: order_status >= 20 (more restrictive)

This means lifetime_sales (from total_sold) is systematically lower than what you'd calculate by summing the orders table. The discrepancy is confirmed:

Product	total_sold	orders sum	Gap
pid 31286	13,786	4,241	9,545
pid 44309	11,978	3,119	8,859

The large gaps are because the orders table only has data from the import start date (~2024), while total_sold includes all-time sales from MySQL. This is expected behavior, not a bug, but it means the lifetime_revenue_quality flag is important — most products show 'estimated' quality.

ISSUE 13: [MEDIUM] Category rollup may double-count products in multiple hierarchy levels

File: calculate_category_metrics.sql, lines 42-66

Description: The RolledUpMetrics CTE uses:

dcm.cat_id = ch.cat_id OR dcm.cat_id = ANY(SELECT cat_id FROM category_hierarchy WHERE ch.cat_id = ANY(ancestor_ids))

If products are assigned to categories at multiple levels in the same branch (e.g., both "Paper Crafts" and "Scrapbook Paper" which is a child of "Paper Crafts"), those products' metrics would be counted twice in the parent's rollup.

ISSUE 14: [LOW] `exclude_forecast` removes products from metrics entirely

File: update_product_metrics.sql, line 509

Description: WHERE s.exclude_forecast IS FALSE OR s.exclude_forecast IS NULL is on the main INSERT's WHERE clause. Products with exclude_forecast = TRUE won't appear in product_metrics at all, rather than just having forecast fields nulled. Currently all 681,912 products are in product_metrics so this appears to not affect any products yet.

ISSUE 15: [LOW] Daily snapshots only look back 5 days

File: update_daily_snapshots.sql, line 14 — _process_days INT := 5

If import data arrives late (>5 days), those days will never get snapshots populated. There is a separate backfill/rebuild_daily_snapshots.sql for historical rebuilds.

ISSUE 16: [INFO] Timezone risk in order date import

File: inventory-server/scripts/import/orders.js

MySQL DATETIME values are timezone-naive. The import uses new Date(order.date) which interprets them using the import server's local timezone. The SSH config specifies timezone: '-05:00' for MySQL (always EST). If the import server is in a different timezone, orders near midnight could land on the wrong date in the daily snapshots calculation.

Custom Functions Review

`calculate_sales_velocity(sales_30d, stockout_days_30d)`

Divides sales_30d by effective selling days: GREATEST(30 - stockout_days, CASE WHEN sales > 0 THEN 14 ELSE 30 END)
The 14-day floor prevents extreme velocity for products mostly out of stock
Sound approach — the only concern is that stockout_days is unreliable (Issues 2, 3)

`get_weighted_avg_cost(pid, date)`

Weighted average of last 10 receivings by cost*qty/qty
Returns NULL if no receivings — sound fallback behavior
Correct implementation

`safe_divide(numerator, denominator)`

Returns NULL on divide-by-zero — correct

`std_numeric(value, precision)`

Rounds to precision digits — correct

`classify_demand_pattern(avg_demand, cv)`

Uses coefficient of variation thresholds: ≤0.2 = stable, ≤0.5 = variable, low-volume+high-CV = sporadic, else lumpy
Reasonable classification, though only based on 30-day window

`detect_seasonal_pattern(pid)`

CROSS JOIN LATERAL (runs per product) — expensive: queries daily_product_snapshots twice per product
Compares current month average to yearly average — very simplistic
Functional but could be a performance bottleneck with 681K products

`category_hierarchy` (materialized view)

Recursive CTE building tree from categories — correct implementation
Refreshed concurrently before category metrics calculation — good practice

Data Health Summary

Metric	Count	% of Total
Products with zero cost_price	385,545	56.5%
Products with NULL sales_30d	621,221	91.1%
Products with no lifetime_sales	321,321	47.1%
Products with zero COGS but positive sales	27	<0.01%
Products with margin > 100%	73	<0.01%
Products with margin < -100%	119	<0.01%
Products with negative sell-through	30	<0.01%
Products with NULL status	0	0%
Duplicate daily snapshots (same pid+date)	0	0%
Net revenue formula mismatches	0	0%

ABC Classification Distribution (replenishable products only)

Class	Products	Revenue %
A	7,727	80.72%
B	12,048	15.10%
C	113,647	4.18%

ABC distribution looks healthy — A ≈ 80%, A+B ≈ 96%.

Brand Metrics Consistency

Product counts and sales_30d match exactly between brand_metrics and direct aggregation from product_metrics. Revenue shows sub-dollar discrepancies due to the > 0 filter excluding products with negative revenue. Consistent within expected tolerance.

Priority Recommendations

Must Fix (Correctness Issues)

Issue 1: Fix order status handling — The text-based filter (NOT IN ('canceled', 'returned')) is dead code against numeric statuses. Two options: (a) map numeric statuses to text during import (like POs already do), or (b) change SQL to filter on numeric codes (e.g., o.status::int >= 20 to exclude cancelled/unfinished, or o.status IN ('100', '95') for shipped-only). The ~19.7K unfulfilled orders (0.69%) are a minor financial impact but the filter should be functional.
Issue 6: Add supplier_id join to vendor lead time — One-line fix in calculate_vendor_metrics.sql
Issue 8: Fix lifetime revenue subquery — Use correct column names from daily_product_snapshots (e.g., net_revenue / NULLIF(units_sold, 0))

Should Fix (Data Quality)

Issue 2/3: Snapshot coverage — Consider creating snapshot rows for all in-stock products, not just those with activity. Or at minimum, calculate stockout metrics by comparing snapshot existence to product existence.
Issue 5: Populate landing_cost_price — If available in the source system, import it. Otherwise remove references to avoid confusion.
Issue 7: Subtract returns from net_revenue — net_revenue = gross_revenue - discounts - returns_revenue
Issue 9: Remove > 0 filter on COGS — Use SUM(pm.cogs_30d) instead of conditional sums

Nice to Fix (Edge Cases)

Issue 4: Flag estimated costs — Add a costeach_estimated BOOLEAN to orders during import
Issue 10: Cap or flag extreme margins — Exclude $0.01-price orders from margin calculations
Issue 11: Clamp sell-through — GREATEST(0, LEAST(sell_through_30d, 200)) or flag outliers
Issue 12: Verify category assignment policy — Check if products are assigned to leaf categories only
Issue 13: Category rollup query — Verify no double-counting with actual data

19 KiB Raw Blame History