347 lines
19 KiB
Markdown
347 lines
19 KiB
Markdown
# Metrics Calculation Pipeline Audit
|
|
|
|
**Date:** 2026-02-07
|
|
**Scope:** All 6 SQL calculation scripts, custom DB functions, import pipeline, and live data verification
|
|
|
|
## Overview
|
|
|
|
The metrics pipeline in `inventory-server/scripts/calculate-metrics-new.js` runs 6 SQL scripts sequentially:
|
|
|
|
1. `update_daily_snapshots.sql` — Aggregates daily per-product sales/receiving data
|
|
2. `update_product_metrics.sql` — Calculates the main product_metrics table (KPIs, forecasting, status)
|
|
3. `update_periodic_metrics.sql` — ABC classification, average lead time
|
|
4. `calculate_brand_metrics.sql` — Brand-level aggregated metrics
|
|
5. `calculate_vendor_metrics.sql` — Vendor-level aggregated metrics
|
|
6. `calculate_category_metrics.sql` — Category-level metrics with hierarchy rollups
|
|
|
|
### Database Scale
|
|
| Table | Row Count |
|
|
|---|---|
|
|
| products | 681,912 |
|
|
| orders | 2,883,982 |
|
|
| purchase_orders | 256,809 |
|
|
| receivings | 313,036 |
|
|
| daily_product_snapshots | 678,312 (601 distinct dates, since 2024-06-01) |
|
|
| product_metrics | 681,912 |
|
|
| brand_metrics | 1,789 |
|
|
| vendor_metrics | 281 |
|
|
| category_metrics | 610 |
|
|
|
|
---
|
|
|
|
## Issues Found
|
|
|
|
### ISSUE 1: [HIGH] Order status filter is non-functional — numeric codes vs text comparison
|
|
|
|
**Files:** `update_daily_snapshots.sql` lines 86-101, `update_product_metrics.sql` lines 89, 178-183
|
|
**Confirmed by data:** All order statuses are numeric strings ('100', '50', '55', etc.)
|
|
**Status mappings from:** `docs/prod_registry.class.php`
|
|
|
|
**Description:** The SQL filters `COALESCE(o.status, 'pending') NOT IN ('canceled', 'returned')` and `o.status NOT IN ('canceled', 'returned')` are used throughout the pipeline to exclude canceled/returned orders. However, the import pipeline stores order statuses as their **raw numeric codes** from the production MySQL database (e.g., '100', '50', '55', '90', '92'). There are **zero text status values** in the orders table.
|
|
|
|
This means these filters **never exclude any rows** — every comparison is `'100' NOT IN ('canceled', 'returned')` which is always true.
|
|
|
|
**Actual status distribution (with confirmed meanings):**
|
|
| Status | Meaning | Count | Negative Qty | Assessment |
|
|
|---|---|---|---|---|
|
|
| 100 | shipped | 2,862,792 | 3,352 | Completed — correct to include |
|
|
| 50 | awaiting_products | 11,109 | 0 | In-progress — not yet shipped |
|
|
| 55 | shipping_later | 5,689 | 0 | In-progress — not yet shipped |
|
|
| 56 | shipping_together | 2,863 | 0 | In-progress — not yet shipped |
|
|
| 90 | awaiting_shipment | 38 | 0 | Near-complete — not yet shipped |
|
|
| 92 | awaiting_pickup | 71 | 0 | Near-complete — awaiting customer |
|
|
| 95 | shipped_confirmed | 5 | 0 | Completed — correct to include |
|
|
| 15 | cancelled | 1 | 0 | Should be excluded |
|
|
|
|
**Full status reference (from prod_registry.class.php):**
|
|
- 0=created, 10=unfinished, **15=cancelled**, 16=combined, 20=placed, 22=placed_incomplete
|
|
- 30=cancelled_old (historical), 40=awaiting_payment, 50=awaiting_products
|
|
- 55=shipping_later, 56=shipping_together, 60=ready, 61=flagged
|
|
- 62=fix_before_pick, 65=manual_picking, 70=in_pt, 80=picked
|
|
- 90=awaiting_shipment, 91=remote_wait, **92=awaiting_pickup**, 93=fix_before_ship
|
|
- **95=shipped_confirmed**, **100=shipped**
|
|
|
|
**Severity revised to HIGH (from CRITICAL):** Now that we know the actual meanings, no cancelled/refunded orders are being miscounted (only 1 cancelled order exists, status=15). The real concern is twofold:
|
|
1. **The text-based filter is dead code** — it can never match any row. Either map statuses to text during import (like POs do) or change SQL to use numeric comparisons.
|
|
2. **~19,775 unfulfilled orders** (statuses 50/55/56/90/92) are counted as completed sales. These are orders in various stages of fulfillment that haven't shipped yet. While most will eventually ship, counting them now inflates current-period metrics. At 0.69% of total orders, the financial impact is modest but the filter should work correctly on principle.
|
|
|
|
**Note:** PO statuses ARE properly mapped to text ('canceled', 'done', etc.) in the import pipeline. Only order statuses are numeric.
|
|
|
|
---
|
|
|
|
### ISSUE 2: [CRITICAL] Daily Snapshots use current stock instead of historical EOD stock
|
|
|
|
**File:** `update_daily_snapshots.sql`, lines 126-135, 173
|
|
**Confirmed by data:** Top product (pid 666925) shows `eod_stock_quantity = 0` for ALL dates even though it sold 28 units on Jan 28 (clearly had stock then)
|
|
|
|
**Description:** The `CurrentStock` CTE reads `stock_quantity` directly from the `products` table at query execution time. When the script processes historical dates (today minus 1-4 days), it writes **today's stock** as if it were the end-of-day stock for those past dates.
|
|
|
|
**Cascading impact on product_metrics:**
|
|
- `avg_stock_units_30d` / `avg_stock_cost_30d` — Wrong averages
|
|
- `stockout_days_30d` — Undercounts (only based on current stock state, not historical)
|
|
- `stockout_rate_30d`, `service_level_30d`, `fill_rate_30d` — All derived from wrong stockout data
|
|
- `gmroi_30d` — Wrong denominator (avg stock cost)
|
|
- `stockturn_30d` — Wrong denominator (avg stock units)
|
|
- `sell_through_30d` — Affected by stock level inaccuracy
|
|
|
|
---
|
|
|
|
### ISSUE 3: [CRITICAL] Snapshot coverage is 0.17% — most products have no snapshot data
|
|
|
|
**Confirmed by data:** 678,312 snapshot rows across 601 dates = ~1,128 products/day out of 681,912 total
|
|
|
|
**Description:** The daily snapshots script only creates rows for products with sales or receiving activity on that date (`ProductsWithActivity` CTE, line 136). This means:
|
|
- 91.1% of products (621,221) have NULL `sales_30d` — they had no orders in the last 30 days so no snapshot rows exist
|
|
- `AVG(eod_stock_quantity)` averages only across days with activity, not 30 days
|
|
- `stockout_days_30d` only counts stockout days where there was ALSO some activity
|
|
- A product out of stock with zero sales gets zero stockout_days even though it was stocked out
|
|
|
|
This is by design (to avoid creating 681K rows/day) but means stock-related metrics are systematically biased.
|
|
|
|
---
|
|
|
|
### ISSUE 4: [HIGH] `costeach` fallback to 50% of price in import pipeline
|
|
|
|
**File:** `inventory-server/scripts/import/orders.js` (line ~573)
|
|
|
|
**Description:** When the MySQL `order_costs` table has no record for an order item, `costeach` defaults to `price * 0.5`. There is **no flag** in the PostgreSQL data to distinguish actual costs from estimated ones.
|
|
|
|
**Data impact:** 385,545 products (56.5%) have `current_cost_price = 0` AND `current_landing_cost_price = 0`. For these products, the COGS calculation in daily_snapshots falls through the chain:
|
|
1. `o.costeach` — May be the 50% estimate from import
|
|
2. `get_weighted_avg_cost()` — Returns NULL if no receivings exist
|
|
3. `p.landing_cost_price` — Always NULL (hardcoded in import)
|
|
4. `p.cost_price` — 0 for 56.5% of products
|
|
|
|
Only 27 products have zero COGS with positive sales, meaning the `costeach` field is doing its job for products that sell, but the 50% fallback means margins for those products are estimates, not actuals.
|
|
|
|
---
|
|
|
|
### ISSUE 5: [HIGH] `landing_cost_price` is always NULL
|
|
|
|
**File:** `inventory-server/scripts/import/products.js` (line ~175)
|
|
|
|
**Description:** The import explicitly sets `landing_cost_price = NULL` for all products. The daily_snapshots COGS calculation uses it as a fallback: `COALESCE(o.costeach, get_weighted_avg_cost(...), p.landing_cost_price, p.cost_price)`. Since it's always NULL, this fallback step is useless and the chain jumps straight to `cost_price`.
|
|
|
|
The `product_metrics` field `current_landing_cost_price` is populated as `COALESCE(p.landing_cost_price, p.cost_price, 0.00)`, so it equals `cost_price` for all products. Any UI showing "landing cost" is actually just showing `cost_price`.
|
|
|
|
---
|
|
|
|
### ISSUE 6: [HIGH] Vendor lead time is drastically wrong — missing supplier_id join
|
|
|
|
**File:** `calculate_vendor_metrics.sql`, lines 62-82
|
|
**Confirmed by data:** Vendor-level lead times are 2-10x higher than product-level lead times
|
|
|
|
**Description:** The vendor metrics lead time joins POs to receivings only by `pid`:
|
|
```sql
|
|
LEFT JOIN public.receivings r ON r.pid = po.pid
|
|
```
|
|
But the periodic metrics lead time correctly matches supplier:
|
|
```sql
|
|
JOIN public.receivings r ON r.pid = po.pid AND r.supplier_id = po.supplier_id
|
|
```
|
|
|
|
Without supplier matching, a PO for product X from Vendor A can match a receiving of product X from Vendor B, creating inflated/wrong lead times.
|
|
|
|
**Measured discrepancies:**
|
|
| Vendor | Vendor Metrics Lead Time | Avg Product Lead Time |
|
|
|---|---|---|
|
|
| doodlebug design inc. | 66 days | 14 days |
|
|
| Notions | 55 days | 4 days |
|
|
| Simple Stories | 59 days | 27 days |
|
|
| Ranger Industries | 31 days | 5 days |
|
|
|
|
---
|
|
|
|
### ISSUE 7: [MEDIUM] Net revenue does not subtract returns
|
|
|
|
**File:** `update_daily_snapshots.sql`, line 184
|
|
|
|
**Description:** `net_revenue = gross_revenue - discounts`. Standard accounting: `net_revenue = gross_revenue - discounts - returns`. The `returns_revenue` is calculated separately but not deducted.
|
|
|
|
**Data impact:** There are 3,352 orders with negative quantities (returns), totaling -5,499 units. These returns are tracked in `returns_revenue` but not reflected in `net_revenue`, which means all downstream revenue-based metrics are slightly overstated.
|
|
|
|
---
|
|
|
|
### ISSUE 8: [MEDIUM] Lifetime revenue subquery references wrong table columns
|
|
|
|
**File:** `update_product_metrics.sql`, lines 323-329
|
|
|
|
**Description:** The lifetime revenue estimation fallback queries:
|
|
```sql
|
|
SELECT revenue_7d / NULLIF(sales_7d, 0)
|
|
FROM daily_product_snapshots
|
|
WHERE pid = ci.pid AND sales_7d > 0
|
|
```
|
|
But `daily_product_snapshots` does NOT have `revenue_7d` or `sales_7d` columns — those exist in `product_metrics`. This subquery either errors silently or returns NULL. The effect is that the estimation always falls back to `current_price * total_sold`.
|
|
|
|
---
|
|
|
|
### ISSUE 9: [MEDIUM] Brand/Vendor metrics COGS filter inflates margins
|
|
|
|
**Files:** `calculate_brand_metrics.sql` lines 31, `calculate_vendor_metrics.sql` line 32
|
|
|
|
**Description:** `SUM(CASE WHEN pm.cogs_30d > 0 THEN pm.cogs_30d ELSE 0 END)` excludes products with zero COGS. But if a product has sales revenue and zero COGS (missing cost data), the brand/vendor totals will include the revenue but not the COGS, artificially inflating the margin.
|
|
|
|
**Data context:** Brand metrics revenue matches product_metrics aggregation exactly for sales counts, but shows small discrepancies in revenue (e.g., Stamperia: $7,613.98 brand vs $7,611.11 actual). These tiny diffs come from the `> 0` filtering excluding products with negative revenue.
|
|
|
|
---
|
|
|
|
### ISSUE 10: [MEDIUM] Extreme margin values from $0.01 price orders
|
|
|
|
**Confirmed by data:** 73 products with margin > 100%, 119 with margin < -100%
|
|
|
|
**Examples:**
|
|
| Product | Revenue | COGS | Margin |
|
|
|---|---|---|---|
|
|
| Flower Gift Box Die (pid 624756) | $0.02 | $29.98 | -149,800% |
|
|
| Special Flowers Stamp Set (pid 614513) | $0.01 | $11.97 | -119,632% |
|
|
|
|
These are products with extremely low prices (likely samples, promos, or data errors) where the order price was $0.01. The margin calculation is mathematically correct but these outliers skew any aggregate margin statistics.
|
|
|
|
---
|
|
|
|
### ISSUE 11: [MEDIUM] Sell-through rate has edge cases yielding negative/extreme values
|
|
|
|
**File:** `update_product_metrics.sql`, lines 358-361
|
|
**Confirmed by data:** 30 products with negative sell-through, 10 with sell-through > 200%
|
|
|
|
**Description:** Beginning inventory is approximated as `current_stock + sales - received + returns`. When inventory adjustments, shrinkage, or manual corrections occur, this approximation breaks. Edge cases:
|
|
- Products with many manual stock adjustments → negative denominator → negative sell-through
|
|
- Products with beginning stock near zero but decent sales → sell-through > 100%
|
|
|
|
---
|
|
|
|
### ISSUE 12: [MEDIUM] `total_sold` uses different status filter than orders import
|
|
|
|
**Import pipeline confirmed:**
|
|
- Orders import: `order_status >= 15` (includes processing/pending orders)
|
|
- `total_sold` in products: `order_status >= 20` (more restrictive)
|
|
|
|
This means `lifetime_sales` (from `total_sold`) is systematically lower than what you'd calculate by summing the orders table. The discrepancy is confirmed:
|
|
| Product | total_sold | orders sum | Gap |
|
|
|---|---|---|---|
|
|
| pid 31286 | 13,786 | 4,241 | 9,545 |
|
|
| pid 44309 | 11,978 | 3,119 | 8,859 |
|
|
|
|
The large gaps are because the orders table only has data from the import start date (~2024), while `total_sold` includes all-time sales from MySQL. This is expected behavior, not a bug, but it means the `lifetime_revenue_quality` flag is important — most products show 'estimated' quality.
|
|
|
|
---
|
|
|
|
### ISSUE 13: [MEDIUM] Category rollup may double-count products in multiple hierarchy levels
|
|
|
|
**File:** `calculate_category_metrics.sql`, lines 42-66
|
|
|
|
**Description:** The `RolledUpMetrics` CTE uses:
|
|
```sql
|
|
dcm.cat_id = ch.cat_id OR dcm.cat_id = ANY(SELECT cat_id FROM category_hierarchy WHERE ch.cat_id = ANY(ancestor_ids))
|
|
```
|
|
If products are assigned to categories at multiple levels in the same branch (e.g., both "Paper Crafts" and "Scrapbook Paper" which is a child of "Paper Crafts"), those products' metrics would be counted twice in the parent's rollup.
|
|
|
|
---
|
|
|
|
### ISSUE 14: [LOW] `exclude_forecast` removes products from metrics entirely
|
|
|
|
**File:** `update_product_metrics.sql`, line 509
|
|
|
|
**Description:** `WHERE s.exclude_forecast IS FALSE OR s.exclude_forecast IS NULL` is on the main INSERT's WHERE clause. Products with `exclude_forecast = TRUE` won't appear in `product_metrics` at all, rather than just having forecast fields nulled. Currently all 681,912 products are in product_metrics so this appears to not affect any products yet.
|
|
|
|
---
|
|
|
|
### ISSUE 15: [LOW] Daily snapshots only look back 5 days
|
|
|
|
**File:** `update_daily_snapshots.sql`, line 14 — `_process_days INT := 5`
|
|
|
|
If import data arrives late (>5 days), those days will never get snapshots populated. There is a separate `backfill/rebuild_daily_snapshots.sql` for historical rebuilds.
|
|
|
|
---
|
|
|
|
### ISSUE 16: [INFO] Timezone risk in order date import
|
|
|
|
**File:** `inventory-server/scripts/import/orders.js`
|
|
|
|
MySQL `DATETIME` values are timezone-naive. The import uses `new Date(order.date)` which interprets them using the import server's local timezone. The SSH config specifies `timezone: '-05:00'` for MySQL (always EST). If the import server is in a different timezone, orders near midnight could land on the wrong date in the daily snapshots calculation.
|
|
|
|
---
|
|
|
|
## Custom Functions Review
|
|
|
|
### `calculate_sales_velocity(sales_30d, stockout_days_30d)`
|
|
- Divides `sales_30d` by effective selling days: `GREATEST(30 - stockout_days, CASE WHEN sales > 0 THEN 14 ELSE 30 END)`
|
|
- The 14-day floor prevents extreme velocity for products mostly out of stock
|
|
- **Sound approach** — the only concern is that stockout_days is unreliable (Issues 2, 3)
|
|
|
|
### `get_weighted_avg_cost(pid, date)`
|
|
- Weighted average of last 10 receivings by cost*qty/qty
|
|
- Returns NULL if no receivings — sound fallback behavior
|
|
- **Correct implementation**
|
|
|
|
### `safe_divide(numerator, denominator)`
|
|
- Returns NULL on divide-by-zero — **correct**
|
|
|
|
### `std_numeric(value, precision)`
|
|
- Rounds to precision digits — **correct**
|
|
|
|
### `classify_demand_pattern(avg_demand, cv)`
|
|
- Uses coefficient of variation thresholds: ≤0.2 = stable, ≤0.5 = variable, low-volume+high-CV = sporadic, else lumpy
|
|
- **Reasonable classification**, though only based on 30-day window
|
|
|
|
### `detect_seasonal_pattern(pid)`
|
|
- CROSS JOIN LATERAL (runs per product) — **expensive**: queries `daily_product_snapshots` twice per product
|
|
- Compares current month average to yearly average — very simplistic
|
|
- **Functional but could be a performance bottleneck** with 681K products
|
|
|
|
### `category_hierarchy` (materialized view)
|
|
- Recursive CTE building tree from categories — **correct implementation**
|
|
- Refreshed concurrently before category metrics calculation — **good practice**
|
|
|
|
---
|
|
|
|
## Data Health Summary
|
|
|
|
| Metric | Count | % of Total |
|
|
|---|---|---|
|
|
| Products with zero cost_price | 385,545 | 56.5% |
|
|
| Products with NULL sales_30d | 621,221 | 91.1% |
|
|
| Products with no lifetime_sales | 321,321 | 47.1% |
|
|
| Products with zero COGS but positive sales | 27 | <0.01% |
|
|
| Products with margin > 100% | 73 | <0.01% |
|
|
| Products with margin < -100% | 119 | <0.01% |
|
|
| Products with negative sell-through | 30 | <0.01% |
|
|
| Products with NULL status | 0 | 0% |
|
|
| Duplicate daily snapshots (same pid+date) | 0 | 0% |
|
|
| Net revenue formula mismatches | 0 | 0% |
|
|
|
|
### ABC Classification Distribution (replenishable products only)
|
|
| Class | Products | Revenue % |
|
|
|---|---|---|
|
|
| A | 7,727 | 80.72% |
|
|
| B | 12,048 | 15.10% |
|
|
| C | 113,647 | 4.18% |
|
|
|
|
ABC distribution looks healthy — A ≈ 80%, A+B ≈ 96%.
|
|
|
|
### Brand Metrics Consistency
|
|
Product counts and sales_30d match exactly between `brand_metrics` and direct aggregation from `product_metrics`. Revenue shows sub-dollar discrepancies due to the `> 0` filter excluding products with negative revenue. **Consistent within expected tolerance.**
|
|
|
|
---
|
|
|
|
## Priority Recommendations
|
|
|
|
### Must Fix (Correctness Issues)
|
|
1. **Issue 1: Fix order status handling** — The text-based filter (`NOT IN ('canceled', 'returned')`) is dead code against numeric statuses. Two options: (a) map numeric statuses to text during import (like POs already do), or (b) change SQL to filter on numeric codes (e.g., `o.status::int >= 20` to exclude cancelled/unfinished, or `o.status IN ('100', '95')` for shipped-only). The ~19.7K unfulfilled orders (0.69%) are a minor financial impact but the filter should be functional.
|
|
2. **Issue 6: Add supplier_id join to vendor lead time** — One-line fix in `calculate_vendor_metrics.sql`
|
|
3. **Issue 8: Fix lifetime revenue subquery** — Use correct column names from `daily_product_snapshots` (e.g., `net_revenue / NULLIF(units_sold, 0)`)
|
|
|
|
### Should Fix (Data Quality)
|
|
4. **Issue 2/3: Snapshot coverage** — Consider creating snapshot rows for all in-stock products, not just those with activity. Or at minimum, calculate stockout metrics by comparing snapshot existence to product existence.
|
|
5. **Issue 5: Populate landing_cost_price** — If available in the source system, import it. Otherwise remove references to avoid confusion.
|
|
6. **Issue 7: Subtract returns from net_revenue** — `net_revenue = gross_revenue - discounts - returns_revenue`
|
|
7. **Issue 9: Remove > 0 filter on COGS** — Use `SUM(pm.cogs_30d)` instead of conditional sums
|
|
|
|
### Nice to Fix (Edge Cases)
|
|
8. **Issue 4: Flag estimated costs** — Add a `costeach_estimated BOOLEAN` to orders during import
|
|
9. **Issue 10: Cap or flag extreme margins** — Exclude $0.01-price orders from margin calculations
|
|
10. **Issue 11: Clamp sell-through** — `GREATEST(0, LEAST(sell_through_30d, 200))` or flag outliers
|
|
11. **Issue 12: Verify category assignment policy** — Check if products are assigned to leaf categories only
|
|
12. **Issue 13: Category rollup query** — Verify no double-counting with actual data
|