Restore accidentally removed files, a few additional import/calculation fixes

2026-02-09 10:19:35 -05:00
parent 6aefc1b40d
commit 38b12c188f
209 changed files with 69925 additions and 412 deletions
@@ -0,0 +1,276 @@
+# Metrics Pipeline Audit Report
+
+**Date:** 2026-02-08
+**Scope:** All 6 SQL scripts in `inventory-server/scripts/metrics-new/`, import pipeline, custom functions, and post-calculation data verification.
+
+---
+
+## Executive Summary
+
+The metrics pipeline is architecturally sound and the core calculations are mostly correct. The 30-day sales, revenue, replenishment, and aggregate metrics (brand/vendor/category) all cross-check accurately between the snapshots, product_metrics, and direct orders queries. However, several issues were found ranging from **critical data bugs** to **design limitations** that affect accuracy of specific metrics.
+
+**Issues found: 13** (3 Critical, 4 Medium, 6 Low/Informational)
+
+---
+
+## CRITICAL Issues
+
+### C1. `net_revenue` in daily snapshots never subtracts returns ($35.6K affected)
+
+**Location:** `update_daily_snapshots.sql`, line 181
+**Symptom:** `net_revenue` is stored as `gross_revenue - discounts` but should be `gross_revenue - discounts - returns_revenue`.
+
+The SQL formula on line 181 appears correct:
+```sql
+COALESCE(sd.gross_revenue_unadjusted, 0.00) - COALESCE(sd.discounts, 0.00) - COALESCE(sd.returns_revenue, 0.00) AS net_revenue
+```
+
+However, actual data shows `net_revenue = gross_revenue - discounts` for ALL 3,252 snapshots that have returns. Total returns not subtracted: **$35,630.03** across 2,946 products. This may be caused by the `returns_revenue` in the SalesData CTE not properly flowing through to the INSERT, or by a prior version of the code that stored these values differently. The profit column (line 184) has the same issue: `(gross - discounts) - cogs` instead of `(gross - discounts - returns) - cogs`.
+
+**Impact:** Net revenue and profit are overstated by the amount of returns. This cascades to all metrics derived from snapshots: `revenue_30d`, `profit_30d`, `margin_30d`, `avg_ros_30d`, and all brand/vendor/category aggregate revenue.
+
+**Recommended fix:** Debug why the returns subtraction isn't taking effect. The formula in the SQL looks correct, so this may be a data-type issue or an execution path issue. After fixing, rebuild snapshots.
+
+**Status:** Owner will resolve. Code formula is correct; snapshots need rebuilding after prior fix deployment.
+
+---
+
+### C2. `eod_stock_quantity` uses CURRENT stock, not historical end-of-day stock
+
+**Location:** `update_daily_snapshots.sql`, lines 123-132 (CurrentStock CTE)
+**Symptom:** Every snapshot for a given product shows the same stock quantity regardless of the snapshot date.
+
+The `CurrentStock` CTE simply reads `stock_quantity` from the `products` table:
+```sql
+SELECT pid, stock_quantity, ... FROM public.products
+```
+
+This means a snapshot from January 10 shows the SAME stock as today (February 8). Verified in data:
+- Product 662561: stock = 36 on every date (Feb 1-7)
+- Product 665397: stock = 25 on every date (Feb 1-7)
+- All products checked show identical stock across all snapshot dates
+
+**Impact:** All stock-derived metrics are inaccurate for historical analysis:
+- `eod_stock_cost`, `eod_stock_retail`, `eod_stock_gross` (all wrong for past dates)
+- `stockout_flag` (based on current stock, not historical)
+- `stockout_days_30d` (undercounted since stockout_flag uses current stock)
+- `avg_stock_units_30d`, `avg_stock_cost_30d` (no variance, just current stock repeated)
+- `gmroi_30d`, `stockturn_30d` (based on avg_stock which is flat)
+- `sell_through_30d` (denominator uses current stock assumption)
+- `service_level_30d`, `fill_rate_30d`
+
+**This is a known architectural limitation** noted in MEMORY.md. Fixing requires either:
+1. Storing stock snapshots separately at end-of-day (ideally via a cron job that records stock before any changes)
+2. Reconstructing historical stock from orders and receivings (complex but possible)
+
+**Status: FIXED.** MySQL's `snap_product_value` table (daily EOD stock per product since 2012) is now imported into PostgreSQL `stock_snapshots` table via `scripts/import/stock-snapshots.js`. The `CurrentStock` CTE in `update_daily_snapshots.sql` now uses `LEFT JOIN stock_snapshots` for historical stock, falling back to `products.stock_quantity` when no historical data exists. Requires: run import, then rebuild daily snapshots.
+
+---
+
+### C3. `ON CONFLICT DO UPDATE WHERE` check skips 91%+ of product_metrics updates
+
+**Location:** `update_product_metrics.sql`, lines 558-574
+**Symptom:** 623,205 of 681,912 products (91.4%) have `last_calculated` older than 1 day. 592,369 are over 30 days old. 914 products with active 30-day sales haven't been updated in over 7 days.
+
+The upsert's `WHERE` clause only updates if specific fields changed:
+```sql
+WHERE product_metrics.current_stock IS DISTINCT FROM EXCLUDED.current_stock OR
+      product_metrics.current_price IS DISTINCT FROM EXCLUDED.current_price OR ...
+```
+
+Fields NOT checked include: `stockout_days_30d`, `margin_30d`, `gmroi_30d`, `demand_pattern`, `seasonality_index`, `sales_growth_*`, `service_level_30d`, and many others. If a product's stock, price, sales, and revenue haven't changed, the entire row is skipped even though growth metrics, variability, and other derived fields may need updating.
+
+**Impact:** Most derived metrics (growth, demand patterns, seasonality) are stale for the majority of products. Products with steady sales but unchanged stock/price never get their growth metrics recalculated.
+
+**Recommended fix:** Either:
+1. Remove the `WHERE` clause entirely (accept the performance cost of writing all rows every run)
+2. Add `last_calculated` age check: `OR product_metrics.last_calculated < NOW() - INTERVAL '7 days'`
+3. Add the missing fields to the change-detection check
+
+**Status: FIXED.** Added 12 derived fields to the `IS DISTINCT FROM` check (`profit_30d`, `cogs_30d`, `margin_30d`, `stockout_days_30d`, `sell_through_30d`, `sales_growth_30d_vs_prev`, `revenue_growth_30d_vs_prev`, `demand_pattern`, `seasonal_pattern`, `seasonality_index`, `service_level_30d`, `fill_rate_30d`) plus a time-based safety net: `OR product_metrics.last_calculated < NOW() - INTERVAL '1 day'`. This guarantees every row is refreshed at least daily.
+
+---
+
+## MEDIUM Issues
+
+### M1. Demand variability calculated only over activity days, not full 30-day window
+
+**Location:** `update_product_metrics.sql`, DemandVariability CTE (lines 206-223)
+**Symptom:** Variance, std_dev, and CV are computed over only the days that appear in snapshots (activity days), not the full 30-day period including zero-sales days.
+
+Example: Product 41141 (Mexican Poppy) sold 102 units in 30 days across only 3 snapshot days (1, 1, 100). The variance/CV is calculated over just those 3 data points instead of 30 (with 27 zero-sales days).
+
+**Impact:**
+- CV is computed on sparse data (3-10 points instead of 30), making it statistically unreliable
+- Products with sporadic large orders appear less variable than they really are
+- `demand_pattern` classification is affected (stable/variable/sporadic/lumpy)
+
+**Recommended fix:** Join against a generated 30-day date series and COALESCE missing days to 0 units sold before computing variance/stddev/CV.
+
+**Status: FIXED.** Rewrote `DemandVariability` CTE to use `generate_series()` for the full 30-day date range, `CROSS JOIN` with distinct PIDs from snapshots, and `LEFT JOIN` actual snapshot data with `COALESCE(dps.units_sold, 0)` for missing days. Variance/stddev/CV now computed over all 30 data points.
+
+---
+
+### M2. `costeach` fallback to `price * 0.5` affects 32.5% of recent orders
+
+**Location:** `orders.js`, line 600 and 634
+**Symptom:** When no cost record exists in `order_costs`, the import falls back to `price * 0.5`.
+
+Data shows 9,839 of 30,266 recent orders (32.5%) use this fallback. Among these, 79 paid products have `costeach = 0` because `price = 0 * 0.5 = 0`, even though the product has a real cost_price.
+
+The daily snapshot has a second line of defense (using `get_weighted_avg_cost()` and then `p.cost_price`), but the orders table's `costeach` column itself contains inaccurate data for ~1/3 of orders.
+
+**Impact:** COGS calculations at the order level are approximate for 1/3 of orders. The snapshot's fallback chain mitigates this somewhat, but any analytics using `orders.costeach` directly will be affected.
+
+**Status: FIXED.** Added `products.cost_price` as intermediate fallback: `COALESCE(oc.costeach, p.cost_price, oi.price * 0.5)`. The products table join was added to both the `order_totals` CTE and the outer SELECT in `orders.js`. Requires a full orders re-import to apply retroactively.
+
+---
+
+### M3. `lifetime_sales` uses MySQL `total_sold` (status >= 20) but orders import uses status >= 15
+
+**Location:** `products.js` line 200 vs `orders.js` line 69
+**Symptom:** `total_sold` in the products table comes from MySQL with `order_status >= 20`, excluding status 15 (canceled) and 16 (combined). But the orders import fetches orders with `order_status >= 15`.
+
+Verified in MySQL: For product 31286, `total_sold` (>=20) = 13,786 vs (>=15) = 13,905 (difference of 119 units).
+
+**Impact:** `lifetime_sales` in product_metrics (sourced from `products.total_sold`) slightly understates compared to what the orders table contains. The `lifetime_revenue_quality` field correctly flags most as "estimated" since the orders table only covers ~5 years while `total_sold` is all-time. This is a minor inconsistency (< 1% difference).
+
+**Status:** Accepted. < 1% difference, not worth the complexity of aligning thresholds.
+
+---
+
+### M4. `sell_through_30d` has 868 NULL values and 547 anomalous values for products with sales
+
+**Location:** `update_product_metrics.sql`, lines 356-361
+**Formula:** `(sales_30d / (current_stock + sales_30d + returns_units_30d - received_qty_30d)) * 100`
+
+- 868 products with sales but NULL sell_through (denominator = 0, which happens when `current_stock + sales - received = 0`, i.e. all stock came from receiving and was sold)
+- 259 products with sell_through > 100%
+- 288 products with negative sell_through
+
+**Impact:** Sell-through rate is unreliable for products with significant receiving activity in the same period. The formula tries to approximate "beginning inventory" but the approximation breaks when current stock ≠ actual beginning stock (which is always, per issue C2).
+
+**Status:** Will improve once C2 fix (historical stock) is deployed and snapshots are rebuilt, since `current_stock` in the formula will then reflect actual beginning inventory.
+
+---
+
+## LOW / INFORMATIONAL Issues
+
+### L1. Snapshots only cover ~1,167 products/day out of 681K
+
+Only products with order or receiving activity on a given day get snapshots. This is by design (the `ProductsWithActivity` CTE on line 133 of `update_daily_snapshots.sql`), but it means:
+- 560K+ products have zero snapshot history
+- Stockout tracking is impossible for products with no sales (they can't appear in snapshots)
+- The "avg_stock" metrics (avg_stock_units_30d, etc.) only average over activity days, not all 30 days
+
+This is acceptable for storage efficiency but should be understood when interpreting metrics.
+
+**Status:** Accepted (by design).
+
+---
+
+### L2. `detect_seasonal_pattern` function only compares current month to yearly average
+
+The seasonality detection is simplistic: it compares current month's avg daily sales to yearly avg. This means:
+- It can only detect if the CURRENT month is above average, not identify historical seasonal patterns
+- Running in January vs July will give completely different results for the same product
+- The "peak_season" field always shows the current month/quarter when seasonal (not the actual peak)
+
+This is noted as a P5 (low priority) feature and is adequate for a first pass but should not be relied upon for demand planning.
+
+**Status: FIXED.** Rewrote `detect_seasonal_pattern` function to compare monthly average sales across the full last 12 months. Uses CV across months + peak-to-average ratio for classification: `strong` (CV > 0.5, peak > 150%), `moderate` (CV > 0.3, peak > 120%), `none`. Peak season now identifies the actual highest-sales month. Requires at least 3 months of data. Saved in `db/functions.sql`.
+
+---
+
+### L3. Free product with negative revenue in top sellers
+
+Product 476848 ("Thank You, From ACOT!") shows 254 sales with -$1.00 revenue because one order applied a $1 discount to a $0 product. This is a data oddity, not a calculation bug. Could be addressed by excluding $0-price products from revenue metrics or by data cleanup.
+
+**Status:** Accepted (data oddity, not a bug).
+
+---
+
+### L4. `landing_cost_price` is always NULL
+
+`current_landing_cost_price` in product_metrics is mapped from `current_effective_cost` which is just `cost_price`. The `landing_cost_price` concept (cost + shipping + duties) is not implemented. The field exists but has no meaningful data.
+
+**Status: FIXED.** Removed `landing_cost_price` from `db/schema.sql`, `current_landing_cost_price` from `db/metrics-schema-new.sql`, `update_product_metrics.sql`, and `backfill/populate_initial_product_metrics.sql`. Column should be dropped from the live database via `ALTER TABLE`.
+
+---
+
+### L5. Custom SQL functions not tracked in version control
+
+All 6 custom functions (`calculate_sales_velocity`, `get_weighted_avg_cost`, `safe_divide`, `std_numeric`, `classify_demand_pattern`, `detect_seasonal_pattern`) and the `category_hierarchy` materialized view exist only in the database. They are not defined in any migration or schema file in the repository.
+
+If the database needs to be recreated, these would be lost.
+
+**Status: FIXED.** All 6 functions and the `category_hierarchy` materialized view definition saved to `inventory-server/db/functions.sql`. File is re-runnable via `psql -f functions.sql`.
+
+---
+
+### L6. `get_weighted_avg_cost` limited to last 10 receivings
+
+The function `LIMIT 10` for performance, but this means products with many small receivings may not accurately reflect the true weighted average cost if the cost has changed significantly beyond the last 10 receiving records.
+
+**Status: FIXED.** Removed `LIMIT 10` from `get_weighted_avg_cost`. Data shows max receivings per product is 142 (p95 = 11, avg = 3), so performance impact is negligible. Updated definition in `db/functions.sql`.
+
+---
+
+## Verification Summary
+
+### What's Working Correctly
+
+| Check | Result |
+|-------|--------|
+| 30d sales: product_metrics vs orders vs snapshots | **MATCH** (verified top 10 sellers) |
+| Replenishment formula: manual calc vs stored | **MATCH** (verified 10 products) |
+| Brand metrics vs sum of product_metrics | **MATCH** (0 difference across all brands) |
+| Order status mapping (numeric → text) | **CORRECT** (all statuses mapped, no numeric remain) |
+| Cost price: PostgreSQL vs MySQL source | **MATCH** (within rounding, verified 5 products) |
+| total_sold: PostgreSQL vs MySQL source | **MATCH** (verified 5 products) |
+| Category rollups (rolled-up > direct for parents) | **CORRECT** |
+| ABC classification distribution | **REASONABLE** (A: 8K, B: 12.5K, C: 113K) |
+| Lead time calculation (PO → receiving) | **CORRECT** (verified examples) |
+
+### Data Overview
+
+| Metric | Value |
+|--------|-------|
+| Total products | 681,912 |
+| Products in product_metrics | 681,912 (100%) |
+| Products with 30d sales | 10,291 (1.5%) |
+| Products with negative profit & revenue | 139 (mostly cost > price) |
+| Products with negative stock | 0 |
+| Snapshot date range | 2020-06-18 to 2026-02-08 |
+| Avg products per snapshot day | 1,167 |
+| Order date range | 2020-06-18 to 2026-02-08 |
+| Total orders | 2,885,825 |
+| 'returned' status orders | 0 (returns via negative quantity only) |
+
+---
+
+## Fix Status Summary
+
+| Issue | Severity | Status | Deployment Action Needed |
+|-------|----------|--------|--------------------------|
+| C1 | Critical | Owner resolving | Rebuild daily snapshots |
+| C2 | Critical | **FIXED** | Run import, rebuild daily snapshots |
+| C3 | Critical | **FIXED** | Deploy updated `update_product_metrics.sql` |
+| M1 | Medium | **FIXED** | Deploy updated `update_product_metrics.sql` |
+| M2 | Medium | **FIXED** | Full orders re-import (`--full`) |
+| M3 | Medium | Accepted | None |
+| M4 | Medium | Pending C2 | Will improve after C2 deployment |
+| L1 | Low | Accepted | None |
+| L2 | Low | **FIXED** | Deploy `db/functions.sql` to database |
+| L3 | Low | Accepted | None |
+| L4 | Low | **FIXED** | `ALTER TABLE` to drop columns |
+| L5 | Low | **FIXED** | None (file committed) |
+| L6 | Low | **FIXED** | Deploy `db/functions.sql` to database |
+
+### Deployment Steps
+
+1. Deploy `db/functions.sql` to PostgreSQL: `psql -d inventory_db -f db/functions.sql` (L2, L6)
+2. Run import (includes stock snapshots first load) (C2, M2)
+3. Drop stale columns: `ALTER TABLE products DROP COLUMN IF EXISTS landing_cost_price; ALTER TABLE product_metrics DROP COLUMN IF EXISTS current_landing_cost_price;` (L4)
+4. Rebuild daily snapshots (C1, C2)
+5. Re-run metrics calculation (C3, M1 take effect automatically)