Files
inventory/docs/import-from-prod-data-mapping.md
2025-03-25 19:12:41 -04:00

342 lines
21 KiB
Markdown

# MySQL to PostgreSQL Import Process Documentation
This document outlines the data import process from the production MySQL database to the local PostgreSQL database, focusing on column mappings, data transformations, and the overall import architecture.
## Table of Contents
1. [Overview](#overview)
2. [Import Architecture](#import-architecture)
3. [Column Mappings](#column-mappings)
- [Categories](#categories)
- [Products](#products)
- [Product Categories (Relationship)](#product-categories-relationship)
- [Orders](#orders)
- [Purchase Orders](#purchase-orders)
- [Metadata Tables](#metadata-tables)
4. [Special Calculations](#special-calculations)
5. [Implementation Notes](#implementation-notes)
## Overview
The import process extracts data from a MySQL 5.7 production database and imports it into a PostgreSQL database. It can operate in two modes:
- **Full Import**: Imports all data regardless of last sync time
- **Incremental Import**: Only imports data that has changed since the last import
The process handles four main data types:
- Categories (product categorization hierarchy)
- Products (inventory items)
- Orders (sales records)
- Purchase Orders (vendor orders)
## Import Architecture
The import process follows these steps:
1. **Establish Connection**: Creates a SSH tunnel to the production server and establishes database connections
2. **Setup Import History**: Creates a record of the current import operation
3. **Import Categories**: Processes product categories in hierarchical order
4. **Import Products**: Processes products with their attributes and category relationships
5. **Import Orders**: Processes customer orders with line items, taxes, and discounts
6. **Import Purchase Orders**: Processes vendor purchase orders with line items
7. **Record Results**: Updates the import history with results
8. **Close Connections**: Cleans up connections and resources
Each import step uses temporary tables for processing and wraps operations in transactions to ensure data consistency.
## Column Mappings
### Categories
| PostgreSQL Column | MySQL Source | Transformation |
|-------------------|---------------------------------|----------------------------------------------|
| cat_id | product_categories.cat_id | Direct mapping |
| name | product_categories.name | Direct mapping |
| type | product_categories.type | Direct mapping |
| parent_id | product_categories.master_cat_id| NULL for top-level categories (types 10, 20) |
| description | product_categories.combined_name| Direct mapping |
| status | N/A | Hard-coded 'active' |
| created_at | N/A | Current timestamp |
| updated_at | N/A | Current timestamp |
**Notes:**
- Categories are processed in hierarchical order by type: [10, 20, 11, 21, 12, 13]
- Type 10/20 are top-level categories with no parent
- Types 11/21/12/13 are child categories that reference parent categories
### Products
| PostgreSQL Column | MySQL Source | Transformation |
|----------------------|----------------------------------|---------------------------------------------------------------|
| pid | products.pid | Direct mapping |
| title | products.description | Direct mapping |
| description | products.notes | Direct mapping |
| sku | products.itemnumber | Fallback to 'NO-SKU' if empty |
| stock_quantity | shop_inventory.available_local | Capped at 5000, minimum 0 |
| preorder_count | current_inventory.onpreorder | Default 0 |
| notions_inv_count | product_notions_b2b.inventory | Default 0 |
| price | product_current_prices.price_each| Default 0, filtered on active=1 |
| regular_price | products.sellingprice | Default 0 |
| cost_price | product_inventory | Weighted average: SUM(costeach * count) / SUM(count) when count > 0, or latest costeach |
| vendor | suppliers.companyname | Via supplier_item_data.supplier_id |
| vendor_reference | supplier_item_data | supplier_itemnumber or notions_itemnumber based on vendor |
| notions_reference | supplier_item_data.notions_itemnumber | Direct mapping |
| brand | product_categories.name | Linked via products.company |
| line | product_categories.name | Linked via products.line |
| subline | product_categories.name | Linked via products.subline |
| artist | product_categories.name | Linked via products.artist |
| categories | product_category_index | Comma-separated list of category IDs |
| created_at | products.date_created | Validated date, NULL if invalid |
| first_received | products.datein | Validated date, NULL if invalid |
| landing_cost_price | NULL | Not set |
| barcode | products.upc | Direct mapping |
| harmonized_tariff_code| products.harmonized_tariff_code | Direct mapping |
| updated_at | products.stamp | Validated date, NULL if invalid |
| visible | shop_inventory | Calculated from show + buyable > 0 |
| managing_stock | N/A | Hard-coded true |
| replenishable | Multiple fields | Complex calculation based on reorder, dates, etc. |
| permalink | N/A | Constructed URL with product ID |
| moq | supplier_item_data | notions_qty_per_unit or supplier_qty_per_unit, minimum 1 |
| uom | N/A | Hard-coded 1 |
| rating | products.rating | Direct mapping |
| reviews | products.rating_votes | Direct mapping |
| weight | products.weight | Direct mapping |
| length | products.length | Direct mapping |
| width | products.width | Direct mapping |
| height | products.height | Direct mapping |
| country_of_origin | products.country_of_origin | Direct mapping |
| location | products.location | Direct mapping |
| total_sold | order_items | SUM(qty_ordered) for all order_items where prod_pid = pid |
| baskets | mybasket | COUNT of records where mb.item = pid and qty > 0 |
| notifies | product_notify | COUNT of records where pn.pid = pid |
| date_last_sold | product_last_sold.date_sold | Validated date, NULL if invalid |
| image | N/A | Constructed from pid and image URL pattern |
| image_175 | N/A | Constructed from pid and image URL pattern |
| image_full | N/A | Constructed from pid and image URL pattern |
| options | NULL | Not set |
| tags | NULL | Not set |
**Notes:**
- Replenishable calculation:
```javascript
CASE
WHEN p.reorder < 0 THEN 0
WHEN (
(COALESCE(pls.date_sold, '0000-00-00') = '0000-00-00' OR pls.date_sold <= DATE_SUB(CURRENT_DATE, INTERVAL 5 YEAR))
AND (p.datein = '0000-00-00 00:00:00' OR p.datein <= DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 5 YEAR))
AND (p.date_refill = '0000-00-00 00:00:00' OR p.date_refill <= DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 5 YEAR))
) THEN 0
ELSE 1
END
```
In business terms, a product is considered NOT replenishable only if:
- It was manually flagged as not replenishable (negative reorder value)
- OR it shows no activity across ALL metrics (no sales AND no receipts AND no refills in the past 5 years)
- Image URLs are constructed using this pattern:
```javascript
const paddedPid = pid.toString().padStart(6, '0');
const prefix = paddedPid.slice(0, 3);
const basePath = `${imageUrlBase}${prefix}/${pid}`;
return {
image: `${basePath}-t-${iid}.jpg`,
image_175: `${basePath}-175x175-${iid}.jpg`,
image_full: `${basePath}-o-${iid}.jpg`
};
```
### Product Categories (Relationship)
| PostgreSQL Column | MySQL Source | Transformation |
|-------------------|-----------------------------------|---------------------------------------------------------------|
| pid | products.pid | Direct mapping |
| cat_id | product_category_index.cat_id | Direct mapping, filtered by category types |
**Notes:**
- Only categories of types 10, 20, 11, 21, 12, 13 are imported
- Categories 16 and 17 are explicitly excluded
### Orders
| PostgreSQL Column | MySQL Source | Transformation |
|-------------------|-----------------------------------|---------------------------------------------------------------|
| order_number | order_items.order_id | Direct mapping |
| pid | order_items.prod_pid | Direct mapping |
| sku | order_items.prod_itemnumber | Fallback to 'NO-SKU' if empty |
| date | _order.date_placed_onlydate | Via join to _order table |
| price | order_items.prod_price | Direct mapping |
| quantity | order_items.qty_ordered | Direct mapping |
| discount | Multiple sources | Complex calculation (see notes) |
| tax | order_tax_info_products.item_taxes_to_collect | Via latest order_tax_info record |
| tax_included | N/A | Hard-coded false |
| shipping | N/A | Hard-coded 0 |
| customer | _order.order_cid | Direct mapping |
| customer_name | users | CONCAT(users.firstname, ' ', users.lastname) |
| status | _order.order_status | Direct mapping |
| canceled | _order.date_cancelled | Boolean: true if date_cancelled is not '0000-00-00 00:00:00' |
| costeach | order_costs | From latest record or fallback to price * 0.5 |
**Notes:**
- Only orders with order_status >= 15 and with a valid date_placed are processed
- For incremental imports, only orders modified since last sync are processed
- Discount calculation combines three sources:
1. Base discount: order_items.prod_price_reg - order_items.prod_price
2. Promo discount: SUM of order_discount_items.amount
3. Proportional order discount: Calculation based on order subtotal proportion
```javascript
(oi.base_discount +
COALESCE(ot.promo_discount, 0) +
CASE
WHEN om.summary_discount > 0 AND om.summary_subtotal > 0 THEN
ROUND((om.summary_discount * (oi.price * oi.quantity)) / NULLIF(om.summary_subtotal, 0), 2)
ELSE 0
END)::DECIMAL(10,2)
```
- Taxes are taken from the latest tax record for an order
- Cost data is taken from the latest non-pending cost record
### Purchase Orders
| PostgreSQL Column | MySQL Source | Transformation |
|-------------------|-----------------------------------|---------------------------------------------------------------|
| po_id | po.po_id | Default 0 if NULL |
| pid | po_products.pid | Direct mapping |
| sku | products.itemnumber | Fallback to 'NO-SKU' if empty |
| name | products.description | Fallback to 'Unknown Product' |
| cost_price | po_products.cost_each | Direct mapping |
| po_cost_price | po_products.cost_each | Duplicate of cost_price |
| vendor | suppliers.companyname | Fallback to 'Unknown Vendor' if empty |
| date | po.date_ordered | Fallback to po.date_created if NULL |
| expected_date | po.date_estin | Direct mapping |
| status | po.status | Default 1 if NULL |
| notes | po.short_note | Fallback to po.notes if NULL |
| ordered | po_products.qty_each | Direct mapping |
| received | N/A | Hard-coded 0 |
| receiving_status | N/A | Hard-coded 1 |
**Notes:**
- Only POs created within last 1 year (incremental) or 5 years (full) are processed
- For incremental imports, only POs modified since last sync are processed
### Metadata Tables
#### import_history
| PostgreSQL Column | Source | Notes |
|-------------------|-----------------------------------|---------------------------------------------------------------|
| id | Auto-increment | Primary key |
| table_name | Code | 'all_tables' for overall import |
| start_time | NOW() | Import start time |
| end_time | NOW() | Import completion time |
| duration_seconds | Calculation | Elapsed seconds |
| is_incremental | INCREMENTAL_UPDATE | Flag from config |
| records_added | Calculation | Sum from all imports |
| records_updated | Calculation | Sum from all imports |
| status | Code | 'running', 'completed', 'failed', or 'cancelled' |
| error_message | Exception | Error message if failed |
| additional_info | JSON | Configuration and results |
#### sync_status
| PostgreSQL Column | Source | Notes |
|----------------------|--------------------------------|---------------------------------------------------------------|
| table_name | Code | Name of imported table |
| last_sync_timestamp | NOW() | Timestamp of successful sync |
| last_sync_id | NULL | Not used currently |
## Special Calculations
### Date Validation
MySQL dates are validated before insertion into PostgreSQL:
```javascript
function validateDate(mysqlDate) {
if (!mysqlDate || mysqlDate === '0000-00-00' || mysqlDate === '0000-00-00 00:00:00') {
return null;
}
// Check if the date is valid
const date = new Date(mysqlDate);
return isNaN(date.getTime()) ? null : mysqlDate;
}
```
### Retry Mechanism
Operations that might fail temporarily are retried with exponential backoff:
```javascript
async function withRetry(operation, errorMessage) {
let lastError;
for (let attempt = 1; attempt <= MAX_RETRIES; attempt++) {
try {
return await operation();
} catch (error) {
lastError = error;
console.error(`${errorMessage} (Attempt ${attempt}/${MAX_RETRIES}):`, error);
if (attempt < MAX_RETRIES) {
const backoffTime = RETRY_DELAY * Math.pow(2, attempt - 1);
await new Promise(resolve => setTimeout(resolve, backoffTime));
}
}
}
throw lastError;
}
```
### Progress Tracking
Progress is tracked with estimated time remaining:
```javascript
function estimateRemaining(startTime, current, total) {
if (current === 0) return "Calculating...";
const elapsedSeconds = (Date.now() - startTime) / 1000;
const itemsPerSecond = current / elapsedSeconds;
const remainingItems = total - current;
const remainingSeconds = remainingItems / itemsPerSecond;
return formatElapsedTime(remainingSeconds);
}
```
## Implementation Notes
### Transaction Management
All imports use transactions to ensure data consistency:
- **Categories**: Uses savepoints for each category type
- **Products**: Uses a single transaction for the entire import
- **Orders**: Uses a single transaction with temporary tables
- **Purchase Orders**: Uses a single transaction with temporary tables
### Memory Usage Optimization
To minimize memory usage when processing large datasets:
1. Data is processed in batches (100-5000 records per batch)
2. Temporary tables are used for intermediate data
3. Some queries use cursors to avoid loading all results at once
### MySQL vs PostgreSQL Compatibility
The scripts handle differences between MySQL and PostgreSQL:
1. MySQL-specific syntax like `USE INDEX` is removed for PostgreSQL
2. `GROUP_CONCAT` in MySQL becomes string operations in PostgreSQL
3. Transaction syntax differences are abstracted in the connection wrapper
4. PostgreSQL's `ON CONFLICT` replaces MySQL's `ON DUPLICATE KEY UPDATE`
### SSH Tunnel
Database connections go through an SSH tunnel for security:
```javascript
ssh.forwardOut(
"127.0.0.1",
0,
sshConfig.prodDbConfig.host,
sshConfig.prodDbConfig.port,
async (err, stream) => {
if (err) reject(err);
resolve({ ssh, stream });
}
);
```