填补组中日期和变量的空白-PostgreSQL

时间:2018-11-12 03:17:44

标签: sql postgresql gaps-and-islands

我有一个价格表,该价格表在两个主要变量(日期(sales_date)和销售渠道(渠道))之间存在差异。我需要填补所有可能的skus(ean)和客户(id_client)组合的空白。

目前,我已经能够填写日期和频道,但是在某些情况下,同一日期会共享多个频道,在那些“怪异”的情况下,我的方法是复制所有内容。

表格

create table prices_master (
   id_price serial primary key,
   sales_date date,
   ean varchar(15),
   id_client int,
   channel varchar(15),
  price float
);

create table channels_master (
   id_channel serial primary key, 
   channel varchar(15)
);

insert into prices_master (sales_date, ean, id_client, channel, price) 
values
('2015-07-01', '7506205801143', 7, 'COMERCIAL',47655),  
('2015-08-01', '7506205801143', 7, 'COMERCIAL',51655),
('2015-12-01', '7506205801143', 7, 'COMERCIAL', 55667),
('2015-12-01', '7506205801143', 7, 'DISTRIBUIDOR', 35667),
('2015-07-01', '5052197008555', 7, 'DISTRIBUIDOR', 7224),
('2015-10-01', '5052197008555', 7, 'DISTRIBUIDOR', 8224);

insert into channels_master (channel) values 
('DISTRIBUIDOR'), ('INSTITUCIONAL'), ('NON_TRADE'), ('COMERCIAL');

我的方法

WITH full_dates AS (
    WITH min_max AS (
      SELECT min(prm.sales_date) AS min_date, ((max(prm.sales_date))) :: date AS max_date
      FROM prices_master prm
)
  SELECT generate_series((min_max.min_date) :: timestamp with time zone,
                       (min_max.max_date) :: timestamp with time zone, '1 mon' :: interval) AS sales_date
  FROM min_max), 
completechannels AS (
  SELECT DISTINCT channel
  FROM channels_master
 ), 
temp AS (
  SELECT prices_master.sales_date,
         prices_master.id_client,
         prices_master.ean,
         prices_master.channel,
         prices_master.price,
         lead(
           prices_master.sales_date) OVER (PARTITION BY prices_master.id_client, prices_master.ean, prices_master.channel ORDER BY prices_master.sales_date) AS next_sales_date
  FROM prices_master
  ORDER BY prices_master.id_client, prices_master.ean, prices_master.channel, prices_master.sales_date
 )
SELECT (full_dates.sales_date) :: date AS sales_date,
     temp.id_client,
     temp.ean,
     completechannels.channel,
     price
FROM full_dates
     JOIN temp ON full_dates.sales_date >= temp.sales_date AND 
     (full_dates.sales_date < temp.next_sales_date OR temp.next_sales_date IS NULL)
     JOIN completechannels ON 1=1
     ORDER BY temp.id_client, temp.ean, completechannels.channel, 
     full_dates.sales_date;

我的问题出现在sales_date 2015-12-01 上的代码 7506205801143 上,因为此代码同时具有DISTRIBUIDOR和COMERCIAL两种渠道的价格,所以我的方法是复制行: / p>

我的进近结果(不好)

+------------+-----------+---------------+---------------+-------+
| sales_date | id_client |      ean      |    channel    | price |
+------------+-----------+---------------+---------------+-------+
| 2015-12-01 |         7 | 7506205801143 | COMERCIAL     | 55667 |
| 2015-12-01 |         7 | 7506205801143 | COMERCIAL     | 35667 |
| 2015-12-01 |         7 | 7506205801143 | DISTRIBUIDOR  | 55667 |
| 2015-12-01 |         7 | 7506205801143 | DISTRIBUIDOR  | 35667 |
| 2015-12-01 |         7 | 7506205801143 | INSTITUCIONAL | 35667 |
| 2015-12-01 |         7 | 7506205801143 | INSTITUCIONAL | 55667 |
| 2015-12-01 |         7 | 7506205801143 | NON_TRADE     | 55667 |
| 2015-12-01 |         7 | 7506205801143 | NON_TRADE     | 35667 |
+------------+-----------+---------------+---------------+-------+

预期结果(良好)

+------------+-----------+---------------+---------------+-------+
| sales_date | id_client |      ean      |    channel    | price |
+------------+-----------+---------------+---------------+-------+
| 2015-12-01 |         7 | 7506205801143 | COMERCIAL     | 55667 |
| 2015-12-01 |         7 | 7506205801143 | DISTRIBUIDOR  | 35667 |
| 2015-12-01 |         7 | 7506205801143 | INSTITUCIONAL | 55667 |
| 2015-12-01 |         7 | 7506205801143 | NON_TRADE     | 55667 |
+------------+-----------+---------------+---------------+-------+

对于 INSTITUTIONAL NON_TRADE ,最高价格可用于填补空白。

2 个答案:

答案 0 :(得分:1)

您可以尝试通过ROW_NUMBER DESC在子查询中使用sales_date窗口函数来获取每个channel的最大行数据

然后使用coalesceMAX窗口函数来制作它。

查询1

WITH pricesCTE as (
   SELECT price,sales_date,id_client,ean,cm.channel,ROW_NUMBER() OVER(PARTITION BY cm.channel ORDER BY sales_date DESC) rn
   FROM (SELECT DISTINCT channel FROM channels_master) cm 
   LEFT JOIN prices_master pm on pm.channel = cm.channel
)
SELECT 
      coalesce(sales_date,MAX(sales_date) OVER(ORDER BY coalesce(price,0) DESC)) sales_date,
      coalesce(id_client,MAX(id_client) OVER(ORDER BY coalesce(price,0) DESC)) id_client,
      coalesce(ean,MAX(ean) OVER(ORDER BY coalesce(price,0) DESC)) ean,
      channel,
      coalesce(price,MAX(price) OVER(ORDER BY coalesce(price,0) DESC)) price
FROM 
(
  select *
  from pricesCTE 
  where rn = 1
) t1

Results

| sales_date | id_client |           ean |       channel | price |
|------------|-----------|---------------|---------------|-------|
| 2015-12-01 |         7 | 7506205801143 |     COMERCIAL | 55667 |
| 2015-12-01 |         7 | 7506205801143 |  DISTRIBUIDOR | 35667 |
| 2015-12-01 |         7 | 7506205801143 | INSTITUCIONAL | 55667 |
| 2015-12-01 |         7 | 7506205801143 |     NON_TRADE | 55667 |

答案 1 :(得分:1)

您可以通过以下操作轻松实现这一点,将主要价格表视为 overrides 。也就是说,您想要构建一个{base}表,该表仅包含date / client / ean元组的(最高)价格,并忽略直到稍后。

首先,您需要将以下CTE添加到已有的CTE中(格式/命名已更新为我惯用的样式):

Maximum_Price_Per_Date AS (
    SELECT Date_Range.sales_date, Price_Date_Range.id_client, Price_Date_Range.ean, 
           MAX(Price_Date_Range.price) AS price
    FROM Date_Range
    JOIN Price_Date_Range -- aka TEMP in your original query
      ON Price_Date_Range.sales_date <= Date_Range.sales_date
          AND (Price_Date_Range.next_sales_date > Date_Range.sales_date OR Price_Date_Range.next_sales_date IS NULL)
    GROUP BY Date_Range.sales_date, Price_Date_Range.id_client, Price_Date_Range.ean
)

这使与笛卡尔乘积(JOIN completechannels ON 1=1-尽管通常通过CROSS JOIN完成)的集合乘法与您一起工作:现在将不再有多余的行:

SELECT Maximum_Price_Per_Date.sales_date, Maximum_Price_Per_Date.id_client, Maximum_Price_Per_Date.ean,
       Channel.channel, 
       Maximum_Price_Per_Date.price
FROM Maximum_Price_Per_Date
CROSS JOIN (SELECT DISTINCT channel
            FROM Channels_Master) Channel

生成(省略不感兴趣的行):

| sales_date | channel | id_client     | ean           | price |
|------------|---------|---------------|---------------|-------|
| 2015-12-01 | 7       | 7506205801143 | DISTRIBUIDOR  | 55667 |
| 2015-12-01 | 7       | 7506205801143 | COMERCIAL     | 55667 |
| 2015-12-01 | 7       | 7506205801143 | NON_TRADE     | 55667 |
| 2015-12-01 | 7       | 7506205801143 | INSTITUCIONAL | 55667 |

现在,我们只需LEFT JOIN(再次)返回Price_Date_Range CTE,并使用那里的价格(如果存在)

-- Note that you should have a Calendar table, which would remove this.
WITH Date_Range AS (
    -- You probably should be using an explicit range here, to account for future dates.
    WITH Min_Max AS (
        SELECT MIN(sales_date) AS min_date, MAX(sales_date) AS max_date
        FROM Prices_Master
    ),
    Timezone_Range AS (
        SELECT GENERATE_SERIES(min_date, max_date, CAST('1 mon' AS INTERVAL)) AS sales_date
        FROM Min_Max
    )
    SELECT CAST(sales_date AS DATE) AS sales_date
    FROM Timezone_Range
),
-- This would really benefit by being a MQT - materialized query table
Price_Date_Range AS (
    SELECT sales_date, lead(sales_date) OVER (PARTITION BY id_client, ean, channel ORDER BY sales_date) AS next_sales_date,
           id_client, ean, channel, price
    FROM Prices_Master
), 
Maximum_Price_Per_Date AS (
    SELECT Date_Range.sales_date, Price_Date_Range.id_client, Price_Date_Range.ean, 
           MAX(Price_Date_Range.price) AS price
    FROM Date_Range
    JOIN Price_Date_Range
      ON Price_Date_Range.sales_date <= Date_Range.sales_date
          AND (Price_Date_Range.next_sales_date > Date_Range.sales_date OR Price_Date_Range.next_sales_date IS NULL)
    GROUP BY Date_Range.sales_date, Price_Date_Range.id_client, Price_Date_Range.ean
)
SELECT Maximum_Price_Per_Date.sales_date, Maximum_Price_Per_Date.id_client, Maximum_Price_Per_Date.ean,
       Channel.channel, 
       COALESCE(Price_Date_Range.price, Maximum_Price_Per_Date.price) AS price
FROM Maximum_Price_Per_Date
CROSS JOIN (SELECT DISTINCT channel
            FROM Channels_Master) Channel
LEFT JOIN Price_Date_Range
       ON Price_Date_Range.channel = Channel.channel
          AND Price_Date_Range.id_client = Maximum_Price_Per_Date.id_client
          AND Price_Date_Range.ean = Maximum_Price_Per_Date.ean
          AND Price_Date_Range.sales_date <= Maximum_Price_Per_Date.sales_date
          AND (Price_Date_Range.next_sales_date > Maximum_Price_Per_Date.sales_date OR Price_Date_Range.next_sales_date IS NULL)
ORDER BY Maximum_Price_Per_Date.sales_date, Maximum_Price_Per_Date.id_client, Maximum_Price_Per_Date.ean, Channel.channel

Fiddle example
(感谢@ D-Shih的设置)
生成(省略无趣的行):

| sales_date | channel | id_client     | ean           | price |
|------------|---------|---------------|---------------|-------|
| 2015-12-01 | 7       | 7506205801143 | COMERCIAL     | 55667 |
| 2015-12-01 | 7       | 7506205801143 | DISTRIBUIDOR  | 35667 |
| 2015-12-01 | 7       | 7506205801143 | INSTITUCIONAL | 55667 |
| 2015-12-01 | 7       | 7506205801143 | NON_TRADE     | 55667 |