SQL:使用条件填充丢失的记录

时间:2019-10-16 00:19:48

标签: sql google-bigquery

我需要按日期计算库存中存在的产品数量。但是,在数据库中,仅当消费者查看产品时才记录该产品。

例如考虑以下基本表结构:

date      |  productId |   views
July 1    |  A         |   8
July 2    |  A         |   6
July 2    |  B         |   4
July 3    |  A         |   2
July 4    |  A         |   8
July 4    |  B         |   6
July 4    |  C         |   4
July 5    |  C         |   2
July 10   |  A         |   17

使用以下查询,我尝试确定给定日期的库存产品数量。

select date, count(distinct productId) as Inventory, sum(views) as views
from (
   select date, productId, count(*) as views
   from SomeTable
   group by date, productID
   order by date asc, productID asc
)
group by date

这是输出

date      |  Inventory |   views
July 1    |  1         |   8
July 2    |  2         |   10
July 3    |  1         |   2
July 4    |  3         |   18
July 5    |  1         |   2
July 10   |  1         |   17

由于缺少行,我的输出不能准确反映库存中有多少产品。

对库存的正确理解如下:
-产品A从7月1日至7月10日在库存中。
-产品B从7月2日至7月4日在库存中。
-产品C的库存从7月4日到7月5日。

正确的SQL输出应为:

date      |  Inventory |   views
July 1    |  1         |   8
July 2    |  2         |   10
July 3    |  2         |   2
July 4    |  3         |   18
July 5    |  2         |   2
July 6    |  1         |   0
July 7    |  1         |   0
July 8    |  1         |   0
July 9    |  1         |   0
July 10   |  1         |   17

如果您继续关注,请允许我确认我很乐意将“广告资源中”定义为第一个视图和最后一个视图之间的日期差。

我遵循了以下错误过程:

首先,我创建了一个表,该表是每个productID和每个日期的笛卡尔乘积。 ''' 日期为(    选择日期    来自SomeTable    按日期分组 ), 产品为(    选择productId    来自SomeTable    按productId分组 ) 选择Dates.date,Products.productId from Dates交叉加入产品 '''

然后我尝试进行正确的外部联接,以将其减少为丢失的记录:

with Records as (
select date, productId, count(*) as views
from SomeTable
group by date, productId
),
Cartesian as (
{See query above}
)
Select Cartesian.date, Cartesian.productId, 0 as views #for upcoming union
from Cartesian right outer join Records 
on Cartesian.date = Records.date 
where Records.productId is null

然后在手边缺少行,将它们合并回记录。 这样做会产生一个新问题:多余的行。

date      |  productId |   views
July 1    |  A         |   8
July 1    |  B         |   0
July 1    |  C         |   0
July 2    |  A         |   6
July 2    |  B         |   4
July 2    |  C         |   0
July 3    |  A         |   2
July 3    |  B         |   0
July 3    |  C         |   0
July 4    |  A         |   8
July 4    |  B         |   6
July 4    |  C         |   4
July 5    |  A         |   2
July 5    |  B         |   0
July 5    |  C         |   0
July 6    |  A         |   0
July 6    |  B         |   0
July 6    |  C         |   0
July 7    |  A         |   0
July 7    |  B         |   0
July 7    |  C         |   0
July 8    |  A         |   0
July 8    |  B         |   0
July 8    |  C         |   0
July 9    |  A         |   0
July 9    |  B         |   0
July 9    |  C         |   0
July 10   |  A         |   17
July 10   |  B         |   0
July 10   |  C         |   0

当我运行我的简单查询时 select date, count(distinct productId) as Inventory, sum(views) as views 在那张桌子上,我又得到了错误的输出:

date      |  Inventory |   views
July 1    |  3         |   8
July 2    |  3         |   10
July 3    |  3         |   2
July 4    |  3         |   18
July 5    |  3         |   2
July 6    |  3         |   0
July 7    |  3         |   0
July 8    |  3         |   0
July 9    |  3         |   0
July 10   |  3         |   17

我的下一个想法是遍历每个productId,确定它的第一个和最后一个日期,然后与Cartesian表结合起来,条件是Cartesian.date介于每个特定产品的第一个和最后一个日期之间。
必须有一种更简单的方法来执行此操作。谢谢。

1 个答案:

答案 0 :(得分:1)

以下是用于BigQuery标准SQL

#standardSQL
WITH dates AS (
  SELECT day FROM (
    SELECT MIN(day) min_day, MAX(day) max_day
    FROM `project.dataset.table`
  ), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) day
), ranges AS (
  SELECT productId, MIN(day) min_day, MAX(day) max_day
  FROM `project.dataset.table` t
  GROUP BY productId
)
SELECT day, COUNT(DISTINCT productId) Inventory, SUM(IFNULL(views, 0)) views
FROM dates d, ranges r 
LEFT JOIN `project.dataset.table` USING(day, productId)
WHERE day BETWEEN min_day AND max_day 
GROUP BY day

如果要应用到您的问题的示例数据中,如下面的示例

#standardSQL
WITH `project.dataset.table` AS (
  SELECT DATE '2019-07-01' day, 'A' productId, 8 views UNION ALL
  SELECT '2019-07-02', 'A', 6 UNION ALL
  SELECT '2019-07-02', 'B', 4 UNION ALL
  SELECT '2019-07-03', 'A', 2 UNION ALL
  SELECT '2019-07-04', 'A', 8 UNION ALL
  SELECT '2019-07-04', 'B', 6 UNION ALL
  SELECT '2019-07-04', 'C', 4 UNION ALL
  SELECT '2019-07-05', 'C', 2 UNION ALL
  SELECT '2019-07-10', 'A', 17 
), dates AS (
  SELECT day FROM (
    SELECT MIN(day) min_day, MAX(day) max_day
    FROM `project.dataset.table`
  ), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) day
), ranges AS (
  SELECT productId, MIN(day) min_day, MAX(day) max_day
  FROM `project.dataset.table` t
  GROUP BY productId
)
SELECT day, COUNT(DISTINCT productId) Inventory, SUM(IFNULL(views, 0)) views
FROM dates d, ranges r 
LEFT JOIN `project.dataset.table` USING(day, productId)
WHERE day BETWEEN min_day AND max_day 
GROUP BY day
-- ORDER BY day

结果是

Row day         Inventory   views    
1   2019-07-01  1           8    
2   2019-07-02  2           10   
3   2019-07-03  2           2    
4   2019-07-04  3           18   
5   2019-07-05  2           2    
6   2019-07-06  1           0    
7   2019-07-07  1           0    
8   2019-07-08  1           0    
9   2019-07-09  1           0    
10  2019-07-10  1           17   
相关问题