为每个ID子组在时间序列数据表中添加缺少的日期记录

时间:2018-10-31 19:35:58

标签: mysql

我正在使用一个包含工作日数据的表。该数据几乎与一天中的每日余额有关。数据如下:

ID  Name        Some Val    Other Val   Date

10  Somebody    33001.93    33001.93    2018-10-01
10  Somebody    33481.93    33481.93    2018-10-02
10  Somebody    33001.93    33001.93    2018-10-03
10  Somebody    33582.76    33582.76    2018-10-04
10  Somebody    33582.73    33582.79    2018-10-05
------- Missing row for 2018-10-06 ---------------
------- Missing row for 2018-10-07 ---------------
10  Somebody    33582.76    33582.76    2018-10-08
------- Missing row for 2018-10-09 ---------------
10  Somebody    33462.76    33462.76    2018-10-10

我的任务是计算平均每日余额(每天结束时的总余额/总天数)。为了进行计算,我需要确保我整天都有数据。为此,最后一行需要替换丢失的数据。

我的需要是

ID  Name        Some Val    Other Val   Date

10  Somebody    33001.93    33001.93    2018-10-01
10  Somebody    33481.93    33481.93    2018-10-02
10  Somebody    33001.93    33001.93    2018-10-03
10  Somebody    33582.76    33582.76    2018-10-04
10  Somebody    33582.73    33582.79    2018-10-05    
10  Somebody    33582.73    33582.79    2018-10-06
10  Somebody    33582.73    33582.79    2018-10-07    
10  Somebody    33582.76    33582.76    2018-10-08
10  Somebody    33382.76    33582.76    2018-10-09
10  Somebody    33462.76    33462.76    2018-10-10

本质上,第5行写入丢失的第6和7行,第8行写入第9行。

我通过创建日历表然后使用以下查询来解决该问题:

SELECT  
CASE WHEN ID IS NULL THEN (SELECT ID 
                        FROM T tt 
                        WHERE tt.Date < t1.minDt
                        ORDER BY tt.Date DESC
                        LIMIT 1)  
ELSE ID END ID,
CASE WHEN Name IS NULL THEN (SELECT Name 
                        FROM T tt 
                        WHERE tt.Date < t1.minDt
                        ORDER BY tt.Date DESC
                        LIMIT 1) 
ELSE Name END Name,
CASE WHEN SomeVal IS NULL THEN (SELECT SomeVal 
                        FROM T tt 
                        WHERE tt.Date < t1.minDt
                        ORDER BY tt.Date DESC
                        LIMIT 1) 
ELSE SomeVal END SomeVal,
CASE WHEN OtherVal IS NULL THEN (SELECT OtherVal 
                        FROM T tt 
                        WHERE tt.Date < t1.minDt
                        ORDER BY tt.Date DESC
                        LIMIT 1) 
ELSE OtherVal END OtherVal,
minDt
FROM calendar t1 
LEFT JOIN T t2 ON t1.minDt = t2.Date
ORDER BY t1.minDT;

当ID值恒定时,此解决方案有效。我意识到我的数据集有成千上万条具有数百个唯一ID值的记录。每个ID可能缺少值。上面的查询仅替换数据的顶部,而不替换整个数据。我需要为每个ID运行相同的查询。我猜按分区可以在mysql中工作,但是我不太确定如何尝试。

数据实际上看起来像这样:

10,'Somebody',33001.93,33001.93,'2018-10-01'
10,'Somebody',33481.93,33481.93,'2018-10-02'
10,'Somebody',33001.93,33001.93,'2018-10-03'
10,'Somebody',33582.76,33582.76,'2018-10-04'
10,'Somebody',33582.73,33582.79,'2018-10-05'
10,'Somebody',33582.76,33582.76,'2018-10-08'
15,'someone else',33462.76,33462.76,'2018-10-1'
15,'someone else',33582.76,33582.76,'2018-10-04'
15,'someone else',33582.73,33582.79,'2018-10-05'
15,'someone else',33582.76,33582.76,'2018-10-08'
15,'someone else',33462.76,33462.76,'2018-10-10'

您可以在此处尝试使用虚拟数据和上述查询:

View on DB Fiddle

我正在使用的MySQL版本是:

mysql  Ver 14.14 Distrib 5.7.24, for Linux (x86_64) using  EditLine wrapper

2 个答案:

答案 0 :(得分:1)

您可以使用MySQL变量填写表数据。诀窍是将日历表JOIN移到表中不同的ID值列表中,以获取具有该范围内每个日期的ID和日期的表。然后可以将其LEFT JOIN放入数据表以获取它们存在的值,并且可以使用MySQL变量来填补空白:

SELECT thedate,
       @name := coalesce(Name, @name) AS Name,
       @someval := coalesce(SomeVal, @someval) AS SomeVal,
       @otherval := coalesce(OtherVal, @otherval) AS OtherVal,
       @id := id AS id
FROM (SELECT c.thedate, i.id, t.Name, t.SomeVal, t.OtherVal
      FROM calendar c
      JOIN (SELECT DISTINCT id FROM t) i
      LEFT JOIN t ON t.date = c.thedate AND t.id = i.id) g
CROSS JOIN (SELECT @id := 0, @name := '', @someval := 0, @otherval := 0) v
ORDER BY id, thedate

输出示例数据:

thedate     Name            SomeVal     OtherVal    id
2018-10-01  Somebody        33001.93    33001.93    10
2018-10-02  Somebody        33481.93    33481.93    10
2018-10-03  Somebody        33001.93    33001.93    10
2018-10-04  Somebody        33582.76    33582.76    10
2018-10-05  Somebody        33582.73    33582.79    10
2018-10-06  Somebody        33582.73    33582.79    10
2018-10-07  Somebody        33582.73    33582.79    10
2018-10-08  Somebody        33582.76    33582.76    10
2018-10-09  Somebody        33582.76    33582.76    10
2018-10-10  Somebody        33582.76    33582.76    10
2018-10-01  someone else    33462.76    33462.76    15
2018-10-02  someone else    33462.76    33462.76    15
2018-10-03  someone else    33462.76    33462.76    15
2018-10-04  someone else    33582.76    33582.76    15
2018-10-05  someone else    33582.73    33582.79    15
2018-10-06  someone else    33582.73    33582.79    15
2018-10-07  someone else    33582.73    33582.79    15
2018-10-08  someone else    33582.76    33582.76    15
2018-10-09  someone else    33582.76    33582.76    15
2018-10-10  someone else    33462.76    33462.76    15

我在dbfiddle上创建了一个演示,演示了各个部分如何组合在一起(包括我的日历表,该日历表仅包含表中的日期)。

答案 1 :(得分:0)

我想我通过使用与上述相同的逻辑取得了一些进展。必须使用id数据创建日历查找表。我在日期和ID级别进行匹配。结果表获得了很多重复/空记录,但是对数据进行去往操作几乎可以满足我的需求。

这肯定不是最优雅的解决方案,因为我使用的临时数据集非常大。必须有一个更简洁的解决方案,但目前对我有用。