MySQL:通过删除/聚合重复项和常量值来清理数据

时间:2017-01-03 22:53:02

标签: mysql aggregate

在一个包含测量和状态数据的大型数据库中,我打算在不丢失大量信息的情况下减少数据。我研究了几个例子,但我的SQL技能似乎太有限了,无法成功......

该表有数百万个数据。表定义是

TIMESTAMP TIMESTAMP, DEVICE varchar(32), TYPE varchar(32), EVENT varchar(512), READING varchar(32), VALUE varchar(32), UNIT varchar(32)

一些示例数据。完整的表有许多不同的设备和读数,每个应分别处理:

+---------------------+----------+------+---------+---------+-------+------+
| TIMESTAMP           | DEVICE   | TYPE | EVENT   | READING | VALUE | UNIT |
+---------------------+----------+------+---------+---------+-------+------+
| 2016-03-27 10:17:45 | KNX_428c | KNX  |  49 mA  | state   | 49    | mA   |
| 2016-03-27 10:19:45 | KNX_428c | KNX  |  47 mA  | state   | 47    | mA   |
| 2016-03-27 10:21:44 | KNX_428c | KNX  |  50 mA  | state   | 50    | mA   |
| 2016-03-27 10:23:44 | KNX_428c | KNX  |  50 mA  | state   | 50    | mA   |
| 2016-03-27 10:23:44 | KNX_428c | KNX  |  50 mA  | state   | 50    | mA   |
| 2016-03-27 10:25:44 | KNX_428c | KNX  |  50 mA  | state   | 50    | mA   |
| 2016-03-27 10:25:44 | KNX_428c | KNX  |  50 mA  | state   | 50    | mA   |
| 2016-03-27 10:27:44 | KNX_428c | KNX  |  50 mA  | state   | 50    | mA   |
| 2016-03-27 10:27:44 | KNX_428c | KNX  |  50 mA  | state   | 50    | mA   |
| 2016-03-27 10:29:44 | KNX_428c | KNX  |  50 mA  | state   | 50    | mA   |
| 2016-03-27 10:31:44 | KNX_428c | KNX  |  50 mA  | state   | 50    | mA   |
| 2016-03-27 10:31:44 | KNX_428c | KNX  |  47 mA  | state   | 47    | mA   |
| 2016-03-27 10:33:44 | KNX_428c | KNX  |  50 mA  | state   | 50    | mA   |
| 2016-03-27 10:33:44 | KNX_428c | KNX  |  50 mA  | state   | 50    | mA   |
| 2016-03-27 10:34:04 | KNX_428c | KNX  |  136 mA | state   | 136   | mA   |
| 2016-03-27 10:34:04 | KNX_428c | KNX  |  165 mA | state   | 165   | mA   |
| 2016-03-27 10:34:05 | KNX_428c | KNX  |  136 mA | state   | 136   | mA   |
| 2016-03-27 10:34:05 | KNX_428c | KNX  |  107 mA | state   | 107   | mA   |
| 2016-03-27 10:34:05 | KNX_428c | KNX  |  79 mA  | state   | 79    | mA   |
| 2016-03-27 10:34:06 | KNX_428c | KNX  |  50 mA  | state   | 50    | mA   |
| 2016-03-27 10:34:29 | KNX_428c | KNX  |  107 mA | state   | 107   | mA   |
| 2016-03-27 10:34:29 | KNX_428c | KNX  |  136 mA | state   | 136   | mA   |
| 2016-03-27 10:34:30 | KNX_428c | KNX  |  165 mA | state   | 165   | mA   |
| 2016-03-27 10:34:30 | KNX_428c | KNX  |  139 mA | state   | 139   | mA   |
| 2016-03-27 10:34:30 | KNX_428c | KNX  |  107 mA | state   | 107   | mA   |
| 2016-03-27 10:34:31 | KNX_428c | KNX  |  51 mA  | state   | 51    | mA   |
| 2016-03-27 10:34:44 | KNX_428c | KNX  |  0 mA   | state   | 0     | mA   |
| 2016-03-27 10:35:44 | KNX_428c | KNX  |  0 mA   | state   | 0     | mA   |
| 2016-03-27 10:37:44 | KNX_428c | KNX  |  0 mA   | state   | 0     | mA   |
| 2016-03-27 10:37:44 | KNX_428c | KNX  |  0 mA   | state   | 0     | mA   |
| 2016-03-27 10:39:43 | KNX_428c | KNX  |  0 mA   | state   | 0     | mA   |
| 2016-03-27 10:41:43 | KNX_428c | KNX  |  0 mA   | state   | 0     | mA   |
| 2016-03-27 10:43:43 | KNX_428c | KNX  |  0 mA   | state   | 0     | mA   |
| 2016-03-27 10:45:43 | KNX_428c | KNX  |  0 mA   | state   | 0     | mA   |
| 2016-03-27 10:47:43 | KNX_428c | KNX  |  0 mA   | state   | 0     | mA   |
| 2016-03-27 10:47:43 | KNX_428c | KNX  |  0 mA   | state   | 0     | mA   |
| 2016-03-27 10:49:43 | KNX_428c | KNX  |  0 mA   | state   | 0     | mA   |

我打算做两件事:

  • 使用max()
  • 将值与相同的时间戳,设备和读数进行汇总
  • 删除连续的相同值,但常量值的第一个和最后一个除外。

我在使用group的select语句中实现的第一件事。但我不知道如何实际更改数据库。

SELECT *,MAX(VALUE) FROM filelog
GROUP BY TIMESTAMP,DEVICE,READING

对于第二步,我找到了几个例子,但它们总是将重复项组合在一个记录中而不是两个(第一个和最后一个),因为我打算这样做。通常这些示例使用JOIN,我认为这对数百万数据集来说是不可能的。

结果如下:

| 2016-03-27 10:17:45 | KNX_428c | KNX  |  49 mA  | state   | 49    | mA   | 
| 2016-03-27 10:19:45 | KNX_428c | KNX  |  47 mA  | state   | 47    | mA   | 
| 2016-03-27 10:21:44 | KNX_428c | KNX  |  50 mA  | state   | 50    | mA   | 
| 2016-03-27 10:33:44 | KNX_428c | KNX  |  50 mA  | state   | 50    | mA   | 
| 2016-03-27 10:34:04 | KNX_428c | KNX  |  136 mA | state   | 165   | mA   | 
| 2016-03-27 10:34:05 | KNX_428c | KNX  |  136 mA | state   | 136   | mA   | 
| 2016-03-27 10:34:06 | KNX_428c | KNX  |  50 mA  | state   | 50    | mA   | 
| 2016-03-27 10:34:29 | KNX_428c | KNX  |  107 mA | state   | 136   | mA   | 
| 2016-03-27 10:34:30 | KNX_428c | KNX  |  165 mA | state   | 165   | mA   | 
| 2016-03-27 10:34:31 | KNX_428c | KNX  |  51 mA  | state   | 51    | mA   | 
| 2016-03-27 10:34:44 | KNX_428c | KNX  |  0 mA   | state   | 0     | mA   | 
| 2016-03-27 10:49:43 | KNX_428c | KNX  |  0 mA   | state   | 0     | mA   |   

感谢您的支持。

1 个答案:

答案 0 :(得分:0)

对于第一个查询,如果要在聚合后获取完整记录,则需要执行比建议更多的工作。一种方法是进行额外的加入:

SELECT t1.*
FROM filelog t1
INNER JOIN
(
    SELECT TIMESTAMP, DEVICE, READING, MAX(VALUE) AS VALUE
    FROM filelog
    GROUP BY TIMESTAMP, DEVICE, READING
) t2
    ON t1.TIMESTAMP = t2.TIMESTAMP AND
       t1.DEVICE    = t2.DEVICE    AND
       t1.READING   = t2.READING   AND
       t1.VALUE     = t2.VALUE