避免与自己连接100次

时间:2014-12-05 17:31:48

标签: mysql sql database join

我有一个巨大的表(数百万行),看起来像这样(实质上)

 datatime               tagname      interesting somemore columns
 2014-12-04 20:00:00   grp1_tagA          77        0       0
 2014-12-04 20:00:00   grp1_tagB          88        0       0
 2014-12-04 20:00:00   grp1_tagC          99        0       0
 2014-12-04 20:00:00   grp2_tagA          11        0       0
 2014-12-04 20:00:00   grp2_tagB          22        0       0
 2014-12-04 20:00:00   grp2_tagC          13        0       0
 2014-12-04 21:00:00   grp1_tagA          17        0       0
 2014-12-04 21:00:00   grp1_tagC          28        0       0
 2014-12-04 21:00:00   grp1_tagC          29        0       0
 2014-12-04 21:00:00   grp2_tagA          31        0       0
 2014-12-04 21:00:00   grp2_tagB          62        0       0
 2014-12-04 21:00:00   grp2_tagC          53        0       0
 2014-12-04 22:00:00   grp1_tagA          87        0       0
 2014-12-04 22:00:00   grp1_tagB          48        0       0
 2014-12-04 22:00:00   grp1_tagC          99        0       0
 2014-12-04 22:00:00   grp2_tagA          51        0       0
 2014-12-04 22:00:00   grp2_tagB          42        0       0
 2014-12-04 22:00:00   grp2_tagC          53        0       0

在真实表中,有几十个组,每组有~100个标签,对于每个组和标签,有几年的小时数据(每个标记名一万行),相当于目前大约800万行。在稍后阶段,其他具有较小时间间隔且因此更大的表格将会起作用。

我需要一种快速的方法来获取表中的所有数据,这些数据与某个组(例如,组1,即标记名以" grp1和#34;开头)在某些日期范围内(数据)被发送到某个客户的浏览器进行可视化。)

所以我想制作一个"第1组摘要"像这样的表

group1_1

简单的查询就像(暂时删除日期约束)

SELECT A.`datatime` as `datatime`,
A.`interesting` as tagA, B.`interesting` as tagB, C.`interesting` as tagC 
FROM `everything` A, `everything` B, `everything` C
WHERE 
A.`datatime` = B.`datatime` AND
A.`datatime` = C.`datatime` AND
A.`tagname` = "grp1_tagA" AND
B.`tagname` = "grp1_tagB" AND
C.`tagname` = "grp1_tagC"
实际上它实际上有点复杂,因为在某些日期,某些标签可能包含数据,而其他标签则没有,我也希望这些行包含部分数据。再多一行

enter image description here

我想要的是

group1_2

为此目的的可能查询是

SELECT GLUE.thyme, A.iwant as tagA, B.iwant as tagB, C.iwant as tagC FROM
(SELECT distinct `datatime` as thyme from `everything`) GLUE left join
(SELECT `datatime` as thyme, `interesting` as iwant from `everything` where `tagname` = "grp1_tagA") A on GLUE.thyme = A.thyme left join
(SELECT `datatime` as thyme, `interesting` as iwant from `everything` where `tagname` = "grp1_tagB") B on GLUE.thyme = B.thyme left join
(SELECT `datatime` as thyme, `interesting` as iwant from `everything` where `tagname` = "grp1_tagC") C on GLUE.thyme = C.thyme

问题:"现实世界"这个查询的版本还不够快。我使用34个标记名称(进行35个表连接)测试了上述查询结构,向子查询的每个添加了where/and datatime >= '2013-12-04'之类的日期约束,因此总共有8760行(即1行)年份数据)被退回。由此产生的运行时间为2.5分钟。我将目标锁定在半分钟以下,这是通过互联网传输数据的时间。

大表在datatime和tagname上有一个复合主键索引,在datatime上有一个索引(键)。

总体问题是,如何更快地获取数据?

问题1 上述查询可以改进吗?

那将是首选解决方案。

更新已接受的答案提供首选解决方案。可以在没有任何连接的情况下编写此查询。而且它的速度要快得多。 (从2.5分钟开始只需几秒钟,只需测试一下。)无需阅读问题的其余部分,不需要额外的表格。

如果无法做到这一点,则可以在整个可用日期范围内维护一个额外的表group1,该表具有查询结果的所有数据,并与大表保持同步通过某种方式,可能是触发器。这就是我目前所做的工作,但我怀疑我的触发器运行速度不够快。

所以创建新表

CREATE TABLE `group1` (
  `datatime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `tagA` int(32) DEFAULT NULL,
  `tagB` int(32) DEFAULT NULL,
  `tagC` int(32) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

将数据从大表传输到新表

INSERT INTO group1 (`datatime`) SELECT DISTINCT `datatime` from `everything`;

UPDATE group1 g, (SELECT `datatime` as thyme, `interesting` as iwant from `everything` where `tagname` = "grp1_tagA") as source set g.`tagA` = iwant WHERE g.`datatime`= source.thyme;
UPDATE group1 g, (SELECT `datatime` as thyme, `interesting` as iwant from `everything` where `tagname` = "grp1_tagB") as source set g.`tagB` = iwant WHERE g.`datatime`= source.thyme;
UPDATE group1 g, (SELECT `datatime` as thyme, `interesting` as iwant from `everything` where `tagname` = "grp1_tagC") as source set g.`tagC` = iwant WHERE g.`datatime`= source.thyme;

触发保持新表与大表同步

CREATE TRIGGER everything_group1_after_insert
AFTER INSERT
   ON `everything` FOR EACH ROW
BEGIN
    DECLARE counter INT;
    SET counter = (SELECT count(*) FROM `group1` WHERE datatime = NEW.`datatime`);
    IF counter = 0 THEN
        INSERT INTO `group1` (`datatime`) VALUES (NEW.`datatime`);
    END IF;
    IF NEW.TAGNAME = "grp1_tagA" THEN UPDATE `group1` SET `tagA` = NEW.`interesting` WHERE `group1`.`datatime` = NEW.`datatime`; END IF;  
    IF NEW.TAGNAME = "grp1_tagB" THEN UPDATE `group1` SET `tagB` = NEW.`interesting` WHERE `group1`.`datatime` = NEW.`datatime`; END IF;  
    IF NEW.TAGNAME = "grp1_tagC" THEN UPDATE `group1` SET `tagC` = NEW.`interesting` WHERE `group1`.`datatime` = NEW.`datatime`; END IF;  
END; //
DELIMITER ;

问题2 如何改善触发器的运行时间?或者以某种不同的方式维护表同步(不一定是触发器)?每个标签有1个if语句是不可避免的吗?

问题3 假设新的标签已添加到组中。是否可以以这样的方式编写触发器(或查询,请参阅问题1),在这种情况下,不必为了考虑结果表的新标签/列而重写它?对于查询,我很确定这是不可能的(这需要加入未指定数量的表),但也许触发器有可能吗?

您可以在此处下载上述玩具数据库的sql转储:toy database

更新:我忘记了group1上的主键

alter table `group1` add primary key (datatime)

2 个答案:

答案 0 :(得分:3)

尝试在datatime列上使用group by,并使用case语句,如下所示。

SELECT a.datatime
    , sum(case when a.tagname = 'grp1_tagA' then a.interesting else NULL end) as tagA
    , sum(case when a.tagname = 'grp1_tagB' then a.interesting else NULL end) as tagB
    , sum(case when a.tagname = 'grp1_tagC' then a.interesting else NULL end) as tagC
FROM everything AS a
WHERE a.datatime >= '2013-12-04'
GROUP BY a.datatime
;

答案 1 :(得分:0)

在数百万行的巨大桌面上进行的测试表明,BateTech的优秀答案仍然可以稍微改善一下,就像这样

SELECT a.datatime
    , sum(case when a.tagname = 'grp1_tagA' then a.interesting else NULL end) as tagA
    , sum(case when a.tagname = 'grp1_tagB' then a.interesting else NULL end) as tagB
    , sum(case when a.tagname = 'grp1_tagC' then a.interesting else NULL end) as tagC
FROM (SELECT * FROM everything WHERE datatime >= '2013-12-04' and tagname like "grp1_%") AS a
GROUP BY a.datatime
;