考虑以下表格:
tweets daterange
--------------------------- ----------------
tweet_id nyse_date class _date
--------------------------- ----------------
1 2011-03-12 2 2011-03-11
2 2011-03-12 1 2011-03-12
3 2011-03-12 1 2011-03-13
4 2011-03-12 1 2011-03-14
5 2011-03-12 0 2011-03-15
7 2011-03-13 1
8 2011-03-13 2
9 2011-03-13 3
10 2011-03-14 3
每条推文都分配了一个“类”,分别为1,2或3.我需要概述每个类中每个类的数据范围内的推文数量。因此,即使2011-03-11
和2011-03-15
上没有推文,我仍然需要将该日期包含在结果集中,如下所示:
nyse_date total class1 class2 class3
-----------------------------------------
2011-03-11 0 0 0 0
2011-03-12 5 3 1 0
2011-03-13 3 1 1 1
2011-03-14 1 0 0 1
2011-03-15 0 0 0 0
我尝试了以下查询,但它只是超时(它不应该因为数据库不是那么大):
SELECT
t.nyse_date,
COUNT(CASE WHEN t.nyse_date = d._date THEN 1 END) total,
SUM(t.class=1) as neu,
SUM(t.class=2) as pos,
SUM(t.class=3) as neg
FROM tweets t
CROSS JOIN
daterange d
GROUP BY t.nyse_date
ORDER BY t.nyse_date ASC
这是EXPLAIN
id select_type table type possible_keys key key_len ref rows Extra
---------------------------------------------------------------------------------------------------
1 SIMPLE d ALL NULL NULL NULL NULL 148 Using temporary; Using filesort
1 SIMPLE t ALL NULL NULL NULL NULL 560783 Using join buffer
我做错了什么?有没有更有效的方法来确保包含日期范围表中的所有日期?
edit
:我也尝试了这个查询,但结果保持不变 - 它会一直运行,直到超时。
SELECT
t.nyse_date,
COUNT(t.tweet_id) AS total,
SUM(t.class=1) AS neu,
SUM(t.class=2) AS pos,
SUM(t.class=3) AS neg
FROM tweets t
LEFT JOIN
daterange d
ON t.nyse_date = d._date
GROUP BY t.nyse_date
ORDER BY t.nyse_date ASC
这是EXPLAIN
id select_type table type possible_keys key key_len ref rows Extra
-------------------------------------------------------------------------------------------------
1 SIMPLE t ALL NULL NULL NULL NULL 560783 Using temporary; Using filesort
1 SIMPLE d ALL NULL NULL NULL NULL 148
答案 0 :(得分:3)
您的查询运行缓慢的原因是因为它没有使用tweets
表上的任何索引。
您要做的是在(sp100_id, nyse_date)
表的tweets
列上创建一个复合索引,然后运行此查询:
SELECT
a.sp100_id,
b._date,
COALESCE(c.total,0) AS total,
COALESCE(c.neu,0) AS neu,
COALESCE(c.pos,0) AS pos,
COALESCE(c.neg,0) AS neg,
COALESCE(c.spamneu,0) AS spamneu
FROM
sp100 a
CROSS JOIN
daterange b
LEFT JOIN
(
SELECT
sp100_id,
nyse_date,
COUNT(1) AS total,
COUNT(CASE class WHEN 1 THEN 1 END) AS neu,
COUNT(CASE class WHEN 2 THEN 1 END) AS pos,
COUNT(CASE class WHEN 3 THEN 1 END) AS neg,
COUNT(CASE WHEN class = 1 AND type = 1 THEN 1 END) AS spamneu
FROM tweets
GROUP BY sp100_id, nyse_date
) c ON
a.sp100_id = c.sp100_id AND b._date = c.nyse_date
ORDER BY
a.sp100_id, b._date
答案 1 :(得分:1)
我觉得你很亲密。但是你可能想要左侧的日期。
SELECT
d.nyse_date,
COUNT(t.tweet_id) AS total,
SUM(t.class=1) AS neu,
SUM(t.class=2) AS pos,
SUM(t.class=3) AS neg
FROM daterange d LEFT OUTER JOIN tweets t t.nyse_date = d._date
GROUP BY d.nyse_date
ORDER BY d.nyse_date ASC
没有必要对索引做出结论。在假设过多之前,只需以正确的方式尝试查询。
修改强>
当我第一次写这篇文章时,我没有意识到你的表使用了不同的名称来表示数据ecolumns。我用无效列编写了查询 - 没有d.nyse_date。如果您已将其更改为t.nyse_date或仅删除了合格别名而不是将其更改为正确的列引用d._date,那么我认为这解释了我们看到的问题,因为它包含的数据不包括在内将分组内部表中的值。
这是应该有效的版本:
SELECT
d._date,
COUNT(t.tweet_id) AS total,
SUM(t.class=1) AS neu,
SUM(t.class=2) AS pos,
SUM(t.class=3) AS neg
FROM daterange d LEFT OUTER JOIN tweets t t.nyse_date = d._date
GROUP BY d._date
ORDER BY d._date ASC