在MySQL中索引和查询分析表的最佳方法

时间:2012-05-10 20:30:40

标签: mysql sql indexing subquery analytics

我有一个分析表(5M行并且正在增长),具有以下结构

Hits 
  id int() NOT NULL AUTO_INCREMENT,
  hit_date datetime NOT NULL,
  hit_day int(11) DEFAULT NULL,
  gender varchar(255) DEFAULT NULL,
  age_range_id int(11) DEFAULT NULL,
  klout_range_id int(11) DEFAULT NULL,
  frequency int(11) DEFAULT NULL,
  count int(11) DEFAULT NULL,
  location_id int(11) DEFAULT NULL,
  source_id int(11) DEFAULT NULL,
  target_id int(11) DEFAULT NULL,

对表的大多数查询是在特定的列子集的两个日期时间之间进行查询,并且它们将所有行中的所有计数列相加。例如:

SELECT target.id,
   SUM(CASE gender WHEN 'm' THEN count END) AS 'gender_male',
   SUM(CASE gender WHEN 'f' THEN count END) AS 'gender_female',
   SUM(CASE age_range_id WHEN 1 THEN count END) AS 'age_18 - 20',
   SUM(CASE target_id WHEN 1 then count END) AS 'target_test'
   SUM(CASE location_id WHEN 1 then count END) AS 'location_NY'
FROM Hits
WHERE (location_id =1 or location_id = 2)
  AND (target_id = 40 OR target_id = 22)
  AND cast(hit_date AS date) BETWEEN '2012-5-4'AND '2012-5-10'
GROUP BY target.id

对此表的查询有趣的是,where子句包含Hit列名称和值的任何排列,因为这些是我们要过滤的内容。因此,上面的特定查询是获得纽约州年龄在18到20岁之间(age_range_id 1)的男性和女性的#,其属于称为“测试”的目标。然而,有超过8个年龄组,10个klout范围,45个位置,10个来源等(所有 外键引用)。

我目前在hot_date上有一个索引,在target_id上有另一个索引。什么是正确索引此表的最佳方法?在所有列字段上都有一个复合索引似乎本身就是错误的。

有没有其他方法可以在不使用子查询来总结所有计数的情况下运行此查询?我做了一些研究,这似乎是获得我需要的数据集的最佳方法,但是有更有效的方法来处理这个查询吗?

1 个答案:

答案 0 :(得分:2)

这是您的优化查询。我们的想法是摆脱hit_date上的ORCAST()函数,以便MySQL可以利用覆盖每个数据子集的复合索引。您需要按顺序在(location_idtarget_idhit_date)上添加复合索引。

SELECT id, gender_male, gender_female, `age_18 - 20`, target_test, location_NY
FROM
(
SELECT target.id,
   SUM(CASE gender WHEN 'm' THEN 1 END) AS gender_male,
   SUM(CASE gender WHEN 'f' THEN 1 END) AS gender_female,
   SUM(CASE age_range_id WHEN 1 THEN 1 END) AS `age_18 - 20`,
   SUM(CASE target_id WHEN 1 then 1 END) AS target_test,
   SUM(CASE location_id WHEN 1 then 1 END) AS location_NY
FROM Hits
WHERE (location_id =1)
  AND (target_id = 40)
  AND hit_date BETWEEN '2012-05-04 00:00:00' AND '2012-05-10 23:59:59'
GROUP BY target.id

UNION ALL

SELECT target.id,
   SUM(CASE gender WHEN 'm' THEN 1 END) AS gender_male,
   SUM(CASE gender WHEN 'f' THEN 1 END) AS gender_female,
   SUM(CASE age_range_id WHEN 1 THEN 1 END) AS `age_18 - 20`,
   SUM(CASE target_id WHEN 1 then 1 END) AS target_test,
   SUM(CASE location_id WHEN 1 then 1 END) AS location_NY
FROM Hits
WHERE (location_id = 2)
  AND (target_id = 22)
  AND hit_date BETWEEN '2012-05-04 00:00:00' AND '2012-05-10 23:59:59'
GROUP BY target.id

UNION ALL

SELECT target.id,
   SUM(CASE gender WHEN 'm' THEN 1 END) AS gender_male,
   SUM(CASE gender WHEN 'f' THEN 1 END) AS gender_female,
   SUM(CASE age_range_id WHEN 1 THEN 1 END) AS `age_18 - 20`,
   SUM(CASE target_id WHEN 1 then 1 END) AS target_test,
   SUM(CASE location_id WHEN 1 then 1 END) AS location_NY
FROM Hits
WHERE (location_id =1)
  AND (target_id = 22)
  AND hit_date BETWEEN '2012-05-04 00:00:00' AND '2012-05-10 23:59:59'
GROUP BY target.id

UNION ALL

SELECT target.id,
   SUM(CASE gender WHEN 'm' THEN 1 END) AS gender_male,
   SUM(CASE gender WHEN 'f' THEN 1 END) AS gender_female,
   SUM(CASE age_range_id WHEN 1 THEN 1 END) AS `age_18 - 20`,
   SUM(CASE target_id WHEN 1 then 1 END) AS target_test,
   SUM(CASE location_id WHEN 1 then 1 END) AS location_NY
FROM Hits
WHERE (location_id = 2)
  AND (target_id = 22)
  AND hit_date BETWEEN '2012-05-04 00:00:00' AND '2012-05-10 23:59:59'
GROUP BY target.id
) a
GROUP BY id

如果您的选择尺寸太大而且没有任何改善,那么您可以继续扫描所有行,就像您已经在做的那样。

注意,带有后标记的别名,而不是单引号,不推荐使用。我还修复了CASEcount而不是1