我有一个非常慢(通常接近60秒)的MySQL查询,它试图找出用户在一次民意调查中投票的方式与他们在之前所有民意调查中的投票方式之间的相关性。
基本上,我们收集在给定民意调查中投票选出一个特定选项的每个人的用户ID。
然后我们看到该小组如何对每个先前的民意调查进行投票,并将这些结果与每个人(不仅仅是小组)对该民意调查投票的结果进行比较。子组结果与总结果之间的差异是偏差,此查询按偏差排序以确定最强的相关性。
查询有点混乱:
(SELECT p_id as poll_id, o_id AS option_id, description, optCount AS option_count, subgroup_percent, total_percent, ABS(total_percent - subgroup_percent) AS deviation
FROM(
SELECT poll_id AS p_id,
option_id AS o_id,
(SELECT description FROM `option` WHERE id = o_id) AS description,
COUNT(*) AS optCount,
(SELECT COUNT(*) FROM response INNER JOIN user_ids_122 ON response.user_id = user_ids_122.user_id WHERE option_id = o_id ) /
(SELECT COUNT(*) FROM response INNER JOIN user_ids_122 ON response.user_id = user_ids_122.user_id WHERE poll_id = p_id) AS subgroup_percent,
(SELECT COUNT(*) FROM response WHERE option_id = o_id) /
(SELECT COUNT(*) FROM response WHERE poll_id = p_id) AS total_percent
FROM response
INNER JOIN user_ids_122 ON response.user_id = user_ids_122.user_id
WHERE poll_id < '61'
GROUP BY option_id DESC
) AS derived_table_122
)
ORDER BY deviation DESC, option_count DESC
请注意,user_ids_122是先前创建的临时表,其中包含投票选项ID为122的所有用户的ID。
“response”表有大约65,000行,“user”表有大约7,000行,“option”表有大约130行。
更新:
这是EXPLAIN表......
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 121 Using filesort
2 DERIVED user_ids_122 ALL NULL NULL NULL NULL 74 Using temporary; Using filesort
2 DERIVED response ref poll_id,user_id user_id 4 correlated.user_ids_122.user_id 780 Using where
7 DEPENDENT SUBQUERY response ref poll_id poll_id 4 func 7800 Using index
6 DEPENDENT SUBQUERY response ref option_id option_id 4 func 7800 Using index
5 DEPENDENT SUBQUERY user_ids_122 ALL NULL NULL NULL NULL 74
5 DEPENDENT SUBQUERY response ref poll_id,user_id poll_id 4 func 7800 Using where
4 DEPENDENT SUBQUERY user_ids_122 ALL NULL NULL NULL NULL 74
4 DEPENDENT SUBQUERY response ref user_id,option_id user_id 4 correlated.user_ids_122.user_id 780 Using where
3 DEPENDENT SUBQUERY option eq_ref PRIMARY PRIMARY 4 func 1
更新2:
“response”表中的每一行都如下所示:
id (INT) poll_id (INT) user_id (INT) option_id (INT) created (DATETIME)
7 7 1 14 2011-03-17 09:25:10
“选项”表中的每一行都如下所示:
id (INT) poll_id (INT) text (TEXT) description (TEXT)
14 7 No people who dislike country music
“user”表中的每一行都如下所示:
id (INT) email (TEXT) created (DATETIME)
1 user@example.com 2011-02-15 11:16:03
答案 0 :(得分:3)
3件事:
因此,当你计算“按option_id计票时”(需要扫描大表),然后 你需要计算“poll_id投票计数”,好吧,不要再次启动大表,只需使用之前的结果!
你可以用ROLLUP做到这一点。
这是一个将在Postgres上运行所需的查询。
为了让MySQL执行此操作,您将需要将所有“WITH foo AS(SELECT ...)”语句替换为临时表。这很简单。 MySQL内存临时表很快,不要害怕使用它们,因为这样可以重用前面步骤的结果,节省大量的计算。
我已经生成了随机测试数据,似乎有效。执行时间为0.3秒......
WITH
-- users of interest : target group
uids AS (
SELECT DISTINCT user_id
FROM options
JOIN responses USING (option_id)
WHERE poll_id=22
),
-- votes of everyone and target group
votes AS (
SELECT poll_id, option_id, sum(all_votes) AS all_votes, sum(target_votes) AS target_votes
FROM (
SELECT option_id, count(*) AS all_votes, count(uids.user_id) AS target_votes
FROM responses
LEFT JOIN uids USING (user_id)
GROUP BY option_id
) v
JOIN options USING (option_id)
GROUP BY poll_id, option_id
),
-- totals for all polls (reuse previous result)
totals AS (
SELECT poll_id, sum(all_votes) AS all_votes, sum(target_votes) AS target_votes
FROM votes
GROUP BY poll_id
),
poll_options AS (
SELECT poll_id, count(*) AS poll_option_count
FROM options
GROUP BY poll_id
)
-- reuse previous tables to get some stats
SELECT *, ABS(total_percent - subgroup_percent) AS deviation
FROM (
SELECT
poll_id,
option_id,
v.target_votes / v.all_votes AS subgroup_percent,
t.target_votes / t.all_votes AS total_percent,
poll_option_count
FROM votes v
JOIN totals t USING (poll_id)
JOIN poll_options po USING (poll_id)
) AS foo
ORDER BY deviation DESC, poll_option_count DESC;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=14910.46..14910.56 rows=40 width=144) (actual time=299.844..299.862 rows=200 loops=1)
Sort Key: (abs(((t.target_votes / t.all_votes) - (v.target_votes / v.all_votes)))), po.poll_option_count
Sort Method: quicksort Memory: 52kB
CTE uids
-> HashAggregate (cost=1801.43..1850.52 rows=4909 width=4) (actual time=3.935..4.793 rows=4860 loops=1)
-> Nested Loop (cost=0.00..1789.16 rows=4909 width=4) (actual time=0.029..2.555 rows=4860 loops=1)
-> Seq Scan on options (cost=0.00..3.50 rows=5 width=4) (actual time=0.008..0.032 rows=5 loops=1)
Filter: (poll_id = 22)
-> Index Scan using responses_option_id_key on responses (cost=0.00..344.86 rows=982 width=8) (actual time=0.012..0.298 rows=972 loops=5)
Index Cond: (public.responses.option_id = public.options.option_id)
CTE votes
-> HashAggregate (cost=13029.43..13032.43 rows=200 width=24) (actual time=298.255..298.317 rows=200 loops=1)
-> Hash Join (cost=13019.68..13027.43 rows=200 width=24) (actual time=297.953..298.103 rows=200 loops=1)
Hash Cond: (public.responses.option_id = public.options.option_id)
-> HashAggregate (cost=13014.18..13017.18 rows=200 width=8) (actual time=297.839..297.879 rows=200 loops=1)
-> Merge Left Join (cost=399.13..11541.43 rows=196366 width=8) (actual time=9.301..230.467 rows=196366 loops=1)
Merge Cond: (public.responses.user_id = uids.user_id)
-> Index Scan using responses_pkey on responses (cost=0.00..8585.75 rows=196366 width=8) (actual time=0.015..121.971 rows=196366 loops=1)
-> Sort (cost=399.13..411.40 rows=4909 width=4) (actual time=9.281..22.044 rows=137645 loops=1)
Sort Key: uids.user_id
Sort Method: quicksort Memory: 420kB
-> CTE Scan on uids (cost=0.00..98.18 rows=4909 width=4) (actual time=3.937..6.549 rows=4860 loops=1)
-> Hash (cost=3.00..3.00 rows=200 width=8) (actual time=0.095..0.095 rows=200 loops=1)
-> Seq Scan on options (cost=0.00..3.00 rows=200 width=8) (actual time=0.007..0.043 rows=200 loops=1)
CTE totals
-> HashAggregate (cost=5.50..8.50 rows=200 width=68) (actual time=298.629..298.640 rows=40 loops=1)
-> CTE Scan on votes (cost=0.00..4.00 rows=200 width=68) (actual time=298.257..298.425 rows=200 loops=1)
CTE poll_options
-> HashAggregate (cost=4.00..4.50 rows=40 width=4) (actual time=0.091..0.101 rows=40 loops=1)
-> Seq Scan on options (cost=0.00..3.00 rows=200 width=4) (actual time=0.005..0.020 rows=200 loops=1)
-> Hash Join (cost=6.95..13.45 rows=40 width=144) (actual time=298.994..299.554 rows=200 loops=1)
Hash Cond: (t.poll_id = v.poll_id)
-> CTE Scan on totals t (cost=0.00..4.00 rows=200 width=68) (actual time=298.632..298.669 rows=40 loops=1)
-> Hash (cost=6.45..6.45 rows=40 width=84) (actual time=0.335..0.335 rows=200 loops=1)
-> Hash Join (cost=1.30..6.45 rows=40 width=84) (actual time=0.140..0.263 rows=200 loops=1)
Hash Cond: (v.poll_id = po.poll_id)
-> CTE Scan on votes v (cost=0.00..4.00 rows=200 width=72) (actual time=0.001..0.030 rows=200 loops=1)
-> Hash (cost=0.80..0.80 rows=40 width=12) (actual time=0.130..0.130 rows=40 loops=1)
-> CTE Scan on poll_options po (cost=0.00..0.80 rows=40 width=12) (actual time=0.093..0.119 rows=40 loops=1)
Total runtime: 300.132 ms
答案 1 :(得分:0)
尝试将这些内容添加为一口大小的块:
-- Compute the average you're looking for.
select ..., agg1, agg2, avg(...)
from (
-- Use max() to merge the retrieved aggregates as individual rows.
-- (This will be faster than joins if you're dealing with tons of rows.)
select ..., max(agg1) as agg1, max(agg2) as agg2, ...
from (
-- Compute individual aggregates without nested loops.
select ..., count(*) as agg1, null as agg2, ...
from ...
where ...
group by ...
union all
select ..., null as agg1, count(*) as agg2, ...
from ...
where ...
group by ...
union all
...
) as aggs
group by ...
) as rows
group by ...
如果它之后仍然很慢(我怀疑它会是这样),考虑使用触发器维护中间结果(如果它一直被使用)或考虑使用临时表(如果它是一次性查询,那么每次都会被触发常)。
-
更新以下评论。例如:
(SELECT COUNT(*) FROM response WHERE option_id = o_id) /
(SELECT COUNT(*) FROM response WHERE poll_id = p_id) as total_percent
将被重写为:
SELECT [fields you need],
MAX(total_reponses_by_option_id) / MAX(total_reponses_by_option_id) as total_percent
FROM (
SELECT [fields you need],
COUNT(*) as total_reponses_by_option_id,
NULL as total_reponses_by_poll_id
FROM response
[join/where as needed]
GROUP BY [fields you need]
UNION ALL
SELECT [fields you need],
NULL as total_reponses_by_option_id,
COUNT(*) as total_reponses_by_poll_id
FROM response
[join/where as needed]
GROUP BY [fields you need]
) as agg
GROUP BY [fields you need];
答案 2 :(得分:0)
我认为,由于你的查询混乱,它使它变得比它应该更难。我可能非常接近,但我会试着回顾一下我正在做的事情。它首先出现你需要你的分母基于每个民意调查...所以,我的第一个查询就是这样......每个民意调查多少(按民意调查分组)。
接下来,您想知道每个投票中每个选项提供了多少答案。这就是我用第二个查询(按轮询AND选项分组)
所做的由于您正在处理统计信息,因此在回复表中已经回答了什么并不重要。谁在乎他们的名字是什么......在第二个问题中,我并不关心选项描述,只关注计数。
现在“预查询1”和“预查询2”已完成,我可以根据公共Poll_ID加入1到2,然后将2连接到选项表以获得最终分析中所需的描述
对于聚合,在联接结束时,您最终会得到类似
的内容(Result from PreQuery 1 on just the poll counts)
Poll Count
1 50
2 30
(Result from PreQuery 2 on poll AND Option)
Poll Option Count
1 1 30
1 2 12
1 3 5
1 4 3
2 5 8
2 6 12
2 7 10
Final join should have
Poll Option Description PerPollAndOption SubGroup_Percent PerPollResponse
1 1 Descrip 1 30 .60 50
1 2 Descrip 2 12 .24 50
1 3 Descrip 3 5 .10 50
1 4 Descrip 4 3 .06 50
2 5 Descrip 5 8 .27 30
2 6 Descrip 6 12 .40 30
2 7 Descrip 7 10 .33 30
所以,最后的排序,分组等你应该对这里直接提供的所有数字进行大量简化。无需像之前所述那样访问用户。如果我遗漏了一些重要内容,请告诉我......也许这个解决方案有助于简化剩下的任何内容......
SELECT
ByPoll.Poll_ID,
ByPollOption.Option_ID,
Option.Description,
ByPollOption.PerPollAndOption,
ByPollOption.PerPollAndOption / ByPoll.PerPollResponse as SubGroup_Percent,
ByPoll.PerPollResponse
FROM
( select
Poll_ID,
COUNT(*) as PerPollResponse
from
Response
where
Poll_ID < '61'
group by
Poll_ID ) ByPoll
JOIN ( select r.Poll_ID,
r.Option_ID,
COUNT(*) as PerPollAndOption
from
Responses r
join option o
ON r.Option_ID = o.id
where
Poll_ID < '61'
group by
r.Poll_ID,
r.Option_ID ) ByPollOption
ON ByPoll.Poll_ID = ByPollOption.Poll_ID
JOIN OPTION
ON ByPollOption.Option_ID = Option.ID