我有四张MySQL表:
用户(身份证,姓名)
民意调查(身份证明,文字)
选项(id,poll_id,text)
回复(id,poll_id,option_id,user_id)
鉴于特定的民意调查和特定选项,我想生成一个表格,显示其他民意调查中哪些选项的关联性最强。
假设这是我们的数据集:
TABLE users:
+------+-------+
| id | name |
+------+-------+
| 1 | Abe |
| 2 | Bob |
| 3 | Che |
| 4 | Den |
+------+-------+
TABLE polls:
+------+-----------------------+
| id | text |
+------+-----------------------+
| 1 | Do you like apples? |
| 2 | What is your gender? |
| 3 | What is your height? |
| 4 | Do you like polls? |
+------+-----------------------+
TABLE options:
+------+----------+---------+
| id | poll_id | text |
+------+----------+---------+
| 1 | 1 | Yes |
| 2 | 1 | No |
| 3 | 2 | Male |
| 4 | 2 | Female |
| 5 | 3 | Short |
| 6 | 3 | Tall |
| 7 | 4 | Yes |
| 8 | 4 | No |
+------+----------+---------+
TABLE responses:
+------+----------+------------+----------+
| id | poll_id | option_id | user_id |
+------+----------+------------+----------+
| 1 | 1 | 1 | 1 |
| 2 | 1 | 2 | 2 |
| 3 | 1 | 2 | 3 |
| 4 | 1 | 2 | 4 |
| 5 | 2 | 3 | 1 |
| 6 | 2 | 3 | 2 |
| 7 | 2 | 3 | 3 |
| 8 | 2 | 4 | 4 |
| 9 | 3 | 5 | 1 |
| 10 | 3 | 6 | 2 |
| 10 | 3 | 5 | 3 |
| 10 | 3 | 6 | 4 |
| 10 | 4 | 7 | 1 |
| 10 | 4 | 7 | 2 |
| 10 | 4 | 7 | 3 |
| 10 | 4 | 7 | 4 |
+------+----------+------------+----------+
鉴于轮询ID 1和选项ID 2,生成的表应该是这样的:
+----------+------------+-----------------------+
| poll_id | option_id | percent_correlated |
+----------+------------+-----------------------+
| 4 | 7 | 100 |
| 2 | 3 | 66.66 |
| 3 | 6 | 66.66 |
| 2 | 4 | 33.33 |
| 3 | 5 | 33.33 |
| 4 | 8 | 0 |
+----------+------------+-----------------------+
基本上,我们确定了所有响应投票ID 1和所选选项ID 2的用户,我们正在查看所有其他民意调查,看看他们中有多少百分比也选择了其他选项。
答案 0 :(得分:1)
没有方便测试的实例,你能看出这是否得到了正确的结果:
select
poll_id,
option_id,
((psum - (sum1 * sum2 / n)) / sqrt((sum1sq - pow(sum1, 2.0) / n) * (sum2sq - pow(sum2, 2.0) / n))) AS r,
n
from
(
select
poll_id,
option_id,
SUM(score) AS sum1,
SUM(score_rev) AS sum2,
SUM(score * score) AS sum1sq,
SUM(score_rev * score_rev) AS sum2sq,
SUM(score * score_rev) AS psum,
COUNT(*) AS n
from
(
select
responses.poll_id,
responses.option_id,
CASE
WHEN user_resp.user_id IS NULL THEN SELECT 0
ELSE SELECT 1
END CASE as score,
CASE
WHEN user_resp.user_id IS NULL THEN SELECT 1
ELSE SELECT 0
END CASE as score_rev,
from responses left outer join
(
select
user_id
from
responses
where
poll_id = 1 and
option_id = 2
)user_resp
ON (user_resp.user_id = responses.user_id)
) temp1
group by
poll_id,
option_id
)components
答案 1 :(得分:1)
经过几个小时的反复试验,我设法将一个正常运行的查询放在一起:
SELECT poll_id AS p_id,
option_id AS o_id,
COUNT(*) AS optCount,
(SELECT COUNT(*) FROM response WHERE option_id = o_id AND user_id IN
(SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2')) /
(SELECT COUNT(*) FROM response WHERE poll_id = p_id AND user_id IN
(SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2'))
AS percentage
FROM response
INNER JOIN
(SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2') AS user_ids
ON response.user_id = user_ids.user_id
WHERE poll_id != '1'
GROUP BY option_id DESC
ORDER BY percentage DESC, optCount DESC
基于具有小数据集的测试,此查询看起来相当快,但我想修改它以使“IN”子查询不重复三次。有什么建议吗?
答案 2 :(得分:1)
这似乎给了我正确的结果:
select poll_stats.poll_id,
option_stats.option_id,
(100 * option_responses / poll_responses) as percent_correlated
from (select response.poll_id,
count(*) as poll_responses
from response selecting_response
join response on response.user_id = selecting_response.user_id
where selecting_response.poll_id = 1 and selecting_response.option_id = 2
group by response.poll_id) poll_stats
join (select options.poll_id,
options.id as option_id,
count(response.id) as option_responses
from options
left join response on response.poll_id = options.poll_id
and response.option_id = options.id
and exists (
select 1 from response selecting_response
where selecting_response.user_id = response.user_id
and selecting_response.poll_id = 1
and selecting_response.option_id = 2)
group by options.poll_id, options.id
) as option_stats
on option_stats.poll_id = poll_stats.poll_id
where poll_stats.poll_id <> 1
order by 3 desc, option_responses desc