我有一个表,其中包含许多seq_id的数据。每个seq_id在不同的行上都有很多命中(hit_name_id)。我想要做的是将seqs分组,如果它们的命中相似(即共享大约70-80%的命中率)例如,在下表中,序列1,2和4实际上非常相似,因此很可能它们是一样。我希望能够使用组ID分配所有类似的匹配,以便稍后我可以提取唯一的seqs。
我创建了这个查询,以证明每个seq_id可能有很多可能会或可能不会共享的点击:
mysql> SELECT seq_id,GROUP_CONCAT(hit_name_id ORDER BY hit_name_id), count(hit_name_id) FROM polished_data
-> GROUP BY seq_id;
+--------+------------------------------------------------+--------------------+
| seq_id | GROUP_CONCAT(hit_name_id ORDER BY hit_name_id) | count(hit_name_id) |
+--------+------------------------------------------------+--------------------+
| 1 | 4,5,6,9,10,14,19,20,21 | 9 |
| 2 | 4,6,9,10,14,18,19,20,21 | 9 |
| 3 | 6,12,13,14,18,20 | 6 |
| 4 | 4,7,8,11,14,18,19,20,21 | 9 |
| 5 | 1,2,3,15,16,17,32 | 7 |
+--------+------------------------------------------------+--------------------+
我不确定我是否可以在MySQL中完成此操作,或者是否需要在我的链接程序中编写此步骤。
答案 0 :(得分:1)
这会计算相同的点击次数。
SELECT seq_id, COUNT(*) AS same
FROM polished_data
WHERE
hit_name_id IN (SELECT hit_name_id FROM polished WHERE seq_id = ###)
AND and seq_id != ###
GROUP BY seq_id
然后你可以扩展它并计算有多少不同(它出现在两者中但不是两者中),然后将它们连接在一起。
SELECT *, (same/(same+diff)) AS similarity
FROM
(
SELECT
s.seq_id,
s.same,
((t.total-s.same)+(ct.total-s.same)) AS diff
FROM
(SELECT seq_id, COUNT(*) as total FROM polished_data
GROUP BY seq_id) AS t
LEFT JOIN
(SELECT seq_id, COUNT(*) AS same
FROM polished_data
WHERE
hit_name_id IN
(SELECT hit_name_id FROM polished_data
WHERE seq_id = ###)
GROUP BY seq_id) AS s
ON t.seq_id = s.seq_id
JOIN
(SELECT COUNT(*) as total FROM polished_data
WHERE seq_id = ###) AS ct
) as result
使用随机数据可以获得类似的结果(使用###替换为1进行测试)。
+--------+------+------+------------+
| seq_id | same | diff | similarity |
+--------+------+------+------------+
| 1 | 22 | 0 | 1.0000 |
| 2 | 4 | 45 | 0.0816 |
| 3 | 5 | 57 | 0.0806 |
| 4 | 8 | 34 | 0.1905 |
| 5 | 9 | 47 | 0.1607 |
| 6 | 3 | 36 | 0.0769 |
| 7 | 7 | 45 | 0.1346 |
| 8 | 3 | 48 | 0.0588 |
| 9 | 9 | 46 | 0.1636 |
| 10 | 4 | 48 | 0.0769 |
+--------+------+------+------------+
将上述SQL中的###
更改为您要比较的seq_id。