按ID对行进行分组,然后将不同的组相互比较

时间:2011-07-29 13:08:21

标签: mysql

我有一个表,其中包含许多seq_id的数据。每个seq_id在不同的行上都有很多命中(hit_name_id)。我想要做的是将seqs分组,如果它们的命中相似(即共享大约70-80%的命中率)例如,在下表中,序列1,2和4实际上非常相似,因此很可能它们是一样。我希望能够使用组ID分配所有类似的匹配,以便稍后我可以提取唯一的seqs。

我创建了这个查询,以证明每个seq_id可能有很多可能会或可能不会共享的点击:

mysql> SELECT seq_id,GROUP_CONCAT(hit_name_id ORDER BY hit_name_id), count(hit_name_id) FROM polished_data
    -> GROUP BY seq_id;
+--------+------------------------------------------------+--------------------+
| seq_id | GROUP_CONCAT(hit_name_id ORDER BY hit_name_id) | count(hit_name_id) |
+--------+------------------------------------------------+--------------------+
|      1 | 4,5,6,9,10,14,19,20,21                         |                  9 |
|      2 | 4,6,9,10,14,18,19,20,21                        |                  9 |
|      3 | 6,12,13,14,18,20                               |                  6 |
|      4 | 4,7,8,11,14,18,19,20,21                        |                  9 |
|      5 | 1,2,3,15,16,17,32                              |                  7 |
+--------+------------------------------------------------+--------------------+

我不确定我是否可以在MySQL中完成此操作,或者是否需要在我的链接程序中编写此步骤。

1 个答案:

答案 0 :(得分:1)

这会计算相同的点击次数。

SELECT seq_id, COUNT(*) AS same
FROM polished_data
WHERE 
    hit_name_id IN (SELECT hit_name_id FROM polished WHERE seq_id = ###) 
    AND and seq_id != ### 
GROUP BY seq_id

然后你可以扩展它并计算有多少不同(它出现在两者中但不是两者中),然后将它们连接在一起。

SELECT *, (same/(same+diff)) AS similarity   
FROM
(
    SELECT 
        s.seq_id, 
        s.same,
        ((t.total-s.same)+(ct.total-s.same)) AS diff 

    FROM 

        (SELECT seq_id, COUNT(*) as total FROM polished_data
         GROUP BY seq_id) AS t  

    LEFT JOIN

        (SELECT seq_id, COUNT(*) AS same
         FROM polished_data
         WHERE 
             hit_name_id IN 
                 (SELECT hit_name_id FROM polished_data 
                  WHERE seq_id = ###) 
         GROUP BY seq_id) AS s

    ON t.seq_id = s.seq_id

    JOIN

        (SELECT COUNT(*) as total FROM polished_data
         WHERE seq_id = ###) AS ct  

) as result   

使用随机数据可以获得类似的结果(使用###替换为1进行测试)。

+--------+------+------+------------+
| seq_id | same | diff | similarity |
+--------+------+------+------------+
|      1 |   22 |    0 |     1.0000 |
|      2 |    4 |   45 |     0.0816 |
|      3 |    5 |   57 |     0.0806 |
|      4 |    8 |   34 |     0.1905 |
|      5 |    9 |   47 |     0.1607 |
|      6 |    3 |   36 |     0.0769 |
|      7 |    7 |   45 |     0.1346 |
|      8 |    3 |   48 |     0.0588 |
|      9 |    9 |   46 |     0.1636 |
|     10 |    4 |   48 |     0.0769 |
+--------+------+------+------------+

将上述SQL中的###更改为您要比较的seq_id。