用于查找数据集中大多数值的组合的最佳算法

时间:2017-03-07 13:32:03

标签: php sql data-science

----------------------------------------
   ColumnA  |  ColumnB      | ColumnC  | 
----------------------------------------
      Cat   |     Shirt     |   Pencil | 
      Dog   |     Shirt     |   Eraser | 
      Worm  |     Dress     |   Pen    | 
      Cow   |     Shirt     |   Pen    | 
      Cat   |     Shirt     |   Pen    | 
      Cat   |     Jacket    |   Pen    | 
      Cow   |     Shirt     |   Pen    | 
      Cat   |     Shirt     |   Pen    | 
      Cat   |     Jacket    |   Pen    | 
      Cow   |     Shirt     |   Pen    | 
      Cat   |     Shirt     |   Pen    | 
      Cat   |     Jacket    |   Pen    | 

根据上面的示例数据,我试图找到最重现的组合,它们是一对2或更大。

例如

Shirt,Pen 6
Cat,Pen 6    
Cat,Shirt 4
Jacket, Pen 3
Pen,Cow 3
Cat,Shirt,Pen 3
Cat,Jacket,Pen 3
Cow,Shirt,Pen 3

我最多需要10列数据。

Cat,ShirtShirt,Cat相同。

使用的最佳算法是什么?最好在SQL中,但我也可以尝试PHP?

2 个答案:

答案 0 :(得分:3)

您可以通过识别每一行并添加“空”元素在SQL中执行此操作。注意:这假设每列中的值不同 - 或者至少是可互换的(与第一列无关)。

我还假设每行都有一个唯一的ID:

with t as (
      select id, col
      from data d outer apply
           (values (col1), (col2), (col3), (NULL)) v(col)
     )
select t1.col, t2.col, t3.col, count(*)
from t t1 join
     t t2
     on t1.id = t2.id and (t2.col > t1.col or t2.col is null) join
     t t3
     on t1.id = t3.id and (t3.col > t2.col or (t2.col is null and t3.col is null))
group by t1.col, t2.col, t3.col
order by count(*) desc;

答案 1 :(得分:3)

这可能是一种方式

SELECT c1, c2, c3, count(*) FROM (
    SELECT ColumnA AS c1,  ColumnB AS c2, NULL AS c3 FROM your_table
    UNION ALL
    SELECT ColumnA AS c1,  ColumnC AS c2, NULL AS c3 FROM your_table
    UNION ALL
    SELECT ColumnB AS c1,  ColumnC AS c2, NULL AS c3 FROM your_table
    UNION ALL
    SELECT ColumnA AS c1,  ColumnB AS c2, ColumnC AS c3 FROM your_table
) tt
group by c1, c2, c3
order by count(*) desc