Question

我的数据包含ID和多列整数数值，从-5到5，包括0。

╔════╦══════╦══════╦══════╦══════╗
║ ID ║ COL1 ║ COL2 ║ COL3 ║ COL4 ║
╠════╬══════╬══════╬══════╬══════╣
║ A  ║   -5 ║   -2 ║    0 ║   -2 ║
║ B  ║    0 ║    1 ║   -1 ║    3 ║
║ C  ║    1 ║   -2 ║   -3 ║    1 ║
║ D  ║   -1 ║   -1 ║    5 ║    0 ║
║ E  ║    2 ║   -3 ║    1 ║   -2 ║
║ F  ║   -3 ║    1 ║   -2 ║   -1 ║
║ G  ║   -4 ║   -1 ║   -1 ║   -3 ║
╚════╩══════╩══════╩══════╩══════╝

我希望按所有列的相似性对ID进行分组。例如ID A和G高于类似值，因为它们在每列中的值非常相似。

╔════╦══════╦══════╦══════╦══════╗
║ ID ║ COL1 ║ COL2 ║ COL3 ║ COL4 ║
╠════╬══════╬══════╬══════╬══════╣
║ A  ║   -5 ║   -2 ║    0 ║   -2 ║
║ G  ║   -4 ║   -1 ║   -1 ║   -3 ║
╚════╩══════╩══════╩══════╩══════╝

另一方面，A和B不同

╔════╦══════╦══════╦══════╦══════╗
║ ID ║ COL1 ║ COL2 ║ COL3 ║ COL4 ║
╠════╬══════╬══════╬══════╬══════╣
║ A  ║   -5 ║   -2 ║    0 ║   -2 ║
║ B  ║    0 ║    1 ║   -1 ║    3 ║
╚════╩══════╩══════╩══════╩══════╝

对于给定的ID对，我正在考虑计算每列中的差异，然后添加差异以获得相似性得分（更大的数字更少相似）。这是我最好的想法，但我对更准确或有效的方法持开放态度。

Answer 1

执行此操作的一种方法（使用列中值之间的绝对差值）如下：

with all_compared as (
  select a.id as ID,
         b.id as CompID,
         abs(a.col1 - b.col1) + abs(a.col2 - b.col2) + abs(a.col3 - b.col3) + abs(a.col4 - b.col4) as TotalDiff
    from stuff a,
         stuff b
   where a.id != b.id
  ),
  ranked_data as (
  select ID, 
         CompID,
         TotalDiff,
         rank() over (partition by ID order by TotalDiff) Rnk 
    from all_compared
  )
select *
  from ranked_data
 where rnk = 1;

我做了一个SQL小提琴演示了我如何逐步完成这一步：http://sqlfiddle.com/#!4/fef06/14

然后你需要决定如何处理关系，因为它给出了输出：

enter image description here

这使用笛卡尔积（一个表中的所有行连接到另一个表中）和一个自连接来比较每一行与另一行，并总结col1，2等之间的绝对差值。然后我们通过排名减少这个总差异并选择最高排名。

另一种方法是使用平方距离而不是绝对差值，这会放大较大的差异，因此您需要考虑是否要这样做。

示例 1,1＆amp; 0,5将得到25，因为（0-5）^ 2是25，这将被计为与0,3＆amp; -4，-1将获得18（3 ^ 2 + 3 ^ 2），其中绝对差异将第一个计算为更相似，因为所有差异都用相同的权重处理。

平方距离版本是：

with all_compared as (
  select a.id as ID,
         b.id as CompID,
         power(a.col1 - b.col1, 2) + 
           power(a.col2 - b.col2, 2) + 
           power(a.col3 - b.col3, 2) + 
           power(a.col4 - b.col4, 2) as SqDist
    from stuff a,
         stuff b
   where a.id != b.id
  ),
  ranked_data as (
  select ID, 
         CompID,
         SqDist,
         rank() over (partition by ID order by SqDist) Rnk 
    from all_compared
  )
select *
  from ranked_data
 where rnk = 1;

enter image description here

或者，你可以使用两者并只使用平方距离来解决关系：

with all_compared as (
  select a.id as ID,
         b.id as CompID,
         abs(a.col1 - b.col1) + abs(a.col2 - b.col2) + abs(a.col3 - b.col3) + abs(a.col4 - b.col4) as TotalDiff,
         power(a.col1 - b.col1, 2) + 
           power(a.col2 - b.col2, 2) + 
           power(a.col3 - b.col3, 2) + 
           power(a.col4 - b.col4, 2) as SqDist
    from stuff a,
         stuff b
   where a.id != b.id
  ),
  ranked_data as (
  select ID, 
         CompID,
         TotalDiff,
         SqDist,
         rank() over (partition by ID order by TotalDiff, SqDist) Rnk 
    from all_compared
  )
select *
  from ranked_data
 where rnk = 1;

enter image description here

比较多个数字列以确定记录相似性

1 个答案: