Question

给出一个表

$cat data.csv

ID,State,City,Price,Flag
1,CA,A,95,0
2,CA,A,96,1
3,CA,A,195,1
4,NY,B,124,0
5,NY,B,128,1
6,NY,C,24,0
7,NY,C,27,1
8,NY,C,29,0
9,NY,C,39,1

预期结果：

ID0, ID1
1,2
4,5
6,7
8,7

对于上面有Flag = 0的每个ID，我们想要从Flag = 1找到另一个ID，具有相同的“State”和“City”，以及最接近的价格。

我有两个粗略的愚蠢想法：

方法1.

Use a left outer join with the table itself on 
    (a.State=b.State and a.City=b.city and a.Flag=0 and b.Flag=1),
     where a.Flag=0 and b.Flag=1, 

    and then use RANK() over (partitioned by a.State,a.City order by a.Price - b.Price) as rank
    where rank=1

方法2.

Use a left outer join with the table itself, 
on 
(a.State=b.State and a.City=b.city and a.Flag=0 and b.Flag=1),
 where a.Flag=0 and b.Flag=1, 

and then Use Distribute by a.State,a.City Sort by Price_Diff ASC limit 1

在Hive找到最近邻居的最佳方法是什么？任何有价值的提示将不胜感激！

Answer 1

select a.id, b.id , min(abs(b.price-a.price)) as delta 
from data as a 
     inner join data as b 
            on a.country=b.country and 
               a.flag=0 and b.flag=1 and 
               a.city=b.city 
group by a.id, b.id  
order by delta asc;

返回

1   2   1  <---
8   7   2  <---
6   7   3  <--- 
4   5   4  <--- 
8   9   10
6   9   15
1   3   100

问题是最后3行在前4行中使用了相同的id。

select a.id as id0, b.id as id1, abs(b.price-a.price) as delta, 
       rank() over ( partition by a.country, a.city order by abs(b.price-a.price) ) 
from data as a 
      inner join data as b 
            on a.country=b.country and 
            a.flag=0 and b.flag=1 and 
            a.city=b.city;

这将返回

   id0 id1 prc rank
    1   2   1   1  <---
    1   3   100 2
    4   5   4   1  <---
    8   7   2   1  <--- 
    6   7   3   2
    8   9   10  3
    6   9   15  4

我们缺少6,7，这在某种程度上是正确的。

6,NY,C,24,0 
7,NY,C,27,1
8,NY,C,29,0
9,NY,C,39,1

（6,7），（6,9），（8,7），（8,9）的最低价格差异在（8,7）。（模棱两可的加入）

我想你会喜欢这个关于这个话题的视频：Big Data Analytics Using Window Functions

如何在Hive找到最近的邻居？任何窗口功能？

1 个答案: