给出一个表
$cat data.csv ID,State,City,Price,Flag 1,CA,A,95,0 2,CA,A,96,1 3,CA,A,195,1 4,NY,B,124,0 5,NY,B,128,1 6,NY,C,24,0 7,NY,C,27,1 8,NY,C,29,0 9,NY,C,39,1
预期结果:
ID0, ID1
1,2
4,5
6,7
8,7
对于上面有Flag = 0的每个ID,我们想要从Flag = 1找到另一个ID,具有相同的“State”和“City”,以及最接近的价格。
我有两个粗略的愚蠢想法:
方法1.
Use a left outer join with the table itself on
(a.State=b.State and a.City=b.city and a.Flag=0 and b.Flag=1),
where a.Flag=0 and b.Flag=1,
and then use RANK() over (partitioned by a.State,a.City order by a.Price - b.Price) as rank
where rank=1
方法2.
Use a left outer join with the table itself,
on
(a.State=b.State and a.City=b.city and a.Flag=0 and b.Flag=1),
where a.Flag=0 and b.Flag=1,
and then Use Distribute by a.State,a.City Sort by Price_Diff ASC limit 1
在Hive找到最近邻居的最佳方法是什么? 任何有价值的提示将不胜感激!
答案 0 :(得分:1)
select a.id, b.id , min(abs(b.price-a.price)) as delta
from data as a
inner join data as b
on a.country=b.country and
a.flag=0 and b.flag=1 and
a.city=b.city
group by a.id, b.id
order by delta asc;
返回
1 2 1 <---
8 7 2 <---
6 7 3 <---
4 5 4 <---
8 9 10
6 9 15
1 3 100
问题是最后3行在前4行中使用了相同的id。
select a.id as id0, b.id as id1, abs(b.price-a.price) as delta,
rank() over ( partition by a.country, a.city order by abs(b.price-a.price) )
from data as a
inner join data as b
on a.country=b.country and
a.flag=0 and b.flag=1 and
a.city=b.city;
这将返回
id0 id1 prc rank
1 2 1 1 <---
1 3 100 2
4 5 4 1 <---
8 7 2 1 <---
6 7 3 2
8 9 10 3
6 9 15 4
我们缺少6,7,这在某种程度上是正确的。
6,NY,C,24,0
7,NY,C,27,1
8,NY,C,29,0
9,NY,C,39,1
(6,7),(6,9),(8,7),(8,9)的最低价格差异在(8,7)。 (模棱两可的加入)
我想你会喜欢这个关于这个话题的视频:Big Data Analytics Using Window Functions