Question

我需要帮助在一组不同的ID中对一组用户（2000万+）进行重复数据删除。

以下是它的样子：
- 我们有3种用户ID：ID1，ID2和ID3。 - 其中至少有2个始终一起发送：ID1 ID2或ID2 ID3。 ID3永远不会与ID1一起发送 - 用户可以拥有多个ID1，ID2或ID3 - 所以有时候，在我的表格中，我会有几行包含许多不同的ID，但所有这些都可以描述一个用户。

一个例子：

所有这些ID都显示一个用户。

我想我可以添加第四个ID（GroupID），这将是重复删除它们的那个。有点像这样：

问题是：我知道如何通过CURSOR / OPEN / FETCH / NEXT命令在SQL Server上执行此操作，但我的环境中只能使用Hive QL，Impala和Python。

有谁知道最好的方法是什么？

万分感谢，

雨果

Answer 1

根据您的示例，假设id2始终存在，您可以聚合行，按ID2分组：

select max(id1) id1,  id2, max(id3) id3 from
( --your dataset as in example
 select 'A'  as id1, 1 as id2,  null   as id3 union all
 select null as id1, 1 as id2, 'Alpha' as id3 union all
 select null as id1, 2 as id2, 'Beta'  as id3 union all
 select 'A'  as id1, 2 as id2,  null   as id3
 )s
 group by id2;

OK
A       1       Alpha
A       2       Beta
Time taken: 58.739 seconds, Fetched: 2 row(s)

现在我正试图按照你的描述实现你的逻辑：

select --pass2
 id1, id2, id3,
 case when lag(id2) over (order by id2, GroupId) = id2 then lag(GroupId) over (order by id2, GroupId) else GroupId end GroupId2
 from
 (
 select        --pass1
 id1, id2, id3,
 case when 
 lag(id1) over(order by id1, NVL(ID1,ID3)) =id1 then lag(NVL(ID1,ID3))  over(order by id1, NVL(ID1,ID3)) else NVL(ID1,ID3) end GroupId
 from
( --your dataset as in example
 select 'A'  as id1, 1 as id2,  null   as id3 union all
 select null as id1, 1 as id2, 'Alpha' as id3 union all
 select null as id1, 2 as id2, 'Beta'  as id3 union all
 select 'A'  as id1, 2 as id2,  null   as id3
 )s
 )s --pass1
;


OK
id1     id2     id3     groupid2
A       1       NULL    A
NULL    1       Alpha   A
A       2       NULL    A
NULL    2       Beta    A
Time taken: 106.944 seconds, Fetched: 4 row(s)

使用Hive QL / Impala / Python对ID进行重复数据删除

1 个答案: