Question

我们有一个包含三个不同ID（均可选）的Hive表。在每一行中，必须提供三个ID中的至少一个。如果提供了多个ID，则将在多个ID之间建立等效性。

我们需要根据每一行中建立的等效性，为每一行分配一个唯一的主ID。例如：

Line   id1     id2     id3    masterID
--------------------------------------
(1)    A1                     M1
(2)            A2             M1
(3)                    A3     M1
(4)    A1      A2             M1
(5)            A2      A3     M1
(6)    B1      A2             M1
(7)    C1              C3     M2

因为在第4行上，A1和A2都存在，所以我们知道这些ID是等效的。

同样，在第5行上，A2和A3都存在，我们知道这些ID也是等效的。

再次在第6行上，我们同时拥有B1和A2，因此它们也是等效的。

在第7行，我们在C1和C3之间具有等效性。

鉴于以上信息，A1，A2，A3和B1均相等。因此，必须为包含这些ID中的任何一个的所有行分配相同的主ID，因此我们为它们指定了相同的主ID（“ M1”）。第7行收到一个唯一的唯一ID（“ M2”），因为它的两个ID都不匹配。

我们如何编写Hive查询以这种方式分配主ID？而且，如果Hive并不是实现此目的的最佳工具，您是否可以建议一种使用Hadoop生态系统中其他工具为这些行分配主ID的方法？

Answer 1

您可以通过将ID表示为顶点并找到连接的组件来解决此问题。有关想法here的更多信息，请参见第3.5节。假设init_table是您的表格。首先，建立一个链接表

create table links as
select distinct id1 as v1, id2 as v2
  from init_table
 where id1 is not null and id2 is not null
union all 
select distinct id1 as v1, id3 as v2
  from init_table
 where id1 is not null and id3 is not null
union all 
select distinct id2 as v1, id3 as v2
  from init_table
 where id2 is not null and id3 is not null
;

接下来为每个链接生成一些数字，例如行号并执行传播：

create table links1 as
with temp_table as (
  select v1, v2, row_number() over () as score
    from links
)
, tbl1 as (
  select v1, v2, score
       , max(score) over (partition by v1) as max_1
       , max(score) over (partition by v2) as max_2
    from temp_table
)
select v1, v2, greatest(max_1, max_2) as unique_id
  from tbl1
;

然后将您的ID与匹配表结合起来

create table matching_table as
with temp_table as (
select v1 as id, unique_id
  from link1
union all
select v2 as id, unique_id
  from link1
)
select distinct id, unique_id
  from temp_table

如果某些ID没有耦合，那么找出哪些ID并不难。希望这会有所帮助。

配置单元查询以基于多个可选键分配分组键

1 个答案: