我在Hive数据库中有五个表(A,B,C,D,E),我必须根据列#34; id"上的逻辑来合并这些表中的数据。
条件是:
Select * from A
UNION
select * from B (except ids not in A)
UNION
select * from C (except ids not in A and B)
UNION
select * from D(except ids not in A,B and C)
UNION
select * from E(except ids not in A,B,C and D)
必须将此数据插入决赛桌。
一种方法是创建目标表(目标)并为每个UNION阶段附加数据,然后使用此表与其他UNION阶段连接。
这将是我的.hql文件的一部分:
insert into target
(select * from A
UNION
select B.* from
A
RIGHT OUTER JOIN B
on A.id=B.id
where ISNULL(A.id));
INSERT INTO target
select C.* from
target
RIGHT outer JOIN C
ON target.id=C.id
where ISNULL(target.id);
INSERT INTO target
select D.* from
target
RIGHT OUTER JOIN D
ON target.id=D.id
where ISNULL(target.id);
INSERT INTO target
select E.* from
target
RIGHT OUTER JOIN E
ON target.id=E.id
where ISNULL(target.id);
有没有更好的方法来实现这一目标?我认为无论如何我们必须这样做 多个连接/查找。我期待着实现这一目标的最佳方法 在
1)用Tez Hive
2)Spark-sql
非常感谢提前
答案 0 :(得分:1)
如果id
在每个表格中都是唯一的,则可以使用row_number
代替rank
。
select *
from (select *
,rank () over
(
partition by id
order by src
) as rnk
from (
select 1 as src,* from a
union all select 2 as src,* from b
union all select 3 as src,* from c
union all select 4 as src,* from d
union all select 5 as src,* from e
) t
) t
where rnk = 1
;
答案 1 :(得分:0)
我想我会尝试这样做:
with ids as (
select id, min(which) as which
from (select id, 1 as which from a union all
select id, 2 as which from b union all
select id, 3 as which from c union all
select id, 4 as which from d union all
select id, 5 as which from e
) x
)
select a.*
from a join ids on a.id = ids.id and ids.which = 1
union all
select b.*
from b join ids on b.id = ids.id and ids.which = 2
union all
select c.*
from c join ids on c.id = ids.id and ids.which = 3
union all
select d.*
from d join ids on d.id = ids.id and ids.which = 4
union all
select e.*
from e join ids on e.id = ids.id and ids.which = 5;