在hive中,我希望按一列分配表,并使用python为每个分布式部分进行转换。
我想对具有特定列D号的记录进行操作,如下所示:
from
(select *
from raw_table
where D=12345
sort by A)
sb
insert overwrite table u_12345
partition (X,Y)
select transform(cast(A as double),B,C,D,E,F,X,Y)
using 'hello.py'
as A,B,C,D,E,F,X,Y
;
现在我想为所有不同的D列编号做这些,我编写了如下代码:
from raw_table
insert overwrite table clean_data
partition (X,Y)
select transform(cast(A as double),B,C,D,E,F,X,Y)
using 'hello.py'
as A,B,C,D,E,F,X,Y
distribute by D
;
但它并没有按照我想要的方式运作。
答案 0 :(得分:0)
您可以使用分发子查询:
我还没有测试过这个:
From (select A,B,C,D,E,F,X,Y from raw_table distribute by D)
insert overwrite table clean_data
partition (X,Y)
select transform(cast(A as double),B,C,D,E,F,X,Y)
using 'hello.py'
as A,B,C,D,E,F,X,Y ;
使用我的群集:
create table clean-data as
select
transform (key, B,C,D,E,F,G)
USING 'reducer_script.py' as (key, B,C,D,E,F,G_reduced)
from (key, B,C,D,E,F,G from raw_table distribute by KEY sort by KEY, D ) alias ;