我知道在hive中没有直接的方法来转置数据。我跟着这个问题:Is there a way to transpose data in Hive?,但由于那里没有最终答案,所以无法完全解决。
这是我的表格:
| ID | Code | Proc1 | Proc2 |
| 1 | A | p | e |
| 2 | B | q | f |
| 3 | B | p | f |
| 3 | B | q | h |
| 3 | B | r | j |
| 3 | C | t | k |
这里Proc1可以有任意数量的值。 ID,Code& Proc1一起形成该表的唯一键。我想透视/转置此表,以便Proc1中的每个唯一值成为新列,并且Proc2中的相应值是该列中相应行的值。在本质上,我试图得到类似的东西:
| ID | Code | p | q | r | t |
| 1 | A | e | | | |
| 2 | B | | f | | |
| 3 | B | f | h | j | |
| 3 | C | | | | k |
在新转换的表中,ID和代码是唯一的主键。从我上面提到的票证中,我可以使用to_map UDAF来实现这一点。 (免责声明 - 这可能不是朝着正确方向迈出的一步,但只要提到这里,如果是的话)
| ID | Code | Map_Aggregation |
| 1 | A | {p:e} |
| 2 | B | {q:f} |
| 3 | B | {p:f, q:h, r:j } |
| 3 | C | {t:k} |
但是我不知道如何从这一步到达我想要的转轴/换位表。 任何有关如何进行的帮助都会很棒! 感谢。
答案 0 :(得分:13)
以下是我使用hive的内部UDF函数解决此问题的方法," map":
select
b.id,
b.code,
concat_ws('',b.p) as p,
concat_ws('',b.q) as q,
concat_ws('',b.r) as r,
concat_ws('',b.t) as t
from
(
select id, code,
collect_list(a.group_map['p']) as p,
collect_list(a.group_map['q']) as q,
collect_list(a.group_map['r']) as r,
collect_list(a.group_map['t']) as t
from (
select
id,
code,
map(proc1,proc2) as group_map
from
test_sample
) a
group by
a.id,
a.code
) b;
" CONCAT_WS"和"地图" hive udf和" collect_list"是一个蜂巢udaf。
答案 1 :(得分:5)
以下是我最终使用的解决方案:
add jar brickhouse-0.7.0-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION collect AS 'brickhouse.udf.collect.CollectUDAF';
select
id,
code,
group_map['p'] as p,
group_map['q'] as q,
group_map['r'] as r,
group_map['t'] as t
from ( select
id, code,
collect(proc1,proc2) as group_map
from test_sample
group by id, code
) gm;
to_map UDF用于brickhouse repo:https://github.com/klout/brickhouse
答案 2 :(得分:5)
又一种解决方案。
使用Hivemallto_map
函数数据透视。
SELECT
uid,
kv['c1'] AS c1,
kv['c2'] AS c2,
kv['c3'] AS c3
FROM (
SELECT uid, to_map(key, value) kv
FROM vtable
GROUP BY uid
) t
uid c1 c2 c3
101 11 12 13
102 21 22 23
<强>逆透视强>
SELECT t1.uid, t2.key, t2.value
FROM htable t1
LATERAL VIEW explode (map(
'c1', c1,
'c2', c2,
'c3', c3
)) t2 as key, value
uid key value
101 c1 11
101 c2 12
101 c3 13
102 c1 21
102 c2 22
102 c3 23
答案 3 :(得分:1)
我没有编写这段代码,但我认为你可以使用klouts brickhouse提供的一些UDF:https://github.com/klout/brickhouse
具体来说,你可以像在这里提到的那样使用他们的收集:http://brickhouseconfessions.wordpress.com/2013/03/05/use-collect-to-avoid-the-self-join/
然后使用本文http://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_ra中详述的方法爆炸数组(它们将具有不同的长度)
答案 4 :(得分:1)
create table hive (id Int,Code String, Proc1 String, Proc2 String);
insert into hive values('1','A','p','e');
insert into hive values('2','B','q','f');
insert into hive values('3','B','p','f');
insert into hive values('3','B','q','h');
insert into hive values('3','B','r','j');
insert into hive values('3','C','t','k');
select id,code,
case when collect_list(p)[0] is null then '' else collect_list(p)[0] end as p,
case when collect_list(q)[0] is null then '' else collect_list(q)[0] end as q,
case when collect_list(r)[0] is null then '' else collect_list(r)[0] end as r,
case when collect_list(t)[0] is null then '' else collect_list(t)[0] end as t
from(
select id, code,
case when proc1 ='p' then proc2 end as p,
case when proc1 ='q' then proc2 end as q,
case when proc1 ='r' then proc2 end as r,
case when proc1 ='t' then proc2 end as t
from hive
) dummy group by id,code;
答案 5 :(得分:0)
如果是数值,您可以使用以下配置单元查询:
示例数据
ID cust_freq Var1 Var2 frequency
220444 1 16443 87128 72.10140547
312554 6 984 7339 0.342452643
220444 3 6201 87128 9.258396518
220444 6 47779 87128 2.831972441
312554 1 6055 7339 82.15209213
312554 3 12868 7339 4.478333954
220444 2 6705 87128 15.80822558
312554 2 37432 7339 13.02712127
select id, sum(a.group_map[1]) as One, sum(a.group_map[2]) as Two, sum(a.group_map[3]) as Three, sum(a.group_map[6]) as Six from
( select id,
map(cust_freq,frequency) as group_map
from table
) a group by a.id having id in
( '220444',
'312554');
ID one two three six
220444 72.10140547 15.80822558 9.258396518 2.831972441
312554 82.15209213 13.02712127 4.478333954 0.342452643
In above example I have't used any custom udf. It is only using in-built hive functions.
Note :For string value in key write the vale as sum(a.group_map['1']) as One.
答案 6 :(得分:0)
下面也是Pivot的一种方式
SELECT TM1_Code, Product, Size, State_code, Description
, Promo_date
, Price
FROM (
SELECT TM1_Code, Product, Size, State_code, Description
, MAP('FY2018Jan', FY2018Jan, 'FY2018Feb', FY2018Feb, 'FY2018Mar', FY2018Mar, 'FY2018Apr', FY2018Apr
,'FY2018May', FY2018May, 'FY2018Jun', FY2018Jun, 'FY2018Jul', FY2018Jul, 'FY2018Aug', FY2018Aug
,'FY2018Sep', FY2018Sep, 'FY2018Oct', FY2018Oct, 'FY2018Nov', FY2018Nov, 'FY2018Dec', FY2018Dec) AS tmp_column
FROM CS_ME_Spirits_30012018) TmpTbl
LATERAL VIEW EXPLODE(tmp_column) exptbl AS Promo_date, Price;
答案 7 :(得分:0)
对于Unpivot,我们可以简单地使用以下逻辑。
SELECT Cost.Code, Cost.Product, Cost.Size
, Cost.State_code, Cost.Promo_date, Cost.Cost, Sales.Price
FROM
(Select Code, Product, Size, State_code, Promo_date, Price as Cost
FROM Product
Where Description = 'Cost') Cost
JOIN
(Select Code, Product, Size, State_code, Promo_date, Price as Price
FROM Product
Where Description = 'Sales') Sales
on (Cost.Code = Sales.Code
and Cost.Promo_date = Sales.Promo_date);
答案 8 :(得分:0)
您可以使用案例语句和collect_set的一些帮助来实现此目的。你可以看一下。您可以在 - http://www.analyticshut.com/big-data/hive/pivot-rows-to-columns-in-hive/
查看详细答案以下是供参考的查询
SELECT resource_id,
CASE WHEN COLLECT_SET(quarter_1)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_1)[0] END AS quarter_1_spends,
CASE WHEN COLLECT_SET(quarter_2)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_2)[0] END AS quarter_2_spends,
CASE WHEN COLLECT_SET(quarter_3)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_3)[0] END AS quarter_3_spends,
CASE WHEN COLLECT_SET(quarter_4)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_4)[0] END AS quarter_4_spends
FROM (
SELECT resource_id,
CASE WHEN quarter='Q1' THEN amount END AS quarter_1,
CASE WHEN quarter='Q2' THEN amount END AS quarter_2,
CASE WHEN quarter='Q3' THEN amount END AS quarter_3,
CASE WHEN quarter='Q4' THEN amount END AS quarter_4
FROM billing_info)tbl1
GROUP BY resource_id;