在hive中按id收集数据

时间:2016-08-07 20:36:45

标签: hadoop hive apache-pig

我有一个包含以下格式行的表

user |  purchase | time_of_purchase|quantity

样品

1234 | Bread | Jul 7 20:48| 1
1234 | Shaving Cream | July 10 14:20 | 2
5678 | Milk | July 7 3:48 | 1 
5678 | Bread | July 7 3:49 | 2
5678 | Bread | July 7 15:30 | 1

我想以下列格式创建用户的购买历史记录

1234 | {[Bread , Jul 7 20:48,1] ,[ Shaving Cream , July 10 14:20, 2 ]}
5678 | {[Milk, July 7 3:48 , 1 ] , [Bread , July 7 3:49 , 2], [Bread , July 7 15:30 , 1]}

是否可以在蜂巢或猪脚本中执行此操作?我尝试了collect_list,但这并没有保持跨列的顺序组合,也试过brickhouse collect但行为类似于collect_set而且我丢失了部分信息。

1 个答案:

答案 0 :(得分:0)

PIG脚本

File = LOAD 'file.txt' using PigStorage(',') as (user:int, Purchase:chararray,  timeofpurchase:chararray, quantity:int); 

GRP_USER = GROUP File by user;
DUMP GRP_USER;

你可以参考http://ybhavesh.blogspot.com/

上的几个例子

希望它能帮助