我有大量文件,所有文件都在相同的结构中。在每个文件的第一行是一个键(在这个例子中是一个电影的键),后面是用户ID,评级和日期的记录。
示例文件1:
1:
1488844,3,2005-09-06
822109,5,2005-05-13
885013,4,2005-10-19
示例文件2:
2:
2059652,4,2005-09-05
1666394,3,2005-04-19
1759415,4,2005-04-22
1959936,5,2005-11-21
为了处理猪的数据并获得每部电影或每年的最高评分和平均评分,我需要这样的事情:
1,1488844,3,2005-09-06
1,822109,5,2005-05-13
1,885013,4,2005-10-19
2,2059652,4,2005-09-05
2,1666394,3,2005-04-19
2,1759415,4,2005-04-22
2,1959936,5,2005-11-21
我该如何管理? 感谢!!!
答案 0 :(得分:4)
尝试这样的事情:
inputs = LOAD 'input_path/*' using PigStorage('-tagsource');
grouped = GROUP inputs by $0;
processed = FOREACH grouped {
key_row = FILTER inputs BY [regexp expression for the key row, or some simple string expression];
without_key_row = FILTER input BY [the opposite expression];
GENERATE
(chararray)key_row,
FLATTEN(without_key_row);
}