PIG在嵌套袋上展平vs组

时间:2015-11-10 23:22:26

标签: apache-pig

我正在学习PIG,我有一个问题,我知道可能在书中,但不幸的是我没有时间做研究。 我有两个管道:

(选项1):

a = LOAD 'geolocation' USING org.apache.hive.hcatalog.pig.HCatLoader();
b = LOAD 'truck_mileage' USING org.apache.hive.hcatalog.pig.HCatLoader();
c = join a by truckid, b by truckid; 
d = foreach c generate SUBSTRING(rdate,3,5) as year, event, mpg;
e = foreach d generate flatten(year) as year, event, mpg;
f = group e by year;
g = foreach f generate group, AVG(e.mpg);
x = limit g 10;
dump x;

我加载2个文件,然后加入,然后我将日期的最后2位数字换成年份,之后我使用flatten来简化分组之前的事情以获得mpg的平均值。

(选项2):

a = LOAD 'geolocation' USING org.apache.hive.hcatalog.pig.HCatLoader();
b = LOAD 'truck_mileage' USING org.apache.hive.hcatalog.pig.HCatLoader();
c = join a by truckid, b by truckid; 
d = foreach c generate SUBSTRING(rdate,3,5) as year, event, mpg;
f = group d by year;
g = foreach f generate group, AVG(d.mpg);
x = limit g 10;
dump x;

同样的事情,但我不会使用flatten进行分组,然后获得mpg的平均值。

我得到了相同的结果但是,有显着差异吗?在这种情况下,我使用的数据集并不大,但我很好奇如果我有几百万条记录就会出现这种情况。

感谢。

0 个答案:

没有答案