我正在学习PIG,我有一个问题,我知道可能在书中,但不幸的是我没有时间做研究。 我有两个管道:
(选项1):
a = LOAD 'geolocation' USING org.apache.hive.hcatalog.pig.HCatLoader();
b = LOAD 'truck_mileage' USING org.apache.hive.hcatalog.pig.HCatLoader();
c = join a by truckid, b by truckid;
d = foreach c generate SUBSTRING(rdate,3,5) as year, event, mpg;
e = foreach d generate flatten(year) as year, event, mpg;
f = group e by year;
g = foreach f generate group, AVG(e.mpg);
x = limit g 10;
dump x;
我加载2个文件,然后加入,然后我将日期的最后2位数字换成年份,之后我使用flatten来简化分组之前的事情以获得mpg的平均值。
(选项2):
a = LOAD 'geolocation' USING org.apache.hive.hcatalog.pig.HCatLoader();
b = LOAD 'truck_mileage' USING org.apache.hive.hcatalog.pig.HCatLoader();
c = join a by truckid, b by truckid;
d = foreach c generate SUBSTRING(rdate,3,5) as year, event, mpg;
f = group d by year;
g = foreach f generate group, AVG(d.mpg);
x = limit g 10;
dump x;
同样的事情,但我不会使用flatten进行分组,然后获得mpg的平均值。
我得到了相同的结果但是,有显着差异吗?在这种情况下,我使用的数据集并不大,但我很好奇如果我有几百万条记录就会出现这种情况。
感谢。