为什么我的猪查询返回错误的值

时间:2019-04-29 19:20:18

标签: hadoop apache-pig

我正在尝试使用Pig中的以下数据集 https://www.kaggle.com/zynicide/wine-reviews/version/4? 我从查询中得到了错误的值,我能想到的唯一原因是与数据集中的数据丢失有关 但我不知道是这样还是我为什么得到错误的值

allWines = LOAD 'winemag-data_first150k.csv' USING PigStorage(',') AS (id:chararray, country:chararray, description:chararray, designation:chararray, points:chararray, price:chararray, province:chararray, region_2:chararray, region_1:chararray, variety:chararray, winery:chararray);

allWinesNotNull = FILTER allWines BY price is not null;
allWinesNotNull2 = FILTER allWinesNotNull BY points is not null;
allWinesPriceSorted = ORDER allWinesNotNull2 BY price;
allWinesPriceTop5Sorted = LIMIT allWinesPriceSorted  5;
allWinesPricePoints = FOREACH allWinesPriceTop5Sorted GENERATE id, price;
DUMP allWinesPricePoints;

DESCRIBE allWinesPricePoints;

我得到的实际结果是 (56203,涂成黄油状的黄油吐司和香料风味。应保存一两年。”) (61341,甜美的单宁。新鲜的酸度使它更具刺激性。给它时间。2007-2012年最佳。”) (16417年,霞多丽也为人所知) (115384,杏仁和香草) (136804,杏仁和香草)

我认为输出应该是 (56203,23) (61341,30) (16417,16) (115384,250) (136804,250)

我希望第二个值是数字,并且在价格列中

1 个答案:

答案 0 :(得分:0)

进行如下:

allWines = LOAD 'winemag-data_first150k.csv' USING PigStorage(',') AS (id:chararray, country:chararray, description:chararray, designation:chararray, points:chararray, price:chararray, province:chararray, region_2:chararray, region_1:chararray, variety:chararray, winery:chararray);

--comments
--add below foreach to generate the values this will help you out to parse data correctly
--generate column in the same order as it is in the text file
allWines= FOREACH allWines GENERATE
id AS id,
country AS country,
description AS description,
designation AS designation,
points AS points,
price AS price, 
province AS provience,
region_2 AS region_2,
region_1 AS region_1,
variety AS variety,
winery AS winery;

allWinesNotNull = FILTER allWines BY price is not null;
allWinesNotNull2 = FILTER allWinesNotNull BY points is not null;
allWinesPriceSorted = ORDER allWinesNotNull2 BY price;
allWinesPriceTop5Sorted = LIMIT allWinesPriceSorted  5;
allWinesPricePoints = FOREACH allWinesPriceTop5Sorted GENERATE id, price;
DUMP allWinesPricePoints;
DESCRIBE allWinesPricePoints;

希望这会对您有所帮助。 如有任何疑问,请通知我。