无法在PIG中转储关系

时间:2016-11-06 09:10:02

标签: hadoop apache-pig bigdata

很久以来一直遇到问题。任何帮助都会很明显。 所以我在/ home / hadoop / pig目录中有一个数据集文件。我可以查看该文件,因此没有权限问题。 数据集有4列,分别为" ::"作为分隔符。 我在本地模式下从/ home / hadoop / pig目录中运行猪。

ratingsData = LOAD 'ratings.dat' AS (line:chararray);

ratings = FOREACH ratingsData GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);

grouped_mid = GROUP ratings BY mid;

dump grouped_mid;

上述脚本失败。我可以成功转储&ratings; ratingsData'和'评分'关系但不是groups_mid。但这是奇怪的部分。以下脚本成功运行。

ratingsData = LOAD 'ratings.dat' AS (line:chararray);

ratings = FOREACH ratingsData GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);

STORE ratings INTO 'ratingInfo.txt';

X = LOAD 'ratingInfo.txt' AS (uid:int, mid:int, rating:int, timestamp:long);

grouped_mid = GROUP X BY mid;

dump grouped_mid;

显然,第二个脚本有一个冗余步骤。我只是存储关系并重新加载它。我想避免这种情况。 任何澄清/解释都会非常值得注意。

非常感谢。

1 个答案:

答案 0 :(得分:0)

只需参考:pig join with java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer

您可以将脚本修改为:

ratingsData = LOAD 'ratings.dat' AS (line:chararray);

ratings = FOREACH ratingsData GENERATE FLATTEN((tuple(int, int, int, long))REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);

grouped_mid = GROUP ratings BY mid;

dump grouped_mid;

测试。