很久以来一直遇到问题。任何帮助都会很明显。 所以我在/ home / hadoop / pig目录中有一个数据集文件。我可以查看该文件,因此没有权限问题。 数据集有4列,分别为" ::"作为分隔符。 我在本地模式下从/ home / hadoop / pig目录中运行猪。
ratingsData = LOAD 'ratings.dat' AS (line:chararray);
ratings = FOREACH ratingsData GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);
grouped_mid = GROUP ratings BY mid;
dump grouped_mid;
上述脚本失败。我可以成功转储&ratings; ratingsData'和'评分'关系但不是groups_mid。但这是奇怪的部分。以下脚本成功运行。
ratingsData = LOAD 'ratings.dat' AS (line:chararray);
ratings = FOREACH ratingsData GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);
STORE ratings INTO 'ratingInfo.txt';
X = LOAD 'ratingInfo.txt' AS (uid:int, mid:int, rating:int, timestamp:long);
grouped_mid = GROUP X BY mid;
dump grouped_mid;
显然,第二个脚本有一个冗余步骤。我只是存储关系并重新加载它。我想避免这种情况。 任何澄清/解释都会非常值得注意。
非常感谢。
答案 0 :(得分:0)
只需参考:pig join with java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
您可以将脚本修改为:
ratingsData = LOAD 'ratings.dat' AS (line:chararray);
ratings = FOREACH ratingsData GENERATE FLATTEN((tuple(int, int, int, long))REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);
grouped_mid = GROUP ratings BY mid;
dump grouped_mid;
测试。