这是我尝试运行的代码。步骤进行:
代码如下:
x = LOAD '$input' USING PigStorage('\t'); --The input is tab separated
x = LIMIT x 25;
DESCRIBE x;
-- Output of DESCRIBE x:
-- x: {id: chararray,keywords: chararray,score: chararray,time: long}
distinctCounts = FOREACH x GENERATE keywords, id; -- generate two fields
distinctCounts = DISTINCT distinctCounts; -- remove duplicates
DESCRIBE distinctCounts;
-- Output of DESCRIBE distinctCounts;
-- distinctCounts: {keywords: chararray,id: chararray}
grouped = GROUP distinctCounts BY keywords; --group by keywords
DESCRIBE grouped; --THIS IS WHERE IT GIVES AN ERROR
DUMP grouped;
当我进行分组时,会出现以下错误:
ERROR org.apache.pig.tools.pigstats.SimplePigStats -
ERROR: org.apache.pig.data.DataByteArray cannot be cast to java.lang.String
关键字是一个chararray,Pig应该能够在chararray上进行分组。有什么想法吗?
编辑: 输入文件:
0000010000014743 call for midwife 23 1425761139
0000010000062069 naruto 1 56 1425780386
0000010000079919 the following 98 1425788874
0000010000081650 planes 2 76 1425721945
0000010000118785 law and order 21 1425763899
0000010000136965 family guy 12 1425766338
0000010000136100 american dad 19 1425766702
.pig_schema文件
{"fields":[{"name":"id","type":55},{"name":"keywords","type":55},{"name":"score","type":55},{"name":"time","type":15}]}
答案 0 :(得分:1)
Pig无法将关键字的值识别为chararray。最好在初始加载期间进行字段命名,这样我们就明确地说明了字段类型。
x = LOAD '$input' USING PigStorage('\t') AS (id:chararray,keywords:chararray,score: chararray,time: long);
更新:
尝试使用更新的.pig_schema下面的片段来引入分数,使用'\ t'作为分隔符,并尝试以下步骤进行共享输入。
x = LOAD 'a.csv' USING PigStorage('\t');
distinctCounts = FOREACH x GENERATE keywords, id;
distinctCounts = DISTINCT distinctCounts;
grouped = GROUP distinctCounts BY keywords;
DUMP grouped;
建议使用唯一的别名,以提高可读性和可维护性。
输出
(naruto 1,{(naruto 1,0000010000062069)})
(planes 2,{(planes 2,0000010000081650)})
(family guy,{(family guy,0000010000136965)})
(american dad,{(american dad,0000010000136100)})
(law and order,{(law and order,0000010000118785)})
(the following,{(the following,0000010000079919)})
(call for midwife,{(call for midwife,0000010000014743)})