Question

这是我尝试运行的代码。步骤进行：

获取输入（输入文件夹中有.pig_schema文件）
只从中获取两个字段（chararray）并删除重复项
在其中一个字段上分组

代码如下：

x = LOAD '$input' USING PigStorage('\t'); --The input is tab separated
x = LIMIT x 25;
DESCRIBE x;
-- Output of DESCRIBE x:
-- x: {id: chararray,keywords: chararray,score: chararray,time: long}

distinctCounts = FOREACH x GENERATE keywords, id; -- generate two fields
distinctCounts = DISTINCT distinctCounts; -- remove duplicates
DESCRIBE distinctCounts;
-- Output of DESCRIBE distinctCounts;
-- distinctCounts: {keywords: chararray,id: chararray}

grouped = GROUP distinctCounts BY keywords; --group by keywords
DESCRIBE grouped; --THIS IS WHERE IT GIVES AN ERROR
DUMP grouped;

当我进行分组时，会出现以下错误：

ERROR org.apache.pig.tools.pigstats.SimplePigStats  - 
ERROR: org.apache.pig.data.DataByteArray cannot be cast to java.lang.String

关键字是一个chararray，Pig应该能够在chararray上进行分组。有什么想法吗？

编辑：输入文件：

0000010000014743       call for midwife    23      1425761139
0000010000062069       naruto 1    56      1425780386
0000010000079919       the following    98     1425788874
0000010000081650       planes 2    76      1425721945
0000010000118785       law and order    21     1425763899
0000010000136965       family guy    12    1425766338
0000010000136100       american dad    19      1425766702

.pig_schema文件

{"fields":[{"name":"id","type":55},{"name":"keywords","type":55},{"name":"score","type":55},{"name":"time","type":15}]}

Answer 1

Pig无法将关键字的值识别为chararray。最好在初始加载期间进行字段命名，这样我们就明确地说明了字段类型。

x = LOAD '$input' USING PigStorage('\t') AS (id:chararray,keywords:chararray,score: chararray,time: long);

更新：

尝试使用更新的.pig_schema下面的片段来引入分数，使用'\ t'作为分隔符，并尝试以下步骤进行共享输入。

  x = LOAD 'a.csv' USING PigStorage('\t'); 
 distinctCounts = FOREACH x GENERATE keywords, id; 
 distinctCounts = DISTINCT distinctCounts;
 grouped = GROUP distinctCounts BY keywords; 
 DUMP grouped;

建议使用唯一的别名，以提高可读性和可维护性。

输出

    (naruto 1,{(naruto 1,0000010000062069)})
    (planes 2,{(planes 2,0000010000081650)})
    (family guy,{(family guy,0000010000136965)})
    (american dad,{(american dad,0000010000136100)})
    (law and order,{(law and order,0000010000118785)})
    (the following,{(the following,0000010000079919)})
    (call for midwife,{(call for midwife,0000010000014743)})

Pig：分组数据时出现错误

1 个答案: