Pig:分组数据时出现错误

时间:2015-05-27 20:13:59

标签: apache-pig

这是我尝试运行的代码。步骤进行:

  1. 获取输入(输入文件夹中有.pig_schema文件)
  2. 只从中获取两个字段(chararray)并删除重复项
  3. 在其中一个字段上分组
  4. 代码如下:

    x = LOAD '$input' USING PigStorage('\t'); --The input is tab separated
    x = LIMIT x 25;
    DESCRIBE x;
    -- Output of DESCRIBE x:
    -- x: {id: chararray,keywords: chararray,score: chararray,time: long}
    
    distinctCounts = FOREACH x GENERATE keywords, id; -- generate two fields
    distinctCounts = DISTINCT distinctCounts; -- remove duplicates
    DESCRIBE distinctCounts;
    -- Output of DESCRIBE distinctCounts;
    -- distinctCounts: {keywords: chararray,id: chararray}
    
    grouped = GROUP distinctCounts BY keywords; --group by keywords
    DESCRIBE grouped; --THIS IS WHERE IT GIVES AN ERROR
    DUMP grouped;
    

    当我进行分组时,会出现以下错误:

    ERROR org.apache.pig.tools.pigstats.SimplePigStats  - 
    ERROR: org.apache.pig.data.DataByteArray cannot be cast to java.lang.String
    

    关键字是一个chararray,Pig应该能够在chararray上进行分组。有什么想法吗?

    编辑: 输入文件:

    0000010000014743       call for midwife    23      1425761139
    0000010000062069       naruto 1    56      1425780386
    0000010000079919       the following    98     1425788874
    0000010000081650       planes 2    76      1425721945
    0000010000118785       law and order    21     1425763899
    0000010000136965       family guy    12    1425766338
    0000010000136100       american dad    19      1425766702
    

    .pig_schema文件

    {"fields":[{"name":"id","type":55},{"name":"keywords","type":55},{"name":"score","type":55},{"name":"time","type":15}]}
    

1 个答案:

答案 0 :(得分:1)

Pig无法将关键字的值识别为chararray。最好在初始加载期间进行字段命名,这样我们就明确地说明了字段类型。

x = LOAD '$input' USING PigStorage('\t') AS (id:chararray,keywords:chararray,score: chararray,time: long);

更新:

尝试使用更新的.pig_schema下面的片段来引入分数,使用'\ t'作为分隔符,并尝试以下步骤进行共享输入。

  x = LOAD 'a.csv' USING PigStorage('\t'); 
 distinctCounts = FOREACH x GENERATE keywords, id; 
 distinctCounts = DISTINCT distinctCounts;
 grouped = GROUP distinctCounts BY keywords; 
 DUMP grouped;

建议使用唯一的别名,以提高可读性和可维护性。

输出

    (naruto 1,{(naruto 1,0000010000062069)})
    (planes 2,{(planes 2,0000010000081650)})
    (family guy,{(family guy,0000010000136965)})
    (american dad,{(american dad,0000010000136100)})
    (law and order,{(law and order,0000010000118785)})
    (the following,{(the following,0000010000079919)})
    (call for midwife,{(call for midwife,0000010000014743)})