Pig:无法使用PigStorage加载数据

时间:2015-02-02 02:01:19

标签: hadoop mapreduce apache-pig bigdata

我在txt文件中有这个smaple数据集(格式:名字,姓氏,年龄,性别)

(Eric,Ack,27,M),(Jeremy,Ross,29,F)
(Jenny,Dicken,27,F),(Vijay,Sampath,40,M)
(Angs,Dicken,28,M),(Venu,Rao,28,M)
(Mahima,Mohanty,29,F),(Kenny,Oath,28,M)

我正在尝试加载这样的数据:

tuple_record = LOAD '~/Documents/Pig_Tuple.txt' USING PigStorage(',') AS (details:tuple(firstname:chararray,lastname:chararray,age:int,sex:chararray));

但这不起作用:

DUMP tuple_record;

我在运行此命令时得到了这个(即它什么都不返回)

()
()
()
()

请建议如何加载此数据集。

2 个答案:

答案 0 :(得分:2)

原因是,元组内的tupleeach fields都有same delimiter',')。在这种情况下,pig将解析输入并在模式转换中失败。

您可以在控制台中看到以下日志

"Unable to interpret the value in field being converted to type tuple, caught ParseException <Unexpect end of tuple> field discarded"

解决此问题

  1. 您需要将元组分隔符','更改为其他内容。在下面的示例中,我使用'#'作为分隔符而不是','。您可以使用除(',')

  2. 之外的任何分隔符
  3. 您的输入文件有两个元组,但您在加载模式中只定义了一个元组,因此您还需要定义另一个元组。

  4. 示例示例:

    <强>输入

    (Eric,Ack,27,M)#(Jeremy,Ross,29,F)
    (Jenny,Dicken,27,F)#(Vijay,Sampath,40,M)
    (Angs,Dicken,28,M)#(Venu,Rao,28,M)
    (Mahima,Mohanty,29,F)#(Kenny,Oath,28,M)
    

    <强> Pigscript:

    tuple_record = LOAD '~/Documents/Pig_Tuple.txt' USING PigStorage('#') AS (details:tuple(firstname:chararray,lastname:chararray,age:int,sex:chararray), details1:tuple(firstname1:chararray,lastname1:chararray,age1:int,sex1:chararray));
    DUMP tuple_record;
    

    <强>输出:

    ((Eric,Ack,27,M),(Jeremy,Ross,29,F))
    ((Jenny,Dicken,27,F),(Vijay,Sampath,40,M))
    ((Angs,Dicken,28,M),(Venu,Rao,28,M))
    ((Mahima,Mohanty,29,F),(Kenny,Oath,28,M))
    

    <强>更新
    如何更改分隔符','到不同的东西
    选项1:使用sed
    这是非常简单的选择,通过使用sed命令将'),('模式替换为')#('模式,以便在同一输入文件中将分隔符从','更改为'#' 。(注意:在执行此sed脚本之前备份输入文件)

    >> sed -i -- 's/),(/)#(/g' inputFile
    

    选项2:在不改变分隔符的情况下对其进行轻微修改
    Pigscript:

    --Read each input line as chararray
    A = LOAD 'inputFile' AS (line:chararray);
    
    --Remove the character '(',')' from the input
    B = FOREACH A GENERATE FLATTEN(REPLACE(line,'[)(]+','')) AS (newline:chararray);
    
    --Split the input using ',' as delimiter, 8 refer to total number of fields
    C = FOREACH B GENERATE FLATTEN(STRSPLIT(newline,',',8)) AS (firstname1:chararray,lastname1:chararray,age1:int,sex1:chararray,firstname2:chararray,lastname2:chararray,age2:int,sex2:chararray);
    
    --Group the fields and form tuples 
    D = FOREACH C GENERATE TOTUPLE(firstname1,lastname1,age1,sex1) AS details1,TOTUPLE(firstname2,lastname2,age2,sex2) AS details2;
    
    --Now you can do whatever you want.
    E = FOREACH D GENERATE details1.firstname1,details2.firstname2;
    DUMP E;
    

答案 1 :(得分:1)

请查看Pig Documentation

的复杂方案部分
cat data;
(3,8,9) (mary,19)
(1,4,7) (john,18)
(2,5,8) (joe,18)

A = LOAD data AS (F:tuple(f1:int,f2:int,f3:int),T:tuple(t1:chararray,t2:int));

DESCRIBE A;
A: {F: (f1: int,f2: int,f3: int),T: (t1: chararray,t2: int)}

DUMP A;
((3,8,9),(mary,19))
((1,4,7),(john,18))
((2,5,8),(joe,18))