猪使用存钱罐罐处理不正确的数据

时间:2017-05-10 07:37:00

标签: hadoop apache-pig

我有一个文件,其结构如下所述:

ID,姓名,地址

1,"Amrit,kumar",India   
2,"Vaibhav,arora",USA   
3,"Deepika,kumar",Germany

显然,如果我给pigStorage(','),三个字段将被拆分为4并且数据溢出。 替代方案:

  1. 我尝试过储钱罐,但问题仍然存在,数据仍然溢出。请在下面找到脚本

    A11 = LOAD 'File.csv.gz' USING org.apache.pig.piggybank.storage.CSVLoader() as (column:type)

  2. 我尝试了替换fucntiion,因为我有35k行,所有行都没有进行更改。在这种情况下数据仍然如何溢出.Column值转移到下一列。请释放找到下面的推荐链接。

    how can i ignore " (double quotes) while loading file in PIG?

  3. 我也尝试了CSVEXCEL存储和CSV加载程序。

  4. 请告诉我在这里可以做些什么。我希望在一个列中包含名称值。

2 个答案:

答案 0 :(得分:0)

将其加载到4个字段中,替换引号,在第2个字段后添加空格,最后连接第2个和第3个字段以获取一个字段/列中的全名。无需外部jar。

A = LOAD 'File.csv.gz' USING PigStorage(',') AS (f1:int,f2:chararray,f3:chararray,f4:chararray);
B = FOREACH A GENERATE 
            f1,
            CONCAT(REPLACE(f2,'\\"',''),' ') as f2, -- replace beginning quote and add space at end
            REPLACE(f3,'\\"','') as f3,             -- replace ending quote
            f4;
C = FOREACH B GENERATE 
            f1 as id,
            CONCAT(f2,f3) as name,
            f4 as country;
DUMP C;

答案 1 :(得分:0)

使用您的数据测试此脚本:

-- load as four fields
a = LOAD 'data.txt' using PigStorage(',');

-- removes single quotes from second and third fields
b = foreach a generate $0 as id, REPLACE($1, '"', '') as firstname, REPLACE($2, '"', '') as lastname, $0 as address;

-- combines second and third field with a ',' in between
c = foreach b generate id,  CONCAT(firstname, ',', lastname) as name, address;

现在,测试结果:

test = foreach c generate name;
dump test;
(Amrit,kumar)
(Vaibhav,arora)
(Deepika,kumar)