我有一个文件,其结构如下所述:
ID,姓名,地址
1,"Amrit,kumar",India
2,"Vaibhav,arora",USA
3,"Deepika,kumar",Germany
显然,如果我给pigStorage(','),三个字段将被拆分为4并且数据溢出。 替代方案:
我尝试过储钱罐,但问题仍然存在,数据仍然溢出。请在下面找到脚本
A11 = LOAD 'File.csv.gz' USING org.apache.pig.piggybank.storage.CSVLoader() as (column:type)
我尝试了替换fucntiion,因为我有35k行,所有行都没有进行更改。在这种情况下数据仍然如何溢出.Column值转移到下一列。请释放找到下面的推荐链接。
how can i ignore " (double quotes) while loading file in PIG?
我也尝试了CSVEXCEL存储和CSV加载程序。
请告诉我在这里可以做些什么。我希望在一个列中包含名称值。
答案 0 :(得分:0)
将其加载到4个字段中,替换引号,在第2个字段后添加空格,最后连接第2个和第3个字段以获取一个字段/列中的全名。无需外部jar。
A = LOAD 'File.csv.gz' USING PigStorage(',') AS (f1:int,f2:chararray,f3:chararray,f4:chararray);
B = FOREACH A GENERATE
f1,
CONCAT(REPLACE(f2,'\\"',''),' ') as f2, -- replace beginning quote and add space at end
REPLACE(f3,'\\"','') as f3, -- replace ending quote
f4;
C = FOREACH B GENERATE
f1 as id,
CONCAT(f2,f3) as name,
f4 as country;
DUMP C;
答案 1 :(得分:0)
使用您的数据测试此脚本:
-- load as four fields
a = LOAD 'data.txt' using PigStorage(',');
-- removes single quotes from second and third fields
b = foreach a generate $0 as id, REPLACE($1, '"', '') as firstname, REPLACE($2, '"', '') as lastname, $0 as address;
-- combines second and third field with a ',' in between
c = foreach b generate id, CONCAT(firstname, ',', lastname) as name, address;
现在,测试结果:
test = foreach c generate name;
dump test;
(Amrit,kumar)
(Vaibhav,arora)
(Deepika,kumar)