只有在Pig

时间:2017-07-20 09:10:32

标签: apache-pig

我有这样的数据:

1,234,"john, lee", john@xyz.com

我想删除,内部""用空间使用猪脚本。这样我的数据就像:

1,234,john lee, john@xyz.com

我尝试使用CSVExcelStorage加载这些数据,但我需要使用' -tagFile' CSVExcelStorage中不支持的选项。所以我打算只使用PigStorage然后替换引号内的任何逗号(,)。 我坚持这个。任何帮助都非常感谢。感谢

3 个答案:

答案 0 :(得分:1)

以下命令将有所帮助:

csvFile = load '/path/to/file' using PigStorage(',');
result = foreach csvFile generate $0 as (field1:chararray),$1 as (field2:chararray),CONCAT(REPLACE($2, '\\"', '') , REPLACE($3, '\\"', '')) as field3,$4 as (field4:chararray);

输出:

  

(1,234,john lee,john @ xyz.com)

答案 1 :(得分:0)

将其加载到单个字段中,然后使用STRSPLIT和REPLACE

A = LOAD 'data.csv' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE STRSPLIT(line,'\\"',3); 
C = FOREACH B GENERATE REPLACE($1,',','');
D = FOREACH C GENERATE CONCAT(CONCAT($0,$1),$2); -- You can further use STRSPLIT to get individual fields or just CONCAT
E = FOREACH D GENERATE STRSPLIT(D.$0,',',4);
DUMP E;

<强> A

1,234,"john, lee", john@xyz.com

<强>乙

(1,234,)(john, lee)(, john@xyz.com)

<强> C

(1,234,)(john lee)(, john@xyz.com)

<强> d

(1,234,john lee, john@xyz.com)

<强>电子

(1),(234),(john lee),(john@xyz.com)

答案 2 :(得分:0)

我有完美的方法来做到这一点。一个非常通用的解决方案如下:

data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray, record:chararray);

/*replace comma(,) if it appears in column content*/
replaceComma = FOREACH data GENERATE filename, REPLACE (record, ',(?!(([^\\"]*\\"){2})*[^\\"]*$)', '');

/*replace the quotes("") which is present around the column if it have comma(,) as its a csv file feature*/
replaceQuotes = FOREACH replaceComma GENERATE filename, REPLACE ($4,'"','') as record;

详细用例可在my blog

获取