Apache Pig如何替换chararray中的所有逗号

时间:2016-10-26 18:49:07

标签: apache-pig data-cleansing

我试图用这样的字母替换所有逗号:

输入行示例:

1,compras com cartão, comprei (cp1,cp2,cp3), 206-01-01 00:00:00

输出示例:

1,compras com cartão, comprei (cp1 cp2 cp3), 206-01-01 00:00:00

使用这种方法:

raw_data = LOAD 's3://datalake/example' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE') AS (id:int, transaction:chararray, transaction_name:chararray, date:chararray);

apply_cleanness = FOREACH raw_data GENERATE id:int, ransaction:chararray, REPLACE(transaction_name,',','') as transaction_name, date:chararray;

但是这个命令只删除第一次出现的逗号,结果是:

1,compras com cartão, comprei (cp1 cp2, cp3), 206-01-01 00:00:00

我做错了什么?

谢谢,

1 个答案:

答案 0 :(得分:1)

第3个字段没有明确的标记。你有2个选项。用引号括起第3个字段,然后使用你的脚本。

1,compras com cartão, "comprei (cp1,cp2,cp3)", 206-01-01 00:00:00

raw_data = LOAD 's3://datalake/example' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE') AS (id:int, transaction:chararray, transaction_name:chararray, date:chararray);
apply_cleanness = FOREACH raw_data GENERATE id:int, ransaction:chararray, REPLACE(transaction_name,',','') as transaction_name, date:chararray;

或者,您可以使用逗号作为分隔符加载字段,然后生成第3个字段作为加载中3,4,5个字段的组合。参见下面

A = LOAD 'test16.txt' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',');
B = FOREACH A GENERATE $0 as id:int,$1 as transaction:chararray,CONCAT(CONCAT(CONCAT(CONCAT($2,' '),$3),' '),$4) as transaction_name:chararray,$5 as date:chararray; 
DUMP B;

Output