Apache Pig - 如何从CSV文件中读取数据,数据可选择用双引号括起来?
示例数据如下:
"Traditional",0.03,"Department, of Housing and Urban Development (HUD)",0.01
预期产出:
Traditional 0.03 Department, of Housing and Urban Development (HUD) 0.01
在上面的例子中,我们有4列。 2用双引号括起来,2不是,并且是浮动数据类型。此外,还有第3列在数据本身中有逗号。
请帮我一些与Pig相关的API(示例代码),这将有助于正确拆分数据并使用位置符号处理它们,例如$ 0,$ 1,$ 2,$ 3。
我已经从CSVExcelStorage
探讨了CSVLoader
和PiggyBank
,但我无法正确分割。
答案 0 :(得分:1)
选项1 - 使用CSVLoader或CSVExcelStorage
REGISTER piggybank.jar;
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
a = load 'data' USING CSVLoader(',') AS (field1:chararray,field2:double,
field3:chararray,field4:chararray);
b = FOREACH a GENERATE $0,$1,$2,$3;
DUMP b;
选项2 - TextLoader + STRSPLIT + REPLACE
A = LOAD '/path/to/files/' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE REPLACE(line,'"','');
C = FOREACH B GENERATE FLATTEN(STRSPLIT(line, ','));
DUMP C;
消息来源:http://www.crackinghadoop.com/hadoop-pig-loading-files-with-quotes-and-comma-delimiters/
答案 1 :(得分:1)
a = LOAD 'filename.csv' USING PigStorage (',') AS (fieldname:chararray, fieldname2:float);
DUMP a;