Apache Pig - 如何从CSV文件中读取数据

时间:2015-11-19 10:46:50

标签: apache-pig

Apache Pig - 如何从CSV文件中读取数据,数据可选择用双引号括起来?

示例数据如下:

"Traditional",0.03,"Department, of Housing and Urban Development (HUD)",0.01 

预期产出:

Traditional  0.03  Department, of Housing and Urban Development (HUD)  0.01

在上面的例子中,我们有4列。 2用双引号括起来,2不是,并且是浮动数据类型。此外,还有第3列在数据本身中有逗号。

请帮我一些与Pig相关的API(示例代码),这将有助于正确拆分数据并使用位置符号处理它们,例如$ 0,$ 1,$ 2,$ 3。

我已经从CSVExcelStorage探讨了CSVLoaderPiggyBank,但我无法正确分割。

2 个答案:

答案 0 :(得分:1)

选项1 - 使用CSVLoader或CSVExcelStorage

 REGISTER piggybank.jar;
 DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();

 a = load 'data' USING CSVLoader(',') AS (field1:chararray,field2:double,
                                          field3:chararray,field4:chararray);

 b = FOREACH a GENERATE $0,$1,$2,$3;

 DUMP b;

选项2 - TextLoader + STRSPLIT + REPLACE

 A = LOAD '/path/to/files/' USING TextLoader() AS (line:chararray);

 B = FOREACH A GENERATE REPLACE(line,'"','');

 C = FOREACH B GENERATE FLATTEN(STRSPLIT(line, ','));

 DUMP C;

消息来源:http://www.crackinghadoop.com/hadoop-pig-loading-files-with-quotes-and-comma-delimiters/

答案 1 :(得分:1)

a = LOAD 'filename.csv' USING PigStorage (',') AS (fieldname:chararray, fieldname2:float);

DUMP a;