Pig CSVExcelStorage DoubleQuoted Commas

时间:2016-07-11 20:20:53

标签: csv hadoop apache-pig delimiter

I am receiving csv formatted files (fields are comma delimited and double-quoted) into HDFS and have developed a pig script which removes the header rows and strips the double quotes before I insert the data into Hive using an HQL script.

This process has been working fine; however, today I discovered a data issue with one of the tables. The files for this table in particular have a string field which can contain multiple commas within the double quotes. This is resulting in the data being incorrectly loaded into the wrong columns in Hive for some of the records.

I cannot change the format of the files at the source.

Currently I am using the PiggyBank CSVExcelStorage to process the csv formatting as follows. Can this be modified to produce the correct result? What other options do I have? I noticed that there is also a CSVLoader now but havent found any examples showing how to use/implement it. Pig CSVLoader

USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','NOCHANGE','SKIP_INPUT_HEADER')

Editing to add additional sample data and results of testing:

Sample Input File Data:

"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO"    
"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This Sample Name of A, B, and C","3234","c_name","R"
"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name","3235","c_name2","Q"

Using CSVExcelLoader with formatting provided above:

SAMPLEPNAME,123456,789123,SAMPLECNAME,Upload,SAMPLEINAME,This Sample Name of A, B, and C,3234,This Sample Name of A, B, and C,3234,c_name,R
SAMPLEPNAME2,123457,789124,SAMPLECNAME2,Download,SAMPLEINAME2,This Sample Name,3235,This Sample Name,3235,c_name2,Q

Using CSVLoader as CSVLoader(): Notice - Didn't see any options for parameters to be provided to the constructor

P_NAME,,,C_NAME,C_TYPE,PROT,I_NAME,,A_NAME,,C_NM,CO 
SAMPLEPNAME,123456,789123,SAMPLECNAME,Upload,SAMPLEINAME,This Sample Name of A, B, and C,3234,This Sample Name of A, B, and C,3234,c_name,R
SAMPLEPNAME2,123457,789124,SAMPLECNAME2,Download,SAMPLEINAME2,This Sample Name,3235,This Sample Name,3235,c_name2,Q

The only real difference I see is that CSVLoader isn't removing the header row as I saw no option to select this and instead its removing some of the header names.

Am I doing something incorrectly? A working solution will be appreciated.

1 个答案:

答案 0 :(得分:2)

要解决字段中的逗号问题,您可以尝试这项工作。

将数据加载为一行 对待","作为分隔符并用管道字符替换它,即' |' 替换开头和结尾的引用"用空字符串。
使用' |'将线路加载到配置单元中作为分隔符。

A = LOAD 'test1.csv' AS (lines:chararray);
ranked = rank A;
B = FILTER ranked BY (rank_A > 1);
C = FOREACH B GENERATE REPLACE($1,'","','|');
D = FOREACH C GENERATE REPLACE($0,'"','');
DUMP D;

A = LOAD' test1.csv' AS(行:chararray);

"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO"
"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This Sample Name of A, B, and C","3234","c_name","R"
"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name","3235","c_name2","Q"

排名=排名A;

(1,"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO")
(2,"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This S
ample Name of A, B, and C","3234","c_name","R")
(3,"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name
","3235","c_name2","Q")

B = FILTER排名BY(rank_A> 1);

(2,"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This S
ample Name of A, B, and C","3234","c_name","R")
(3,"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name
","3235","c_name2","Q")

C = FOR BACH B GENERATE REPLACE($ 1,'","',' |');

("SAMPLEPNAME|123456|789123|SAMPLECNAME|Upload|SAMPLEINAME|This Sample Name of A, B, and C|3234|This S
ample Name of A, B, and C|3234|c_name|R")
("SAMPLEPNAME2|123457|789124|SAMPLECNAME2|Download|SAMPLEINAME2|This Sample Name|3235|This Sample Name
|3235|c_name2|Q")

D = FOR CACH GENERATE REPLACE($ 0,'"''');

(SAMPLEPNAME|123456|789123|SAMPLECNAME|Upload|SAMPLEINAME|This Sample Name of A, B, and C|3234|This S
ample Name of A, B, and C|3234|c_name|R)
(SAMPLEPNAME2|123457|789124|SAMPLECNAME2|Download|SAMPLEINAME2|This Sample Name|3235|This Sample Name
|3235|c_name2|Q)

您现在可以使用' |'将此数据加载到配置单元。作为分隔符。

enter image description here