如何在PIG中加载文件时忽略“(双引号)?

时间:2014-07-03 13:11:53

标签: apache-pig hdfs

我在文件中有以下数据

"a","b","1","2"
"a","b","4","3"
"a","b","3","1"

我正在使用以下命令

阅读此文件
File1 = LOAD '/path' using PigStorage (',') as (f1:chararray,f2:chararray,f3:int,f4:int)

但这里忽略了第3和第4栏的数据

如何正确阅读此文件或以任何方式使PIG跳过'''

我正在使用Apache Pig版本0.10.0

的其他信息

4 个答案:

答案 0 :(得分:4)

您可以使用REPLACE功能(虽然它不会在一次通过中):

file1 = load 'your.csv' using PigStorage(',');
data = foreach file1 generate $0 as (f1:chararray), $1 as (f2:chararray), REPLACE($2, '\\"', '') as (f3:int), REPLACE($3, '\\"', '') as (f4:int);

您也可以在REGEX_EXTRACT使用正则表达式:

file1 = load 'your.csv' using PigStorage(',');
data = foreach file1 generate $0, $1, REGEX_EXTRACT($2, '([0-9]+)', 1), REGEX_EXTRACT($3, '([0-9]+)', 1);

当然,你可以用同样的方式擦除{1}} f1和f2。

答案 1 :(得分:1)

尝试以下(无需转义或替换双引号):

using org.apache.pig.piggybank.storage.CSVExcelStorage() 

答案 2 :(得分:0)

如果您安装了Jython,则可以部署一个简单的UDF来完成这项工作。

python UDF

#!/usr/bin/env python

'''
udf.py
'''

@outputSchema("out:chararray")
def formatter(item):
    chars = 'abcdefghijklmnopqrstuvwxyz'
    nums = '1234567890'
    new_item = item.split('"')[1]
    if new_item in chars:
        output = str(new_item)
    elif new_item in nums:
        output = int(new_item)

    return output

猪脚本

REGISTER 'udf.py' USING jython as udf;
data = load 'file' USING PigStorage(',') AS (col1:chararray, col2:chararray,
       col3:chararray, col4:chararray);
out = foreach data generate udf.formatter(col1) as a, udf.formatter(col3) as b;
dump out

(a,1)
(a,4)
(a,3)

答案 3 :(得分:0)

如何使用REPLACE?如果这种情况很简单吗?

data = LOAD 'YOUR_DATA' Using PigStorage(',') AS (a:chararray, b:chararray, c:chararray, d:chararray) ;

new_data = foreach data generate 
   REPLACE(a, '"', '') AS a,
   REPLACE(b, '"', '') AS b, 
   (int)REPLACE(c, '"', '') AS c:int, 
   (int)REPLACE(d, '"', '') AS d:int;

还有一个提示:如果要加载csv文件,请在Excel中设置正确的数字格式,就像工具也可能有用。