我在文件中有以下数据
"a","b","1","2"
"a","b","4","3"
"a","b","3","1"
我正在使用以下命令
阅读此文件File1 = LOAD '/path' using PigStorage (',') as (f1:chararray,f2:chararray,f3:int,f4:int)
但这里忽略了第3和第4栏的数据
如何正确阅读此文件或以任何方式使PIG跳过'''
我正在使用Apache Pig版本0.10.0
的其他信息答案 0 :(得分:4)
您可以使用REPLACE
功能(虽然它不会在一次通过中):
file1 = load 'your.csv' using PigStorage(',');
data = foreach file1 generate $0 as (f1:chararray), $1 as (f2:chararray), REPLACE($2, '\\"', '') as (f3:int), REPLACE($3, '\\"', '') as (f4:int);
您也可以在REGEX_EXTRACT
使用正则表达式:
file1 = load 'your.csv' using PigStorage(',');
data = foreach file1 generate $0, $1, REGEX_EXTRACT($2, '([0-9]+)', 1), REGEX_EXTRACT($3, '([0-9]+)', 1);
当然,你可以用同样的方式擦除{1}} f1和f2。
答案 1 :(得分:1)
尝试以下(无需转义或替换双引号):
using org.apache.pig.piggybank.storage.CSVExcelStorage()
答案 2 :(得分:0)
如果您安装了Jython
,则可以部署一个简单的UDF
来完成这项工作。
python UDF
#!/usr/bin/env python
'''
udf.py
'''
@outputSchema("out:chararray")
def formatter(item):
chars = 'abcdefghijklmnopqrstuvwxyz'
nums = '1234567890'
new_item = item.split('"')[1]
if new_item in chars:
output = str(new_item)
elif new_item in nums:
output = int(new_item)
return output
猪脚本
REGISTER 'udf.py' USING jython as udf;
data = load 'file' USING PigStorage(',') AS (col1:chararray, col2:chararray,
col3:chararray, col4:chararray);
out = foreach data generate udf.formatter(col1) as a, udf.formatter(col3) as b;
dump out
(a,1)
(a,4)
(a,3)
答案 3 :(得分:0)
如何使用REPLACE?如果这种情况很简单吗?
data = LOAD 'YOUR_DATA' Using PigStorage(',') AS (a:chararray, b:chararray, c:chararray, d:chararray) ;
new_data = foreach data generate
REPLACE(a, '"', '') AS a,
REPLACE(b, '"', '') AS b,
(int)REPLACE(c, '"', '') AS c:int,
(int)REPLACE(d, '"', '') AS d:int;
还有一个提示:如果要加载csv文件,请在Excel中设置正确的数字格式,就像工具也可能有用。