猪自定义函数加载多个字符^^(双胡萝卜)分隔符

时间:2014-10-23 18:40:51

标签: hadoop load apache-pig

我是PIG的新手,有人可以帮助我如何加载具有多个字符的文件(在我的情况下为'^^')作为列分隔符。

例如我有以下列的文件 aisforapple ^^ bisforball ^^ cisforcat ^^ disfordoll ^^ andeisforelephant fisforfish ^^ gisforgreen ^^ hisforhat ^^ iisforicecreem ^^ andjisforjar kisforking ^^ lisforlion ^^ misformango ^^ nisfornose ^^ andoisfororange

此致

1 个答案:

答案 0 :(得分:2)

正则表达式最适合这些多个字符

input.txt
aisforapple^^bisforball^^cisforcat^^disfordoll^^andeisforelephant
fisforfish^^gisforgreen^^hisforhat^^iisforicecreem^^andjisforjar
kisforking^^lisforlion^^misformango^^nisfornose^^andoisfororange

PigScript
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)\\^\\^(.*)\\^\\^(.*)\\^\\^(.*)\\^\\^(.*)')) AS (f1,f2,f3,f4,f5);
DUMP B;

Output:
(aisforapple,bisforball,cisforcat,disfordoll,andeisforelephant)
(fisforfish,gisforgreen,hisforhat,iisforicecreem,andjisforjar)
(kisforking,lisforlion,misformango,nisfornose,andoisfororange)

<强>解释

For better understanding i break the regex into multiple lines
(.*)\\^\\^ ->Any character match till ^^ and stored into f1,(double backslash for special characters) 
(.*)\\^\\^ ->Any character match till ^^ and stored into f2,(double backslash for special characters) 
(.*)\\^\\^ ->Any character match till ^^ and stored into f3,(double backslash for special characters) 
(.*)\\^\\^ ->Any character match till ^^ and stored into f4,(double backslash for special characters) 
(.*)       ->Any character match till the end of string and stored into f5