如何从Pig中的双方括号中提取字符串

时间:2018-12-22 13:01:36

标签: regex apache-pig

我正在尝试从输入数据中提取字符串,如下所示: I love [[cricket]]. Let's play it at [[16:00]].

我希望输出为:cricket, 16:00

我尝试了多个正则表达式,例如:

'(?<=\[\[).*?(?=\]\])',  
'\\[\\[(.*?)\\]\\]',  
'\[\[(.*?)\]\]',  
'[[([^>]*?)]]'.

grunt> Register '/usr/local/pig/lib/piggybank.jar';
grunt> Define Xpath org.apache.pig.piggybank.evaluation.xml.XPath();
grunt> page = Load 'hdfs://master:9000/IP/Wikipedia-20181215070630.xml'            using org.apache.pig.piggybank.storage.XMLLoader('page') as (x : chararray);
link = Foreach page Generate Flatten(REGEX_EXTRACT_ALL(x, '(?<=\[\[).*?(?=\]\])'));

每当我的正则表达式为'['时,猪就会在错误消息下方抛出:

<line 1, column 64>  Unexpected character '['
2018-12-22 04:41:24,535 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 1, column 64>  Unexpected character '['.

当我尝试上述其他正则表达式时,得到空白输出。

0 个答案:

没有答案