从PIG中的同一行提取多个正则表达式匹配

时间:2014-01-30 23:20:00

标签: regex apache-pig

这就是我想做的事情

INPUT 

    1,code=1a_asdfasdf_code=1b,asdf
    2,code=2a_asdfasdf_code=2b_code=2c_laksjdf;lksjdf,asdf
    3,code=3a_,sdoliclwmd

Intermediate 

    1,{1a,1b}
    2,{2a,2b,2c}
    3,{3a}


Finally
    1,1a
    1,1b
    2,2a
    2,2b

我知道REGEX_EXTRACT和REGEX_EXTRACT_ALL,但它们都没有为同一个正则表达式提供多个匹配。

2,2c
3,3a

这只给了我第一场比赛

A = LOAD '/data/regsearch1.csv' using PigStorage(',') as (c1:chararray,c2:chararray,c3:chararray);

B = foreach A  generate c1,REGEX_EXTRACT_ALL(c2,'.*code=([^_]+)_.*') as m1;

3 个答案:

答案 0 :(得分:3)

仅供参考,这个问题是关于PIG-latin的。

我最终编写了python UDF

#!/usr/bin/python
import re;

@outputSchema("bag1:bag{tuple1:tuple(match:chararray)}")
def findallregex(pattern,str):
        outbag = []
        matches =  re.findall(pattern,str);
        for m in matches:
                tuple1 = (m,)
                outbag.append(tuple1);
        return outbag;

然后是这个PIG拉丁代码

REGISTER '/findall.py' using org.apache.pig.scripting.jython.JythonScriptEngine as myfuncs;
A = LOAD '/regsearch1.csv' using PigStorage(',') as (c1:chararray,c2:chararray,c3:chararray);
B = foreach A generate c1, myfuncs.findallregex('code=([^_]+)',c2) as bag1;
C = foreach B generate c1, flatten(bag1);

答案 1 :(得分:0)

你必须使用群组,我不知道你是否需要很多流程,但你可以拉出第一个数字并处理你的字符串模式。

input
    1,code=1a_asdfasdf_code=1b,asdf
    2,code=2a_asdfasdf_code=2b_code=2c_laksjdf;lksjdf,asdf
    3,code=3a_,sdoliclwmd 

output

    1,1a
    1,1b
    2,2a
    2,2b
    2,2c
    3,3a

private static void lineProcess(String text) {

        Pattern p = Pattern.compile("code=(\\w\\w)", Pattern.DOTALL);
        Matcher m = p.matcher(text); 
        while (m.find()) {
            System.out.println(text.substring(0,1)+","+m.group(1));
        }
    }

答案 2 :(得分:0)

这可以通过简单的字符串操作来实现。

    A = LOAD 'Data.txt' Using PigStorage(',') AS (a1:int,a2:chararray,a3:chararray);
    B = foreach A generate a1, REPLACE(a2,'asdfasdf_','') AS a2;
    C = FOREACH B GENERATE a1, FLATTEN(TOKENIZE(a2, '_')) AS parameter;
    D = FILTER C BY INDEXOF(parameter, 'code=') != -1;
    E = FOREACH D GENERATE a1, SUBSTRING(parameter, 5, 7) AS number;`