这就是我想做的事情
INPUT
1,code=1a_asdfasdf_code=1b,asdf
2,code=2a_asdfasdf_code=2b_code=2c_laksjdf;lksjdf,asdf
3,code=3a_,sdoliclwmd
Intermediate
1,{1a,1b}
2,{2a,2b,2c}
3,{3a}
Finally
1,1a
1,1b
2,2a
2,2b
我知道REGEX_EXTRACT和REGEX_EXTRACT_ALL,但它们都没有为同一个正则表达式提供多个匹配。
2,2c
3,3a
这只给了我第一场比赛
A = LOAD '/data/regsearch1.csv' using PigStorage(',') as (c1:chararray,c2:chararray,c3:chararray);
B = foreach A generate c1,REGEX_EXTRACT_ALL(c2,'.*code=([^_]+)_.*') as m1;
答案 0 :(得分:3)
仅供参考,这个问题是关于PIG-latin的。
我最终编写了python UDF
#!/usr/bin/python
import re;
@outputSchema("bag1:bag{tuple1:tuple(match:chararray)}")
def findallregex(pattern,str):
outbag = []
matches = re.findall(pattern,str);
for m in matches:
tuple1 = (m,)
outbag.append(tuple1);
return outbag;
然后是这个PIG拉丁代码
REGISTER '/findall.py' using org.apache.pig.scripting.jython.JythonScriptEngine as myfuncs;
A = LOAD '/regsearch1.csv' using PigStorage(',') as (c1:chararray,c2:chararray,c3:chararray);
B = foreach A generate c1, myfuncs.findallregex('code=([^_]+)',c2) as bag1;
C = foreach B generate c1, flatten(bag1);
答案 1 :(得分:0)
你必须使用群组,我不知道你是否需要很多流程,但你可以拉出第一个数字并处理你的字符串模式。
input
1,code=1a_asdfasdf_code=1b,asdf
2,code=2a_asdfasdf_code=2b_code=2c_laksjdf;lksjdf,asdf
3,code=3a_,sdoliclwmd
output
1,1a
1,1b
2,2a
2,2b
2,2c
3,3a
private static void lineProcess(String text) {
Pattern p = Pattern.compile("code=(\\w\\w)", Pattern.DOTALL);
Matcher m = p.matcher(text);
while (m.find()) {
System.out.println(text.substring(0,1)+","+m.group(1));
}
}
答案 2 :(得分:0)
这可以通过简单的字符串操作来实现。
A = LOAD 'Data.txt' Using PigStorage(',') AS (a1:int,a2:chararray,a3:chararray);
B = foreach A generate a1, REPLACE(a2,'asdfasdf_','') AS a2;
C = FOREACH B GENERATE a1, FLATTEN(TOKENIZE(a2, '_')) AS parameter;
D = FILTER C BY INDEXOF(parameter, 'code=') != -1;
E = FOREACH D GENERATE a1, SUBSTRING(parameter, 5, 7) AS number;`