目前我在Pig工作,我正在尝试检查字段值(chararray
)是否存在于另一个字段(也是chararray
)中。
这是一个例子。
档案t.txt
:
1;This is a banana which is yellow.;Fruit;Banana
2;This is not about fruit but about Apple Inc.;Company;Apple
在上面的示例中,我想检查第二个字段(句子)中是否存在最后一个字段(即Banana
和Apple
)。到目前为止,这是我的猪脚本:
a = LOAD 't.txt' using PigStorage(';') AS (id:chararray, sentence:chararray, kind:chararray, search:chararray);
b = FOREACH a GENERATE id, LOWER(sentence) as sent:chararray, kind, LOWER(search) as srch:chararray;
c = FILTER b BY sent MATCHES '.* srch .*';
我想要实现的目标是让搜索词周围的双字母组合。举一个具体的例子,这就是我正在寻找的(或以其他形式):
(1,Fruit,{(a, banana),(banana, which})
(2,Company,{(about, apple),(apple, inc.})
所以,我的问题是:如何使用模式中的字段搜索来匹配模式中的字段句子?
答案 0 :(得分:2)
使用UDF。将句子和搜索项传递给UDF。在UDF中将句子分成单词并迭代单词。如果单词匹配则在搜索项目之前和之后得到单词。
<强> PigScript 强>
REGISTER GetSurroundingWords.jar;
DEFINE GetSurroundingWords com.mypackages.GetSurroundingWords();
A = LOAD 'test11.txt' using PigStorage(';') AS (id:chararray, sentence:chararray, kind:chararray, search:chararray);
B = FOREACH A GENERATE id, LOWER(sentence) as sent:chararray, kind, LOWER(search) as srch:chararray;
C = FOREACH B GENERATE id,kind,GetSurroundingWords(sent,srch);
DUMP C;
<强>输出强>
Java UDF
package com.mypackages;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class GetSurroundingWords extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException
{
if(input != null && input.size() != 0)
{
String sInputString = input.toString();
String sOutputString = "";
try
{
if(sInputString != null && !sInputString.isEmpty())
{
String [] sInputStringItems = sInputString.split(",");
String sSentence = sInputStringItems[0].replace('(', ' ').trim();
String [] sWords = sSentence.split(" ");
String sSearchItem = sInputStringItems[1].replace(')',' ').trim();
for(int iIndex = 0;iIndex < sWords.length;iIndex ++)
{
if(sWords[iIndex].equals(sSearchItem))
{
try
{
sOutputString = "(" + sWords[--iIndex] + "," + sSearchItem + ")";
}catch(Exception ex)
{
sOutputString = "(" + sSearchItem + ")";
}
int iNextItem = iIndex + 2;
try
{
sOutputString = sOutputString + "," + "(" + sSearchItem + "," + sWords[iNextItem] + ")";
}catch(Exception ex)
{
sOutputString = sOutputString + "," + "(" + sSearchItem + ")";
}
return sOutputString;
}
}
}
else
{
return null;
}
}
catch(Exception ex)
{
return null;
}
return sOutputString;
}
else
{
return null;
}
}
}