Apache Pig - 其他字符串中的子串

时间:2016-05-13 13:04:46

标签: regex apache-pig

目前我在Pig工作,我正在尝试检查字段值(chararray)是否存在于另一个字段(也是chararray)中。 这是一个例子。

档案t.txt

1;This is a banana which is yellow.;Fruit;Banana
2;This is not about fruit but about Apple Inc.;Company;Apple

在上面的示例中,我想检查第二个字段(句子)中是否存在最后一个字段(即BananaApple)。到目前为止,这是我的猪脚本:

a = LOAD 't.txt' using PigStorage(';') AS (id:chararray, sentence:chararray, kind:chararray, search:chararray);

b = FOREACH a GENERATE id, LOWER(sentence) as sent:chararray, kind, LOWER(search) as srch:chararray;

c = FILTER b BY sent MATCHES '.* srch .*';

我想要实现的目标是让搜索词周围的双字母组合。举一个具体的例子,这就是我正在寻找的(或以其他形式):

(1,Fruit,{(a, banana),(banana, which})
(2,Company,{(about, apple),(apple, inc.})

所以,我的问题是:如何使用模式中的字段搜索来匹配模式中的字段句子?

1 个答案:

答案 0 :(得分:2)

使用UDF。将句子和搜索项传递给UDF。在UDF中将句子分成单词并迭代单词。如果单词匹配则在搜索项目之前和之后得到单词。

<强> PigScript

REGISTER GetSurroundingWords.jar;
DEFINE GetSurroundingWords com.mypackages.GetSurroundingWords();

A = LOAD 'test11.txt' using PigStorage(';') AS (id:chararray, sentence:chararray, kind:chararray, search:chararray);
B = FOREACH A GENERATE id, LOWER(sentence) as sent:chararray, kind, LOWER(search) as srch:chararray;
C = FOREACH B GENERATE id,kind,GetSurroundingWords(sent,srch);
DUMP C;

<强>输出

Output

Java UDF

package com.mypackages;

import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class GetSurroundingWords extends EvalFunc<String> 
{
    public String exec(Tuple input) throws IOException
    {
        if(input != null && input.size() != 0)
        {
            String sInputString = input.toString();
            String sOutputString = "";
            try
            {
                if(sInputString != null && !sInputString.isEmpty())
                {
                    String [] sInputStringItems = sInputString.split(",");
                    String sSentence = sInputStringItems[0].replace('(', ' ').trim();
                    String [] sWords = sSentence.split(" ");
                    String sSearchItem = sInputStringItems[1].replace(')',' ').trim();

                    for(int iIndex = 0;iIndex < sWords.length;iIndex ++)
                    {
                        if(sWords[iIndex].equals(sSearchItem))
                        {
                            try
                            {
                                sOutputString = "(" + sWords[--iIndex] + "," + sSearchItem + ")";
                            }catch(Exception ex)
                            {
                                sOutputString = "(" + sSearchItem + ")";
                            }

                            int iNextItem = iIndex + 2;
                            try
                            {
                                sOutputString =  sOutputString + "," + "(" + sSearchItem + "," + sWords[iNextItem] + ")"; 
                            }catch(Exception ex)
                            {
                                sOutputString = sOutputString + "," + "(" + sSearchItem  + ")";
                            }
                            return sOutputString;
                        }
                    }           
                }
                else
                { 
                    return null;
                }   
            }
            catch(Exception ex)
            {  
                return null;
            }
            return sOutputString;
        }
        else
        {
            return null;
        }
    }
}