查找包含给定单词的句子的最短部分

时间:2012-07-01 18:26:25

标签: java string algorithm

例:  如果有一句话:
My name is not eugene. my pet name is not eugene.
我们必须搜索包含给定单词的句子中的最小部分 我的 eugene 那么答案就是 eugene. my
无需检查大写或小写或特殊字符或数字。
我粘贴了我的代码但是对于某些测试用例得到了错误的答案。

任何人都可以知道代码有什么问题。我没有错误的测试用例。

import java.io.*;
import java.util.*;
public class ShortestSegment 
{
static String[] pas;
static String[] words;
static int k,st,en,fst,fen,match,d;
static boolean found=false;
static int[] loc;
static boolean[] matches ;
public static void main(String s[]) throws IOException
{
    BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
    pas = in.readLine().replaceAll("[^A-Za-z ]", "").split(" ");
    k = Integer.parseInt(in.readLine());
    words = new String[k];
    matches = new boolean[k];
    loc = new int[k];
    for(int i=0;i<k;i++)
    {
        words[i] = in.readLine();
    }
    en = fen = pas.length;
    find(0);
    if(found==false)
    System.out.println("NO SUBSEGMENT FOUND");
    else
    {
        for(int j=fst;j<=fen;j++)
            System.out.print(pas[j]+" ");
    }

}
private static void find(int min)
{
    if(min==pas.length)
        return;
    for(int i=0;i<k;i++)
    {
        if(pas[min].equalsIgnoreCase(words[i]))
        {
            if(matches[i]==false)
            {
                loc[i]=min;
                matches[i] =true;
                match++;
            }
            else
            {
                    loc[i]=min;
            }
            if(match==k)
            {
                en=min;
                st = min();
                found=true;
                if((fen-fst)>(en-st))
                {
                    fen=en;
                    fst=st;
                }
                match--;
                matches[getIdx()]=false;
            }
        }
    }
    find(min+1);
}
private static int getIdx()
{
    for(int i=0;i<k;i++)
    {
        if(words[i].equalsIgnoreCase(pas[st]))
            return i;
    }
    return -1;
}
private static int min()
{
    int min=loc[0];
    for(int i=1;i<loc.length;i++)
        if(min>loc[i])
            min=loc[i];
    return min;
}


}

4 个答案:

答案 0 :(得分:0)

您提供的代码将为以下输入生成错误的输出。我想,当你想要“找到含有给定单词的最短句子”时,单词长度也很重要

  

字符串:'我的名字是eugene。我的fn是eugene。'
  搜索字符串数量:2
  string1:'我的'   string2:'是'
  你的解决方案是:'我的名字是'   正确答案是:'我的fn是'

代码中的问题是,它将'firstname'和'fn'视为相同的长度。在比较(fen-fst)>(en-st)中,您只考虑单词的数量是否已最小化,而不是单词长度是否缩短。

答案 1 :(得分:0)

以下代码(junit):

@Test
public void testIt() {
    final String s = "My name is not eugene. my pet name is not eugene.";
    final String tmp = s.toLowerCase().replaceAll("[^a-zA-Z]", " ");//here we need the placeholder (blank)
    final String w1 = "my "; // leave a blank at the end to avoid those words e.g. "myself", "myth"..
    final String w2 = "eugene ";//same as above
    final List<Integer> l1 = getList(tmp, w1); //indexes list
    final List<Integer> l2 = getList(tmp, w2);
    int min = Integer.MAX_VALUE;
    final int[] idx = new int[] { 0, 0 };

    //loop to find out the result
    for (final int i : l1) {
        for (final int j : l2) {
            if (Math.abs(j - i) < min) {
                final int x = j - i;
                min = Math.abs(j - i);
                idx[0] = j - i > 0 ? i : j;
                idx[1] = j - i > 0 ? j + w2.length() + 2 : i + w1.length() + 2;
            }
        }

    }

    System.out.println("indexes: " + Arrays.toString(idx));
    System.out.println("result: " + s.substring(idx[0], idx[1]));
}

private List<Integer> getList(final String input, final String search) {
    String t = new String(input);
    final List<Integer> list = new ArrayList<Integer>();
    int tmp = 0;
    while (t.length() > 0) {
        final int x = t.indexOf(search);

        if (x < 0 || x > t.length()) {
            break;
        }
        tmp += x;
        list.add(tmp);
        t = t.substring(search.length() + x);

    }
    return list;

}

给出输出:

indexes: [15, 25]
result: eugene. my

我认为带内联注释的代码非常容易理解。基本上,用index + wordlength播放。

注意

  • 未找到“未找到”案例。
  • 代码只显示了 想法,它可以优化。例如至少可以保存一个abs()。 等...
希望它有所帮助。

答案 2 :(得分:0)

我认为可以用另一种方式处理: 首先,找到匹配的结果,并最小化当前结果的绑定,然后从当前结果中找到匹配的结果。它可以编码如下:

/**This method intends to check the shortest interval between two words
 * @param s : the string to be processed at
 * @param first : one of the words
 * @param second : one of the words
 */
public static void getShortestInterval(String s , String first , String second)
{
    String situationOne = first + "(.*?)" + second;
    String situationTwo = second + "(.*?)" + first;

    Pattern patternOne = Pattern.compile(situationOne,Pattern.DOTALL|Pattern.CASE_INSENSITIVE);
    Pattern patternTwo = Pattern.compile(situationTwo,Pattern.DOTALL|Pattern.CASE_INSENSITIVE);

    List<Integer> result = new ArrayList<Integer>(Arrays.asList(Integer.MAX_VALUE,-1,-1));
    /**first , test the first choice*/
    Matcher matcherOne = patternOne.matcher(s);
    findTheMax(first.length(),matcherOne, result);
    /**then , test the second choice*/
    Matcher matcherTwo = patternTwo.matcher(s);
    findTheMax(second.length(),matcherTwo,result);

    if(result.get(0)!=Integer.MAX_VALUE)
    {
        System.out.println("The shortest length is " + result.get(0));
        System.out.println("Which start @ " + result.get(1));
        System.out.println("And end @ " + result.get(2));
    }else
        System.out.println("No matching result is found!");
}

private static void findTheMax(int headLength , Matcher matcher , List<Integer> result) 
{
    int length = result.get(0);
    int startIndex = result.get(1);
    int endIndex = result.get(2);

    while(matcher.find())
    {
        int temp = matcher.group(1).length();
        int start = matcher.start();
        List<Integer> minimize = new ArrayList<Integer>(Arrays.asList(Integer.MAX_VALUE,-1,-1));
        System.out.println(matcher.group().substring(headLength));
        findTheMax(headLength, matcher.pattern().matcher(matcher.group().substring(headLength)), minimize);
        if(minimize.get(0) != Integer.MAX_VALUE)
        {
            start = start + minimize.get(1) + headLength;
            temp = minimize.get(0);
        }

        if(temp<length)
        {
            length = temp;
            startIndex = start;
            endIndex = matcher.end();
        }
    }

    result.set(0, length);
    result.set(1, startIndex);
    result.set(2, endIndex);
}

请注意,无论两个单词的顺序如何,这都可以处理两种情况!

答案 3 :(得分:0)

您可以使用Knuth Morris Pratt算法查找文本中每个给定单词的所有匹配项的索引。想象一下,你有长度为N和M的文字(w1 ... wM)。使用KMP算法可以得到数组:

occur = string[N];
occur[i] = 1, if w1 starts at position i
...
occur[i] = M, if wM starts at position i
occur[i] = 0, if no word from w1...wM starts at position i

循环遍历此数组,并从每个非零位置搜索其他M-1字。

这是近似伪代码。只是为了理解这个想法。如果你只是在java上重新编码它肯定是行不通的:

for i=0 to N-1 {
 if occur[i] != 0 {
  for j = i + w[occur[i] - 1].length - 1 { // searching forward
   if occur[j] != 0 and !foundWords.contains(occur[j]) {
    foundWords.add(occur[j]);
    lastWordInd = j;
    if foundWords.containAllWords() break;
   }
   foundTextPeaceLen = j + w[occur[lastWordInd]].length - i;
   if foundTextPeaceLen < minTextPeaceLen {
    minTextPeaceLen = foundTextPeaceLen;
    // also remember start and end indexes of text peace
   }
  }
 }
}