正则表达式字典搜索单用字母

时间:2013-02-25 19:37:07

标签: java regex

我正在研究一个Java程序,它在字典中搜索由一组特定字母组成的单词。我想知道是否可以设置一个正则表达式,让你只使用字符串中出现的字符。例如,使用SHARE字母。听,野兔,海,等等都是有效的。但是看到或sarah不会有效,因为你只有一个e或一个a。

5 个答案:

答案 0 :(得分:1)

正则表达式是关于模式匹配的。找到一个简单的模式可能是不可能的。

如果真的非常想要一个正则表达式,这些函数将生成一个:

public  static String permutation(String str) {
    return "^" + permutation("",str).replaceFirst("\\|", "(") + ")$";
 }

 private static String permutation(String prefix, String str) {
    String s = "";
    int n = str.length();
    if (n == 0) return "|"+prefix;
    else {
        for (int i = 0; i < n; i++)
           s += permutation(prefix + str.charAt(i)+"?",
                            str.substring(0, i) + str.substring(i+1, n));
    }
    return s;
}

对于“分享”,它将返回:

^(s?h?a?r?e?|s?h?a?e?r?|s?h?r?a?e?|s?h?r?e?a?|s?h?e?a?r?|s?h?e?r?a?|s?a?h?r?e?|s?a?h?e?r?|s?a?r?h?e?|s?a?r?e?h?|s?a?e?h?r?|s?a?e?r?h?|s?r?h?a?e?|s?r?h?e?a?|s?r?a?h?e?|s?r?a?e?h?|s?r?e?h?a?|s?r?e?a?h?|s?e?h?a?r?|s?e?h?r?a?|s?e?a?h?r?|s?e?a?r?h?|s?e?r?h?a?|s?e?r?a?h?|h?s?a?r?e?|h?s?a?e?r?|h?s?r?a?e?|h?s?r?e?a?|h?s?e?a?r?|h?s?e?r?a?|h?a?s?r?e?|h?a?s?e?r?|h?a?r?s?e?|h?a?r?e?s?|h?a?e?s?r?|h?a?e?r?s?|h?r?s?a?e?|h?r?s?e?a?|h?r?a?s?e?|h?r?a?e?s?|h?r?e?s?a?|h?r?e?a?s?|h?e?s?a?r?|h?e?s?r?a?|h?e?a?s?r?|h?e?a?r?s?|h?e?r?s?a?|h?e?r?a?s?|a?s?h?r?e?|a?s?h?e?r?|a?s?r?h?e?|a?s?r?e?h?|a?s?e?h?r?|a?s?e?r?h?|a?h?s?r?e?|a?h?s?e?r?|a?h?r?s?e?|a?h?r?e?s?|a?h?e?s?r?|a?h?e?r?s?|a?r?s?h?e?|a?r?s?e?h?|a?r?h?s?e?|a?r?h?e?s?|a?r?e?s?h?|a?r?e?h?s?|a?e?s?h?r?|a?e?s?r?h?|a?e?h?s?r?|a?e?h?r?s?|a?e?r?s?h?|a?e?r?h?s?|r?s?h?a?e?|r?s?h?e?a?|r?s?a?h?e?|r?s?a?e?h?|r?s?e?h?a?|r?s?e?a?h?|r?h?s?a?e?|r?h?s?e?a?|r?h?a?s?e?|r?h?a?e?s?|r?h?e?s?a?|r?h?e?a?s?|r?a?s?h?e?|r?a?s?e?h?|r?a?h?s?e?|r?a?h?e?s?|r?a?e?s?h?|r?a?e?h?s?|r?e?s?h?a?|r?e?s?a?h?|r?e?h?s?a?|r?e?h?a?s?|r?e?a?s?h?|r?e?a?h?s?|e?s?h?a?r?|e?s?h?r?a?|e?s?a?h?r?|e?s?a?r?h?|e?s?r?h?a?|e?s?r?a?h?|e?h?s?a?r?|e?h?s?r?a?|e?h?a?s?r?|e?h?a?r?s?|e?h?r?s?a?|e?h?r?a?s?|e?a?s?h?r?|e?a?s?r?h?|e?a?h?s?r?|e?a?h?r?s?|e?a?r?s?h?|e?a?r?h?s?|e?r?s?h?a?|e?r?s?a?h?|e?r?h?s?a?|e?r?h?a?s?|e?r?a?s?h?|e?r?a?h?s?)$

显然,这可以简化+优化,但仍然不是一个好主意。

编辑:缩短输出的功能:

public  static String permutation(String str) {
    return "^(" + permutation("",str) + ")$";
 }

 private static String permutation(String prefix, String str) {
   String s = "";
   int n = str.length();
   if (n == 0) return prefix;
   else {
     for (int i = 0; i < n; i++)
       if (i != n-1)
         s += prefix + str.charAt(i) + "?(" +
            permutation("", str.substring(0, i) + str.substring(i+1, n))+")|";
       else
         s += prefix + str.charAt(i) + "?" +
            permutation("", str.substring(0, i) + str.substring(i+1, n));
   }
   return s;
}

打印:

^(s?(h?(a?(r?(e?)|e?r?)|r?(a?(e?)|e?a?)|e?a?(r?)|r?a?)|a?(h?(r?(e?)|e?r?)|r?(h?(e?)|e?h?)|e?h?(r?)|r?h?)|r?(h?(a?(e?)|e?a?)|a?(h?(e?)|e?h?)|e?h?(a?)|a?h?)|e?h?(a?(r?)|r?a?)|a?(h?(r?)|r?h?)|r?h?(a?)|a?h?)|h?(s?(a?(r?(e?)|e?r?)|r?(a?(e?)|e?a?)|e?a?(r?)|r?a?)|a?(s?(r?(e?)|e?r?)|r?(s?(e?)|e?s?)|e?s?(r?)|r?s?)|r?(s?(a?(e?)|e?a?)|a?(s?(e?)|e?s?)|e?s?(a?)|a?s?)|e?s?(a?(r?)|r?a?)|a?(s?(r?)|r?s?)|r?s?(a?)|a?s?)|a?(s?(h?(r?(e?)|e?r?)|r?(h?(e?)|e?h?)|e?h?(r?)|r?h?)|h?(s?(r?(e?)|e?r?)|r?(s?(e?)|e?s?)|e?s?(r?)|r?s?)|r?(s?(h?(e?)|e?h?)|h?(s?(e?)|e?s?)|e?s?(h?)|h?s?)|e?s?(h?(r?)|r?h?)|h?(s?(r?)|r?s?)|r?s?(h?)|h?s?)|r?(s?(h?(a?(e?)|e?a?)|a?(h?(e?)|e?h?)|e?h?(a?)|a?h?)|h?(s?(a?(e?)|e?a?)|a?(s?(e?)|e?s?)|e?s?(a?)|a?s?)|a?(s?(h?(e?)|e?h?)|h?(s?(e?)|e?s?)|e?s?(h?)|h?s?)|e?s?(h?(a?)|a?h?)|h?(s?(a?)|a?s?)|a?s?(h?)|h?s?)|e?s?(h?(a?(r?)|r?a?)|a?(h?(r?)|r?h?)|r?h?(a?)|a?h?)|h?(s?(a?(r?)|r?a?)|a?(s?(r?)|r?s?)|r?s?(a?)|a?s?)|a?(s?(h?(r?)|r?h?)|h?(s?(r?)|r?s?)|r?s?(h?)|h?s?)|r?s?(h?(a?)|a?h?)|h?(s?(a?)|a?s?)|a?s?(h?)|h?s?)$

答案 1 :(得分:0)

这是一种方法:

  1. 遍历您的字符串数组以创建MultiMap<String, String>(如果您正在使用Guava库,或HashMap<String, List<String>>,如果您正在使用java.util),其中键是已排序的单词,并且值是该排序字符串的合法字词。这将是您的预处理步骤,因此您只需执行一次。由于您的hashmap已经存在,因此后续搜索将相对较快(与每次循环遍历字典以匹配某些正则表达式相比,这将比使用散列映射慢得多。)
  2. 对搜索字符串进行排序,找到该排序字符串的所有子字符串。
  3. 遍历已排序的子集,并搜索HashMap或MultiMap以获取该sortedsubset字符串的值。跟踪 所有的价值观,你有答案。
  4. 我认为这里的问题是正则表达式不适合你所描述的内容,因为你仍然需要为每次搜索(你已经存储为数组)循环遍历整个字典。然而,如果你创建了hashmap(这个步骤相对昂贵),你只会循环遍历已排序的子集列表(这很便宜)。

答案 2 :(得分:0)

如果在单词中没有出现两次的字母,就像share中没有,则可以使用

^(?!([share]).*\\1)[share]+$

这将匹配由share中的部分或全部字母组成的任何字词。

包含后面引用(?!)的否定前瞻\\1到括号中匹配的内容,如果字母出现多次,则会阻止匹配。

您可以扩展此原则,以处理出现多次字母的单词。

答案 3 :(得分:0)

好的,这是一个如何做到这一点的例子。但是,你应该阅读这些关于灾难性回溯的文章:

Runaway Regular Expressions: Catastrophic Backtracking

Regex Performance

^(?!.*s.*s)(?!.*h.*h)(?!.*a.*a)(?!.*r.*r)(?!.*e.*e)(?![^share]).*$

如果你想让2个字母“像股票一样”允许单词sashes你可以这样做。

^(?!.*s.*s.*s)(?!.*h.*h)(?!.*a.*a)(?!.*r.*r)(?!.*e.*e)(?![^share]).*$

这个词的概念小于3“s”就可以了......

答案 4 :(得分:0)

一种不使用模式匹配但是找到问题的根源的方法是创建一个数组,其中包含目标词中每个字符的计数:“聋”将是数组(1,0, 0,1,1,1,0,0,...)。

然后当你遍历你的字典时,你为每个单词准备相同的数组并从目标单词的数组中减去它 - 如果差异数组中有任何负值,则该单词不能由目标词的字母。