Question

我有一个有效的代码，但速度非常慢。此代码确定字符串是否包含关键字。我需要提高数百个关键字的效率，我将在数千个文档中搜索这些关键字。

如何有效地找到关键字（不会错误地返回包含关键字的字词）？

例如：

String keyword="ac"; 
String document"..."  //few page long file

如果我使用：

if(document.contains(keyword) ){
//do something
}

如果文档包含像“account”这样的单词，它也会返回true;

所以我尝试使用正则表达式如下：

String pattern = "(.*)([^A-Za-z]"+ keyword +"[^A-Za-z])(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(document);
if(m.find()){
   //do something
}

要点：

这是摘要：希望它对其他人有用：

我的正则表达式可行，但非常不切实际使用大数据。（它没有终止）
@anubhava完善了正则表达式。这很容易理解和实施。它设法终止哪个是大的事情。但它仍然有点慢。（大约240秒）
@Tomalak解决方案很难实现和理解，但它是最快的解决方案。所以帽子离开队友。（18秒）

所以@Tomalak解决方案比@anubhava快〜15倍。

Answer 1

不要认为你的正则表达式需要.*。

试试这个正则表达式：

String pattern = "\\b"+ Pattern.quote(keyword) + "\\b";

此处\\b用于单词边界。如果关键字可以包含特殊字符，请确保它们不在单词的开头或结尾，否则单词边界将无法匹配。

如果您的关键字包含特殊的正则表达式字符，则必须使用Pattern.quote。

编辑：如果您的关键字以空格分隔，则可以使用此正则表达式。

String pattern = "(?<=\\s|^)"+ Pattern.quote(keyword) + "(?=\\s|$)";

Answer 2

在Java中查找子字符串的最快方法是使用String.indexOf()。

实现＆＃34;整个单词＆＃34;匹配，您需要添加一些逻辑来检查可能匹配之前和之后的字符，以确保它们是非单词字符：

public class IndexOfWordSample {
    public static void main(String[] args) {
        String input = "There are longer strings than this not very long one.";
        String search = "long";
        int index = indexOfWord(input, search);

        if (index > -1) {
            System.out.println("Hit for \"" + search + "\" at position " + index + ".");
        } else {
            System.out.println("No hit for \"" + search + "\".");
        }
    }

    public static int indexOfWord(String input, String word) {
        String nonWord = "^\\W?$", before, after;               
        int index, before_i, after_i = 0;

        while (true) {
            index = input.indexOf(word, after_i);
            if (index == -1 || word.isEmpty()) break;

            before_i = index - 1;
            after_i = index + word.length();
            before = "" + (before_i > -1 ? input.charAt(before_i) : "");            
            after = "" + (after_i < input.length() ? input.charAt(after_i) : "");

            if (before.matches(nonWord) && after.matches(nonWord)) {
                return index;
            }
        }
        return -1;
    }
}

这将打印：

点击＆＃34; long＆＃34;在第44位。

这应该比纯正则表达式方法更好。

如果^\W?$已经符合您对＆＃34;非单词＆＃34;的期望，请考虑字符。正则表达式在这里是一种折衷方案，如果您的输入字符串包含许多＆＃34;几乎＆＃34; -matches，则可能会降低性能。

要获得额外的速度，请弃用正则表达式并使用Character class，检查它为isAlphabetic和{{1}提供的许多属性（如before等）的组合}}

我已经使用alternative implementation that does that创建了一个Gist。

如果String包含单词，则为大数据提供高效的正则表达式

2 个答案: