Question

我有大量的文本文件，其大小范围可以从500KB到500MB。我有一个需要在文件内容中找到的关键字列表。没有。关键字最多可达400,000。现在我使用以下代码在文件内容中找到关键字

public static void main(String[] args) {
    StringBuilder fileContent = new StringBuilder();
    try (BufferedReader reader = new BufferedReader(new FileReader("C:\\Users\\harshita.sethi\\Desktop\\merge\\MNT.txt"))) {
        String line;
        while ((line = reader.readLine()) != null) {
            fileContent.append(line).append("\n");
        }
    }

    String content = fileContent.toString();
    Set<List<String>> keywords = getDbQuery(); // size can be up to 4*10^5

    for (List<String> key : keywords) {
        if (checkOccurence(content, key.get(0))) {
            //Do Somethng
        }
    }
}

private static boolean checkOccurence(String content, String keyword) {
    Boolean flag = false;
    try {

            Pattern p = Pattern.compile("\\b" + keyword + "\\b", Pattern.CASE_INSENSITIVE);
            Matcher m = p.matcher(content);
            flag = m.find();


    } catch (PatternSyntaxException ex) {
        System.out.println("cannot report occrence of " + keyword);
    }
    return flag;
}

问题在于文件大小太大，扫描文件需要花费大量时间。我已经做了各种各样的测试，并得出结论，Pattern.Compile使代码进展缓慢。我已经在互联网上阅读，因为每次调用函数时，Pattern.compile都会编译正则表达式，因此会花费很多时间。

任何人都可以建议如何提高此代码的性能，以便字符串搜索更快。

PS：我限制使用Java 6版本。

修改

我尝试在for循环之前编译所有关键字，正如少数人所建议的那样。我可以看到代码执行时间没有太大区别。

虽然我注意到通过删除boundary regex，代码的性能发生了巨大变化。只需几秒钟即可完成全程运行，早于8-10分钟。但是，通过删除boundary regex，我没有获得所需的输出。

问题 - 有没有办法使用边界微调性能。为什么表现发生了巨大变化？

我的目标（例如）是

false

abcd

abc 如果在搜索true时找到abc.或abc,或abc等，则
abc。

Answer 1

我更愿意在搜索过程之前加载关键词并编译所有模式。

提高性能的下一步是使用Java 8流api，它允许你使编译和搜索过程瘫痪。

我认为这可以提供帮助。

使用大文件中的Patter.compile提高字符串搜索的性能

1 个答案: