Question

我的实际要求是列出给定目录中包含搜索词组textToMatch的所有文件，其中包含4-5秒的最短时间，其中文件数量可达{{1}或更多。

我不想要代码，只是我想要一个最好的算法。

Answer 1

由于您必须打开每个文件，因此您还可以使用工具构建来执行此特定任务。使用grep：

我们要查看100000个文件。

% ls -l *.txt | wc -l          
100000

它们包含Vestibulum。

% grep Vestibulum 1.txt        
Aenean commodo ultrices imperdiet. Vestibulum ut justo vel sapien venenatis tincidunt.
euismod ultrices facilisis. Vestibulum porta sapien adipiscing augue congue id pretium lectus

计算包含Vestibulum的文件，计算时间。

% time grep -l Vestibulum *.txt | wc -l
100000
grep --color=auto -l Vestibulum *.txt  0,28s user 0,25s system 99% cpu 0,537 total
wc -l  0,00s user 0,01s system 1% cpu 0,537 total

如您所见，这只需要我的机器上的一秒钟。

Answer 2

您的计划必须处理2个问题：

找到每个子目录中的每个文件和
在每个文件中搜索您需要的短语。

For 1：您可以迭代或递归搜索给定目录中的文件，或者让Java 7或8使用FileVisitor或Apache Commons IO为您完成工作。

For 2：您可以使用Java Scanner或实现自己的非常快速的算法来搜索内部文件，称为Boyer-Moore算法。

如何在java中搜索多个文档中的单词？

2 个答案: