Question

我有大约5000个文件，我需要从10000个单词的列表中找到每个文件中的单词。我当前的代码使用（非常）长的正则表达式来做，但它很慢。

wordlist = [...list of around 10000 english words...]
filelist = [...list of around 5000 filenames...]
wordlistre = re.compile('|'.join(wordlist), re.IGNORECASE)
discovered = []

for x in filelist:
    with open(x, 'r') as f:
        found = wordlistre.findall(f.read())
    if found:
        discovered = [x, found]

这会以大约每秒5个文件的速度检查文件，这比手动操作快很多，但它仍然很慢。有更好的方法吗？

Answer 1

如果您可以在命令行上访问grep，则可以尝试以下操作：

grep -i -f wordlist.txt -r DIRECTORY_OF_FILES

您需要创建所有单词的文件wordlist.txt（每行一个单词）。

您的任何文件中与任何单词匹配的任何行都将按以下格式打印到STDOUT：

<path/to/file>:<matching line>

Answer 2

如果没有关于数据的更多信息，一些想法是使用词典而不是列表，并减少搜索/排序所需的数据。如果您的分隔符不如下所示，请考虑使用re.split：

wordlist = 'this|is|it|what|is|it'.split('|')
d_wordlist = {}

for word in wordlist:
    first_letter = word[0]
    d_wordlist.setdefault(first_letter,set()).add(word)

filelist = [...list of around 5000 filenames...]
discovered = {}

for x in filelist:
    with open(x, 'r') as f:
        for word in f.read():
            first_letter = word[0]
            if word in d_wordlist[first_letter]:
                discovered.get(x,set()).add(word)

return discovered

Answer 3

Aho-Corasick algorithm正是为了这种用法而设计的，并在Unix中实现为fgrep。使用POSIX，命令grep -F被定义为执行此功能。

它与常规grep的不同之处在于它只使用固定字符串（不是正则表达式），并且针对搜索大量字符串进行了优化。

要在大量文件上运行它，请在命令行上指定精确文件，或者通过xargs传递它们：

xargs -a filelist.txt grep -F -f wordlist.txt

xargs的功能是用尽可能多的文件填充命令行，并根据需要多次运行grep;

grep -F -f wordlist.txt (files 1 through 2,500 maybe)
grep -F -f wordlist.txt (files 2,501 through 5,000)

每次调用的精确文件数取决于各个文件名的长度以及系统上ARG_MAX常量的大小。

在大量文件中搜索大量单词的最佳方法是什么？

3 个答案: