Question

在文本文件上搜索单词，我需要检查单词是否出现在大量文件中。我想使用程序中使用的唯一单词，但我想使用列表扩展到单词列表，但是我无法使其正常工作。

for name in files:
try:
    with open(name,errors='ignore') as f:
     found = "FALSE"
     pos = 0
     for line in f:
         pos = pos + 1
         if pattern_finder(line):
             found = "TRUE"
             break
     output_file.write (ntpath.basename(f.name) +';' + found + ';' + str(pos)+ ';' + line )
except IOError as exc:
    if exc.errno != errno.EISDIR:
        print("No Files Found")
        raise
output_file.close()



def pattern_finder (file_line):
    for i in range(len(pattern_to_find)):
        word = pattern_to_find[i]
        if word in file_line:
            return True
            break

该行中的“单词”是永远不会找到的，当然，如果我只是避免使用它可以正常使用的列表，即word =“ WORD_IM_LOOKING” 我相信我对如何从列表中获取值以使用该值有一个概念性的问题，请查看list.index中是否存在有人可以建议吗？）

Answer 1

为此，您要使用正则表达式，并且可以使用str.join使用管道字符和单词列表来编译正则表达式替代模式。示例：

import re
from pathlib import Path

def main():
    search_words = ['words', 'one', 'two']
    p = re.compile(r'|'.join(search_words), re.IGNORECASE|re.MULTILINE)
    files_with_words = []
    for file in Path().glob('*.txt'):
        if p.search(file.read_text()):
            files_with_words.append(file.name)
    print(files_with_words)



if __name__ == '__main__':
    main()

编辑：已更新以显示找到的行号，位置和单词。

import re
from pathlib import Path


def main():
    search_words = ['words', 'one', 'two']
    p = re.compile(fr"\b({'|'.join(search_words)})\b", re.IGNORECASE)
    files_with_words = set()
    for file in Path().glob('*.txt'):
        with open(file.name) as f:
            for i, line in enumerate(f):
                re_search_obj = p.search(line)
                if re_search_obj:
                    print("file={}, line={}, pos={}, word={}".format(
                        file.name, i, re_search_obj.span(), re_search_obj.group()
                    ))


if __name__ == '__main__':
    main()

从文件中搜索列表中的值

1 个答案: