逐行拆分文件,查看拆分文件中是否有其他单词

时间:2018-10-25 21:00:40

标签: python regex python-3.x python-2.7

我正在尝试将一个文件与另一个文件进行匹配,以查看第一个文件(set1)中的任何单词是否在目录中的任何文档中。

代码:

import glob
import re
from nltk.corpus import PlaintextCorpusReader
import nltk


folder_path = "/home/#"
file_pattern = "/*.txt"


corpus_root = "/home/#" 
wordlists = PlaintextCorpusReader(corpus_root, '.*') 
wordlists.fileids()
set1=set(wordlists.words('locations.txt'))
set2=set(wordlists.words('names.txt'))


match_list = []

folder_contents = glob.glob(folder_path + file_pattern)

for file in folder_contents:
    read_file = open(file, 'rt').read()
    if set1 in read_file:
        match_list.append(file)
        print(file)

输出:

TypeErrorTraceback (most recent call last)
<ipython-input-44-c63210fee01a> in <module>()
     23     read_file = open(file, 'rt').read()
     24     words=read_file.split()
---> 25     if set1 in read_file:
     26         match_list.append(file)
     27         print(file)

TypeError: 'in <string>' requires string as left operand, not set

反正可以查看set1是否在目录中的任何文件中?

1 个答案:

答案 0 :(得分:2)

read_file内容加载到集合中,然后尝试执行set.intersection()

for file in folder_contents:
    read_file = open(file, 'rt').read()
    if set1.intersection(set(read_file.split(" "))):
        match_list.append(file)
        print(file)