我正在尝试将一个文件与另一个文件进行匹配,以查看第一个文件(set1)中的任何单词是否在目录中的任何文档中。
import glob
import re
from nltk.corpus import PlaintextCorpusReader
import nltk
folder_path = "/home/#"
file_pattern = "/*.txt"
corpus_root = "/home/#"
wordlists = PlaintextCorpusReader(corpus_root, '.*')
wordlists.fileids()
set1=set(wordlists.words('locations.txt'))
set2=set(wordlists.words('names.txt'))
match_list = []
folder_contents = glob.glob(folder_path + file_pattern)
for file in folder_contents:
read_file = open(file, 'rt').read()
if set1 in read_file:
match_list.append(file)
print(file)
TypeErrorTraceback (most recent call last)
<ipython-input-44-c63210fee01a> in <module>()
23 read_file = open(file, 'rt').read()
24 words=read_file.split()
---> 25 if set1 in read_file:
26 match_list.append(file)
27 print(file)
TypeError: 'in <string>' requires string as left operand, not set
反正可以查看set1是否在目录中的任何文件中?
答案 0 :(得分:2)
将read_file
内容加载到集合中,然后尝试执行set.intersection()
:
for file in folder_contents:
read_file = open(file, 'rt').read()
if set1.intersection(set(read_file.split(" "))):
match_list.append(file)
print(file)