我需要能够比较两个文本文件。文件1是聊天记录,文件2是关键词的词列表。我正在努力获得我想要的输出,理想情况是每次文件2中的一个关键词出现在文件1的聊天记录中时。有关如何实现此输出的任何想法?
编辑*
这是我目前正在尝试使用的代码,但是我得到的输出是它将两个文件打印到gui中的文本框。输出需要显示文件2中的单词出现在文件1中的哪些行。一些代码来自我已经工作的关键字搜索功能。
def wordlistsearch():
filename = tkFileDialog.askopenfile(filetypes=(("Text files", "*.txt") ,)) //file1
mtxt = filename.readline()
i =0
filename2 = tkFileDialog.askopenfile(filetypes=(("Text files", "*.txt") ,)) //file2
while i<10000:
keystring = filename2.readline()
print keystring
participant = mtxt.split("(")[0]
temppart2 = mtxt.split("(")[-1]
keyword = temppart2.split(")")[0]
if mtxt.find(str(keystring)) != -1:
print i, ": ", mtxt
i=i+1
mtxt = filename.readline()
答案 0 :(得分:1)
如果要查找文件1中同样位于File2中的所有单词,可以使用:
keywords = set([word for line in open("keyword_file","r") for word in line.split()])
words = set([word for line in open("log_file","r") for word in line.split()])
common = words.intersection(keywords)
要在读取文件1时找到匹配的匹配项:
keywords = set([word for line in open("keyword_file","r") for word in line.split()])
for line in open("log_file","r"):
for word in line:
if word in keywords:
print "found {0} in line {1}".format(word, line)
答案 1 :(得分:0)
这是一个非常好的问题。我个人认为你可以这样做:
# I suppose the keywords has non repeated words separated by a space
keywords_file = open('path_to_file_keywords')
keywords_dict = {word: 0 for word in keywords_file.readlines().strip().split(' ')} # Iterate through all the words removing '\n'characters and generate a dict
# Then read the chat log
chat_log_file = open('path_to_file_chat_log')
chat_log_words_generator = (word for word in chat_log_file.readlines().strip().split(' ')) # Create a generator with the words from the chat log
for word in chat_log_words_generator:
try:
word_count = keywords_dict[word]
except KeyError:
continue # The word is not a keyword
word_count += 1 # increment the total
keywords_dict[word] = word_count # override the value of the count in the dict
最后,keywords_dict
应该包含所有关键字的出现次数。