在python中将文本文件与另一个文本文件进行比较?

时间:2017-03-29 15:44:12

标签: python

我需要能够比较两个文本文件。文件1是聊天记录,文件2是关键词的词列表。我正在努力获得我想要的输出,理想情况是每次文件2中的一个关键词出现在文件1的聊天记录中时。有关如何实现此输出的任何想法?

编辑*

这是我目前正在尝试使用的代码,但是我得到的输出是它将两个文件打印到gui中的文本框。输出需要显示文件2中的单词出现在文件1中的哪些行。一些代码来自我已经工作的关键字搜索功能。

def wordlistsearch():

filename = tkFileDialog.askopenfile(filetypes=(("Text files", "*.txt") ,)) //file1
mtxt = filename.readline()
i =0
filename2 = tkFileDialog.askopenfile(filetypes=(("Text files", "*.txt") ,)) //file2

while i<10000:
    keystring = filename2.readline()
    print keystring
    participant = mtxt.split("(")[0]
    temppart2 = mtxt.split("(")[-1]
    keyword = temppart2.split(")")[0]
    if mtxt.find(str(keystring)) != -1:
        print i, ": ", mtxt
    i=i+1
    mtxt = filename.readline()

2 个答案:

答案 0 :(得分:1)

如果要查找文件1中同样位于File2中的所有单词,可以使用:

keywords = set([word for line in open("keyword_file","r") for word in line.split()])

words = set([word for line in open("log_file","r") for word in line.split()])

common = words.intersection(keywords)

要在读取文件1时找到匹配的匹配项:

keywords = set([word for line in open("keyword_file","r") for word in line.split()])

for line in open("log_file","r"):
    for word in line:
        if word in keywords:
            print "found {0} in line {1}".format(word, line)

答案 1 :(得分:0)

这是一个非常好的问题。我个人认为你可以这样做:

# I suppose the keywords has non repeated words separated by a space 
keywords_file = open('path_to_file_keywords')
keywords_dict = {word: 0 for word in keywords_file.readlines().strip().split(' ')} # Iterate through all the words removing '\n'characters and generate a dict

# Then read the chat log
chat_log_file = open('path_to_file_chat_log')
chat_log_words_generator = (word for word in chat_log_file.readlines().strip().split(' ')) # Create a generator with the words from the chat log


for word in chat_log_words_generator:
    try:
        word_count = keywords_dict[word]
    except KeyError:
        continue # The word is not a keyword
    word_count += 1 # increment the total
    keywords_dict[word] = word_count # override the value of the count in the dict

最后,keywords_dict应该包含所有关键字的出现次数。