如何比较两个文本文件中的单词频率?

时间:2018-11-07 20:25:01

标签: python python-3.x dictionary frequency word-frequency

如何比较python中两个文本文件中的单词频率?例如,如果一个单词同时包含在file1和file2中,则该单词只能被写入一次,而在比较时不添加它们的频率,因此应为{'The':3,5}。这里3是文件1中的频率,5是文件2中的频率。并且如果某些单词仅存在于一个文件中但不同时存在,则该文件应为0。请帮助 到目前为止,这是我所做的:

import operator
f1=open('file1.txt','r') #file 1
f2=open('file2.txt','r') #file 2

wordlist=[]
wordlist2=[]
for line in f1:
    for word in line.split():
        wordlist.append(word)

for line in f2:
    for word in line.split():
        wordlist2.append(word)

worddictionary = {}
for word in wordlist:
    if word in worddictionary:
        worddictionary[word] += 1
    else:
        worddictionary[word] = 1

worddictionary2 = {}
for word in wordlist2:
    if word in worddictionary2:
        worddictionary2[word] += 1
    else:
        worddictionary2[word] = 1

print(worddictionary)
print(worddictionary2)

3 个答案:

答案 0 :(得分:2)

编辑:这是对任何文件列表(注释中的说明)执行此操作的更通用的方法:

\n

保留编写的代码,这是创建组合字典的方法:

f1=open('file1.txt','r') #file 1
f2=open('file2.txt','r') #file 2

file_list = [f1, f2] # This would hold all your open files
num_files = len(file_list)

frequencies = {} # We'll just make one dictionary to hold the frequencies

for i, f in enumerate(file_list): # Loop over the files, keeping an index i
    for line in f: # Get the lines of that file
        for word in line.split(): # Get the words of that file
            if not word in frequencies:
                frequencies[word] = [0 for _ in range(num_files)] # make a list of 0's for any word you haven't seen yet -- one 0 for each file

            frequencies[word][i] += 1 # Increment the frequency count for that word and file

print frequencies

答案 1 :(得分:0)

编辑:我误解了这个问题,该代码现在可以解决您的问题。

f1 = open('file1.txt','r') #file 1
f2 = open('file2.txt','r') #file 2

wordList = {}

for line in f1.readlines(): #for each line in lines (file.readlines() returns a list)
    for word in line.split(): #for each word in each line
        if(not word in wordList): #if the word is not already in our dictionary
            wordList[word] = 0 #Add the word to the dictionary

for line in f2.readlines(): #for each line in lines (file.readlines() returns a list)
    for word in line.split(): #for each word in each line
        if(word in wordList): #if the word is already in our dictionary
            wordList[word] = wordList[word]+1 #add one to it's value

f1.close() #close files
f2.close()

f1 = open('file1.txt','r') #Have to re-open because we are at the end of the file.
#might be a n easier way of doing this

for line in f1.readlines(): #Removing keys whose values are 0
    for word in line.split(): #for each word in each line
        try:
            if(wordList[word] == 0): #if it's value is 0
                del wordList[word] #remove it from the dictionary
            else:
                wordList[word] = wordList[word]+1 #if it's value is not 0, add one to it for each occurrence in file1
        except:
            pass #we know the error was that there was no wordList[word]
f1.close()

print(wordList)

添加第一个文件单词,如果该单词在第二个文件中,则在值中添加一个。 之后,检查每个单词,如果它的值为0,则将其删除。

这不能通过遍历字典来完成,因为它在遍历字典时会改变大小。

这是您对多个文件(更复杂)实施的方法:

f1 = open('file1.txt','r') #file 1
f2 = open('file2.txt','r') #file 2

fileList = ["file1.txt", "file2.txt"]
openList = []
for i in range(len(fileList)):
    openList.append(open(fileList[i], 'r'))

fileWords = []

for i, file in enumerate(openList): #for each file
    fileWords.append({}) #add a dictionary to our list
    for line in file: #for each line in each file
        for word in line.split(): #for each word in each line
            if(word in fileWords[i]): #if the word is already in our dictionary
                fileWords[i][word] += 1 #add one to it
            else:
                fileWords[i][word] = 1 #add it to our dictionary with value 0

for i in openList:
    i.close()

for i, wL in enumerate(fileWords):
    print(f"File: {fileList[i]}")
    for l in wL.items():
        print(l)
    #print(f"File {i}\n{wL}")

答案 2 :(得分:0)

您可能会发现以下演示程序是获取文件单词频率的良好起点:

#! /usr/bin/env python3
import collections
import pathlib
import pprint
import re
import sys


def main():
    freq = get_freq(sys.argv[0])
    pprint.pprint(freq)


def get_freq(path):
    if isinstance(path, str):
        path = pathlib.Path(path)
    return collections.Counter(
        match.group() for match in re.finditer(r'\b\w+\b', path.open().read())
    )


if __name__ == '__main__':
    main()

尤其是,您将需要使用get_freq函数来获取一个Counter对象,该对象告诉您单词的频率是什么。您的程序可以使用不同的文件名多次调用get_freq函数,并且您应该发现Counter对象与您以前使用的字典非常相似。