如何使用collections.Counter获取单词频率,即使计数为零?

时间:2016-06-03 14:34:24

标签: python counter

我正在尝试获取目录中多个文件中出现的单词频率的计数,并且由于这个答案here我能够在单词确实发生时得到结果。但是,我也无法弄清楚当单词出现0次时如何显示结果。

e.g。 这是我想要的结果,所以我总是得到所有指定单词的结果,第一行中的指定单词和下面的计数。

21, 23, 60 4, 0, 8

这是我目前的代码:

import csv
import copy
import os
import sys
import glob
import string
import fileinput
from collections import Counter

def word_frequency(fileobj, words):
    """Build a Counter of specified words in fileobj"""
    # initialise the counter to 0 for each word
    ct = Counter(dict((w, 0) for w in words))
    file_words = (word for line in fileobj for word in line.split())
    filtered_words = (word for word in file_words if word in words)
    return Counter(filtered_words)


def count_words_in_dir(dirpath, words, action):
    """For each .txt file in a dir, count the specified words"""
        for filepath in glob.iglob(os.path.join(dirpath, '*.txt_out')):
            filepath = {}
        with open(filepath) as f:
            ct = word_frequency(f, words)
            action(filepath, ct)


def final_summary(filepath, ct):
    words = sorted(ct.keys())
    counts = [str(ct[k]) for k in words]
    with open('new.csv','a') as f:
        [f.write('{0},{1}\n,{2}\n'.format(
            filepath,
        ', '.join(words),
        ', '.join(counts)))]


words = set(['21','23','60','75','79','86','107','121','147','193','194','197','198','199','200','201','229','241','263','267','309','328'])
count_words_in_dir('C:\\Users\jllevent\Documents\PE Submsissions\Post-CLI', words, action=final_summary)

2 个答案:

答案 0 :(得分:1)

您永远不会使用ct中构建的word_frequency计数器,而是构建一个只有现有单词的新计数器,您需要使用构建的ct,例如:

...
for word in file_words:
    if word in words:
        ct[word] += 1
return ct

或者正如以下@ShadowRanger所指出的那样:

ct.update(word for word in file_words if word in words)
return ct

答案 1 :(得分:-1)

如果单词没有出现,它看起来像是返回NULL。放入条件返回语句,如果它返回的值不是int> 0,返回0。