我应该计算文档“individual-articles”中所有文件中字典“d”的所有键值的频率。这里,文档“individual-articles”有大约20000个txt文件,文件名为1 ,2,3,4 ...例如:假设d [英国] = [5,76,289]必须返回英国出现在文件中的文件5.txt,76.txt,289.txt的次数“我需要在同一文档中的所有文件中找到它的频率。
import collections
import sys
import os
import re
sys.stdout=open('dictionary.txt','w')
from collections import Counter
from glob import glob
folderpath='d:/individual-articles'
counter=Counter()
filepaths = glob(os.path.join(folderpath,'*.txt'))
def words_generator(fileobj):
for line in fileobj:
for word in line.split():
yield word
word_count_dict = {}
for file in filepaths:
f = open(file,"r")
words = words_generator(f)
for word in words:
if word not in word_count_dict:
word_count_dict[word] = {"total":0}
if file not in word_count_dict[word]:
word_count_dict[word][file] = 0
word_count_dict[word][file] += 1
word_count_dict[word]["total"] += 1
for k in word_count_dict.keys():
for filename in word_count_dict[k]:
if filename == 'total': continue
counter.update(filename)
for k in word_count_dict.keys():
for count in counter.most_common():
print('{} {}'.format(word_count_dict[k],count))
如何才能在那些作为该键值字典元素的文件中找到英国的频率?
我需要在同一个例子中将这些值存储在另一个d2中,d2必须包含
(英国,26,1200) (西班牙,52,6795) (法国,45568)
其中26是文件中英文单词的频率5.txt,76.txt和289.txt和1200是所有文件中英文单词的频率。 同样对于西班牙和法国。
我在这里使用计数器,我认为这是缺陷,因为到目前为止一切正常,除了我的最后一个循环!
我是一个蟒蛇新手,我已经尝试过了!请帮助!!
答案 0 :(得分:0)
word_count_dict["Britain"]
是一本普通字典。只需循环它:
for filename in word_count_dict["Britain"]:
if filename == 'total': continue
print("Britain appears in {} {} times".format(filename, word_count_dict["Britain"][filename]))
或使用以下方法检索所有密钥:
word_count_dict["Britain"].keys()
请注意,该字典中有一个特殊键total
。
可能是您的缩进已关闭,但似乎您没有正确计算文件条目:
if file not in word_count_dict[word]:
word_count_dict[word][file] = 0
word_count_dict[word][file] += 1
word_count_dict[word]["total"] += 1
如果以前在单词词典中没有看到+= 1
,只计算(file
)个单词;纠正于:
if file not in word_count_dict[word]:
word_count_dict[word][file] = 0
word_count_dict[word][file] += 1
word_count_dict[word]["total"] += 1
要将此扩展为任意字词,请循环覆盖外部word_count_dict
:
for word, counts in word_count_dict.iteritems():
print('Total counts for word {}: '.format(word, counts['total']))
for filename, count in counts.iteritems():
if filename == 'total': continue
print("{} appears in {} {} times".format(word, filename, count))