我有大约20000个文本文件,编号为5.txt,10.txt等等。
我将这些文件的文件路径存储在我创建的列表“list2”中。
我还有一个文本文件“temp.txt”,其中包含500个字的列表
vs
mln
money
依旧......
我将这些单词存储在我创建的另一个列表“列表”中。
现在我创建一个嵌套字典d2 [file] [word] =“file”中“word”的频率计数
现在,
我需要为每个文本文件迭代这些单词,
我想获得以下输出:
filename.txt- sum(d[filename][word]*log(prob))
这里,filename.txt的格式为5.txt,10.txt等等......
“prob”,这是我已经获得的值
我基本上需要找到每个外键(文件)的内部键'(单词)值的总和(这是单词的频率)。
说:
d['5.txt']['the']=6
这里“the”是我的话,“5.txt”是文件.Now 6是“5.txt”中出现的次数。
类似地:
d['5.txt']['as']=2.
我需要找到字典值的总和。
所以,这里是5.txt:我需要我的答案是:
6*log(prob('the'))+2*log(prob('as'))+...`(for all the words in list)
我需要为所有文件完成此操作。
我的问题在于我应该遍历嵌套字典的部分
import collections, sys, os, re
sys.stdout=open('4.txt','w')
from collections import Counter
from glob import glob
folderpath='d:/individual-articles'
folderpaths='d:/individual-articles/'
counter=Counter()
filepaths = glob(os.path.join(folderpath,'*.txt'))
#test contains: d:/individual-articles/5.txt,d:/individual,articles/10.txt,d:/individual-articles/15.txt and so on...
with open('test.txt', 'r') as fi:
list2= [line.strip() for line in fi]
#temp contains the list of words
with open('temp.txt', 'r') as fi:
list= [line.strip() for line in fi]
#the dictionary that contains d2[file][word]
d2 =defaultdict(dict)
for fil in list2:
with open(fil) as f:
path, name = os.path.split(fil)
words_c = Counter([word for line in f for word in line.split()])
for word in list:
d2[name][word] = words_c[word]
#this portion is also for the generation of dictionary "prob",that is generated from file 2.txt can be overlooked!
with open('2.txt', 'r+') as istream:
for line in istream.readlines():
try:
k,r = line.strip().split(':')
answer_ca[k.strip()].append(r.strip())
except ValueError:
print('Ignoring: malformed line: "{}"'.format(line))
#my problem lies here
items = d2.items()
small_d2 = dict(next(items) for _ in range(10))
for fil in list2:
total=0
for k,v in small_d2[fil].items():
total=total+(v*answer_ca[k])
print("Total of {} is {}".format(fil,total))
答案 0 :(得分:0)
with open(f) as fil
将fil分配给f的任何内容。当您稍后访问词典中的条目时
total=sum(math.log(prob)*d2[fil][word].values())
我相信你的意思
total = sum(math.log(prob)*d2[f][word])
但是,这似乎与您期望的顺序完全匹配,所以我建议更像这样:
word_list = [#list of words]
file_list = [#list of files]
dictionary = {#your dictionary}
summation = lambda file_name,prob: sum([(math.log(prob)*dictionary[word][file_name]) for word in word_list])
return_value = []
for file_name in file_list:
prob = #something
return_value.append(summation(file_name))
汇总行在python中定义了一个匿名函数。这些被称为lambda函数。基本上,这条线特别意味着:
summation = lambda file_name,prob:
几乎与:
相同def summation(file_name, prob):
然后
sum([(math.log(prob)*dictionary[word][file_name]) for word in word_list])
几乎与:
相同result = []
for word in word_list:
result.append(math.log(prob)*dictionary[word][file_name]
return sum(result)
总而言之,你有:
summation = lambda file_name,prob: sum([(math.log(prob)*dictionary[word][file_name]) for word in word_list])
而不是:
def summation(file_name, prob):
result = []
for word in word_list:
result.append(math.log(prob)*dictionary[word][file_name])
return sum(result)
尽管具有列表推导的lambda函数比for循环实现快得多。在python中很少有人应该使用for循环而不是列表理解,但它们肯定存在。
答案 1 :(得分:0)
for fil in list2: #list2 contains the filenames
total = 0
for k,v in d[fil].iteritems():
total += v*log(prob[k]) #where prob is a dict
print "Total of {} is {}".format(fil,total)