我有一份文字文件topics.txt
:
1~cocoa
2~
3~
4~
5~grain~wheat~corn~barley~oat~sorghum
6~veg-oil~linseed~lin-oil~soy-oil~sun-oil~soybean~oilseed~corn~sunseed~grain~sorghum~wheat
7~
8~
9~earn
10~acq
11~earn
12~earn~acq
13~earn
14~earn
...
其中每行开头的数字是文件名。
我有大约20000个文件要分类。 到目前为止,我已经设法为每个单词创建字典,字典的元素是相应的文件名, 例如:''赚',['9','11','12','13','14','18','23','24','27','36','37 ','38'..等等) 现在我需要计算赚取的总字数,这是所有属于赚取的文件,这些文件存在于目录d:/ individual-words
我需要输出格式为: 单词,总数没有单词
'获得',30000
'晶粒',40000
import os
import re
import sys
from collections import Counter
from glob import glob
sys.stdout=open('f1.txt','w')
def removegarbage(text):
text=re.sub(r'\W+',' ',text)
text=text.lower()
return text
folderpath='d:/individual-articles'
counter=Counter()
filepaths = glob(os.path.join(folderpath,'*.txt'))
with open('topics.txt','r') as filehandle:
d = collections.defaultdict(list)
for line in f:
value, *keys = line.strip().split('~')
for key in filter(None, keys):
d[key].append(value)
for i in d.items():
for filepath in filepaths:
with open(filepath,'r') as filehandle:
lines = filehandle.read()
words = removegarbage(lines).split()
counter.update(words)
print(counter)
到目前为止,我的程序工作正常,直到文件列表,但如何获得每个单词的文件列表中的单词总数?上面的代码不起作用!
答案 0 :(得分:1)
如何计算给定文件列表中的字数?
def count_words(files):
path = './' # check that this path is correct
return sum(len(open(path + str(f) +'.txt').read().split()) for f in files)
那么如何对d中的每个条目求和?
total = sum(count_words(d[k]) for k in d)