计算多个文本文件中单词列表的出现次数

时间:2021-06-05 16:46:31

标签: python

我有一个单词列表:

words = ["hello","my","name"]
files = ["file1.txt","file2.txt"]

我想要的是计算所有文本文件中列表中每个单词的出现次数。

我目前的工作:

import re 
occ = []
for file in files:
 try:
   fichier = open(file, encoding="utf-8")
 except:
   pass
 data = fichier.read()
 for wrd in words:
    count = sum(1 for _ in re.finditer(r'\b%s\b' % re.escape(wrd), data))
    occ.append(wrd + " : " + str(count))
 texto = open("occurence.txt", "w+b")
for ww in occ:
   texto.write(ww.encode("utf-8")+"\n".encode("utf-8"))

所以这段代码可以很好地处理单个文件,但是当我尝试一个文件列表时,它只给我最后一个文件的结果。

2 个答案:

答案 0 :(得分:1)

使用 json 存储计数。

例如:

import json

# Read Json
with open('data_store.json') as jfile:
    data = json.load(jfile)

for wrd in words:
   count = sum(1 for _ in re.finditer(r'\b%s\b' % re.escape(wrd), data))
   if wrd not in data:
       data[wrd] = 0
   data[wrd] += count   # Increment Count

# Write Result to JSON
with open('data_store.json', "w") as jfile:
    json.dump(data, jfile)

答案 1 :(得分:1)

使用字典代替列表:

import re 
occ = {} # Create an empty dictionary
words = ["hello", "my", "name"]
files = ["f1.txt", "f2.txt", "f3.txt" ]
for file in files:
 try:
   fichier = open(file, encoding="utf-8")
 except:
   pass
else:
 data = fichier.read()
 for wrd in words:
    count = sum(1 for _ in re.finditer(r'\b%s\b' % re.escape(wrd), data))
    if wrd in occ:
        occ[wrd] += count # If wrd is already in dictionary, increment occurrence count 
    else:
        occ[wrd] = count # Else add wrd to dictionary with occurrence count 
 
print(occ)

如果你想把它作为你问题中的字符串列表:

occ_list = [ f"{key} : {value}" for key, value in occ.items() ]