我有一个包含10个txt文件的文件夹。我正在尝试计算给定术语的IDF。但我的输出与预期不同。这是我的idf代码。
这里是一个包含10个文件中所有单词的并集的集合。
def idf(term):
i = 0
doc_counts = 0
totaldocs = 10
if term in s:
for filename in os.listdir(root_of_my_corpus):
file = open(os.path.join(root_of_my_corpus, filename), "r", encoding='UTF-8')
idfdoc = file.read()
file.close()
idfdoc = idfdoc.lower()
tokenidf = tokenizer.tokenize(idfdoc)
if term in tokenidf:
doc_counts+=1
return(math.log(totaldocs/doc_counts))
答案 0 :(得分:0)
我只是写一个关于如何计算idf的小演示。我使用的玩具数据是四个txt文件,如下所示
代码基本上将所有txt内容加载到字典中,然后计算每个单词的idf。这是代码:
import os
import math
from collections import defaultdict
def idf_calc(path):
# load data
file_paths = [(path + item, str(item.split(".")[0])) for item in os.listdir(path)]
contents = {}
for item in file_paths:
file_path, file_name = item
raw = ""
with open(file_path, "r") as fp:
data = fp.readlines()
if len(data) > 0:
raw = data[0].strip()
contents[file_name] = raw
# idf calculate
result = {}
total_cnt = len(contents)
words = list(set([word for item in contents for word in contents[item].split()]))
for i, word in enumerate(words):
cnt = sum([1 for item in contents if word in contents[item]])
idf = math.log(total_cnt / cnt)
result[word] = "%.3f" % (idf)
print result
idf_calc("../data/txt/")
结果
{'1': '1.386', '3': '1.386', '2': '1.386', '4': '1.386', 'world': '0.000', 'Hello': '0.000'}
希望它有所帮助。