语料库的反向文档频率

时间:2016-03-02 00:30:27

标签: python

我有一个包含10个txt文件的文件夹。我正在尝试计算给定术语的IDF。但我的输出与预期不同。这是我的idf代码。

这里是一个包含10个文件中所有单词的并集的集合。

def idf(term):
    i = 0
    doc_counts = 0
    totaldocs = 10
    if term in s:
        for filename in os.listdir(root_of_my_corpus):
            file = open(os.path.join(root_of_my_corpus, filename), "r", encoding='UTF-8')
            idfdoc = file.read()
            file.close() 
            idfdoc = idfdoc.lower()
            tokenidf = tokenizer.tokenize(idfdoc)
            if term in tokenidf:
                doc_counts+=1
    return(math.log(totaldocs/doc_counts))

1 个答案:

答案 0 :(得分:0)

我只是写一个关于如何计算idf的小演示。我使用的玩具数据是四个txt文件,如下所示

  • 1.txt内容:“Hello world 1”
  • 2.txt内容:“Hello world 2”
  • 3.txt内容:“Hello world 3”
  • 4.txt内容:“Hello world 4”

代码基本上将所有txt内容加载到字典中,然后计算每个单词的idf。这是代码:

import os
import math
from collections import defaultdict


def idf_calc(path):
    # load data
    file_paths = [(path + item, str(item.split(".")[0])) for item in os.listdir(path)]
    contents = {}
    for item in file_paths:
        file_path, file_name = item
        raw = ""
        with open(file_path, "r") as fp:
            data = fp.readlines()
            if len(data) > 0:
                raw = data[0].strip()
        contents[file_name] = raw


    # idf calculate
    result = {}
    total_cnt = len(contents)
    words = list(set([word for item in contents for word in contents[item].split()]))

    for i, word in enumerate(words):
        cnt = sum([1 for item in contents if word in contents[item]])
        idf = math.log(total_cnt / cnt)
        result[word] = "%.3f" % (idf)

    print result


idf_calc("../data/txt/")

结果

{'1': '1.386', '3': '1.386', '2': '1.386', '4': '1.386', 'world': '0.000', 'Hello': '0.000'}

希望它有所帮助。