Question

我正在尝试构建一些我在论文中读过的东西。在本文中，它介绍了“ Synset频率-逆文档频率（SF-IDF）”

根据论文：

SF-IDF works in the same way as TF-IDF does, with the difference that t is now replaced by s, where s is not a word but a synset instead. This means that we consider two words with the same meaning as one and the same synset. The SFIDF formula for similarity is: sf − idf(s, d) = tf(s, d) × idf(s, d).

我了解它的主要工作原理，但是我不确定应如何实现sysnet频率。

我有一个检测两个单词是否在同一sysnet中的功能：

def are_same_synset(word1, word2):

    word1_set = set(wn.synsets(word1))
    word2_set = set(wn.synsets(word2))
    common_synset = word1_set.intersection(word2_set)

    return common_synset

我想知道在这种情况下应如何确定s。现在，我们首先计算文档中所有单词的出现频率。看来我们必须将单词聚类为同义词集，但不确定该文件是否表明了这一点。

有什么帮助或想法吗？

建筑物同义词集频率–反向文档频率（SF-IDF）

0 个答案: