建筑物同义词集频率–反向文档频率(SF-IDF)

时间:2019-06-19 23:03:58

标签: python wordnet recommender-systems

我正在尝试构建一些我在论文中读过的东西。 在本文中,它介绍了“ Synset频率-逆文档频率(SF-IDF)”

根据论文:

SF-IDF works in the same way as TF-IDF does, with the difference that t is now replaced by s, where s is not a word but a synset instead. This means that we consider two words with the same meaning as one and the same synset. The SFIDF formula for similarity is: sf − idf(s, d) = tf(s, d) × idf(s, d).

我了解它的主要工作原理,但是我不确定应如何实现sysnet频率。

我有一个检测两个单词是否在同一sysnet中的功能:

def are_same_synset(word1, word2):

    word1_set = set(wn.synsets(word1))
    word2_set = set(wn.synsets(word2))
    common_synset = word1_set.intersection(word2_set)

    return common_synset

我想知道在这种情况下应如何确定s。现在,我们首先计算文档中所有单词的出现频率。看来我们必须将单词聚类为同义词集,但不确定该文件是否表明了这一点。

有什么帮助或想法吗?

0 个答案:

没有答案