我正在尝试构建一些我在论文中读过的东西。 在本文中,它介绍了“ Synset频率-逆文档频率(SF-IDF)”
根据论文:
SF-IDF works in the same way as TF-IDF does, with the
difference that t is now replaced by s, where s is not a word
but a synset instead. This means that we consider two words
with the same meaning as one and the same synset. The SFIDF formula for similarity is:
sf − idf(s, d) = tf(s, d) × idf(s, d).
我了解它的主要工作原理,但是我不确定应如何实现sysnet频率。
我有一个检测两个单词是否在同一sysnet中的功能:
def are_same_synset(word1, word2):
word1_set = set(wn.synsets(word1))
word2_set = set(wn.synsets(word2))
common_synset = word1_set.intersection(word2_set)
return common_synset
我想知道在这种情况下应如何确定s
。现在,我们首先计算文档中所有单词的出现频率。看来我们必须将单词聚类为同义词集,但不确定该文件是否表明了这一点。
有什么帮助或想法吗?