尽管我不知道如何在代码中应用IDF权重,但是我正在使用WordNet处理文档相似性。我敢肯定,这种加权是目前最简单的方法之一,但是所有在线信息似乎都相当混乱。我正在尝试达到可以使用余弦相似度的程度,尽管我现在迷路了,不胜感激。我已经将2个语料库放在单独的袋子中,并计算了频率,那么TF部分是否完整?
def make_bow(somestring):
rep=word_tokenize(somestring)
rep=normalise(rep)
rep=stem(rep)
rep=filter_stopwords(rep)
dict_rep={}
for token in rep:
dict_rep[token]=dict_rep.get(token,0)+1
return(dict_rep)
wsj=WSJCorpusReader()
rcr=ReutersCorpusReader()
collectionsize=50
collections={"wsj":[],"rcr":[]}
for key in collections.keys():
if key=="wsj":
generator=wsj.raw()
else:
generator=rcr.raw()
while len(collections[key])<collectionsize:
collections[key].append(next(generator))
bow_collections={key:[make_bow(doc) for doc in collection] for key,collection in collections.items()}
print(bow_collections)
最终打印:
{'wsj': [{'pierre': 1, 'vinken': 1, 'NUM': 2, 'year': 1, 'old': 1, 'join':
1, 'board': 1, 'nonexecutive': 1, 'director': 1}, {'vinken': 1, 'chairman':
1, 'elsevier': 1, 'dutch': 1, 'publishing': 1, 'group': 1}, {'rudolph': 1,
'agnew': 1, 'NUM': 1, 'year': 1, 'old': 1, 'former': 1, 'chairman': 1,
'consolidated': 1, 'gold': 1, 'field': 1, 'plc': 1, 'wa': 1, 'named': 1,
'nonexecutive': 1, 'director': 1, 'british': 1, ......