如何从文档(数据集)中查找和打印不匹配/不相似的单词?

时间:2019-02-05 15:09:35

标签: python dictionary nltk gensim nltk-trainer

我正在尝试重写一种算法,该算法基本上需要一个输入文本文件,并与不同的文档进行比较并得出相似性。

现在,我要打印不匹配单词的输出,并输出不匹配单词的新纺织品。

从此代码中,“ hello force”是输入,并针对raw_documents进行检查,并在0-1之间打印出匹配文档的等级(单词“ force”与第二个文档匹配,输出给第二个文档更多的等级,但是“ hello”不在任何raw_document中,我想将不匹配的单词“ hello”打印为不匹配),但是我要打印的是与任何raw_document都不匹配的不匹配输入单词

import gensim
import nltk

from nltk.tokenize import word_tokenize

raw_documents = ["I'm taking the show on the road",
                 "My socks are a force multiplier.",
             "I am the barber who cuts everyone's hair who doesn't cut their own.",
             "Legend has it that the mind is a mad monkey.",
            "I make my own fun."]

gen_docs = [[w.lower() for w in word_tokenize(text)]
            for text in raw_documents]

dictionary = gensim.corpora.Dictionary(gen_docs)

corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]

tf_idf = gensim.models.TfidfModel(corpus)
s = 0
for i in corpus:
    s += len(i)
sims = gensim.similarities.Similarity('/usr/workdir/',tf_idf[corpus],
                                      num_features=len(dictionary))
query_doc = [w.lower() for w in word_tokenize("hello force")]

query_doc_bow = dictionary.doc2bow(query_doc)

query_doc_tf_idf = tf_idf[query_doc_bow]
result = sims[query_doc_tf_idf]
print result

0 个答案:

没有答案