直到最后一步,我似乎得到了所有正确的结果。我的结果数组一直空着。
我试图按照本教程比较6组笔记:
https://www.oreilly.com/learning/how-do-i-compare-document-similarity-using-python
到目前为止,我有这个:
#tokenize an array of all text
raw_docs = [Notes_0, Notes_1, Notes_2, Notes_3, Notes_4, Notes_5]
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in raw_docs]
#create dictionary
dictionary_interactions = gensim.corpora.Dictionary(gen_docs)
print("Number of words in dictionary: ", len(dictionary_interactions))
#create a corpus
corpus_interactions = [dictionary_interactions.doc2bow(gen_docs) for gen_docs in gen_docs]
len(corpus_interactions)
#convert to tf-idf model
tf_idf_interactions = gensim.models.TfidfModel(corpus_interactions)
#check for similarities between docs
sims_interactions = gensim.similarities.Similarity('C:/Users/JNproject', tf_idf_interactions[corpus_interactions],
num_features = len(dictionary_interactions))
print(sims_interactions)
print(type(sims_interactions))
输出:
Number of words in dictionary: 46364
Similarity index with 6 documents in 0 shards (stored under C:/Users/Jeremy Bice/JNprojects/Company/Interactions/sim_interactions)
<class 'gensim.similarities.docsim.Similarity'>
这似乎是正确的,所以我继续这样做:
query_doc = [w.lower() for w in word_tokenize("client is")]
print(query_doc)
query_doc_bow = dictionary_interactions.doc2bow(query_doc)
print(query_doc_bow)
query_doc_tf_idf = tf_idf_interactions[query_doc_bow]
print(query_doc_tf_idf)
#check for similarities between docs
sims_interactions[query_doc_tf_idf]
我的输出是这样的:
['client', 'is']
[(335, 1), (757, 1)]
[]
array([ 0., 0., 0., 0., 0., 0.], dtype=float32)
如何在此处获得输出?
答案 0 :(得分:1)
根据raw_docs
的内容,这可能是正确的行为。
尽管您的查询字词出现在原始文档和字典中,但您的代码仍会返回空tf_idf
。 tf_idf
由term_frequency * inverse_document_frequency
计算。 inverse_document_frequency
由log(N/d)
计算,其中N
是您的文档总数,d
是特定术语出现的文档数。
我的猜测是,您的查询字词['client', 'is']
出现在您的每个文档中,导致inverse_document_frequency
0
和空tf_idf
列表。您可以使用我提到的文档检查此行为,并从您提到的教程中进行修改:
# original: commented out
# added arbitrary words 'now' and 'the' where missing, so they occur in each document
#raw_documents = ["I'm taking the show on the road.",
raw_documents = ["I'm taking the show on the road now.",
# "My socks are a force multiplier.",
"My socks are the force multiplier now.",
# "I am the barber who cuts everyone's hair who doesn't cut their own.",
"I am the barber who cuts everyone's hair who doesn't cut their own now.",
# "Legend has it that the mind is a mad monkey.",
"Legend has it that the mind is a mad monkey now.",
# "I make my own fun."]
"I make my own the fun now."]
如果您查询
query_doc = [w.lower() for w in word_tokenize("the now")]
你得到了
['the', 'now']
[(3, 1), (8, 1)]
[]
[0. 0. 0. 0. 0.]