为Gensim中的文档分配自定义字符串ID

时间:2019-07-18 10:00:43

标签: python-3.x machine-learning gensim

我正在Gensim中进行主题建模 我成功找到了document_id和sameity_percentage。

这就是我要尝试的。

documents = ["Say to other what you feel",
             "Speak truth from your heart and tell people",
             "what this book say and tell about lying"]

texts = # remove common words and tokenize

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi[corpus_tfidf]

index = similarities.MatrixSimilarity(lsi[corpus])

doc = "Always tell people what in your heart"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]

sims = index[vec_lsi]

输出

[(0, 0.74419993), (1, 0.99159265), (2, 0.35600105)]
  |          |
  |          |
  |          |

index        similarity percentage
number
in
documents
array

我想要类似下面的结果

我想要这个

[(myid_123, 0.74419993), (abc_1, 0.99159265), (id_3, 0.35600105)]
  |          |
  |          |
  |          |

string        similarity percentage
id
in
documents
array

我尝试了类似的方法,但是没有用

documents = {"myid_123": "Say to other what you feel",
             "abc_1": "Speak truth from your heart and tell people",
             "id_3": "what this book say and tell about lying"}

如何为文档指定我的ID。在Gensim中可能吗。 如果是的话。你有什么例子吗?

0 个答案:

没有答案