Question

我想测试一组文档是否具有某些特殊的相似性，查看使用每个文档构建的图形，并与其他文档的文本数据集一起显示。我猜他们将在一起进行可视化。

解决方案是使用doc2vec计算每个文档的向量并绘制它？可以无人监督的方式完成吗？我应该使用哪个python库来获得Word2vec的那些漂亮的2D和3D表示？

Answer 1

不确定您要问的是什么，但如果您想要一种方法来检查矢量是否属于同一类型，您可以使用K-Means。 K-Means从向量列表中得出一个数量为K的簇，所以如果你选择一个好的K（不是太低，它会搜索一些东西但不会太高，所以它不会过于区分）它可以工作。 / p>

K-Means以这种方式工作：

init_center(K) # randomly set K vector that will be the center of your cluster

while not converge(): # This one is tricky as you can find a lot of way to check for the convergence, the easiest is to check if your center has moved since the last itteration

    associate_vector() # Here you associate all the vectors to the closest center

    re_calculate_center() # And now you put the center at the... well center of their point, you can do that just by doing the mean of all the vector of the cluster.

这个gif可能比我更清楚：

这篇文章（这个gif来自哪里）比我更清楚，即使他在这里谈论java： https://picoledelimao.github.io/blog/2016/03/12/multithreaded-k-means-in-java/

文档聚类和可视化

1 个答案: