Question

过去几个小时我一直在寻找SO上的nlp标签，我相信我没有错过任何东西，但如果我这样做，请指出我的问题。

与此同时，我将描述我想要做的事情。我在许多帖子中观察到的一个常见概念是语义相似性很难。例如，从this发布后，接受的解决方案建议如下：

First of all, neither from the perspective of computational 
linguistics nor of theoretical linguistics is it clear what 
the term 'semantic similarity' means exactly. .... 
Consider these examples:

Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station.
Pete and Rob both like programming a lot.
Patricia found a dog near the station.
It was a dog who found Pete and Rob under the snow.

Which of the sentences 2-4 are similar to 1? 2 is the exact 
opposite of 1, still it is about Pete and Rob (not) finding a 
dog.

我的高级要求是利用k-means聚类并基于语义相似性对文本进行分类，因此我需要知道的是它们是否是近似匹配。例如，在上面的例子中，我可以将1,2,4,5分类为一个类别而将3分类为另一个类别（当然，3个将用一些更相似的句子进行备份）。有点像，找到相关的文章，但他们不必100％相关。

我在想我最终需要构建每个句子的矢量表示，有点像它的指纹，但这个矢量应该包含的内容对我来说仍然是一个悬而未决的问题。它是n-gram，还是来自wordnet的东西，还是个别词干或其他东西？

This线程在枚举所有相关技术方面做得非常棒，但不幸的是，当帖子达到我想要的时候就停止了。关于这个领域最新技术水平的任何建议？

Answer 1

Latent Semantic Modeling可能有用。它基本上只是Singular Value Decomposition的另一个应用程序。 SVDLIBC是这种方法的一个非常好的C实现，它是一个老人但是好东西，甚至还有sparsesvd形式的python绑定。

Answer 2

我建议你尝试一个主题建模框架，例如Latent Dirichlet Allocation（LDA）。这个想法是由一组潜在的（隐藏的）主题生成的文件（在你的情况下，句子，可能证明是一个问题）; LDA检索这些主题，通过单词群集来表示它们。

implementation of LDA in Python可作为免费Gensim套餐的一部分提供。您可以尝试将其应用于您的句子，然后在其输出上运行 k -means。

估计句子之间“近似”语义相似性的好方法是什么？

2 个答案: