Question

我必须找到参考文档与存储库中的文档集之间的相似性。

Method : 

1. I find the term document matrix for all the documents including the reference document 
2. The svd is calculated for this matrix 
3. I take the v array(The third result)
4. I transpose this matrix so that the each row represents a document . 
5. The first row represents the reference document . 
6. I find the cosine similarity beween this row and the rest of the rows

我的怀疑：

由于我的数据库中有大约7个文档，因此我只得到8 * 8 varray（文档矩阵）。如果我单独找到这8个值的余弦相似度，那么我会得到正确的结果吗？
一般采用这种方法吗？

我使用java来编写代码。我利用jama包找到了svd。

Answer 1

我尝试过使用TMG工具箱的Matlab。它工作正常。
为了获得更好的结果（或更准确），请使用更大的数据集。
在LSA中，svd是其中的一部分（用于减少尺寸）。对于计算你的余弦相似性，您需要在计算后得到的最后一个矩阵 A = U * S * V ^ t。

您可以阅读LSA Here

的示例

关于LSA的疑问

1 个答案: