使用SVD绘制单词矢量来测量相似度

时间:2015-07-17 18:51:30

标签: python matplotlib nlp svd

这是我用来计算直接邻居计数的单词共生矩阵的代码。我在网上找到了以下代码,它使用了SVD。

 import numpy as np
 la = np.linalg
 words = ['I','like','enjoying','deep','learning','NLP','flying','.']
 ### A Co-occurence matrix which counts how many times the word before and after a particular word appears ( ie, like appears after I 2 times)
 arr = np.array([[0,2,1,0,0,0,0,0],[2,0,0,1,0,1,0,0],[1,0,0,0,0,0,1,0],[0,0,0,1,0,0,0,1],[0,1,0,0,0,0,0,1],[0,0,1,0,0,0,0,8],[0,2,1,0,0,0,0,0],[0,0,1,1,1,0,0,0]])
 u, s, v = la.svd(arr, full_matrices=False)
 import matplotlib.pyplot as plt
 for i in xrange(len(words)):
     plt.text(u[i,2], u[i,3], words[i])

在最后一行代码中,U的第一个元素用作x坐标,U的第二个元素用作y坐标以投影单词,以查看相似性。 这种方法背后的直觉是什么?为什么他们将每行中的第1和第2个元素(每行代表每个单词)作为x和y来表示单词?请帮忙。

2 个答案:

答案 0 :(得分:3)

import numpy as np
import matplotlib.pyplot as plt
la = np.linalg
words = ["I", "like", "enjoy", "deep", "learning", "NLP", "flying", "."]
X = np.array([[0,2,1,0,0,0,0,0], [2,0,0,1,0,1,0,0], [1,0,0,0,0,0,1,0], [0,1,0,0,1,0,0,0], [0,0,0,1,0,0,0,1], [0,1,0,0,0,0,0,1], [0,0,1,0,0,0,0,1], [0,0,0,0,1,1,1,0]])
U, s, Vh = la.svd(X, full_matrices = False)

#plot
for i in range(len(words)):
    plt.text(U[i,0], U[i,1], words[i])
plt.show()

在图中,向左平移轴,您将看到所有单词。

答案 1 :(得分:2)

定义SVD的方式,您从s方法获得的la.svd矩阵是一个对角矩阵,包含降序的奇异值。选取u的前两列可确保您选择原始矩阵中最重要的组件。

此过程也称为降维。请阅读here(第11.3.3节)和here