我编写了以下代码来计算许多预处理文档之间的余弦相似性(停用词删除,词干和术语频率 - 逆文档频率)。
print(X.shape)
similarity = []
for each in X:
similarity.append(cosine_similarity(X[i:1], X))
print(cosine_similarity(X[i:1], X))
i = i+1
但是,当我运行它时,我会收到:
(2235, 7791)
[[ 1. 0.01490594 0.11752643 ..., 0.00941571 0.03652551
0.01239277]]
Traceback (most recent call last):
File "...", line 83, in <module>
similarity.append(cosine_similarity(X[i:1], X))
File "/Users/.../anaconda/lib/python3.5/site-packages/sklearn/metrics/pairwise.py", line 881, in cosine_similarity
X, Y = check_pairwise_arrays(X, Y)
File "/Users/.../anaconda/lib/python3.5/site-packages/sklearn/metrics/pairwise.py", line 96, in check_pairwise_arrays
X = check_array(X, accept_sparse='csr', dtype=dtype)
File "/Users/.../anaconda/lib/python3.5/site-packages/sklearn/utils/validation.py", line 407, in check_array
context))
ValueError: Found array with 0 sample(s) (shape=(0, 7791)) while a minimum of 1 is required.
[Finished in 56.466s]
答案 0 :(得分:0)
目前尚不清楚你想要实现的目标。您在矩阵 X 的切片与整个矩阵之间采用余弦相似性。除非i == 0,否则切片为空。 for 语句遍历矩阵,但您从不使用迭代变量每个。
余弦相似性是两个相等长度的矢量之间的操作。例如,您可以使用
计算行 i 和行 j 之间的相似度cosine_similarity(X[i], X[j])
如果您想要在列表中计算所有行与行的相似性,请使用列表推导:
similarity = [cosine_similarity(a, b) for a in X for b in X]
这会让你感动吗?