我试图计算K-Means算法结果的余弦相似度
tf_vectorizer = CountVectorizer()
tf = tf_vectorizer.fit_transform(tokens_list)
svd = TruncatedSVD(100)
lsa = make_pipeline(svd, Normalizer(copy=False))
lsa_tf = lsa.fit_transform(tf)
上面我构建了我的两个特征矩阵(lsa
和lsa_tf
),我想为它们构建cosine_similarity。
nb_cluster = 5
km = KMeans(n_clusters=nb_cluster, init='k-means++', max_iter=300, n_init=3, random_state=0)
km.fit(lsa_tf)
上面我正在应用我的K-mean。
cluster_matrix = []
for cluster_number in range(0,nb_cluster):
index = 0
for label in km.labels_:
if label == cluster_number:
cluster_matrix.append(lsa_tf[index])
index += 1
我在这里创建一个矩阵,我的所有矢量都按照聚类结果中的标签分组。
downsample_matrix = []
downsample_coefficient = 0
for vector in cluster_matrix:
downsample_coefficient += 1
if downsample_coefficient == 5:
downsample_matrix.append(vector)
downsample_coefficient = 0
上面我只是简单地对我的矩阵进行下采样,否则它会显示为大。
similarity_matrix = cosine_similarity(downsample_matrix)
plt.matshow(similarity_matrix)
plt.show()
最后我在这里使用cosine_similarity并显示生成的矩阵。
此代码适用于lsa_tf
。
但在尝试计算tf
时,cosine_similarity
引发了以下错误:
ValueError Traceback (most recent call last)
<ipython-input-27-5997ca6abb2d> in <module>()
19 downsample_matrix.append(vector)
20 downsample_coefficient = 0
---> 21 similarity_matrix = cosine_similarity(downsample_matrix)
22 plt.matshow(similarity_matrix)
23 plt.show()
/home/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in cosine_similarity(X, Y, dense_output)
908 # to avoid recursive import
909
--> 910 X, Y = check_pairwise_arrays(X, Y)
911
912 X_normalized = normalize(X, copy=True)
/home/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype)
104 if Y is X or Y is None:
105 X = Y = check_array(X, accept_sparse='csr', dtype=dtype,
--> 106 warn_on_dtype=warn_on_dtype, estimator=estimator)
107 else:
108 X = check_array(X, accept_sparse='csr', dtype=dtype,
/home/venv/lib/python3.5/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
380 force_all_finite)
381 else:
--> 382 array = np.array(array, dtype=dtype, order=order, copy=copy)
383
384 if ensure_2d:
ValueError: setting an array element with a sequence.
我的tf
和lsa_tf
之间有什么区别?如何将cosine_similarity
应用于这两者?