使用序列设置数组元素Scikit学习cosine_similarity

时间:2017-06-26 17:20:28

标签: python scikit-learn similarity svd

我试图计算K-Means算法结果的余弦相似度

tf_vectorizer = CountVectorizer()
tf = tf_vectorizer.fit_transform(tokens_list)
svd = TruncatedSVD(100)
lsa = make_pipeline(svd, Normalizer(copy=False))
lsa_tf = lsa.fit_transform(tf)

上面我构建了我的两个特征矩阵(lsalsa_tf),我想为它们构建cosine_similarity。

nb_cluster = 5
km = KMeans(n_clusters=nb_cluster, init='k-means++', max_iter=300, n_init=3, random_state=0)
km.fit(lsa_tf)

上面我正在应用我的K-mean。

cluster_matrix = []
for cluster_number in range(0,nb_cluster):
    index = 0
    for label in km.labels_:
        if label == cluster_number:
            cluster_matrix.append(lsa_tf[index])
        index += 1

我在这里创建一个矩阵,我的所有矢量都按照聚类结果中的标签分组。

downsample_matrix = []
downsample_coefficient = 0
for vector in cluster_matrix:
    downsample_coefficient += 1
    if downsample_coefficient == 5:
        downsample_matrix.append(vector)
        downsample_coefficient = 0

上面我只是简单地对我的矩阵进行下采样,否则它会显示为大。

similarity_matrix = cosine_similarity(downsample_matrix)
plt.matshow(similarity_matrix)
plt.show()

最后我在这里使用cosine_similarity并显示生成的矩阵。

此代码适用于lsa_tf

Similarity matrix

但在尝试计算tf时,cosine_similarity引发了以下错误:

ValueError                                Traceback (most recent call last)
<ipython-input-27-5997ca6abb2d> in <module>()
     19         downsample_matrix.append(vector)
     20         downsample_coefficient = 0
---> 21 similarity_matrix = cosine_similarity(downsample_matrix)
     22 plt.matshow(similarity_matrix)
     23 plt.show()

/home/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in cosine_similarity(X, Y, dense_output)
    908     # to avoid recursive import
    909 
--> 910     X, Y = check_pairwise_arrays(X, Y)
    911 
    912     X_normalized = normalize(X, copy=True)

/home/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype)
    104     if Y is X or Y is None:
    105         X = Y = check_array(X, accept_sparse='csr', dtype=dtype,
--> 106                             warn_on_dtype=warn_on_dtype, estimator=estimator)
    107     else:
    108         X = check_array(X, accept_sparse='csr', dtype=dtype,

/home/venv/lib/python3.5/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    380                                       force_all_finite)
    381     else:
--> 382         array = np.array(array, dtype=dtype, order=order, copy=copy)
    383 
    384         if ensure_2d:

ValueError: setting an array element with a sequence.

我的tflsa_tf之间有什么区别?如何将cosine_similarity应用于这两者?

0 个答案:

没有答案