我在调用cosine_similarity时遇到以下错误
numerator = sum(a*b for a,b in zip(x,y))
TypeError: only integer arrays with one element can be converted to an index
我正在尝试从CountVectorizer返回的document-keyword矩阵中获取关键字 - 关键字共生矩阵。
我觉得cosine_similarity
对我传递的数据类型不感兴趣,但我不确定究竟是什么问题。此处n
的类型为scipy.sparse.csc.csc_matrix
,y
的类型为scipy.sparse.csr.csr_matrix
documents = (
"The sky is blue",
"The sun is bright",
"The sun in the sky is bright",
"We can see the shining sun, the bright sun"
)
countvectorizer = CountVectorizer()
y = countvectorizer.fit_transform(documents)
n = y.T.dot(y)
x = n.tocsr()
x = x.toarray()
numpy.fill_diagonal(x, 0)
result = cosine_similarity(x, "None")
答案 0 :(得分:1)
使用sklearn
cosine_similarity
此代码段运行并返回一个明智的答案。
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import distance_metrics
documents = (
"The sky is blue",
"The sun is bright",
"The sun in the sky is bright",
"We can see the shining sun, the bright sun"
)
countvectorizer = CountVectorizer()
y = countvectorizer.fit_transform(documents)
n = y.T.dot(y)
x = n.tocsr()
x = x.toarray()
np.fill_diagonal(x, 0)
cosine_similarity = distance_metrics()['cosine']
result = cosine_similarity(x, x)