KNN与TF-IDF投掷"重塑您的数据"具有余弦相似性的警告作为距离度量

时间:2016-07-02 09:25:40

标签: python scikit-learn knn cosine-similarity

我正试图在SciKIt Learn中使用Cosine Similarity来做KNN,但它一直在抛出这些警告。有人可以解释这些是什么意思,为什么它只是在我试图使用具有余弦相似性的KNN模型而不是任何其他距离度量时才会出现?

代码:

t0 = time.time()
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

vectorizer = TfidfVectorizer()
vec_fit = vectorizer.fit_transform(X)

t1 = time.time()
total = t1-t0
print "TF-IDF built:", total

#######################------------------------############################

t0 = time.time()
nbrs = NearestNeighbors(n_neighbors=20, algorithm='auto', metric=cosine_similarity)
nbrs.fit(X_train_tfidf.toarray())#,Y)
#KD_TREE won't work here becuase it doesn't work with Sparse Matrix -- on giving it a dense matrix, it throws a memory error

t1 = time.time()
total = t1-t0
print "KNN Built:", total

重复警告消息:

C:\Anaconda2\lib\site-packages\sklearn\utils\validation.py:386: DeprecationWarning: Passing 1d arrays as data is depreca
ted in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single
feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)

根据建议尝试这样做:

nbrs = NearestNeighbors(n_neighbors=20, algorithm='auto', metric=cosine_similarity)
nbrs.fit(numpy.array(X_train_tfidf).reshape(1, -1))

会抛出以下错误:

Traceback (most recent call last):
  File ".\tf-idf.py", line 54, in <module>
    nbrs.fit(numpy.array(X_train_tfidf).reshape(1, -1))
  File "C:\Miniconda2\lib\site-packages\sklearn\neighbors\base.py", line 816, in fit
    return self._fit(X)
  File "C:\Miniconda2\lib\site-packages\sklearn\neighbors\base.py", line 221, in _fit
    X = check_array(X, accept_sparse='csr')
  File "C:\Miniconda2\lib\site-packages\sklearn\utils\validation.py", line 373, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.

1 个答案:

答案 0 :(得分:0)

对我而言,没有其他指标(如linear_kernel)显示这一点没有意义,我想这是他们忘记(?)更新的内容,因为两者(linear_kernel和{ {1}})是内核操作。

对于手头的问题,您收到此错误是因为cosine_similarity方法需要一个二维数组,但您传递的是一维数组。 例如,这将引发此警告fit(),因为它具有形状5.另一方面,这不会X_train_tfidf=np.array([1,2,3,4.234,213.2]),因为它具有形状(5,1),因此是二维的。

警告信息的建议是将您的1维数组转换为2维,如X_train_tfidf=np.array([[1,2,3,4.234,213.2]]),相当于X_train_tfidf=np.array([1,2,3,4.234,213.2]).reshape(1, -1)

核矩阵基本上是线性代数的子代,涉及默认为二维的矩阵运算。

希望有意义,如果没有,请大声喊。