我有一些稀疏数据,我将其转换为 CSR稀疏矢量:
from scipy.sparse import coo_matrix
num_news = indexed.agg(max(indexed["newsIndex"])).take(1)[0][0] + 1 # maximum index of the news in the data
def get_matrix(news):
row = [0 for i in news]
data = [1 for i in news]
return coo_matrix((data, (row,news)), shape=(1, num_news)).tocsr()
d['feature'] = d['newsArr'].apply(get_matrix)
然后,我使用pd.head
显示它:
uuid newsArr feature
0 014324000050581 [300.0, 274.0] (0, 274)\t1\n (0, 300)\t1
1 014379002854034 [3539.0, 1720.0, 402.0, 1787.0, 2854.0, 2500.0... (0, 402)\t1\n (0, 492)\t1\n (0, 493)\t1\n ...
2 014379004874618 [346.0] (0, 346)\t1
3 014379004904357 [592.0, 1586.0, 20.0, 4165.0, 19.0, 165.0, 12.0] (0, 12)\t1\n (0, 19)\t1\n (0, 20)\t1\n (0...
4 014379004920072 [1658.0, 283.0, 7.0, 492.0] (0, 7)\t1\n (0, 283)\t1\n (0, 492)\t1\n (...
d['feature'][:1].tolist()
的输出如下:
[<1x93315 sparse matrix of type '<type 'numpy.int64'>'
with 2 stored elements in Compressed Sparse Row format>]
然后我想使用DBscan:
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit_predict(d['feature'])
但是,我收到以下错误:
ValueError:使用序列设置数组元素。
我认为这是不合理的,因为我的矢量是1*num_news
。然后我尝试使用tolist()
:
db = DBSCAN(eps=0.3, min_samples=10).fit_predict(d['feature'].tolist())
弹出以下错误:
ValueError: Expected 2D array, got 1D array instead:
array=[ <1x93315 sparse matrix of type '<type 'numpy.int64'>'
with 2 stored elements in Compressed Sparse Row format>
<1x93315 sparse matrix of type '<type 'numpy.int64'>'
with 19 stored elements in Compressed Sparse Row format>
<1x93315 sparse matrix of type '<type 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>
...,
<1x93315 sparse matrix of type '<type 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
<1x93315 sparse matrix of type '<type 'numpy.int64'>'
with 2 stored elements in Compressed Sparse Row format>
<1x93315 sparse matrix of type '<type 'numpy.int64'>'
with 15 stored elements in Compressed Sparse Row format>].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
我知道sklearn
可以使用 CSR稀疏矩阵作为输入,我该怎么做?