Question

我有一些稀疏数据，我将其转换为 CSR稀疏矢量：

from scipy.sparse import coo_matrix

num_news = indexed.agg(max(indexed["newsIndex"])).take(1)[0][0] + 1  # maximum index of the news in the data

def get_matrix(news):
    row = [0 for i in news]
    data = [1 for i in news]
    return coo_matrix((data, (row,news)), shape=(1, num_news)).tocsr()

d['feature'] = d['newsArr'].apply(get_matrix)

然后，我使用pd.head显示它：

uuid    newsArr     feature
0   014324000050581     [300.0, 274.0]  (0, 274)\t1\n (0, 300)\t1
1   014379002854034     [3539.0, 1720.0, 402.0, 1787.0, 2854.0, 2500.0...   (0, 402)\t1\n (0, 492)\t1\n (0, 493)\t1\n ...
2   014379004874618     [346.0]     (0, 346)\t1
3   014379004904357     [592.0, 1586.0, 20.0, 4165.0, 19.0, 165.0, 12.0]    (0, 12)\t1\n (0, 19)\t1\n (0, 20)\t1\n (0...
4   014379004920072     [1658.0, 283.0, 7.0, 492.0]     (0, 7)\t1\n (0, 283)\t1\n (0, 492)\t1\n (...

d['feature'][:1].tolist()的输出如下：

[<1x93315 sparse matrix of type '<type 'numpy.int64'>'
with 2 stored elements in Compressed Sparse Row format>]

然后我想使用DBscan：

from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.3, min_samples=10).fit_predict(d['feature'])

但是，我收到以下错误：

ValueError：使用序列设置数组元素。

我认为这是不合理的，因为我的矢量是1*num_news。然后我尝试使用tolist()：

db = DBSCAN(eps=0.3, min_samples=10).fit_predict(d['feature'].tolist())

弹出以下错误：

ValueError: Expected 2D array, got 1D array instead:
array=[ <1x93315 sparse matrix of type '<type 'numpy.int64'>'
    with 2 stored elements in Compressed Sparse Row format>
 <1x93315 sparse matrix of type '<type 'numpy.int64'>'
    with 19 stored elements in Compressed Sparse Row format>
 <1x93315 sparse matrix of type '<type 'numpy.int64'>'
    with 1 stored elements in Compressed Sparse Row format>
 ...,
 <1x93315 sparse matrix of type '<type 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>
 <1x93315 sparse matrix of type '<type 'numpy.int64'>'
    with 2 stored elements in Compressed Sparse Row format>
 <1x93315 sparse matrix of type '<type 'numpy.int64'>'
    with 15 stored elements in Compressed Sparse Row format>].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

我知道sklearn可以使用 CSR稀疏矩阵作为输入，我该怎么做？

Sklearn DBscan无法适应CSR稀疏数据

0 个答案: