Question

SVC似乎对待可以采用稀疏矩阵的内核不同于那些不稀疏矩阵的内核。但是，如果编写用户提供的内核来获取稀疏矩阵，并且在拟合期间提供稀疏矩阵，它仍然将稀疏矩阵转换为密集并将内核视为密集，因为内核不是稀疏内核之一在scikit-learn中定义。

有没有办法强制SVC将内核识别为稀疏，并且在将稀疏矩阵传递给内核之前不将其转换为密集？

编辑1：最小化工作示例

作为一个例子，如果在创建时，SVC被传递字符串＆＃34; linear＆＃34;对于内核，则使用线性内核，将稀疏矩阵直接传递给线性内核，如果在拟合时提供稀疏矩阵，则将支持向量存储为稀疏矩阵。但是，如果将linear_kernel函数本身传递给SVC，则稀疏矩阵在传递给内核之前转换为ndarray，支持向量存储为ndarray。

import numpy as np
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import linear_kernel
from sklearn.svm import SVC


def make_random_sparsemat(m, n=1024, p=.94):
    """Make mxn sparse matrix with 1-p probability of 1."""
    return csr_matrix(np.random.uniform(size=(m, n)) > p, dtype=np.float64)


X = make_random_sparsemat(100)
Y = np.asarray(np.random.uniform(size=(100)) > .5, dtype=np.float64)
model1 = SVC(kernel="linear")
model1.fit(X, Y)
print("Built-in kernel:")
print("Kernel treated as sparse: {}".format(model1._sparse))
print("Type of dual coefficients: {}".format(type(model1.dual_coef_)))
print("Type of support vectors: {}".format(type(model1.support_vectors_)))

model2 = SVC(kernel=linear_kernel)
model2.fit(X, Y)
print("User-provided kernel:")
print("Kernel treated as sparse: {}".format(model2._sparse))
print("Type of dual coefficients: {}".format(type(model2.dual_coef_)))
print("Type of support vectors: {}".format(type(model2.support_vectors_)))

输出：

Built-in kernel:
Kernel treated as sparse: True
Type of dual coefficients: <class 'scipy.sparse.csr.csr_matrix'>
Type of support vectors: <class 'scipy.sparse.csr.csr_matrix'>
User-provided kernel:
Kernel treated as sparse: False
Type of dual coefficients: <type 'numpy.ndarray'>
Type of support vectors: <type 'numpy.ndarray'>

Answer 1

我在黑暗中钓鱼，主要使用我在scikit-learn找到的github代码。

很多SVC linear代码似乎都在C库中。有人谈论其内部表现稀疏。

您的linear_kernel功能就是：

X, Y = check_pairwise_arrays(X, Y)
return safe_sparse_dot(X, Y.T, dense_output=True)

如果我制作了X和Y

In [119]: X
Out[119]: 
<100x1024 sparse matrix of type '<class 'numpy.float64'>'
    with 6108 stored elements in Compressed Sparse Row format>
In [120]: 
In [120]: 
In [120]: Y = np.asarray(np.random.uniform(size=(100)) > .5, dtype=np.float64)

并重新创建sparse_safe_dot

In [122]: safe_sparse_dot(Y,X,dense_output=True)
Out[122]: array([ 3.,  5.,  3., ...,  4.,  2.,  4.])

因此将其应用于Y和X（以唯一有意义的顺序），我得到一个密集的数组。更改dense_output参数不会改变任何事情。基本上，Y*X，稀疏*密集，返回密集。

如果我使Y稀疏，那么我可以获得稀疏产品：

In [125]: Ym=sparse.csr_matrix(Y)
In [126]: Ym*X
Out[126]: 
<1x1024 sparse matrix of type '<class 'numpy.float64'>'
    with 1000 stored elements in Compressed Sparse Row format>
In [127]: safe_sparse_dot(Ym,X,dense_output=False)
Out[127]: 
<1x1024 sparse matrix of type '<class 'numpy.float64'>'
    with 1000 stored elements in Compressed Sparse Row format>
In [128]: safe_sparse_dot(Ym,X,dense_output=True)
Out[128]: array([[ 3.,  5.,  3., ...,  4.,  2.,  4.]])

我不知道SVC和fit的工作原理，但仅仅是使用稀疏矩阵，我知道在混合稀疏矩阵和密集矩阵时必须小心。无论你是否愿意，都很容易获得密集的结果。

如何强制SVC将用户提供的内核视为稀疏

1 个答案: