Question

我使用sklearn来训练分类模型，数据形状和培训渠道是：

clf = Pipeline([
    ("imputer", Imputer(missing_values='NaN', strategy="mean", axis=0)),
    ('feature_selection', VarianceThreshold(threshold=(.97 * (1 - .97)))),
    ('scaler', StandardScaler()),
    ('classification', svm.SVC(kernel='linear', C=1))])

print X.shape, y.shape
(59381, 895) (59381,)

我已检查feature_selection会将要素向量大小从895减少到124

feature_selection = Pipeline([
    ("imputer", Imputer(missing_values='NaN', strategy="mean", axis=0)),
    ('feature_selection', VarianceThreshold(threshold=(.97 * (1 - .97))))
    ])

feature_selection.fit_transform(X).shape
(59381, 124) (59381,)

然后我尝试获得准确度如下

scores = cross_validation.cross_val_score(clf, X, y)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

但是训练过程非常缓慢，我想知道在这种情况下加快进程吗？或者特征向量124的大小是否仍然过大到svm模型？

Answer 1

尝试使用sklearn.svm.LinearSVC。

它假设给svm.SVC(kernel='linear')提供非常相似的结果，但训练过程会更快（至少在d<m时，d-特征维度和列车样本的m-大小。）

如果您想使用其他内核，例如rbf，则无法使用LinearSVC。

但是，您可以添加内核缓存大小：内核缓存的大小对较大问题的运行时间有很大影响。如果您有足够的可用RAM，建议将cache_size设置为高于默认值200（MB）的值，例如500（MB）或1000（MB）。

如何加快培训过程

1 个答案: