Question

Using the code below for svm in python:

from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='auto'))
clf.fit(X, y)
proba = clf.predict_proba(X)

But it is taking a huge amount of time.

Actual Data Dimensions:

train-set (1422392,29)
test-set (233081,29)

How can I speed it up(parallel or some other way)? Please help. I have already tried PCA and downsampling.

I have 6 classes. Edit: Found http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html but I wish for probability estimates and it seems not to so for svm.

Edit:

from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC,LinearSVC
from sklearn.linear_model import SGDClassifier
import joblib
import numpy as np
from sklearn import grid_search
import multiprocessing
import numpy as np
import math

def new_func(a):                              #converts array(x) elements to (1/(1 + e(-x)))
    a=1/(1 + math.exp(-a))
    return a

if __name__ == '__main__':
    iris = datasets.load_iris()
    cores=multiprocessing.cpu_count()-2
    X, y = iris.data, iris.target                       #loading dataset

    C_range = 10.0 ** np.arange(-4, 4);                  #c value range 
    param_grid = dict(estimator__C=C_range.tolist())              

    svr = OneVsRestClassifier(LinearSVC(class_weight='auto'),n_jobs=cores) ################LinearSVC Code faster        
    #svr = OneVsRestClassifier(SVC(kernel='linear', probability=True,  ##################SVC code slow
    #   class_weight='auto'),n_jobs=cores)

    clf = grid_search.GridSearchCV(svr, param_grid,n_jobs=cores,verbose=2)  #grid search
    clf.fit(X, y)                                                   #training svm model                                     

    decisions=clf.decision_function(X)                             #outputs decision functions
    #prob=clf.predict_proba(X)                                     #only for SVC outputs probablilites
    print decisions[:5,:]
    vecfunc = np.vectorize(new_func)
    prob=vecfunc(decisions)                                        #converts deicision to (1/(1 + e(-x)))
    print prob[:5,:]

Edit 2: The answer by user3914041 yields very poor probability estimates.

Answer 1

SVM classifiers don't scale so easily. From the docs, about the complexity of sklearn.svm.SVC.

The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.

In scikit-learn you have svm.linearSVC which can scale better. Apparently it could be able to handle your data.

Alternatively you could just go with another classifier. If you want probability estimates I'd suggest logistic regression. Logistic regression also has the advantage of not needing probability calibration to output 'proper' probabilities.

Edit:

I did not know about linearSVC complexity, finally I found information in the user guide:

Also note that for the linear case, the algorithm used in LinearSVC by the liblinear implementation is much more efficient than its libsvm-based SVC counterpart and can scale almost linearly to millions of samples and/or features.

To get probability out of a linearSVC check out this link. It is just a couple links away from the probability calibration guide I linked above and contains a way to estimate probabilities. Namely:

    prob_pos = clf.decision_function(X_test)
    prob_pos = (prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())

Note the estimates will probably be poor without calibration, as illustrated in the link.

Answer 2

在最佳答案中简要提及;这是代码：最快的方法是通过the n_jobs parameter：替换行

clf = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='auto'))

与

clf = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='auto'), n_jobs=-1)

这将使用您计算机上的所有可用CPU，同时仍然执行与以前相同的计算。

Answer 3

您可以使用the kernel_approximation module将SVM扩展为大量此类样本。

Answer 4

对于大型数据集，请考虑使用 LinearSVC 或 SGDClassifier，可能在 Nystroem 转换器之后。

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

Answer 5

使用class_weight == 'auto'提及了一些答案。对于高于0.17的sklearn版本，请改用class_weight == 'balanced'： https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Making SVM run faster in python

5 个答案: