Using the code below for svm in python:
from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='auto'))
clf.fit(X, y)
proba = clf.predict_proba(X)
But it is taking a huge amount of time.
Actual Data Dimensions:
train-set (1422392,29)
test-set (233081,29)
How can I speed it up(parallel or some other way)? Please help. I have already tried PCA and downsampling.
I have 6 classes. Edit: Found http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html but I wish for probability estimates and it seems not to so for svm.
Edit:
from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC,LinearSVC
from sklearn.linear_model import SGDClassifier
import joblib
import numpy as np
from sklearn import grid_search
import multiprocessing
import numpy as np
import math
def new_func(a): #converts array(x) elements to (1/(1 + e(-x)))
a=1/(1 + math.exp(-a))
return a
if __name__ == '__main__':
iris = datasets.load_iris()
cores=multiprocessing.cpu_count()-2
X, y = iris.data, iris.target #loading dataset
C_range = 10.0 ** np.arange(-4, 4); #c value range
param_grid = dict(estimator__C=C_range.tolist())
svr = OneVsRestClassifier(LinearSVC(class_weight='auto'),n_jobs=cores) ################LinearSVC Code faster
#svr = OneVsRestClassifier(SVC(kernel='linear', probability=True, ##################SVC code slow
# class_weight='auto'),n_jobs=cores)
clf = grid_search.GridSearchCV(svr, param_grid,n_jobs=cores,verbose=2) #grid search
clf.fit(X, y) #training svm model
decisions=clf.decision_function(X) #outputs decision functions
#prob=clf.predict_proba(X) #only for SVC outputs probablilites
print decisions[:5,:]
vecfunc = np.vectorize(new_func)
prob=vecfunc(decisions) #converts deicision to (1/(1 + e(-x)))
print prob[:5,:]
Edit 2: The answer by user3914041 yields very poor probability estimates.
答案 0 :(得分:14)
SVM classifiers don't scale so easily. From the docs, about the complexity of sklearn.svm.SVC
.
The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.
In scikit-learn you have svm.linearSVC
which can scale better.
Apparently it could be able to handle your data.
Alternatively you could just go with another classifier. If you want probability estimates I'd suggest logistic regression. Logistic regression also has the advantage of not needing probability calibration to output 'proper' probabilities.
Edit:
I did not know about linearSVC
complexity, finally I found information in the user guide:
Also note that for the linear case, the algorithm used in LinearSVC by the liblinear implementation is much more efficient than its libsvm-based SVC counterpart and can scale almost linearly to millions of samples and/or features.
To get probability out of a linearSVC
check out this link. It is just a couple links away from the probability calibration guide I linked above and contains a way to estimate probabilities.
Namely:
prob_pos = clf.decision_function(X_test)
prob_pos = (prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())
Note the estimates will probably be poor without calibration, as illustrated in the link.
答案 1 :(得分:7)
在最佳答案中简要提及;这是代码:最快的方法是通过the n_jobs
parameter:替换行
clf = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='auto'))
与
clf = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='auto'), n_jobs=-1)
这将使用您计算机上的所有可用CPU,同时仍然执行与以前相同的计算。
答案 2 :(得分:7)
您可以使用the kernel_approximation
module将SVM扩展为大量此类样本。
答案 3 :(得分:1)
对于大型数据集,请考虑使用 LinearSVC 或 SGDClassifier,可能在 Nystroem 转换器之后。
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
答案 4 :(得分:0)
使用class_weight == 'auto'
提及了一些答案。对于高于0.17的sklearn版本,请改用class_weight == 'balanced'
:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html