我正在尝试为随机森林和Logistic回归训练和运行多分类器。截至目前,在我的机器上(具有8GB RAM和i5内核),尽管数据大小几乎不超过34K记录,但仍需要花费一些时间来运行。有什么方法可以通过调整一些参数来加快当前的现有运行时间?
我仅在下面举例说明Logistic回归随机搜索。
X.shape
Out[9]: (34857, 18)
Y.shape
Out[10]: (34857,)
Y.unique()
Out[11]: array([7, 3, 8, 6, 1, 5, 9, 2, 4], dtype=int64)
params_logreg={'C':[0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0],
'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
'penalty':['l2'],
'max_iter':[100,200,300,400,500],
'multi_class':['multinomial']}
folds = 2
n_iter = 2
scoring= 'accuracy'
n_jobs= 1
model_logregression=LogisticRegression()
model_logregression = RandomizedSearchCV(model_logregression,X,Y,params_logreg,folds,n_iter,scoring,n_jobs)
[CV] solver=newton-cg, penalty=l2, multi_class=multinomial, max_iter=100, C=0.9
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] solver=newton-cg, penalty=l2, multi_class=multinomial, max_iter=100, C=0.9, score=0.5663798049340218, total= 2.7min
[CV] solver=newton-cg, penalty=l2, multi_class=multinomial, max_iter=100, C=0.9
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 2.7min remaining: 0.0s
[CV] solver=newton-cg, penalty=l2, multi_class=multinomial, max_iter=100, C=0.9, score=0.5663625408848338, total= 4.2min
[CV] solver=sag, penalty=l2, multi_class=multinomial, max_iter=400, C=0.8
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 7.0min remaining: 0.0s
[CV] solver=sag, penalty=l2, multi_class=multinomial, max_iter=400, C=0.8, score=0.5663798049340218, total= 33.9s
[CV] solver=sag, penalty=l2, multi_class=multinomial, max_iter=400, C=0.8
[CV] solver=sag, penalty=l2, multi_class=multinomial, max_iter=400, C=0.8, score=0.5664773053308085, total= 26.6s
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 8.0min finished```
It's taking about 8 mins to run for Logistic Regression. In contrast RandomForestClassifier takes only about 52 seconds.
Is there any way in which I can make this run faster by tweaking the parameters?
答案 0 :(得分:0)
尝试标准化逻辑回归模型的数据。归一化的数据将有助于模型快速收敛。 Scikit-learn有几种方法可用于此目的,因此请检查其预处理部分以获取更多信息。
另外,您正在使用RandomizedSearchCV
进行回归,因为创建和计算了多个模型并进行比较以获取最佳参数,这需要时间。