我正在使用scikit-learn包为MNIST数据库建立一个带有后勤恢复的模型。我注意到,使用默认参数时效果非常差,在找到this tutorial后,我将sklearn.linear_model.LogisticRegression
解算器更改为'lbfgs'
。幸运的是它运行良好,使用训练集中的所有60000个元素对模型进行了不到2分钟的训练。
我正在使用Google Compute Engine,因此我想使用多个内核并尝试更快地训练模型。我设置了一个包含2个内核的实例,并将n_jobs = 2
放在LogisticRegression
中。但是,该算法的表现比n_jobs = 1
差。这是片段:
导入数据并将其转换为np.ndarray
个对象:
import numpy as np
import matplotlib.pyplot as plt
from mnist import MNIST
mndata = MNIST('./data')
images_train, labels_train = mndata.load_training()
images_test, labels_test = mndata.load_testing()
labels_train = labels_train.tolist()
labels_test = labels_test.tolist()
X_train = np.array(images_train)
y_train = np.array(labels_train)
X_test = np.array(images_test)
y_test = np.array(labels_test)
主要功能:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
import time
def log_test(train_size, c, cores):
X_train = X_train_all[:train_size]
y_train = y_train_all[:train_size]
start_time = time.time()
logreg = LogisticRegression(C = c, solver = 'lbfgs', n_jobs = cores).fit(X_train, y_train)
print("Training set score: {:.3f}".format(logreg.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg.score(X_test_all, y_test_all)))
elapsed_time = time.time() - start_time
print(elapsed_time)
n_jobs = 1
与n_jobs = 2
的效果:
log_test(2000, 100, 1)
- 2s log_test(2000, 100, 2)
- 9s log_test(5000, 100, 1)
- 8s log_test(5000, 100, 2)
- 27s log_test(7000, 100, 1)
- 13s log_test(7000, 100, 2)
- 55s log_test(15000, 100, 1)
- 27s log_test(15000, 100, 2)
- 115s 问题。如何使用多个核心来提升算法的性能?