我有使用sklearn.linear_model.LogisticRegression和sklearn.ensemble.RandomForestClassifier的代码。代码中所有其他内容保持不变,使用多进程池运行代码会在逻辑回归路径中启动数百个线程,因此完全妨碍了性能-36个处理器的htop屏幕截图:
空闲:
森林(一个处理器按预期保持空闲状态):
物流(所有处理器的使用率均为100%):
逻辑回归是否会产生后台线程(是),如果是,是否有办法防止这种情况发生?
$ python3.6
Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux
>>> sklearn.__version__
'0.20.1'
答案 0 :(得分:1)
在实例化sklearn.linear_model.LogisticRegression
时,您始终可以通过n_jobs=N
传递要使用的线程数,其中N
是所需的线程数。我将检查是否使用n_jobs=1
运行它没有帮助。否则,Python可能会误读您环境中可用线程的数量。为确保其性能良好,我将进行检查。
import multiprocessing
print(multiprocessing.cpu_count())
内幕LogisticRegression
使用sklearn.externals.joblib.Parallel
进行穿线。它的逻辑相当复杂,因此如果没有对您的环境设置的全面了解,就很难说出它的确切功能。
答案 1 :(得分:1)
假设在拟合模型时发生这种情况,请查看模型的fit()方法源代码(link)的这一部分:
# The SAG solver releases the GIL so it's more efficient to use
# threads for this solver.
if solver in ['sag', 'saga']:
prefer = 'threads'
else:
prefer = 'processes'
fold_coefs_ = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
**_joblib_parallel_args(prefer=prefer))(
path_func(X, y, pos_class=class_, Cs=[self.C],
fit_intercept=self.fit_intercept, tol=self.tol,
verbose=self.verbose, solver=solver,
multi_class=multi_class, max_iter=self.max_iter,
class_weight=self.class_weight, check_input=False,
random_state=self.random_state, coef=warm_start_coef_,
penalty=self.penalty,
max_squared_sum=max_squared_sum,
sample_weight=sample_weight)
for class_, warm_start_coef_ in zip(classes_, warm_start_coef))
热衷情况
prefer = 'threads'
**_joblib_parallel_args(prefer=prefer)
如果您正在使用sag
或saga
求解器,则可能会遇到线程问题。但是默认的求解器是liblinear
。
另外,从上面(link)上使用的Parallel()的来源来看,sklearn表示这种可能的解决线程问题的方法:
'threading' is a low-overhead alternative that is most efficient for functions that release the Global Interpreter Lock: e.g. I/O-bound code or CPU-bound code in a few calls to native code that explicitly releases the GIL. In addition, if the `dask` and `distributed` Python packages are installed, it is possible to use the 'dask' backend for better scheduling of nested parallel calls without over-subscription and potentially distribute parallel calls over a networked cluster of several hosts.
据我了解,类似以下内容可以减少线程:
from dask.distributed import Client
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
...
# create local cluster
client = Client(processes=False)
model = LogisticRegression()
with joblib.parallel_backend('dask'):
model.fit(...)
...
按照建议使用Dask Joblib。