Question

我有使用sklearn.linear_model.LogisticRegression和sklearn.ensemble.RandomForestClassifier的代码。代码中所有其他内容保持不变，使用多进程池运行代码会在逻辑回归路径中启动数百个线程，因此完全妨碍了性能-36个处理器的htop屏幕截图：

空闲：

森林（一个处理器按预期保持空闲状态）：

物流（所有处理器的使用率均为100％）：

逻辑回归是否会产生后台线程（是），如果是，是否有办法防止这种情况发生？

$ python3.6
Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0] on linux
>>> sklearn.__version__
'0.20.1'

Answer 1

在实例化sklearn.linear_model.LogisticRegression时，您始终可以通过n_jobs=N传递要使用的线程数，其中N是所需的线程数。我将检查是否使用n_jobs=1运行它没有帮助。否则，Python可能会误读您环境中可用线程的数量。为确保其性能良好，我将进行检查。

import multiprocessing
print(multiprocessing.cpu_count())

内幕LogisticRegression使用sklearn.externals.joblib.Parallel进行穿线。它的逻辑相当复杂，因此如果没有对您的环境设置的全面了解，就很难说出它的确切功能。

Answer 2

假设在拟合模型时发生这种情况，请查看模型的fit（）方法源代码（link）的这一部分：

    # The SAG solver releases the GIL so it's more efficient to use
    # threads for this solver.
    if solver in ['sag', 'saga']:
        prefer = 'threads'
    else:
        prefer = 'processes'
    fold_coefs_ = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
                           **_joblib_parallel_args(prefer=prefer))(
        path_func(X, y, pos_class=class_, Cs=[self.C],
                  fit_intercept=self.fit_intercept, tol=self.tol,
                  verbose=self.verbose, solver=solver,
                  multi_class=multi_class, max_iter=self.max_iter,
                  class_weight=self.class_weight, check_input=False,
                  random_state=self.random_state, coef=warm_start_coef_,
                  penalty=self.penalty,
                  max_squared_sum=max_squared_sum,
                  sample_weight=sample_weight)
        for class_, warm_start_coef_ in zip(classes_, warm_start_coef))

热衷情况

prefer = 'threads'
**_joblib_parallel_args(prefer=prefer)

如果您正在使用sag或saga求解器，则可能会遇到线程问题。但是默认的求解器是liblinear。

另外，从上面（link）上使用的Parallel（）的来源来看，sklearn表示这种可能的解决线程问题的方法：

'threading' is a low-overhead alternative that is most efficient for
functions that release the Global Interpreter Lock: e.g. I/O-bound code or
CPU-bound code in a few calls to native code that explicitly releases the
GIL.
In addition, if the `dask` and `distributed` Python packages are installed,
it is possible to use the 'dask' backend for better scheduling of nested
parallel calls without over-subscription and potentially distribute
parallel calls over a networked cluster of several hosts.

据我了解，类似以下内容可以减少线程：

from dask.distributed import Client
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression


...
# create local cluster
client = Client(processes=False)             
model = LogisticRegression()
with joblib.parallel_backend('dask'):
    model.fit(...)
...

按照建议使用Dask Joblib。

sklearn LogisticRegression：是否使用多个后台线程？

2 个答案: