Question

出于某种原因，来自RandomForestClassifier.fit的{{1}}在我的本地计算机上仅使用2.5GB RAM，但在我的服务器上使用的训练集完全相同，只有7GB。

没有导入的代码几乎就是这样：

sklearn.ensemble

我的本地机器是macbook pro，有16GB内存和4核CPU 我的服务器是digitalocean云上的Ubuntu服务器，内存为8 GB，也是4核CPU。

sklearn版本为0.18，Python版本为3.5.2

我甚至无法想象可能的原因，任何帮助都会非常有帮助。

更新

内存错误出现在y_train = data_train['train_column'] x_train = data_train.drop('train_column', axis=1) # Difference in memory consuming starts here clf = RandomForestClassifier(n_estimators=100, random_state=42) clf = clf.fit(x_train, y_train) preds = clf.predict(data_test)方法内的此代码中：

fit

更新2

关于我的系统的信息：

# Parallel loop: we use the threading backend as the Cython code
# for fitting the trees is internally releasing the Python GIL
# making threading always more efficient than multiprocessing in
# that case.
trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
                 backend="threading")(
    delayed(_parallel_build_trees)(
        t, self, X, y, sample_weight, i, len(trees),
        verbose=self.verbose, class_weight=self.class_weight)
    for i, t in enumerate(trees))

我的numpy配置：

# local
Darwin-16.1.0-x86_64-i386-64bit
Python 3.5.2 (default, Oct 11 2016, 05:05:28)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)]
NumPy 1.11.2
SciPy 0.18.1
Scikit-Learn 0.18

# server
Linux-3.13.0-57-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.1 (default, Dec 18 2015, 00:00:00)
[GCC 4.8.4]
NumPy 1.11.2
SciPy 0.18.1
Scikit-Learn 0.18

# server >>> np.__config__.show() blas_opt_info: libraries = ['openblas', 'openblas'] define_macros = [('HAVE_CBLAS', None)] library_dirs = ['/usr/local/lib'] language = c openblas_info: libraries = ['openblas', 'openblas'] define_macros = [('HAVE_CBLAS', None)] library_dirs = ['/usr/local/lib'] language = c lapack_opt_info: libraries = ['openblas', 'openblas'] define_macros = [('HAVE_CBLAS', None)] library_dirs = ['/usr/local/lib'] language = c blas_mkl_info: NOT AVAILABLE openblas_lapack_info: libraries = ['openblas', 'openblas'] define_macros = [('HAVE_CBLAS', None)] library_dirs = ['/usr/local/lib'] language = c # local >>> np.__config__.show() blas_opt_info: extra_link_args = ['-Wl,-framework', '-Wl,Accelerate'] define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)] extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers'] blas_mkl_info: NOT AVAILABLE atlas_threads_info: NOT AVAILABLE lapack_mkl_info: NOT AVAILABLE openblas_lapack_info: NOT AVAILABLE atlas_info: NOT AVAILABLE atlas_3_10_blas_info: NOT AVAILABLE lapack_opt_info: extra_link_args = ['-Wl,-framework', '-Wl,Accelerate'] define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)] extra_compile_args = ['-msse3'] openblas_info: NOT AVAILABLE atlas_3_10_blas_threads_info: NOT AVAILABLE atlas_3_10_threads_info: NOT AVAILABLE atlas_3_10_info: NOT AVAILABLE atlas_blas_threads_info: NOT AVAILABLE atlas_blas_info: NOT AVAILABLE对象的Repr在两台机器上都是相同的：

clf

Answer 1

一种可能的解释是您的服务器使用较旧的scikit-learn。前一个问题是sklearn RF非常耗费内存，如果我没记错的话已经修复了0.17。

Answer 2

好吧，在我将内核从3.13.0-57更新为4.4.0-28后，问题神奇地消失了。现在它比我当地的Mac笔记本电脑吃的内存更少。

Answer 3

我不确定这是原因，但OS X默认启用了内存压缩;在Linux上，zRam / zswap / zcache是可选的，而不是默认的（参见https://en.wikipedia.org/wiki/Virtual_memory_compression）。

RandomForestClassifier.fit在不同的机器上使用不同数量的RAM

3 个答案: