出于某种原因,来自RandomForestClassifier.fit
的{{1}}在我的本地计算机上仅使用2.5GB RAM,但在我的服务器上使用的训练集完全相同,只有7GB。
没有导入的代码几乎就是这样:
sklearn.ensemble
我的本地机器是macbook pro,有16GB内存和4核CPU 我的服务器是digitalocean云上的Ubuntu服务器,内存为8 GB,也是4核CPU。
sklearn版本为0.18,Python版本为3.5.2
我甚至无法想象可能的原因,任何帮助都会非常有帮助。
更新
内存错误出现在y_train = data_train['train_column']
x_train = data_train.drop('train_column', axis=1)
# Difference in memory consuming starts here
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf = clf.fit(x_train, y_train)
preds = clf.predict(data_test)
方法内的此代码中:
fit
更新2
关于我的系统的信息:
# Parallel loop: we use the threading backend as the Cython code
# for fitting the trees is internally releasing the Python GIL
# making threading always more efficient than multiprocessing in
# that case.
trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
backend="threading")(
delayed(_parallel_build_trees)(
t, self, X, y, sample_weight, i, len(trees),
verbose=self.verbose, class_weight=self.class_weight)
for i, t in enumerate(trees))
我的numpy配置:
# local
Darwin-16.1.0-x86_64-i386-64bit
Python 3.5.2 (default, Oct 11 2016, 05:05:28)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)]
NumPy 1.11.2
SciPy 0.18.1
Scikit-Learn 0.18
# server
Linux-3.13.0-57-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.1 (default, Dec 18 2015, 00:00:00)
[GCC 4.8.4]
NumPy 1.11.2
SciPy 0.18.1
Scikit-Learn 0.18
# server
>>> np.__config__.show()
blas_opt_info:
libraries = ['openblas', 'openblas']
define_macros = [('HAVE_CBLAS', None)]
library_dirs = ['/usr/local/lib']
language = c
openblas_info:
libraries = ['openblas', 'openblas']
define_macros = [('HAVE_CBLAS', None)]
library_dirs = ['/usr/local/lib']
language = c
lapack_opt_info:
libraries = ['openblas', 'openblas']
define_macros = [('HAVE_CBLAS', None)]
library_dirs = ['/usr/local/lib']
language = c
blas_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
libraries = ['openblas', 'openblas']
define_macros = [('HAVE_CBLAS', None)]
library_dirs = ['/usr/local/lib']
language = c
# local
>>> np.__config__.show()
blas_opt_info:
extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
blas_mkl_info:
NOT AVAILABLE
atlas_threads_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
NOT AVAILABLE
atlas_info:
NOT AVAILABLE
atlas_3_10_blas_info:
NOT AVAILABLE
lapack_opt_info:
extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
extra_compile_args = ['-msse3']
openblas_info:
NOT AVAILABLE
atlas_3_10_blas_threads_info:
NOT AVAILABLE
atlas_3_10_threads_info:
NOT AVAILABLE
atlas_3_10_info:
NOT AVAILABLE
atlas_blas_threads_info:
NOT AVAILABLE
atlas_blas_info:
NOT AVAILABLE
对象的Repr在两台机器上都是相同的:
clf
答案 0 :(得分:1)
一种可能的解释是您的服务器使用较旧的scikit-learn。前一个问题是sklearn RF非常耗费内存,如果我没记错的话已经修复了0.17。
答案 1 :(得分:1)
好吧,在我将内核从3.13.0-57
更新为4.4.0-28
后,问题神奇地消失了。现在它比我当地的Mac笔记本电脑吃的内存更少。
答案 2 :(得分:0)
我不确定这是原因,但OS X默认启用了内存压缩;在Linux上,zRam / zswap / zcache是可选的,而不是默认的(参见https://en.wikipedia.org/wiki/Virtual_memory_compression)。