Question

我使用sklearn＆＃39; RandomForestClassifier来解决分类问题。我想单独训练森林的树木，因为我抓住每棵树的（非常）大集的子集。但是，当我手动填充树时，内存消耗会膨胀。

这是一个使用memory_profiler自定义契约与使用RandomForestClassifier fit函数的逐行内存配置文件。据我所知，源拟合函数执行与自定义拟合相同的步骤。那么是什么赋予了所有额外的内存？

正常健康：

Line #    Mem usage    Increment   Line Contents
================================================
17   28.004 MiB    0.000 MiB   @profile
18                             def normal_fit():
19   28.777 MiB    0.773 MiB    X = random.random((1000,100))
20   28.781 MiB    0.004 MiB    Y = random.random(1000) < 0.5
21   28.785 MiB    0.004 MiB    rfc = RFC(n_estimators=100,n_jobs=1)
22   28.785 MiB    0.000 MiB    rfc.n_classes_ = 2
23   28.785 MiB    0.000 MiB    rfc.classes_ = array([False, True],dtype=bool)
24   28.785 MiB    0.000 MiB    rfc.n_outputs_ = 1
25   28.785 MiB    0.000 MiB    rfc.n_features_ = 100
26   28.785 MiB    0.000 MiB    rfc.bootstrap = False
27   37.668 MiB    8.883 MiB    rfc.fit(X,Y)

定制合身：

Line #    Mem usage    Increment   Line Contents
================================================
 4   28.004 MiB    0.000 MiB   @profile
 5                             def custom_fit():
 6   28.777 MiB    0.773 MiB    X = random.random((1000,100))
 7   28.781 MiB    0.004 MiB    Y = random.random(1000) < 0.5
 8   28.785 MiB    0.004 MiB    rfc = RFC(n_estimators=100,n_jobs=1)
 9   28.785 MiB    0.000 MiB    rfc.n_classes_ = 2
10   28.785 MiB    0.000 MiB    rfc.classes_ = array([False, True],dtype=bool)
11   28.785 MiB    0.000 MiB    rfc.n_outputs_ = 1
12   28.785 MiB    0.000 MiB    rfc.n_features_ = 100
13   73.266 MiB   44.480 MiB    for i in range(rfc.n_estimators):
14   72.820 MiB   -0.445 MiB        rfc._make_estimator()
15   73.262 MiB    0.441 MiB        rfc.estimators_[-1].fit(X,Y,check_input=False)

Answer 1

跟进：

我改为创建一个python脚本来构建单个树并通过pickle转储它。然后我将所有内容与一些shell脚本和最终的python脚本粘合在一起，以创建和转储RF模型。这样，每次创建树后都会返回内存，因为每个树都有自己的执行线程。

sklearn实现以我认为与_parallel_build_tree方法有关的方式绕过内存问题，因为自定义实现仅在这方面有所不同。我发布我的解决方法作为答案，但如果将来有人可以启发我以前，我会很感激。

sklearn中的手动树拟合内存消耗

1 个答案: