Question

我正在尝试在h2o中加载大于内存大小的数据。

H2o blog提及：A note on Bigger Data and GC: We do a user-mode swap-to-disk when the Java heap gets too full, i.e., you’re using more Big Data than physical DRAM. We won’t die with a GC death-spiral, but we will degrade to out-of-core speeds. We’ll go as fast as the disk will allow. I’ve personally tested loading a 12Gb dataset into a 2Gb (32bit) JVM; it took about 5 minutes to load the data, and another 5 minutes to run a Logistic Regression.

以下是连接到R的{{1}}代码：

h2o 3.6.0.8

给出

h2o.init(max_mem_size = '60m') # alloting 60mb for h2o, R is running on 8GB RAM machine

我尝试将169 MB csv加载到h2o。

java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)

.Successfully connected to http://127.0.0.1:54321/ 

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 seconds 561 milliseconds 
    H2O cluster version:        3.6.0.8 
    H2O cluster name:           H2O_started_from_R_RILITS-HWLTP_tkn816 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   0.06 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  2 
    H2O cluster healthy:        TRUE 

Note:  As started, H2O is limited to the CRAN default of 2 CPUs.
       Shut down and restart H2O as shown below to use all your CPUs.
           > h2o.shutdown()
           > h2o.init(nthreads = -1)

IP Address: 127.0.0.1 
Port      : 54321 
Session ID: _sid_b2e0af0f0c62cd64a8fcdee65b244d75 
Key Count : 3

引发了错误，

dat.hex <- h2o.importFile('dat.csv')

表示内存不足error。

问题：如果H2o承诺加载大于其内存容量的数据集（如上面的博客引用所述，交换到磁盘机制），这是加载数据的正确方法吗？

Answer 1

默认情况下，默认情况下已禁用交换到磁盘，因为性能非常糟糕。最前沿（不是最新的稳定）有一个标志，使其能够：＆＃34; - 清洁＆＃34; （对于＆＃34;记忆清洁剂＆＃34;）请注意，您的群集具有极小的内存： H2O cluster total memory: 0.06 GB 那是60MB！勉强可以启动JVM，更不用说运行H2O了。如果H2O可以在那里正常出现，我会感到惊讶，更别提了交换到磁盘。交换仅限于单独交换数据。如果您正在尝试进行交换测试，请将JVM升级到1或2 Gigs ram，然后加载总和超过该值的数据集。

崖

在h2o中加载大于内存大小的数据

1 个答案: