解决1：

Question

我正在尝试用H2O训练机器学习模型（3.14）。我的数据集大小是4Gb，我的计算机RAM是2Gb，2G交换，JDK 1.8。请参阅此article，H2O可以处理具有2Gb RAM的大型数据集。

关于更大数据和GC的说明：当Java堆太满时，我们进行用户模式交换到磁盘，即，您使用的数据量大于   物理DRAM。我们不会死于GC死亡螺旋，但我们会   降级到核心外速度。我们会像磁盘一样快   允许。我个人测试过将12Gb数据集加载到2Gb中   （32位）JVM;加载数据大约需要5分钟，另外5分钟   运行Logistic回归的分钟。

围绕这个问题的一些问题：

Loading data bigger than the memory size in h2o。答案提到用户模式swap-to-disk被禁用，因为性能非常糟糕。但是，他没有解释任何替代方法，如何在h2o中启用标志--cleaner？

解决1：

我使用选项java -Xmx10g -jar h2o.jar配置了java堆。当我加载数据集。 H2O信息如下：

然而，JVM消耗了所有RAM内存和Swap，然后操作系统停止了java h2o程序。

解决2：

我安装了H2O spark。我可以加载数据集，但是火花挂起了以下带有完整交换内存的日志：

 + FREE:426.8 MB == MEM_MAX:2.67 GB), desiredKV=841.3 MB OOM!
09-01 02:01:12.377 192.168.233.133:54321 6965   Thread-47 WARN: Swapping!  OOM, (K/V:1.75 GB + POJO:513.2 MB + FREE:426.8 MB == MEM_MAX:2.67 GB), desiredKV=841.3 MB OOM!
09-01 02:01:12.377 192.168.233.133:54321 6965   Thread-48 WARN: Swapping!  OOM, (K/V:1.75 GB + POJO:513.2 MB + FREE:426.8 MB == MEM_MAX:2.67 GB), desiredKV=841.3 MB OOM!
09-01 02:01:12.381 192.168.233.133:54321 6965   Thread-45 WARN: Swapping!  OOM, (K/V:1.75 GB + POJO:513.3 MB + FREE:426.7 MB == MEM_MAX:2.67 GB), desiredKV=803.2 MB OOM!
09-01 02:01:12.382 192.168.233.133:54321 6965   Thread-46 WARN: Swapping!  OOM, (K/V:1.75 GB + POJO:513.4 MB + FREE:426.5 MB == MEM_MAX:2.67 GB), desiredKV=840.9 MB OOM!
09-01 02:01:12.384 192.168.233.133:54321 6965   #e Thread WARN: Swapping!  GC CALLBACK, (K/V:1.75 GB + POJO:513.4 MB + FREE:426.5 MB == MEM_MAX:2.67 GB), desiredKV=802.7 MB OOM!
09-01 02:01:12.867 192.168.233.133:54321 6965   FJ-3-1    WARN: Swapping!  OOM, (K/V:1.75 GB + POJO:513.4 MB + FREE:426.5 MB == MEM_MAX:2.67 GB), desiredKV=1.03 GB OOM!
09-01 02:01:13.376 192.168.233.133:54321 6965   Thread-46 WARN: Swapping!  OOM, (K/V:1.75 GB + POJO:513.2 MB + FREE:426.8 MB == MEM_MAX:2.67 GB), desiredKV=803.2 MB OOM!
09-01 02:01:13.934 192.168.233.133:54321 6965   Thread-45 WARN: Swapping!  OOM, (K/V:1.75 GB + POJO:513.2 MB + FREE:426.8 MB == MEM_MAX:2.67 GB), desiredKV=841.3 MB OOM!
09-01 02:01:12.867 192.168.233.133:54321 6965   #e Thread WARN: Swapping!  GC CALLBACK, (K/V:1.75 GB + POJO:513.2 MB + FREE:426.8 MB == MEM_MAX:2.67 GB), desiredKV=803.2 MB OOM!

在这种情况下，我认为gc收集器正在等待清除swap中的一些未使用的内存。

如何使用有限的RAM内存处理大型数据集？

Answer 1

If this is in any way commercial, buy more RAM, or pay a few dollars to rent a few hours on a cloud server.

This is because the extra time and effort to do machine learning on a machine that is too small is just not worth it.

If it is a learning project, with no budget at all: cut the data set into 8 equal-sized parts (*), and just use the first part to make and tune your models. (If the data is not randomly ordered, cut it in 32 equal parts, and then concatenate parts 1, 9, 17 and 25; or something like that.)

If you really, really, really, must build a model using the whole data set, then still do the above. But then save the model, then move to the 2nd of your 8 data sets. You will already have tuned hyperparameters by this point, so you are just generating a model, and it will be quick. Repeat for parts 3 to 8. Now you have 8 models, and can use them in an ensemble.

*: I chose 8, which gives you a 0.5GB data set, which is a quarter of available memory. For the early experiments I'd actually recommend going even smaller, e.g. 50MB, as it will make the iterations so much quicker.

A couple more thoughts:

H2O compresses data in-memory. So if the 4GB was the uncompressed data size, you might get by with a smaller memory. (However, remember that the recommendation is for memory that is 3-4x the size of your data.)
If you have some friends with similar small-memory computers, you could network them together. 4 to 8 computers might be enough to load your data. It might work well, it might be horribly slow, it depends on the algorithm (and how fast your network is).

Answer 2

2014年引用的文章已经过时多年，并且是指H2O-2。当时H2O内用户模式的交换到磁盘概念是实验性的。

但H2O-3（它在2015年初成为主要的H2O代码库）从未得到过支持，因为所引用的StackOverflow帖子解释说，性能很糟糕。

如何使用H2O处理大数据集

解决1：

解决2：

2 个答案: