我首先将数据集从csv导入到Spark,在Spark中进行一些转换,然后尝试将其转换为H2O Frame。这是我的代码:
library(rsparkling)
library(h2o)
library(dplyr)
library(sparklyr)
sc <- spark_connect(master = "local")
data <- spark_read_csv(sc,"some_data", paste(path, file_name, sep = ""), memory = TRUE,
infer_schema = TRUE)
data_h2o <- as_h2o_frame(sc,data)
csv文件的大小约为750MB。最后一行需要很长时间才能运行,但失败并显示以下消息:
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task
3 in stage 10.0 failed 1 times, most recent failure: Lost task 3.0 in stage 10.0
(TID 44, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
我有16GB的内存,并且可以直接将数据集直接读取到H2O中。
这是日志文件的一部分:
18/11/06 09:46:45 WARN MemoryStore: Not enough space to cache rdd_16_2 in memory! (computed 32.7 MB so far)
18/11/06 09:46:45 INFO MemoryStore: Memory use = 272.0 MB (blocks) + 57.9 MB (scratch space shared across 4 tasks(s)) = 329.8 MB. Storage limit = 366.3 MB.
18/11/06 09:46:45 INFO CodeGenerator: Code generated in 92.700007 ms
18/11/06 09:46:45 INFO MemoryStore: Will not store rdd_16_0
18/11/06 09:46:45 INFO BlockManager: Found block rdd_16_2 locally
18/11/06 09:46:45 WARN MemoryStore: Not enough space to cache rdd_16_3 in memory! (computed 32.8 MB so far)
18/11/06 09:46:45 INFO MemoryStore: Memory use = 272.0 MB (blocks) + 57.9 MB (scratch space shared across 4 tasks(s)) = 329.8 MB. Storage limit = 366.3 MB.
18/11/06 09:46:45 INFO BlockManager: Found block rdd_16_3 locally
18/11/06 09:46:45 WARN MemoryStore: Not enough space to cache rdd_16_0 in memory! (computed 32.6 MB so far)
18/11/06 09:46:45 INFO MemoryStore: Memory use = 272.0 MB (blocks) + 57.9 MB (scratch space shared across 4 tasks(s)) = 329.8 MB. Storage limit = 366.3 MB.
18/11/06 09:46:45 INFO BlockManager: Found block rdd_16_0 locally
18/11/06 09:46:45 INFO CodeGenerator: Code generated in 21.519354 ms
18/11/06 09:46:45 INFO MemoryStore: Will not store rdd_16_5
18/11/06 09:46:45 WARN MemoryStore: Not enough space to cache rdd_16_5 in memory! (computed 63.6 MB so far)
18/11/06 09:46:45 INFO MemoryStore: Memory use = 272.0 MB (blocks) + 57.9 MB (scratch space shared across 4 tasks(s)) = 329.8 MB. Storage limit = 366.3 MB.
答案 0 :(得分:0)
我刚刚发现了解决此问题的方法:只需增加允许的内存即可。
conf <- spark_config()
conf$`sparklyr.cores.local` <- 4
conf$`sparklyr.shell.driver-memory` <- "16G"
conf$spark.memory.fraction <- 0.9