使用SparklyR和Spark 2.0.2调用逻辑回归后,我在Spark上遇到以下错误。
ml_logistic_regression(Data, ml_formula)
我读入Spark的数据集相对较大(2.2GB)。以下是错误消息:
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task
13 in stage 64.0 failed 1 times, most recent failure:
Lost task 13.0 in stage 64.0 (TID 1132, localhost):
java.util.concurrent.ExecutionException:
java.lang.Exception:
failed to compile: org.codehaus.janino.JaninoRuntimeException:
Code of method "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z"
of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate"
grows beyond 64 KB
其他人也遇到过类似的问题:https://github.com/rstudio/sparklyr/issues/298但我无法找到解决方案。有什么想法吗?
答案 0 :(得分:1)
当您对数据进行子集并尝试运行模型时会发生什么?您可能需要更改配置设置以处理数据大小:
library(dplyr)
library(sparklyr)
#configure the spark session and connect
config <- spark_config()
config$`sparklyr.shell.driver-memory` <- "XXG" #change depending on the size of the data
config$`sparklyr.shell.executor-memory` <- "XXG"
sc <- spark_connect(master='yarn-client', spark_home='/XXXX/XXXX/XXXX',config = config)
spark_config()
中还有其他设置,您可以更改以处理效果。这只是一对夫妇的一个例子。