我对SparkR(以及一般的并行化)很新。我在本地运行SparkR(我知道这不是spark的正确用法,但我刚开始)并且我试图用sparkR重写我的代码的一部分 collect通过增加样本数量给出了以下错误(对于少量样本没有错误):
Error in unserialize(obj) :
ReadItem: unknown type 0, perhaps written by later version of R
Calls: assetForecast ... convertJListToRList -> lapply -> lapply -> FUN -> unserialize
Execution halted
和另一个错误可能是因为我的内存不足是:
heap memory error (trying increasing JVM memory & driver memory did not help)
我很感激有关FIRST错误的任何帮助(我发布了第二个错误,因为我认为它们可能会以某种方式相关,即使我通过在parallelize中为numSlices设置不同的值来获取它们)。我认为第一个可能是spark,sparkR和R之间的版本不兼容导致此序列化问题。我尝试安装不同版本,但很快就解决了依赖问题。
这是一个示例脚本,模拟我在SparkR中所做的事情(为input.len> 950生成错误):
library(SparkR) # load sparkR library
sc <- sparkR.init() ## initialize the sparkR
input.len <- 8000 # size of the input
num.slice <- 2 # number of slices for parallelize function
## Define a few functions to simulate actual calculations
latemail <- function(N, st="2012/01/01", et="2015/12/31") {
## create random date of length N
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
rt <- st + ev
}
encode <- function(ele1, ele2) {
## concatenate ele1 and ele2, seperated by %
return (paste(toString(ele1), toString(ele2), sep = "%"))
}
decode <- function(coded) {
## separate input string by %
idx <- regexpr("%", coded)[1]
ele1 <- as.numeric(substr(coded, 1, idx-1))
ele2 <- substr(coded, idx + 1, nchar(coded))
return (list(ele1, ele2))
}
fakeFun <- function(asset.age, asset.year) {
## fake function to simulate my actual function
return (as.list(rep(asset.age, 10)))
}
wrapperFun <- function(x) {
asset.age <- decode(x)[[1]]
asset.y <- decode(x)[[1]]
df <- fakeFun(asset.age, asset.y)
return (df)
}
## Start of calculations with SparkR
calc.ts <- latemail(input.len) ## create fake years
asset.ages <- runif(input.len) * 10 ## create fake ages
paired <- list()
for (i in 1:length(asset.ages)) {
## keep information of both years and ages in one vector
## using encode function
paired[[length(paired) + 1]] <- encode(asset.ages[[i]], calc.ts[[i]])
}
rdd.paired <- parallelize(sc, paired, numSlices = num.slice)
rdd.df <- lapply(rdd.paired, wrapperFun)
rdd.list <- collect(rdd.df)
print(rdd.list)
sparkR.stop()
以下是错误的完整报告:
for numSlice = 5 in parallelize function:
> rdd.list <- collect(rdd.df)
15/07/22 17:20:40 INFO RRDD: Times: boot = 0.434 s, init = 0.015 s, broadcast = 0.000 s, read-input = 0.003 s, compute = 0.200 s, write-output = 0.004 s, total = 0.656 s
15/07/22 17:20:41 INFO RRDD: Times: boot = 0.010 s, init = 0.017 s, broadcast = 0.000 s, read-input = 0.003 s, compute = 0.193 s, write-output = 0.004 s, total = 0.227 s
15/07/22 17:20:41 INFO RRDD: Times: boot = 0.010 s, init = 0.013 s, broadcast = 0.001 s, read-input = 0.002 s, compute = 0.191 s, write-output = 0.003 s, total = 0.220 s
15/07/22 17:20:41 INFO RRDD: Times: boot = 0.010 s, init = 0.011 s, broadcast = 0.000 s, read-input = 0.002 s, compute = 0.191 s, write-output = 0.004 s, total = 0.218 s
15/07/22 17:20:41 INFO RRDD: Times: boot = 0.014 s, init = 0.015 s, broadcast = 0.000 s, read-input = 0.003 s, compute = 0.213 s, write-output = 0.004 s, total = 0.249 s
Error in unserialize(obj) :
ReadItem: unknown type 0, perhaps written by later version of R
Calls: collect ... convertJListToRList -> lapply -> lapply -> FUN -> unserialize
Execution halted
for numSlice = 6 in parallelize function
15/07/22 17:18:52 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, localhost): java.lang.OutOfMemoryError: Java heap space
edu.berkeley.cs.amplab.sparkr.RRDD.readData(RRDD.scala:258)
edu.berkeley.cs.amplab.sparkr.RRDD.readData(RRDD.scala:243)
edu.berkeley.cs.amplab.sparkr.BaseRRDD.read(RRDD.scala:200)
edu.berkeley.cs.amplab.sparkr.BaseRRDD$$anon$1.next(RRDD.scala:70)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
edu.berkeley.cs.amplab.sparkr.BaseRRDD$$anon$1.foreach(RRDD.scala:66)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
edu.berkeley.cs.amplab.sparkr.BaseRRDD$$anon$1.to(RRDD.scala:66)
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
edu.berkeley.cs.amplab.sparkr.BaseRRDD$$anon$1.toBuffer(RRDD.scala:66)
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
edu.berkeley.cs.amplab.sparkr.BaseRRDD$$anon$1.toArray(RRDD.scala:66)
org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
15/07/22 17:18:52 ERROR TaskSetManager: Task 2 in stage 0.0 failed 1 times; aborting job
Error in readTypedObject(con, type) :
Unsupported type for deserialization
Calls: collect ... callJMethod -> invokeJava -> readObject -> readTypedObject
Execution halted
我的SparkR安装确实存在问题吗?如果是,它如何运行少量样品?
非常感谢
答案 0 :(得分:0)
以下答案是它的工作原理(或应该在Spark-1.4.0中工作)。首先初始化一个sqlContext:
sqlContext <- sparkRSQL.init(sc)
而不是从
开始更改代码paired <- list()
in
# Create a vector instead of a list
paired <- c()
for (i in 1:length(asset.ages)) {
## keep information of both years and ages in one vector
## using encode function
paired[length(paired) + 1] <- encode(asset.ages[[i]], calc.ts[[i]])
}
# What you actually need is a data.frame or SparkR DataFrame
paired.data.frame <- data.frame(paired=paired)
paired.DataFrame <- createDataFrame(sqlContext, paired.data.frame)
# Map function returns an RDD which you can not collect yet
# Therefor convert it to a DataFrame again
paired.df <- createDataFrame(sqlContext, map(paired.DataFrame,wrapperFun))
# This DataFrame you can collect
paired.result <- collect(paired.df)
为什么我说应该在我的第一句话中工作?它在我的笔记本电脑上运行时有效,但我改变了SparkR源代码以使地图可用。 我不知道要在SparkR 1.2中解决这个问题,但是自从SparkR集成到Spark之后,无论如何都会建议改为Spark-1.4.0。