我正在读取长度超过100k字节的字符串,并根据宽度拆分列。我有将近16K列,是根据宽度从上面的字符串中拆分出来的。
但是在写入实木复合地板时,我正在使用以下代码
rdd1=spark.sparkContext.textfile("file1")
{ var now=0
{ val collector= new array[String] (ColLenghth.length)
val recordlength=line.length
for (k<- 0 to colLength.length -1)
{ collector(k) = line.substring(now,now+colLength(k))
now =now+colLength(k)
}
collector.toSeq}
StringArray=rdd1.map(SubstrSting(_,ColLengthSeq))
#here ColLengthSeq is read from another schema file which is column lengths
StringArray.toDF("StringCol").select(0 until ColCount).map(j=>$"StringCol"(j) as column_seq(j):_*).write.mode("overwrite").parquet("c"\home\")
此处ColCount = 16000,column_seq是具有16K列名的seq(string)。
我在具有16GB执行程序内存和20个执行程序的Yarn上运行此程序。
文件大小为4GB。
我收到错误消息
Lost task 113.0 in stage 0.0 (TID 461, gsta32512.foo.com): ExecutorLostFailure (executor 28 exited caused by one of the running tasks) Reason:
Container marked as failed:
container_e05_1472185459203_255575_01_000183 on host: gsta32512.foo.com. Exit status: 143. Diagnostics:
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
当我在UI上检查其显示状态
#java.lang.outofmemoryerror java heap space
#java.lang.outofmemoryerror gc overhead limit exceeded
请指导上述代码的性能调整和Spark提交参数优化