当我使用spark-submit
在使用YARN的远程Spark群集上运行sparkR作业时,作业似乎成功但是日志显示尝试将结果数据帧写入json文件时任务失败在HSFS。如果sparkR脚本直接从群集中的网关机器运行,则作业成功并且能够将数据写入文件。
相关警告是:
WARN scheduler.TaskSetManager: Lost task 0.0 in stage 8.0 (TID 16, hadoop5.lavastorm.com): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:251)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: scala.MatchError: [1,null,null,[2.62]] (of class org.apache.spark.sql.catalyst.expressions.GenericMutableRow)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:194)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179)
at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:103)
at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126)
at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89)
at org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133)
at org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:185)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:243)
... 8 more
日志中较早的信息中的两种情况之间存在差异,因为它在远程提交作业时显示acl中的两个用户。在本地提交作业时,纱线用户不在acl中:
INFO spark.SecurityManager: Changing view acls to: yarn,lavastorm
INFO spark.SecurityManager: Changing modify acls to: yarn,lavastorm
提前感谢任何有关根本原因的指示。
//更新 以下附加日志文本表明作业最终失败:
ERROR datasources.InsertIntoHadoopFsRelation: Aborting job. org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 8.0 failed 4 times, most recent failure: Lost task 1.3 in stage 8.0 (TID 23, hadoop5.lavastorm.com): org.apache.spark.SparkException: Task failed while writing rows.
正在创建json结果文件的目录,但没有' part-r-0000 ..'目录中创建的文件(或' _SUCCESS'文件)。
提交的SparkR脚本是:
Sys.setenv(HADOOP_CONF_DIR='/etc/hadoop/conf')
Sys.setenv(YARN_CONF_DIR='/etc/hadoop/conf')
Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-7-oracle-cloudera")
.libPaths(c("/usr/lib/spark/R/lib", .libPaths()))
library(SparkR)
sc <- sparkR.init(appName = 'SparkR-mtcars-Predict-LinReg',
sparkEnvir = list(spark.driver.memory="2g",
spark.executor.cores='2',
spark.executor.instances='2'))
sqlContext <- sparkRSQL.init(sc)
trainFP <- "hdfs://hadoop5.lavastorm.com:8020/user/lavastorm/mtcars.json"
UnseenFP <- "hdfs://hadoop5.lavastorm.com:8020/user/lavastorm/mtcars_wt.json"
trainDF <- jsonFile(sqlContext, trainFP)
UnseenDF <- jsonFile(sqlContext, UnseenFP)
model <- glm(mpg ~ wt, family = "gaussian", trainDF)
predicted <- predict(model, UnseenDF)
targetFP <- "hdfs://hadoop5.lavastorm.com:8020/user/lavastorm/mtcars_predictions.json"
write.df(predicted, targetFP, "json", mode = "overwrite")
sparkR.stop()