我在Spark 1.6上有多个线程写入同一配置单元表(使用木地板文件),当它们尝试同时写入时,在将写入文件重命名为HDFS的部分期间,它提示错误。我正在寻找解决方案来绕过这个已知的Spark问题。
class MyThread extends Runnable {
def run {
//some code
myTable.write.format("parquet").mode("append")
.saveAsTable("hdfstable")
//some code
}
}
Executors.defaultThreadFactory().newThread(new MyThread).start()
我收到此错误:
org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:189)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:239)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:221)
at fr.neolink.spark.streaming.StreamingNeo$.algo(StreamingNeo.scala:837)
at fr.neolink.spark.streaming.StreamingNeo$$anonfun$main$3$$anonfun$apply$18$MyThread$1.run(StreamingNeo.scala:374)
at java.lang.Thread.run(Thread.java:748)
由引起:
java.io.IOException: Failed to rename
FileStatus{path=hdfs://my_hdfs_master/user/hive/warehouse/MYDB.db/hdfstable/_temporary/0/task_201812281010_1770_m_000000/part-r-00000-9a70cbea-d105-4f50-ba1b-372f555906ce.gz.parquet;
isDirectory=false; length=4608; replication=3; blocksize=134217728; modification_time=1545988247575;
access_time=1545988247494; owner=owner; group=hive; permission=rw-r--r--; isSymlink=false}
to hdfs://my_hdfs_master/user/hive/warehouse/MYDB.db/hdfstable/part-r-00000-9a70cbea-d105-4f50-ba1b-372f555906ce.gz.parquet
我在jira上发现了此问题:https://issues.apache.org/jira/browse/SPARK-18626
有没有办法使书写部分的线程安全?使一个接一个地执行?
谢谢。
答案 0 :(得分:0)
解决方案
像下面一样使用this.synchronized{}
class MyThread extends Runnable{
def run{
//some code
this.synchronized{
myTable.write.format("parquet").mode("append")
.saveAsTable("hdfstable")
}
//some code
}
}
Executors.defaultThreadFactory().newThread(new MyThread).start()