Azure中的Spark流式传输-写入镶木地板接收器时发生错误“压缩批处理9时不存在_spark_metadata / 0”

时间:2019-08-14 12:24:06

标签: azure-storage azure-data-lake spark-structured-streaming azure-databricks spark-checkpoint

我们正在Azure Databricks上运行Spark结构化流作业,该作业不断将数据写入Azure Datalake gen2存储中的镶木地板。我们也将Datalake gen2用于检查点。

但是,当我们运行查询时

df_exponea
    .writeStream
    .format("parquet")
    .partitionBy("date", "hour")
    .outputMode("append") 
    .option("checkpointLocation", "abfss://filesystem@azureacc.dfs.core.windows.net/Data/StreamCheckpoints/query_dtl")
    .option("path", "abfss://filesystem@azureacc.dfs.core.windows.net/Data/Export/query_dtl")
    .start()

在运行过程中,有时会出现以下错误

at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
    at org.apache.spark.sql.execution.streaming.FileStreamSink.addBatch(FileStreamSink.scala:131)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:568)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:111)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:240)
    at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:97)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:170)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:566)
    at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251)
    at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:565)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:207)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
    at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251)
    at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:175)
    at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:169)
    at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:296)
    at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:208)
Caused by: java.lang.IllegalStateException: abfss://filesystem@azureacc.dfs.core.windows.net/Data/Export/query_dtl/_spark_metadata/0 doesn't exist when compacting batch 9 (compactInterval: 10)
    at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$3$$anonfun$apply$1.apply(CompactibleFileStreamLog.scala:182)
    at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$3$$anonfun$apply$1.apply(CompactibleFileStreamLog.scala:182)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$3.apply(CompactibleFileStreamLog.scala:181)
    at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog$$anonfun$3.apply(CompactibleFileStreamLog.scala:180)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.immutable.NumericRange.foreach(NumericRange.scala:73)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.compact(CompactibleFileStreamLog.scala:180)
    at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.add(CompactibleFileStreamLog.scala:164)
    at org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol.commitJob(ManifestFileCommitProtocol.scala:64)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:187)

我的问题:如何在没有元数据失败的情况下运行它?

这里有一个类似的问题:Structured streaming won't write DF to file sink citing /_spark_metadata/9.compact doesn't exist,其中底层存储是S3 ,不适用于Azure Datalake gen2,因此某些建议/答案是特定于AWS的。

编辑 -我们将_spark_metadata/0手动添加到了目标位置,到目前为止,该作业正在正常运行。但是我有点担心这是一些容易出错的hack。

0 个答案:

没有答案