我的结构化流应用程序正在写入镶木地板,我想摆脱其创建的_spark_metadata文件夹。我用下面的属性,似乎很好
--conf "spark.hadoop.parquet.enable.summary-metadata=false"
应用程序启动时,不会生成_spark_metadata
文件夹。但是一旦它进入RUNNING状态并开始处理消息,它就会失败,并显示以下错误,提示_spark_metadata
文件夹不存在。似乎结构化流依赖于此文件夹,没有它我们就无法运行。只是想知道在这种情况下禁用元数据属性是否有意义。这是流没有引用conf的错误吗?
Caused by: java.io.FileNotFoundException: File /_spark_metadata does not exist.
at org.apache.hadoop.fs.Hdfs.listStatus(Hdfs.java:261)
at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1765)
at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1761)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1761)
at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1726)
at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1685)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.list(HDFSMetadataLog.scala:370)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.getLatest(HDFSMetadataLog.scala:231)
at org.apache.spark.sql.execution.streaming.FileStreamSink.addBatch(FileStreamSink.scala:99)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:477)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:475)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
答案 0 :(得分:0)
发生这种情况的原因是未清理kafkacheckpoint文件夹。卡夫卡检查点内的文件交叉引用了火花元数据文件,并且失败了。一旦我删除了这两个文件,它就开始工作了