Question

我们正在构建一个流平台，在该平台上，批量处理SQL至关重要。

val query = streamingDataSet.writeStream.option("checkpointLocation", checkPointLocation).foreachBatch { (df, batchId) => {

      df.createOrReplaceTempView("events")

      val df1 = ExecutionContext.getSparkSession.sql("select * from events")

      df1.limit(5).show()
      // More complex processing on dataframes

    }}.trigger(trigger).outputMode(outputMode).start()

query.awaitTermination()

抛出的错误是：

org.apache.spark.sql.streaming.StreamingQueryException: Table or view not found: events
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'events' not found in database 'default';

流源是带有水印的Kafka，无需使用Spark-SQL，我们就能执行数据帧转换。 Spark版本是2.4.0，Scala是2.11.7。触发器是每1分钟处理时间，输出模式是追加。

是否还有其他方法可促进在foreachBatch中使用spark-sql？它可以与Spark的升级版本一起使用吗？在这种情况下，我们要升级到哪个版本？请帮助。谢谢。

Answer 1

tl; dr 将ExecutionContext.getSparkSession替换为df.sparkSession。

StreamingQueryException的原因是流查询试图访问对其完全不了解的events（即SparkSession）中的ExecutionContext.getSparkSession临时表。

唯一已注册此SparkSession临时表的events就是在其中创建SparkSession数据帧的df，即df.sparkSession。

如何在foreachBatch中使用临时表？

1 个答案: