无法使用Spark连续流处理数据

时间:2020-09-29 10:41:29

标签: apache-spark pyspark apache-kafka spark-structured-streaming

我正在开发一个实时流应用程序,该应用程序轮询来自Kafka代理的数据,并且我正在默认情况下使用微结构化(Spark Structured Streaming)对以前使用Spark结构化流的代码进行改编。但是,我不知道如何使用连续流而不是微批处理流来获得类似的行为。这是一段有效的代码:

query = df.writeStream \
        .foreachBatch(foreach_batch_func) \
        .start()

这是我到目前为止对连续流进行的尝试:

query = df \
        .writeStream \
        .foreach(example_func) \
        .trigger(continuous = '1 second') \
        .start()

该应用会弹出以下错误:

连续执行不支持在org.apache.spark.sql.execution.streaming.continuous.ContinuousDataSourceRDD.compute(ContinuousDataSourceRDD.scala:76)重试任务

我正在使用Spark(pyspark)3.0.1 w / Scala 2.12,Kafka 2.6.0

提交应用程序时,我要添加jar org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1

1 个答案:

答案 0 :(得分:0)

Spark结构化流中的连续处理模式仅适用于某些查询类型。

根据Continuous Processing上的文档,支持以下查询,并且似乎不支持您的查询:

从Spark 2.4开始,连续处理模式仅支持以下类型的查询。

Operations: Only map-like Dataset/DataFrame operations are supported in continuous mode, that is, only projections (select, map, flatMap, mapPartitions, etc.) and selections (where, filter, etc.).
   All SQL functions are supported except aggregation functions (since aggregations are not yet supported), current_timestamp() and current_date() (deterministic computations using time is challenging).
Sources:
   Kafka source: All options are supported.
   Rate source: Good for testing. Only options that are supported in the continuous mode are numPartitions and rowsPerSecond.
Sinks:
   Kafka sink: All options are supported.
   Memory sink: Good for debugging.
   Console sink: Good for debugging. All options are supported. Note that the console will print every checkpoint interval that you have specified in the continuous trigger.