Question

运行时：Spark 2.3.0，Scala 2.11（Databricks 4.1 ML beta）

import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._

//kafka settings and df definition goes here

val query = df.writeStream.format("parquet")
.option("path", ...)
.option("checkpointLocation",...)
.trigger(continuous(30000))
.outputMode(OutputMode.Append)
.start

找不到引发错误：连续值

其他无效的尝试：

.trigger(continuous = "30 seconds") //as per Databricks blog
// throws same error as above

.trigger(Trigger.Continuous("1 second")) //as per Spark docs
// throws java.lang.IllegalStateException: Unknown type of trigger: ContinuousTrigger(1000)

参考文献：

（Databricks博客） https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html

（火花指南） http://spark.apache.org/docs/2.3.0/structured-streaming-programming-guide.html#continuous-processing

（Scaladoc）https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.streaming.package

Answer 1

Spark 2.3.0不支持连续流下的拼花，您必须使用基于Kafka，控制台或内存的流。

引用continuous processing mode in structured streaming博客文章：

您可以在满足以下条件的查询中设置可选的连续触发器：从支持的源（如Kafka）读取并写入支持的接收器（如Kafka，内存，控制台）。

Answer 2

尝试使用触发器（Trigger.ProcessingTime（“ 1秒”））

这将起作用，因为我遇到了同样的问题，并且使用此方法已解决该问题。

Answer 3

作为以下火花代码，只有实现StreamWriteSupport接口的接收器才能使用ContinuousTrigger。

    (sink, trigger) match {
      case (v2Sink: StreamWriteSupport, trigger: ContinuousTrigger) =>
        UnsupportedOperationChecker.checkForContinuous(analyzedPlan, outputMode)
        new StreamingQueryWrapper(new ContinuousExecution(
          sparkSession,
          userSpecifiedName.orNull,
          checkpointLocation,
          analyzedPlan,
          v2Sink,
          trigger,
          triggerClock,
          outputMode,
          extraOptions,
          deleteCheckpointOnStop))
      case _ =>
        new StreamingQueryWrapper(new MicroBatchExecution(
          sparkSession,
          userSpecifiedName.orNull,
          checkpointLocation,
          analyzedPlan,
          sink,
          trigger,
          triggerClock,
          outputMode,
          extraOptions,
          deleteCheckpointOnStop))

只有三个接收器实现了此接口ConsoleSinkProvider， KafkaSourceProvider， MemorySinkV2。

Answer 4

在Spark 3.0.1中，连续处理模式是实验性的，并且根据其 Source 和 Sink 支持特殊查询类型。

根据Continuous Processing上的文档，支持以下查询，并且似乎不支持书写实木复合地板：

从Spark 2.4开始，连续处理模式仅支持以下类型的查询。

Operations: Only map-like Dataset/DataFrame operations are supported in continuous mode, that is, only projections (select, map, flatMap, mapPartitions, etc.) and selections (where, filter, etc.).
   All SQL functions are supported except aggregation functions (since aggregations are not yet supported), current_timestamp() and current_date() (deterministic computations using time is challenging).
Sources:
   Kafka source: All options are supported.
   Rate source: Good for testing. Only options that are supported in the continuous mode are numPartitions and rowsPerSecond.
Sinks:
   Kafka sink: All options are supported.
   Memory sink: Good for debugging.
   Console sink: Good for debugging. All options are supported. Note that the console will print every checkpoint interval that you have specified in the continuous trigger.

在结构化流中找不到连续触发

4 个答案: