Question

我有一个从S3读取的Spark Structured Streaming Job，转换数据，然后将其存储到一个S3 sink和一个Elasticsearch接收器。

目前，我正在进行readStream一次，然后进行writeStream.format("").start()两次。这样做似乎Spark从S3源读取数据两次，每个接收器一次。

是否有更有效的方法可以在同一个管道中写入多个接收器？

Answer 1

目前，我正在做readStream一次，然后两次writeStream.format（“”）。start（）。

您实际上创建了两个单独的流式查询。 jconsole - 部分是描述第一个（也是唯一的）流媒体源。这没有什么执行力。

当这样做时，似乎Spark从每个接收器的S3源读取数据两次。

这是描述Spark Structured Streaming查询如何工作的最正确方法。接收器的数量对应于查询的数量，因为一个流式查询可以只有一个流式接收器（请参阅任何流式查询后面的StreamExecution）。

您还可以检查线程数（使用microBatchThread或类似），因为Structured Streaming每个流查询使用一个podTemplateSpecHash := fmt.Sprintf("%d", controller.ComputeHash(&newRSTemplate, d.Status.CollisionCount)) //... Name: d.Name + "-" + rand.SafeEncodeString(podTemplateSpecHash),个线程（请参阅StreamExecution）。

是否有更有效的方法可以在同一个管道中写入多个接收器？

当前的Spark Structured Streaming设计中不。

Answer 2

您想要做的是cache()读取一次后的数据并多次使用数据。我不认为Spark Structured Streaming目前支持缓存（请参阅here），但您可以使用Spark Streaming。与结构化流式传输相比，它是一种较低级别的API（与Dataframe / Dataset相比，使用底层RDD）。来自Spark Streaming documentation：

与RDD类似，DStreams还允许开发人员将流的数据保存在内存中。也就是说，在DStream上使用persist（）方法会自动将该DStream的每个RDD保留在内存中。如果DStream中的数据将被多次计算（例如，对同一数据进行多次操作），这将非常有用。

使用Spark Streaming API，您可以对数据使用Dstream.cache()。这将基础RDD标记为缓存，这应该阻止第二次读取。 Spark Streaming将在超时后自动取消分配RDD，您可以使用spark.cleaner.ttl设置控制行为。请注意，默认值为无限值，我不建议在生产设置中使用。

除了使用需要等待Dstream.cache()超时的spark.cleaner.ttl之外，还有另一种缓存数据的方法。可以使用foreachRDD直接访问底层RDD。这里的RDD可以在使用后直接。

dstream.foreachRDD{rdd =>
  rdd.cache()
  // perform any transormations, etc. 
  rdd.saveAs(...)
  rdd.unpersist(true)
}

Answer 3

我也在寻找解决这个问题的方法。我想在接收器1中写入一些数据帧记录，而在接收器2中写入其他记录（取决于某些条件，而在2个流查询中不读取相同的数据两次）。当前，按照当前的实现，这似乎是不可能的（DataSource.scala中的createSink（）方法提供对单个接收器的支持）。

但是，在Spark 2.4.0中有一个新的api：foreachBatch（）将为数据帧微批处理提供句柄，该微批处理可用于缓存数据帧，写入不同的接收器或在取消缓存aagin之前进行多次处理。像这样：

streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  batchDF.cache()
  batchDF.write.format(...).save(...)  // location 1
  batchDF.write.format(...).save(...)  // location 2
  batchDF.uncache()
}

现在此功能可在databricks运行时中使用：https://docs.databricks.com/spark/latest/structured-streaming/foreach.html#reuse-existing-batch-data-sources-with-foreachbatch

Answer 4

这对我有用。下面的代码是用 Scala V2.13.3 编写的。

package com.spark.structured.stream.multisink

import org.apache.spark.sql.SparkSession
import java.text.SimpleDateFormat
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.expr
import org.apache.spark.sql.functions.lit
import org.apache.spark.sql.types.TimestampType
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.streaming.Trigger.ProcessingTime
import org.apache.spark.sql.types.StructType

object MultipleStreamingSink extends App {

val spark = SparkSession
.builder()
.master("local[*]")
.getOrCreate()

import spark.implicits._

val csvSchema = new StructType()
.add("name", "string").add("age", "integer").add("num","integer").add("date", "string")

val sample = spark.readStream
.schema(csvSchema)
.format("csv")
.options(Map("inferSchema" ->"true", "delimiter"->",", "header"->"true"))
.load("path/to/input/dir")


val sample1 = sample.withColumn("datetime",col("date").cast(TimestampType)).drop("date")

val sampleAgg1 = sample1.withWatermark("datetime", "10 minutes")
.groupBy(window($"datetime", "5 minutes", "5 minutes"), col("name"))
.agg(count("age").alias("age_count"))

val sampleAgg2 = sample1.withWatermark("datetime", "10 minutes")
.groupBy(window($"datetime", "5 minutes", "5 minutes"), col("age"))
.agg(count("name").alias("name_count"))


// I have used console to stream the output, use your sinks accordingly 
val sink1 = sampleAgg1
.withColumn("window_start_time", col("window.start"))
.withColumn("window_end_time", col("window.end"))
.drop("window")
.writeStream
.queryName("count by name")
.option("checkpointLocation", "/tmp/1")
.outputMode(OutputMode.Update())
.trigger(Trigger.ProcessingTime("60 seconds"))
.format("console")
.option("numRows", 100)
.option("truncate", false)
.start()

val sink2 = sampleAgg2
.withColumn("window_start_time", col("window.start"))
.withColumn("window_end_time", col("window.end"))
.drop("window")
.writeStream
.option("checkpointLocation", "/tmp/2")
.queryName("count by age")
.outputMode(OutputMode.Update())
.trigger(Trigger.ProcessingTime("60 seconds"))
.format("console")
.option("numRows", 100)
.option("truncate", false)
.start()

sink1.awaitTermination()
sink2.awaitTermination()

这是我的示例 csv 文件内容，

name,age,num,date
abc,28,123,2021-06-01T07:15:00
def,27,124,2021-06-01T08:16:00
abc,28,125,2021-06-01T07:15:00
ghi,28,126,2021-06-01T07:17:00

如何读取流数据集一次并输出到多个接收器？

4 个答案: