Spark结构化流Stream-Stream连接的性能

时间:2020-07-24 05:11:41

标签: apache-spark

我正在尝试使用Spark 2.4.0的spark结构化流的流连接功能。

我只是联接两个简单的数据集,只是为了观察流-流联接的性能。我目前在本地计算机上仅使用一些输入记录来运行它。我观察到,从两个流中合并数据并将输出写入Kafka花费了超过几分钟的时间。

这是我一直在尝试的:

val in1Df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", s"$kafkaHost:$kafkaPort")
  .option("subscribe", config.getString("SparkStrucStreamingPoc.inTopic1"))
  .load()
  .select($"timestamp" as "timestamp1",$"value" cast "string" as "value1")

val in2Df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", s"$kafkaHost:$kafkaPort")
  .option("subscribe", config.getString("SparkStrucStreamingPoc.inTopic2"))
  .load()
  .select($"timestamp"  as "timestamp2", $"value" cast "string" as "value2")

val in1DfWithWatermark = in1Df
  .select($"timestamp1",$"value1")
  .withWatermark("timestamp1", "10 seconds")

val in2DfWithWatermark = in2Df
  .select($"timestamp2",$"value2")
  .withWatermark("timestamp2", "20 seconds")

val joinedDf = in1DfWithWatermark.join(in2DfWithWatermark,
  expr(("""value1 = value2 AND
                 timestamp2 >= timestamp1 AND
                 timestamp2 <= timestamp1 + interval 1 minutes""")))

joinedDf.select(($"value1").alias("value"))
  .writeStream
  .format("kafka")
  .option("topic", config.getString("SparkStrucStreamingPoc.outTopic"))
  .option("kafka.bootstrap.servers", s"$kafkaHost:$kafkaPort")
  .option("checkpointLocation", config.getString("SparkStrucStreamingPoc.checkpoint"))
  .start()
  .awaitTermination()

有没有其他人观察到这种行为?加入两个流通常花费这么长时间吗?

0 个答案:

没有答案