Spark结构化流媒体 - 一个应用程序中的2个ReadStream

时间:2018-04-08 12:43:36

标签: apache-spark apache-kafka spark-structured-streaming

是否可以在一个应用程序中拥有两个独立的ReadStream?我正在尝试听两个单独的Kafka主题,并根据两个DataFrame进行计算。

2 个答案:

答案 0 :(得分:1)

您可以简单地订阅多个主题:

// Subscribe to multiple topics
val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1,topic2")
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]

或者,如果您特别想在一个应用中使用两个独立的readStream定义:

// read stream A
val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1")
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]

// read stream B
val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic2")
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]

答案 1 :(得分:0)

您应该可以通过在Spark 2.3.0中使用join()来实现此目的:

val stream1 = spark.readStream. ...
val stream2 = spark.readStream. ...

val joinedDf = stream1.join(stream2, "join_column_id")