Question

是否要加入两个来自Spark结构化流（2.3）中允许的相同输入流数据集的流？

例如在下面的示例查询中，将两个流连接在一起。我在Azure eventhub spark客户端中收到IllegalStateException。

这可以正常工作吗？

eventhubs = spark.readStream ... .createOrReplaceTempView("Input")

spark.sql("SELECT temperature, time, device,  category FROM Input").createOrReplaceTempView("devices1")

spark.sql("SELECT temperature, time, device,  category FROM Input").createOrReplaceTempView("devices2")

val d1 = spark.sql("SELECT * FROM devices1 WHERE device=0")
val d2 = spark.sql("SELECT * FROM devices2 WHERE device=1")


val output = d1.join(
                        d2,
                        expr("""
                          devices1.category = devices2.category AND
                          devices1.time >= devices2.time AND
                          devices1.time <= devices2.time + interval 1 seconds
                          """),
                        joinType = "inner"      
                       )

display(output)

Answer 1

据我所知，允许在Spark结构化流上半加入，但仅在追加输出模式下。

这里是例子：

class ExampleTest extends SparkBaseSpec {

  import spark.implicits._

  private val data: DataFrame = spark.range(1, 5).toDF

  data.write.parquet("/tmp/streaming/")

  val readStr = spark.readStream.schema(data.schema).parquet("/tmp/streaming/")

  val df = readStr
    .select($"id".as("id1"))
    .where("id1<50")
    .join(readStr.select($"id".as("id2")).where("id2<50"), $"id1" === $"id2")

  df.printSchema()

  val stream = df.writeStream
    .option("checkpointLocation", "/tmp/spark-streaming-checkpoint")
    .format("console")
    .outputMode("append")
    .start

  spark.range(20, 25).toDF.write.mode("append").parquet("/tmp/streaming/")

  stream.awaitTermination(30000)

}

root
 |-- id1: long (nullable = false)
 |-- id2: long (nullable = false)

-------------------------------------------
Batch: 0
-------------------------------------------
+---+---+
|id1|id2|
+---+---+
|  1|  1|
|  3|  3|
|  2|  2|
|  4|  4|
+---+---+

-------------------------------------------
Batch: 1
-------------------------------------------
+---+---+
|id1|id2|
+---+---+
| 22| 22|
| 21| 21|
| 23| 23|
| 20| 20|
| 24| 24|
+---+---+

顺便说一句，您不需要创建两个临时视图，只需创建一个即可。

希望有帮助！

连接来自同一Spark Streaming数据集的两个流

1 个答案: