是否要加入两个来自Spark结构化流(2.3)中允许的相同输入流数据集的流?
例如在下面的示例查询中,将两个流连接在一起。 我在Azure eventhub spark客户端中收到IllegalStateException。
这可以正常工作吗?
eventhubs = spark.readStream ... .createOrReplaceTempView("Input")
spark.sql("SELECT temperature, time, device, category FROM Input").createOrReplaceTempView("devices1")
spark.sql("SELECT temperature, time, device, category FROM Input").createOrReplaceTempView("devices2")
val d1 = spark.sql("SELECT * FROM devices1 WHERE device=0")
val d2 = spark.sql("SELECT * FROM devices2 WHERE device=1")
val output = d1.join(
d2,
expr("""
devices1.category = devices2.category AND
devices1.time >= devices2.time AND
devices1.time <= devices2.time + interval 1 seconds
"""),
joinType = "inner"
)
display(output)
答案 0 :(得分:1)
据我所知,允许在Spark结构化流上半加入,但仅在追加输出模式下。
这里是例子:
class ExampleTest extends SparkBaseSpec {
import spark.implicits._
private val data: DataFrame = spark.range(1, 5).toDF
data.write.parquet("/tmp/streaming/")
val readStr = spark.readStream.schema(data.schema).parquet("/tmp/streaming/")
val df = readStr
.select($"id".as("id1"))
.where("id1<50")
.join(readStr.select($"id".as("id2")).where("id2<50"), $"id1" === $"id2")
df.printSchema()
val stream = df.writeStream
.option("checkpointLocation", "/tmp/spark-streaming-checkpoint")
.format("console")
.outputMode("append")
.start
spark.range(20, 25).toDF.write.mode("append").parquet("/tmp/streaming/")
stream.awaitTermination(30000)
}
root
|-- id1: long (nullable = false)
|-- id2: long (nullable = false)
-------------------------------------------
Batch: 0
-------------------------------------------
+---+---+
|id1|id2|
+---+---+
| 1| 1|
| 3| 3|
| 2| 2|
| 4| 4|
+---+---+
-------------------------------------------
Batch: 1
-------------------------------------------
+---+---+
|id1|id2|
+---+---+
| 22| 22|
| 21| 21|
| 23| 23|
| 20| 20|
| 24| 24|
+---+---+
顺便说一句,您不需要创建两个临时视图,只需创建一个即可。
希望有帮助!