在Spark 2.4中合并两个流后,为什么结果为空?

时间:2018-11-24 21:14:18

标签: python apache-spark pyspark spark-structured-streaming

我使用Spark 2.4来加入两个流。问题在于结果为空。

我从文件夹加载流数据:

数据/ 1

[
  {"id1": 77,"name1": "aaa","timestamp": 1532609003},
  {"id1": 77,"name1": "xxx","timestamp": 1532609005},
  {"id1": 78,"name1": "xxx","timestamp": 1532609005}
]

数据/ 2

[
  {"id2": 77,"name2": "yyy", "timestamp2": 1532609000}
]

我的代码:

schema1 = StructType([
    StructField("id1", IntegerType()),
    StructField("name1", StringType()),
    StructField("timestamp1", TimestampType()))
])

schema2 = StructType([
    StructField("id2", IntegerType()),
    StructField("name2", StringType()),
    StructField("timestamp2", TimestampType()))
])

ds1 = spark \
    .readStream \
    .format("json") \
    .schema(schema1) \
    .load("data/1") \
    .withWatermark("timestamp1", "2 minutes")

ds2 = spark \
    .readStream \
    .format("json") \
    .schema(schema2) \
    .load("data/2") \
    .withWatermark("timestamp2", "2 minutes")

ds_joined = ds1.join(
    ds2,
    func.expr("""
    id1 = id2 AND
    timestamp1 >= timestamp2 AND
    timestamp1 <= timestamp2 + interval 2 minutes
    """),
    "leftOuter"
).fillna(0)

query = ds_joined \
    .writeStream \
    .format('console') \
    .start()

query.awaitTermination()

可以看出,我使用了2分钟的水印。因此,我不明白我得到了一个空的联接数据集。

预期输出:

id1  id2  name1  name2  timestamp1  timestamp2
77   77   aaa    yyy    1532609003  1532609000
77   77   xxx    yyy    1532609005  1532609000
78   0    xxx    0      1532609005  0     

0 个答案:

没有答案