Flink SQL:与Group By的外部联接提供了意外的输出

时间:2019-12-13 10:13:46

标签: sql apache-flink flink-sql

我有两个Flink动态表EventConfiguration

Event具有以下结构:[id, myTimestamp]Configuration具有以下结构:id, myValue, myTimestamp

我正在尝试执行Flink SQL查询,该查询返回Event.id, Configuration.myValueEvent.id, null,如果Eventid与{中​​的任何id不匹配{1}}。

预期行为的示例(ConfigurationEvent开始为空):

该示例必须读取为:

Configuration

由于SQL查询是通过联接进行的,因此将其插入到[DATA_RECEIVED] => TARGET_TABLE : EXPECTED_OUTPUT 中(输出的第一个值对应于upsert布尔值)

UpsertSink

所以我做了这个查询:

[myId-1, 10]            => EventTable           : [(true, myId-1, null)]
[myId-1, myValue-A, 15] => ConfigurationTable   : [(false, myId-1, null), (true, myId-1, myValue-A)]
[myId-1, myValue-A, 20] => ConfigurationTable   : [(false, myId-1, myValue-A), (true, myId-1, myValue-A)]
[myId-1, myValue-B, 25] => ConfigurationTable   : [(false, myId-1, myValue-A), (true, myId-1, myValue-B)]
[myId-1, 30]            => EventTable           : [(false, myId-1, null), (true, myId-1, myValue-B)]

SELECT Event.id, Configuration.myValue FROM (SELECT id, MAX(myTimestamp) as myTimestamp FROM Event GROUP BY id) as Event LEFT JOIN (SELECT id, LATEST_VAL(myValue, myTimestamp) as myValue, MAX(myTimestamp) as myTimestamp FROM Configuration GROUP BY id, myValue) as Configuration ON Event.id = Configuration.id GROUP BY Event.id, Configuration.myValue 是一个UDF,它返回与LATEST_VAL相关的myValue

但是我有我不了解的行为。以下是观察到的结果:

MAX(myTimestamp)

您如何解释预期行为和观察到的行为之间的区别?为什么会有额外的输出[myId-1, 10] => EventTable : [(true, myId-1, null)] // OK [myId-1, myValue-A, 15] => ConfigurationTable : [(false, myId-1, null), (true, myId-1, myValue-A)] // OK [myId-1, myValue-A, 20] => ConfigurationTable : [(false, myId-1, myValue-A), (true, myId-1, null), (false, myId-1, null), (true, myId-1, myValue-A)] // NOT OK [myId-1, myValue-B, 25] => ConfigurationTable : [(false, myId-1, myValue-A), (true, myId-1, null), (false, myId-1, null), (true, myId-1, myValue-B)] // NOT OK [myId-1, 30] => EventTable : [(false, myId-1, null), (true, myId-1, myValue-B)] // OK

是否可以使SQL查询适应所需的行为?

注意:

  • 我正在使用Flink 1.8

1 个答案:

答案 0 :(得分:1)

我认为您错过的一点是您实际上加入了两个缩回流。即使您的输入流仅是追加流,您仍在子查询中对它们执行聚合,这些聚合会导致撤消。

让我们首先分析子查询的结果:

子查询1:

Query: SELECT id, MAX(myTimestamp) as myTimestamp FROM Event GROUP BY id

Resulting stream:
 (true, myId-1, 10L)
 (false, myId-1, 10L)
 (true, myId-1, 30L)

子查询2:

Query: SELECT id, LATEST_VAL(myValue, myTimestamp) as myValue, MAX(myTimestamp) as myTimestamp FROM Configuration GROUP BY id, myValue

Resulting stream:
 (true, "myId-1", "myValue-A", 15L)
 (false, "myId-1", "myValue-A", 15L)
 (true, "myId-1", "myValue-A", 20L)
 (false, "myId-1", "myValue-A", 20L)
 (true, "myId-1", "myValue-B", 25L)

然后,在这两个撤回流的顶部执行联接和分组。请记住,您的示例中实际加入并分组的是:

[true, myId-1, 10]             : [(true, myId-1, null)]
[true, myId-1, myValue-A, 15]  : [(false, myId-1, null), (true, myId-1, myValue-A)]
[false, myId-1, myValue-A, 15] : [(false, myId-1, myValue-A), (true, myId-1, null)]
[true, myId-1, myValue-A, 20]  : [(false, myId-1, null), (true, myId-1, myValue-A)]
[false, myId-1, myValue-A, 20] : [(false, myId-1, myValue-A), (true, myId-1, null)]
[true, myId-1, myValue-B, 25]  : [(false, myId-1, null), (true, myId-1, myValue-B)]
...

据我所知,总体而言会产生正确的结果。对于每个输入行,最后发出的行表示与给定id对应的最新值。