为什么此KStream / KTable拓扑传播未通过过滤器的记录?

时间:2018-08-29 22:04:44

标签: apache-kafka apache-kafka-streams

我具有以下拓扑:

  1. 创建状态存储
  2. 基于SOME_CONDITION过滤记录,将其值映射到新实体,最后将这些记录发布到另一个主题STATIONS_LOW_CAPACITY_TOPIC

但是我在STATIONS_LOW_CAPACITY_TOPIC上看到了这一点:

�   null
�   null
�   null
�   {"id":140,"latitude":"40.4592351","longitude":"-3.6915330",...}
�   {"id":137,"latitude":"40.4591366","longitude":"-3.6894151",...}
�   null

这就是说,好像也将那些未通过过滤器的记录也发布到STATIONS_LOW_CAPACITY_TOPIC主题中。这怎么可能?如何防止它们被发布?

这是ksteams代码:

kStream.groupByKey().reduce({ _, newValue -> newValue },
                Materialized.`as`<Int, Station, KeyValueStore<Bytes, ByteArray>>(STATIONS_STORE)
                        .withKeySerde(Serdes.Integer())
                        .withValueSerde(stationSerde))
                .filter { _, value -> SOME_CONDITION }
                .mapValues { station ->
                    Stats(XXX)
                }
                .toStream().to(STATIONS_LOW_CAPACITY_TOPIC, Produced.with(Serdes.Integer(), stationStatsSerde))

更新:我只是简单地选择了拓扑并打印了结果表。由于某种原因,最终的KTable还包含与未通过过滤器的上游记录相对应的空值记录:

kStream.groupByKey().reduce({ _, newValue -> newValue },
                Materialized.`as`<Int, BiciMadStation, KeyValueStore<Bytes, ByteArray>>(STATIONS_STORE)
                        .withKeySerde(Serdes.Integer())
                        .withValueSerde(stationSerde))
                .filter { _, value ->
                    val conditionResult = (SOME_CONDITION)
                    println(conditionResult)
                    conditionResult
                }
                .print()

日志:

false
[KTABLE-FILTER-0000000002]: 1, (null<-null)
false
[KTABLE-FILTER-0000000002]: 2, (null<-null)
false
[KTABLE-FILTER-0000000002]: 3, (null<-null)
false
[KTABLE-FILTER-0000000002]: 4, (null<-null)
true
[KTABLE-FILTER-0000000002]: 5, (Station(id=5, latitude=40.4285524, longitude=-3.7025875, ...)<-null)

1 个答案:

答案 0 :(得分:3)

答案在KTable.filter(...)的javadoc中:

  

请注意,变更日志流的过滤器与记录的工作原理不同   流过滤器,因为记录具有空值(所谓的逻辑删除)   记录)具有删除语义。因此,对于墓碑,   过滤谓词未评估,但逻辑删除记录为   根据需要直接转发(即是否有任何内容需要转发)   删除)。此外,对于删除的每个记录(即,点   不满足给定的谓词)将转发墓碑记录。

这说明了为什么我看到向下游发送空值(逻辑删除)记录。

为避免这种情况,我将KTable转换为KStream,然后应用了过滤器:

kStream.groupByKey().reduce({ _, newValue -> newValue },
                Materialized.`as`<Int, Stations, KeyValueStore<Bytes, ByteArray>>(STATIONS_STORE)
                        .withKeySerde(Serdes.Integer())
                        .withValueSerde(stationSerde))
                .toStream()
                .filter { _, value -> SOME_CONDITION }
                .mapValues { station ->
                    StationStats(station.id, station.latitude, station.longitude, ...)
                }
                .to(STATIONS_LOW_CAPACITY_TOPIC, Produced.with(Serdes.Integer(), stationStatsSerde))

结果:

4   {"id":4,"latitude":"40.4302937","longitude":"-3.7069171",...}
5   {"id":5,"latitude":"40.4285524","longitude":"-3.7025875",...}
...