我想通过密钥外连接几个(通常是2-10个)Kafka主题,理想情况下使用流API。所有主题都具有相同的密钥和分区。实现此联接的一种方法是为每个主题创建KStream
并对KStream.outerJoin
进行链式调用:
stream1
.outerJoin(stream2, ...)
.outerJoin(stream3, ...)
.outerJoin(stream4, ...)
但是,KStream.outerJoin
的{{3}}表示每次调用outerJoin
都会实现其两个输入流,因此上述示例不仅会实现流1到4,还会实现{{1} }和stream1.outerJoin(stream2, ...)
。与直接连接4个流相比,会有很多不必要的序列化,反序列化和I / O.
上述方法的另一个问题是stream1.outerJoin(stream2, ...).outerJoin(stream3, ...)
在所有4个输入流中不一致:一个JoinWindow
将用于连接流1和2,但随后会有一个单独的连接窗口用于连接此流和流3等。例如,我为每个连接指定10秒的连接窗口,具有特定键的条目在流1中显示为0秒,流2显示为6秒,流3显示为12秒,并且流18在18秒时,加入的项目将在18秒后输出,导致过高的延迟。结果取决于连接的顺序,这似乎不自然。
使用Kafka有更好的多路连接方法吗?
答案 0 :(得分:1)
我目前还不知道Kafka Stream有更好的方法,但它正在制作中:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-150+-+Kafka-Streams+Cogroup
答案 1 :(得分:0)
最终我决定创建一个自定义轻量级木匠,避免实现并严格遵守到期时间。它应该是平均O(1)。它更适合使用Consumer API而不是Stream API:对于每个使用者,使用任何接收的数据重复轮询和更新加入者;如果木匠返回完整的属性集,则将其转发。这是代码:
import java.util.ArrayList;
import java.util.Collection;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.Optional;
/**
* Inner joins multiple streams of data by key into one stream. It is assumed
* that a key will appear in a stream exactly once. The values associated with
* each key are collected and if all values are received within a certain
* maximum wait time, the joiner returns all values corresponding to that key.
* If not all values are received in time, the joiner never returns any values
* corresponding to that key.
* <p>
* This class is not thread safe: all calls to
* {@link #update(Object, Object, long)} must be synchronized.
* @param <K> The type of key.
* @param <V> The type of value.
*/
class StreamInnerJoiner<K, V> {
private final Map<K, Vals<V>> idToVals = new LinkedHashMap<>();
private final int joinCount;
private final long maxWait;
/**
* Creates a stream inner joiner.
* @param joinCount The number of streams being joined.
* @param maxWait The maximum amount of time after an item has been seen in
* one stream to wait for it to be seen in the remaining streams.
*/
StreamInnerJoiner(final int joinCount, final long maxWait) {
this.joinCount = joinCount;
this.maxWait = maxWait;
}
private static class Vals<A> {
final long firstSeen;
final Collection<A> vals = new ArrayList<>();
private Vals(final long firstSeen) {
this.firstSeen = firstSeen;
}
}
/**
* Updates this joiner with a value corresponding to a key.
* @param key The key.
* @param val The value.
* @param now The current time.
* @return If all values for the specified key have been received, the
* complete collection of values for thaht key; otherwise
* {@link Optional#empty()}.
*/
Optional<Collection<V>> update(final K key, final V val, final long now) {
expireOld(now - maxWait);
final Vals<V> curVals = getOrCreate(key, now);
curVals.vals.add(val);
return expireAndGetIffFull(key, curVals);
}
private Vals<V> getOrCreate(final K key, final long now) {
final Vals<V> existingVals = idToVals.get(key);
if (existingVals != null)
return existingVals;
else {
/*
Note: we assume that the item with the specified ID has not already
been seen and timed out, and therefore that its first seen time is
now. If the item has in fact already timed out, it is doomed and
will time out again with no ill effect.
*/
final Vals<V> curVals = new Vals<>(now);
idToVals.put(key, curVals);
return curVals;
}
}
private void expireOld(final long expireBefore) {
final Iterator<Vals<V>> i = idToVals.values().iterator();
while (i.hasNext() && i.next().firstSeen < expireBefore)
i.remove();
}
private Optional<Collection<V>> expireAndGetIffFull(final K key, final Vals<V> vals) {
if (vals.vals.size() == joinCount) {
// as all expired entries were already removed, this entry is valid
idToVals.remove(key);
return Optional.of(vals.vals);
} else
return Optional.empty();
}
}