我正在尝试通过功能组成的 Kafka 消息的周期进行统计。当我收到相同周期的新消息时,我想重新计算此功能的 Kurtosis 。
我目前能够对流的每个消息执行简单的聚合(加法,计数)。
//set message count as new key (instead of String null)
val newStream : KStream[Int, Message] = builder.stream[String, Message]("queueing.sensors.data" )(consumed).map((_,v) => (v.msg_count,v))
//stream -> ktable
newStream.to("Dummy-ReduceInputTopic")(produced2)
val cycleTable : KTable[Int, Message] = builder.table("Dummy-ReduceInputTopic")
//aggregate values
val cycleTable2 : KTable[Int, Seq[Message]] = cycleTable
.groupBy((k, v) => (v.cycle,v))(serializedFromSerde(intSerde,messageSerde))
.aggregate[Seq[Message]](Seq[Message]())((aggkey,newvalue,aggvalue) => aggvalue :+ newvalue, (aggkey,newvalue,aggvalue) => aggvalue)(materializedFromSerde(intSerde,seqmesageSerde))
//create messageList objects => apply predictions
val cycleTable3 : KStream[Int, Double] = cycleTable2.toStream.map((k,v) => (k,MessageList(v.toSeq).skewness_ps1))
Kafka中是否有等同于 Spark Streaming Sliding Windows 的东西?
我是否应该为此用例放弃Kafka Stream?对于Spark Streaming?
在此先感谢您的关注。
答案 0 :(得分:3)
在卡夫卡,您还拥有Windowing概念: https://docs.confluent.io/current/streams/concepts.html#windowing
示例卡夫卡流
KTable<Windowed<Key>, Value> fifteenMinuteWindowed =
fiveMinuteWindowed
.groupBy( (windowedKey, value) ->
new KeyValue<>(
new Windowed<>(
windowedKey.key(),
new Window<>(
windowedKey.window().start() /1000/60/15 *1000*60*15,
windowedKey.window().start() /1000/60/15 *1000*60*15 + 1000*60*15
// the above rounds time down to a timestamp divisible by 15 minutes
)
),
value
),
/* your key serde */,
/* your value serde */
)
.reduce(/*your adder*/, /*your subtractor*/, "store15m");
您还可以考虑具有以下概念的 KSQL :
跳跃窗口基于时间的固定持续时间,重叠窗口
滚动窗口基于时间的固定持续时间,无重叠,无间隙窗口
会话窗口基于会话的动态大小,不重叠,数据驱动的窗口
示例KSQL:
SELECT regionid, COUNT(*) FROM pageviews
WINDOW HOPPING (SIZE 30 SECONDS, ADVANCE BY 10 SECONDS)
WHERE UCASE(gender)='FEMALE' AND LCASE (regionid) LIKE '%_6'
GROUP BY regionid;