KSQL:在过去一小时内得到最常用的单词

时间:2019-03-20 17:23:37

标签: apache-kafka ksql

我有一个kafka主题,接收以下事件:{timestamp, word, channel_id}

我需要创建一个KSQL,以获取在过去半小时内在确定的频道中所说的前K个单词。

到目前为止,我所做的是:

1-为主题创建频道

CREATE STREAM WORDEVENTS WITH (KAFKA_TOPIC='words',VALUE_FORMAT='AVRO');

2-过滤我想要的频道

CREATE STREAM FILTERED_WORDEVENTS WITH (KAFKA_TOPIC='words_in_mail', VALUE_FORMAT='AVRO') AS SELECT WORD FROM WORDEVENTS WHERE CHANNEL_ID LIKE 'mail';

还有一些我不知道的地方,我可以这样做:

SELECT WORD, COUNT(*) AS COUNT_TOTAL FROM FILTERED_WORDEVENTS WINDOW HOPPING (SIZE 30 MINUTES, ADVANCE BY 5 SECONDS) GROUP BY WORD;

这很好用,但是如果我尝试使用TOPK函数做某事,则不起作用:

SELECT WORD, topk(COUNT(*), 2) AS COUNT_TOTAL FROM FILTERED_WORDEVENTS WINDOW HOPPING (SIZE 30 MINUTES, ADVANCE BY 5 SECONDS) GROUP BY WORD;

它失败并显示:

Caused by: Can't find any functions with the name 'COUNT'

我尝试通过事件为该组创建一个流/表,然后尝试使计数增加:

 CREATE TABLE COUNT_WORDS_LAST_HOUR AS SELECT WORD, COUNT(*) AS COUNT_TOTAL FROM FILTERED_WORDEVENTS WINDOW HOPPING (SIZE 30 MINUTES, ADVANCE BY 5 SECONDS) GROUP BY WORD;

但是它抱怨说topK可以应用于表

如何解决这个用例?

0 个答案:

没有答案