我已经在KSQL中创建了一个流,如下所示。
create stream incident_1 (fruitName VARCHAR) WITH (KAFKA_TOPIC='test_incident',VALUE_FORMAT='JSON');
说我有一个主题,信息流中有以下记录
fruitName
---------
apple
orange
banana
apple
orange
我试图通过在KSQL中创建一个表来获取单个记录的计数,说输出是
select fruitName,count(*) from incident_2 group by fruitName;
fruitName count
--------- --------
apple 2
orange 2
banana 1
我还尝试了用流应用程序代码而不是KSQL编写JAVA逻辑。但这将有助于减少数据量。将来我们会获得超过10万条记录,到那时所有这些迭代都将花费大量时间,这会影响代码的速度。所以我想不要使用这个。这是代码
static HashSet<String> hash_incident = new HashSet<String>();
// Adding elements into HashSet usind add()
hash_incident.add(new_key);
System.out.println("incident_count "+hash_incident.size());
count_unique_notification+=1;
System.out.println("keyyyyyy"+new_key+
"helllllllllllllllllllllllo"+count_unique_notification);
但是,我想要的是,不同记录的总数为
total_distinctt_fruits_count
-----------------------------
3
那么,KSQL中还有其他方法吗?
答案 0 :(得分:0)
不确定所运行的ksqlDB版本是什么,但最新版本具有COUNT_DISTINCT
函数,该函数似乎非常适合您想要实现的功能。
-- Your source stream:
create stream incident_1 (
fruitName VARCHAR
) WITH (
KAFKA_TOPIC='test_incident',
VALUE_FORMAT='JSON'
);
-- `COUNT_DISTINCT` works per topic-partition.
-- So if you want a _global_ count, then you must ensure you have only a single partition.
-- This step can be avoided if topic `test_incident` only has a single partition
CREATE STREAM single_partition AS
WITH (PARTITIONS = 1)
SELECT * FROM incident_1;
-- Now that we have a single source partition, we can create a table with the counts:
-- Set partitions to 1 as all output is on a single key, so only need 1 partition.
-- And GROUP BY a constant/literal so that all results end up on the same key - so you get a _global_ count.
-- And use `COUNT_DISTINCT` to count the distinct fruit:
CREATE TABLE DISTINCT_COUNTS
WITH (PARTITIONS = 1) AS
SELECT
1 AS K,
COUNT_DISTINCT(fruitName) AS NUM_DISTINCT_FRUITS
FROM single_partition
GROUP BY 1;