Question

我已经在KSQL中创建了一个流，如下所示。

create stream incident_1  (fruitName VARCHAR) WITH (KAFKA_TOPIC='test_incident',VALUE_FORMAT='JSON');

说我有一个主题，信息流中有以下记录

fruitName
---------
apple
orange
banana
apple
orange

我试图通过在KSQL中创建一个表来获取单个记录的计数，说输出是

select fruitName,count(*) from incident_2 group by fruitName;

    fruitName     count
    ---------    --------
    apple          2
    orange         2
    banana         1

我还尝试了用流应用程序代码而不是KSQL编写JAVA逻辑。但这将有助于减少数据量。将来我们会获得超过10万条记录，到那时所有这些迭代都将花费大量时间，这会影响代码的速度。所以我想不要使用这个。这是代码

static HashSet<String> hash_incident = new HashSet<String>();

// Adding elements into HashSet usind add() 
hash_incident.add(new_key);

System.out.println("incident_count  "+hash_incident.size());

count_unique_notification+=1;
System.out.println("keyyyyyy"+new_key+
"helllllllllllllllllllllllo"+count_unique_notification);

但是，我想要的是，不同记录的总数为

total_distinctt_fruits_count
-----------------------------
       3

那么，KSQL中还有其他方法吗？

Answer 1

不确定所运行的ksqlDB版本是什么，但最新版本具有COUNT_DISTINCT函数，该函数似乎非常适合您想要实现的功能。

-- Your source stream:
create stream incident_1 (
    fruitName VARCHAR
  ) WITH (
    KAFKA_TOPIC='test_incident',
    VALUE_FORMAT='JSON'
  );


-- `COUNT_DISTINCT` works per topic-partition. 
-- So if you want a _global_ count, then you must ensure you have only a single partition.
-- This step can be avoided if topic `test_incident` only has a single partition
CREATE STREAM single_partition AS
  WITH (PARTITIONS = 1)
  SELECT * FROM incident_1;


-- Now that we have a single source partition, we can create a table with the counts:
-- Set partitions to 1 as all output is on a single key, so only need 1 partition.
-- And GROUP BY a constant/literal so that all results end up on the same key - so you get a _global_ count.
-- And use `COUNT_DISTINCT` to count the distinct fruit:
CREATE TABLE DISTINCT_COUNTS 
  WITH (PARTITIONS = 1) AS
  SELECT
    1 AS K,
    COUNT_DISTINCT(fruitName) AS NUM_DISTINCT_FRUITS
  FROM single_partition
  GROUP BY 1;

如何从ksql中的ktable获取不同记录的总数

1 个答案: