通过多个字段进行汇总并映射到一个结果

时间:2019-04-14 22:48:57

标签: apache-kafka ksql

对于从票务系统流式传输的数据,我们尝试实现以下目标

获取按状态和客户分组的未处理故障单数量。 简化的模式如下


 Field               | Type                      
-------------------------------------------------
 ROWTIME             | BIGINT           (system) 
 ROWKEY              | VARCHAR(STRING)  (system) 
 ID                  | BIGINT                    
 TICKET_ID           | BIGINT                    
 STATUS              | VARCHAR(STRING)           
 TICKETCATEGORY_ID   | BIGINT                    
 SUBJECT             | VARCHAR(STRING)           
 PRIORITY            | VARCHAR(STRING)           
 STARTTIME           | BIGINT                    
 ENDTIME             | BIGINT                    
 CHANGETIME          | BIGINT                    
 REMINDTIME          | BIGINT                    
 DEADLINE            | INTEGER                   
 CONTACT_ID          | BIGINT           

我们想使用该数据来获取每个客户具有特定状态(打开,等待,进行中等)的票证数量。该数据必须在另一个主题中传递一条消息-该方案可能看起来像这样

 Field               | Type                      
-------------------------------------------------
 ROWTIME             | BIGINT           (system) 
 ROWKEY              | VARCHAR(STRING)  (system) 
 CONTACT_ID          | BIGINT                    
 COUNT_OPEN          | BIGINT                    
 COUNT_WAITING       | BIGINT                    
 COUNT_CLOSED        | BIGINT                    

我们计划使用此数据和其他数据来充实客户信息,并将充实的数据集发布到外部系统(例如elasticsearch)

获得第一部分非常容易-根据客户和状态对票进行分组。

select contact_id,status count(*) cnt from tickets group by contact_id,status;

但是现在我们被困住了-每个客户获得多行/消息,而我们只是不知道如何将它们转换为一条以contact_id为键的消息。

我们尝试了加入,但所有尝试均未成功。

示例

为所有按客户分组的“等待”状态的票创建表

create table waiting_tickets_by_cust with (partitions=12,value_format='AVRO')
as select contact_id, count(*) cnt from tickets where status='waiting' group by contact_id;

用于联接的密钥表

CREATE TABLE T_WAITING_REKEYED with WITH (KAFKA_TOPIC='WAITING_TICKETS_BY_CUST',
       VALUE_FORMAT='AVRO',
       KEY='contact_id');

将该表与客户表的左侧(外部)连接起来,将为我们提供所有正在等待票证的客户。

select c.id,w.cnt wcnt from T_WAITING_REKEYED w left join CRM_CONTACTS c on w.contact_id=c.id;

但是,我们将需要所有客户,且等待计数为NULL才能使用该结果,从而导致另一个票证状态为PROCESSING的联接。 因为我们只有等待中的客户,所以只有那些同时具有这两种状态的客户才能得到我们。

ksql> select c.*,t.cnt from T_PROCESSING_REKEYED t left join cust_ticket_tmp1 c on t.contact_id=c.id;
null | null | null | null | 1
1555261086669 | 1472 | 1472 | 0 | 1
1555261086669 | 1472 | 1472 | 0 | 1
null | null | null | null | 1
1555064371937 | 1474 | 1474 | 1 | 1
null | null | null | null | 1
1555064371937 | 1474 | 1474 | 1 | 1
null | null | null | null | 1
null | null | null | null | 1
null | null | null | null | 1
1555064372018 | 3 | 3 | 5 | 6
1555064372018 | 3 | 3 | 5 | 6

那么执行此操作的正确方法是什么?

这是KSQL 5.2.1

谢谢

编辑:

以下是一些示例数据

创建了一个将数据限制为测试帐户的主题

CREATE STREAM tickets_filtered
  WITH (
        PARTITIONS=12,
        VALUE_FORMAT='JSON') AS
  SELECT id,
         contact_id,
subject,
status,

         TIMESTAMPTOSTRING(changetime, 'yyyy-MM-dd HH:mm:ss.SSS') AS timestring
  FROM tickets where contact_id=1472
  PARTITION BY contact_id;

00:06:44 1 $ kafkacat-dev -C -o beginning -t TICKETS_FILTERED
{"ID":2216,"CONTACT_ID":1472,"SUBJECT":"Test Bodenbach","STATUS":"closed","TIMESTRING":"2012-11-08 10:34:30.000"}
{"ID":8945,"CONTACT_ID":1472,"SUBJECT":"sync-test","STATUS":"waiting","TIMESTRING":"2019-04-16 23:07:01.000"}
{"ID":8945,"CONTACT_ID":1472,"SUBJECT":"sync-test","STATUS":"processing","TIMESTRING":"2019-04-16 23:52:08.000"}
Changing and adding something in the ticketing-system...
{"ID":8945,"CONTACT_ID":1472,"SUBJECT":"sync-test","STATUS":"waiting","TIMESTRING":"2019-04-17 00:10:38.000"}
{"ID":8952,"CONTACT_ID":1472,"SUBJECT":"another sync ticket","STATUS":"new","TIMESTRING":"2019-04-17 00:11:23.000"}
{"ID":8952,"CONTACT_ID":1472,"SUBJECT":"another sync ticket","STATUS":"close-request","TIMESTRING":"2019-04-17 00:12:04.000"}

我们要根据这些数据创建一个主题,消息看起来像这样

{"CONTACT_ID":1472,"TICKETS_CLOSED":1,"TICKET_WAITING":1,"TICKET_CLOSEREQUEST":1,"TICKET_PROCESSING":0}

1 个答案:

答案 0 :(得分:0)

written up here too

可以通过建立一个表(用于状态),然后在该表上进行汇总来做到这一点。

  1. 设置测试数据

    kafkacat -b localhost -t tickets -P <<EOF
    {"ID":2216,"CONTACT_ID":1472,"SUBJECT":"Test Bodenbach","STATUS":"closed","TIMESTRING":"2012-11-08 10:34:30.000"}
    {"ID":8945,"CONTACT_ID":1472,"SUBJECT":"sync-test","STATUS":"waiting","TIMESTRING":"2019-04-16 23:07:01.000"}
    {"ID":8945,"CONTACT_ID":1472,"SUBJECT":"sync-test","STATUS":"processing","TIMESTRING":"2019-04-16 23:52:08.000"}
    {"ID":8945,"CONTACT_ID":1472,"SUBJECT":"sync-test","STATUS":"waiting","TIMESTRING":"2019-04-17 00:10:38.000"}
    {"ID":8952,"CONTACT_ID":1472,"SUBJECT":"another sync ticket","STATUS":"new","TIMESTRING":"2019-04-17 00:11:23.000"}
    {"ID":8952,"CONTACT_ID":1472,"SUBJECT":"another sync ticket","STATUS":"close-request","TIMESTRING":"2019-04-17 00:12:04.000"}
    EOF
    
  2. 预览主题数据

    ksql> PRINT 'tickets' FROM BEGINNING;
    Format:JSON
    {"ROWTIME":1555511270573,"ROWKEY":"null","ID":2216,"CONTACT_ID":1472,"SUBJECT":"Test Bodenbach","STATUS":"closed","TIMESTRING":"2012-11-08 10:34:30.000"}
    {"ROWTIME":1555511270573,"ROWKEY":"null","ID":8945,"CONTACT_ID":1472,"SUBJECT":"sync-test","STATUS":"waiting","TIMESTRING":"2019-04-16 23:07:01.000"}
    {"ROWTIME":1555511270573,"ROWKEY":"null","ID":8945,"CONTACT_ID":1472,"SUBJECT":"sync-test","STATUS":"processing","TIMESTRING":"2019-04-16 23:52:08.000"}
    {"ROWTIME":1555511270573,"ROWKEY":"null","ID":8945,"CONTACT_ID":1472,"SUBJECT":"sync-test","STATUS":"waiting","TIMESTRING":"2019-04-17 00:10:38.000"}
    {"ROWTIME":1555511270573,"ROWKEY":"null","ID":8952,"CONTACT_ID":1472,"SUBJECT":"another sync ticket","STATUS":"new","TIMESTRING":"2019-04-17 00:11:23.000"}
    {"ROWTIME":1555511270573,"ROWKEY":"null","ID":8952,"CONTACT_ID":1472,"SUBJECT":"another sync ticket","STATUS":"close-request","TIMESTRING":"2019-04-17 00:12:04.000"}
    
  3. 注册流

    CREATE STREAM TICKETS (ID INT, 
                          CONTACT_ID VARCHAR, 
                          SUBJECT VARCHAR, 
                          STATUS VARCHAR, 
                          TIMESTRING VARCHAR) 
            WITH (KAFKA_TOPIC='tickets', 
            VALUE_FORMAT='JSON');
    
  4. 查询数据

    ksql> SET 'auto.offset.reset' = 'earliest';
    ksql> SELECT * FROM TICKETS;
    1555502643806 | null | 2216 | 1472 | Test Bodenbach | closed | 2012-11-08 10:34:30.000
    1555502643806 | null | 8945 | 1472 | sync-test | waiting | 2019-04-16 23:07:01.000
    1555502643806 | null | 8945 | 1472 | sync-test | processing | 2019-04-16 23:52:08.000
    1555502643806 | null | 8945 | 1472 | sync-test | waiting | 2019-04-17 00:10:38.000
    1555502643806 | null | 8952 | 1472 | another sync ticket | new | 2019-04-17 00:11:23.000
    1555502643806 | null | 8952 | 1472 | another sync ticket | close-request | 2019-04-17 00:12:04.000
    
  5. 在这一点上,我们可以使用CASE来枢纽聚合:

    SELECT CONTACT_ID, 
          SUM(CASE WHEN STATUS='new' THEN 1 ELSE 0 END) AS TICKETS_NEW, 
          SUM(CASE WHEN STATUS='processing' THEN 1 ELSE 0 END) AS TICKETS_PROCESSING, 
          SUM(CASE WHEN STATUS='waiting' THEN 1 ELSE 0 END) AS TICKETS_WAITING, 
          SUM(CASE WHEN STATUS='close-request' THEN 1 ELSE 0 END) AS TICKETS_CLOSEREQUEST ,
          SUM(CASE WHEN STATUS='closed' THEN 1 ELSE 0 END) AS TICKETS_CLOSED
      FROM TICKETS 
      GROUP BY CONTACT_ID;
    
      1472 | 1 | 1 | 2 | 1 | 1
    

    但是,您会注意到答案与预期不符。这是因为我们正在计算所有六个输入事件

    让我们看一下ID为8945的单个故障单-它经历了三个状态更改(waiting-> processing-> waiting),每个状态更改都包含在骨料。我们可以使用一个简单的谓词来验证这一点:

    SELECT CONTACT_ID, 
          SUM(CASE WHEN STATUS='new' THEN 1 ELSE 0 END) AS TICKETS_NEW, 
          SUM(CASE WHEN STATUS='processing' THEN 1 ELSE 0 END) AS TICKETS_PROCESSING, 
          SUM(CASE WHEN STATUS='waiting' THEN 1 ELSE 0 END) AS TICKETS_WAITING, 
          SUM(CASE WHEN STATUS='close-request' THEN 1 ELSE 0 END) AS TICKETS_CLOSEREQUEST ,
          SUM(CASE WHEN STATUS='closed' THEN 1 ELSE 0 END) AS TICKETS_CLOSED
      FROM TICKETS 
      WHERE ID=8945
      GROUP BY CONTACT_ID;
    
    1472 | 0 | 1 | 2 | 0 | 0
    
  6. 我们真正想要的是每张票证的当前状态。因此,重新分配票证ID上的数据:

    CREATE STREAM TICKETS_BY_ID AS SELECT * FROM TICKETS PARTITION BY ID;
    
    CREATE TABLE TICKETS_TABLE (ID INT, 
                          CONTACT_ID INT, 
                          SUBJECT VARCHAR, 
                          STATUS VARCHAR, 
                          TIMESTRING VARCHAR) 
            WITH (KAFKA_TOPIC='TICKETS_BY_ID', 
            VALUE_FORMAT='JSON',
            KEY='ID');
    
  7. 比较事件流当前状态

    • 事件流(KSQL流)

      ksql> SELECT ID, TIMESTRING, STATUS FROM TICKETS;
      2216 | 2012-11-08 10:34:30.000 | closed
      8945 | 2019-04-16 23:07:01.000 | waiting
      8945 | 2019-04-16 23:52:08.000 | processing
      8945 | 2019-04-17 00:10:38.000 | waiting
      8952 | 2019-04-17 00:11:23.000 | new
      8952 | 2019-04-17 00:12:04.000 | close-request
      
    • 当前状态(KSQL表)

      ksql> SELECT ID, TIMESTRING, STATUS FROM TICKETS_TABLE;
      2216 | 2012-11-08 10:34:30.000 | closed
      8945 | 2019-04-17 00:10:38.000 | waiting
      8952 | 2019-04-17 00:12:04.000 | close-request
      
  8. 我们想要一个表的汇总,我们想要运行与上面相同的SUM(CASE…)…GROUP BY技巧,但要基于每张票证的当前状态,而不是每个事件:

      SELECT CONTACT_ID, 
          SUM(CASE WHEN STATUS='new' THEN 1 ELSE 0 END) AS TICKETS_NEW, 
          SUM(CASE WHEN STATUS='processing' THEN 1 ELSE 0 END) AS TICKETS_PROCESSING, 
          SUM(CASE WHEN STATUS='waiting' THEN 1 ELSE 0 END) AS TICKETS_WAITING, 
          SUM(CASE WHEN STATUS='close-request' THEN 1 ELSE 0 END) AS TICKETS_CLOSEREQUEST ,
          SUM(CASE WHEN STATUS='closed' THEN 1 ELSE 0 END) AS TICKETS_CLOSED
      FROM TICKETS_TABLE 
      GROUP BY CONTACT_ID;
    

    这给了我们我们想要的:

      1472 | 0 | 0 | 1 | 1 | 1
    
  9. 让我们将另一个故障单的事件提供给主题,并观察表状态如何变化。 状态更改时,表中的行会重新发出;您还可以取消SELECT并重新运行以仅查看当前状态。

    对数据进行抽样以亲自尝试:

    {"ID":8946,"CONTACT_ID":42,"SUBJECT":"","STATUS":"new","TIMESTRING":"2019-04-16 23:07:01.000"}
    {"ID":8946,"CONTACT_ID":42,"SUBJECT":"","STATUS":"processing","TIMESTRING":"2019-04-16 23:07:01.000"}
    {"ID":8946,"CONTACT_ID":42,"SUBJECT":"","STATUS":"waiting","TIMESTRING":"2019-04-16 23:07:01.000"}
    {"ID":8946,"CONTACT_ID":42,"SUBJECT":"","STATUS":"processing","TIMESTRING":"2019-04-16 23:07:01.000"}
    {"ID":8946,"CONTACT_ID":42,"SUBJECT":"","STATUS":"waiting","TIMESTRING":"2019-04-16 23:07:01.000"}
    {"ID":8946,"CONTACT_ID":42,"SUBJECT":"","STATUS":"closed","TIMESTRING":"2019-04-16 23:07:01.000"}
    {"ID":8946,"CONTACT_ID":42,"SUBJECT":"","STATUS":"close-request","TIMESTRING":"2019-04-16 23:07:01.000"}
    

如果您想进一步尝试,可以使用Mockaroo生成的其他伪数据流,通过awk进行管道传输以减慢它的速度,以便您看到对生成的聚合的影响当每条消息到达时:

while [ 1 -eq 1 ]
  do curl -s "https://api.mockaroo.com/api/f2d6c8a0?count=1000&key=ff7856d0" | \
      awk '{print $0;system("sleep 2");}' | \
      kafkacat -b localhost -t tickets -P
  done