从ksql流到ksql表的聚合结果错误

时间:2020-02-03 09:59:40

标签: ksqldb

我在ksql中有一个流,称为turnstile_stream。对于该流中的列值(station_id),当我查询所有条目时,都会得到以下结果

ksql> select * from turnstile_stream where station_id = 40820 emit changes;
+----------------------------------------------------+----------------------------------------------------+----------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|ROWTIME                                             |ROWKEY                                              |STATION_ID                                          |STATION_NAME                                        |LINE                                                |
+----------------------------------------------------+----------------------------------------------------+----------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|1580720442456                                       |�Ը�
                                                                                                    |40820                                               |Rosemont                                            |blue                                                |
|1580720442456                                       |�Ը�
                                                                                                    |40820                                               |Rosemont                                            |blue                                                |

意味着,该station_id的流中只有两个条目。这是正确的,因为我在主题中仅推送了两个事件,这些事件用于创建流。现在,我有了一个表,该表是通过使用以下查询创建的。查询按station_id分组,并在流turnstile_stream中获取事件计数。

ksql> describe extended turnstile_summary;

Name                 : TURNSTILE_SUMMARY
Type                 : TABLE
Key field            : STATION_ID
Key format           : STRING
Timestamp field      : Not set - using <ROWTIME>
Value format         : AVRO
Kafka topic          : turnstile_summary_1 (partitions: 2, replication: 1)

 Field      | Type
----------------------------------------
 ROWTIME    | BIGINT           (system)
 ROWKEY     | VARCHAR(STRING)  (system)
 STATION_ID | INTEGER
 COUNT      | BIGINT
----------------------------------------

Queries that write from this TABLE
-----------------------------------
CTAS_TURNSTILE_SUMMARY_6 : CREATE TABLE TURNSTILE_SUMMARY WITH (KAFKA_TOPIC='turnstile_summary_1', PARTITIONS=2, REPLICAS=1, VALUE_FORMAT='AVRO') AS SELECT
  TURNSTILE_STREAM.STATION_ID "STATION_ID",
  COUNT(*) "COUNT"
FROM TURNSTILE_STREAM TURNSTILE_STREAM
GROUP BY TURNSTILE_STREAM.STATION_ID
EMIT CHANGES;

现在,问题是,当我查询此turnstile_summary表时,得到以下结果,这没有任何意义。

ksql> select * from turnstile_summary where station_id = 40820 emit changes;
+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+
|ROWTIME                                                           |ROWKEY                                                            |STATION_ID                                                        |COUNT                                                             |
+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+
|1580720442562                                                     |�Ը�
                                                                                                                                |40820                                                             |9                                                                 |
|1580720442562                                                     |�Ը�
                                                                                                                                |40820                                                             |10                                                                |

如您所见,计数为910,这是不可能的,因为该station_id的流中只有两行。我挠头,但没用。非常感谢您的帮助。

1 个答案:

答案 0 :(得分:0)

为了进行这项工作,我进行了两项更改。

首先,流和表的ROWKEY列中的奇怪字符是由于键Avro Schema中的long类型引起的。我将键架构从更改为

{
  "type": "record",
  "name": "arrival.key",
  "fields": [
    {
      "name": "timestamp",
      "type": "long"
    }
  ]
}

{
  "namespace": "com.udacity",
  "type": "record",
  "name": "arrival.key",
  "fields": [
    {
      "name": "timestamp",
      "type": "string"     <<-----------
    }
  ]
}

第二,当我声明流时,我给它一个key声明,这是我不应该给出的。因此,我将流的定义从

CREATE STREAM turnstile_stream (
    station_id INT,
    station_name VARCHAR,
    line VARCHAR
) WITH (
    KAFKA_TOPIC='app.entity.turnstile',
    VALUE_FORMAT='AVRO',
    KEY='station_id'
);

CREATE STREAM turnstile_stream (
    station_id INT,
    station_name VARCHAR,
    line VARCHAR
) WITH (
    KAFKA_TOPIC='app.entity.turnstile',
    VALUE_FORMAT='AVRO'
);

进行了这些更改后,我的聚合正常运行。