我在ksql
中有一个流,称为turnstile_stream
。对于该流中的列值(station_id
),当我查询所有条目时,都会得到以下结果
ksql> select * from turnstile_stream where station_id = 40820 emit changes;
+----------------------------------------------------+----------------------------------------------------+----------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|ROWTIME |ROWKEY |STATION_ID |STATION_NAME |LINE |
+----------------------------------------------------+----------------------------------------------------+----------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|1580720442456 |�Ը�
|40820 |Rosemont |blue |
|1580720442456 |�Ը�
|40820 |Rosemont |blue |
意味着,该station_id
的流中只有两个条目。这是正确的,因为我在主题中仅推送了两个事件,这些事件用于创建流。现在,我有了一个表,该表是通过使用以下查询创建的。查询按station_id
分组,并在流turnstile_stream
中获取事件计数。
ksql> describe extended turnstile_summary;
Name : TURNSTILE_SUMMARY
Type : TABLE
Key field : STATION_ID
Key format : STRING
Timestamp field : Not set - using <ROWTIME>
Value format : AVRO
Kafka topic : turnstile_summary_1 (partitions: 2, replication: 1)
Field | Type
----------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
STATION_ID | INTEGER
COUNT | BIGINT
----------------------------------------
Queries that write from this TABLE
-----------------------------------
CTAS_TURNSTILE_SUMMARY_6 : CREATE TABLE TURNSTILE_SUMMARY WITH (KAFKA_TOPIC='turnstile_summary_1', PARTITIONS=2, REPLICAS=1, VALUE_FORMAT='AVRO') AS SELECT
TURNSTILE_STREAM.STATION_ID "STATION_ID",
COUNT(*) "COUNT"
FROM TURNSTILE_STREAM TURNSTILE_STREAM
GROUP BY TURNSTILE_STREAM.STATION_ID
EMIT CHANGES;
现在,问题是,当我查询此turnstile_summary
表时,得到以下结果,这没有任何意义。
ksql> select * from turnstile_summary where station_id = 40820 emit changes;
+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+
|ROWTIME |ROWKEY |STATION_ID |COUNT |
+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+
|1580720442562 |�Ը�
|40820 |9 |
|1580720442562 |�Ը�
|40820 |10 |
如您所见,计数为9
和10
,这是不可能的,因为该station_id
的流中只有两行。我挠头,但没用。非常感谢您的帮助。
答案 0 :(得分:0)
为了进行这项工作,我进行了两项更改。
首先,流和表的ROWKEY
列中的奇怪字符是由于键Avro Schema中的long
类型引起的。我将键架构从更改为
{
"type": "record",
"name": "arrival.key",
"fields": [
{
"name": "timestamp",
"type": "long"
}
]
}
到
{
"namespace": "com.udacity",
"type": "record",
"name": "arrival.key",
"fields": [
{
"name": "timestamp",
"type": "string" <<-----------
}
]
}
第二,当我声明流时,我给它一个key
声明,这是我不应该给出的。因此,我将流的定义从
CREATE STREAM turnstile_stream (
station_id INT,
station_name VARCHAR,
line VARCHAR
) WITH (
KAFKA_TOPIC='app.entity.turnstile',
VALUE_FORMAT='AVRO',
KEY='station_id'
);
到
CREATE STREAM turnstile_stream (
station_id INT,
station_name VARCHAR,
line VARCHAR
) WITH (
KAFKA_TOPIC='app.entity.turnstile',
VALUE_FORMAT='AVRO'
);
进行了这些更改后,我的聚合正常运行。