KTABLE未检测到相同的键。 (插入记录而不是更新)

时间:2018-12-22 01:36:46

标签: apache-kafka confluent-schema-registry ksql

用例

目标是识别传入的事件/行,以检查它是新行还是更新。  新行将转到另一个主题,更新行将转到另一个主题。

方法:拥有一个查询表(KTABLE)并执行两次联接操作    1.内部联接以检测更新。    2.当右表键为空时左联接以检测“插入/新建”行。 根据以上两个操作的结果创建两个流。 对流执行插入查询,这会将记录插入查询表中。

此处是复制问题的步骤:(需要7分钟)

第1步 docker-compose up Latest Confluent 5.1.0 platform with docker

第2步 docker ps

注意:请确保代理已启动。经纪人在我的本地区域经常宕机。

第3步

在新终端中进入架构注册表的重击。(很容易监视此终端是否保持打开状态。)

docker run -it --net=cp-all-in-one_default --rm confluentinc/cp-schema-registry:5.1.0 bash

第4步

创建一个查找表。主题为LOAD.TEST.LOCAL.LOOKUP.TABLE。我的架构的键类型为字符串。以下是三个示例记录。 您首先要使用初始的3条虚拟记录填充查找表。

kafka-avro-console-producer --broker-list broker:9092 --topic LOAD.TEST.LOCAL.LOOKUP.TABLE \
    --property schema.registry.url=http://schema-registry:8081 \
    --property parse.key=true \
    --property key.separator=, \
    --property key.schema='{"type":"string"}' \
    --property value.schema='{"name":"LOAD.TEST.LOCAL.LOOKUP.TABLE","type":"record","namespace":"example.sender.batch","fields":[{"name":"SENDER_CODE","type":"string"},{"name":"SENDER_NAME","type":"string"},{"name":"SENDER_CATEGORY_CODE","type":"string"},{"name":"SENDER_AGENCY_CODE","type":"string"},{"name":"SENDER_SUB_AGENCY_CODE","type":"string"},{"name":"SENDER_FOREIGN_IND","type":"string"},{"name":"SENDER_FOREIGN_COUNTRY","type":"string"},{"name":"SENDER_NAME_ALTERNATE","type":"string"},{"name":"PARENT_SENDER_CODE","type":"string"},{"name":"CHANGE_DATE","type":"string"},{"name":"REQUESTING_LOCATION","type":"string"},{"name":"REQUEST_DATE","type":"string"},{"name":"REPLACEMENT_SENDER_CODE","type":"string"},{"name":"SENDER_STATUS","type":"string"},{"name":"SENDER_DUNS","type":"string"},{"name":"ADDRESSLINE1","type":"string"},{"name":"ADDRESSLINE2","type":"string"},{"name":"ADDRESSLINE3","type":"string"},{"name":"ADDRESS4","type":"string"},{"name":"CITY","type":"string"},{"name":"STATE","type":"string"},{"name":"POSTAL_CODE","type":"string"},{"name":"URL","type":"string"},{"name":"SENDER_ACRONYM","type":"string"},{"name":"DEACTIVATED_DATE","type":"string"},{"name":"Kafka_TimeEvent","type":"string"}]}'

现在您可以在下面插入记录。只需粘贴下面的3条记录。 如果您多次按回车键并遇到异常,则只需再次运行上述相同命令,然后在按return一次后将其插入即可。

"SVI6FQ",{"SENDER_CODE":"SVI6FQ","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"373 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 972","ADDRESSLINE3":"MAILBOXB","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}
"SVI6FR",{"SENDER_CODE":"SVI6FR","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"374 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 973","ADDRESSLINE3":"MAILBOXC","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}
"SVI6FN",{"SENDER_CODE":"SVI6FN","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"372 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 972","ADDRESSLINE3":"MAILBOXA","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"} 

⌘+c退出。

第5步 。在另一个终端上,打开KSQL CLI

docker run --network cp-all-in-one_default --interactive --tty --rm confluentinc/cp-ksql-cli:latest http://ksql-server:8088

第6步 。创建一个KTABLE。

create table load_test_local_lookup_table with (KAFKA_TOPIC='LOAD.TEST.LOCAL.LOOKUP.TABLE',VALUE_FORMAT='AVRO',KEY='SENDER_CODE');

第7步 。 确保设置以下属性,以便您可以从起始偏移量看到结果。在KSQL中运行。

ksql> SET 'auto.offset.reset'='earliest';

您将看到以下消息。 Successfully changed local property 'auto.offset.reset' from 'null' to 'earliest'

第8步 。 现在创建一个主题,您的事件将在此流式传输。使用步骤4的架构注册表的bash。此外,将相同的记录填充到基础主题。

 kafka-avro-console-producer --broker-list broker:9092 --topic LOAD.TEST.LOCAL.EVENT.STREAM \
    --property schema.registry.url=http://schema-registry:8081 \
    --property parse.key=true \
    --property key.separator=, \
    --property key.schema='{"type":"string"}' \
    --property value.schema='{"name":"LOAD.TEST.LOCAL.EVENT.STREAM","type":"record","namespace":"example.sender.batch","fields":[{"name":"SENDER_CODE","type":"string"},{"name":"SENDER_NAME","type":"string"},{"name":"SENDER_CATEGORY_CODE","type":"string"},{"name":"SENDER_AGENCY_CODE","type":"string"},{"name":"SENDER_SUB_AGENCY_CODE","type":"string"},{"name":"SENDER_FOREIGN_IND","type":"string"},{"name":"SENDER_FOREIGN_COUNTRY","type":"string"},{"name":"SENDER_NAME_ALTERNATE","type":"string"},{"name":"PARENT_SENDER_CODE","type":"string"},{"name":"CHANGE_DATE","type":"string"},{"name":"REQUESTING_LOCATION","type":"string"},{"name":"REQUEST_DATE","type":"string"},{"name":"REPLACEMENT_SENDER_CODE","type":"string"},{"name":"SENDER_STATUS","type":"string"},{"name":"SENDER_DUNS","type":"string"},{"name":"ADDRESSLINE1","type":"string"},{"name":"ADDRESSLINE2","type":"string"},{"name":"ADDRESSLINE3","type":"string"},{"name":"ADDRESS4","type":"string"},{"name":"CITY","type":"string"},{"name":"STATE","type":"string"},{"name":"POSTAL_CODE","type":"string"},{"name":"URL","type":"string"},{"name":"SENDER_ACRONYM","type":"string"},{"name":"DEACTIVATED_DATE","type":"string"},{"name":"Kafka_TimeEvent","type":"string"}]}'

"SVI6FQ",{"SENDER_CODE":"SVI6FQ","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"373 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 972","ADDRESSLINE3":"MAILBOXB","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}
"SVI6FR",{"SENDER_CODE":"SVI6FR","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"374 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 973","ADDRESSLINE3":"MAILBOXC","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}
"SVI6FN",{"SENDER_CODE":"SVI6FN","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"372 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 972","ADDRESSLINE3":"MAILBOXA","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}

第9步

为此事件主题创建流。

create stream load_test_local_event_stream with (KAFKA_TOPIC='LOAD.TEST.LOCAL.EVENT.STREAM',VALUE_FORMAT='AVRO',KEY='SENDER_CODE');

第10步

在“流表联接”之后派生一个流,该流将用于检测已经存在的行。我们称其为update_stream。下面为此流创建的主题只有更新。 这是我的用例之一。我必须过滤更新的消息。

create stream load_test_update_stream as select event.*  FROM load_test_local_event_stream event JOIN  load_test_local_lookup_table lookup ON event.sender_code = lookup.sender_Code;

步骤11

创建带有用于查找表的主题的流。这样,如果您想更新查找表,可以将其插入此流中。 (如果我没记错的话:您不能直接从流中插入ktable。)因此,这样做。

create stream load_test_lookup_feed_stream with (KAFKA_TOPIC='LOAD.TEST.LOCAL.LOOKUP.TABLE',VALUE_FORMAT='AVRO',KEY='SENDER_CODE');

第12步 。 运行插入查询。该查询将插入到查找表供稿流中,当update_stream中有消息时,该更新将更新查找表。

Insert into load_test_lookup_feed_stream  select EVENT_SENDER_CODE AS SENDER_CODE, EVENT_SENDER_NAME AS SENDER_NAME, EVENT_SENDER_CATEGORY_CODE AS SENDER_CATEGORY_CODE , EVENT_SENDER_AGENCY_CODE AS SENDER_AGENCY_CODE , EVENT_SENDER_SUB_AGENCY_CODE AS SENDER_SUB_AGENCY_CODE, EVENT_SENDER_FOREIGN_IND AS SENDER_FOREIGN_IND, EVENT_SENDER_FOREIGN_COUNTRY AS SENDER_FOREIGN_COUNTRY  , EVENT_SENDER_NAME_ALTERNATE AS SENDER_NAME_ALTERNATE, EVENT_PARENT_SENDER_CODE AS PARENT_SENDER_CODE ,EVENT_CHANGE_DATE AS CHANGE_DATE, EVENT_REQUESTING_LOCATION AS REQUESTING_LOCATION , EVENT_REQUEST_DATE AS REQUEST_DATE, EVENT_REPLACEMENT_SENDER_CODE AS REPLACEMENT_SENDER_CODE  , EVENT_SENDER_STATUS AS SENDER_STATUS, EVENT_SENDER_DUNS AS SENDER_DUNS , EVENT_ADDRESSLINE1 AS ADDRESSLINE1 , EVENT_ADDRESSLINE2 AS ADDRESSLINE2, EVENT_ADDRESSLINE3 AS ADDRESSLINE3 , EVENT_ADDRESS4 AS ADDRESS4  , EVENT_CITY AS CITY , EVENT_STATE AS STATE, EVENT_POSTAL_CODE AS POSTAL_CODE, EVENT_URL AS URL, EVENT_SENDER_ACRONYM AS SENDER_ACRONYM , EVENT_DEACTIVATED_DATE AS DEACTIVATED_DATE, EVENT_KAFKA_TIMEEVENT AS KAFKA_TIMEEVENT from load_test_update_stream partition by SENDER_CODE ;

:惊叹号:问题:尽管这会使我的查找表变味,但它正在作为新记录进行更新。不作为更新。 复制此问题(请按照步骤15A)。

第13步 。 与更新非常相似,创建一个将在事件中检测新记录的流。

create stream load_test_insert_stream as select event.*  FROM load_test_local_event_stream event left JOIN  load_test_local_lookup_table lookup ON event.sender_code = lookup.sender_Code where lookup.sender_Code is null ;

验证: 您可以选择仅运行选择查询以了解正在发生的情况。如果您的schema-registry bash已打开,请使用您自己的密钥插入新记录。 (尝试像第15A节中一样插入新记录)。此新消息将在此流中可用。

步骤14 。 像以前一样在查询中创建一个插入。这将插入回查询表。现在,您的查询表中会填充一条新消息。

Insert into load_test_lookup_feed_stream select EVENT_SENDER_CODE AS SENDER_CODE, EVENT_SENDER_NAME AS SENDER_NAME, EVENT_SENDER_CATEGORY_CODE AS SENDER_CATEGORY_CODE , EVENT_SENDER_AGENCY_CODE AS SENDER_AGENCY_CODE , EVENT_SENDER_SUB_AGENCY_CODE AS SENDER_SUB_AGENCY_CODE, EVENT_SENDER_FOREIGN_IND AS SENDER_FOREIGN_IND, EVENT_SENDER_FOREIGN_COUNTRY AS SENDER_FOREIGN_COUNTRY  , EVENT_SENDER_NAME_ALTERNATE AS SENDER_NAME_ALTERNATE, EVENT_PARENT_SENDER_CODE AS PARENT_SENDER_CODE ,EVENT_CHANGE_DATE AS CHANGE_DATE, EVENT_REQUESTING_LOCATION AS REQUESTING_LOCATION , EVENT_REQUEST_DATE AS REQUEST_DATE, EVENT_REPLACEMENT_SENDER_CODE AS REPLACEMENT_SENDER_CODE  , EVENT_SENDER_STATUS AS SENDER_STATUS, EVENT_SENDER_DUNS AS SENDER_DUNS , EVENT_ADDRESSLINE1 AS ADDRESSLINE1 , EVENT_ADDRESSLINE2 AS ADDRESSLINE2, EVENT_ADDRESSLINE3 AS ADDRESSLINE3 , EVENT_ADDRESS4 AS ADDRESS4  , EVENT_CITY AS CITY , EVENT_STATE AS STATE, EVENT_POSTAL_CODE AS POSTAL_CODE, EVENT_URL AS URL, EVENT_SENDER_ACRONYM AS SENDER_ACRONYM , EVENT_DEACTIVATED_DATE AS DEACTIVATED_DATE, EVENT_KAFKA_TIMEEVENT AS KAFKA_TIMEEVENT from load_test_insert_stream partition by SENDER_CODE ;

第15步

问题是什么:如何复制。

步骤15A 如何插入新的样本记录

运行第8步中的命令(使用架构)。插入/粘贴新记录,如下所示。 注意,我同时更改了消息键和发送者代码。消息键和行键应始终匹配。  例如:"SVI6FW","SENDER_CODE":"SVI6FW

"SVI6FW",{"SENDER_CODE":"SVI6FW","SENDER_NAME":"SENDER SAM II","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"373 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 972","ADDRESSLINE3":"MAILBOXB","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}

步骤15B 如何更新示例记录

与以前的(15A)新记录插入非常相似,但是使用相同的消息键,只是更新名称或某些值。例如,'SAM II'变成了'SAM III'

"SVI6FW",{"SENDER_CODE":"SVI6FW","SENDER_NAME":"SENDER SAM III","SENDER_CATEGORY_CODE":"5","SENDER_AGENCY_CODE":"","SENDER_SUB_AGENCY_CODE":"","SENDER_FOREIGN_IND":"","SENDER_FOREIGN_COUNTRY":"","SENDER_NAME_ALTERNATE":"","PARENT_SENDER_CODE":"5","CHANGE_DATE":"2018-09-27","REQUESTING_LOCATION":"","REQUEST_DATE":"","REPLACEMENT_SENDER_CODE":"","SENDER_STATUS":"","SENDER_DUNS":"","ADDRESSLINE1":"373 ELAN VILLAGE LANE","ADDRESSLINE2":"APPARTMENT 972","ADDRESSLINE3":"MAILBOXB","ADDRESS4":"","CITY":"SAN JOSE","STATE":"CA","POSTAL_CODE":"95134","URL":"","SENDER_ACRONYM":"","DEACTIVATED_DATE":"","Kafka_TimeEvent":"2018-09-27"}

问题   如果您看到“我的查询表”没有更新,即使使用相同的密钥发送,它也将每条消息都视为新消息。 因此,我无法检测到更新。每条消息都是新消息。

您可以通过执行以下操作进行测试。

  1. 使用您自己的密钥(15A)发送新消息。它将在load_test_insert_stream中可用。
  2. 使用相同的密钥(例如15B)发送更新的消息。它应该在load_test_update_stream中可用,但将在load_test_insert_stream中使用。并且Lookup表将其视为新消息。

欢迎任何新方法/建议!

1 个答案:

答案 0 :(得分:0)

我假设您的设计基本上已经接近:

-- stream of inputs:
CREATE STREAM INPUT (ID INT KEY, V0 INT) WITH (kafka_topic='test_topic', value_format='JSON', PARTITIONS=1);

-- table built from the stream of inserts:
CREATE TABLE EXISTING (ID INT PRIMARY KEY, IGNORED INT) WITH (kafka_topic='INSERTS', value_format='JSON', PARTITIONS=1);

-- stream of inserts:
CREATE STREAM INSERTS AS SELECT INPUT.ID, INPUT.V0 AS V0 FROM INPUT LEFT JOIN EXISTING ON INPUT.ID = EXISTING.ID WHERE EXISTING.ID IS NULL;

-- stream of updates:
CREATE STREAM UPDATES AS SELECT INPUT.ID, INPUT.V0 AS V0 FROM INPUT JOIN EXISTING ON INPUT.ID = EXISTING.ID;

然后您要插入一些记录:

INSERT INTO INPUT VALUES (1, 3);
INSERT INTO INPUT VALUES (2, 4);
INSERT INTO INPUT VALUES (1, 5);

并期望在INSERTS流中输出前两行,并在OUTPUTS行中输出最后一行。

我已经在ksqlDB的0.11版本上测试了上面的内容,它确实可以工作。。。

如果从CLI一条一条插入每条记录,则输出将与您期望的一样。但是,如果您将所有三行同时插入,例如通过在CLI的同一行上全部运行它们:

INSERT INTO INPUT VALUES (1, 3);INSERT INTO INPUT VALUES (2, 4);INSERT INTO INPUT VALUES (1, 5);

然后,所有三行最终都在INSERTS流中。为什么?你可能会问。

tl; dr;解决方案是脆弱的。如果更新接近插入,将无法正常工作。

设计中存在竞争条件。如果联接在第二个输入行之前进行处理,则第一行的输出已生成到INSERTS主题,并且已对联接进行轮询以读取该行,则EXISTING表将不包含该行,因此第二行将被错误地发送到INSERTS而不是“ UPDATES”。

您可以使用一些配置,看看是否可以在您的用例中使它工作。

  • max.task.idle.ms设置为更高将意味着该联接将等待更长的时间,以便数据出现在联接的表侧。但是,如果更新和插入发生在同一毫秒内,这将无济于事,并且增加更新会损害吞吐量和延迟。
  • cache.max.bytes.buffering设置为零将关闭Streams库中的缓冲,这可能会有所帮助。
  • linger.ms设置为零将意味着Kafka生产者不会延迟发送消息。

即使所有这些,系统都是异步的,您的结果可能会有所不同。如果更新永远不会在插入片段附近发生,则系统将正常工作。但是,如果更新可能接近插入,您可能会发现这些错误被错误地归类为插入。