在KSQL中加入两个(或更多)kafka主题的最佳方法是发出所有主题的更改?

时间:2020-09-23 14:10:15

标签: apache-kafka ksqldb debezium

我们有一个“微服务”平台,我们正在使用debezium从这些平台上的数据库捕获更改数据,效果很好。

现在,我们希望使我们能够轻松加入这些主题并将结果流式传输到新主题中,以供多种服务使用。

免责声明:这里假设使用v0.11 ksqldb和cli(似乎其中的许多功能在较旧的版本中可能无效)

来自两个数据库实例的两个表的示例,这些实例流到Kafka主题中:

-- source identity microservice (postgres)
CREATE TABLE public.user_entity (
    id varchar(36) NOT NULL,
    first_name varchar(255) NULL,
    PRIMARY KEY (id)
);
-- ksql stream 
CREATE STREAM stream_user_entity WITH (KAFKA_TOPIC='cdc.identity.public.user_entity', value_format='avro');

-- source organization microservice (postgres)
CREATE TABLE public.user_info (
    id varchar(36) NOT NULL,
    user_entity_id varchar(36) NOT NULL,
    business_unit varchar(255) NOT NULL,
    cost_center varchar(255) NOT NULL,
    PRIMARY KEY (id)
);
-- ksql stream 
CREATE STREAM stream_user_info WITH (KAFKA_TOPIC='cdc.application.public.user_info', value_format='avro');

选项1 :流

CREATE STREAM stream_user_info_by_user_entity_id
AS SELECT * FROM stream_user_info
PARTITION BY user_entity_id
EMIT CHANGES;

SELECT 
    user_entity_id,
    first_name,
    business_unit,
    cost_center
FROM stream_user_entity ue
LEFT JOIN stream_user_info_by_user_entity_id ui WITHIN 365 DAYS ON ue.id = ui.user_entity_id 
EMIT CHANGES;

通知WITHIN 365 DAYS,从概念上讲,这些表可能需要很长时间才能被更改,因此该窗口在技术上将无限大。这看起来很可疑,似乎暗示这不是一个好方法。

选项2 :表格

CREATE TABLE ktable_user_info_by_user_entity_id (
    user_entity_id,
    first_name,
    business_unit,
    cost_center
)
with (KAFKA_TOPIC='stream_user_info_by_user_entity_id', value_format='avro');

SELECT 
    user_entity_id,
    first_name,
    business_unit,
    cost_center
FROM stream_user_entity ue
LEFT JOIN ktable_user_info_by_user_entity_id ui ON ue.id = ui.user_entity_id 
EMIT CHANGES;

我们不再需要窗口WITHIN 365 DAYS,因此感觉更正确。 但是仅当消息发送到流而不是表时才发出更改。

在此示例中: 用户更新first_name->发出更改 用户更新business_unit->未发出更改

也许有一种方法可以创建一个由user_entity_id分区的合并流,然后加入子表,该子表将保持当前状态,这导致我....

选项3 :合并的流和表格

-- "master" change stream with merged stream output
CREATE STREAM stream_user_changes (user_entity_id VARCHAR) 
WITH (KAFKA_TOPIC='stream_user_changes', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='avro');
INSERT INTO stream_user_changes SELECT id as user_entity_id FROM stream_user_entity;
INSERT INTO stream_user_changes SELECT user_entity_id FROM stream_user_info;

CREATE STREAM stream_user_entity_by_id
AS SELECT * FROM stream_user_entity
PARTITION BY id
EMIT CHANGES;

CREATE TABLE ktable_user_entity_by_id (
    id VARCHAR PRIMARY KEY,
    first_name VARCHAR
) with (KAFKA_TOPIC='stream_user_entity_by_id', value_format='avro');

SELECT 
    uec.user_entity_id,
    ue.first_name,
    ui.business_unit,
    ui.cost_center
FROM stream_user_entity_changes uec
LEFT JOIN ktable_user_entity_by_id ue ON uec.user_entity_id = ue.id
LEFT JOIN ktable_user_info_by_user_entity_id ui ON uec.user_entity_id = ui.user_entity_id 
EMIT CHANGES;

这个看起来最好,但对于每个表,似乎有很多移动组件,我们有2个流,1个插入查询,1个ktable。这里的另一个潜在问题可能是隐藏的竞争条件,在这种情况下,流在表的后台更新之前就发出更改。

选项4 :更多合并的表和流

CREATE STREAM stream_user_entity_changes_enriched
AS SELECT 
    ue.id AS user_entity_id,
    ue.first_name,
    ui.business_unit,
    ui.cost_center
FROM stream_user_entity_by_id ue
LEFT JOIN ktable_user_info_by_user_entity_id ui ON uec.user_entity_id = ui.user_entity_id 
EMIT CHANGES;

CREATE STREAM stream_user_info_changes_enriched
AS SELECT 
    ui.user_entity_id,
    ue.first_name,
    ui.business_unit,
    ui.cost_center
FROM stream_user_info_by_user_entity_id ui
LEFT JOIN ktable_user_entity_by_id ue ON ui.user_entity_id = ue.id
EMIT CHANGES;


CREATE STREAM stream_user_changes_enriched (user_entity_id VARCHAR, first_name VARCHAR, business_unit VARCHAR, cost_center VARCHAR) 
WITH (KAFKA_TOPIC='stream_user_changes_enriched', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='avro');
INSERT INTO stream_user_changes_enriched SELECT * FROM stream_user_entity_changes_enriched;
INSERT INTO stream_user_changes_enriched SELECT * FROM stream_user_info_changes_enriched;

从概念上讲,它与早期版本相同,但是“合并”发生在联接之后。可以想象,这可能消除了任何潜在的竞争条件,因为我们主要是从流而不是表中进行选择。

缺点是复杂度甚至比选项3还要差,并且为具有两个以上表的任何联接编写和跟踪所有这些流会让人感到麻木...

问题: 哪种方法最适合此用例,并且/或者我们是否正在尝试执行不应该使用ksql的操作?我们是否最好将其卸载到传统的RDBMS或使用Spark替代方案?

1 个答案:

答案 0 :(得分:0)

我将尝试回答我自己的问题,只有投票赞成才接受。

答案是:选项3

以下是此用例最好的原因,虽然可能是主观的

  • 由主键和外键划分的流是常见且简单的。
  • 基于这些流的表是常见且简单的。
  • 以这种方式使用的表将不是竞争条件。

所有选项都有优点,例如如果您不在乎发出所有更改,或者数据的行为类似于流(日志或事件),而不是缓慢变化的维度(sql表)。

至于“竞赛条件”,“表”一词使您误以为您实际上正在处理和持久化数据。实际上,它们实际上不是物理表,它们的行为实际上更像流中的子查询。注意:对于实际产生主题的聚合表可能是个例外(我建议这是一个不同的主题,但希望看到注释)

最后(语法可能有一些小错误):

---------------------------------------------------------
-- shared objects (likely to be used by multiple queries)
---------------------------------------------------------

-- shared streams wrapping topics
CREATE STREAM stream_user_entity WITH (KAFKA_TOPIC='cdc.identity.public.user_entity', value_format='avro');
CREATE STREAM stream_user_info WITH (KAFKA_TOPIC='cdc.application.public.user_info', value_format='avro');

-- shared keyed streams (i like to think of them as "indexes")
CREATE STREAM stream_user_entity_by_id AS 
SELECT * FROM stream_user_entity PARTITION BY id
EMIT CHANGES;
CREATE STREAM stream_user_info_by_user_entity_id AS 
SELECT * FROM stream_user_info PARTITION BY user_entity_id
EMIT CHANGES;

-- shared keyed tables (inferring columns with schema registry)
CREATE TABLE ktable_user_entity_by_id (id VARCHAR PRIMARY KEY) 
WITH (KAFKA_TOPIC='stream_user_entity_by_id', value_format='avro');
CREATE TABLE ktable_user_info_by_user_entity_id (user_entity_id VARCHAR PRIMARY KEY) 
WITH (KAFKA_TOPIC='stream_user_info_by_user_entity_id', value_format='avro');


---------------------------------------------------------
-- query objects (specific to the produced data)
---------------------------------------------------------
-- "master" change stream (include all tables in join)
CREATE STREAM stream_user_changes (user_entity_id VARCHAR) 
WITH (KAFKA_TOPIC='stream_user_changes', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='avro');
INSERT INTO stream_user_changes SELECT id as user_entity_id FROM stream_user_entity;
INSERT INTO stream_user_changes SELECT user_entity_id FROM stream_user_info;

-- pretty simple looking query
SELECT 
    uec.user_entity_id,
    ue.first_name,
    ui.business_unit,
    ui.cost_center
FROM stream_user_entity_changes uec
LEFT JOIN ktable_user_entity_by_id ue ON uec.user_entity_id = ue.id
LEFT JOIN ktable_user_info_by_user_entity_id ui ON uec.user_entity_id = ui.user_entity_id 
EMIT CHANGES;

“共享”对象基本上是流模式(试图为我们所有的主题创建,但这是另一个问题),第二部分类似于查询模式。最终,它是一种功能,清洁和可重复的模式。