我一直在寻找一个使用Kafka Streams的示例,该示例说明了如何执行这种操作,即将一个带客户表和一个地址表连接起来,然后将数据下沉到ES中:
+------+------------+----------------+-----------------------+
| id | first_name | last_name | email |
+------+------------+----------------+-----------------------+
| 1001 | Sally | Thomas | sally.thomas@acme.com |
| 1002 | George | Bailey | gbailey@foobar.com |
| 1003 | Edward | Davidson | ed@walker.com |
| 1004 | Anne | Kim | annek@noanswer.org |
+------+------------+----------------+-----------------------+
+----+-------------+---------------------------+------------+--------------+-------+----------+
| id | customer_id | street | city | state | zip | type |
+----+-------------+---------------------------+------------+--------------+-------+----------+
| 10 | 1001 | 3183 Moore Avenue | Euless | Texas | 76036 | SHIPPING |
| 11 | 1001 | 2389 Hidden Valley Road | Harrisburg | Pennsylvania | 17116 | BILLING |
| 12 | 1002 | 281 Riverside Drive | Augusta | Georgia | 30901 | BILLING |
| 13 | 1003 | 3787 Brownton Road | Columbus | Mississippi | 39701 | SHIPPING |
| 14 | 1003 | 2458 Lost Creek Road | Bethlehem | Pennsylvania | 18018 | SHIPPING |
| 15 | 1003 | 4800 Simpson Square | Hillsdale | Oklahoma | 73743 | BILLING |
| 16 | 1004 | 1289 University Hill Road | Canehill | Arkansas | 72717 | LIVING |
+----+-------------+---------------------------+------------+--------------+-------+----------+
"hits": [
{
"_index": "customers_with_addresses",
"_type": "_doc",
"_id": "1",
"_score": 1.3278645,
"_source": {
"first_name": "Sally",
"last_name": "Thomas",
"email": "sally.thomas@acme.com",
"addresses": [{
"street": "3183 Moore Avenue",
"city": "Euless",
"state": "Texas",
"zip": "76036",
"type": "SHIPPING"
}, {
"street": "2389 Hidden Valley Road",
"city": "Harrisburg",
"state": "Pennsylvania",
"zip": "17116",
"type": "BILLING"
}],
}
}, ….
表数据来自Debezium主题,我是否认为自己需要中间的Java来加入流并将其输出到一个新主题中,然后将其沉入ES中是正确的?
有人会对此有任何示例代码吗?
谢谢。
答案 0 :(得分:4)
根据对一个客户节点中嵌套多个地址的要求的严格程度,可以在KSQL(基于Kafka Streams构建)中执行此操作。
将一些测试数据填充到Kafka中(在您的情况下已通过Debezium完成):
$ curl -s "https://api.mockaroo.com/api/ffa9ff20?count=10&key=ff7856d0" | kafkacat -b localhost:9092 -t addresses -P
$ curl -s "https://api.mockaroo.com/api/9b868890?count=4&key=ff7856d0" | kafkacat -b localhost:9092 -t customers -P
启动KSQL并首先检查数据:
ksql> PRINT 'addresses' FROM BEGINNING ;
Format:JSON
{"ROWTIME":1558519823351,"ROWKEY":"null","id":1,"customer_id":1004,"street":"8 Moulton Center","city":"Bronx","state":"New York","zip":"10474","type":"BILLING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":2,"customer_id":1001,"street":"5 Hollow Ridge Alley","city":"Washington","state":"District of Columbia","zip":"20016","type":"LIVING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":3,"customer_id":1000,"street":"58 Maryland Point","city":"Greensboro","state":"North Carolina","zip":"27404","type":"LIVING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":4,"customer_id":1002,"street":"55795 Derek Avenue","city":"Temple","state":"Texas","zip":"76505","type":"LIVING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":5,"customer_id":1002,"street":"164 Continental Plaza","city":"Modesto","state":"California","zip":"95354","type":"SHIPPING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":6,"customer_id":1004,"street":"6 Miller Road","city":"Louisville","state":"Kentucky","zip":"40205","type":"BILLING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":7,"customer_id":1003,"street":"97 Shasta Place","city":"Pittsburgh","state":"Pennsylvania","zip":"15286","type":"BILLING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":8,"customer_id":1000,"street":"36 Warbler Circle","city":"Memphis","state":"Tennessee","zip":"38109","type":"SHIPPING"}
{"ROWTIME":1558519823351,"ROWKEY":"null","id":9,"customer_id":1001,"street":"890 Eagan Circle","city":"Saint Paul","state":"Minnesota","zip":"55103","type":"SHIPPING"}
{"ROWTIME":1558519823354,"ROWKEY":"null","id":10,"customer_id":1000,"street":"8 Judy Terrace","city":"Washington","state":"District of Columbia","zip":"20456","type":"SHIPPING"}
^C
Topic printing ceased
ksql>
ksql> PRINT 'customers' FROM BEGINNING;
Format:JSON
{"ROWTIME":1558519852363,"ROWKEY":"null","id":1001,"first_name":"Jolee","last_name":"Handasyde","email":"jhandasyde0@nhs.uk"}
{"ROWTIME":1558519852363,"ROWKEY":"null","id":1002,"first_name":"Rebeca","last_name":"Kerrod","email":"rkerrod1@sourceforge.net"}
{"ROWTIME":1558519852363,"ROWKEY":"null","id":1003,"first_name":"Bobette","last_name":"Brumble","email":"bbrumble2@cdc.gov"}
{"ROWTIME":1558519852368,"ROWKEY":"null","id":1004,"first_name":"Royal","last_name":"De Biaggi","email":"rdebiaggi3@opera.com"}
现在,我们在数据上声明一个STREAM
(Kafka主题+模式),以便我们可以进一步对其进行操作:
ksql> CREATE STREAM addresses_RAW (ID INT, CUSTOMER_ID INT, STREET VARCHAR, CITY VARCHAR, STATE VARCHAR, ZIP VARCHAR, TYPE VARCHAR) WITH (KAFKA_TOPIC='addresses', VALUE_FORMAT='JSON');
Message
----------------
Stream created
----------------
ksql> CREATE STREAM customers_RAW (ID INT, FIRST_NAME VARCHAR, LAST_NAME VARCHAR, EMAIL VARCHAR) WITH (KAFKA_TOPIC='customers', VALUE_FORMAT='JSON');
Message
----------------
Stream created
----------------
我们将customers
建模为TABLE
,并需要正确键入Kafka消息的密钥(以及它们具有空密钥的时间,从以上"ROWKEY":"null"
输出中的PRINT
。您可以配置Debezium来设置消息密钥,因此在KSQL中可能不需要此步骤:
ksql> CREATE STREAM CUSTOMERS_KEYED WITH (PARTITIONS=1) AS SELECT * FROM CUSTOMERS_RAW PARTITION BY ID;
Message
----------------------------
Stream created and running
----------------------------
现在,我们声明一个TABLE
(给定键的状态,从Kafka主题+架构实例化):
ksql> CREATE TABLE CUSTOMER (ID INT, FIRST_NAME VARCHAR, LAST_NAME VARCHAR, EMAIL VARCHAR) WITH (KAFKA_TOPIC='CUSTOMERS_KEYED', VALUE_FORMAT='JSON', KEY='ID');
Message
---------------
Table created
---------------
现在我们可以加入数据了:
ksql> CREATE STREAM customers_with_addresses AS
SELECT CUSTOMER_ID,
FIRST_NAME + ' ' + LAST_NAME AS FULL_NAME,
FIRST_NAME,
LAST_NAME,
TYPE AS ADDRESS_TYPE,
STREET,
CITY,
STATE,
ZIP
FROM ADDRESSES_RAW A
INNER JOIN CUSTOMER C
ON A.CUSTOMER_ID = C.ID;
Message
----------------------------
Stream created and running
----------------------------
这将创建一个新的KSQL STREAM,从而填充一个新的Kafka主题。
ksql> SHOW STREAMS;
Stream Name | Kafka Topic | Format
------------------------------------------------------------------------------------------
CUSTOMERS_KEYED | CUSTOMERS_KEYED | JSON
ADDRESSES_RAW | addresses | JSON
CUSTOMERS_RAW | customers | JSON
CUSTOMERS_WITH_ADDRESSES | CUSTOMERS_WITH_ADDRESSES | JSON
流具有一个架构:
ksql> DESCRIBE CUSTOMERS_WITH_ADDRESSES;
Name : CUSTOMERS_WITH_ADDRESSES
Field | Type
------------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
CUSTOMER_ID | INTEGER (key)
FULL_NAME | VARCHAR(STRING)
FIRST_NAME | VARCHAR(STRING)
ADDRESS_TYPE | VARCHAR(STRING)
LAST_NAME | VARCHAR(STRING)
STREET | VARCHAR(STRING)
CITY | VARCHAR(STRING)
STATE | VARCHAR(STRING)
ZIP | VARCHAR(STRING)
------------------------------------------
For runtime statistics and query details run: DESCRIBE EXTENDED <Stream,Table>;
我们可以查询流:
ksql> SELECT * FROM CUSTOMERS_WITH_ADDRESSES WHERE CUSTOMER_ID=1002;
1558519823351 | 1002 | 1002 | Rebeca Kerrod | Rebeca | LIVING | Kerrod | 55795 Derek Avenue | Temple | Texas | 76505
1558519823351 | 1002 | 1002 | Rebeca Kerrod | Rebeca | SHIPPING | Kerrod | 164 Continental Plaza | Modesto | California | 95354
我们还可以使用Kafka Connect将其流式传输到Elasticsearch:
curl -i -X POST -H "Accept:application/json" \
-H "Content-Type:application/json" http://localhost:8083/connectors/ \
-d '{
"name": "sink-elastic-customers_with_addresses-00",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"topics": "CUSTOMERS_WITH_ADDRESSES",
"connection.url": "http://elasticsearch:9200",
"type.name": "type.name=kafkaconnect",
"key.ignore": "true",
"schema.ignore": "true",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false"
}
}'
结果:
$ curl -s http://localhost:9200/customers_with_addresses/_search | jq '.hits.hits[0]'
{
"_index": "customers_with_addresses",
"_type": "type.name=kafkaconnect",
"_id": "CUSTOMERS_WITH_ADDRESSES+0+2",
"_score": 1,
"_source": {
"ZIP": "76505",
"CITY": "Temple",
"ADDRESS_TYPE": "LIVING",
"CUSTOMER_ID": 1002,
"FULL_NAME": "Rebeca Kerrod",
"STATE": "Texas",
"STREET": "55795 Derek Avenue",
"LAST_NAME": "Kerrod",
"FIRST_NAME": "Rebeca"
}
}
答案 1 :(得分:2)
是的,您可以通过以下方式在Java中使用Kafka流API实现解决方案。
下面是示例(考虑以json格式使用数据):
KStream<String,JsonNode> customers = builder.stream("customer", Consumed.with(stringSerde, jsonNodeSerde));
KStream<String,JsonNode> addresses = builder.stream("address", Consumed.with(stringSerde, jsonNodeSerde));
// Select the customer ID as key in order to join with address.
KStream<String,JsonNode> customerRekeyed = customers.selectKey(value-> value.get("id").asText());
ObjectMapper mapper = new ObjectMapper();
// Select Customer_id as key to aggregate the addresses and join with customer
KTable<String,JsonNode> addressTable = addresses
.selectKey(value-> value.get("customer_id").asText())
.groupByKey()
.aggregate(() ->mapper::createObjectNode, //initializer
(key,value,aggregate) -> aggregate.add(value),
Materialized.with(stringSerde, jsonNodeSerde)
); //adder
// Join Customer Stream with Address Table
KStream<String,JsonNode> customerAddressStream = customerRekeyed.leftJoin(addressTable,
(left,right) -> {
ObjectNode finalNode = mapper.createObjectNode();
ArrayList addressList = new ArrayList<JsonNode>();
// Considering the address is arrayNode
((ArrayNode)right).elements().forEachRemaining(addressList ::add);
left.putArray("addresses").allAll(addressList);
return left;
},Joined.keySerde(stringSerde).withValueSerde(jsonNodeSerde));
您可以在此处参考有关所有类型连接的详细信息:
https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#joining
答案 2 :(得分:1)
不久前,我们在Debezium博客上针对这个用例(将聚合流传输到Elasticsearch)构建了一个示例并blog post。
要记住的一个问题是,此解决方案(基于Kafka Streams,但我认为对于KSQL相同)易于暴露中间连接结果。例如。假设您在一笔交易中插入了一个客户和10个地址。流加入方法可能首先会生成客户及其前五个地址的汇总,然后不久便会生成包含所有十个地址的完整汇总。对于您的特定用例,这可能是不希望的。我还记得,处理删除并不是一件容易的事(例如,如果删除10个地址之一,那么您将不得不再次生成聚合,剩下的9个地址可能没有被修改过。)
可以考虑使用的替代方法是outbox pattern,在该方法中,您本质上将使用应用程序内部的预先计算的聚合来生成显式事件。即它需要应用程序的一点帮助,但是避免了事实之后产生联接结果的细微之处。