Question

我注意到使用kafka elasticsearch连接器在elasticsearch中编入索引的文档的id为topic+partition+offset格式。

我更喜欢使用elasticsearch生成的id。似乎topic+partition+offset通常不是唯一的，所以我loosing data。

我该如何改变？

Answer 1

菲尔在评论中说 - topic-partition-offset应该是唯一的，所以我不知道这是如何导致数据丢失的。

无论如何 - 您可以让连接器生成密钥（正如您所做的那样），也可以自己定义密钥（key.ignore=false）。没有其他选择。

您可以将Single Message Transformations与Kafka Connect一起使用，从数据中的字段中获取密钥。根据您在Elasticsearch论坛中的消息，您的数据中似乎有一个id - 如果它是唯一的，您可以将其设置为您的密钥，因此也可以将其设置为您的Elasticsearch文档ID。以下是使用SMT定义密钥的示例：

# Add the `id` field as the key using Simple Message Transformations
transforms=InsertKey, ExtractId

# `ValueToKey`: push an object of one of the column fields (`id`) into the key
transforms.InsertKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.InsertKey.fields=id

# `ExtractField`: convert key from an object to a plain field
transforms.ExtractId.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.ExtractId.field=id

（通过https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/）

Answer 2

@Robin Moffatt，就像我看到的那样，topic-partition-offset可以在升级你的kafka集群的情况下导致重复，但不是滚动升级方式，而只是用集群替换集群（有时更容易替换）。在这种情况下，由于覆盖数据，您将遇到数据丢失。

关于你的优秀例子，这可能是许多案例的解决方案，但我会添加另一个选项。也许你可以将epoc timestamp元素添加到topic-partition-offset，所以这就像这个topic-partition-offset-current_timestamp。

您怎么看？

使用elasticsearch在kafka elasticsearch连接器中生成ID

2 个答案: