Kafka流以特定键作为输入连接

时间:2017-01-23 16:21:30

标签: apache-kafka apache-kafka-streams kafka-join

我在架构注册表中有3个不同的主题和3个Avro文件,我想流式传输这些主题并将它们连接在一起并将它们写入一个主题。问题是我要加入的密钥与我将数据写入每个主题的密钥不同。

假设我们有这3个Avro文件:
报警

{
  "type" : "record",
  "name" : "Alarm",
  "namespace" : "com.kafkastream.schema.avro",
  "fields" : [ {
    "name" : "alarm_id",
    "type" : "string",
    "doc" : "Unique identifier of the alarm."
  }, {
    "name" : "ne_id",
    "type" : "string",
    "doc" : "Unique identifier of the  network element ID that produces the alarm."
  }, {
    "name" : "start_time",
    "type" : "long",
    "doc" : "is the timestamp when the alarm was generated."
  }, {
    "name" : "severity",
    "type" : [ "null", "string" ],
    "doc" : "The severity field is the default severity associated to the alarm ",
    "default" : null
  }]
}

事件:

{
  "type" : "record",
  "name" : "Incident",
  "namespace" : "com.kafkastream.schema.avro",
  "fields" : [ {
    "name" : "incident_id",
    "type" : "string",
    "doc" : "Unique identifier of the incident."
  }, {
    "name" : "incident_type",
    "type" : [ "null", "string" ],
    "doc" : "Categorization of the incident e.g. Network fault, network at risk, customer impact, etc",
    "default" : null
  }, {
    "name" : "alarm_source_id",
    "type" : "string",
    "doc" : "Respective Alarm"
  }, {
    "name" : "start_time",
    "type" : "long",
    "doc" : "is the timestamp when the incident was generated on the node."
  }, {
    "name" : "ne_id",
    "type" : "string",
    "doc" : "ID of specific network element."
  }]
}

维护:

{
  "type" : "record",
  "name" : "Maintenance",
  "namespace" : "com.kafkastream.schema.avro",
  "fields" : [ {
    "name" : "maintenance_id",
    "type" : "string",
    "doc" : "The message number is the unique ID for every maintenance"
  }, {
    "name" : "ne_id",
    "type" : "string",
    "doc" : "The NE ID is the network element ID on which the maintenance is done."
  }, {
    "name" : "start_time",
    "type" : "long",
    "doc" : "The timestamp when the maintenance start."
  }, {
    "name" : "end_time",
    "type" : "long",
    "doc" : "The timestamp when the maintenance start."
  }]
}

我的Kafka中有3个主题用于这些Avro中的每一个(ley's say alarm_raw,incident_raw,maintenance_raw),每当我想写这些主题时,我使用ne_id作为键(因此由ne_id分区的主题)。现在我想加入这3个主题并获得新记录并将其写入新主题。问题是我想根据 alarm_id 加入 警报 事件 strong> alarm_source_id 并根据 ne_id 加入警报和维护。我想避免创建新主题并重新分配新密钥。无论如何我在加入时指定了密钥吗?

2 个答案:

答案 0 :(得分:5)

这取决于您想要使用的联接类型(c.f。https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics

对于 KStream-KStream 加入,当前(v0.10.2及更早版本)除了设置新密钥(例如,使用selectKey())之外别无其他方式重新分区。

对于 KStream-KTable 加入,Kafka 0.10.2(将在接下来的几周内发布)包含一项名为GlobalKTables(c.f。https://cwiki.apache.org/confluence/display/KAFKA/KIP-99%3A+Add+Global+Tables+to+Kafka+Streams)的新功能。这允许您在KTable上进行非键连接(即, KStream-GlobalKTable 连接,因此您无需在GlobalKTable中重新分区数据。)

  

注意:KStream-GlobalKTable连接具有与KStream-KTable连接不同的语义。与后者相比,它不是时间同步的,因此,就GlobalKTable更新而言,连接在设计上是不确定的;即,无法保证KStream记录将首先“看到”GlobalKTable更新,从而加入更新的GlobalKTable记录。

还计划添加 KTable-GlobalKTable 加入。这可能会在0.10.3中提供。没有计划添加“全局”KStream-KStream连接。

答案 1 :(得分:0)

您可以通过修改它来维护相同的密钥 您可以使用KeyValueMapper来修改密钥和值 你应该按如下方式使用它:

val modifiedStream = kStream.map[String,String](
    new KeyValueMapper[String, String,KeyValue[String,String]]{
        override def apply(key: String, value: String): KeyValue[String, String] = new KeyValue("modifiedKey", value)
    }
)

您可以在多个Kstream对象上应用上述逻辑,以维护用于加入KStream的单个密钥。