Kafka(Re-)加入团队坚持超过2个主题

时间:2017-09-12 10:20:56

标签: scala apache-kafka spark-streaming

我正在开发一个使用Kafka作为消息发布/子工具的系统。

数据由scala脚本生成:

val kafkaParams = new Properties()
    kafkaParams.put("bootstrap.servers", "localhost:9092")
    kafkaParams.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    kafkaParams.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    kafkaParams.put("group.id", "test_luca")

    //kafka producer
    val producer = new KafkaProducer[String, String](kafkaParams)

    //Source list
    val s1 = new java.util.Timer()
    val tasks1 = new java.util.TimerTask {
        def run() = {
            val date = new java.util.Date
            val date2 = date.getTime()
            val send = ""+ date2 + ", 45.1234, 12.5432, 4.5, 3.0"
            val data = new ProducerRecord[String,String]("topic_s1", send)
            producer.send(data)
        }
    }
    s1.schedule(tasks1, 1000L, 1000L)

    val s2 = new java.util.Timer()
    val tasks2 = new java.util.TimerTask {
        def run() = {
            val date = new java.util.Date
            val date2 = date.getTime()
            val send = ""+ date2 + ", 1.111, 9.999, 10.4, 10.0"
            val data = new ProducerRecord[String,String]("topic_s2", send)
            producer.send(data)
        }
    }
    s2.schedule(tasks2, 2000L, 2000L)

我需要在某些特定情况下测试卡夫卡表演。在一个案例中,我有一个其他脚本使用主题" topic_s1"和" topic_s2",详细说明它们,然后生成具有不同主题的新数据(topic_s1b和topic_s2b)。随后,这些详细的数据被Apache Spark Streaming脚本使用。

如果我省略了消费者/制作人脚本(我只有1个Kafka制作人,有2个主题和Spark脚本),一切正常。

如果我使用完整配置(1个kafka生产者有2个主题,"中间件"脚本使用来自kafka生产者的数据,详细说明并生成带有新主题的新数据,1个使用数据的spark脚本使用新主题)Spark Streaming脚本停留在INFO AbstractCoordinator: (Re-)joining group test_luca

我在本地运行所有内容,而且我没有对kafka和zookeeper配置进行修改。

有什么建议吗?

更新:火花脚本:

val sparkConf = new SparkConf().setAppName("SparkScript").set("spark.driver.allowMultipleContexts", "true").setMaster("local[2]")
val sc = new SparkContext(sparkConf)

val ssc = new StreamingContext(sc, Seconds(4))

case class Thema(name: String, metadata: JObject)
case class Tempo(unit: String, count: Int, metadata: JObject)
case class Spatio(unit: String, metadata: JObject)
case class Stt(spatial: Spatio, temporal: Tempo, thematic: Thema)
case class Location(latitude: Double, longitude: Double, name: String)

case class Data(location: Location, timestamp: Long, measurement: Int, unit: String, accuracy: Double)
case class Sensor(sensor_name: String, start_date: String, end_date: String, data_schema: Array[String], data: Data, stt: Stt)


case class Datas(location: Location, timestamp: Long, measurement: Int, unit: String, accuracy: Double)
case class Sensor2(sensor_name: String, start_date: String, end_date: String, data_schema: Array[String], data: Datas, stt: Stt)


val kafkaParams = Map[String, Object](
    "bootstrap.servers" -> "localhost:9092",
    "key.deserializer" -> classOf[StringDeserializer].getCanonicalName,
    "value.deserializer" -> classOf[StringDeserializer].getCanonicalName,
    "group.id" -> "test_luca",
    "auto.offset.reset" -> "latest",
    "enable.auto.commit" -> (false: java.lang.Boolean)
)

val topics1 = Array("topics1")
val topics2 = Array("topics2")

val stream = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics1, kafkaParams))
val stream2 = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics2, kafkaParams))

val s1 = stream.map(record => {
  implicit val formats = DefaultFormats
  parse(record.value).extract[Sensor]
}
)
val s2 = stream2.map(record => {
  implicit val formats = DefaultFormats
  parse(record.value).extract[Sensor2]
}
)

val f1 = s1.map { x => x.sensor_name }
f1.print()
val f2 = s2.map { x => x.sensor_name }
f2.print()

由于 卢卡

1 个答案:

答案 0 :(得分:2)

也许您应该更改spark.stream for spark流脚本。我想你的"中间件"脚本的消费者拥有与spark spark脚本消费者相同的group.id。然后可怕的事情会发生。

在kafka中,消费者群体是主题的真正订阅者,群组中的消费者只是一个分裂的工作者,因此在您的情况下,您应该在中间件脚本消费者和火花流动脚本消费者中使用不同的group.id。

在你没有中间脚本的第一次尝试中,它只是因为这个。