KafkaConsumer对于多线程访问并不安全

时间:2019-05-25 13:19:12

标签: apache-spark apache-kafka spark-streaming-kafka

Kafka是火花流的数据源,接收器是redis。但是,“ KafkaConsumer对于多线程访问来说并不安全”问世,如何解决?

def initKafkaParams(bootstrap_servers: String, groupId: String, duration: String = "5000") : Map[String,Object] = {
    val kafkaParams = Map[String,Object](
      "bootstrap.servers" -> bootstrap_servers,
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> groupId,
      "auto.offset.reset" -> "latest",
      "enable.auto.commit" -> (true: java.lang.Boolean), 
      "auto.commit.interval.ms" -> duration
    )
    kafkaParams
  }

val kafkaParams = KafkaUtil.initKafkaParams(Configuration.bootstrap_servers_log, groupId, duration)
    val topics = Array(topic)
    val stream = KafkaUtils.createDirectStream(ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams))
    val cachedStream = stream.cache()
    val closed_uids = cachedStream.map(record => parseJson(record.value)).filter(record => record != null)
    closed_uids.foreachRDD(rdd =>
      rdd.foreachPartition(rows =>
      {
        val rc = RedisClient("recall")
        try {
          val redis = rc.getRedisClient()
          val pipe = redis.pipelined()
          val redisKey = "zyf_radio"
          rows.foreach(r => {
            //val redis = rc.getRedisClient()
            pipe.sadd(redisKey, r)
          })
          pipe.sync()
          redis.close()
        } catch {
          case e: Exception => println("redis error!")
        }
      }
      )
    )
  

19/05/25 20:58:20信息kafka010.CachedKafka消费者:初始获取   spark-executor-radio_sound_room_close_test aliyun_kfk_applog_common 58   2431327169 19/05/25 20:58:20警告存储.BlockManager:放置块   rdd_24_36由于异常19/05/25 20:58:20 WARN而失败   storage.BlockManager:无法删除块rdd_24_36   在磁盘或内存中找不到19/05/25 20:58:20错误   executor.Executor:阶段8.0中的任务36.0中的异常(TID 676)   java.util.ConcurrentModificationException:KafkaConsumer不安全   用于在以下位置进行多线程访问   org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)     在   org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)     在   org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)     在   org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)     在   org.apache.spark.streaming.kafka010.KafkaRDD $ KafkaRDDIterator.next(KafkaRDD.scala:228)     在   org.apache.spark.streaming.kafka010.KafkaRDD $ KafkaRDDIterator.next(KafkaRDD.scala:194)     在   org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:364)     在   org.apache.spark.storage.BlockManager $$ anonfun $ doPutIterator $ 1.apply(BlockManager.scala:1032)     在   org.apache.spark.storage.BlockManager $$ anonfun $ doPutIterator $ 1.apply(BlockManager.scala:1007)     在   org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:947)     在   org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1007)     在   org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:711)     在org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)处   org.apache.spark.rdd.RDD.iterator(RDD.scala:285)在   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)处   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)处   org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)在   org.apache.spark.scheduler.Task.run(Task.scala:99)在   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:325)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)     在   java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)     在java.lang.Thread.run(Thread.java:748)

1 个答案:

答案 0 :(得分:0)

这与错误有关:SPARK-19185

如果您使用的是Spark 2.2.0,则将spark.streaming.kafka.consumer.cache.enabled设置为false,就像回答here一样。