Kafka是火花流的数据源,接收器是redis。但是,“ KafkaConsumer对于多线程访问来说并不安全”问世,如何解决?
def initKafkaParams(bootstrap_servers: String, groupId: String, duration: String = "5000") : Map[String,Object] = {
val kafkaParams = Map[String,Object](
"bootstrap.servers" -> bootstrap_servers,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> groupId,
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (true: java.lang.Boolean),
"auto.commit.interval.ms" -> duration
)
kafkaParams
}
val kafkaParams = KafkaUtil.initKafkaParams(Configuration.bootstrap_servers_log, groupId, duration)
val topics = Array(topic)
val stream = KafkaUtils.createDirectStream(ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams))
val cachedStream = stream.cache()
val closed_uids = cachedStream.map(record => parseJson(record.value)).filter(record => record != null)
closed_uids.foreachRDD(rdd =>
rdd.foreachPartition(rows =>
{
val rc = RedisClient("recall")
try {
val redis = rc.getRedisClient()
val pipe = redis.pipelined()
val redisKey = "zyf_radio"
rows.foreach(r => {
//val redis = rc.getRedisClient()
pipe.sadd(redisKey, r)
})
pipe.sync()
redis.close()
} catch {
case e: Exception => println("redis error!")
}
}
)
)
19/05/25 20:58:20信息kafka010.CachedKafka消费者:初始获取 spark-executor-radio_sound_room_close_test aliyun_kfk_applog_common 58 2431327169 19/05/25 20:58:20警告存储.BlockManager:放置块 rdd_24_36由于异常19/05/25 20:58:20 WARN而失败 storage.BlockManager:无法删除块rdd_24_36 在磁盘或内存中找不到19/05/25 20:58:20错误 executor.Executor:阶段8.0中的任务36.0中的异常(TID 676) java.util.ConcurrentModificationException:KafkaConsumer不安全 用于在以下位置进行多线程访问 org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) 在 org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) 在 org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) 在 org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) 在 org.apache.spark.streaming.kafka010.KafkaRDD $ KafkaRDDIterator.next(KafkaRDD.scala:228) 在 org.apache.spark.streaming.kafka010.KafkaRDD $ KafkaRDDIterator.next(KafkaRDD.scala:194) 在 org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:364) 在 org.apache.spark.storage.BlockManager $$ anonfun $ doPutIterator $ 1.apply(BlockManager.scala:1032) 在 org.apache.spark.storage.BlockManager $$ anonfun $ doPutIterator $ 1.apply(BlockManager.scala:1007) 在 org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:947) 在 org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1007) 在 org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:711) 在org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)处 org.apache.spark.rdd.RDD.iterator(RDD.scala:285)在 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)处 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)处 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)在 org.apache.spark.scheduler.Task.run(Task.scala:99)在 org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:325) 在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 在 java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617) 在java.lang.Thread.run(Thread.java:748)
答案 0 :(得分:0)
这与错误有关:SPARK-19185
如果您使用的是Spark 2.2.0,则将spark.streaming.kafka.consumer.cache.enabled
设置为false
,就像回答here一样。