我在kafka中有一堆消息,并使用Spark流处理这些消息。
我正在尝试捕获我的代码未能插入到我的数据库中时的情况,然后接收这些消息并将其重新插入到Kafka中,以便稍后进行处理。
为了解决这个问题,我在我的foreachRDD函数中创建了一个名为“成功”的变量。然后,当我尝试更新到数据库时,我为成功插入返回一个布尔值。我在测试过程中注意到的是,当我尝试在foreachPartition中插入时,这似乎无法正常工作。当我离开foreachPartition函数之外时,似乎成功值会“重置”。
stream: DStream[String]
stream
.foreachRDD(rdd => {
if (!rdd.isEmpty()) {
var success = true
rdd.foreachPartition(partitionOfRecords => {
if (partitionOfRecords.nonEmpty) {
val listOfRecords = partitionOfRecords.toList
val successfulInsert: Boolean = insertRecordsToDB(listOfRecords)
logger.info("Insert was successful: " + successfulInsert)
if (!successfulInsert) {
logger.info("logging successful as false. Currently its set to: " + success )
success = false
logger.info("logged successful as false. Currently its set to: " + success )
}
}
})
logger.info("Insert into database successful from all partition: " + success)
if (!success) {
// send data to Kafka topic
}
}
})
然后我的日志输出显示了此信息!
2019-06-24 20:26:37 [INFO]插入成功:错误 2019-06-24 20:26:37 [INFO]成功记录为false。当前将其设置为:true 2019-06-24 20:26:37 [INFO]成功记录为false。当前将其设置为:false 2019-06-24 20:26:37 [INFO]从所有分区成功插入数据库:true
即使在第三个日志中,它说当前“成功”设置为false,但是当我超出foreachPartition时,我再次对其进行记录,并将其设置为true。
谁能解释为什么?还是建议其他方法?
答案 0 :(得分:0)
我能够使用累加器使它工作。
stream: DStream[String]
val dbInsertACC = sparkSession.sparkContext.longAccumulator("insertSuccess")
stream
.foreachRDD(rdd => {
if (!rdd.isEmpty()) {
//could maybe put accumulator here?
rdd.foreachPartition(partitionOfRecords => {
if (partitionOfRecords.nonEmpty) {
val listOfRecords = partitionOfRecords.toList
val successfulInsert: Boolean = insertRecordsToDB(listOfRecords)
logger.info("Insert was successful: " + successfulInsert)
if (!successfulInsert) dbInsertACC.add(1)
}
})
logger.info("Insert into database successful from all partition: " + success)
if (!dbInsertACC.isZero) {
// send data to Kafka topic
}
}
})