我正在构建一个实用程序,它监视并保持在较大系统中处理的文件的进度。该文件是一个大的“文本”文件,.csv,.xls,.txt等。这可能是来自Kafka的流数据,将其写入Avro,或者批量编写SQL DB。我正在尝试构建一个“catchall”实用程序,它记录处理的行数,并使用RESTful API调用将进度持久保存到数据库。
无论处理类型如何,处理始终在Akka Actor中完成。我正在尝试异步进行进度记录,以避免阻止处理进度。进展非常迅速。大多数情况都是以类似的批处理风格格式发生的,尽管有时它会逐渐递增,这里只是为了演示而在处理过程中会发生什么的基本表示:
//inside my processing actor
var fileIsProcessing = true
val allLines = KafkaUtil.getConnect(fileKey)
val totalLines = KafkaUtil.getSize
val batchSize = 500
val dBUtil = new DBUtil(totalLines)
while (fileIsProcessing) {
// consumes @ 500 lines at a time to process, returns empty if done consuming
val batch:List[Pollable] = allLines.poll
//for batch identification purposes
val myMax = batch.map(_.toInt ).max
println("Starting new batch with max line: " + myMax)
//processing work happens here
batch.map(processSync)
println("Finished processing batch with max line: " + myMax)
//send a progress update to be persisted to the DB
val progressCall = Future[Unit] {dBUtil.incrementProgress(batch.size)}
progressCall.onComplete{
case Success(s) => // don't care
case Failure(e) => logger.error("Unable to persist progress from actor ")
}
if (batch.isEmpty) fileIsProcessing = false //this is horribly non-functional.
}
并且,我的DBUtil的简单表示,即进行处理的类:
class DBUtil(totalLines:Int) {
//store both the number processed and the total to process in db, even if there is currently a percentage
var rate = 0 //lines per second
var totalFinished = 0
var percentageFin:Double = 0
var lastUpdate = DateTime.now()
def incrementProgress(totalProcessed: Int, currentTime:DateTime): Unit = {
//simulate write the data and calculated progress percentage to db
rate = totalProcessed/((currentTime.getMillis() - lastUpdate.getMillis())/1000)
totalFinished += totalProcessed
percentageFin = (totalFinished.toDouble / totalLines.toDouble) * 100
println(s"Simulating DB persist of total processed:$totalFinished lines at $percentageFin% from my total lines: $totalLines at rate:$rate" )
}
}
现在,真正奇怪的是,在生产中,处理发生得如此之快,以至于每次都不能可靠地调用行Future[Unit] { dBUtil.incrementProgress(batch.size)}
。 while
循环将完成,但我会在我的数据库中注意到进度将挂起50%或80%。唯一可行的方法是,如果我使用logger
或println
语句阻塞系统,以减慢速度。
为什么我的Future呼叫每次都不能可靠地呼叫?
答案 0 :(得分:1)
嗯......所以你的代码几乎没有问题,
您只是在while循环中启动期货,然后循环进行下一次迭代,而无需等待未来完成。这意味着您的程序可能会在执行者实际执行期货之前完成。
此外,您的循环正在创建越来越多的“未来”调用dBUtil.incrementProgress(batch.size)
,您将有多个线程同时执行相同的功能。当您使用可变状态时,这将导致竞争条件。
def processFileWithIncrementalUpdates(
allLines: ????,
totalLines: Int,
batchSize: Int,
dbUtil: DBUtil
): Future[Unit] = {
val promise = Promise[Unit]()
Future {
val batch: List[Pollable] = allLines.poll
if (batch.isEmpty) {
promise.completeWith(Future.successful[Unit]())
}
else {
val myMax = batch.map(_.toInt).max
println("Starting new batch with max line: " + myMax)
//processing work happens here
batch.map(processSync)
println("Finished processing batch with max line: " + myMax)
//send a progress update to be persisted to the DB
val progressCall = Future[Unit] { dBUtil.incrementProgress(batch.size) }
progressCall.onComplete{
case Success(s) => // don't care
case Failure(e) => logger.error("Unable to persist progress from actor ")
}
progressCall.onComplete({
case _ => promise.completeWith(processFileWithIncrementalUpdates(allLines, totalLines, batchSize, dBUtil))
})
}
promise.future
}
}
val allLines = KafkaUtil.getConnect(fileKey)
val totalLines = KafkaUtil.getSize
val batchSize = 500
val dBUtil = new DBUtil(totalLines)
val processingFuture = processFileWithIncrementalUpdates(allLines, totalLines, batchSize, dBUtil)