我正在处理来自按密钥分组的文件的数据流。我创建了一个带有apply方法的类,该方法可用于通过名为KeyChanges [T,K]的键拆分流。在处理子流的第一项之前,我需要从数据库中检索一些数据。每个子流完成后,我需要向队列发送一条消息。在标准的scala序列中,我会做这样的事情:
val groups: Map[Key, Seq[Value]] = stream.groupBy(v => v.k)
val groupSummaryF = Future.sequence(groups.map { case (k, group) =>
retrieveMyData(k).flatMap { data =>
Future.sequence(group.map(v => process(data, v))).map(
k -> _.foldLeft(0) { (a,t) =>
t match {
case Success(v) => a + 1
case Failure(ex) =>
println(s"failure: $ex")
a
}
}
).andThen {
case Success((key,count)) =>
sendMessage(count,key)
}
}
})
我想和Akka Streams做类似的事情。在数据检索上,我可以只缓存数据并为每个元素调用检索函数,但对于队列消息,我确实需要知道子流何时完成。到目前为止,我还没有看到这方面的方法。有什么想法吗?
答案 0 :(得分:1)
您只需运行Stream,然后从Sink执行操作。
val categories = Array("DEBUG", "INFO", "WARN", "ERROR")
// assume we have a stream from file which produces categoryId -> message
val lines = (1 to 100).map(x => (Random.nextInt(categories.length), s"message $x"))
def loadDataFromDatabase(categoryId: Int): Future[String] =
Future.successful(categories(categoryId))
// assume this emits message to the queue
def emitToQueue(x: (String, Int)): Unit =
println(s"${x._2} messages from category ${x._1}")
val flow =
Flow[(Int, String)].
groupBy(4, _._1).
fold((0, List.empty[String])) { case ((_, acc), (catId, elem)) =>
(catId, elem :: acc)
}.
mapAsync(1) { case (catId, messages) =>
// here you load your stuff from the database
loadDataFromDatabase(catId).map(cat => (cat, messages))
}. // here you may want to do some more processing
map(x => (x._1, x._2.size)).
mergeSubstreams
// assume the source is a file
Source.fromIterator(() => lines.iterator).
via(flow).
to(Sink.foreach(emitToQueue)).run()
如果你想为多个文件运行它,例如报告一次总和,你就可以这样做。
val futures = (1 to 4).map { x =>
Source.fromIterator(() => lines.iterator).via(flow).toMat(Sink.seq[(String, Int)])(Keep.right).run()
}
Future.sequence(futures).map { results =>
results.flatten.groupBy(_._1).foreach { case (cat, xs) =>
val total = xs.map(_._2).sum
println(s"$total messages from category $cat")
}
}
如您所见,当您运行流程时,您将获得未来。它将包含一个物化值(流程的结果),当它完成时,您可以随意使用它。