Question

所以我目前有一个akka流来读取一个文件列表，一个接收器来连接它们，这很好用：

{{1}}

虽然这对于一个简单的案例来说很好，但文件相当大（按GB的顺序，并且不能适应我运行此应用程序的机器内存。所以我＆＃ 39; d喜欢在字节字符串达到一定大小后将其分块。一个选项是Source.grouped(N)，但文件大小差别很大（从1 KB到2 GB），所以有无法保证规范文件的大小。

我的问题是，是否有一种方法可以按字节串的大小来编写文件。 akka流的文档非常庞大，我在查找库时遇到了麻烦。任何帮助将不胜感激。谢谢！

Answer 1

Akka Streams的FileIO模块为您提供了一个用于写入文件的流式IO Sink，以及用于对ByteString流进行分块的实用程序方法。你的例子将成为

的内容

val files = List("a.txt", "b.txt", "c.txt") // and so on;

val source = Source(files).flatMapConcat(f => FileIO.fromPath(Paths.get(f)))
val chunking = Framing.delimiter(ByteString("\n"), maximumFrameLength = 256, allowTruncation = true)
val sink: Sink[ByteString, Future[IOResult]] = FileIO.toPath(Paths.get("an-output-file.txt"))

source.via(chunking).runWith(sink)

使用FileIO.toPath接收器可以避免将整个折叠的ByteString存储到内存中（从而允许正确的流式传输）。

有关此Akka模块的更多详细信息，请参阅docs。

Answer 2

我认为@Stefano Bonetti已经提供了一个很好的解决方案。只是想补充一点，人们还可以考虑构建自定义GraphStage来满足特定的分块需求。实质上，如下所述为In / Out处理程序创建如下所示的块发送方法：Akka Stream link：

private def emitChunk(): Unit = {
  if (buffer.isEmpty) {
    if (isClosed(in)) completeStage()
    else pull(in)
  } else {
    val (chunk, nextBuffer) = buffer.splitAt(chunkSize)
    buffer = nextBuffer
    push(out, chunk)
  }
}

Answer 3

在Akka Streams图书馆进行了一周的修补之后，我结束的解决方案是Stefano的回答以及提供的解决方案here的组合。我通过Framing.delimiter函数逐行读取文件源，然后只使用Alpakka提供的LogRotatorSink。确定日志轮换的主要内容如下：

val fileSizeRotationFunction = () => {
  val max = 10 * 1024 * 1024 // 10 MB, but whatever you really want; I had it at our HDFS block size
  var size: Long = max
  (element: ByteString) =>
    {
      if (size + element.size > max) {
        val path = Files.createTempFile("out-", ".log")
        size = element.size
        Some(path)
      } else {
        size += element.size
        None
      }
    }
}

val sizeRotatorSink: Sink[ByteString, Future[Done]] =
  LogRotatorSink(fileSizeRotationFunction)

val source = Source(files).flatMapConcat(f => FileIO.fromPath(Paths.get(f)))
val chunking = Framing.delimiter(ByteString("\n"), maximumFrameLength = 256, allowTruncation = true)

source.via(chunking).runWith(sizeRotatorSink)

就是这样。希望这对其他人有所帮助。

Akka Streams：如何按大小对源中的文件列表进行分组？

3 个答案: