Akka流 - 将ByteString流拆分为多个文件

时间:2016-12-30 20:46:05

标签: scala stream akka akka-stream

我试图将传入的Akka字节流(来自http请求的正文,但也可能来自文件)拆分为多个已定义大小的文件。

例如,如果我上传10Gb文件,它会创建10个1Gb文件。这些文件会随机生成名称。我的问题是我不知道从哪里开始,因为我读过的所有响应和示例都是将整个块存储到内存中,或者使用基于字符串的分隔符。除非我真的没有" chunks" 1Gb,然后将它们写入磁盘..

是否有 easy 方式执行此类操作?我唯一的想法是使用类似http://doc.akka.io/docs/akka/2.4/scala/stream/stream-cookbook.html#Chunking_up_a_stream_of_ByteStrings_into_limited_size_ByteStrings之类的内容,但转换为类似FlowShape[ByteString, File]的内容,将自己写入文件块直到达到最大文件大小,然后创建新文件等。 ,并回传创建的文件。这看起来像是一个不正确使用Akka的残暴想法..

提前致谢

3 个答案:

答案 0 :(得分:6)

我经常回归纯粹的功能性,非akka技术,解决诸如此类问题,然后提升"那些函数成为akka构造。我的意思是我试图只使用scala" stuff"然后尝试将这些东西包装在akka中......

文件创建

FileOutputStream创建开始,基于"随机生成的名称":

def randomFileNameGenerator : String = ??? //not specified in question

import java.io.FileOutputStream

val randomFileOutGenerator : () => FileOutputStream = 
  () => new FileOutputStream(randomFileNameGenerator)

<强>国家

需要某种方式来存储&#34;状态&#34;当前文件的例子,例如,已写入的字节数:

case class FileState(byteCount : Int = 0, 
                     fileOut : FileOutputStream = randomFileOutGenerator())

文件写作

首先,我们确定我们是否违反了给定ByteString的最大文件大小阈值:

import akka.util.ByteString

val isEndOfChunk : (FileState, ByteString, Int) => Boolean =
  (state, byteString, maxBytes) =>
    state.byteCount + byteString.length > maxBytes

如果我们已经将当前状态的容量最大化,则必须编写创建新FileState的函数,如果仍然低于容量,则返回当前状态:

val closeFileInState : FileState => Unit = 
  (_ : FileState).fileOut.close()

val getCurrentFileState(FileState, ByteString, Int) => FileState = 
  (state, byteString, maxBytes) =>
    if(isEndOfChunk(maxBytes, state, byteString)) {
      closeFileInState(state)
      FileState()
    }
    else
      state

唯一剩下的就是写信给FileOutputStream

val writeToFileAndReturn(FileState, ByteString) => FileState = 
  (fileState, byteString) => {
    fileState.fileOut write byteString.toArray
    fileState copy (byteCount = fileState.byteCount + byteString.size)
  }

//the signature ordering will become useful
def writeToChunkedFile(maxBytes : Int)(fileState : FileState, byteString : ByteString) : FileState =    
  writeToFileAndReturn(getCurrentFileState(maxBytes, fileState, byteString), byteString)    

折叠任何GenTraversableOnce

在scala中,GenTraversableOnce是任何具有折叠运算符的并行或非并行集合。这些包括Iterator,Vector,Array,Seq,scala stream,......最终writeToChunkedFile函数与GenTraversableOnce#fold的签名完全匹配:

val anyIterable : Iterable = ???

val finalFileState = anyIterable.fold(FileState())(writetochunkedFile(maxBytes))

最后一个松散的结局;最后FileOutputStream也需要关闭。由于折叠只会发出最后FileState,我们可以关闭那个:

closeFileInState(finalFileState)

Akka Streams

Akka Flow从fold获取FlowOps#fold,其恰好与GenTraversableOnce签名匹配。因此,我们可以提升&#34;我们的常规函数​​转换为类似于我们使用Iterable fold的方式的值:

import akka.stream.scaladsl.Flow

def chunkerFlow(maxBytes : Int) : Flow[ByteString, FileState, _] = 
  Flow[ByteString].fold(FileState())(writeToChunkedFile(maxBytes))

使用常规函数处理问题的好处是它们可以在流之外的其他异步框架中使用,例如期货或演员。你也不需要在单元测试中使用akka ActorSystem,只需要常规的语言数据结构。

import akka.stream.scaladsl.Sink
import scala.concurrent.Future

def byteStringSink(maxBytes : Int) : Sink[ByteString, _] = 
  chunkerFlow(maxBytes) to (Sink foreach closeFileInState)

然后,您可以使用此Sink来排除来自HttpEntity的{​​{1}}。

答案 1 :(得分:1)

您可以编写自定义图形阶段。 您的问题类似于上传到亚马逊S3期间alpakka面临的问题。 (谷歌alpakka s3连接器..他们不会让我发布超过2个链接)

由于某种原因,s3连接器DiskBuffer会将整个字节串的输入源写入文件,然后发出该块以进行进一步的流处理..

我们想要的是与limit a source of byte strings to specific length类似的东西。在该示例中,它们通过维护内存缓冲区将传入的Source [ByteString,_]限制为固定大小的byteStrings的源。我采用它来处理文件。 这样做的好处是您可以在此阶段使用专用线程池来阻止IO。对于良好的反应流,您希望在actor系统中的单独线程池中保持阻塞IO。 PS:这不会尝试制作精确大小的文件..因此,如果我们在100MB文件中读取额外的2KB ...我们将这些额外的字节写入当前文件而不是尝试达到确切的大小。

import java.io.{FileOutputStream, RandomAccessFile}
import java.nio.channels.FileChannel
import java.nio.file.Path

import akka.stream.stage.{GraphStage, GraphStageLogic, InHandler, OutHandler}
import akka.stream._
import akka.util.ByteString

case class MultipartUploadChunk(path: Path, size: Int, partNumber: Int)
//Starts writing the byteStrings received from upstream to a file. Emits a path after writing a partSize number of bytes. Does not attemtp to write exact number of bytes.
class FileChunker(maxSize: Int, tempDir: Path, partSize: Int)
    extends GraphStage[FlowShape[ByteString, MultipartUploadChunk]] {

  assert(maxSize > partSize, "Max size should be larger than part size. ")

  val in: Inlet[ByteString] = Inlet[ByteString]("PartsMaker.in")
  val out: Outlet[MultipartUploadChunk] = Outlet[MultipartUploadChunk]("PartsMaker.out")

  override val shape: FlowShape[ByteString, MultipartUploadChunk] = FlowShape.of(in, out)

  override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
    new GraphStageLogic(shape) with OutHandler with InHandler {

      var partNumber: Int = 0
      var length: Int = 0
      var currentBuffer: Option[PartBuffer] = None

      override def onPull(): Unit =
        if (isClosed(in)) {
          emitPart(currentBuffer, length)
        } else {
          pull(in)
        }

      override def onPush(): Unit = {
        val elem = grab(in)
        length += elem.size
        val currentPart: PartBuffer = currentBuffer match {
          case Some(part) => part
          case None =>
            val newPart = createPart(partNumber)
            currentBuffer = Some(newPart)
            newPart
        }
        currentPart.fileChannel.write(elem.asByteBuffer)
        if (length > partSize) {
          emitPart(currentBuffer, length)
          //3. .increment part number, reset length.
          partNumber += 1
          length = 0
        } else {
          pull(in)
        }
      }

      override def onUpstreamFinish(): Unit =
        if (length > 0) emitPart(currentBuffer, length) // emit part only if something is still left in current buffer.

      private def emitPart(maybePart: Option[PartBuffer], size: Int): Unit = maybePart match {
        case Some(part) =>
          //1. flush the part buffer and truncate the file.
          part.fileChannel.force(false)
          //          not sure why we do this truncate.. but was being done in alpakka. also maybe safe to do.
//                    val ch = new FileOutputStream(part.path.toFile).getChannel
//          try {
//            println(s"truncating to size $size")
//            ch.truncate(size)
//          } finally {
//            ch.close()
//          }
          //2emit the part
          val chunk = MultipartUploadChunk(path = part.path, size = length, partNumber = partNumber)
          push(out, chunk)
          part.fileChannel.close() // TODO: probably close elsewhere.
          currentBuffer = None
          //complete stage if in is closed.
          if (isClosed(in)) completeStage()
        case None => if (isClosed(in)) completeStage()
      }

      private def createPart(partNum: Int): PartBuffer = {
        val path: Path = partFile(partNum)
        //currentPart.deleteOnExit() //TODO: Enable in prod. requests that the file be deleted when VM dies.
        PartBuffer(path, new RandomAccessFile(path.toFile, "rw").getChannel)
      }

      /**
       * Creates a file in the temp directory with name bmcs-buffer-part-$partNumber
       * @param partNumber the part number in multipart upload.
       * @return
       * TODO:add unique id to the file name. for multiple
       */
      private def partFile(partNumber: Int): Path =
        tempDir.resolve(s"bmcs-buffer-part-$partNumber.bin")
      setHandlers(in, out, this)
    }

  case class PartBuffer(path: Path, fileChannel: FileChannel) //TODO:  see if you need mapped byte buffer. might be ok with just output stream / channel.

}

答案 2 :(得分:1)

ByteString流拆分为多个文件的惯用方法是使用Alpakka的LogRotatorSink。来自documentation

  

此接收器将使用函数作为返回Bytestring => Option[Path]函数的参数。如果生成的函数返回路径,则接收器将文件输出旋转到此新路径,实际的ByteString也将写入此新文件。使用此方法,用户可以定义自定义有状态文件生成实现。

以下fileSizeRotationFunction也来自文档:

val fileSizeRotationFunction = () => {
  val max = 10 * 1024 * 1024
  var size: Long = max
  (element: ByteString) =>
    {
      if (size + element.size > max) {
        val path = Files.createTempFile("out-", ".log")
        size = element.size
        Some(path)
      } else {
        size += element.size
        None
      }
    }
}

使用它的一个例子:

val source: Source[ByteString, _] = ???
source.runWith(LogRotatorSink(fileSizeRotationFunction))