Akka流:读取多个文件

时间:2016-06-13 22:08:10

标签: scala akka akka-stream

我有一个文件列表。我想要:

  1. 将所有这些作为单一来源阅读。
  2. 文件应按顺序依次读取。 (没有循环赛)
  3. 任何文件都不应该完全在内存中。
  4. 从文件读取错误应该会折叠流。
  5. 感觉这应该有效:(Scala,akka-streams v2.4.7)

    val sources = Seq("file1", "file2").map(new File(_)).map(f => FileIO.fromPath(f.toPath)
        .via(Framing.delimiter(ByteString(System.lineSeparator), 10000, allowTruncation = true))
        .map(bs => bs.utf8String)
      )
    val source = sources.reduce( (a, b) => Source.combine(a, b)(MergePreferred(_)) )
    source.map(_ => 1).runWith(Sink.reduce[Int](_ + _)) // counting lines
    

    但是这会导致编译错误,因为FileIO具有与之关联的具体化值,Source.combine不支持该错误。

    将物化值映射出去让我想知道如何处理文件读取错误,但是编译错误:

    val sources = Seq("file1", "file2").map(new File(_)).map(f => FileIO.fromPath(f.toPath)
        .via(Framing.delimiter(ByteString(System.lineSeparator), 10000, allowTruncation = true))
        .map(bs => bs.utf8String)
        .mapMaterializedValue(f => NotUsed.getInstance())
      )
    val source = sources.reduce( (a, b) => Source.combine(a, b)(MergePreferred(_)) )
    source.map(_ => 1).runWith(Sink.reduce[Int](_ + _))  // counting lines
    

    但是在运行时抛出IllegalArgumentException:

    java.lang.IllegalArgumentException: requirement failed: The inlets [] and outlets [MergePreferred.out] must correspond to the inlets [MergePreferred.preferred] and outlets [MergePreferred.out]
    

3 个答案:

答案 0 :(得分:9)

以下代码并非尽可能简洁,以便明确模块化不同的问题。

// Given a stream of bytestrings delimited by the system line separator we can get lines represented as Strings
val lines = Framing.delimiter(ByteString(System.lineSeparator), 10000, allowTruncation = true).map(bs => bs.utf8String)

// given as stream of Paths we read those files and count the number of lines
val lineCounter = Flow[Path].flatMapConcat(path => FileIO.fromPath(path).via(lines)).fold(0l)((count, line) => count + 1).toMat(Sink.head)(Keep.right)

// Here's our test data source (replace paths with real paths)
val testFiles = Source(List("somePathToFile1", "somePathToFile2").map(new File(_).toPath))

// Runs the line counter over the test files, returns a Future, which contains the number of lines, which we then print out to the console when it completes
testFiles runWith lineCounter foreach println

答案 1 :(得分:2)

更新哦,我没有看到接受的答案,因为我没有刷新页面> _<。我还是留在这里,因为我还添加了一些关于错误处理的注释。

我相信以下程序可以满足您的需求:

import akka.NotUsed
import akka.actor.ActorSystem
import akka.stream.{ActorMaterializer, IOResult}
import akka.stream.scaladsl.{FileIO, Flow, Framing, Keep, Sink, Source}
import akka.util.ByteString
import scala.concurrent.{Await, Future}
import scala.util.{Failure, Success}
import scala.util.control.NonFatal
import java.nio.file.Paths
import scala.concurrent.duration._

object TestMain extends App {
  implicit val actorSystem = ActorSystem("test")
  implicit val materializer = ActorMaterializer()
  implicit def ec = actorSystem.dispatcher

  val sources = Vector("build.sbt", ".gitignore")
    .map(Paths.get(_))
    .map(p =>
      FileIO.fromPath(p)
        .viaMat(Framing.delimiter(ByteString(System.lineSeparator()), Int.MaxValue, allowTruncation = true))(Keep.left)
        .mapMaterializedValue { f =>
          f.onComplete {
            case Success(r) if r.wasSuccessful => println(s"Read ${r.count} bytes from $p")
            case Success(r) => println(s"Something went wrong when reading $p: ${r.getError}")
            case Failure(NonFatal(e)) => println(s"Something went wrong when reading $p: $e")
          }
          NotUsed
        }
    )
  val finalSource = Source(sources).flatMapConcat(identity)

  val result = finalSource.map(_ => 1).runWith(Sink.reduce[Int](_ + _))
  result.onComplete {
    case Success(n) => println(s"Read $n lines total")
    case Failure(e) => println(s"Reading failed: $e")
  }
  Await.ready(result, 10.seconds)

  actorSystem.terminate()
}

这里的关键是flatMapConcat()方法:它将流的每个元素转换为源,并返回由这些源生成的元素流(如果它们按顺序运行)。

至于处理错误,你可以在mapMaterializedValue参数中为将来添加一个处理程序,或者你可以通过在Sink.foreach具体化的未来放置一个处理程序来处理正在运行的流的最终错误值。我在上面的例子中做了两个,如果你测试它,比如说,在一个不存在的文件上,你会看到相同的错误信息将被打印两次。不幸的是,flatMapConcat()并没有收集具体化的价值,坦率地说,我无法理解它的理智,所以如果有必要,你必须单独处理它们。

答案 2 :(得分:-1)

我确实有一个答案 - 不要使用akka.FileIO。这似乎工作正常,例如:

val sources = Seq("sample.txt", "sample2.txt").map(io.Source.fromFile(_).getLines()).reduce(_ ++ _)
val source = Source.fromIterator[String](() => sources)
val lineCount = source.map(_ => 1).runWith(Sink.reduce[Int](_ + _))

我仍然想知道是否有更好的解决方案。