Question

我有一个图形，它从多个压缩文件中读取行，并将这些行写入另一组压缩文件中，并根据每行中的某些值进行映射。

它适用于较小的数据集，但无法终止于较大的数据。（可能不是应该责怪的数据大小，因为我没有足够的时间运行它来确定-需要一段时间。）

def files: Source[File, NotUsed] =
  Source.fromIterator(
    () =>
      Files
        .fileTraverser()
        .breadthFirst(inDir)
        .asScala
        .filter(_.getName.endsWith(".gz"))
        .toIterator)

def extract =
  Flow[File]
    .mapConcat[String](unzip)
    .mapConcat(s =>
      (JsonMethods.parse(s) \ "tk").extract[Array[String]].map(_ -> s).to[collection.immutable.Iterable])
    .groupBy(1 << 16, _._1)
    .groupedWithin(1000, 1.second)
    .map { lines =>
      val w = writer(lines.head._1)
      w.println(lines.map(_._2).mkString("\n"))
      w.close()
      Done
    }
    .mergeSubstreams

def unzip(f: File) = {
  scala.io.Source
    .fromInputStream(new GZIPInputStream(new FileInputStream(f)))
    .getLines
    .toIterable
    .to[collection.immutable.Iterable]
}

def writer(tk: String): PrintWriter =
  new PrintWriter(
    new OutputStreamWriter(
      new GZIPOutputStream(
        new FileOutputStream(new File(outDir, s"$tk.json.gz"), true)
      ))
  )

val process = files.via(extract).toMat(Sink.ignore)(Keep.right).run()

Await.result(process, Duration.Inf)

线程转储显示该进程在WAITING处Await.result(process, Duration.Inf)，并且没有其他任何事情。

带有Akka v2.5.15的OpenJDK v11

Answer 1

它很可能卡在groupBy中，因为它用完了调度程序中的可用线程，无法将所有来源的项目收集到2 ^ 16组中。

因此，如果您是我，我可能会使用extract和可变的statefulMapConcat半手动实现Map[KeyType, List[String]]中的分组。或先用groupedWithin缓冲行，然后将它们分成几组，然后再写入Sink.foreach中的不同文件中。

此Akka流有时无法结束

1 个答案: