Question

我正在玩Scala的懒惰迭代器，我遇到了一个问题。我要做的是读取一个大文件，进行转换，然后写出结果：

object FileProcessor {
  def main(args: Array[String]) {
    val inSource = Source.fromFile("in.txt")
    val outSource = new PrintWriter("out.txt")

    try {
      // this "basic" lazy iterator works fine
      // val iterator = inSource.getLines

      // ...but this one, which incorporates my process method, 
      // throws OutOfMemoryExceptions
      val iterator = process(inSource.getLines.toSeq).iterator

      while(iterator.hasNext) outSource.println(iterator.next)

    } finally {
      inSource.close()
      outSource.close()
    }
  }

  // processing in this case just means upper-cases every line
  private def process(contents: Seq[String]) = contents.map(_.toUpperCase)
}

所以我在大文件上遇到OutOfMemoryException。我知道如果你保持对流的头部的引用，你可以与Scala的懒惰流相遇。所以在这种情况下，我小心翼翼地将process（）的结果转换为迭代器并抛弃它最初返回的Seq。

有谁知道为什么这仍会导致O（n）内存消耗？谢谢！

<小时/> 的更新

对fge和huynhjl的回应，似乎Seq可能是罪魁祸首，但我不知道为什么。作为一个例子，以下代码工作正常（我在整个地方使用Seq）。此代码不生成OutOfMemoryException：

object FileReader {
  def main(args: Array[String]) {

  val inSource = Source.fromFile("in.txt")
  val outSource = new PrintWriter("out.txt")
  try {
    writeToFile(outSource, process(inSource.getLines.toSeq))
  } finally {
    inSource.close()
    outSource.close()
  }
}

@scala.annotation.tailrec
private def writeToFile(outSource: PrintWriter, contents: Seq[String]) {
  if (! contents.isEmpty) {
    outSource.println(contents.head)
    writeToFile(outSource, contents.tail)
  }
}

private def process(contents: Seq[String]) = contents.map(_.toUpperCase)

Answer 1

如 fge 所示，修改process以获取迭代器并删除.toSeq。 inSource.getLines已经是迭代器了。

转换为Seq会导致记住这些项目。我认为它会将迭代器转换为Stream并导致所有项目都被记住。

编辑：好的，它更加微妙。您通过在流程结果上调用Iterator.toSeq.iterator来执行相当于iterator的操作。这可能会导致内存不足异常。

scala> Iterator.continually(1).toSeq.iterator.take(300*1024*1024).size
java.lang.OutOfMemoryError: Java heap space

这可能与此处报告的问题相同：https://issues.scala-lang.org/browse/SI-4835。请注意我在bug末尾的评论，这是来自个人经验。

Scala Infinite Iterator OutOfMemory

1 个答案: