Question

我有一些大的（比方说200 MiB - 2 GiB）文本文件，里面装满了大量的重复记录。每行可以在文件上分布大约100个甚至更精确的副本。任务是删除所有重复，留下每个记录的一个唯一实例。

我已按如下方式实施：


object CleanFile {
  def apply(s: String, t: String) {
    import java.io.{PrintWriter, FileWriter, BufferedReader, FileReader}

    println("Reading " + s + "...")

    var linesRead = 0

    val lines = new scala.collection.mutable.ArrayBuffer[String]()

    val fr = new FileReader(s)
    val br = new BufferedReader(fr)

    var rl = ""

    while (rl != null) {
      rl = br.readLine()

      if (!lines.contains(rl))
        lines += rl

      linesRead += 1

      if (linesRead > 0 && linesRead % 100000 == 0)
        println(linesRead + " lines read, " + lines.length + " unique found.")
    }

    br.close()
    fr.close()

    println(linesRead + " lines read, " + lines.length + " unique found.")
    println("Writing " + t + "...")

    val fw = new FileWriter(t);
    val pw = new PrintWriter(fw);

    lines.foreach(line => pw.println(line))

    pw.close()
    fw.close()
  }
}

需要大约15分钟（在我的Core 2 Duo上配备4 GB RAM）来处理92 MiB文件。而以下命令：

awk '!seen[$0]++' filename

大约需要一分钟来处理1.1 GiB文件（使用上面的代码需要花费很多时间）。

我的代码出了什么问题？

Answer 1

错误的是您使用数组来存储线条。查找（lines.contains）在数组中使用O（ n ），因此整个事务在O（ n ²）时间内运行。相比之下，Awk解决方案使用哈希表，即O（1）查找和总运行时间O（ n ）。

请尝试使用mutable.HashSet。

Answer 2

您也可以阅读所有行并在其上调用.distinct。我不知道distinct是如何实施的，但我打赌它会使用HashSet来执行此操作。

为什么我的Scala编写的行重复数据删除应用程序如此之慢？

2 个答案: