Question

我正在使用Scala。我需要读取一个大的gzip文件并将其转换为字符串。我需要删除第一行。这是我阅读文件的方式：

val fis = new FileInputStream(filename)
val gz  = new GZIPInputStream(fis)

然后我尝试了这个Source.fromInputStream(gz).getLines.drop(1).mkString("") 。但它会导致内存不足错误。

因此，我想到逐行阅读并将其放入字节数组中。然后我可以在最后将它转换为单个字符串。

但我不知道该怎么做。有什么建议吗？或者也欢迎任何更好的方法。

Answer 1

如果您的gzip压缩文件很大，可以使用BufferedReader。这是一个例子。它会将所有字符从gzipped文件复制到未压缩文件，但会跳过第一行。

import java.util.zip.GZIPInputStream
import java.io._
import java.nio.charset.StandardCharsets

import scala.annotation.tailrec
import scala.util.Try

val bufferSize = 4096
val pathToGzFile = "/tmp/text.txt.gz"
val pathToOutputFile = "/tmp/text_without_first_line.txt"
val charset = StandardCharsets.UTF_8

val inStream = new FileInputStream(pathToGzFile)
val outStream = new FileOutputStream(pathToOutputFile)

try {
  val inGzipStream = new GZIPInputStream(inStream)
  val inReader = new InputStreamReader(inGzipStream, charset)
  val outWriter = new OutputStreamWriter(outStream, charset)
  val bufferedReader = new BufferedReader(inReader)

  val closeables =  Array[Closeable](inGzipStream, inReader, 
    outWriter, bufferedReader)
  // Read first line, so copy method will not get this - it will be skipped
  val firstLine = bufferedReader.readLine()
  println(s"First line: $firstLine")

  @tailrec
  def copy(in: Reader, out: Writer, buffer: Array[Char]): Unit = {
    // Copy while it's not end of file
    val readChars = in.read(buffer, 0, buffer.length)
    if (readChars > 0) {
      out.write(buffer, 0, readChars)
      copy(in, out, buffer)
    }
  }

  // Copy chars from bufferReader to outWriter using buffer
  copy(bufferedReader, outWriter, Array.ofDim[Char](bufferSize))

  // Close all closeabes
  closeables.foreach(c => Try(c.close()))
}
finally {
  Try(inStream.close())
  Try(outStream.close())
}

在不使用Source的情况下从GZIPInputStream读取到String

1 个答案: