如何使用Spark Scala在Spark DataFrame中解压缩列

时间:2019-04-16 06:07:10

标签: scala apache-spark gzip

我有镶木地板文件,其中包含包含压缩内容的列。 当前,我的Spark(用Scala编写)作业使用Java.io Reader链来对内容进行字符串化:

val output: StringBuilder = new StringBuilder
val byteArrayInputStream: ByteArrayInputStream = new ByteArrayInputStream(x)
try {
  val gzipInputStream: GZIPInputStream = new GZIPInputStream(byteArrayInputStream)
  try {
    val inputStreamReader: InputStreamReader = new InputStreamReader(gzipInputStream, StandardCharsets.UTF_8)
    try {
      val bufferedReader: BufferedReader = new BufferedReader(inputStreamReader)
      try {
        var line: String = null
        do {
          line = bufferedReader.readLine()
          if (line != null)
            output.append(line)
        } while (line != null)

      } finally {
        if (bufferedReader != null) {
          bufferedReader.close()
        }
      }
    }
    finally {
      if (inputStreamReader != null) {
        inputStreamReader.close()
      }
    }
  }
  finally {
    if (gzipInputStream != null) {
      gzipInputStream.close()
    }
  }
}
finally {
  if (byteArrayInputStream != null) {
    byteArrayInputStream.close()
  }
}
val out = output.toString
return out

但这会导致Hadoop群集中出现java.lang.OutOfMemoryError: GC overhead limit exceeded异常。

是否有更好的解压缩内容的方法?

1 个答案:

答案 0 :(得分:0)

您可以定义一个火花UDF(用户定义函数)来解压缩gzip字节数组:

  1. 定义获取字节数组并返回字符串的UDF
    static UDF1 unzip = (UDF1<byte[], String>) YourClass::gzipDecompress;
  1. 注册该UDF
    spark.sqlContext().udf().register("unzip", unzip, DataTypes.StringType);
  1. 询问火花以使用UDF计算列
    df.withColumn("unzipped_column", callUDF("unzip", col("your_original_column_with_gzip_data")))

您可能会受益于scala中其他类似的gzip解压缩实现,而忽略了失败的原因:

def decompress(compressed: Array[Byte]): String = {
    val inputStream = new GZIPInputStream(new ByteArrayInputStream(compressed))
    scala.io.Source.fromInputStream(inputStream).mkString
}

来源:https://github.com/rest-assured/rest-assured/blob/master/examples/scalatra-example/src/main/scala/io/restassured/scalatra/support/Gzip.scala

注意:UDF示例是用Java编写的,但在Scala中应该非常类似,请参见https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/functions.html#callUDF-java.lang.String-org.apache.spark.sql.Column...-