我有镶木地板文件,其中包含包含压缩内容的列。
当前,我的Spark(用Scala编写)作业使用Java.io Reader
链来对内容进行字符串化:
val output: StringBuilder = new StringBuilder
val byteArrayInputStream: ByteArrayInputStream = new ByteArrayInputStream(x)
try {
val gzipInputStream: GZIPInputStream = new GZIPInputStream(byteArrayInputStream)
try {
val inputStreamReader: InputStreamReader = new InputStreamReader(gzipInputStream, StandardCharsets.UTF_8)
try {
val bufferedReader: BufferedReader = new BufferedReader(inputStreamReader)
try {
var line: String = null
do {
line = bufferedReader.readLine()
if (line != null)
output.append(line)
} while (line != null)
} finally {
if (bufferedReader != null) {
bufferedReader.close()
}
}
}
finally {
if (inputStreamReader != null) {
inputStreamReader.close()
}
}
}
finally {
if (gzipInputStream != null) {
gzipInputStream.close()
}
}
}
finally {
if (byteArrayInputStream != null) {
byteArrayInputStream.close()
}
}
val out = output.toString
return out
但这会导致Hadoop群集中出现java.lang.OutOfMemoryError: GC overhead limit exceeded
异常。
是否有更好的解压缩内容的方法?
答案 0 :(得分:0)
您可以定义一个火花UDF(用户定义函数)来解压缩gzip字节数组:
static UDF1 unzip = (UDF1<byte[], String>) YourClass::gzipDecompress;
spark.sqlContext().udf().register("unzip", unzip, DataTypes.StringType);
df.withColumn("unzipped_column", callUDF("unzip", col("your_original_column_with_gzip_data")))
您可能会受益于scala中其他类似的gzip解压缩实现,而忽略了失败的原因:
def decompress(compressed: Array[Byte]): String = {
val inputStream = new GZIPInputStream(new ByteArrayInputStream(compressed))
scala.io.Source.fromInputStream(inputStream).mkString
}
注意:UDF示例是用Java编写的,但在Scala中应该非常类似,请参见https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/functions.html#callUDF-java.lang.String-org.apache.spark.sql.Column...-