在Spark 2.3.0中读取Zstandard压缩文件

时间:2018-06-15 02:16:39

标签: apache-spark hadoop2 amazon-emr zstandard

从Spark Spark 2.3.0(https://issues.apache.org/jira/browse/SPARK-19112)起,Apache Spark据称支持Facebook的Zstandard压缩算法,但我无法真正读取Zstandard压缩文件:

$ spark-shell

...

// Short name throws an exception
scala> val events = spark.read.option("compression", "zstd").json("data.zst")
java.lang.IllegalArgumentException: Codec [zstd] is not available. Known codecs are bzip2, deflate, uncompressed, lz4, gzip, snappy, none.

// Codec class can be imported
scala> import org.apache.spark.io.ZStdCompressionCodec
import org.apache.spark.io.ZStdCompressionCodec

// Fully-qualified code class bypasses error, but results in corrupt records
scala> spark.read.option("compression", "org.apache.spark.io.ZStdCompressionCodec").json("data.zst")
res4: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

为了阅读这样的文件我需要做什么?

环境是AWS EMR 5.14.0。

1 个答案:

答案 0 :(得分:1)

Per this comment,Spark 2.3.0中对Zstandard的支持仅限于内部和随机输出。

读取或编写Zstandard文件利用Hadoop的org.apache.hadoop.io.compress.ZStandardCodec,它是在Hadoop 2.9.0中引入的(2.8.3包含在EMR 5.14.0中)。