Spark默认支持压缩文件

Question

我已经阅读了 Spark 对 gzip -kind输入文件here的支持，我想知道是否存在相同的支持一种压缩文件，例如 .zip 文件。到目前为止，我已经尝试计算在 zip 文件下压缩的文件，但 Spark 似乎无法成功读取其内容。

我已经看了 Hadoop 的newAPIHadoopFile和newAPIHadoopRDD，但到目前为止我还没有能够得到任何有用的东西。

此外， Spark 支持为指定文件夹下的每个文件创建分区，如下例所示：

SparkConf SpkCnf = new SparkConf().setAppName("SparkApp")
                                  .setMaster("local[4]");

JavaSparkContext Ctx = new JavaSparkContext(SpkCnf);

JavaRDD<String> FirstRDD = Ctx.textFile("C:\input\).cache();

C:\input\指向包含多个文件的目录。

如果可以计算压缩文件，是否也可以将每个文件打包在单个压缩文件下，并按照每个文件一个分区的相同模式

Answer 1

由于Apache Spark使用Hadoop输入格式，我们可以查看有关如何处理zip文件的hadoop文档，看看是否有可行的方法。

This site让我们知道如何使用它（即我们可以使用ZipFileInputFormat）。话虽如此，由于zip文件不是拆分表（请参阅this），因此您对单个压缩文件的请求不受支持。相反，如果可能的话，最好有一个包含许多单独的zip文件的目录。

这个问题类似于this other question，但是它增加了一个额外的问题：是否可以有一个zip文件（因为它不是分割表格格式不是＆＃39;好主意）。

Answer 2

Spark默认支持压缩文件

根据Spark Programming Guide

所有Spark的基于文件的输入方法（包括textFile）都支持在目录，压缩文件和通配符上运行。例如，您可以使用textFile（＆＃34; / my / directory＆＃34;），textFile（＆＃34; / my / directory / .txt＆＃34;）和textFile（＆＃34; /我/目录/ 。广州＆＃34;。）

这可以通过提供有关Hadoop支持的压缩格式的信息来扩展，基本上可以通过查找所有扩展CompressionCodec（docs）的类来检查

name    | ext      | codec class
-------------------------------------------------------------
bzip2   | .bz2     | org.apache.hadoop.io.compress.BZip2Codec 
default | .deflate | org.apache.hadoop.io.compress.DefaultCodec 
deflate | .deflate | org.apache.hadoop.io.compress.DeflateCodec 
gzip    | .gz      | org.apache.hadoop.io.compress.GzipCodec 
lz4     | .lz4     | org.apache.hadoop.io.compress.Lz4Codec 
snappy  | .snappy  | org.apache.hadoop.io.compress.SnappyCodec

来源：List the available hadoop codecs

因此，只需调用以下内容即可实现上述格式和更多可能性：

sc.readFile(path)

在Spark中读取zip文件

不幸的是，zip默认情况下不在受支持的列表中。

我找到了一篇很棒的文章：Hadoop: Processing ZIP files in Map/Reduce和一些答案（example），解释了如何将导入的ZipFileInputFormat与sc.newAPIHadoopFile API结合使用。但这对我不起作用。

我的解决方案

如果没有任何外部依赖关系，您可以使用sc.binaryFiles加载文件，然后解压缩读取内容的PortableDataStream。这是我选择的方法。

import java.io.{BufferedReader, InputStreamReader}
import java.util.zip.ZipInputStream
import org.apache.spark.SparkContext
import org.apache.spark.input.PortableDataStream
import org.apache.spark.rdd.RDD

implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {

    def readFile(path: String,
                 minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {

      if (path.endsWith(".zip")) {
        sc.binaryFiles(path, minPartitions)
          .flatMap { case (name: String, content: PortableDataStream) =>
            val zis = new ZipInputStream(content.open)
            // this solution works only for single file in the zip
            val entry = zis.getNextEntry
            val br = new BufferedReader(new InputStreamReader(zis))
            Stream.continually(br.readLine()).takeWhile(_ != null)
          }
      } else {
        sc.textFile(path, minPartitions)
      }
    }
  }

使用此隐式类，您需要导入它并调用readFile SparkContext上的方法：

import com.github.atais.spark.Implicits.ZipSparkContext
sc.readFile(path)

隐式类会正确加载zip文件，并像以前一样返回RDD[String]。

注意：这仅适用于zip存档中的单个文件！
对于zip支持中的多个文件，请查看以下答案：https://stackoverflow.com/a/45958458/1549135

Answer 3

您可以使用sc.binaryFiles将Zip读取为二进制文件

val rdd = sc.binaryFiles(path).flatMap { 
    case (name: String, content: PortableDataStream) => new ZipInputStream(content.open) 
}  //=> RDD[ZipInputStream]

然后您可以将ZipInputStream映射到行列表：

val zis = rdd.first
val entry = zis.getNextEntry
val br = new BufferedReader(new InputStreamReader(in, "UTF-8"))
val res = Stream.continually(br.readLine()).takeWhile(_ != null).toList

但问题仍然是zip文件不可拆分。

Answer 4

您可以使用sc.binaryFiles以二进制格式打开zip文件，然后将其解压缩为文本格式。不幸的是，zip文件不能拆分..所以你需要等待解压缩，然后可以调用shuffle来平衡每个分区中的数据。

以下是Python中的一个示例。更多信息在http://gregwiki.duckdns.org/index.php/2016/04/11/read-zip-file-in-spark/

 file_RDD = sc.binaryFiles( HDFS_path + data_path )

 def Zip_open( binary_stream_string ) : # New version, treat a stream as zipped file
     try :
         pseudo_file = io.BytesIO( binary_stream_string )
         zf = zipfile.ZipFile( pseudo_file )
         return zf
     except :
         return None

 def read_zip_lines(zipfile_object) :
     file_iter = zipfile_object.open('diff.txt')
     data =  file_iter.readlines() 
     return data

 My_RDD = file_RDD.map(lambda kv: (kv[0], Zip_open(kv[1])))

Answer 5

下面是一个示例，它在目录中搜索.zip文件并使用名为ZipFileInputFormat的自定义FileInputFormat和Spark上下文中的newAPIHadoopFile API创建RDD。然后它将这些文件写入输出目录。

allzip.foreach { x =>
  val zipFileRDD = sc.newAPIHadoopFile(
    x.getPath.toString,
    classOf[ZipFileInputFormat],
    classOf[Text],
    classOf[BytesWritable], hadoopConf)

  zipFileRDD.foreach { y =>
    ProcessFile(y._1.toString, y._2)
  }

https://github.com/alvinhenrick/apache-spark-examples/blob/master/src/main/scala/com/zip/example/Unzip.scala

可以在此处找到示例中使用的ZipFileInputFormat：https://github.com/cotdp/com-cotdp-hadoop/tree/master/src/main/java/com/cotdp/hadoop

Apache Spark中的Zip支持

5 个答案:

Spark默认支持压缩文件

在Spark中读取zip文件

我的解决方案