ZipFileInputFormat

Question

我有我希望打开的zip文件＆＃39;通过＆＃39;火花。我可以打开.gzip文件没有问题，因为Hadoops本机编解码器支持，但我无法使用.zip文件。

有没有简单的方法来读取Spark代码中的zip文件？我还搜索了要添加到CompressionCodecFactory的zip编解码器实现，但到目前为止还没有成功。

Answer 1

没有使用python代码的解决方案，我最近不得不在pyspark中阅读zips。而且，在搜索如何做到这一点时，我遇到了这个问题。所以，希望这对其他人有帮助。

import zipfile
import io

def zip_extract(x):
    in_memory_data = io.BytesIO(x[1])
    file_obj = zipfile.ZipFile(in_memory_data, "r")
    files = [i for i in file_obj.namelist()]
    return dict(zip(files, [file_obj.open(file).read() for file in files]))


zips = sc.binaryFiles("hdfs:/Testing/*.zip")
files_data = zips.map(zip_extract).collect()

在上面的代码中，我返回了一个字典，其中zip中的文件名为键，每个文件中的文本数据为值。你可以改变它，但是你想要适合你的目的。

Answer 2

请尝试以下代码：

using API sparkContext.newAPIHadoopRDD(
    hadoopConf,
    InputFormat.class,
    ImmutableBytesWritable.class, Result.class)

Answer 3

我遇到了类似的问题，我已用以下代码解决了

sparkContext.binaryFiles("/pathToZipFiles/*")
.flatMap { case (zipFilePath, zipContent) =>

        val zipInputStream = new ZipInputStream(zipContent.open())

        Stream.continually(zipInputStream.getNextEntry)
        .takeWhile(_ != null)
        .flatMap { zipEntry => ??? }
    }

Answer 4

@ user3591785指出了我正确的方向，所以我将他的答案标记为正确。

有关详细信息，我能够搜索ZipFileInputFormat Hadoop，并遇到此链接：http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/

使用ZipFileInputFormat及其助手ZipfileRecordReader类，我能够完全打开Spark并读取zip文件。

    rdd1  = sc.newAPIHadoopFile("/Users/myname/data/compressed/target_file.ZIP", ZipFileInputFormat.class, Text.class, Text.class, new Job().getConfiguration());

结果是一个包含一个元素的地图。文件名为键，内容为值，因此我需要将其转换为JavaPairRdd。我确定如果你愿意，你可以用BytesWritable替换Text，并用其他东西替换ArrayList，但我的目标是先运行一些东西。

JavaPairRDD<String, String> rdd2 = rdd1.flatMapToPair(new PairFlatMapFunction<Tuple2<Text, Text>, String, String>() {

    @Override
    public Iterable<Tuple2<String, String>> call(Tuple2<Text, Text> textTextTuple2) throws Exception {
        List<Tuple2<String,String>> newList = new ArrayList<Tuple2<String, String>>();

        InputStream is = new ByteArrayInputStream(textTextTuple2._2.getBytes());
        BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));

        String line;

        while ((line = br.readLine()) != null) {

        Tuple2 newTuple = new Tuple2(line.split("\\t")[0],line);
            newList.add(newTuple);
        }
        return newList;
    }
});

Answer 5

using API sparkContext.newAPIHadoopRDD(hadoopConf, InputFormat.class, ImmutableBytesWritable.class, Result.class)

文件名应使用conf

传递

conf=( new Job().getConfiguration())
conf.set(PROPERTY_NAME from your input formatter,"Zip file address")
sparkContext.newAPIHadoopRDD(conf, ZipFileInputFormat.class, Text.class, Text.class)

请从输入格式化程序中找到PROPERTY_NAME以设置路径

Answer 6

此答案仅收集以前的知识，并与我分享我的经验。

ZipFileInputFormat

我尝试了@Tinku和@JeffLL个答案，并使用导入的ZipFileInputFormat和sc.newAPIHadoopFile API。 但这对我不起作用。而且我不知道如何将com-cotdp-hadoop lib放在我的生产群集上。我不负责设置。

ZipInputStream

@Tiago Palma给了一个很好的建议，但是他没有完成他的答案，我挣扎了很长时间才真正得到了解压缩的输出。

当我能够这样做时，我必须准备所有的理论方面，你可以在我的答案中找到：https://stackoverflow.com/a/45958182/1549135

但上述答案的缺失部分是阅读ZipEntry：

import java.util.zip.ZipInputStream;
import java.io.BufferedReader;
import java.io.InputStreamReader;   

sc.binaryFiles(path, minPartitions)
      .flatMap { case (name: String, content: PortableDataStream) =>
        val zis = new ZipInputStream(content.open)
        Stream.continually(zis.getNextEntry)
              .takeWhile(_ != null)
              .flatMap { _ =>
                  val br = new BufferedReader(new InputStreamReader(zis))
                  Stream.continually(br.readLine()).takeWhile(_ != null)
              }}

Answer 7

试试：

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.read.text("yourGzFile.gz")

如何通过Spark打开/流式传输.zip文件？

7 个答案:

ZipFileInputFormat

ZipInputStream