Question

我试图从SFTP位置读取tar文件，并通过使用流的操作对其进行解压缩，然后再次将其压缩到hadoop生态系统中的gz中。我已经用短信压缩了一个简单的文件

压缩文件中的文本。

文件名为testDataTar.txt.tar，其中的文本文件名为testDataTar.txt。下面提供了用于进行读取和转换的代码片段：

def copyFilesToHdfs(sourceFile: FileObject, fs: FileSystem, targetPath: Path): String = {
   val fileName = sourceFile.getName.getBaseName.replaceAll("tar$", "gz")
   val fis = sourceFile.getContent.getInputStream    
   val is = new TarArchiveInputStream(fis)
   val targetFile = new Path(targetPath, fileName)

   val codec = new GzipCodec
   codec.setConf(new org.apache.hadoop.conf.Configuration())
   val os = new DataOutputStream(codec.createOutputStream(fs.create(targetFile)))

   IOUtils.copy(is, os)

   is.close()
   os.close()
   fis.close()

   fileName
}

如果按以下方式读取结果文件：

val fileContent = Config.sparkContext.textFile(targetFile.toString).first()

该文件的内容为空（也已手动检查），但最终出现如下异常。

java.lang.UnsupportedOperationException: empty collection
20:22:42   at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1369)
20:22:42   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
20:22:42   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
20:22:42   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
20:22:42   at org.apache.spark.rdd.RDD.first(RDD.scala:1366)

但是，如果我只是做同样的事情，但是用fis跳过了TarArchiveInputStream的行并输入文本文件本身，我将得到完美压缩的gzip文件作为hdfs的输出。

因此，总而言之，我无法以某种方式在.tar和.gz流之间传输文件内容。我已经尝试将.zip（ZipInputStream）转换为.gz，结果是相同的。任何帮助表示赞赏。

从sftp读取tar文件并将其即时转换为hdfs上的gzip

0 个答案: