Question

编写软件以从HDFS文件系统中读取不同类型的文件时（例如，某些是tar归档文件，有些是纯文本，有些是二进制文件，有些是gzip或其他压缩文件），我发现我可以使用CompressorStreamFactory.detect方法可以按预期正确识别压缩文件，尝试创建BufferedReader逐行读取压缩数据时出现错误。

我尝试了以下方法：

val hdfsConf: Configuration = new Configuration()
hdfsConf.set("fs.hdfs.impl",classOf[DistributedFileSystem].getName)
hdfsConf.set("fs.file.impl",classOf[LocalFileSystem].getName)
hdfsConf.set(FileSystem.DEFAULT_FS,"hdfs://a-namenode-that-exists:8020")

val fs: FileSystem = FileSystem.get(new URI("hdfs://a-namenode-that-exists:8020"),hdfsConf)

def getFiles(directory: Path): Unit = {
    val iter: RemoteIterator[LocatedFileStatus] = fs.listFiles(directory,true)

    var c: Int = 0

    while(iter.hasNext && c < 3) {
        val fileStatus: LocatedFileStatus = iter.next()
        val path: Path = fileStatus.getPath
        val hfs: FileSystem = path.getFileSystem(hdfsConf)
        val is: FSDataInputStream = hfs.open(path)

        val t: String = CompressorStreamFactory.detect(new BufferedInputStream(is))
        System.out.println(s"|||||||||| $t |=| ${path.toString}")

        val reader: BufferedReader = new BufferedReader(new InputStreamReader(new CompressorStreamFactory().createCompressorInputStream(new BufferedInputStream(is))))

        var cc: Int = 0
        while(cc < 10) {
            val line: String = reader.readLine()
            System.out.println(s"|||||||||| $line")
            cc += 1
        }
        c += 1
    }
}

getFiles(new Path("/some/directory/that/is/definitely/there/"))

我希望，由于我能够成功使用CompressorStreamFactory.detect方法正确地将我的文件标识为gzip，因此读取文件也应该可以。 hdfs open方法返回的结果类是FSDataInputStream，它是从InputStream（和FilteredInputStream）派生的，因此，我希望自从我广泛使用Apache Commons Compress库从常规Linux文件系统中读取存档和压缩文件后，可以识别正在使用的压缩，它也可以在HDFS上正常工作...但是，我收到一个错误：

Exception in thread "main" org.apache.commons.compress.compressors.CompressorException: No Compressor found for the stream signature.
    at org.apache.commons.compress.compressors.CompressorStreamFactory.detect(CompressorStreamFactory.java:525)
    at org.apache.commons.compress.compressors.CompressorStreamFactory.createCompressorInputStream(CompressorStreamFactory.java:542)
    at myorg.myproject.scratch.HdfsTest$.getFiles$1(HdfsTest.scala:34)
    at myorg.myproject.scratch.HdfsTest$.main(HdfsTest.scala:47)
    at myorg.myproject.scratch.HdfsTest.main(HdfsTest.scala)

我非常喜欢Apache Commons库，因为它的工厂方法降低了我读取存档和压缩文件（不仅是tar或gzip）的代码的复杂性。我希望有一个简单的解释来说明为什么检测有效，但是阅读却无法...但是我已经没有足够的想法来解决这个问题。我唯一的想法是FSDataInputStream的FilteredInputStream起源可能会把事情搞砸了……但是我不知道如果实际存在问题该怎么解决。

Answer 1

在调用CompressorStreamFactory.detect(new BufferedInputStream(is))后，输入流is已被部分消耗。因此它似乎格式错误，CompressorStreamFactory().createCompressorInputStream。尝试重新打开文件，而不是重新使用is。

Apache Commons CompressorInputStream和HDFS Gzip文件错误未找到用于流签名的Compressor

1 个答案: