如何在不提取的情况下读取 apache-beam 中云存储中保存的压缩 gzip csv 文件

时间:2021-07-02 05:32:17

标签: java google-cloud-platform google-cloud-storage apache-beam apache-beam-io

我通过第三方将 GZIP 压缩的 csv 文件上传到云存储中。我正在使用 apache 光束连续流管道来读取压缩文件元数据(文件名,完整路径)。我还有一个要求来读取这个 csv 文件的第一行和最后一行。 我正在使用以下代码获取添加到云存储桶文件夹的所有压缩文件。

    pipeline.apply("MatchFile(s)", FileIO.match()
            .filepattern(zipFilePath)
            .continuously(Duration.standardMinutes(1), Watch.Growth.never()))
            .apply(Window.<~>into(FixedWindows.of(Duration.standardMinutes(1))))
            .apply("Get Compressed File(s)", ParDo.of(new GetCompressedFile()));

static class GetCompressedFile extends DoFn<MatchResult.Metadata, Void> {
    @ProcessElement
    public void processElement(ProcessContext context) throws ParseException {

            ResourceId inputFile = context.element().resourceId();
            String fileName = inputFile.getFilename();
            String currentDirectoryPath = inputFile.getCurrentDirectory().toString();

我能够获得压缩文件名和路径,但我无法在不解压的情况下读取 csv 文件。我尝试了一些谷歌答案来读取压缩文件,但这不是从云存储读取。

2 个答案:

答案 0 :(得分:2)

我可以使用以下代码读取压缩文件而无需解压缩。也许它会帮助某人。

    pipeline.apply("MatchFile(s)", FileIO.match()
            .filepattern(zipFilePath)
            .continuously(Duration.standardMinutes(1), Watch.Growth.never()))
            .apply(Window.<MatchResult.Metadata>into(FixedWindows.of(Duration.standardMinutes(1))))
            .apply(FileIO.readMatches().withCompression(GZIP))
            .apply("Read Files",ParDo.of(new ReadFilesGZIP()));

    pipeline.run();
}

static class ReadsFilesGZIP extends DoFn<FileIO.ReadableFile,String>{
    @ProcessElement
    public void processElement(ProcessContext context){
        FileIO.ReadableFile file = context.element();


            ReadableByteChannel readableByteChannel = file.getCompression().readDecompressed(FileSystems.open(file.getMetadata().resourceId()));
            try (BufferedReader r = new BufferedReader(new InputStreamReader(Channels.newInputStream(readableByteChannel)))) {
                String line;
                Stream<String> fileLines = r.lines();

}

答案 1 :(得分:0)

我不知道这是否是您想要的,但您可以这样读取文件

    pipeline.apply("MatchFile(s)", FileIO.match()
            .filepattern(zipFilePath)
            .continuously(Duration.standardMinutes(1), Watch.Growth.never()))
            .apply(Window.<~>into(FixedWindows.of(Duration.standardMinutes(1))))
            .apply(FileIO.readMatches().withCompression(Compression.GZIP))
            .apply(ParDo.of(new DoFn<FileIO.ReadableFile, IdLine>() {
                    @ProcessElement
                    public void processElement(@Element FileIO.ReadableFile file, OutputReceiver<IdLine> out) throws IOException {
                        String content = file.readFullyAsUTF8String();
                        System.out.println(content);
                        BufferedReader br = new BufferedReader(Channels.newReader(file.open(), Charset.defaultCharset()));
                        String l;
                        StringBuffer stackLine = new StringBuffer();
                        while ((l = br.readLine()) != null) {
                        ....
                        }
                    }
                }));
相关问题