FileBasedSource无法理解与Google云端存储中的多个特定文件相对应的glob

时间:2017-10-27 14:35:21

标签: java scala google-cloud-storage google-cloud-dataflow apache-beam

我需要在数据流作业中处理存储在Google云端存储中的自定义二进制文件。

为此,我写了一个自定义FileBasedSource。正如文档所述,它由支持,文件模式定义为Java glob,单个文件或单个文件的偏移范围

就我而言,我需要使用带有几个特定文件名的Java glob,例如/path/{file1,file1,file3}。当我在本地文件系统上测试时,它运行正常,但如果我将它与Google云端存储(gs://bucket/{file1,file2,file3})一起使用,则无法找到任何文件,我得到以下堆栈跟踪:

java.io.IOException: Error executing batch GCS request
        at org.apache.beam.sdk.util.GcsUtil.executeBatches(GcsUtil.java:603)
        at org.apache.beam.sdk.util.GcsUtil.getObjects(GcsUtil.java:342)
        at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.matchNonGlobs(GcsFileSystem.java:217)
        at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.match(GcsFileSystem.java:86)
        at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:111)
        at org.apache.beam.sdk.io.FileBasedSource.getEstimatedSizeBytes(FileBasedSource.java:207)
        at org.apache.beam.runners.dataflow.internal.CustomSources.serializeToCloudSource(CustomSources.java:78)
        at org.apache.beam.runners.dataflow.ReadTranslator.translateReadHelper(ReadTranslator.java:53)
        at org.apache.beam.runners.dataflow.ReadTranslator.translate(ReadTranslator.java:40)
        at org.apache.beam.runners.dataflow.ReadTranslator.translate(ReadTranslator.java:37)
        at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator.visitPrimitiveTransform(DataflowPipelineTranslator.java:439)
        at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:602)
        at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:594)
        at org.apache.beam.sdk.runners.TransformHierarchy$Node.access$500(TransformHierarchy.java:276)
        at org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:210)
        at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:440)
        at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator.translate(DataflowPipelineTranslator.java:383)
        at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.translate(DataflowPipelineTranslator.java:173)
        at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:556)
        at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:167)
        at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
        at org.apache.beam.sdk.Pipeline.run(Pipeline.java:283)
        at com.travelaudience.data.job.rtbtobigquery.Main$.main(Main.scala:74)
        at com.travelaudience.data.job.rtbtobigquery.Main.main(Main.scala)
Caused by: java.util.concurrent.ExecutionException: com.google.api.client.http.HttpResponseException: 400 Bad Request
        at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:500)
        at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:479)
        at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76)
        at org.apache.beam.sdk.util.GcsUtil.executeBatches(GcsUtil.java:595)
        ... 23 more
Caused by: com.google.api.client.http.HttpResponseException: 400 Bad Request
        at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1070)
        at com.google.api.client.googleapis.batch.BatchRequest.execute(BatchRequest.java:241)
        at org.apache.beam.sdk.util.GcsUtil$3.call(GcsUtil.java:588)
        at org.apache.beam.sdk.util.GcsUtil$3.call(GcsUtil.java:586)
        at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
        at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
        at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

如果我使用与此gsutil gsutil ls gs://bucket/{file1,file2,file3} gs://bucket/dir/*相同的完全相同的glob,则会正确列出3个文件。从代码中,像St-flash_CLI.exe这样的glob工作。

我使用Beam版本2.1.0。

知道这里有什么问题吗?

感谢您的帮助!

1 个答案:

答案 0 :(得分:2)

Beam支持only a subset of the glob syntax以匹配GCS文件。它支持*?,但不支持{}。我们的文档目前没有很好地解释这一点 - 它应该记录在FileSystems.match()上,并且可以从其他类中链接到表示用户的全局匹配。