我需要在数据流作业中处理存储在Google云端存储中的自定义二进制文件。
为此,我写了一个自定义FileBasedSource。正如文档所述,它由支持,文件模式定义为Java glob,单个文件或单个文件的偏移范围。
就我而言,我需要使用带有几个特定文件名的Java glob,例如/path/{file1,file1,file3}
。当我在本地文件系统上测试时,它运行正常,但如果我将它与Google云端存储(gs://bucket/{file1,file2,file3}
)一起使用,则无法找到任何文件,我得到以下堆栈跟踪:
java.io.IOException: Error executing batch GCS request
at org.apache.beam.sdk.util.GcsUtil.executeBatches(GcsUtil.java:603)
at org.apache.beam.sdk.util.GcsUtil.getObjects(GcsUtil.java:342)
at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.matchNonGlobs(GcsFileSystem.java:217)
at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.match(GcsFileSystem.java:86)
at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:111)
at org.apache.beam.sdk.io.FileBasedSource.getEstimatedSizeBytes(FileBasedSource.java:207)
at org.apache.beam.runners.dataflow.internal.CustomSources.serializeToCloudSource(CustomSources.java:78)
at org.apache.beam.runners.dataflow.ReadTranslator.translateReadHelper(ReadTranslator.java:53)
at org.apache.beam.runners.dataflow.ReadTranslator.translate(ReadTranslator.java:40)
at org.apache.beam.runners.dataflow.ReadTranslator.translate(ReadTranslator.java:37)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator.visitPrimitiveTransform(DataflowPipelineTranslator.java:439)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:602)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:594)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.access$500(TransformHierarchy.java:276)
at org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:210)
at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:440)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator.translate(DataflowPipelineTranslator.java:383)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.translate(DataflowPipelineTranslator.java:173)
at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:556)
at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:167)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:283)
at com.travelaudience.data.job.rtbtobigquery.Main$.main(Main.scala:74)
at com.travelaudience.data.job.rtbtobigquery.Main.main(Main.scala)
Caused by: java.util.concurrent.ExecutionException: com.google.api.client.http.HttpResponseException: 400 Bad Request
at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:500)
at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:479)
at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76)
at org.apache.beam.sdk.util.GcsUtil.executeBatches(GcsUtil.java:595)
... 23 more
Caused by: com.google.api.client.http.HttpResponseException: 400 Bad Request
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1070)
at com.google.api.client.googleapis.batch.BatchRequest.execute(BatchRequest.java:241)
at org.apache.beam.sdk.util.GcsUtil$3.call(GcsUtil.java:588)
at org.apache.beam.sdk.util.GcsUtil$3.call(GcsUtil.java:586)
at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
如果我使用与此gsutil
gsutil ls gs://bucket/{file1,file2,file3}
gs://bucket/dir/*
相同的完全相同的glob,则会正确列出3个文件。从代码中,像St-flash_CLI.exe
这样的glob工作。
我使用Beam版本2.1.0。
知道这里有什么问题吗?
感谢您的帮助!
答案 0 :(得分:2)
Beam支持only a subset of the glob syntax以匹配GCS文件。它支持*
和?
,但不支持{}
。我们的文档目前没有很好地解释这一点 - 它应该记录在FileSystems.match()
上,并且可以从其他类中链接到表示用户的全局匹配。