我有一个简单的数据流作业用于使用apache-beam 2.1.0成功运行的测试,代码类似于:
public static void main(String[] args) throws Exception {
DataflowPipelineOptions dataflowOptions = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
dataflowOptions.setProject("MY_PROJECT_ID");
dataflowOptions.setStagingLocation("gs://MY_STAGING_LOC");
dataflowOptions.setTempLocation("gs://MY_TEMP_LOC");
dataflowOptions.setFilesToStage(Collections.singletonList("MY_LOCAL_JAR_FILE.jar"));
dataflowOptions.setRunner(DataflowRunner.class);
dataflowOptions.setNetwork("SOME_NETWORK");
dataflowOptions.setSubnetwork("regions/SOME_REGION/subnetworks/SOME_SUBNETWORK");
dataflowOptions.setZone("SOME_ZONE");
Pipeline p = Pipeline.create(dataflowOptions);
List<String> LINES = Arrays.asList("foobar");
p.apply(Create.of(LINES)).setCoder(StringUtf8Coder.of());
p.run().waitUntilFinish();
}
但是,当我迁移到apache-beam 2.4.0时,我在尝试通过cli提交数据流作业时立即收到以下错误。
Exception in thread "main" java.lang.RuntimeException: Error while staging packages
at org.apache.beam.runners.dataflow.util.PackageUtil.stageClasspathElements(PackageUtil.java:396)
at org.apache.beam.runners.dataflow.util.PackageUtil.stageClasspathElements(PackageUtil.java:273)
at org.apache.beam.runners.dataflow.util.GcsStager.stageFiles(GcsStager.java:76)
at org.apache.beam.runners.dataflow.util.GcsStager.stageDefaultFiles(GcsStager.java:64)
at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:661)
at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:174)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:311)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
at com.company.app.App.main(App.java:48)
Caused by: java.io.IOException: Error executing batch GCS request
at org.apache.beam.sdk.util.GcsUtil.executeBatches(GcsUtil.java:607)
at org.apache.beam.sdk.util.GcsUtil.getObjects(GcsUtil.java:339)
at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.matchNonGlobs(GcsFileSystem.java:216)
at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.match(GcsFileSystem.java:85)
at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:123)
at org.apache.beam.sdk.io.FileSystems.matchSingleFileSpec(FileSystems.java:188)
at org.apache.beam.runners.dataflow.util.PackageUtil.alreadyStaged(PackageUtil.java:160)
at org.apache.beam.runners.dataflow.util.PackageUtil.stagePackageSynchronously(PackageUtil.java:184)
at org.apache.beam.runners.dataflow.util.PackageUtil.lambda$stagePackage$1(PackageUtil.java:174)
at org.apache.beam.sdk.util.MoreFutures.lambda$supplyAsync$0(MoreFutures.java:101)
at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.ExecutionException: com.google.api.client.http.HttpResponseException: 404 Not Found
...
我还没有更改任何配置设置。
进一步调试代码,对https://www.googleapis.com/null
答案 0 :(得分:2)
看起来它是2月13日在dev分支中修复的错误。希望修复程序很快就会发布:
原始问题:https://github.com/google/google-api-java-client/issues/1073
有缺陷的修复:https://github.com/google/google-api-java-client/pull/1087
更正修正:https://github.com/google/google-api-java-client/pull/1096
答案 1 :(得分:0)
您正在遇到此问题:https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/607
要修复,请使用Gradle添加以下内容:
compile (group: 'com.google.api-client', name: 'google-api-client', version: '1.22.0') {
force = true
}
或Maven:
<dependency>
<groupId>com.google.api-client</groupId>
<artifactId>google-api-client</artifactId>
<version>[1.22.0]</version>
</dependency>