升级到Beam 2.4.0后,DataFlow Runner失败

时间:2018-03-29 23:16:36

标签: google-cloud-dataflow

我有一个简单的数据流作业用于使用apache-beam 2.1.0成功运行的测试,代码类似于:

public static void main(String[] args) throws Exception {
    DataflowPipelineOptions dataflowOptions = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
    dataflowOptions.setProject("MY_PROJECT_ID");
    dataflowOptions.setStagingLocation("gs://MY_STAGING_LOC");
    dataflowOptions.setTempLocation("gs://MY_TEMP_LOC");
    dataflowOptions.setFilesToStage(Collections.singletonList("MY_LOCAL_JAR_FILE.jar"));
    dataflowOptions.setRunner(DataflowRunner.class);
    dataflowOptions.setNetwork("SOME_NETWORK");
    dataflowOptions.setSubnetwork("regions/SOME_REGION/subnetworks/SOME_SUBNETWORK");
    dataflowOptions.setZone("SOME_ZONE");

    Pipeline p = Pipeline.create(dataflowOptions);

    List<String> LINES = Arrays.asList("foobar");
    p.apply(Create.of(LINES)).setCoder(StringUtf8Coder.of());

    p.run().waitUntilFinish();
}

但是,当我迁移到apache-beam 2.4.0时,我在尝试通过cli提交数据流作业时立即收到以下错误。

Exception in thread "main" java.lang.RuntimeException: Error while staging packages
        at org.apache.beam.runners.dataflow.util.PackageUtil.stageClasspathElements(PackageUtil.java:396)
        at org.apache.beam.runners.dataflow.util.PackageUtil.stageClasspathElements(PackageUtil.java:273)
        at org.apache.beam.runners.dataflow.util.GcsStager.stageFiles(GcsStager.java:76)
        at org.apache.beam.runners.dataflow.util.GcsStager.stageDefaultFiles(GcsStager.java:64)
        at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:661)
        at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:174)
        at org.apache.beam.sdk.Pipeline.run(Pipeline.java:311)
        at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
        at com.company.app.App.main(App.java:48)
Caused by: java.io.IOException: Error executing batch GCS request
        at org.apache.beam.sdk.util.GcsUtil.executeBatches(GcsUtil.java:607)
        at org.apache.beam.sdk.util.GcsUtil.getObjects(GcsUtil.java:339)
        at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.matchNonGlobs(GcsFileSystem.java:216)
        at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.match(GcsFileSystem.java:85)
        at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:123)
        at org.apache.beam.sdk.io.FileSystems.matchSingleFileSpec(FileSystems.java:188)
        at org.apache.beam.runners.dataflow.util.PackageUtil.alreadyStaged(PackageUtil.java:160)
        at org.apache.beam.runners.dataflow.util.PackageUtil.stagePackageSynchronously(PackageUtil.java:184)
        at org.apache.beam.runners.dataflow.util.PackageUtil.lambda$stagePackage$1(PackageUtil.java:174)
        at org.apache.beam.sdk.util.MoreFutures.lambda$supplyAsync$0(MoreFutures.java:101)
        at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.ExecutionException: com.google.api.client.http.HttpResponseException: 404 Not Found
...

我还没有更改任何配置设置。

进一步调试代码,对https://www.googleapis.com/null

的POST请求失败

2 个答案:

答案 0 :(得分:2)

看起来它是2月13日在dev分支中修复的错误。希望修复程序很快就会发布:

原始问题:https://github.com/google/google-api-java-client/issues/1073

有缺陷的修复:https://github.com/google/google-api-java-client/pull/1087

更正修正:https://github.com/google/google-api-java-client/pull/1096

答案 1 :(得分:0)

您正在遇到此问题:https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/607

要修复,请使用Gradle添加以下内容:

compile (group: 'com.google.api-client', name: 'google-api-client', version: '1.22.0') {
    force = true
}

或Maven:

<dependency>
  <groupId>com.google.api-client</groupId>
  <artifactId>google-api-client</artifactId>
  <version>[1.22.0]</version>
</dependency>