如何编写Google云数据流转换映射?

时间:2018-04-06 21:15:36

标签: google-cloud-platform google-cloud-dataflow

我正在将Google云数据流作业从数据流java sdk 1.8升级到2.4版,然后尝试使用--update和--transformNameMapping参数更新其在Google云端的现有数据流作业,但我无法弄清楚如何正确编写transformNameMappings以使升级成功并通过兼容性检查。

我的代码在兼容性检查时因错误而失败: Workflow failed. Causes: The new job is not compatible with 2018-04-06_13_48_04-12999941762965935736. The original job has not been aborted., The new job is missing steps BigQueryIO.writeTableRows/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey, PubsubIO.readStrings. If these steps have been renamed or deleted, please specify them with the update command.

现有的当前正在运行的作业的数据流转换名称为:

  1. PubsubIO.Read

  2. ParDo(ExtractJsonPath) - 我们编写的自定义函数

  3. ParDo(AddMetadata) - 我们编写的另一个自定义函数

  4. BigQueryIO.Write

  5. 在我使用2.4 sdk的新代码中,我更改了第1和第4个转换/函数,因为有些库被重命名,并且在新版本中弃用了一些旧的sdk函数。

    您可以在下面看到具体的转化代码:

    1.8 SDK版本:

         PCollection<String> streamData =
           pipeline
            .apply(PubsubIO.Read
                    .timestampLabel(PUBSUB_TIMESTAMP_LABEL_KEY)
                     //.subscription(options.getPubsubSubscription())
                    .topic(options.getPubsubTopic()));
         streamData
             .apply(ParDo.of(new ExtractJsonPathFn(pathInfos)))
             .apply(ParDo.of(new AddMetadataFn()))
            .apply(BigQueryIO.Write
                     .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                     .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
                     .to(tableRef)
    

    我重写的2.4 SDK版本:

         PCollection<String> streamData =
           pipeline
            .apply("PubsubIO.readStrings", PubsubIO.readStrings()
                    .withTimestampAttribute(PUBSUB_TIMESTAMP_LABEL_KEY)
                     //.subscription(options.getPubsubSubscription())
                    .fromTopic(options.getPubsubTopic()));
    
         streamData
             .apply(ParDo.of(new ExtractJsonPathFn(pathInfos)))
             .apply(ParDo.of(new AddMetadataFn()))
            .apply("BigQueryIO.writeTableRows", BigQueryIO.writeTableRows()
                     .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                     .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
                     .to(tableRef)
    

    所以在我看来PubsubIO.Read应该映射到PubsubIO.readStringsBigQueryIO.Write应映射到BigQueryIO.writeTableRows。但我可能会误解这是如何运作的。

    我一直在尝试各种各样的事情 - 我试图给这两个转换,我没有重新映射定义的名称,因为它们以前没有明确命名,所以我更新了我的申请.apply("PubsubIO.readStrings".apply("BigQueryIO.writeTableRows"然后将我的transformNameMapping参数设置为:

    --transformNameMapping={\"BigQueryIO.Write\":\"BigQueryIO.writeTableRows\",\"PubsubIO.Read\":\"PubsubIO.readStrings\"}
    

     --transformNameMapping={\"BigQueryIO.Write/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey\":\"BigQueryIO.writeTableRows/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey\",\"PubsubIO.Read\":\"PubsubIO.readStrings\"}
    

    甚至尝试重新映射复合变换中的所有内部变换

    --transformNameMapping={\"BigQueryIO.Write/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey\":\"BigQueryIO.writeTableRows/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey\",\"BigQueryIO.Write/BigQueryIO.StreamWithDeDup/Reshuffle\":\"BigQueryIO.writeTableRows/BigQueryIO.StreamWithDeDup/Reshuffle\",\"BigQueryIO.Write/BigQueryIO.StreamWithDeDup\":\"BigQueryIO.writeTableRows/BigQueryIO.StreamWithDeDup\",\"BigQueryIO.Write\":\"BigQueryIO.writeTableRows\",\"PubsubIO.Read\":\"PubsubIO.readStrings\"}
    

    但无论如何我似乎都得到了同样的错误:

    The new job is missing steps BigQueryIO.writeTableRows/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey, PubsubIO.readStrings.
    

    想知道我是否做了严重错误的事情?是谁愿意分享他们使用的格式之前写过变换映射的人?除了关于更新数据流作业的主要谷歌文档之外,我在网上找不到任何示例,除了最简单的案例--transformNameMapping={"oldTransform1":"newTransform1","oldTransform2":"newTransform2",...}之外,它并没有真正涵盖任何内容,并且不能使示例非常具体。

1 个答案:

答案 0 :(得分:1)

事实证明,我遗漏的Google云端网络控制台数据流作业详细信息页面中的日志中还有其他信息。我需要将日志级别从info调整为显示any log level然后我找到了几个步骤融合消息,例如(尽管还有更多):

 2018-04-16 (13:56:28) Mapping original step BigQueryIO.Write/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey to write/StreamingInserts/StreamingWriteTables/Reshuffle/GroupByKey in the new graph.
 2018-04-16 (13:56:28) Mapping original step PubsubIO.Read to PubsubIO.Read/PubsubUnboundedSource in the new graph.

我没有尝试将PubsubIO.Read映射到PubsubIO.readStrings,而是需要映射到我在其他日志记录中提到的步骤。在这种情况下,我通过将PubsubIO.Read映射到PubsubIO.Read/PubsubUnboundedSourceBigQueryIO.Write/BigQueryIO.StreamWithDeDupBigQueryIO.Write/StreamingInserts/StreamingWriteTables来解决我的错误。因此,请尝试将旧步骤映射到日志中作业失败消息之前的完整日志中提到的步骤。

不幸的是,由于从旧代码到新代码的编码器发生了变化,我没有完成兼容性检查的失败,但我的missing step错误已经解决了。