如何展平Apache Beam中的窗口集合? [云数据流]

时间:2018-08-20 10:33:42

标签: google-cloud-platform google-cloud-datastore google-cloud-dataflow google-cloud-pubsub

我尝试使用数据流将数据从pubsub流到Datastore。 我搜索了谷歌提供的模板。 https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/master/src/main/java/com/google/cloud/teleport/templates

并且请注意PubsubToDatastore不起作用。 因此,我尝试调试它。 https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubsubToDatastore.java

这是我所做的。

  • 添加errorTag
  • 添加窗口处理(pubsub生成无界数据,数据存储区不能接受无界数据)
  • 添加flatten(将窗口数据写入数据存储区的方法为空。因此,我认为是无窗口的。)

这是我的代码。

    package com.google.cloud.teleport.templates;

    import com.google.cloud.teleport.templates.common.DatastoreConverters.DatastoreWriteOptions;
    import com.google.cloud.teleport.templates.common.DatastoreConverters.WriteJsonEntities;
    import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.JavascriptTextTransformerOptions;
    import com.google.cloud.teleport.templates.common.JavascriptTextTransformer.TransformTextViaJavascript;
    import com.google.cloud.teleport.templates.common.PubsubConverters.PubsubReadOptions;
    import org.apache.beam.sdk.Pipeline;
    import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
    import org.apache.beam.sdk.options.PipelineOptions;
    import org.apache.beam.sdk.options.PipelineOptionsFactory;

    // added for errorTag
    import com.google.cloud.teleport.templates.common.ErrorConverters.ErrorWriteOptions;
    import com.google.cloud.teleport.templates.common.ErrorConverters.LogErrors;
    import org.apache.beam.sdk.values.TupleTag;

    // added for window
    import org.apache.beam.sdk.transforms.windowing.FixedWindows;
    import org.apache.beam.sdk.transforms.windowing.Window;
    import org.apache.beam.sdk.transforms.Flatten;
    import org.apache.beam.sdk.values.PCollection;
    import org.apache.beam.sdk.values.PCollectionList;
    import org.apache.beam.sdk.values.PCollectionTuple;

    import org.joda.time.Duration;

    public class PubsubToDatastore {
      interface PubsubToDatastoreOptions extends
          PipelineOptions,
          PubsubReadOptions,
          JavascriptTextTransformerOptions,
          DatastoreWriteOptions,
          ErrorWriteOptions {} // added

      public static void main(String[] args) {
        PubsubToDatastoreOptions options = PipelineOptionsFactory
            .fromArgs(args)
            .withValidation()
            .as(PubsubToDatastoreOptions.class);

        TupleTag<String> errorTag = new TupleTag<String>("errors"){};

        Pipeline pipeline = Pipeline.create(options);

        pipeline
            .apply("Read Pubsub Events", PubsubIO.readStrings().fromTopic(options.getPubsubReadTopic()))
            .apply("Windowing", Window.into(FixedWindows.of(Duration.standardMinutes(5))))
            .apply("Flatten", Flatten.pCollections())
            .apply("Transform text to json", TransformTextViaJavascript.newBuilder()
                .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                .setFunctionName(options.getJavascriptTextTransformFunctionName())
                .build())
            .apply(WriteJsonEntities.newBuilder()
                .setProjectId(options.getDatastoreWriteProjectId())
                .setErrorTag(errorTag)
                .build())
            .apply(LogErrors.newBuilder()
                .setErrorWritePath(options.getErrorWritePath())
                .setErrorTag(errorTag)
                .build());

        pipeline.run();
      }
    } 

运行此代码时,发生了错误。

    [INFO] BUILD FAILURE
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 11.054 s
    [INFO] Finished at: 2018-08-20T17:55:49+09:00
    [INFO] ------------------------------------------------------------------------
    [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.6.2:compile (default-compile) on project google-cloud-teleport-java: Compilation failure
    [ERROR] /Users/shinya.yaginuma/work/DataflowTemplates/src/main/java/com/google/cloud/teleport/templates/PubsubToDatastore.java:[80,9] can not find an appropriate method for apply(java.lang.String,org.apache.beam.sdk.transforms.Flatten.PCollections<java.lang.Object>)
    [ERROR]     method org.apache.beam.sdk.values.PCollection.<OutputT>apply(org.apache.beam.sdk.transforms.PTransform<? super org.apache.beam.sdk.values.PCollection<java.lang.String>,OutputT>) can't use
    [ERROR]       (Unable to infer the type variable OutputT
    [ERROR]         (The actual argument list and dummy argument list have different lengths))
    [ERROR]     method org.apache.beam.sdk.values.PCollection.<OutputT>apply(java.lang.String,org.apache.beam.sdk.transforms.PTransform<? super org.apache.beam.sdk.values.PCollection<java.lang.String>,OutputT>) can't use
    [ERROR]       (Since there is no instance of type variable T, org.apache.beam.sdk.transforms.Flatten.PCollections is not fit for  org.apache.beam.sdk.transforms.PTransform<? super org.apache.beam.sdk.values.PCollection<java.lang.String>,OutputT>)

接下来我该怎么办? 请给我建议。 问候。

1 个答案:

答案 0 :(得分:2)

不确定在窗口化后为什么要展平集合。猜测Flatten操作并没有真正按照您的想法做。

这是它所说的:

  

返回一个{@link PTransform},它将一个{@link PCollectionList}展平为一个{@link PCollection},在其输入中包含所有{@link PCollection}的所有元素。

     

Flatten将多个PCollection捆绑到一个PCollectionList中,并返回一个包含所有输入PCollection中所有元素的单个PCollection。 “ Flatten”这个名称建议获取一个列表列表并将它们展平为一个列表。

例如,如果您有来自不同来源的多个PCollections,并且想要将其“展平”到同一个PCollection中,那么Flatten是您的工具。在这种情况下,您只有一个PCollection(没有PCollectionList,即PCollections的列表),因此Flatten操作不会给您带来任何好处。第一步为您提供了PCollection<String>中的一个PubSubIO.readStrings(),然后在窗口Window.into(...)中为第一个 unbounded提供了 bounded PCollection<String> PCollection<String>

我建议您仅删除.apply("Flatten", Flatten.pCollections())行,然后再次运行管道。否则看起来很好。