加入有界和无界源,数据流作业无法扩展

时间:2019-03-19 13:54:33

标签: google-cloud-platform google-cloud-dataflow apache-beam apache-beam-io

最近,我已经开始与Apache Beam和Google的Cloud Data Flow合作开发大数据 处理管道。我打算利用Beam's internal stateful processing model, 开发我的处理管道。

下面是我想要实现的目标

  • 从无边界(流)源读取数据,对于GCP,它将是PubSub。将数据转换为KV
  • 从有限来源读取数据,对于GCP,我的情况是GCS。将数据转换为KV
  • 由于来源之一是不受限制的,因此我必须执行“窗口和触发”。我选择使用“全局窗口”,因为如果内部状态管理中已经存在数据,我想将来自无界源的数据连接到有界源。
  • 通过公用密钥从PubSub和GCS加入数据。
  • 使用Beam's BagState(StateId和StateSpec)将数据添加到Beam的内部状态。发出添加到BagState的数据的Iterable。展平Iterable,并将发出的PCollection写入GCS
  • 对可迭代数据执行一些ParDo(function)。将发出的PCollection写入GCS

下面是示例代码段

public static PipelineResult run(BeamStateFullProcessingPoC.Options options) {

    // User - Think of it as Java POJO / Avro record / Protobuf message.

    // create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    /**
    *   Step - 1
    *   Read data from non-bounded (streaming) source, for GCP this would be PubSub.  
    *   Transform the data into KV<String, Object>
    */
    final PCollection<PubsubMessage> pubsubEvents = ... 
    final PCollection<KV<String, User>> pubSubUserByUserId = ...

    /**
    *   Step - 2
    *   Read data from bounded source, for GCP my case this would be GCS. 
    *   Transform the data into KV<String, Object>
    */
    final PCollection<User> users = ...
    final PCollection<KV<String, User>> gcsUserByUserId = ...

    List<PCollection<KV<String, User>>> pCollectionList = new ArrayList<>();
    pCollectionList.add(pubSubUserByUserId);
    pCollectionList.add(gcsUserByUserId);

    PCollection<KV<String, User>> allKVData = PCollectionList
                  .of(pCollectionList)
                  .apply("flatten KV ID with User", Flatten.pCollections());


    /**
    *   Step - 3 
    *   Perform Window + Triggering and GroupByKey
    *   As one of the Source is streaming, we need to do Window and trigger, before grouping by key
    */
    final PCollection<KV<String, Iterable<User>>> iterableUserByIdKV = allKVData
            .apply("batch data by window + trigger",
                    Window.<KV<String, User>> into(new GlobalWindows())
                    .triggering(AfterProcessingTime.pastFirstElementInPane())
                    .discardingFiredPanes())
    .apply("GroupByKey per User", GroupByKey.create());

    /**
    *   Step - 4
    *   Add User to Beam's internal state, using Beam's BagState (StateId and StateSpec)
    *   Emit the Iterable<User> added to BagState
    *   Flatten Iterable, and write the emitted PCollection to GCS
    */    
    final PCollection<Iterable<User>> iterableUser = iterableUserByIdKV
            .apply("User added to State by Key", ParDo.of(new CreateInternalStateDoFn()));

    final PCollection<User> userAddedToState = iterableUser
            .apply("flatten userAddedToState", Flatten.iterables());
    userAddedToState.apply("write userAddedToState", AvroIO.write(User.class)
            .to(options.getOutputDirectoryForUserState())
            .withSuffix(".avro")
            .withWindowedWrites()
            .withNumShards(options.getNumShards()));


    /**
    *   Step - 5
    *   Perform some function via ParDo on Iterable<User> 
    *   Write emitted data to GCS
    */
    final PCollection<User> changeGenderUser = iterableUser
            .apply("DetectChangeGenderDoFn", ParDo.of(new DetectChangeGenderDoFn()));

    changeGenderUser.apply("write change gender", AvroIO.write(User.class)
            .to(options.getOutputDirectoryForChangeGender())
            .withSuffix(".avro")
            .withWindowedWrites()
            .withNumShards(options.getNumShards()));

    return pipeline.run();
}

下面是JSON paylod,用于创建数据流模板作业

{
  "jobName": "poc-beam-state-management",
  "parameters": {
    "personSubscription": "projects/<project-name>/subscriptions/<subscription-name>",
    "locationForUser": "gs://<bucket>/<user-folder>/*.avro",
    "outputDirectoryForChangeGender": "gs://<bucket>/<folder>/",
    "outputDirectoryForUserState": "gs://<bucket>/<folder>/",
    "avroTempDirectory": "gs://<bucket>/<folder>/",
    "numShards": "5",
    "autoscalingAlgorithm": "THROUGHPUT_BASED",
    "numWorkers": "3",
    "maxNumWorkers": "18"    
  },
  "environment": {
    "subnetwork": "<some-subnet>",
    "zone": "<some-zone>",
    "serviceAccountEmail": "<some-service-account>",
  },
  "gcsPath": "gs://<bucket>/<folder>/templates/<TemplateName>"
}

当我的数据流作业开始时,它仅在1个Worker节点上执行工作。
我假设Google Cloud Platform Data Flow会根据需要自动使用Worker节点扩展工作。

问:数据流如何工作,自动扩展和利用GCP功能以分布式方式执行工作?

Data Flow Job DAG (sub-part-1) Data Flow Job DAG (sub-part-2)

1 个答案:

答案 0 :(得分:0)

因此,您在上面提供的列表中的基本假设存在一些缺陷。您声明 由于源之一是不受限制的...我选择使用全局窗口,因为我想将来自无边界源的数据连接到受限制的源...强调文字

在Beam中,您无法在无限流上执行全局窗口,因为您无法将流装入内存。您需要将它们放入固定的窗口中,您可以阅读here。由于存在全局窗口,因此作业永远不会完成。

第二,如果您正在运行数据流流作业,那么google默认将作业设置为autoscalingAlgorithm=NONE。您可能希望将其指定为autoscalingAlgorithm=THROUGHPUT_BASED。您可以找到详细说明here的详细信息。这样可以防止计算机自动缩放。

希望这会提供您想要的答案。