如何从DoFn输出单个值并将其用作另一个DoFn中的参数?

时间:2018-08-17 03:20:58

标签: google-cloud-dataflow apache-beam

我有一个带行分隔json的发布/订阅。每个发布/订阅邮件都有一个属性值,其中包含要写入的bigquery表名称。

如何获取单个表名的值,并将其传递给新的管道?

可以从DoFn本身内部创建一个新的PCollection并应用它吗?

1 个答案:

答案 0 :(得分:2)

您可以应用转换来检索DoFn中的表名,并将KV<tableName, record>传递到下游。然后使用BigQueryIO中的动态目标支持将每条记录路由到正确的目标。另外,您也可以在BigQuery.withFormatFunction()中检索table属性。下面是执行此操作的示例。

这是总体管道结构,其中从Pub / Sub消费JSON消息,然后根据Pub / Sub消息属性将其路由到适当的表目标。同样,您可以更改getTableDestination(..)逻辑以从JSON记录中检索表名称。

您可以查看整个示例here

  /**
   * Runs the pipeline to completion with the specified options. This method does not wait until the
   * pipeline is finished before returning. Invoke {@code result.waitUntilFinish()} on the result
   * object to block until the pipeline is finished running if blocking programmatic execution is
   * required.
   *
   * @param options The execution options.
   * @return The pipeline result.
   */
  public static PipelineResult run(Options options) {

    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    // Retrieve non-serializable parameters
    String tableNameAttr = options.getTableNameAttr();
    String outputTableProject = options.getOutputTableProject();
    String outputTableDataset = options.getOutputTableDataset();

    // Build & execute pipeline
    pipeline
        .apply(
            "ReadMessages",
            PubsubIO.readMessagesWithAttributes().fromSubscription(options.getSubscription()))
        .apply(
            "WriteToBigQuery",
            BigQueryIO.<PubsubMessage>write()
                .to(
                    input ->
                        getTableDestination(
                            input,
                            tableNameAttr,
                            outputTableProject,
                            outputTableDataset))
                .withFormatFunction(
                    (PubsubMessage msg) -> convertJsonToTableRow(new String(msg.getPayload())))
                .withCreateDisposition(CreateDisposition.CREATE_NEVER)
                .withWriteDisposition(WriteDisposition.WRITE_APPEND));

    return pipeline.run();
  }

  /**
   * Retrieves the {@link TableDestination} for the {@link PubsubMessage} by extracting and
   * formatting the value of the {@code tableNameAttr} attribute. If the message is null, a {@link
   * RuntimeException} will be thrown because the message is unable to be routed.
   *
   * @param value The message to extract the table name from.
   * @param tableNameAttr The name of the attribute within the message which contains the table
   *     name.
   * @param outputProject The project which the table resides.
   * @param outputDataset The dataset which the table resides.
   * @return The destination to route the input message to.
   */
  @VisibleForTesting
   static TableDestination getTableDestination(
      ValueInSingleWindow<PubsubMessage> value,
      String tableNameAttr,
      String outputProject,
      String outputDataset) {
    PubsubMessage message = value.getValue();

    TableDestination destination;
    if (message != null) {
      destination =
          new TableDestination(
              String.format(
                  "%s:%s.%s",
                  outputProject, outputDataset, message.getAttributeMap().get(tableNameAttr)),
              null);
    } else {
      throw new RuntimeException(
          "Cannot retrieve the dynamic table destination of an null message!");
    }

    return destination;
  }