Question

我可以使用PubsubIO读取来自某个主题的PubSub消息，如下所示：

pipeline.apply("read", PubsubIO.readMessages().fromTopic(options.getPubsubReadTopic()))
.apply( /* rest of the pipeline that works on PubSubMessage records */ )

PubSub消息中的数据包装在我们的自定义包装器中，使用起来并不容易。我想创建类CustomPubsubIO并以类似的方式使用它：

pipeline.apply("read", CustomPubsubIO.readTyped<MyType>().fromTopic(options.getPubsubReadTopic()))
.apply( /* rest of the pipeline that works on MyType records */ )

我能够创建自定义CustomCoder<MyType>，但无法使用它创建PubsubIO.Read<MyType>。 PubsubIO.Read在PubsubIO中是抽象的，并且与@AutoValue一起使用，似乎我无法直接对其进行扩展

使用自定义编码器创建Read<>的正确方法是什么？

Answer 1

您是否有任何特定原因来创建自定义类型的PubsubIO.Read？否则，您可以仅使用PubsubIO.readMessages()并结合使用DoFn将输出PubsubMessage转换为所需的任何内容。支持自定义编码器和自定义解析函数的API是两年前的removed，因为使用DoFn似乎是生成自定义类型的更清晰且语义等效的方式。

Answer 2

好的，我能够做到。我必须将CustomPubsubIO类放在package org.apache.beam.sdk.io.gcp.pubsub中，因为AutoValue_PubsubIO_Read受软件包保护。因此，我不确定该解决方案将来是否可以使用（看起来更像是黑客）

无论如何，这是简化的代码：

package org.apache.beam.sdk.io.gcp.pubsub;

import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO.Read;
import org.apache.beam.sdk.transforms.SimpleFunction;

public class ExtraPubsubIO {

  public static <T> Read<T> read() {
    return new AutoValue_PubsubIO_Read.Builder<T>()
        .setPubsubClientFactory(PubsubJsonClient.FACTORY)
        .setCoder(new CustomTypeInPubSubCoder<>())
        .setParseFn(new CustomTypeUnwrapFn<>())
        .setNeedsAttributes(false)
        .build();
  }

  private static class CustomTypeUnwrapFn<T> extends SimpleFunction<PubsubMessage, T> {

    @Override
    public T apply(PubsubMessage input) {
      return CustomTypeUnwrapper.unwrap(input);
    }
  }
}

然后像这样在管道中使用它：

pipeline.apply("Read PubSub messages", ExtraPubsubIO.<String>read().fromTopic(options.getPubsubReadTopic()))
        .apply("Write File(s)", TextIO.write()...
        .run()

使用自定义编码器与PubsubIO一起阅读

2 个答案: