使用ValueProvider

时间:2018-03-29 10:10:14

标签: google-cloud-dataflow apache-beam google-cloud-pubsub

我有多个来自Cloud PubSub的订阅,可以使用Apache Beam根据某些前缀模式进行读取。我扩展PTransform类并实施expand()方法来读取多个订阅,并Flatten转换为PCollectionList(来自每个PCollection的多个subscription })。我将问题前缀ValueProvider传递到expand()方法时遇到问题,因为在模板创建时调用expand(),而不是在启动作业时调用ValueProvider。但是,如果我只使用1次订阅,我可以将PubsubIO.readStrings().fromSubscription()传递给public class MultiPubSubIO extends PTransform<PBegin, PCollection<PubsubMessage>> { private ValueProvider<String> prefixPubsub; public MultiPubSubIO(@Nullable String name, ValueProvider<String> prefixPubsub) { super(name); this.prefixPubsub = prefixPubsub; } @Override public PCollection<PubsubMessage> expand(PBegin input) { List<String> myList = null; try { // prefixPubsub.get() will return error myList = PubsubHelper.getAllSubscription("projectID", prefixPubsub.get()); } catch (Exception e) { LogHelper.error(String.format("Error getting list of subscription : %s",e.toString())); } List<PCollection<PubsubMessage>> collectionList = new ArrayList<PCollection<PubsubMessage>>(); if(myList != null && !myList.isEmpty()){ for(String subs : myList){ PCollection<PubsubMessage> pCollection = input .apply("ReadPubSub", PubsubIO.readMessagesWithAttributes().fromSubscription(this.prefixPubsub)); collectionList.add(pCollection); } PCollection<PubsubMessage> pubsubMessagePCollection = PCollectionList.of(collectionList) .apply("FlattenPcollections", Flatten.pCollections()); return pubsubMessagePCollection; } else { LogHelper.error(String.format("No subscription with prefix %s found", prefixPubsub)); return null; } } public static MultiPubSubIO read(ValueProvider<String> prefixPubsub){ return new MultiPubSubIO(null, prefixPubsub); } }

这是一些示例代码。

PubsubIO.read().fromSubscription()

因此,我正在考虑如何以ValueProvider的方式使用chartOptions来阅读this。或者我错过了什么?

搜索链接:

2 个答案:

答案 0 :(得分:1)

不幸的是,目前无法做到这一点:

  • ValueProvider的值不可能影响变换扩展 - 在扩展时,它是未知的;当它已知时,管道形状已经固定。

  • 目前没有可以接受PubsubIO.read()主题名称的PCollection转换。最终会有(它由[http://s.apache.org/splittable-do-fn](Splittable DoFn启用)),但这需要一段时间 - 目前没有人正在研究这个问题。

答案 1 :(得分:0)

您可以使用来自 apache beam io 模块的 MultipleReadFromPubSub https://beam.apache.org/releases/pydoc/2.27.0/_modules/apache_beam/io/gcp/pubsub.html

topic_1 = PubSubSourceDescriptor('projects/myproject/topics/a_topic')
topic_2 = PubSubSourceDescriptor(
            'projects/myproject2/topics/b_topic',
            'my_label',
            'my_timestamp_attribute')
subscription_1 = PubSubSourceDescriptor(
            'projects/myproject/subscriptions/a_subscription')

results = pipeline | MultipleReadFromPubSub(
            [topic_1, topic_2, subscription_1])