我有多个来自Cloud PubSub的订阅,可以使用Apache Beam根据某些前缀模式进行读取。我扩展PTransform
类并实施expand()
方法来读取多个订阅,并Flatten
转换为PCollectionList
(来自每个PCollection
的多个subscription
})。我将问题前缀ValueProvider
传递到expand()
方法时遇到问题,因为在模板创建时调用expand()
,而不是在启动作业时调用ValueProvider
。但是,如果我只使用1次订阅,我可以将PubsubIO.readStrings().fromSubscription()
传递给public class MultiPubSubIO extends PTransform<PBegin, PCollection<PubsubMessage>> {
private ValueProvider<String> prefixPubsub;
public MultiPubSubIO(@Nullable String name, ValueProvider<String> prefixPubsub) {
super(name);
this.prefixPubsub = prefixPubsub;
}
@Override
public PCollection<PubsubMessage> expand(PBegin input) {
List<String> myList = null;
try {
// prefixPubsub.get() will return error
myList = PubsubHelper.getAllSubscription("projectID", prefixPubsub.get());
} catch (Exception e) {
LogHelper.error(String.format("Error getting list of subscription : %s",e.toString()));
}
List<PCollection<PubsubMessage>> collectionList = new ArrayList<PCollection<PubsubMessage>>();
if(myList != null && !myList.isEmpty()){
for(String subs : myList){
PCollection<PubsubMessage> pCollection = input
.apply("ReadPubSub", PubsubIO.readMessagesWithAttributes().fromSubscription(this.prefixPubsub));
collectionList.add(pCollection);
}
PCollection<PubsubMessage> pubsubMessagePCollection = PCollectionList.of(collectionList)
.apply("FlattenPcollections", Flatten.pCollections());
return pubsubMessagePCollection;
} else {
LogHelper.error(String.format("No subscription with prefix %s found", prefixPubsub));
return null;
}
}
public static MultiPubSubIO read(ValueProvider<String> prefixPubsub){
return new MultiPubSubIO(null, prefixPubsub);
}
}
。
这是一些示例代码。
PubsubIO.read().fromSubscription()
因此,我正在考虑如何以ValueProvider
的方式使用chartOptions
来阅读this
。或者我错过了什么?
搜索链接:
答案 0 :(得分:1)
不幸的是,目前无法做到这一点:
ValueProvider
的值不可能影响变换扩展 - 在扩展时,它是未知的;当它已知时,管道形状已经固定。
目前没有可以接受PubsubIO.read()
主题名称的PCollection
转换。最终会有(它由[http://s.apache.org/splittable-do-fn](Splittable DoFn启用)),但这需要一段时间 - 目前没有人正在研究这个问题。
答案 1 :(得分:0)
您可以使用来自 apache beam io 模块的 MultipleReadFromPubSub
https://beam.apache.org/releases/pydoc/2.27.0/_modules/apache_beam/io/gcp/pubsub.html
topic_1 = PubSubSourceDescriptor('projects/myproject/topics/a_topic')
topic_2 = PubSubSourceDescriptor(
'projects/myproject2/topics/b_topic',
'my_label',
'my_timestamp_attribute')
subscription_1 = PubSubSourceDescriptor(
'projects/myproject/subscriptions/a_subscription')
results = pipeline | MultipleReadFromPubSub(
[topic_1, topic_2, subscription_1])