我正在运行一个独立的spark 2.0集群,它将运行许多小型作业。我想要做的是在DStream上运行多个操作(因为即使为每个作业分配1个核心也太高)。我也使用pySpark,我的代码看起来像这样:
# Stream context reading data from Apache Kafka.
stream_context = KafkaUtils.createDirectStream(.....
# persisting the streaming context.
...
# Iterator to run different jobs.
iterator = {1: {'filter': ['metrics_of_kind_A', 'metrics_of_kind_AA'}, 2: {'filter': ['metrics_of_kind_B']}}
# Multiple actions on the same data set.
for _, value in iterator.iteritems():
filtered_context = stream_context.filter(lambda x: x['metric_type'] in value['filter'])
filtered_context.pprint()
我有办法做到这一点吗?