我在我的pySpark代码中打开了一个Kafka Stream,如下所示:
ssc.checkpoint('ckpt')
mystream = KafkaUtils.createStream(ssc, K_HOST, "sample", {"logs_queue": 1})
现在我试图根据某些条件将此流过滤为2个单独的流。如果我运行以下代码,我会完美地获得2个流:
s1 = mystream.filter(lambda s: s['key'].startswith("11")).map(lambda s: (s['key'], 1)).reduceByKey(lambda a, b: a + b)
s1.pprint()
s2 = mystream.filter(lambda s: s['key'].startswith("12")).map(lambda s: (s['key'], 1)).reduceByKey(lambda a, b: a + b)
s2.pprint()
以上陈述的输出是正确的:
-------------------------------------------
Time: 2017-02-08 14:09:26
-------------------------------------------
(u'11-59', 201)
(u'11-142', 225)
(u'11-68', 151)
(u'11-64', 161)
(u'11-60', 152)
(u'11-69', 106)
(u'11-65', 196)
(u'11-61', 208)
(u'11-143', 158)
(u'11-140', 112)
...
-------------------------------------------
Time: 2017-02-08 14:09:26
-------------------------------------------
(u'12-14', 62)
(u'12-10', 73)
(u'12-36', 95)
(u'12-32', 106)
(u'12-18', 82)
(u'12-21', 107)
(u'12-25', 68)
(u'12-29', 111)
(u'12-15', 134)
(u'12-28', 59)
...
既然上面的语句中只有过滤器不同,我将上面的代码改为for循环,如下所示:
f = ["12", "11"]
for i in f:
fs = mystream.filter(lambda s: s['key'].startswith(i)).map(lambda s: (s['key'], 1)).reduceByKey(lambda a, b: a + b)
fs.pprint()
我期望输出相同,但两个pprint的相同,如下所示。
-------------------------------------------
Time: 2017-02-08 14:05:38
-------------------------------------------
(u'11-59', 102)
(u'11-68', 107)
(u'11-60', 93)
(u'11-142', 145)
(u'11-64', 150)
(u'11-61', 71)
(u'11-143', 155)
(u'11-65', 131)
(u'11-69', 110)
(u'11-140', 71)
...
-------------------------------------------
Time: 2017-02-08 14:05:38
-------------------------------------------
(u'11-59', 102)
(u'11-68', 107)
(u'11-60', 93)
(u'11-142', 145)
(u'11-64', 150)
(u'11-61', 71)
(u'11-143', 155)
(u'11-65', 131)
(u'11-69', 110)
(u'11-140', 71)
...
我认为,变量i
正在通过引用传递,但我需要按值传递它。如何使用for循环正确拾取两个过滤器。
谢谢!