如果RDD是键值对的字典,并且如果某个值具有条件,我想查询所有内容,那么如何在RDD上应用过滤器功能。
我正在从pyspark的Kafka主题中提取实时Twitter流,我的Rdd字典如下所示
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: direct_kafka_wordcount.py <broker_list> <topic>", file=sys.stderr)
sys.exit(-1)
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 2)
brokers,topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
parsed_stream = lines.map(lambda tweets: extractTweet(json.loads(tweets.encode('utf-8'))))
# parsed_stream.get('text')
values=parsed_stream.flatMap(lambda f:f.items())
values.filter(lambda a:a[0]>0).pprint()