我可以在DStream中获取每个RDD的最大密钥吗?

时间:2016-11-20 12:41:15

标签: python apache-spark pyspark spark-streaming

我需要找到每个RDD的最大键,但是当使用reduce()时,我能得到的是整个Dstream中最大的一个。 例如,在这个流中,我想要的是(2,“b”),(2,“d”),(3,“f”),但我只能得到(3,“f”)按stringr 我怎样才能得到(2,“b”),(2,“d”),(3,“f”)?

library(stringr)
data.frame(Player = word(v1, 1, 2), 
             Team = sub(',','' ,word(v1, 3)), 
              Pos = word(v1, 4, 6), stringsAsFactors = FALSE)

#                Player      Team         Pos
#1        João Moutinho    Monaco   30,  M(C)
#2        Clinton N'Jie Marseille     23,  FW
#3 Frederic Sammaritano     Dijon 30,  AM(LR)

1 个答案:

答案 0 :(得分:0)

此:

stream = ssc.queueStream([sc.parallelize([(1,"a"), (2,"b"),(1,"c"),(2,"d"),
  (1,"e"),(3,"f")],3)])

创建一个只有一个批处理的流,其中第一个批处理具有(最少)3个分区。我想你想要:

stream = ssc.queueStream([
    sc.parallelize([(1,"a"), (2,"b")]),
    sc.parallelize([(1,"c"), (2,"d")]), 
    sc.parallelize([(1,"e"), (3,"f")]), 
])

这将给你预期的结果:

stream.reduce(max).pprint()