我有一个按时间顺序排列的事件(T1,K1,V1),(T2,K2,V3),(T3,K1,V2),(T4,K2,V4),(T5,K1,V5)
。
键和值都是字符串。
我正在尝试使用Spark
实现以下功能K1,(V1,V2,V5)
K2,(V3,V4)
这就是我试过的
val inputFile = args(0)
val outputFile = args(1)
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
val rdd1 = sc.textFile(inputFile, 2).cache()
val rdd2= rdd1.map {
line =>
val fields = line.split(" ")
val key = fields(1)
val v = fields(2)
(key, v)
}
// TODO : rdd2.reduce to get the output I want
rdd2.saveAsTextFile(outputFile)
有人可以指点我如何让减速机产生我想要的输出吗?非常感谢提前。
答案 0 :(得分:2)
您只需按键对rdd进行分组即可获得所需的输出:rdd2.groupByKey
这个小火花壳会话说明了用法:
val events = List(("t1","k1","v1"), ("t2","k2","v3"), ("t3","k1","v2"), ("t4","k2","v4"), ("t5","k1","v5"))
val rdd = sc.parallelize(events)
val kv = rdd.map{case (t,k,v) => (k,v)}
val grouped = kv.groupByKey
// show the collection ('collect' used here only to show the contents)
grouped.collect
res0: Array[(String, Iterable[String])] = Array((k1,ArrayBuffer(v1, v2, v5)), (k2,ArrayBuffer(v3, v4)))