以下是我的数据:
val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", bar=C","bar=D", "bar=D")
现在我想使用以下类型的输出,但是使用combineByKey
和aggregateByKey
没有:
1) Array[(String, Int)] = Array((foo,5), (bar,3))
2) Array((foo,Set(B, A)),
(bar,Set(C, D)))
以下是我的尝试:
scala> val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", "bar=C",
| "bar=D", "bar=D")
scala> val sample=keysWithValuesList.map(_.split("=")).map(p=>(p(0),(p(1))))
sample: Array[(String, String)] = Array((foo,A), (foo,A), (foo,A), (foo,A), (foo,B), (bar,C), (bar,D), (bar,D))
现在,当我在变量名后面输入制表符以查看映射的RDD的适用方法时,我可以看到以下选项,但其中任何一个都不满足我的要求:
scala> sample.
apply asInstanceOf clone isInstanceOf length toString update
那我该怎么实现呢?
答案 0 :(得分:1)
这是一种标准方法。
注意点:您需要使用RDD。我认为这是瓶颈。
您在这里:
val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", "bar=C","bar=D", "bar=D")
val sample=keysWithValuesList.map(_.split("=")).map(p=>(p(0),(p(1))))
val sample2 = sc.parallelize(sample.map(x => (x._1, 1)))
val sample3 = sample2.reduceByKey(_+_)
sample3.collect()
val sample4 = sc.parallelize(sample.map(x => (x._1, x._2))).groupByKey()
sample4.collect()
val sample5 = sample4.map(x => (x._1, x._2.toSet))
sample5.collect()