我正在尝试用累加器逻辑替换reduceByKey
字数。
你好,你好吗
这是我到目前为止所得到的:
val words = sc.textFile("wc.txt").flatMap(_.split(" "))
val accum = sc.accumulator(0,"myacc")
for (i <- 1 to words.count.toInt)
foreach( x => accum+ =x)
.....
如何处理它。任何想法或想法都表示赞赏。
答案 0 :(得分:0)
据我所知,您希望使用Spark累加器计算文本文件中的所有单词,在这种情况下您可以使用:
words.foreach(_ => accum.add(1L))
答案 1 :(得分:0)
实际上,使用累加器非常麻烦而且不推荐 - 但为了完整性 - 这里是如何完成的(至少使用Spark版本 1.6 <= V <= 2.1 )。请注意,这使用了不推荐使用的API,该API不会成为下一版本的一部分。
您需要一个Map[String, Long]
累加器,默认情况下不可用,因此您需要创建自己的AccumulableParam
实现并隐式使用它:
// some data:
val words = sc.parallelize(Seq("Hello how are are you")).flatMap(_.split(" "))
// aliasing the type, just for convenience
type AggMap = Map[String, Long]
// creating an implicit AccumulableParam that counts by String key
implicit val param: AccumulableParam[AggMap, String] = new AccumulableParam[AggMap, String] {
// increase matching value by 1, or create it if missing
override def addAccumulator(r: AggMap, t: String): AggMap =
r.updated(t, r.getOrElse(t, 0L) + 1L)
// merge two maps by summing matching values
override def addInPlace(r1: AggMap, r2: AggMap): AggMap =
r1 ++ r2.map { case (k, v) => k -> (v + r1.getOrElse(k, 0L)) }
// start with an empty map
override def zero(initialValue: AggMap): AggMap = Map.empty
}
// create the accumulator; This will use the above `param` implicitly
val acc = sc.accumulable[AggMap, String](Map.empty[String, Long])
// add each word to accumulator; the `count()` can be replaced by any Spark action -
// we just need to trigger the calculation of the mapped RDD
words.map(w => { acc.add(w); w }).count()
// after the action, we acn read the value of the accumulator
val result: AggMap = acc.value
result.foreach(println)
// (Hello,1)
// (how,1)
// (are,2)
// (you,1)