Spark aggregateByKey使用Map并定义函数的数据类型

时间:2015-09-02 14:14:10

标签: scala apache-spark

所以这个标题应该足够令人困惑,所以我会尽力解释。我试图将此功能分解为已定义的函数,以便更好地了解aggregateByKey如何为将要写入我的代码的其他团队工作。我有以下聚合:

 val firstLetter = stringRDD.aggregateByKey(Map[Char, Int]())(
      (accumCount, value) => accumCount.get(value.head) match {
        case None => accumCount + (value.head -> 1)
        case Some(count) => accumCount + (value.head -> (count + 1))
      },
      (accum1, accum2) => accum1 ++ accum2.map{case(k,v) => k -> (v + accum1.getOrElse(k, 0))}
    ).collect()

我一直想按照以下方式解决这个问题:

  val firstLet = Map[Char, Int]
  def fSeq(accumCount:?, value:?) = {
    accumCount.get(value.head) match {
      case None => accumCount + (value.head -> 1)
      case Some(count) => accumCount + (value.head -> (count + 1))
    }
  }
  def fComb(accum1:?, accum2:?) = {
    accum1 ++ accum2.map{case(k,v) => k -> (v + accum1.getOrElse(k, 0))
  }

由于初始值是Map [Char,Int],我不知道如何制作accumCount,要定义的数据类型。我尝试了不同的东西,但没有任何东西可以发挥作用。有人可以帮我定义数据类型并解释你是如何确定它的吗?

1 个答案:

答案 0 :(得分:1)

  • seqOp将与初始值相同类型的累加器作为第一个参数,并将与RDD中的值相同的值作为值。
  • combOp将两个相同类型的累加器作为初始值。

假设您要汇总RDD[(T,U)]

def fSeq(accumCount: Map[Char, Int], value: U): Map[Char, Int] = ???
def fComb(accum1: Map[Char, Int], accum2: Map[Char, Int]): Map[Char, Int] = ???

我想在你的情况下U只是String,所以你应该调整fSeq签名。

顺便说一句,您可以使用提供默认映射并简化您的功能:

val firstLet = Map[Char, Int]().withDefault(x => 0)

def fSeq(accumCount: Map[Char, Int], value: String): Map[Char, Int] = {
  accumCount + (value.head -> (accumCount(value.head) + 1))
}

def fComb(accum1: Map[Char, Int], accum2: Map[Char, Int]): Map[Char, Int] = {
  val accum = (accum1.keys ++ accum2.keys).map(k => (k, accum1(k) + accum2(k)))
  accum.toMap.withDefault(x => 0)
}

最后,使用scala.collection.mutable.Map

会更有效率
import scala.collection.mutable.{Map => MMap}

def firstLetM = MMap[Char, Int]().withDefault(x => 0)

def fSeqM(accumCount: MMap[Char, Int], value: String): MMap[Char, Int] = {
  accumCount += (value.head -> (accumCount(value.head) + 1))
}

def fCombM(accum1: MMap[Char, Int], accum2: MMap[Char, Int]): MMap[Char, Int] = {
  accum2.foreach{case (k, v) => accum1 += (k -> (accum1(k) + v))}
  accum1
}

测试:

def randomChar() = (scala.util.Random.nextInt.abs % 58 + 65).toChar
def randomString() = {
    (Seq(randomChar) ++ Iterator.iterate(randomChar)(_ => randomChar)
      .takeWhile(_ => scala.util.Random.nextFloat > 0.1)).mkString
}

val stringRdd = sc.parallelize(
  (1 to 500000).map(_ => (scala.util.Random.nextInt.abs % 60, randomString)))


val firstLetter = stringRDD.aggregateByKey(Map[Char, Int]())(
  (accumCount, value) => accumCount.get(value.head) match {
    case None => accumCount + (value.head -> 1)
    case Some(count) => accumCount + (value.head -> (count + 1))
  },
  (accum1, accum2) => accum1 ++ accum2.map{
     case(k,v) => k -> (v + accum1.getOrElse(k, 0))}
).collectAsMap()

val firstLetter2 = stringRDD
  .aggregateByKey(firstLet)(fSeq, fComb)
  .collectAsMap

val firstLetter3 = stringRDD
  .aggregateByKey(firstLetM)(fSeqM, fCombM)
  .mapValues(_.toMap)
  .collectAsMap


firstLetter == val firstLetter2
firstLetter == val firstLetter3