groupBy有多个值

时间:2015-12-27 00:21:16

标签: scala apache-spark

我在CSV文件中有homeBackground?.image = UIImage(named: "score.png") homeBackground?.layer.masksToBounds = true homeBackground?.clipsToBounds = true homeBackground?.layer.cornerRadius = 25 callssmsIn的列表,我想计算每个电话号码的smsIn / smsOut数。

CallType表示类型(smsOutcallsmsIn

数据的一个示例是(smsOutphoneNumber

callType

最终,我想要这样的内容:7035076600, 30 5081236732, 31 5024551234, 30 7035076600, 31 7035076600, 30 phoneNumnumSMSIn 我已经实现了这样的事情:

numSMSOUt

以上给出了每个电话号码的短信数量。类似地

val smsOutByPhoneNum = partitionedCalls.
                       filter{ arry => arry(2) == 30}.
                       groupBy { x => x(1) }.
                       map(f=> (f._1,f._2.iterator.length)).
                       collect()

上面给出了每个电话号码的短信数量。

有没有办法可以在一次迭代而不是两次迭代中完成。

3 个答案:

答案 0 :(得分:2)

有多种方法可以解决此问题。一种天真的方法是通过(数字,类型)元组和组部分结果进行聚合:

Resources.resx

您还可以val partitionedCalls = sc.parallelize(Array( ("7035076600", "30"), ("5081236732", "31"), ("5024551234", "30"), ("7035076600", "31"), ("7035076600", "30"))) val codes = partitionedCalls.values.distinct.sortBy(identity).collect val aggregated = partitionedCalls.map((_, 1L)).reduceByKey(_ + _) .map{case ((number, code), cnt) => (number, (code, cnt))} .groupByKey .mapValues(vs => { codes.map(vs.toMap.getOrElse(_, 0)) }) map使用一些可以捕获所有计数的结构:

reduceByKey

甚至将case class CallCounter(calls: Long, smsIn: Long, smsOut: Long, other: Long) partitionedCalls .map { case (number, "30") => (number, CallCounter(0L, 1L, 0L, 0L)) case (number, "31") => (number, CallCounter(0L, 0L, 1L, 0L)) case (number, "32") => (number, CallCounter(1L, 0L, 0L, 0L)) case (number, _) => (number, CallCounter(0L, 0L, 0L, 1L)) } .reduceByKey((x, y) => CallCounter( x.calls + y.calls, x.smsIn + y.smsIn, x.smsOut + y.smsOut, x.other + y.other)) map步骤合并为一个reduce

aggregateByKey

根据上下文,您应该调整累加器类,以便更好地满足您的需求。例如,如果类的数量很大,您应该考虑使用val transformed = partitionedCalls.aggregateByKey( scala.collection.mutable.HashMap.empty[String,Long].withDefault(_ => 0L) )( (acc, x) => { acc(x) += 1; acc }, (acc1, acc2) => { acc2.foreach{ case (k, v) => acc1(k) += v }; acc1 } ).mapValues(codes.map(_)) 之类的线性代数库 - 请参阅How to sum up every column of a Scala array?

你绝对应该避免的一件事是breeze + groupBy当你的意思是map时。当你想要的只是一个修改过的字数时,它必须随机播放所有数据。

答案 1 :(得分:1)

结果代码是违反直觉的,但是:

此代码:

object PartitionedCalls {
  def authorVersion(partitionedCalls: Seq[Seq[Long]]) ={
    val smsOutByPhoneNum = partitionedCalls.
      filter{ arry => arry(1) == 30}.
      groupBy { x => x(0) }.
      map(f=> (f._1,f._2.iterator.length))

    val smsInByPhoneNum = partitionedCalls.
      filter{ arry => arry(1) == 31}.
      groupBy { x => x(0) }.
      map(f => (f._1, f._2.iterator.length))

    (smsOutByPhoneNum, smsInByPhoneNum)
  }

  def myVersion(partitionedCalls: Seq[Seq[Long]]) = {
    val smsInOut = partitionedCalls.
      filter{ arry => arry(1) == 30 || arry(1) == 31}.
      groupBy{ _(1) }.
      map { case (num, t) =>
        num -> t.
          groupBy { x => x(0) }.
          map(f=> (f._1,f._2.iterator.length))
      }

    (smsInOut(30), smsInOut(31))
  }
}

实施这些测试:

class PartitionedCallsTest extends FunSuite {
  val in = Seq(
    Seq(7035076600L, 30L),
    Seq(5081236732L, 31L),
    Seq(5024551234L, 30L),
    Seq(7035076600L, 31L),
    Seq(7035076600L, 30L)
  )

  val out = (Map(7035076600L -> 2L, 5024551234L -> 1L),Map(7035076600L -> 1L, 5081236732L -> 1L))

  test("Author"){
    assert(out == PartitionedCalls.authorVersion(in))
  }

  test("My"){
    assert(out == PartitionedCalls.myVersion(in))
  }
}

答案 2 :(得分:1)

很棒的答案@ zero323

val partitionedCalls = sc.parallelize(Array(("7035076600", "30"),   
("5081236732", "31"), ("5024551234", "30"),("7035076600", "31"), 
("7035076600", "30")))

# count the pairs <(phoneNumber, code), count>
val keyPairCounts = partitionedCalls.map((_,1))
# using reduceByKey
val aggregateCounts = keyPairCounts.reduceByKey(_ + _).map{ case((phNum,  
inOrOut), cnt) => (phNum, (inOrOut, cnt)) }
# using groupBy to aggregate and merge similar keys
val result = aggregateCounts.groupByKey.map(x => (x._1, 
x._2.toMap.values.toArray))

# collect the result 
result.map(x => (x._1, x._2.lift(0).getOrElse(0), 
x._2.lift(1).getOrElse(0))).collect().map(println)

参考: 关于groupBy和reduceBy之间差异的一个很好的解释:prefer_reducebykey_over_groupbykey