我在CSV文件中有homeBackground?.image = UIImage(named: "score.png")
homeBackground?.layer.masksToBounds = true
homeBackground?.clipsToBounds = true
homeBackground?.layer.cornerRadius = 25
,calls
,smsIn
的列表,我想计算每个电话号码的smsIn / smsOut数。
CallType表示类型(smsOut
,call
,smsIn
)
数据的一个示例是(smsOut
,phoneNumber
)
callType
最终,我想要这样的内容:7035076600, 30
5081236732, 31
5024551234, 30
7035076600, 31
7035076600, 30
,phoneNum
,numSMSIn
我已经实现了这样的事情:
numSMSOUt
以上给出了每个电话号码的短信数量。类似地
val smsOutByPhoneNum = partitionedCalls.
filter{ arry => arry(2) == 30}.
groupBy { x => x(1) }.
map(f=> (f._1,f._2.iterator.length)).
collect()
上面给出了每个电话号码的短信数量。
有没有办法可以在一次迭代而不是两次迭代中完成。
答案 0 :(得分:2)
有多种方法可以解决此问题。一种天真的方法是通过(数字,类型)元组和组部分结果进行聚合:
Resources.resx
您还可以val partitionedCalls = sc.parallelize(Array(
("7035076600", "30"), ("5081236732", "31"), ("5024551234", "30"),
("7035076600", "31"), ("7035076600", "30")))
val codes = partitionedCalls.values.distinct.sortBy(identity).collect
val aggregated = partitionedCalls.map((_, 1L)).reduceByKey(_ + _)
.map{case ((number, code), cnt) => (number, (code, cnt))}
.groupByKey
.mapValues(vs => {
codes.map(vs.toMap.getOrElse(_, 0))
})
和map
使用一些可以捕获所有计数的结构:
reduceByKey
甚至将case class CallCounter(calls: Long, smsIn: Long, smsOut: Long, other: Long)
partitionedCalls
.map {
case (number, "30") => (number, CallCounter(0L, 1L, 0L, 0L))
case (number, "31") => (number, CallCounter(0L, 0L, 1L, 0L))
case (number, "32") => (number, CallCounter(1L, 0L, 0L, 0L))
case (number, _) => (number, CallCounter(0L, 0L, 0L, 1L))
}
.reduceByKey((x, y) => CallCounter(
x.calls + y.calls, x.smsIn + y.smsIn,
x.smsOut + y.smsOut, x.other + y.other))
和map
步骤合并为一个reduce
:
aggregateByKey
根据上下文,您应该调整累加器类,以便更好地满足您的需求。例如,如果类的数量很大,您应该考虑使用val transformed = partitionedCalls.aggregateByKey(
scala.collection.mutable.HashMap.empty[String,Long].withDefault(_ => 0L)
)(
(acc, x) => { acc(x) += 1; acc },
(acc1, acc2) => { acc2.foreach{ case (k, v) => acc1(k) += v }; acc1 }
).mapValues(codes.map(_))
之类的线性代数库 - 请参阅How to sum up every column of a Scala array?
你绝对应该避免的一件事是breeze
+ groupBy
当你的意思是map
时。当你想要的只是一个修改过的字数时,它必须随机播放所有数据。
答案 1 :(得分:1)
结果代码是违反直觉的,但是:
此代码:
object PartitionedCalls {
def authorVersion(partitionedCalls: Seq[Seq[Long]]) ={
val smsOutByPhoneNum = partitionedCalls.
filter{ arry => arry(1) == 30}.
groupBy { x => x(0) }.
map(f=> (f._1,f._2.iterator.length))
val smsInByPhoneNum = partitionedCalls.
filter{ arry => arry(1) == 31}.
groupBy { x => x(0) }.
map(f => (f._1, f._2.iterator.length))
(smsOutByPhoneNum, smsInByPhoneNum)
}
def myVersion(partitionedCalls: Seq[Seq[Long]]) = {
val smsInOut = partitionedCalls.
filter{ arry => arry(1) == 30 || arry(1) == 31}.
groupBy{ _(1) }.
map { case (num, t) =>
num -> t.
groupBy { x => x(0) }.
map(f=> (f._1,f._2.iterator.length))
}
(smsInOut(30), smsInOut(31))
}
}
实施这些测试:
class PartitionedCallsTest extends FunSuite {
val in = Seq(
Seq(7035076600L, 30L),
Seq(5081236732L, 31L),
Seq(5024551234L, 30L),
Seq(7035076600L, 31L),
Seq(7035076600L, 30L)
)
val out = (Map(7035076600L -> 2L, 5024551234L -> 1L),Map(7035076600L -> 1L, 5081236732L -> 1L))
test("Author"){
assert(out == PartitionedCalls.authorVersion(in))
}
test("My"){
assert(out == PartitionedCalls.myVersion(in))
}
}
答案 2 :(得分:1)
很棒的答案@ zero323
val partitionedCalls = sc.parallelize(Array(("7035076600", "30"),
("5081236732", "31"), ("5024551234", "30"),("7035076600", "31"),
("7035076600", "30")))
# count the pairs <(phoneNumber, code), count>
val keyPairCounts = partitionedCalls.map((_,1))
# using reduceByKey
val aggregateCounts = keyPairCounts.reduceByKey(_ + _).map{ case((phNum,
inOrOut), cnt) => (phNum, (inOrOut, cnt)) }
# using groupBy to aggregate and merge similar keys
val result = aggregateCounts.groupByKey.map(x => (x._1,
x._2.toMap.values.toArray))
# collect the result
result.map(x => (x._1, x._2.lift(0).getOrElse(0),
x._2.lift(1).getOrElse(0))).collect().map(println)
参考: 关于groupBy和reduceBy之间差异的一个很好的解释:prefer_reducebykey_over_groupbykey