As shown below,
Step 1: Group the calls using groupBy
//Now group the calls by the s_msisdn for call type 1
//grouped: org.apache.spark.rdd.RDD[(String, Iterable[(String, (Array[String], String))])]
val groupedCallsToProcess = callsToProcess.groupBy(_._1)
Step 2: the grouped Calls are mapped.
//create a Map of the second element in the RDD, which is the callObject
//grouped: org.apache.spark.rdd.RDD[(String, Iterable[(String,(Array[String], String))])]
val mapOfCalls = groupedCallsToProcess.map(f => f._2.toList)
Step 3: Map to the Row object, where the map will have key-value pair of [CallsObject, msisdn]
val listOfMappedCalls = mapOfCalls.map(f => f.map(_._2).map(c =>
Row(
c._1(CallCols.call_date_hour),
c._1(CallCols.sw_id),
c._1(CallCols.s_imsi),
f.map(_._1).take(1).mkString
)
))
The 3rd step as shown above seems to take a very long time when the data size is very large. I am wondering if there is any way to make the step 3 efficient. Really appreciate any help in this.
答案 0 :(得分:2)
There are lots of things that are very costly in your code which you actually don't need.
groupBy
in first step. groupBy
are very costly in Spark.toList
is very costly with lot of GC overhead.map
is linear operation of the order of map function.f.map(_._1).take(1)
. You are transforming the whole list but using only 1 (or 5) element. Instead do f.take(5).map(_._1)
. And if you need only 1 - f.head._1
.Before discussing how can you write this code without groupBy
in a different way, lets fix this code.
// you had this in start
val callsToProcess: RDD[(String, (Array[String], String))] = ....
// RDD[(String, Iterable[(String, (Array[String], String))])]
val groupedCallsToProcess = callsToProcess
.groupBy(_._1)
// skip the second step
val listOfMappedCalls = groupedCallsToProcess
.map({ case (key, iter) => {
// this is what you did
// val iterHeadString = iter.head._1
// but the 1st element in each tuple of iter is actually same as key
// so
val iterHeadString = key
// or we can totally remove this iterHeadString and use key
iter.map({ case (str1, (arr, str2)) => Row(
arr(CallCols.call_date_hour),
arr(CallCols.sw_id),
arr(CallCols.s_imsi),
iterHeadString
) })
} })
But... like I said groupBy
are very costly in Spark. And you already had a RDD[(key, value)]
in your callsToProcess
. So we can just use aggregateByKey
directly. Also you may notice that your groupBy
is not useful for anything else other than putting all those rows inside a list instead of directly inside and RDD.
// you had this in start
val callsToProcess: RDD[(String, (Array[String], String))] = ....
// new lets map it to look like what you needed because we can
// totally do this without any grouping
// I somehow believe that you needed this RDD[Row] and not RDD[List[Row]]
// RDD[Row]
val mapped = callsToProcess
.map({ case (key, (arr, str)) => Row(
arr(CallCols.call_date_hour),
arr(CallCols.sw_id),
arr(CallCols.s_imsi),
key
) })
// Though I can not think of any reason of wanting this
// But if you really needed that RDD[List[Row]] thing...
// then keep the keys with your rows
// RDD[(String, Row)]
val mappedWithKey = callsToProcess
.map({ case (key, (arr, str)) => (key, Row(
arr(CallCols.call_date_hour),
arr(CallCols.sw_id),
arr(CallCols.s_imsi),
key
)) })
// now aggregate by the key to create your lists
// RDD[List[Row]]
val yourStrangeRDD = mappedWithKey
.aggregateByKey(List[Row]())(
(list, row) => row +: list, // prepend, do not append
(list1, list2) => list1 ++ list2
)