Question

As shown below,

Step 1: Group the calls using groupBy

//Now group the calls by the s_msisdn for call type 1
//grouped: org.apache.spark.rdd.RDD[(String, Iterable[(String, (Array[String], String))])] 
val groupedCallsToProcess = callsToProcess.groupBy(_._1)

Step 2: the grouped Calls are mapped.

//create a Map of the second element in the RDD, which is the callObject
//grouped: org.apache.spark.rdd.RDD[(String, Iterable[(String,(Array[String], String))])] 

val mapOfCalls = groupedCallsToProcess.map(f => f._2.toList)

Step 3: Map to the Row object, where the map will have key-value pair of [CallsObject, msisdn]

val listOfMappedCalls = mapOfCalls.map(f => f.map(_._2).map(c => 
  Row(
      c._1(CallCols.call_date_hour),
      c._1(CallCols.sw_id),   
      c._1(CallCols.s_imsi),
      f.map(_._1).take(1).mkString
    )
  ))

The 3rd step as shown above seems to take a very long time when the data size is very large. I am wondering if there is any way to make the step 3 efficient. Really appreciate any help in this.

Answer 1

There are lots of things that are very costly in your code which you actually don't need.

You do not need the groupBy in first step. groupBy are very costly in Spark.
The whole second step is useless. toList is very costly with lot of GC overhead.
Remove 1 extra map in third step. Every map is linear operation of the order of map function.
Never do something like f.map(_._1).take(1). You are transforming the whole list but using only 1 (or 5) element. Instead do f.take(5).map(_._1). And if you need only 1 - f.head._1.

Before discussing how can you write this code without groupBy in a different way, lets fix this code.

// you had this in start
val callsToProcess: RDD[(String, (Array[String], String))] = ....

// RDD[(String, Iterable[(String, (Array[String], String))])]
val groupedCallsToProcess = callsToProcess
  .groupBy(_._1)

// skip the second step

val listOfMappedCalls = groupedCallsToProcess
  .map({ case (key, iter) => {
    // this is what you did
    // val iterHeadString = iter.head._1
    // but the 1st element in each tuple of iter is actually same as key
    // so
    val iterHeadString = key
    // or we can totally remove this iterHeadString and use key
    iter.map({ case (str1, (arr, str2)) => Row(
      arr(CallCols.call_date_hour),
      arr(CallCols.sw_id),   
      arr(CallCols.s_imsi),
      iterHeadString
    ) })
  } })

But... like I said groupBy are very costly in Spark. And you already had a RDD[(key, value)] in your callsToProcess. So we can just use aggregateByKey directly. Also you may notice that your groupBy is not useful for anything else other than putting all those rows inside a list instead of directly inside and RDD.

// you had this in start
val callsToProcess: RDD[(String, (Array[String], String))] = ....

// new lets map it to look like what you needed because we can 
// totally do this without any grouping
// I somehow believe that you needed this RDD[Row] and not RDD[List[Row]]
// RDD[Row]
val mapped = callsToProcess
  .map({ case (key, (arr, str)) => Row(
      arr(CallCols.call_date_hour),
      arr(CallCols.sw_id),   
      arr(CallCols.s_imsi),
      key
  ) })


// Though I can not think of any reason of wanting this
// But if you really needed that RDD[List[Row]] thing...
// then keep the keys with your rows
// RDD[(String, Row)]
val mappedWithKey = callsToProcess
  .map({ case (key, (arr, str)) => (key, Row(
      arr(CallCols.call_date_hour),
      arr(CallCols.sw_id),   
      arr(CallCols.s_imsi),
      key
  )) })

// now aggregate by the key to create your lists
// RDD[List[Row]]
val yourStrangeRDD = mappedWithKey
  .aggregateByKey(List[Row]())(
    (list, row) => row +: list, // prepend, do not append
    (list1, list2) => list1 ++ list2
  )

Spark map creation takes a very long time

1 个答案: