Mapping RDD to function does not invoke the function

时间:2018-02-03 08:06:37

标签: scala apache-spark rdd

I am using Scala Spark API. In my code, I have an RDD of the following structure:

Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])]

I need to process (perform validations and modify values) the second element of the RDD. I am using map function to do that:

myRDD.map(line => mappingFunction(line))

Unfortunately, the mappingFunction is not invoked. This is the code of the mapping function:

def mappingFunction(line: Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])] ): Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])] = {
    println("Inside mappingFunction")
    return line
  }

When my program ends, there are no printed messages in the stdout.

In order to investigate the problem, I implemented a code snippet that worked:

val x = List.range(1, 10)
val mappedX = x.map(i => callInt(i))

And the following mapping function was invoked:

 def callInt(i: Int) = {
    println("Inside callInt")
  }

Please assist in getting the RDD mapping function mappingFunction invoked. Thank you.

1 个答案:

答案 0 :(得分:0)

x is a List, so there is no laziness there, that's why your action is being invoked regardless you are not calling an action.

myRDD is an RDD, RDDs are lazy, this means that you don't actually execute your transformations (map, flatMap, filter) until you need to.

That means that you are not running your map function until you perform an action. An action is an operation that triggers the precedent operations (called transformations) to be executed.

Some examples of actions are collect or count

If you do this:

myRDD.map(line => mappingFunction(line)).count()

You'll see your prints. Anyway, there is no problem with your code at all, you just need to take into consideration the laziness nature of the RDDs

There is a good answer about this topic here. Also you can find more info and a whole list of transformations and actions here