I am using Scala Spark API. In my code, I have an RDD of the following structure:
Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])]
I need to process (perform validations and modify values) the second element of the RDD. I am using map
function to do that:
myRDD.map(line => mappingFunction(line))
Unfortunately, the mappingFunction
is not invoked. This is the code of the mapping function:
def mappingFunction(line: Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])] ): Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])] = {
println("Inside mappingFunction")
return line
}
When my program ends, there are no printed messages in the stdout.
In order to investigate the problem, I implemented a code snippet that worked:
val x = List.range(1, 10)
val mappedX = x.map(i => callInt(i))
And the following mapping function was invoked:
def callInt(i: Int) = {
println("Inside callInt")
}
Please assist in getting the RDD mapping function mappingFunction
invoked. Thank you.
答案 0 :(得分:0)
x
is a List
, so there is no laziness there, that's why your action is being invoked regardless you are not calling an action.
myRDD
is an RDD
, RDDs are lazy, this means that you don't actually execute your transformations (map, flatMap, filter) until you need to.
That means that you are not running your map
function until you perform an action. An action is an operation that triggers the precedent operations (called transformations) to be executed.
Some examples of actions are collect or count
If you do this:
myRDD.map(line => mappingFunction(line)).count()
You'll see your prints. Anyway, there is no problem with your code at all, you just need to take into consideration the laziness nature of the RDDs
There is a good answer about this topic here. Also you can find more info and a whole list of transformations and actions here