Why extracting an argument in spark to local variable is considered safer?

时间:2016-08-31 18:35:20

标签: scala function apache-spark distributed-computing bigdata

I saw this example in this book “Learning Spark: Lightning-Fast Big Data Analysis”:

class SearchFunctions(val query: String) {
 // more methods here
 def getMatchesNoReference(rdd: RDD[String]): RDD[String] = {
 // Safe: extract just the field we need into a local variable
 val query_ = this.query
 rdd.map(x => x.split(query_))
 }
}

My question is - the comment says : Safe: extract just the field we need into a local variable

Why extracting to local variable is safer than using the field (defined as a val) itself?

3 个答案:

答案 0 :(得分:4)

Passing Functions in Spark is really helpful and has the answer to your question.

The idea is that you want only the query to be communicated to the workers that need it, and not the whole object (of the class).

If you didn't do it that way (if you were using the field in your map(), instead of the local variable), then:

...sending the object that contains that class along with the method. In a similar way, accessing fields of the outer object will reference the whole object


Note, this is also safer, not just more efficient, because it minimizes the memory usage.

You see, when handling really big data, your job will be facing its memory limitations, and if it exceeds them, it will be killed by the resource manager (for example YARN), so we want to make sure we use as less memory as possible, to make sure our job will make it and not fail!

Moreover, a big object will result in larger communication overhead. The TCP connection may be reset by peer, when the communication size is too big, which will invoke unnecessary overhead, which we want to avoid, because bad communication is also a reason for a job to fail.

答案 1 :(得分:2)

Because when you extract only query_ has to be serialized and send to the workers.

If you didn't extract, a complete instance of SearchFunctions would be sent.

答案 2 :(得分:2)

由于其他答案没有提及:另一个更安全的原因是因为您引用的字段/方法的类可能无法序列化。由于Java不允许在编译时检查它,因此您将遇到运行时故障。 StackOverflow上有很多此类问题的例子:我发现的前几个例子是Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objectsSparkContext not serializable inside a companion objectEnriching SparkContext without incurring in serialization issuesSpark serialization error。搜索spark NotSerializableException应该会给你更多,当然不仅仅是Stack Overflow。

或者它现在可以序列化,但显然无关的更改(例如添加lambda不使用的字段)可能会使代码无法序列化或显着降低您的性能而破坏代码。