I saw this example in this book “Learning Spark: Lightning-Fast Big Data Analysis”:
class SearchFunctions(val query: String) {
// more methods here
def getMatchesNoReference(rdd: RDD[String]): RDD[String] = {
// Safe: extract just the field we need into a local variable
val query_ = this.query
rdd.map(x => x.split(query_))
}
}
My question is - the comment says : Safe: extract just the field we need into a local variable
Why extracting to local variable is safer than using the field (defined as a val
) itself?
答案 0 :(得分:4)
Passing Functions in Spark is really helpful and has the answer to your question.
The idea is that you want only the query to be communicated to the workers that need it, and not the whole object (of the class).
If you didn't do it that way (if you were using the field in your map()
, instead of the local variable), then:
...sending the object that contains that class along with the method. In a similar way, accessing fields of the outer object will reference the whole object
Note, this is also safer, not just more efficient, because it minimizes the memory usage.
You see, when handling really big data, your job will be facing its memory limitations, and if it exceeds them, it will be killed by the resource manager (for example YARN), so we want to make sure we use as less memory as possible, to make sure our job will make it and not fail!
Moreover, a big object will result in larger communication overhead. The TCP connection may be reset by peer, when the communication size is too big, which will invoke unnecessary overhead, which we want to avoid, because bad communication is also a reason for a job to fail.
答案 1 :(得分:2)
Because when you extract only query_
has to be serialized and send to the workers.
If you didn't extract, a complete instance of SearchFunctions
would be sent.
答案 2 :(得分:2)
由于其他答案没有提及:另一个更安全的原因是因为您引用的字段/方法的类可能无法序列化。由于Java不允许在编译时检查它,因此您将遇到运行时故障。 StackOverflow上有很多此类问题的例子:我发现的前几个例子是Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects,SparkContext not serializable inside a companion object,Enriching SparkContext without incurring in serialization issues,Spark serialization error。搜索spark NotSerializableException
应该会给你更多,当然不仅仅是Stack Overflow。
或者它现在可以序列化,但显然无关的更改(例如添加lambda不使用的字段)可能会使代码无法序列化或显着降低您的性能而破坏代码。