我正在尝试从Mysql数据库加载RDD:
package ro.mfl.employees
import org.apache.spark.{SparkConf, SparkContext}
import java.sql.{Connection, DriverManager}
import org.apache.spark.rdd.JdbcRDD
class Loader(sc: SparkContext) {
Class.forName("com.mysql.jdbc.Driver").newInstance()
def connection(): Connection = {
DriverManager.getConnection("jdbc:mysql://localhost/employees", "sakila", "sakila")
}
def load(): Unit = {
val employeesRDD = new JdbcRDD(sc, connection, "select * from employees.employees", 0, 0, 1)
println(employeesRDD.count())
}
}
object Test extends App {
val conf = new SparkConf().setAppName("test")
val sc = new SparkContext(conf)
val l = new Loader(sc)
l.load()
}
当我执行此操作时,出现错误
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@323a9221)
- field (class: ro.mfl.employees.Loader, name: sc, type: class org.apache.spark.SparkContext)
- object (class ro.mfl.employees.Loader, ro.mfl.employees.Loader@607c6d60)
- field (class: ro.mfl.employees.Loader$$anonfun$1, name: $outer, type: class ro.mfl.employees.Loader)
- object (class ro.mfl.employees.Loader$$anonfun$1, <function0>)
- field (class: org.apache.spark.rdd.JdbcRDD, name: org$apache$spark$rdd$JdbcRDD$$getConnection, type: interface scala.Function0)
- object (class org.apache.spark.rdd.JdbcRDD, JdbcRDD[0] at JdbcRDD at Loader.scala:17)
- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
- object (class scala.Tuple2, (JdbcRDD[0] at JdbcRDD at Loader.scala:17,<function2>))
有没有人遇到过这个问题?
我尝试让Loader
类扩展java.io.Serializable
,但我得到了同样的错误,只有org.apache.spark.SparkContext
而不是Loader
。
答案 0 :(得分:2)
您的问题是Loader
,class
不是serializable
尝试将其更改为object
。或者按照下面给出的例子。
object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@323a9221)
这是因为您的Loader
是类,并且您通过创建新实例将SparkContext
传递给Loader类。
按照这个例子(简单而优雅的方式),这应该有效:
import org.apache.spark._
import org.apache.spark.rdd.JdbcRDD
import java.sql.{DriverManager, ResultSet}
// not class enclosed in an object
object LoadSimpleJdbc {
def main(args: Array[String]) {
if (args.length < 1) {
println("Usage: [sparkmaster]")
exit(1)
}
val master = args(0)
val sc = new SparkContext(master, "LoadSimpleJdbc", System.getenv("SPARK_HOME"))
val data = new JdbcRDD(sc,
createConnection, "SELECT * FROM panda WHERE ? <= id AND ID <= ?",
lowerBound = 1, upperBound = 3, numPartitions = 2, mapRow = extractValues)
println(data.collect().toList)
}
/** createConnection - Get connection here **/
def createConnection() = {
Class.forName("com.mysql.jdbc.Driver").newInstance();
DriverManager.getConnection("jdbc:mysql://localhost/test?user=holden");
}
/** This returns tuple **/
def extractValues(r: ResultSet) = {
(r.getInt(1), r.getString(2))
}
}
通常,尽量避免在您的课程中存储SparkContext
。
另外,请查看Serialization Exception on spark
尝试将SparkContext
声明为@transient
(某些用户在SO中使用此方法)