引起:java.io.NotSerializableException:org.apache.spark.SparkContext - 在Spark中使用JdbcRDD时

时间:2016-09-03 10:05:33

标签: mysql scala jdbc apache-spark

我正在尝试从Mysql数据库加载RDD:

package ro.mfl.employees
import org.apache.spark.{SparkConf, SparkContext}
import java.sql.{Connection, DriverManager}

import org.apache.spark.rdd.JdbcRDD

class Loader(sc: SparkContext) {

  Class.forName("com.mysql.jdbc.Driver").newInstance()

  def connection(): Connection = {
    DriverManager.getConnection("jdbc:mysql://localhost/employees", "sakila", "sakila")
  }


  def load(): Unit = {
    val employeesRDD = new JdbcRDD(sc, connection, "select * from employees.employees", 0, 0, 1)
    println(employeesRDD.count())

  }

}

object Test extends App {
  val conf = new SparkConf().setAppName("test")
  val sc = new SparkContext(conf)
  val l = new Loader(sc)
  l.load()
}

当我执行此操作时,出现错误

Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
    - object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@323a9221)
    - field (class: ro.mfl.employees.Loader, name: sc, type: class org.apache.spark.SparkContext)
    - object (class ro.mfl.employees.Loader, ro.mfl.employees.Loader@607c6d60)
    - field (class: ro.mfl.employees.Loader$$anonfun$1, name: $outer, type: class ro.mfl.employees.Loader)
    - object (class ro.mfl.employees.Loader$$anonfun$1, <function0>)
    - field (class: org.apache.spark.rdd.JdbcRDD, name: org$apache$spark$rdd$JdbcRDD$$getConnection, type: interface scala.Function0)
    - object (class org.apache.spark.rdd.JdbcRDD, JdbcRDD[0] at JdbcRDD at Loader.scala:17)
    - field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
    - object (class scala.Tuple2, (JdbcRDD[0] at JdbcRDD at Loader.scala:17,<function2>))

有没有人遇到过这个问题?

我尝试让Loader类扩展java.io.Serializable,但我得到了同样的错误,只有org.apache.spark.SparkContext而不是Loader

1 个答案:

答案 0 :(得分:2)

问题:

您的问题是Loaderclass不是serializable

尝试将其更改为object。或者按照下面给出的例子。

object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@323a9221)

这是因为您的Loader是类,并且您通过创建新实例将SparkContext传递给Loader类。

按照这个例子(简单而优雅的方式),这应该有效:

import org.apache.spark._
import org.apache.spark.rdd.JdbcRDD
import java.sql.{DriverManager, ResultSet}
// not class enclosed in an object
object LoadSimpleJdbc {
  def main(args: Array[String]) {
    if (args.length < 1) {
      println("Usage: [sparkmaster]")
      exit(1)
    }
    val master = args(0)
    val sc = new SparkContext(master, "LoadSimpleJdbc", System.getenv("SPARK_HOME"))
    val data = new JdbcRDD(sc,
      createConnection, "SELECT * FROM panda WHERE ? <= id AND ID <= ?",
      lowerBound = 1, upperBound = 3, numPartitions = 2, mapRow = extractValues)
    println(data.collect().toList)
  }
/** createConnection - Get connection here **/
  def createConnection() = {
    Class.forName("com.mysql.jdbc.Driver").newInstance();
    DriverManager.getConnection("jdbc:mysql://localhost/test?user=holden");
  }
/** This returns tuple **/
  def extractValues(r: ResultSet) = {
    (r.getInt(1), r.getString(2))
  }
}

通常,尽量避免在您的课程中存储SparkContext

另外,请查看Serialization Exception on spark

尝试将SparkContext声明为@transient(某些用户在SO中使用此方法)