Spark JdbcRDD奇怪的任务是不可序列化的异常

时间:2015-04-20 08:09:40

标签: java jdbc apache-spark

在简单的jdbcRDD代码段中使用最新的Spark 1.3.1获取Task is not serializable异常:

public class SparkDriverApp implements Serializable {
    public static void main(String[] args) throws SQLException, ClassNotFoundException {
        SparkConf conf = new SparkConf();
        conf.setAppName("com.example.testapp");
        JavaSparkContext sc = new JavaSparkContext(conf);

        new JdbcRDD<>(sc.sc(), new AbstractFunction0<Connection>() {
            @Override
            public Connection apply() {
                try {
                    Class.forName("com.mysql.jdbc.Driver");
                    return DriverManager.getConnection("jdbc:mysql://localhost:3306/mydb", "root", "yetAnotherMyPassword");
                } catch (Exception e) {
                    throw new RuntimeException();
                }

            }
        }, "SELECT document_id, name, content FROM document WHERE document_id >= ? and document_id <= ?",
                10001, 499999, 10, new AbstractFunction1<ResultSet, Object[]>() {

            @Override
            public Object[] apply(ResultSet resultSet) {
                return JdbcRDD.resultSetToObjectArray(resultSet);
            }
        }, ClassManifestFactory$.MODULE$.fromClass(Object[].class)).collect();
    }
}

异常堆栈跟踪:

Exception in thread "main" org.apache.spark.SparkException: Task not serializable
        at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
        at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
        at org.apache.spark.SparkContext.clean(SparkContext.scala:1622)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1460)
        ...
        Caused by: java.io.NotSerializableException: com.example.testapp.SparkDriverApp$1
Serialization stack:
        - object not serializable (class: com.example.testapp.SparkDriverApp$1, value: <function0>)
        - field (class: org.apache.spark.rdd.RDD, name: checkpointData, type: class scala.Option)
        - object (class org.apache.spark.rdd.JdbcRDD, JdbcRDD[0] at JdbcRDD at SparkDriverApp.java:44)
        - field (class: org.apache.spark.rdd.RDD$$anonfun$17, name: $outer, type: class org.apache.spark.rdd.RDD)
        - object (class org.apache.spark.rdd.RDD$$anonfun$17, <function1>)
        - field (class: org.apache.spark.SparkContext$$anonfun$runJob$5, name: func$1, type: interface scala.Function1)
        - object (class org.apache.spark.SparkContext$$anonfun$runJob$5, <function2>)
        ...

另一个问题是:使用DataFrame(在我的情况下,对于MySQL)是否比JdbcRDD更好?

0 个答案:

没有答案