在简单的jdbcRDD代码段中使用最新的Spark 1.3.1获取Task is not serializable
异常:
public class SparkDriverApp implements Serializable {
public static void main(String[] args) throws SQLException, ClassNotFoundException {
SparkConf conf = new SparkConf();
conf.setAppName("com.example.testapp");
JavaSparkContext sc = new JavaSparkContext(conf);
new JdbcRDD<>(sc.sc(), new AbstractFunction0<Connection>() {
@Override
public Connection apply() {
try {
Class.forName("com.mysql.jdbc.Driver");
return DriverManager.getConnection("jdbc:mysql://localhost:3306/mydb", "root", "yetAnotherMyPassword");
} catch (Exception e) {
throw new RuntimeException();
}
}
}, "SELECT document_id, name, content FROM document WHERE document_id >= ? and document_id <= ?",
10001, 499999, 10, new AbstractFunction1<ResultSet, Object[]>() {
@Override
public Object[] apply(ResultSet resultSet) {
return JdbcRDD.resultSetToObjectArray(resultSet);
}
}, ClassManifestFactory$.MODULE$.fromClass(Object[].class)).collect();
}
}
异常堆栈跟踪:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1622)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1460)
...
Caused by: java.io.NotSerializableException: com.example.testapp.SparkDriverApp$1
Serialization stack:
- object not serializable (class: com.example.testapp.SparkDriverApp$1, value: <function0>)
- field (class: org.apache.spark.rdd.RDD, name: checkpointData, type: class scala.Option)
- object (class org.apache.spark.rdd.JdbcRDD, JdbcRDD[0] at JdbcRDD at SparkDriverApp.java:44)
- field (class: org.apache.spark.rdd.RDD$$anonfun$17, name: $outer, type: class org.apache.spark.rdd.RDD)
- object (class org.apache.spark.rdd.RDD$$anonfun$17, <function1>)
- field (class: org.apache.spark.SparkContext$$anonfun$runJob$5, name: func$1, type: interface scala.Function1)
- object (class org.apache.spark.SparkContext$$anonfun$runJob$5, <function2>)
...
另一个问题是:使用DataFrame
(在我的情况下,对于MySQL)是否比JdbcRDD
更好?