Spark sortby抛出异常

时间:2015-04-29 09:25:07

标签: apache-spark

我试图通过Key对JavaPairRDD进行排序。

配置

Spark版本:1.3.0 模式:本地

有些人可以查看我的代码,我做错了。

 JavaPairRDD<String, HashMap<String, Object>> countAndSum = grupBydate
                                        .reduceByKey(new Function2<HashMap<String, Object>, HashMap<String, Object>, HashMap<String, Object>>() {
                                            @Override
                                            public HashMap<String, Object> call(
                                                    HashMap<String, Object> v1,
                                                    HashMap<String, Object> v2)
                                                    throws Exception {
                                                long count = Long.parseLong(v1.get(
                                                        SparkToolConstant.COUNT)
                                                        .toString())
                                                        + Long.parseLong(v2
                                                                .get(SparkToolConstant.COUNT)
                                                                .toString());
                                                Double sum = Double.parseDouble(v1.get(
                                                        SparkToolConstant.VALUE)
                                                        .toString())
                                                        + Double.parseDouble(v2
                                                                .get(SparkToolConstant.VALUE)
                                                                .toString());
                                                HashMap<String, Object> sumMap = new HashMap<String, Object>();
                                                sumMap.put(SparkToolConstant.COUNT,
                                                        count);
                                                sumMap.put(SparkToolConstant.VALUE, sum);
                                                return sumMap;
                                            }
                                        });


System.out.println("count before sorting : "
                                        + countAndSum.count());



   /**
    sort by date 

    */
                                JavaPairRDD<String, HashMap<String, Object>> sortByDate = countAndSum
                                        .sortByKey(new Comparator<String>() {
                                            @Override
                                            public int compare(String dateStr1,
                                                    String dateStr2) {
                                                DateUtil dateUtil = new DateUtil();
                                                Date date1 = dateUtil.stringToDate(
                                                        dateStr1, dateFormat);
                                                Date date2 = dateUtil.stringToDate(
                                                        dateStr2, dateFormat);
                                                if (date2 == null && date1 == null) {
                                                    return 0;
                                                } else if (date2 != null
                                                        && date1 != null) {
                                                    return date1.compareTo(date2);
                                                } else if (date2 == null) {
                                                    return 1;
                                                } else {
                                                    return -1;
                                                }
                                            }
                                        });

在此处获取错误

                        System.out.println("count after sorting : "
                                + sortByDate.count());

使用spark-submit作为本地模式在Spark中提交任务时的堆栈跟踪

SchedulerImpl:59 - 取消阶段252 2015-04-29 14:37:19 INFO DAGScheduler:59 - Job 62失败:算上DataValidation.java:378,花了0.107696 s 线程&#34; main&#34;中的例外情况org.apache.spark.SparkException:作业因阶段失败而中止:任务序列化失败:java.lang.reflect.InvocationTargetException sun.reflect.NativeMethodAccessorImpl.invoke0(原生方法) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:606) org.apache.spark.serializer.SerializationDebugger $ ObjectStreamClassMethods $ .getObjFieldValues $扩展(SerializationDebugger.scala:240) org.apache.spark.serializer.SerializationDebugger $ SerializationDebugger.visitSerializable(SerializationDebugger.scala:150) org.apache.spark.serializer.SerializationDebugger $ SerializationDebugger.visit(SerializationDebugger.scala:99) org.apache.spark.serializer.SerializationDebugger $ SerializationDebugger.visitSerializable(SerializationDebugger.scala:158) org.apache.spark.serializer.SerializationDebugger $ SerializationDebugger.visit(SerializationDebugger.scala:99) org.apache.spark.serializer.SerializationDebugger $ .find(SerializationDebugger.scala:58) org.apache.spark.serializer.SerializationDebugger $ .improveException(SerializationDebugger.scala:39) org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80) org.apache.spark.scheduler.DAGScheduler.org $阿帕奇$火花$ $调度$$ DAGScheduler submitMissingTasks(DAGScheduler.scala:835) org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15 $$ anonfun $ $应用$ 1.适用$ mcVI SP(DAGScheduler.scala:1042) org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15 $$ anonfun $ $适用1.适用(DAGScheduler.scala:1039) org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15 $$ anonfun $ $适用1.适用(DAGScheduler.scala:1039) scala.Option.foreach(Option.scala:236) org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15.apply(DAGScheduler.scala 1039) org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15.apply(DAGScheduler.scala:1038) scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1038) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1390) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) org.apache.spark.util.EventLoop $$匿名$ 1.run(EventLoop.scala:48)         在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1203)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1192)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1191)         在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)         在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)         在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)         在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ submitMissingTasks(DAGScheduler.scala:847)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15 $$ anonfun $ apply $ 1.apply $ mcVI $ sp(DAGScheduler.scala:1042)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15 $$ anonfun $ apply $ 1.apply(DAGScheduler.scala:1039)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15 $$ anonfun $ apply $ 1.apply(DAGScheduler.scala:1039)         在scala.Option.foreach(Option.scala:236)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15.apply(DAGScheduler.scala:1039)         在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15.apply(DAGScheduler.scala:1038)         在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)         在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)         在org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1038)         在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1390)         在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)         在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)

  

块引用

1 个答案:

答案 0 :(得分:0)

Spark将首先序列化您在reduceByKeysorByKey中传递的函数,并将它们传递给执行程序。因此,您应该保证您的函数可以在那里序列化

SparkToolConstant&amp;您的代码中的DateUtil似乎是导致此错误的原因。