我试图通过Key对JavaPairRDD进行排序。
配置
Spark版本:1.3.0 模式:本地
有些人可以查看我的代码,我做错了。
JavaPairRDD<String, HashMap<String, Object>> countAndSum = grupBydate
.reduceByKey(new Function2<HashMap<String, Object>, HashMap<String, Object>, HashMap<String, Object>>() {
@Override
public HashMap<String, Object> call(
HashMap<String, Object> v1,
HashMap<String, Object> v2)
throws Exception {
long count = Long.parseLong(v1.get(
SparkToolConstant.COUNT)
.toString())
+ Long.parseLong(v2
.get(SparkToolConstant.COUNT)
.toString());
Double sum = Double.parseDouble(v1.get(
SparkToolConstant.VALUE)
.toString())
+ Double.parseDouble(v2
.get(SparkToolConstant.VALUE)
.toString());
HashMap<String, Object> sumMap = new HashMap<String, Object>();
sumMap.put(SparkToolConstant.COUNT,
count);
sumMap.put(SparkToolConstant.VALUE, sum);
return sumMap;
}
});
System.out.println("count before sorting : "
+ countAndSum.count());
/**
sort by date
*/
JavaPairRDD<String, HashMap<String, Object>> sortByDate = countAndSum
.sortByKey(new Comparator<String>() {
@Override
public int compare(String dateStr1,
String dateStr2) {
DateUtil dateUtil = new DateUtil();
Date date1 = dateUtil.stringToDate(
dateStr1, dateFormat);
Date date2 = dateUtil.stringToDate(
dateStr2, dateFormat);
if (date2 == null && date1 == null) {
return 0;
} else if (date2 != null
&& date1 != null) {
return date1.compareTo(date2);
} else if (date2 == null) {
return 1;
} else {
return -1;
}
}
});
在此处获取错误
System.out.println("count after sorting : "
+ sortByDate.count());
使用spark-submit作为本地模式在Spark中提交任务时的堆栈跟踪
SchedulerImpl:59 - 取消阶段252 2015-04-29 14:37:19 INFO DAGScheduler:59 - Job 62失败:算上DataValidation.java:378,花了0.107696 s 线程&#34; main&#34;中的例外情况org.apache.spark.SparkException:作业因阶段失败而中止:任务序列化失败:java.lang.reflect.InvocationTargetException sun.reflect.NativeMethodAccessorImpl.invoke0(原生方法) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:606) org.apache.spark.serializer.SerializationDebugger $ ObjectStreamClassMethods $ .getObjFieldValues $扩展(SerializationDebugger.scala:240) org.apache.spark.serializer.SerializationDebugger $ SerializationDebugger.visitSerializable(SerializationDebugger.scala:150) org.apache.spark.serializer.SerializationDebugger $ SerializationDebugger.visit(SerializationDebugger.scala:99) org.apache.spark.serializer.SerializationDebugger $ SerializationDebugger.visitSerializable(SerializationDebugger.scala:158) org.apache.spark.serializer.SerializationDebugger $ SerializationDebugger.visit(SerializationDebugger.scala:99) org.apache.spark.serializer.SerializationDebugger $ .find(SerializationDebugger.scala:58) org.apache.spark.serializer.SerializationDebugger $ .improveException(SerializationDebugger.scala:39) org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80) org.apache.spark.scheduler.DAGScheduler.org $阿帕奇$火花$ $调度$$ DAGScheduler submitMissingTasks(DAGScheduler.scala:835) org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15 $$ anonfun $ $应用$ 1.适用$ mcVI SP(DAGScheduler.scala:1042) org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15 $$ anonfun $ $适用1.适用(DAGScheduler.scala:1039) org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15 $$ anonfun $ $适用1.适用(DAGScheduler.scala:1039) scala.Option.foreach(Option.scala:236) org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15.apply(DAGScheduler.scala 1039) org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15.apply(DAGScheduler.scala:1038) scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1038) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1390) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) org.apache.spark.util.EventLoop $$匿名$ 1.run(EventLoop.scala:48) 在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1203) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1192) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1191) 在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59) 在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191) 在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ submitMissingTasks(DAGScheduler.scala:847) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15 $$ anonfun $ apply $ 1.apply $ mcVI $ sp(DAGScheduler.scala:1042) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15 $$ anonfun $ apply $ 1.apply(DAGScheduler.scala:1039) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15 $$ anonfun $ apply $ 1.apply(DAGScheduler.scala:1039) 在scala.Option.foreach(Option.scala:236) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15.apply(DAGScheduler.scala:1039) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskCompletion $ 15.apply(DAGScheduler.scala:1038) 在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59) 在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 在org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1038) 在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1390) 在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) 在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)
块引用
答案 0 :(得分:0)
Spark将首先序列化您在reduceByKey
和sorByKey
中传递的函数,并将它们传递给执行程序。因此,您应该保证您的函数可以在那里序列化
SparkToolConstant
&amp;您的代码中的DateUtil
似乎是导致此错误的原因。