在JavaPairDStream

时间:2016-08-03 09:09:08

标签: hadoop apache-spark streaming apache-spark-sql

有没有办法从JavaPairDStream找到最大值?我的密钥为String,值为ArrayList<Row>。我需要找到一个tuple,它具有ArrayList中某行的某一列的最大值。我正在使用spark 1.6 我想为JavaPairDstream实现类似于javaRDD.max()的东西。

使用updateStateByKey不是一个选项,因为我需要所有键的关联状态而不是每个键。 我尝试过以下方式使用累加器:

**

if(accumulator.value()< max)
 {
    accumulator.setValue(max);
 }

**

我在reduceByKey中执行此操作,但我得到以下异常:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.UnsupportedOperationException: Can't read accumulator value in task
    at org.apache.spark.Accumulable.value(Accumulators.scala:98)
    at sample.sample.SampleJdd$2.call(SampleJdd.java:82)
    at sample.sample.SampleJdd$2.call(SampleJdd.java:74)
    at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction2$1.apply(JavaPairRDD.scala:996)
    at org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:200)
    at org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:199)
    at org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:138)
    at org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:205)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
     org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48

0 个答案:

没有答案