scala.NotImplementedError:在对Spark流进行有状态计算时,不应在EmptyStateMap上调用put()吗?

时间:2016-06-22 13:53:07

标签: apache-spark spark-streaming

我在运行有状态计算时遇到警告。状态由BloomFilter(stream-lib)组成,Value和Integer为键。

程序运行平稳几分钟,之后,我收到此警告,流媒体应用程序变得不稳定(处理时间呈指数级增长),最终作业失败。

WARN TaskSetManager: Lost task 0.0 in stage 144.0 (TID 326, mesos-slave-02): scala.NotImplementedError: put() should not be called on an EmptyStateMap 
        at org.apache.spark.streaming.util.EmptyStateMap.put(StateMap.scala:73) 
        at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:62) 
        at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55) 
        at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
        at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) 
        at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55) 
        at org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:155) 
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) 
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) 
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
        at org.apache.spark.scheduler.Task.run(Task.scala:89) 
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) 
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
        at java.lang.Thread.run(Thread.java:745) 

我正在使用kryo序列化。从互联网的某个地方,我得到提示,这可能是由于OpenHashMapBasedStateMap的kryo序列化错误。但是,我不知道如何解决这个问题。

环境:Spark集群以独立模式运行,有1个主站,5个从站,每个都有4个vCPUS,8GB RAM 数据从3节点kafka集群流式传输(由3节点zk集群管理)。

检查点正在hadoop-cluster进行, 另外我们还在HBase中保存状态(在hadoop-cluster之上)并在启动流应用程序时恢复它

此问题最初是在this spark mailing list post中提出的,但在此帖发布之前我还没有得到任何答案。

0 个答案:

没有答案