Question

我们有一个火花流程序，使用 createDirectStream 从Kafka读取输入，并使用 mapWithState 基于公共密钥创建复合对象。

JavaMapWithStateDStream<String, InputData, Trip, Tuple2<String, CompositeData>> mappedDStream = inputMessages.mapWithState(StateSpec.function(mappingFunc).timeout(Durations.minutes(timeOutMinutes)));

我们在3台机器Hadoop YARN集群上运行此代码，并指定了hdfs checkpoint目录。 Hadoop版本为2.7.0，Spark 2.0

指定的流间隔为3秒。该程序连续运行48至72小时，但由于以下异常而失败。

org.apache.hadoop.ipc.RemoteException（java.io.IOException）：File /.streamingcheckpoint/app1/2b86771a-0771-4f5a-a8cf-878f79a29d03/rdd-167/.part-00024-attempt-3 只能复制到0个节点而不是minReplication（= 1）。有3个datanode正在运行，并且没有节点被排除在此操作。在 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock（BlockManager.java:1550）

我们已经提到了以下答案，但是在我们的案例中，集群上有足够的可用空间。（群集磁盘利用率低于30％），即使在此故障之后，名称节点也处于活动状态，我们可以使用hdfs命令将文件添加到hdfs。我们甚至增加了namenode可用的线程数。

价： could only be replicated to 0 nodes instead of minReplication (=1). There are 4 datanode(s) running and no node(s) are excluded in this operation

Hadoop: ...be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation

我们也从一开始就看到以下消息在我们的日志中连续写入。

[rdd_11_28]（org.apache.spark.executor.Executor）[2017-03-16 11：35：00,690]警告1 TID = 202未释放块锁：

造成这种失败的原因是什么？

使用检查点写入问题48小时后，Spark Streaming mapwithstate失败

0 个答案: