Spark数据帧聚合在部署到Spark HDInsight群集时失败

时间:2020-07-20 19:57:41

标签: scala apache-spark spark-structured-streaming azure-hdinsight

背景

我正在尝试将Spark应用程序部署到Spark HDInsight群集(HDI 4.0),该群集将聚合的流数据帧写入Azure Blob存储。但是,出于调试目的,我现在将输出到控制台。

问题

问题在于,当我将应用程序部署到群集时,该应用程序无法将聚合的数据帧写入控制台。该应用程序退出并出现错误,我发现很少有关于可能出问题的信息。

应用程序关闭之前,日志中的最终错误输出如下:

20/07/20 18:00:37 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
    at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160)
    at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:140)
    at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:655)
    at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:275)
    at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
    at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    at java.lang.Thread.run(Thread.java:748)

但是,进一步在日志输出中,我也能够看到此消息,该消息是在我尝试将聚合数据帧写入控制台后不久发生的:

20/07/20 18:00:20 INFO HashAggregateExec: spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current version of codegened fast hashmap does not support this aggregate.

通过一些故障排除,我能够验证此问题仅是由于我对输入数据帧执行的聚合引起的。如果我将输入数据帧写入控制台,它将按预期输出。而且,这似乎只是在部署到集群时才是一个问题,因为当通过单元测试在本地运行时,聚合将按预期方式工作并输出。

我执行的聚合如下:

val windowColExpr: Column = window(
      timeColumn = col("startTime"),
      "12 hours",
      "12 hours"
    )

val outputDF: DataFrame = modifiedInputDF
      .withWatermark("startTime", "12 hours")
      .groupBy(col("id"), windowColExpr)
      .agg(min("startTime"))

此外,我的writeStream查询是:

val exQuery = outputDF.
                    writeStream.
                    format("console").
                    outputMode("update").
                    start()
exQuery.awaitTermination()

潜在的解决方法

我发现与该错误有关的几个问题表明应该更改Spark作业的configuration,但是我尝试的所有配置都没有成功。我已经按照this文章尝试过配置。 Repartitioning不能匹配集群上的内核数量。

感谢您的帮助!

0 个答案:

没有答案