Question

所以我有这个代码

KafkaUtils.
      createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, fromOffsets)
      map(event => (event.requestId, event.toString)).
      mapWithState(StateSpec.function(StateFunc.func _).numPartitions(200).timeout(Minutes(5))).
      foreachRDD((rdd: RDD[String]) => {
        process(rdd.filter(_ != null), sparkSession, topic, partitionsNum)
      })

StateFunc.func所做的一切，就是像这样更新状态：

def func(batchTime: Time, key: String, event: Option[String], state: State[String]): Option[String] = {
    if (state.exists) {
      if (!state.isTimingOut()) {
        state.update(event)
      }
    } else {
      state.update(event)
    }
    event
}

编辑：在状态函数中完成的所有操作都是更新状态。但它应该在大约2-3次后停止更新（没有更多具有相同密钥的事件将到达）。所以最终他们应该超时并被火花删除

process函数正在做的就是将RDD写为S3 json个文件

在具有3600000条记录（来自火花流UI）的3批次之后，输出大小约为~2GB 但是mapWithState大约是30GB（应该是输出大小）而我的群集只有40GB 经过一段时间后，火花会失效并重新开始。

有人可以帮忙解释为什么mapWithState大小是30 + GB吗？为什么集群有时会失败OutOfMemory？

属性：
streamInterval是50 sec
maxRatePerPartition是3000 records
kafka分区是8
backPressure是true
每条记录约为1500 bytes

更新
所以我尝试用小批量的~170000条记录再次运行它（每条记录是一个长度为264字节的字符串）所以总共应该是45MB 如图所示。 mapWithstate约为350MB，所有199个存储的rdds都是2.5KB，只有1个rdd是350 MB。 到底是什么？！
info mapWithState

Spark流mapWithState内存增加

0 个答案: