Question

我在k8s运算符上部署了结构化的流作业，只需从kafka中读取，反序列化，添加2列并将结果存储在datalake中（同时尝试了增量和镶木地板），几天后执行程序就增加了内存，最终我得到OOM。输入记录的kbs确实很低。附言：我使用完全相同的代码，但是使用cassandra作为接收器，现在已经运行了将近一个月，没有任何问题。有什么想法吗？

enter image description here

我的代码

spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", MetisStreamsConfig.bootstrapServers)
    .option("subscribe", MetisStreamsConfig.topics.head)
    .option("startingOffsets", startingOffsets)
    .option("maxOffsetsPerTrigger", MetisStreamsConfig.maxOffsetsPerTrigger)
    .load()
    .selectExpr("CAST(value AS STRING)")
    .as[String]
    .withColumn("payload", from_json($"value", schema))

    // selection + filtering
    .select("payload.*")
    .select($"vesselQuantity.qid" as "qid", $"vesselQuantity.vesselId" as "vessel_id", explode($"measurements"))
    .select($"qid", $"vessel_id", $"col.*")
    .filter($"timestamp".isNotNull)
    .filter($"qid".isNotNull and !($"qid"===""))
    .withColumn("ingestion_time", current_timestamp())
    .withColumn("mapping", MappingUDF($"qid"))
  writeStream
    .foreachBatch { (batchDF: DataFrame, batchId: Long) =>
      log.info(s"Storing batch with id: `$batchId`")
      val calendarInstance = Calendar.getInstance()

      val year = calendarInstance.get(Calendar.YEAR)
      val month = calendarInstance.get(Calendar.MONTH) + 1
      val day = calendarInstance.get(Calendar.DAY_OF_MONTH)
      batchDF.write
        .mode("append")
        .parquet(streamOutputDir + s"/$year/$month/$day")
    }
    .option("checkpointLocation", checkpointDir)
    .start()

我更改为foreachBatch是因为将delta或parquet与partitionBy一起使用会导致问题更快

Answer 1

Spark 3.1.0 中存在一个已解决的错误。

见https://github.com/apache/spark/pull/28904

解决问题的其他方法以及调试的功劳：

https://www.waitingforcode.com/apache-spark-structured-streaming/file-sink-out-of-memory-risk/read

即使您正在使用 foreachBatch，您也会发现这很有帮助 ...

Answer 2

对于使用partitionBy编写一些Delta Lake（或镶木地板）输出的某些Structured Streaming Spark 2.4.4应用程序，我遇到了同样的问题。

似乎与容器内的jvm内存分配有关，如此处详尽解释： https://merikan.com/2019/04/jvm-in-a-container/

我的解决方案（但取决于您的jvm版本）是在yaml定义中为我的spark应用程序添加一些选项：

spec:
    javaOptions: >-
        -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap

这样，我的Streamin App可以正常运行，并且具有正常的内存量（驱动程序为1GB，执行程序为2GB）

编辑：虽然似乎第一个问题已解决（控制器杀死用于内存消耗的Pod），但非堆内存大小的缓慢增长仍然存在问题；几个小时后，驱动程序/执行程序被杀死...

结构化流OOM

2 个答案: