Spark流中单个数据帧上的状态转换

时间:2019-03-06 19:58:27

标签: scala apache-spark cassandra apache-spark-sql spark-streaming

我正在尝试确定各种粒度级别的完成状态。例如,如果该地区的所有城镇均完整,则该地区为“完整”。

我在Spark中使用以下方法将状态保持在内存中的最低级别(镇):

步骤1.将Cassandra表的初始状态加载到Spark数据框架中。

+----------+--------+--------+------------+
| country  | region |  town  | isComplete |
+----------+--------+--------+------------+
| Country1 | State1 | Town1  | FALSE      |
| Country1 | State1 | Town2  | FALSE      |
| Country1 | State1 | Town3  | FALSE      |
| Country1 | State1 | Town4  | FALSE      |
| Country1 | State1 | Town5  | FALSE      |
| Country1 | State1 | Town6  | FALSE      |
| Country1 | State1 | Town7  | FALSE      |
| Country1 | State1 | Town8  | FALSE      |
| Country1 | State1 | Town9  | FALSE      |
| Country1 | State1 | Town10 | FALSE      |
| Country1 | State1 | Town11 | FALSE      |
+----------+--------+--------+------------+

步骤2。开始流处理,并使用在每个微批处理中创建的数据帧尝试使用“左外连接”更新来自步骤1的数据帧中的状态。

批次1:

+----------+--------+-------+------------+
| country  | region | town  | isComplete |
+----------+--------+-------+------------+
| Country1 | State1 | Town1 | TRUE       |
| Country1 | State1 | Town2 | TRUE       |
| Country1 | State1 | Town3 | TRUE       |
| Country1 | State1 | Town4 | TRUE       |
+----------+--------+-------+------------+

应用批次1之后:

+----------+--------+--------+------------+
| country  | state  |  town  | isComplete |
+----------+--------+--------+------------+
| Country1 | State1 | Town1  | TRUE       |
| Country1 | State1 | Town2  | TRUE       |
| Country1 | State1 | Town3  | TRUE       |
| Country1 | State1 | Town4  | TRUE       |
| Country1 | State1 | Town5  | FALSE      |
| Country1 | State1 | Town6  | FALSE      |
| Country1 | State1 | Town7  | FALSE      |
| Country1 | State1 | Town8  | FALSE      |
| Country1 | State1 | Town9  | FALSE      |
| Country1 | State1 | Town10 | FALSE      |
| Country1 | State1 | Town11 | FALSE      |
+----------+--------+--------+------------+

我的想法是,通过保持数据帧可变,我将能够在每个批处理中对其进行更新,并在流作业的整个生命周期中保持整体状态(如全局变量)。

基本数据集大约有120万条记录(约100 MB),并有望扩展到10 GB。

我遇到内存不足的问题。每个批次比上一个批次花费更多的处理时间。同样,同一工作的阶段数会随着批次的增加而增加。最终,应用程序失败,超出了GC开销限制。

var statusDf = loadStatusFromCassandra(sparkSession)
ipStream.foreachRDD { statusMsgRDD =>
  if (!statusMsgRDD.isEmpty) {
    // 1. Create data-frame from the current micro-batch RDD
    val messageDf = getMessageDf(sparkSession, statusMsgRDD)

    // 2. To update, Left outer join statusDf with messageDf
    statusDf = updateStatusDf(sparkSession, statusDf, messageDf)

    // 3. Use updated statusDf to generate aggregations at higher levels
    // and publish to a Kafka topic
    // if a higher level (eg. region) is completed.
  }
}

0 个答案:

没有答案