我正在尝试确定各种粒度级别的完成状态。例如,如果该地区的所有城镇均完整,则该地区为“完整”。
我在Spark中使用以下方法将状态保持在内存中的最低级别(镇):
步骤1.将Cassandra表的初始状态加载到Spark数据框架中。
+----------+--------+--------+------------+
| country | region | town | isComplete |
+----------+--------+--------+------------+
| Country1 | State1 | Town1 | FALSE |
| Country1 | State1 | Town2 | FALSE |
| Country1 | State1 | Town3 | FALSE |
| Country1 | State1 | Town4 | FALSE |
| Country1 | State1 | Town5 | FALSE |
| Country1 | State1 | Town6 | FALSE |
| Country1 | State1 | Town7 | FALSE |
| Country1 | State1 | Town8 | FALSE |
| Country1 | State1 | Town9 | FALSE |
| Country1 | State1 | Town10 | FALSE |
| Country1 | State1 | Town11 | FALSE |
+----------+--------+--------+------------+
步骤2。开始流处理,并使用在每个微批处理中创建的数据帧尝试使用“左外连接”更新来自步骤1的数据帧中的状态。
批次1:
+----------+--------+-------+------------+
| country | region | town | isComplete |
+----------+--------+-------+------------+
| Country1 | State1 | Town1 | TRUE |
| Country1 | State1 | Town2 | TRUE |
| Country1 | State1 | Town3 | TRUE |
| Country1 | State1 | Town4 | TRUE |
+----------+--------+-------+------------+
应用批次1之后:
+----------+--------+--------+------------+
| country | state | town | isComplete |
+----------+--------+--------+------------+
| Country1 | State1 | Town1 | TRUE |
| Country1 | State1 | Town2 | TRUE |
| Country1 | State1 | Town3 | TRUE |
| Country1 | State1 | Town4 | TRUE |
| Country1 | State1 | Town5 | FALSE |
| Country1 | State1 | Town6 | FALSE |
| Country1 | State1 | Town7 | FALSE |
| Country1 | State1 | Town8 | FALSE |
| Country1 | State1 | Town9 | FALSE |
| Country1 | State1 | Town10 | FALSE |
| Country1 | State1 | Town11 | FALSE |
+----------+--------+--------+------------+
我的想法是,通过保持数据帧可变,我将能够在每个批处理中对其进行更新,并在流作业的整个生命周期中保持整体状态(如全局变量)。
基本数据集大约有120万条记录(约100 MB),并有望扩展到10 GB。
我遇到内存不足的问题。每个批次比上一个批次花费更多的处理时间。同样,同一工作的阶段数会随着批次的增加而增加。最终,应用程序失败,超出了GC开销限制。
var statusDf = loadStatusFromCassandra(sparkSession)
ipStream.foreachRDD { statusMsgRDD =>
if (!statusMsgRDD.isEmpty) {
// 1. Create data-frame from the current micro-batch RDD
val messageDf = getMessageDf(sparkSession, statusMsgRDD)
// 2. To update, Left outer join statusDf with messageDf
statusDf = updateStatusDf(sparkSession, statusDf, messageDf)
// 3. Use updated statusDf to generate aggregations at higher levels
// and publish to a Kafka topic
// if a higher level (eg. region) is completed.
}
}