我有一个火花流用例,我计划在每个执行器上保持数据集的广播和缓存。流式传输中的每个微批次都将从RDD创建数据帧并加入批次。我在下面给出的测试代码将执行每个批次的广播操作。有没有办法只播放一次?
val testDF = sqlContext.read.format("com.databricks.spark.csv")
.schema(schema).load("file:///shared/data/test-data.txt")
val lines = ssc.socketTextStream("DevNode", 9999)
lines.foreachRDD((rdd, timestamp) => {
val recordDF = rdd.map(_.split(",")).map(l => Record(l(0).toInt, l(1))).toDF()
val resultDF = recordDF.join(broadcast(testDF), "Age")
resultDF.write.format("com.databricks.spark.csv").save("file:///shared/data/output/streaming/"+timestamp)
}
对于每个批次,都会读取此文件并执行广播。
16/02/18 12:24:02 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:27+28
16/02/18 12:24:02 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:0+27
16/02/18 12:25:00 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:27+28
16/02/18 12:25:00 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:0+27
对广播数据集的任何建议只有一次?
答案 0 :(得分:0)
现在看来,广播的表格不会被重复使用。请参阅:SPARK-3863
在foreachRDD
循环之外进行广播:
val testDF = broadcast(sqlContext.read.format("com.databricks.spark.csv")
.schema(schema).load(...))
lines.foreachRDD((rdd, timestamp) => {
val recordDF = ???
val resultDF = recordDF.join(testDF, "Age")
resultDF.write.format("com.databricks.spark.csv").save(...)
}