带广播加入的Spark流媒体

时间:2016-02-19 16:03:06

标签: scala apache-spark apache-spark-sql spark-streaming broadcast

我有一个火花流用例,我计划在每个执行器上保持数据集的广播和缓存。流式传输中的每个微批次都将从RDD创建数据帧并加入批次。我在下面给出的测试代码将执行每个批次的广播操作。有没有办法只播放一次?

val testDF = sqlContext.read.format("com.databricks.spark.csv")
                .schema(schema).load("file:///shared/data/test-data.txt") 

val lines = ssc.socketTextStream("DevNode", 9999)

lines.foreachRDD((rdd, timestamp) => {
    val recordDF = rdd.map(_.split(",")).map(l => Record(l(0).toInt, l(1))).toDF()
    val resultDF = recordDF.join(broadcast(testDF), "Age")
    resultDF.write.format("com.databricks.spark.csv").save("file:///shared/data/output/streaming/"+timestamp)
    }

对于每个批次,都会读取此文件并执行广播。

16/02/18 12:24:02 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:27+28
16/02/18 12:24:02 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:0+27

16/02/18 12:25:00 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:27+28
16/02/18 12:25:00 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:0+27

对广播数据集的任何建议只有一次?

1 个答案:

答案 0 :(得分:0)

现在看来,广播的表格不会被重复使用。请参阅:SPARK-3863

foreachRDD循环之外进行广播:

val testDF = broadcast(sqlContext.read.format("com.databricks.spark.csv")
 .schema(schema).load(...))

lines.foreachRDD((rdd, timestamp) => { 
  val recordDF = ???
  val resultDF = recordDF.join(testDF, "Age")
  resultDF.write.format("com.databricks.spark.csv").save(...)
}