我有这段代码:
val lines: org.apache.spark.streaming.dstream.InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
lines.foreachRDD { rdd =>
val df = cassandraSQLContext.read.json(rdd.map(x => x._2))
sparkStreamingService.run(df)
}
ssc.start()
ssc.awaitTermination()
我理解的方式是,foreachRDD正在发生在驱动程序级别?基本上所有代码块都是:
lines.foreachRDD { rdd =>
val df = cassandraSQLContext.read.json(rdd.map(x => x._2))
sparkStreamingService.run(df)
}
发生在驱动程序级别? sparkStreamingService.run(df)方法基本上对当前数据帧进行一些转换以产生新的数据帧,然后调用另一个方法(在另一个jar上)将数据帧存储到cassandra。 因此,如果在驱动程序级别发生这种情况,我们不会使用spark执行程序,我如何才能使并行执行程序并行处理RDD的每个分区
我的火花流服务运行方法:
var metadataDataframe = df.select("customer", "tableName", "messageContent", "initialLoadRunning").collect()
metadataDataframe.foreach(rowD => {
metaData = populateMetaDataService.populateSiteMetaData(rowD)
val headers = (rowD.getString(2).split(recordDelimiter)(0))
val fields = headers.split("\u0001").map(
fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val listOfRawData = rowD.getString(2).indexOf(recordDelimiter)
val dataWithoutHeaders = rowD.getString(2).substring(listOfRawData + 1)
val rawData = sparkContext.parallelize(dataWithoutHeaders.split(recordDelimiter))
// val rawData = dataWithoutHeaders.split(recordDelimiter)
val rowRDD = rawData
.map(_.split("\u0001"))
.map(attributes => Row(attributes: _*))
val newDF = cassandraSQLContext.createDataFrame(rowRDD, schema)
dataFrameFilterService.processBasedOnOpType(metaData, newDF)
})
答案 0 :(得分:2)
foreachRDD
的调用确实发生在驱动程序节点上。但是,由于我们在RDD级别运行,因此将对其进行任何转换。在您的示例中,rdd.map
将导致每个分区被发送到特定的工作节点进行计算。
由于我们不知道您的sparkStreamingService.run
方法正在做什么,我们无法告诉您其执行的位置。
答案 1 :(得分:1)
foreachRDD
可以在本地运行,但这只是设置。 RDD本身是一个分布式集合,因此实际工作是分布式的。
直接评论文档中的代码:
dstream.foreachRDD { rdd =>
val connection = createNewConnection() // executed at the driver
rdd.foreach { record =>
connection.send(record) // executed at the worker
}
}
请注意,不是基于RDD的代码部分是在驱动程序上执行的。它是使用分发给工作人员的RDD构建的代码。
您的代码具体如下:
//df.select will be distributed, but collect will pull it all back in
var metadataDataframe = df.select("customer", "tableName", "messageContent", "initialLoadRunning").collect()
//Since collect created a local collection then this is done on the driver
metadataDataframe.foreach(rowD => {
metaData = populateMetaDataService.populateSiteMetaData(rowD)
val headers = (rowD.getString(2).split(recordDelimiter)(0))
val fields = headers.split("\u0001").map(
fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val listOfRawData = rowD.getString(2).indexOf(recordDelimiter)
val dataWithoutHeaders = rowD.getString(2).substring(listOfRawData + 1)
//This will run locally, creating a distributed record
val rawData = sparkContext.parallelize(dataWithoutHeaders.split(recordDelimiter))
// val rawData = dataWithoutHeaders.split(recordDelimiter)
//This will redistribute the work
val rowRDD = rawData
.map(_.split("\u0001"))
.map(attributes => Row(attributes: _*))
//again, setting this up locally, to be run distributed
val newDF = cassandraSQLContext.createDataFrame(rowRDD, schema)
dataFrameFilterService.processBasedOnOpType(metaData, newDF)
})
最终,您可能可以将其重写为不需要收集并保持全部分发,但这不适合您而不是StackOverflow