试图了解火花流动

时间:2017-04-21 20:25:21

标签: apache-spark

我有这段代码:

val lines: org.apache.spark.streaming.dstream.InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
      ssc, kafkaParams, topics)
    lines.foreachRDD { rdd =>
      val df = cassandraSQLContext.read.json(rdd.map(x => x._2))
      sparkStreamingService.run(df)
    }
    ssc.start()
    ssc.awaitTermination()

我理解的方式是,foreachRDD正在发生在驱动程序级别?基本上所有代码块都是:

lines.foreachRDD { rdd =>
  val df = cassandraSQLContext.read.json(rdd.map(x => x._2))
  sparkStreamingService.run(df)
}

发生在驱动程序级别? sparkStreamingService.run(df)方法基本上对当前数据帧进行一些转换以产生新的数据帧,然后调用另一个方法(在另一个jar上)将数据帧存储到cassandra。 因此,如果在驱动程序级别发生这种情况,我们不会使用spark执行程序,我如何才能使并行执行程序并行处理RDD的每个分区

我的火花流服务运行方法:

    var metadataDataframe = df.select("customer", "tableName", "messageContent", "initialLoadRunning").collect()
 metadataDataframe.foreach(rowD => {
      metaData = populateMetaDataService.populateSiteMetaData(rowD)
      val headers = (rowD.getString(2).split(recordDelimiter)(0))

      val fields = headers.split("\u0001").map(
        fieldName => StructField(fieldName, StringType, nullable = true))
      val schema = StructType(fields)

      val listOfRawData = rowD.getString(2).indexOf(recordDelimiter)
      val dataWithoutHeaders = rowD.getString(2).substring(listOfRawData + 1)

      val rawData = sparkContext.parallelize(dataWithoutHeaders.split(recordDelimiter))
//      val rawData = dataWithoutHeaders.split(recordDelimiter)
      val rowRDD = rawData
        .map(_.split("\u0001"))
        .map(attributes => Row(attributes: _*))

      val newDF = cassandraSQLContext.createDataFrame(rowRDD, schema)
      dataFrameFilterService.processBasedOnOpType(metaData, newDF)
    })

2 个答案:

答案 0 :(得分:2)

foreachRDD的调用确实发生在驱动程序节点上。但是,由于我们在RDD级别运行,因此将对其进行任何转换。在您的示例中,rdd.map将导致每个分区被发送到特定的工作节点进行计算。

由于我们不知道您的sparkStreamingService.run方法正在做什么,我们无法告诉您其执行的位置。

答案 1 :(得分:1)

foreachRDD可以在本地运行,但这只是设置。 RDD本身是一个分布式集合,因此实际工作是分布式的。

直接评论文档中的代码:

dstream.foreachRDD { rdd =>
  val connection = createNewConnection()  // executed at the driver
  rdd.foreach { record =>
    connection.send(record) // executed at the worker
  }
}

请注意,不是基于RDD的代码部分是在驱动程序上执行的。它是使用分发给工作人员的RDD构建的代码。

您的代码具体如下:

   //df.select will be distributed, but collect will pull it all back in
   var metadataDataframe = df.select("customer", "tableName", "messageContent", "initialLoadRunning").collect()
 //Since collect created a local collection then this is done on the driver
 metadataDataframe.foreach(rowD => {
      metaData = populateMetaDataService.populateSiteMetaData(rowD)
      val headers = (rowD.getString(2).split(recordDelimiter)(0))

      val fields = headers.split("\u0001").map(
        fieldName => StructField(fieldName, StringType, nullable = true))
      val schema = StructType(fields)

      val listOfRawData = rowD.getString(2).indexOf(recordDelimiter)
      val dataWithoutHeaders = rowD.getString(2).substring(listOfRawData + 1)

      //This will run locally, creating a distributed record
      val rawData = sparkContext.parallelize(dataWithoutHeaders.split(recordDelimiter))
//      val rawData = dataWithoutHeaders.split(recordDelimiter)
      //This will redistribute the work
      val rowRDD = rawData
        .map(_.split("\u0001"))
        .map(attributes => Row(attributes: _*))
      //again, setting this up locally, to be run distributed
      val newDF = cassandraSQLContext.createDataFrame(rowRDD, schema)
      dataFrameFilterService.processBasedOnOpType(metaData, newDF)
    })

最终,您可能可以将其重写为不需要收集并保持全部分发,但这不适合您而不是StackOverflow