Question

我目前正在尝试使用每分钟请求来丰富机器学习数据。数据存储在Kafka主题中，并且在应用程序启动时，主题的整个内容被加载和处理 - 因此，根据我的知识，不可能使用任何火花流的窗口操作，因为所有数据将同时到达。

我的方法是尝试以下方法：

val kMeansFeatureRdd = kMeansInformationRdd.map(x => {

  val begin = x._2 //Long - unix timestamp millis
  val duration = x._3 //Long
  val rpm = kMeansInformationRdd.filter(y => (x._2 - 60000 <= y._2 && x._2 >= y._2)).count()

  (duration, rpm)

})

然而，在这种方法中，我得到以下例外：

org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases: 
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.

有没有办法实现我想做的事情？

如果您需要更多信息，请给我发表评论，我会更新您的需求。

编辑：

广播RDD不起作用。广播收集的数组不会产生可接受的性能。

将会执行什么，但速度非常慢，因此不是一个选择：

  val collected = kMeansInformationRdd.collect()


    val kMeansFeatureRdd = kMeansInformationRdd.map(x => {
      val begin = x._2 //Long - unix timestamp millis
      val duration = x._3 //Long

      val rpm = collected.filter(y => (x._2 - 60000 <= y._2 && x._2 >= y._2)).size

      (duration, rpm)

    })

更新：

这段代码至少能够以更快的速度完成工作 - 但据我所知它仍然变得越来越慢，每分钟的请求越多，过滤后的阵列越来越大 - 有趣的是它越来越慢到最后我无法弄清楚为什么。如果有人看到这个问题 - 或者性能问题可以得到普遍改善 - 如果你让我知道，我会很高兴。

kMeansInformationRdd = kMeansInformationRdd.cache()

    kMeansInformationRdd.sortBy(_._2, true)

    var kMeansFeatureArray: Array[(String, Long, Long)] = Array()
    var buffer: collection.mutable.Map[String, Array[Long]] = collection.mutable.Map()
    var counter = 0


    kMeansInformationRdd.collect.foreach(x => {
      val ts = x._2
      val identifier = x._1 //make sure the identifier represents actually the entity that receives the traffic -e.g. machine (IP?) not only endpoint

      var bufferInstance = buffer.get(identifier).getOrElse(Array[Long]())

      bufferInstance = bufferInstance ++ Array(ts)

      bufferInstance = bufferInstance.filter(p => p > ts-1000)          

      buffer.put(identifier, bufferInstance)

      val rpm = bufferInstance.size.toLong

      kMeansFeatureArray = kMeansFeatureArray ++ Array((identifier, x._3, rpm)) //identifier, duration, rpm
      counter = counter +1
      if(counter % 10000==0){
        println(counter)
        println((identifier, x._3, rpm))
        println((instanceSizeBefore, instanceSizeAfter))
      }

    })

    val kMeansFeatureRdd = sc.parallelize(kMeansFeatureArray)

Answer 1

The code that is given in the EDIT section is not correct. It is not the correct way a variable is broadcasted in Spark. The correct way is as follows:

val collected = sc.broadcast(kMeansInformationRdd.collect())


    val kMeansFeatureRdd = kMeansInformationRdd.map(x => {
      val begin = x._2 //Long - unix timestamp millis
      val duration = x._3 //Long

      val rpm = collected.value.filter(y => (x._2 - 60000 <= y._2 && x._2 >= y._2)).size

      (duration, rpm)

    })

Of course, you can use global variables as well instead of sc.broadcast, but that is not recommended. Why?

The reason is that the difference between using an external variable DIRECTLY(as my so called "global variable"), and BROADCASTING a variable using sc.broadcast() is:

When using the external variable directly, spark will send a copy of the serialized variable together with each TASK. Whereas by sc.broadcast, the variable is sent one copy per EXECUTOR. The number of Task is normally 10 times larger than the Executor. So when the variable (say an array) is large enough (more than 10-20K), the former operation may cost a lot time on network transformation and cause frequent GC, which slows the spark down. Hence large variable(>10-20K) is suggested to be broadcasted explicitly.
When using the external variable directly the variable is not persisted, it ends with the task and thus can not be reused. Whereas by sc.broadcast() the variable is auto-persisted in the executors' memory, it lasts until you explicitly unpersist it. Thus sc.broadcast variable is available across tasks and stages.

So if the variable is expected to be used multiple times, sc.broadcast() is suggested.

在映射期间从RDD中的Timetstamps计算每分钟请求数

1 个答案: