Question

我有这段代码在独立工作正常，但在AWS上工作4个从属群集（8个内存30个内存）时效果很慢。

For a file of 10000 entries
Standalone : 257s
Aws 4S : 369s

    def tabHash(nb:Int, dim:Int) = {

        var tabHash0 = Array(Array(0.0)).tail

        for( ind <- 0 to nb-1) {
            var vechash1 = Array(0.0).tail
            for( ind <- 0 to dim-1) {
                val nG = Random.nextGaussian
                vechash1 = vechash1 :+ nG
            }
            tabHash0 = tabHash0 :+ vechash1
        }
        tabHash0
    }

    def hashmin3(x:Vector, w:Double, b:Double, tabHash1:Array[Array[Double]]) = {

        var tabHash0 = Array(0.0).tail
        val x1 = x.toArray
        for( ind <- 0 to tabHash1.size-1) {
            var sum = 0.0
            for( ind2 <- 0 to x1.size-1) {
                sum = sum + (x1(ind2)*tabHash1(ind)(ind2))
            }           
            tabHash0 =  tabHash0 :+  (sum+b)/w
        }
        tabHash0

    }

    def pow2(tab1:Array[Double], tab2:Array[Double]) = {

        var sum = 0.0
        for( ind <- 0 to tab1.size-1) {
            sum = sum - Math.pow(tab1(ind)-tab2(ind),2)
        }
        sum
    }


        val w = ww
        val b = Random.nextDouble * w
        val tabHash2 = tabHash(nbseg,dim)

        var rdd_0 = parsedData.map(x => (x.get_id,(x.get_vector,hashmin3(x.get_vector,w,b,tabHash2)))).cache

        var rdd_Yet = rdd_0

        for( ind <- 1 to maxIterForYstar  ) {

            var rdd_dist = rdd_Yet.cartesian(rdd_0).flatMap{ case (x,y) => Some((x._1,(y._2._1,pow2(x._2._2,y._2._2))))}//.coalesce(64)

            var rdd_knn = rdd_dist.topByKey(k)(Ordering[(Double)].on(x=>x._2))

            var rdd_bary = rdd_knn.map(x=> (x._1,Vectors.dense(bary(x._2,k))))

            rdd_Yet = rdd_bary.map(x=>(x._1,(x._2,hashmin3(x._2,w,b,tabHash2))))


        }

我试图播放一些变量

        val w = sc.broadcast(ww.toDouble)
        val b = sc.broadcast(Random.nextDouble * ww)
        val tabHash2 = sc.broadcast(tabHash(nbseg,dim))

没有任何影响

我知道这不是bary函数，因为我尝试了没有hashmin3的这个代码的另一个版本，它适用于4个奴隶，更糟糕的是8个奴隶用于另一个主题。

Answer 1

错误的代码。特别是对于分布式和大型计算。我无法快速告诉问题的根源，但无论如何你必须重写这段代码。

数组对于通用和可共享数据非常糟糕。它是可变的，需要连续的内存分配（即使你有足够的内存，最后可能会出现问题）。更好地使用Vector（或有时列出）。永远不要使用数组。
var vechash1 = Array(0.0).tail您使用一个元素创建集合，然后调用函数以获取空集合。如果它很少见，不用担心性能，但它很难看！ var vechash1: Array[Double] = Array()或var vechash1: Vector[Double] = Vector()或var vechash1 = Vector.empty[Double]。
def tabHash(nb:Int, dim:Int) =当它不清楚时，始终设置返回类型的功能。斯卡拉的力量是丰富的类型系统。编译时间检查非常有帮助（关于你在结果中得到的结果，而不是你想象的结果！）。处理大量数据时非常重要，因为编译检查可以节省您的时间和金钱。此外，以后更容易阅读此类代码。 def tabHash(nb:Int, dim:Int): Vector[Vector[Double]] =
def hashmin3(x: Vector,错字？它不会在没有类型参数的情况下编译。

第一个功能更紧凑：

def tabHash(nb:Int, dim:Int): Vector[Vector[Double]] = {
  (0 to nb-1).map {_ =>
    (0 to dim - 1).map(_ => Random.nextGaussian()).toVector
  }.toVector
}

第二个功能是((x*M) + scalar_b)/scalar_w。使用专门针对矩阵进行优化的库可能会更有效。

第三（我猜这里有错误的计算符号，如果算上平方误差）：

def pow2(tab1:Vector[Double], tab2:Vector[Double]): Double = 
      tab1.zip(tab2).map{case (t1,t2) => Math.pow(t1 - t2, 2)}.reduce(_ - _)

var rdd_Yet = rdd_0 Cached RDD is rewrited in cycle. So it's useless storage.

最后一个周期难以分析。我认为必须简化它。

群集上的Spark工作比独立工作慢

1 个答案: