Question

编辑：我的样本量太小了。当我对8个CPU的实际数据进行运行时，我看到速度提高了7.2倍。在我的代码中添加4个字符并不太简陋;）

我目前正在尝试“销售”管理使用Scala的好处，特别是在扩展CPU时。为此，我创建了一个简单的测试应用程序，它可以执行一系列矢量数学运算，并且发现在我的四核机器上运行时并不是特别好。有趣的是，我发现运行时是第一次通过集合时最差，并且随后的调用会变得更好。并行集合中是否存在导致此问题的一些懒惰事物，或者我只是做错了？应该注意的是，我来自C ++ / C＃世界，所以我完全可能以某种方式搞砸了我的配置。无论如何，这是我的设置：

InteliJ Scala插件

Scala 2.9.1.final

Windows 7 64位，四核处理器（无超线程）

import util.Random

  // simple Vector3D class that has final x,y,z components a length, and a '-' function
  class Vector3D(val x:Double,  val y:Double, val z:Double)
  {
    def length = math.sqrt(x*x+y*y+z*z)
    def -(rhs : Vector3D ) = new Vector3D(x - rhs.x, y - rhs.y, z - rhs.z)
  }

object MainClass {

  def main(args : Array[String]) =
  {
    println("Available CPU's: " + Runtime.getRuntime.availableProcessors())
    println("Parallelism Degree set to: " + collection.parallel.ForkJoinTasks.defaultForkJoinPool.getParallelism);
    // my position
    val myPos = new Vector3D(0,0,0);

    val r = new Random(0);

    // define a function nextRand that gets us a random between 0 and 100
    def nextRand = r.nextDouble() * 100;

    // make 10 million random targets
    val targets = (0 until 10000000).map(_ => new Vector3D(nextRand, nextRand, nextRand)).toArray
    // take the .par hit before we start profiling
    val parTargets = targets.par

    println("Created " + targets.length + " vectors")

    // define a range function
    val rangeFunc : (Vector3D => Double) = (targetPos) => (targetPos - myPos).length

    // we'll select ones that are <50
    val within50 : (Vector3D => Boolean) = (targetPos) => rangeFunc(targetPos) < 50

    // time it sequentially
    val startTime_sequential = System.currentTimeMillis()
    val numTargetsInRange_sequential = targets.filter(within50)
    val endTime_sequential = System.currentTimeMillis()
    println("Sequential (ms): " + (endTime_sequential - startTime_sequential))

    // do the parallel version 10 times
    for(i <- 1 to 10)
    {

      val startTime_par = System.currentTimeMillis()
      val numTargetsInRange_parallel = parTargets.filter(within50)
      val endTime_par = System.currentTimeMillis()

      val ms = endTime_par - startTime_par;
      println("Iteration[" + i + "] Executed in " + ms + " ms")
    }
  }
}

该程序的输出是：

Available CPU's: 4
Parallelism Degree set to: 4
Created 10000000 vectors
Sequential (ms): 216
Iteration[1] Executed in 227 ms
Iteration[2] Executed in 253 ms
Iteration[3] Executed in 76 ms
Iteration[4] Executed in 78 ms
Iteration[5] Executed in 77 ms
Iteration[6] Executed in 80 ms
Iteration[7] Executed in 78 ms
Iteration[8] Executed in 78 ms
Iteration[9] Executed in 79 ms
Iteration[10] Executed in 82 ms

那么这里发生了什么？前两次我们做过滤器，它慢了，然后事情加快了？我知道本质上会有并行性的启动成本，我只是想弄清楚在我的应用程序中表达并行性是有意义的，特别是我希望能够向Management显示一个运行3-4次的程序在四核盒子上更快。这不是一个好问题吗？

想法？

Answer 1

你有微基准疾病。您最有可能对JIT编译阶段进行基准测试。你需要先预先运行JIT来预热你的JIT。

可能最好的想法是使用像http://code.google.com/p/caliper/这样的微基准测试框架来处理所有这些问题。

修改：对于Caliper基准Scala项目有一个很好的SBT Template，引用为from this blog post

Answer 2

事情确实加快了，但这与并行与顺序无关，你不是在比较苹果与苹果。 JVM有一个JIT（及时）编译器，只有在代码被使用了一定次数后才会编译一些字节代码。因此，您在第一次迭代中看到的是对于尚未JIT编辑的代码执行速度较慢以及正在进行的JIT编译本身的时间。删除.par以便全部顺序是我在我的机器上看到的（迭代次数减少10倍，因为我使用的是旧机器）：

Sequential (ms): 312
Iteration[1] Executed in 117 ms
Iteration[2] Executed in 112 ms
Iteration[3] Executed in 112 ms
Iteration[4] Executed in 112 ms
Iteration[5] Executed in 114 ms
Iteration[6] Executed in 113 ms
Iteration[7] Executed in 113 ms
Iteration[8] Executed in 117 ms
Iteration[9] Executed in 113 ms
Iteration[10] Executed in 111 ms

但这都是顺序的！您可以使用JVM -XX:+PrintCompilation（在JAVA_OPTS中设置或使用-J-XX:+PrintCompilation scala选项来查看JVM在JIT方面的作用。在第一次迭代中，您将看到大量JVM打印语句显示正在被JIT编辑的内容，然后它会在以后稳定下来。

因此，为了比较苹果和苹果，首先运行没有par，然后添加par并运行相同的程序。在我的双核心上，当我使用.par时，我得到了：

Sequential (ms): 329
Iteration[1] Executed in 197 ms
Iteration[2] Executed in 60 ms
Iteration[3] Executed in 57 ms
Iteration[4] Executed in 58 ms
Iteration[5] Executed in 59 ms
Iteration[6] Executed in 73 ms
Iteration[7] Executed in 56 ms
Iteration[8] Executed in 60 ms
Iteration[9] Executed in 58 ms
Iteration[10] Executed in 57 ms

一旦稳定，或多或少加速2倍。

在相关的注释中，你要小心的另一件事是拳击和解拳，特别是如果你只是比较Java。像过滤器这样的scala库高阶函数正在进行原始类型的装箱和解包，这通常是那些将代码从Java转换为Scala的人最初的失望之源。

虽然在这种情况下不适用，因为for超出了时间，但使用for而不是while也需要一些成本，但2.9.1编译器应该使用-optimize scalac标志时做得不错。

Answer 3

除了之前提到的JIT优化之外，您需要评估的一个关键概念是您的问题是否倾向于并行化：分裂，线程协调和连接的固有成本会对并行处理的优势产生影响。 Scala隐藏了这种复杂性，但您确实需要知道何时应用它以获得良好的结果。

在您的情况下，虽然您正在执行大量操作，但每个操作本身几乎都是CPU微不足道的。要查看并行集合的运行情况，请尝试以单位为基础的繁重操作。

对于类似的Scala演示文稿，我使用了一个简单（无效）的算法来计算一个数字是否为素数： def isPrime(x:Int) = (2 to x/2).forall(y=>x%y!=0)

然后使用您提供的相同逻辑来确定集合中数字的数字：

val col = 1 to 1000000
col.filter(isPrime(_))  // sequential
col.par.filter(isPrime(_)) // parallel

CPU行为确实显示了两者之间的区别： prime numbers: sequential vs parallel

对于4核CPU中的并行集合，时间大约是3.5倍。

Answer 4

怎么样

val numTargetsInRange_sequential = parTargets.filter(within50)

此外，您可能会使用地图而不是过滤器操作获得更令人印象深刻的结果。

Scala并行收集运行时令人费解

4 个答案: