Question

我正在做一个简单的硬币翻转实验，课程涉及在一定数量的线程上翻转一定数量的硬币。为了加速我们的性能测试，我们使用固定数量的coinflips（我一直使用十亿）并改变线程数。我们使用具有8个内核的AWS额外高CPU实例来运行这些测试。出于某种原因，只要我使用超过6个线程，我就会显着减速。更糟糕的是，这是不一致的。有时候我会得到14秒，有时候会得到2个相同数量的线程和翻转。这没有道理。我尝试过使用不同的JVM（OpenJRE和Sun JVM）并尝试新的实例。下面是我的代码和基准测试结果（以ms为单位）。我会喜欢一些帮助。感谢。

编辑：所以我似乎解决了这个问题，这在很大程度上要归功于yadab和Bruno Reis的建议。他们建议使用局部变量来跟踪头部的数量，我认为这可能是一个因素。他们还建议在同一个JVM会话中运行我的所有测试，这几乎肯定是一个因素。谢谢大家的帮助。

Speedup:
Threads | Flips | Time
1       1000000000  16402 16399  16404
2       1000000000  8218  8216   8217
3       1000000000  5493  5483   5492
4       1000000000  4125  4127   4140
5       1000000000  3306  3304   3311
6       1000000000  2758  2766   2756
7       1000000000  8346  7874   10617
8       1000000000  14370 14414  17831
9       1000000000  14956  14764  15316
10      1000000000  13595 14491  14031
11      1000000000  12642 11188   10625
12      1000000000  10620 10629  10876
13      1000000000  8422  9950   9756
14      1000000000  9284  9546   10194
15      1000000000  8524  4134   8046
16      1000000000  6915  6361   7275

代码：

import java.util.Random;

public class CoinFlip implements Runnable {
    private final long iterations; //iterations is the number of times the program will run, numHeads is the number of heads counted
    private long numHeads;
    public CoinFlip(long iterations) {
        this.iterations = iterations;
    }

    @Override
    public void run() {
        Random rand = new Random();
        numHeads = 0;
        for (long i = 0; i < iterations; i++) {
            if (rand.nextBoolean()) { //True represents heads, false represents a tails
                numHeads++;
            }
        }
    }

    public long getHeads() { //numHeads getter
        return numHeads;
    }

    public static void main(String[] args) {
        final long numIterations , itersPerThread; //iterations: number of iterations, threads: number of threads to run on, itersPerThread: how many iterations each thread is responsible for
        final int threads;
        if (args.length != 2) {
            System.out.println("Usage: java CoinFlip #threads #iterations");
            return;
        }
        try {
            threads = Integer.parseInt(args[0]);
            numIterations = Long.parseLong(args[1]);
        } catch (NumberFormatException e) {
            System.out.println("Usage: java CoinFlip #threads #iterations");
            System.out.println("Invalid arguments");
            return;
        }
        itersPerThread = numIterations / ((long)threads); //Might cause rounding errors, but we were told to ignore that
        Thread[] threadList = new Thread[threads]; //List of running threads so we can join() them later
        CoinFlip[] flipList = new CoinFlip[threads]; //List of our runnables so that we can collect the number of heads later
        for (int i = 0; i < threads; i++) { //create each runnable
            flipList[i] = new CoinFlip(itersPerThread);
        }
        long time = System.currentTimeMillis(); //start time
        for (int i = 0; i < threads; i++) { //create and start each thread
            threadList[i] = new Thread(flipList[i]);
            threadList[i].start();
        }
        for (int i = 0; i < threads; i++) { //wait for all threads to finish
            try {
                threadList[i].join();
                System.out.println("Collected thread " + i);
            } catch (InterruptedException e) {
                System.out.println("Interrupted");
                return;
            }
        }
        time = System.currentTimeMillis() - time; //total running time
        long totHeads = 0; 
        for (CoinFlip t : flipList) { //Collect number of heads from each CoinFlip object
            totHeads += t.getHeads();
        }

        //Print results
        System.out.println(totHeads + " heads in " + (numIterations / threads)
                * threads + " coin tosses on " + threads + " threads");
        System.out.println("Elapsed time: " + time + "ms");
    }
}

Answer 1

只要您只执行CPU绑定操作，使用比可用内核更多的线程就会有一些意义。相反，使用额外的线程会增加上下文切换和调度的开销。

Answer 2

如果您正在运行VM，则可用核心也可以是虚拟的。有时您可能会获得8个不同的内核，有时您可能会获得4个内核x 2个线程。

我怀疑底层机器有6个核心，每个核心有2个线程，其中最多可以使用8个。

Answer 3

你的线程是CPU密集型的，它们不会阻止等待一些缓慢的资源准备就绪，因此线程相互竞争CPU。

我敢打赌每个线程都会暂停，以便将其他线程放在执行上。它是，执行时间片总是耗尽。因此，线程之间有很多上下文切换没有实际增益，只与6线程进行比较（假设可以同时执行六个线程）。

Answer 4

任何在短时间内（少于30秒）运行的Java测试都不适合进行性能测试。热点编译器和其他Java运行时机制在应用程序运行的第一个大秒内优化代码。您的时序偏差很容易归因于JVM启动，优化和关闭。

如果你想要一个更逼真的计时，你将不得不跑30秒左右，然后然后开始计时。此外，我建议您将测试运行乘以至少一个数量级，以更好地平均操作系统开销，GC，后台任务等的影响。所以预热您的应用程序让它运行30秒，启动您的测试和计时器，让它运行至少一分钟，停止计时器并记录您的结果，然后关闭JVM。

此外，更有意义的是，在一定时间内绘制您所做的硬币翻转次数，然后查看执行一定数量的硬币翻转需要多长时间。不同之处在于，如果可以的话，您希望所有测试都运行相同的时间。

Answer 5

快速浏览一下您的来源，看不出任何瓶颈（任何地方都没有细粒度的同步）。

在评论中，您提到您正在运行云服务。最有可能的是，虚拟化系统也为其他客户端执行服务。如果这个假设是正确的，那么您不能期望执行任何有意义的基准测试，因为您不知道系统可能执行的其他处理除了您自己的工作负载。

在本地工作站上尝试测试，它应该表现出更少的变化 - 但自然会有一些变化（不能保证每个线程都会获得相同的CPU片段）。

Answer 6

线程数不应大于您拥有的CPU核心数。您将受到惩罚，因为VM必须切换线程

使用6个以上的线程时会造成严重的性能损失

6 个答案: