当我使用比我的CPU有核心更多的线程时,为什么这段代码会变得更快?

时间:2016-10-13 19:02:27

标签: c++ multithreading performance

我有一些时间码,我用它来衡量给定代码片段的运行时间:

 post "/cities" => "cities#create", as: :cities

然后,为了使用它,我有以下代码用于对一组数据执行累积,但不是简单地将数字加在一起,而是首先找到总停止时间从应用Collatz Conjecture并积累它。

struct time_data {
    std::chrono::steady_clock::time_point start, end;

    auto get_duration() const {
        return end - start;
    }

    void print_data(std::ostream & out) const {
        out << str();
    }

    std::string str() const {
        static const std::locale loc{ "" };
        std::stringstream ss;
        ss.imbue(loc);
        ss << "Start:    " << std::setw(24) << std::chrono::nanoseconds{ start.time_since_epoch() }.count() << "ns\n";
        ss << "End:      " << std::setw(24) << std::chrono::nanoseconds{ end.time_since_epoch() }.count() << "ns\n";
        ss << "Duration: " << std::setw(24) << std::chrono::nanoseconds{ get_duration() }.count() << "ns\n";
        return ss.str();
    }

    static friend std::ostream & operator<<(std::ostream & out, const time_data & data) {
        return out << data.str();
    }
};

template<typename T>
time_data time_function(T && func) {
    time_data data;
    data.start = std::chrono::steady_clock::now();
    func();
    data.end = std::chrono::steady_clock::now();
    return data;
}

为了参考起见:运行它的样板template<typename T> T accumulation_function(T a, T b) { T count = 0; while (b > 1) { if (b % 2 == 0) b /= 2; else b = b * 3 + 1; ++count; } return a + count; } template<typename IT> auto std_sum(IT begin, IT end) { auto sum = (*begin - *begin); sum = std::accumulate(begin, end, sum, accumulation_function<decltype(sum)>); return sum; } template<typename IT> auto single_thread_sum(IT begin, IT end) { auto sum = (*begin - *begin); IT current = begin; while (current != end) { sum = accumulation_function(sum, *current); ++current; } return sum; } template<typename IT, uint64_t N> auto N_thread_smart_sum(IT begin, IT end) { auto sum = (*begin - *begin); std::vector<std::thread> threads; std::array<decltype(sum), N> sums; auto dist = std::distance(begin, end); for (uint64_t i = 0; i < N; i++) { threads.emplace_back([=, &sums] { IT first = begin; IT last = begin; auto & tsum = sums[i]; tsum = 0; std::advance(first, i * dist / N); std::advance(last, (i + 1) * dist / N); while (first != last) { tsum = accumulation_function(tsum, *first); ++first; } }); } for (std::thread & thread : threads) thread.join(); for (const auto & s : sums) { sum += s; } return sum; } template<typename IT> auto two_thread_smart_sum(IT begin, IT end) { return N_thread_smart_sum<IT, 2>(begin, end); } template<typename IT> auto four_thread_smart_sum(IT begin, IT end) { return N_thread_smart_sum<IT, 4>(begin, end); } template<typename IT> auto eight_thread_smart_sum(IT begin, IT end) { return N_thread_smart_sum<IT, 8>(begin, end); } template<typename IT> auto sixteen_thread_smart_sum(IT begin, IT end) { return N_thread_smart_sum<IT, 16>(begin, end); } template<typename IT> auto thirty_two_thread_smart_sum(IT begin, IT end) { return N_thread_smart_sum<IT, 32>(begin, end); } template<typename IT> auto sixty_four_thread_smart_sum(IT begin, IT end) { return N_thread_smart_sum<IT, 64>(begin, end); } 代码:

main

这是该计划的输出:

int main() {
    std::vector<int64_t> raw_data;
    auto fill_data = time_function([&raw_data] {
        constexpr uint64_t SIZE = 1'000'000'000ull;
        raw_data.resize(SIZE);
        std::vector<std::thread> threads;
        for (int i = 0; i < 8; i++) {
            threads.emplace_back([i, SIZE, &raw_data] {
                uint64_t begin = i * SIZE / 8;
                uint64_t end = (i + 1) * SIZE / 8;
                for (uint64_t index = begin; index < end; index++) {
                    raw_data[index] = begin % (20 + i);
                }
            });
        }
        for (std::thread & t : threads) 
            t.join();
    });

    int64_t sum;

    std::cout << std::setw(25) << "Fill Data" << std::endl;
    std::cout << fill_data << std::endl;

    auto std_data = time_function([&] {
        sum = std_sum(raw_data.begin(), raw_data.end());
    });

    std::cout << std::setw(25) << "STD Sum: " << sum << std::endl;
    std::cout << std_data << std::endl;

    auto single_data = time_function([&] {
        sum = single_thread_sum(raw_data.begin(), raw_data.end());
    });

    std::cout << std::setw(25) << "Single Sum: " << sum << std::endl;
    std::cout << single_data << std::endl;

    auto smart_2_data = time_function([&] {
        sum = two_thread_smart_sum(raw_data.begin(), raw_data.end());
    });

    std::cout << std::setw(25) << "Two-Thread-Smart Sum: " << sum << std::endl;
    std::cout << smart_2_data << std::endl;

    auto smart_4_data = time_function([&] {
        sum = four_thread_smart_sum(raw_data.begin(), raw_data.end());
    });

    std::cout << std::setw(25) << "Four-Thread-Smart Sum: " << sum << std::endl;
    std::cout << smart_4_data << std::endl;

    auto smart_8_data = time_function([&] {
        sum = eight_thread_smart_sum(raw_data.begin(), raw_data.end());
    });

    std::cout << std::setw(25) << "Eight-Thread-Smart Sum: " << sum << std::endl;
    std::cout << smart_8_data << std::endl;

    auto smart_16_data = time_function([&] {
        sum = sixteen_thread_smart_sum(raw_data.begin(), raw_data.end());
    });

    std::cout << std::setw(25) << "Sixteen-Thread-Smart Sum: " << sum << std::endl;
    std::cout << smart_16_data << std::endl;

    auto smart_32_data = time_function([&] {
        sum = thirty_two_thread_smart_sum(raw_data.begin(), raw_data.end());
    });

    std::cout << std::setw(25) << "Thirty-Two-Thread-Smart Sum: " << sum << std::endl;
    std::cout << smart_32_data << std::endl;

    auto smart_64_data = time_function([&] {
        sum = sixty_four_thread_smart_sum(raw_data.begin(), raw_data.end());
    });

    std::cout << std::setw(25) << "Sixty-Four-Thread-Smart Sum: " << sum << std::endl;
    std::cout << smart_64_data << std::endl;
    return 0;
}

前几个结果并不令人惊讶:我自编的累积代码与 Fill Data Start: 16,295,979,890,252ns End: 16,300,523,805,484ns Duration: 4,543,915,232ns STD Sum: 7750000000 Start: 16,300,550,212,791ns End: 16,313,216,607,890ns Duration: 12,666,395,099ns Single Sum: 7750000000 Start: 16,313,216,694,522ns End: 16,325,774,379,684ns Duration: 12,557,685,162ns Two-Thread-Smart Sum: 7750000000 Start: 16,325,774,466,014ns End: 16,334,441,032,868ns Duration: 8,666,566,854ns Four-Thread-Smart Sum: 7750000000 Start: 16,334,441,137,913ns End: 16,342,188,642,721ns Duration: 7,747,504,808ns Eight-Thread-Smart Sum: 7750000000 Start: 16,342,188,770,706ns End: 16,347,850,908,471ns Duration: 5,662,137,765ns Sixteen-Thread-Smart Sum: 7750000000 Start: 16,347,850,961,597ns End: 16,352,187,838,584ns Duration: 4,336,876,987ns Thirty-Two-Thread-Smart Sum: 7750000000 Start: 16,352,187,891,710ns End: 16,356,111,411,220ns Duration: 3,923,519,510ns Sixty-Four-Thread-Smart Sum: 7750000000 Start: 16,356,111,471,288ns End: 16,359,988,028,812ns Duration: 3,876,557,524ns 函数的工作方式大致相同(我已经看到它们都出现了&#34;最快的&#34 ;在连续运行中,暗示它们的实现可能类似)。当我转移到两个线程和四个线程时,代码变得更快。这是有道理的,因为我使用的是英特尔4核处理器。

但超出这一点的结果令人困惑。我的CPU只有4个内核(如果考虑超线程,则为8个内核),但即使我原谅超线程在8个线程上提供的边际性能提升,将数字增加到16,32和64个线程都会产生额外的性能提升。为什么是这样?当我已经超出CPU可以物理同时运行的线程数时,额外的线程如何产生额外的性能提升?

注意:这与this linked question不同,因为我正在处理特定的用例和代码,而链接的问题则涉及一般性。

编辑:

我对代码进行了调整,以便在创建线程时,将其优先级设置为Windows允许的最高优先级。从广义上讲,结果似乎是一样的。

std::accumulate

DOUBLE EDIT COMBO:

我编辑了程序以确保 Fill Data Start: 18,950,798,175,057ns End: 18,955,085,850,007ns Duration: 4,287,674,950ns STD Sum: 7750000000 Start: 18,955,086,975,013ns End: 18,967,581,061,562ns Duration: 12,494,086,549ns Single Sum: 7750000000 Start: 18,967,581,136,724ns End: 18,980,127,355,698ns Duration: 12,546,218,974ns Two-Thread-Smart Sum: 7750000000 Start: 18,980,127,426,332ns End: 18,988,619,042,214ns Duration: 8,491,615,882ns Four-Thread-Smart Sum: 7750000000 Start: 18,988,619,135,487ns End: 18,996,215,684,824ns Duration: 7,596,549,337ns Eight-Thread-Smart Sum: 7750000000 Start: 18,996,215,763,004ns End: 19,002,055,977,730ns Duration: 5,840,214,726ns Sixteen-Thread-Smart Sum: 7750000000 Start: 19,002,056,055,608ns End: 19,006,282,772,254ns Duration: 4,226,716,646ns Thirty-Two-Thread-Smart Sum: 7750000000 Start: 19,006,282,840,774ns End: 19,010,539,676,914ns Duration: 4,256,836,140ns Sixty-Four-Thread-Smart Sum: 7750000000 Start: 19,010,539,758,113ns End: 19,014,450,311,829ns Duration: 3,910,553,716ns 变量是lambda中的局部变量,而不是lambda外部的引用。多线程代码加速了很多,但它仍然表现出相同的行为。

tsum

TRIPLE EDIT WOMBO-COMBO:

我再次运行相同的测试,但相反。结果非常相似:

                Fill Data
Start:          21,740,931,802,656ns
End:            21,745,429,501,480ns
Duration:            4,497,698,824ns

                STD Sum: 7750000000
Start:          21,745,430,637,655ns
End:            21,758,206,612,632ns
Duration:           12,775,974,977ns

             Single Sum: 7750000000
Start:          21,758,206,681,153ns
End:            21,771,260,233,797ns
Duration:           13,053,552,644ns

   Two-Thread-Smart Sum: 7750000000
Start:          21,771,260,287,828ns
End:            21,777,845,764,595ns
Duration:            6,585,476,767ns

  Four-Thread-Smart Sum: 7750000000
Start:          21,777,845,831,002ns
End:            21,784,011,543,159ns
Duration:            6,165,712,157ns

 Eight-Thread-Smart Sum: 7750000000
Start:          21,784,011,628,584ns
End:            21,788,846,061,014ns
Duration:            4,834,432,430ns

Sixteen-Thread-Smart Sum: 7750000000
Start:          21,788,846,127,422ns
End:            21,791,921,652,246ns
Duration:            3,075,524,824ns

Thirty-Two-Thread-Smart Sum: 7750000000
Start:          21,791,921,747,330ns
End:            21,794,980,832,033ns
Duration:            3,059,084,703ns

Sixty-Four-Thread-Smart Sum: 7750000000
Start:          21,794,980,901,761ns
End:            21,797,937,401,426ns
Duration:            2,956,499,665ns

2 个答案:

答案 0 :(得分:3)

所有tsum次写入都是相同的array,这将占用内存中的连续地址。这将导致所有这些写入的缓存争用问题。有更多的线程将这些写入扩展到不同的缓存行,因此CPU核心花费的时间更少,无效并重新加载缓存行。

尝试累积到本地tsum变量(不是对sums的引用),然后在循环完成时将其写入sums[i]

答案 1 :(得分:3)

我怀疑您看到了加速,因为您的应用程序的线程占系统上运行的总活动线程的较大百分比,从而为您提供更多的时间片。当你有其他程序在运行时,你可能想看看数字是变好还是变坏。