Question

我有一个线程池，它接受作业（函数指针+数据），并将其交给工作线程来完成。有些作业被赋予指向完成计数std::atomic<uint32>的指针，完成后它们会递增，因此创建这些作业的主线程可以知道其中有多少作业已完成。

问题是，单个uint32上有12个以上的线程竞争。我已经做了很多工作来沿着缓存行分离作业和它们的数据，所以这是我想消除的唯一争论点，但是我不确定如何最好地解决这个特定问题。

在多个线程之间不共享单个uint32的情况下，最简单的方法是收集已完成的作业数量？

（如果主线程在检查此计数时必须刷新其缓存是可以的，我只想避免弄脏工作线程的缓存。而且，工作线程不需要知道计数，它们只需递增它，而主线程只能读取它。）

更新：

我目前正在尝试完全不共享一个计数，而是对每个工作线程都有一个计数，主线程可以在检查时将其加在一起。这个想法是主线程支付主价（这很好，因为它一直在等待“加入”）。

这是我在10分钟内将丑陋的食物煮熟的代码

class Promise {
    friend ThreadPool;
public:

    ~Promise() {
        // this function destroys our object's memory underneath us; no members with destructors
        m_pool->_destroyPromise(this);
    }

    void join() {
        while (isDone() == false) {
            if(m_pool->doAJob() == false){
                // we've no jobs to steal, try to spin a little gentler
                std::this_thread::yield();
            }
        }
    }

    void setEndCount(uint32 count) {
        m_endCount = count;
    }

    bool isDone() {
        return m_endCount == getCount();
    }

    uint32 getCount() {
        uint32 count = 0;
        for (uint32 n = 0; n < m_countCount; ++n) {
            count += _getCountRef(n)->load();
        }
        return count;
    }

    uint32 getRemaining() {
        return m_endCount - getCount();
    }

private:
    // ThreadPool creates these as a factory
    Promise(ThreadPool * pool, uint32 countsToKeep, uint32 endCount, uint32 countStride, void * allocatedData)
        : m_pool(pool)
        , m_endCount(endCount)
        , m_countCount(countsToKeep)
        , m_countStride(countStride)
        , m_perThreadCount(allocatedData)
    {};

    // all worker IDs start at 1, not 0, only ThreadPool should use this directly
    std::atomic<uint32> * _getCountRef(uint32 workerID = 0) {
        return (std::atomic<uint32>*)((char*)m_perThreadCount + m_countStride * workerID);
    }

    // data
    uint32 m_endCount;
    uint32 m_countCount; // the count of how many counts we're counting
    uint32 m_countStride;
    ThreadPool * m_pool;
    void * m_perThreadCount; // an atomic count for each worker thread + one volunteer count (for non-worker threads), seperated by cacheline
};

更新两次：

对此进行测试，看来效果很好。不幸的是，这是一个相当大的结构，64个字节*工作者线程数（对我来说，这是在推动KB），但是对于我通常使用的作业，速度贸易大约是5％+。我想这可能暂时可行。

如何在不共享单个变量的情况下计算线程池中已完成的作业？

0 个答案: