Question

我正在尝试针对SPSC队列中的消费者延迟进行优化：

template <typename TYPE>
class queue
{
public:

    void produce(message m)
    {
        const auto lock = std::scoped_lock(mutex);
        has_new_messages = true;
        new_messages.emplace_back(std::move(m));
    }

    void consume()
    {
        if (UNLIKELY(has_new_messages))
        {
            const auto lock = std::scoped_lock(mutex);
            has_new_messages = false;
            messages_to_process.insert(
                messages_to_process.cend(),
                std::make_move_iterator(new_messages.begin()),
                std::make_move_iterator(new_messages.end()));
            new_messages.clear();
        }

        // handle messages_to_process, and then...

        messages_to_process.clear();
    }

private:
    TYPE has_new_messages{false};
    std::vector<message> new_messages{};
    std::vector<message> messages_to_process{};

    std::mutex mutex;
};

这里的消费者试图避免为互斥锁的锁定/解锁支付费用，并在锁定互斥锁之前进行检查。

问题是：我是否绝对必须使用TYPE = std::atomic<bool>还是可以节省原子操作并且读取 volatile bool很好？

It's known that a volatile variable per se doesn't guarantee thread safety，但是，std::mutex::lock()和std::mutex::unlock()提供了一些内存顺序保证。我是否可以依靠它们对volatile bool has_new_messages进行更改，以便最终在mutex范围之外的使用者线程可见？

更新：在@Peter Cordes的advice之后，我将其重写为：

    void produce(message m)
    {
        {
            const auto lock = std::scoped_lock(mutex);
            new_messages.emplace_back(std::move(m));
        }
        has_new_messages.store(true, std::memory_order_release);
    }

    void consume()
    {
        if (UNLIKELY(has_new_messages.exchange(false, std::memory_order_acq_rel))
        {
            const auto lock = std::scoped_lock(mutex);
            messages_to_process.insert(...);
            new_messages.clear();
        }
    }

Answer 1

它不能是普通的bool。阅读器中的自旋循环将优化为以下形式：
if (!has_new_messages) infinite_loop;是因为编译器可以使负载脱离循环，因为它可以假定它不会异步更改。

volatile在某些平台（包括大多数主流CPU，例如x86-64或ARM）上可以作为atomic的{{1}}加载/存储的a脚替代品，适用于{ {3}}。即无锁原子加载/存储使用与正常加载/存储相同的asm。

我最近写了一个比较"naturally" atomic (e.g. int or bool, because the ABI gives them natural alignment)的答案，但实际上并发线程基本相同。 memory_order_relaxed可以编译为您在普通平台上从has_new_messages.load(std::memory_order_relaxed)获得的asm（即没有额外的防护说明，仅是简单的加载或存储），但这是合法的/可移植的C ++。

您可以并且应该仅将volatile与std::atomic<bool> has_new_messages;一起用于互斥锁之外的加载/存储，如果对mo_relaxed执行相同的操作是安全的。

您的编写者可能应该在释放互斥锁之后标记，或者在关键部分的末尾使用volatile存储。没有必要让读者脱离自旋循环，并在作者尚未真正释放互斥体时尝试使用该互斥体。

顺便说一句，如果您的阅读器线程在memory_order_release上旋转以等待变为真，您应该在x86的循环中使用has_new_messages来节省功耗并避免内存顺序错误-推测管道清除。还考虑在旋转数千次后退回到操作系统辅助的睡眠/唤醒。请参阅volatile with relaxed atomic for an interrupt handler，有关由一个线程编写并由另一个线程读取的内存的更多信息，请参见What does __asm volatile ("pause" ::: "memory"); do?（包括一些内存顺序的错误推测结果。）

或者更好的方法是使用无锁SPSC队列；有很多使用固定大小的环形缓冲区的实现，如果队列不满或为空，则读写器之间不会发生争用。如果将事物安排在原子位置计数器上，以使读取器和写入器位于单独的缓存行中，那应该很好。

更改为_mm_pause()，以便最终对使用者线程可见

这是一个常见的误解。任何存储都将非常很快对所有其他CPU内核可见，因为它们都共享一个一致的缓存域，并且存储将尽快提交给它，而不需要任何防护指令。

What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?。最坏的情况可能在一个数量级内约为一微秒。通常更少。

然后volatile bool has_new_messages或volatile确保在编译器生成的asm中确实存在存储。

（相关：当前的编译器基本上根本不优化atomic；因此atomic<T>基本上等效于atomic。If I don't use fences, how long could it take a core to see another core's writes?。但是即使没有这样，编译器不能跳过存储或将负载从旋转循环中提升出来。）

读取互斥范围之外的volatile变量，而不是std :: atomic

1 个答案: