Question

在学会了shared variables are currently not guarded by memory barriers的困难之后，我现在遇到了另一个问题。要么我做错了，要么dmd中的现有编译器优化可以通过重新排序shared变量的读取来破坏多线程代码。

例如，当我使用dmd -O（完全优化）编译可执行文件时，编译器会愉快地优化此代码中的局部变量o（其中cas是比较 - 来自core.atomic）的交换函数

shared uint cnt;
void atomicInc  ( ) { uint o; do { o = cnt; } while ( !cas( &cnt, o, o + 1 ) );}

这样的事情（参见下面的拆卸）：

shared uint cnt;
void atomicInc  ( ) { while ( !cas( &cnt, cnt, cnt + 1 ) ) { } }

在“优化”代码中，cnt从内存中读取两次，从而冒着另一个线程在其间修改cnt的风险。优化基本上破坏了比较和交换算法。

这是一个错误，还是有正确的方法来达到预期的效果？到目前为止，我发现的唯一解决方法是使用汇编程序实现代码。

完整的测试代码和其他详细信息
为了完整起见，这里有一个完整的测试代码，可以显示这两个问题（没有内存障碍和优化问题）。它在dmd 2.049和dmd 2.050的三台不同的Windows机器上产生以下输出（假设Dekker的算法没有死锁，可能会发生死锁）：

dmd -O -run optbug.d
CAS   : failed
Dekker: failed

atomicInc中的循环被编译为完全优化：

; cnt is stored at 447C10h
; while ( !cas( &cnt, o, o + 1 ) ) o = cnt;
; 1) prepare call cas( &cnt, o, o + 1 ): &cnt and o go to stack, o+1 to eax
402027: mov    ecx,447C10h         ; ecx = &cnt
40202C: mov    eax,[447C10h]     ; eax = o1 = cnt
402031: inc    eax                 ; eax = o1 + 1 (third parameter)
402032: push   ecx                 ; push &cnt (first parameter)
    ; next instruction pushes current value of cnt onto stack
    ; as second parameter o instead of re-using o1
402033: push   [447C10h]    
402039: call   4020BC              ; 2) call cas    
40203E: xor    al,1                ; 3) test success
402040: jne    402027              ; no success try again
; end of main loop

这是测试代码：

import core.atomic;
import core.thread;
import std.stdio;

enum loops = 0xFFFF;
shared uint cnt;

/* *****************************************************************************
 Implement atomicOp!("+=")(cnt, 1U); with CAS. The code below doesn't work with
 the "-O" compiler flag because cnt is read twice while calling cas and another
 thread can modify cnt in between.
*/
enum threads = 8;

void atomicInc  ( ) { uint o; do { o = cnt; } while ( !cas( &cnt, o, o + 1 ) );}
void threadFunc ( ) { foreach (i; 0..loops) atomicInc; }

void testCas ( ) {
    cnt = 0;
    auto tgCas = new ThreadGroup;
    foreach (i; 0..threads) tgCas.create(&threadFunc);
    tgCas.joinAll;
    writeln( "CAS   : ", cnt == loops * threads ? "passed" : "failed" );
}

/* *****************************************************************************
 Dekker's algorithm. Fails on ia32 (other than atom) because ia32 can re-order 
 read before write. Most likely fails on many other architectures.
*/
shared bool flag1 = false;
shared bool flag2 = false;
shared bool turn2 = false;   // avoids starvation by executing 1 and 2 in turns

void dekkerInc ( ) {
    flag1 = true;
    while ( flag2 ) if ( turn2 ) {
        flag1 = false; while ( turn2 )  {  /* wait until my turn */ }
        flag1 = true;
    }
    cnt++;                   // shouldn't work without a cast
    turn2 = true; flag1 = false;
}

void dekkerDec ( ) {
    flag2 = true;
    while ( flag1 ) if ( !turn2 ) {
        flag2 = false; while ( !turn2 ) { /* wait until my turn */ }
        flag2 = true;
    }
    cnt--;                   // shouldn't work without a cast
    turn2 = false; flag2 = false;
}

void threadDekkerInc ( ) { foreach (i; 0..loops) dekkerInc; }
void threadDekkerDec ( ) { foreach (i; 0..loops) dekkerDec; }

void testDekker ( ) {
    cnt = 0;
    auto tgDekker = new ThreadGroup;
    tgDekker.create( &threadDekkerInc );
    tgDekker.create( &threadDekkerDec );
    tgDekker.joinAll;
    writeln( "Dekker: ", cnt == 0 ? "passed" : "failed" );
}

/* ************************************************************************** */
void main() {
    testCas;
    testDekker;
}

Answer 1

虽然问题似乎仍然存在，但core.atomic现在公开atomicLoad，这使得相对简单的解决方法成为可能。要使cas示例正常工作，只需以原子方式加载cnt：

void atomicInc  ( ) { 
    uint o; 
    do {
         o = atomicLoad(cnt); 
    } while ( !cas( &cnt, o, o + 1 ) );
}

同样，要使Dekker的算法有效：

// ...
while ( atomicLoad(flag2) ) if ( turn2 ) {
// ...
while ( atomicLoad(flag1) ) if ( !turn2 ) {
// ...

对于ia32以外的体系结构（忽略字符串操作和SSE），也可以重新排序

读取相对于读取
或写相对于写
或写入和读取到相同的内存位置

需要额外的内存障碍。

Answer 2

是的，在汇编程序中编码。如果你跳过使用cas（）函数并在程序集中编写整个atomicInt函数，它只会是几行代码。在你这样做之前，你可能会反对编译器的优化。

最重要的是，您可以使用x86 LOCK INC指令代替CAS，并且您应该能够将功能简化为一行或两行程序集。

编译器优化打破了多线程代码

2 个答案: