如何让每个CPU核心可以访问Vec的一部分?

时间:2018-03-19 00:29:16

标签: rust

我有一个令人尴尬的并行位图形渲染代码,我想在我的CPU核心上运行。我编写了一个测试用例(计算的函数是无意义的)来探索我如何并行化它。我想使用std Rust对此进行编码,以便了解如何使用std::thread。但是,我不明白如何给每个线程一个帧缓冲区的一部分。我将完整的测试用例代码放在下面,但我会先尝试将其分解。

顺序形式非常简单:

let mut buffer0 = vec![vec![0i32; WIDTH]; HEIGHT];
for j in 0..HEIGHT {
    for i in 0..WIDTH {
        buffer0[j][i] = compute(i as i32,j as i32);
    }
}

我认为制作一个相同大小的缓冲区会有所帮助,但重新安排为3D&首先由核心索引。这是相同的计算,只是重新排序数据以显示工作情况。

let mut buffer1 = vec![vec![vec![0i32; WIDTH]; y_per_core]; num_logical_cores];
for c in 0..num_logical_cores {
    for y in 0..y_per_core {
        let j = y*num_logical_cores + c;
        if j >= HEIGHT {
            break;
        }
        for i in 0..WIDTH {
            buffer1[c][y][i] = compute(i as i32,j as i32)
        }
    }
}

但是,当我尝试将代码的内部部分放在一个闭包和放大器中时创建一个线程,我得到关于缓冲区和错误的错误寿命。我基本上不懂得做什么&可以使用一些指导。我希望per_core_buffer暂时引用属于该核心的buffer2中的数据&允许它被写入,同步所有线程&然后阅读buffer2。这可能吗?

let mut buffer2 = vec![vec![vec![0i32; WIDTH]; y_per_core]; num_logical_cores];
let mut handles = Vec::new();
for c in 0..num_logical_cores {
    let per_core_buffer = &mut buffer2[c]; // <<< lifetime error
    let handle = thread::spawn(move || {
        for y in 0..y_per_core {
            let j = y*num_logical_cores + c;
            if j >= HEIGHT {
                break;
            }
            for i in 0..WIDTH {
                per_core_buffer[y][i] = compute(i as i32,j as i32)
            }
        }
    });
    handles.push(handle)
}
for handle in handles {
    handle.join().unwrap();
}

错误就是这个&amp;我不明白:

error[E0597]: `buffer2` does not live long enough
  --> src/main.rs:50:36
   |
50 |         let per_core_buffer = &mut buffer2[c]; // <<< lifetime error
   |                                    ^^^^^^^ borrowed value does not live long enough
...
88 | }
   | - borrowed value only lives until here
   |
   = note: borrowed value must be valid for the static lifetime...

完整的测试用例是:

extern crate num_cpus;
use std::time::Instant;
use std::thread;

fn compute(x: i32, y: i32) -> i32 {
    (x*y) % (x+y+10000)
}

fn main() {
    let num_logical_cores = num_cpus::get();
    const WIDTH: usize = 40000;
    const HEIGHT: usize = 10000;
    let y_per_core = HEIGHT/num_logical_cores + 1;

    // ------------------------------------------------------------
    // Serial Calculation...
    let mut buffer0 = vec![vec![0i32; WIDTH]; HEIGHT];
    let start0 = Instant::now();
    for j in 0..HEIGHT {
        for i in 0..WIDTH {
            buffer0[j][i] = compute(i as i32,j as i32);
        }
    }
    let dur0 = start0.elapsed();

    // ------------------------------------------------------------
    // On the way to Parallel Calculation...
    // Reorder the data buffer to be 3D with one 2D region per core.
    let mut buffer1 = vec![vec![vec![0i32; WIDTH]; y_per_core]; num_logical_cores];
    let start1 = Instant::now();
    for c in 0..num_logical_cores {
        for y in 0..y_per_core {
            let j = y*num_logical_cores + c;
            if j >= HEIGHT {
                break;
            }
            for i in 0..WIDTH {
                buffer1[c][y][i] = compute(i as i32,j as i32)
            }
        }
    }
    let dur1 = start1.elapsed();

    // ------------------------------------------------------------
    // Actual Parallel Calculation...
    let mut buffer2 = vec![vec![vec![0i32; WIDTH]; y_per_core]; num_logical_cores];
    let mut handles = Vec::new();
    let start2 = Instant::now();
    for c in 0..num_logical_cores {
        let per_core_buffer = &mut buffer2[c]; // <<< lifetime error
        let handle = thread::spawn(move || {
            for y in 0..y_per_core {
                let j = y*num_logical_cores + c;
                if j >= HEIGHT {
                    break;
                }
                for i in 0..WIDTH {
                    per_core_buffer[y][i] = compute(i as i32,j as i32)
                }
            }
        });
        handles.push(handle)
    }
    for handle in handles {
        handle.join().unwrap();
    }
    let dur2 = start2.elapsed();

    println!("Runtime: Serial={0:.3}ms, AlmostParallel={1:.3}ms, Parallel={2:.3}ms",
             1000.*dur0.as_secs() as f64 + 1e-6*(dur0.subsec_nanos() as f64),
             1000.*dur1.as_secs() as f64 + 1e-6*(dur1.subsec_nanos() as f64),
             1000.*dur2.as_secs() as f64 + 1e-6*(dur2.subsec_nanos() as f64));

    // Sanity check
    for j in 0..HEIGHT {
        let c = j % num_logical_cores;
        let y = j / num_logical_cores;
        for i in 0..WIDTH {
            if buffer0[j][i] != buffer1[c][y][i] {
                println!("wtf1? {0} {1} {2} {3}",i,j,buffer0[j][i],buffer1[c][y][i])
            }
            if buffer0[j][i] != buffer2[c][y][i] {
                println!("wtf2? {0} {1} {2} {3}",i,j,buffer0[j][i],buffer2[c][y][i])
            }
        }
    }

}

1 个答案:

答案 0 :(得分:0)

感谢@Shepmaster的指示和澄清,这对Rust来说不是一个简单的问题,我需要考虑一下箱子才能找到合理的解决方案。我刚刚开始在Rust工作,所以这对我来说真的不太清楚。

我喜欢控制scoped_threadpool给出的线程数的能力,所以我接受了。直接从上面翻译我的代码,我尝试使用4D缓冲区,核心作为最重要的索引,并且因为3D矢量没有实现Copy特征而遇到了麻烦。它实现Copy的事实让我关注性能,但我回到原来的问题并更直接地实现它&amp;通过使每一行成为一个线程来找到合理的加速。复制每一行不会产生很大的内存开销。

适用于我的代码是:

let mut buffer2 = vec![vec![0i32; WIDTH]; HEIGHT];
let mut pool = Pool::new(num_logical_cores as u32);
pool.scoped(|scope| {
    let mut y = 0;
    for e in &mut buffer2 {
        scope.execute(move || {
            for x in 0..WIDTH {
                (*e)[x] = compute(x as i32,y as i32);
            }
        });
        y += 1;
    }
});

在用于400000x4000测试用例的6芯12线程i7-8700K上,串行运行3.2秒。 481ms并行 - 合理的加速。

编辑:我继续思考这个问题,并在推特上得到Rustlang的建议,我应该考虑rayon。我将代码转换为rayon并使用以下代码获得了类似的加速。

let mut buffer2 = vec![vec![0i32; WIDTH]; HEIGHT];
buffer2
    .par_iter_mut()
    .enumerate()
    .map(|(y,e): (usize, &mut Vec<i32>)| {
        for x in 0..WIDTH {
            (*e)[x] = compute(x as i32,y as i32);
        }
    })
    .collect::<Vec<_>>();