我在为个人神经网络结构做的辅助项目中遇到了麻烦。我遇到的问题是Heisenbug Segfault,它出现在我正在编写的自定义蒙特卡罗算法的并行代码段中。
线程不应该以任何方式对代码的这一部分进行交互,直到它们到达我定义的关键部分,但有些方面,函数调用中局部变量的内存位置被另一个线程覆盖,或者函数调用本身会覆盖前一个线程分配的内存位置。
我相信这个人的问题与我遇到的问题是一样的,但是我对如何使用他的启示来修复我的代码缺乏了解,因为他没有说明他如何解决他的问题。 OpenMP Causes Heisenbug Segfault
以下是我编写的代码的并行部分,其中“已测试”的关键添加已注释掉,因为它对bug没有帮助。发生错误的部分是
#include "Network.h"
#include <vector>
#include <cmath>
#include <thread>
#include <omp.h>
#include <stdint.h>
#include <iostream>
using namespace std;
using namespace AeroSW;
int main(){
// Generate X amount of blueprints
vector<vector<double> > inputs;
vector<vector<double> > outputs;
double sf = 1100000;
double lr = 0.1;
uint32_t duration = 3;
for(uint32_t i = 0; i < 1000; i++){
vector<double>* in = new vector<double>(3);
vector<double>* out = new vector<double>(1); // These can be different sizes, but for simplicity for example
(*in)[0] = i;
(*in)[1] = i+1;
(*in)[2] = i+2;
(*out)[0] = i * 1000;
inputs.push_back(*in);
outputs.push_back(*out);
}
vector<vector<int> > bps;
int n_i = 3;
int n_o = 1;
for(uint32_t i = 0; i <= 3; i++){
int num_bps_for_this_layer = pow(7, i);
int* val_array = new int[i];
for(uint32_t j = 0; j < i; j++){
val_array[j] = 7;
}
for(uint32_t j = 0; j < (unsigned)num_bps_for_this_layer; j++){
vector<int>* vec_i = new vector<int>(2+i);
(*vec_i)[0] = n_i;
(*vec_i)[i+1] = n_o;
for(uint32_t k = 0; k < i; k++){
(*vec_i)[k+1] = val_array[k];
}
bps.push_back(*vec_i);
if(i > 0){
uint32_t t_i = i-1; // Temp i
val_array[t_i]--;
bool b_flag = false; // break flag
while(val_array[t_i] == 0){
val_array[t_i] = 7;
if(t_i == 0){
b_flag = true;
break;
}
t_i--;
val_array[t_i]--;
}
if(b_flag) break;
}
}
}
//cout << "Hello World\n";
uint32_t num_bins = 10;
uint32_t num_threads = std::thread::hardware_concurrency(); // Find # of cores
if(num_threads == 0) // Assume 1 core for systems w/out multiple cores
num_threads = 1;
if(num_bins < num_threads){
num_threads = num_bins;
}
uint32_t bp_slice = bps.size() / num_threads;
#pragma omp parallel num_threads(num_threads) firstprivate(num_bins, n_i, n_o, lr)
{
uint32_t my_id = omp_get_thread_num();
uint32_t my_si = my_id * bp_slice; // my starting index
uint32_t my_ei; // my ending index, exclusive
if(my_id == num_threads -1) my_ei = bps.size();
else my_ei = my_si + bp_slice;
std::vector<Network*> my_nets;
for(uint32_t i = my_si; i < my_ei; i++){
uint32_t nl = bps[i].size();
uint32_t* bp = new uint32_t[nl];
for(uint32_t j = 0; j < nl; j++){
bp[j] = bps[i][j];
}
Network* t_net = new Network(lr, bp, nl);
my_nets.push_back(t_net);
}
for(uint32_t i = 0; i < my_nets.size(); i++){
for(uint32_t j = 0; j < num_bins; j++){
my_nets[i]->train(inputs, outputs, sf, inputs.size(), duration);
}
}
}
}
如果有人看到我没有看到的东西,或者知道我可以做些什么来解决这个问题,请告诉我!
以下是Valgrind Debugger的一个示例输出,其中Helgrind工具处于活动状态,它描述了我认为的问题。
==26386==
==26386== Possible data race during read of size 8 at 0x6213348 by thread #1
==26386== Locks held: none
==26386== at 0x40CB26: AeroSW::Node::get_weight(unsigned int) (Node.cpp:84)
==26386== by 0x40E688: AeroSW::Network::train_tim(std::vector<std::vector<double, std::allocator<double> >, std::allocator<std::vector<double, std::allocator<double> > > >, std::vector<std::vector<double, std::allocator<double> >, std::allocator<std::vector<double, std::allocator<double> > > >, double, unsigned int, unsigned long) (Network.cpp:227)
==26386== by 0x4058F1: monte_carlo(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, double, double, double, std::vector<double*, std::allocator<double*> >&) [clone ._omp_fn.0] (Validation.cpp:196)
==26386== by 0x5462E5E: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==26386== by 0x404B86: monte_carlo(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, double, double, double, std::vector<double*, std::allocator<double*> >&) (Validation.cpp:136)
==26386== by 0x402467: main (NeuralNetworkArchitectureDriver.cpp:85)
==26386== Address 0x6213348 is 24 bytes inside a block of size 32 in arena "client"
==26386==
-UPDATE- 这是一个堆腐败问题。我不得不修改大量的代码,但我使用shared_ptrs和向量工作。线程覆盖了他们本不应该访问的内存位置,这导致其他线程崩溃,因为他们试图访问的信息已被更改。
答案 0 :(得分:0)
我正在写一篇文章,因为我在当地大学教授的帮助下确定了我的问题。由于程序中使用的内存范围,我遇到的问题原因是堆损坏问题。这导致线程绕过自己的内存分配并开始使用堆上的其他线程的内存空间来存储它们无法放入自己的堆中的信息。
我能够通过将所有对象指针更改为shared_ptrs来处理此问题,这会阻止覆盖内存位置,直到对对象的所有引用都被正确删除为止。我还将所有数组指针或用作数组的指针更改为向量。在这样做之后,我的问题消失得无影无踪,它随意停止崩溃。
感谢Zulan的推荐!