我使用Boost MPI相对较新。我已经安装了库,代码编译,但我得到一个非常奇怪的错误 - 从属节点接收的一些整数数据不是主节点发送的。发生了什么事?
我正在使用boost版本1.42.0,使用mpic ++编译代码(在一个集群上包含g ++,在另一个集群上包含icpc)。下面是一个简化示例,包括输出。
代码:
#include <iostream>
#include <boost/mpi.hpp>
using namespace std;
namespace mpi = boost::mpi;
class Solution
{
public:
Solution() :
solution_num(num_solutions++)
{
// Master node's constructor
}
Solution(int solutionNum) :
solution_num(solutionNum)
{
// Slave nodes' constructor.
}
int solutionNum() const
{
return solution_num;
}
private:
static int num_solutions;
int solution_num;
};
int Solution::num_solutions = 0;
int main(int argc, char* argv[])
{
// Initialization of MPI
mpi::environment env(argc, argv);
mpi::communicator world;
if (world.rank() == 0)
{
// Create solutions
int numSolutions = world.size() - 1; // One solution per slave
vector<Solution*> solutions(numSolutions);
for (int sol = 0; sol < numSolutions; ++sol)
{
solutions[sol] = new Solution;
}
// Send solutions
for (int sol = 0; sol < numSolutions; ++sol)
{
world.isend(sol + 1, 0, false); // Tells the slave to expect work
cout << "Sending solution no. " << solutions[sol]->solutionNum() << " to node " << sol + 1 << endl;
world.isend(sol + 1, 1, solutions[sol]->solutionNum());
}
// Retrieve values (solution numbers squared)
vector<double> values(numSolutions, 0);
for (int i = 0; i < numSolutions; ++i)
{
// Get values for each solution
double value = 0;
mpi::status status = world.recv(mpi::any_source, 2, value);
int source = status.source();
int sol = source - 1;
values[sol] = value;
}
for (int i = 1; i <= numSolutions; ++i)
{
world.isend(i, 0, true); // Tells the slave to finish
}
// Output the solutions numbers and their squares
for (int i = 0; i < numSolutions; ++i)
{
cout << solutions[i]->solutionNum() << ", " << values[i] << endl;
delete solutions[i];
}
}
else
{
// Slave nodes merely square the solution number
bool finished;
mpi::status status = world.recv(0, 0, finished);
while (!finished)
{
int solNum;
world.recv(0, 1, solNum);
cout << "Node " << world.rank() << " receiving solution no. " << solNum << endl;
Solution solution(solNum);
double value = static_cast<double>(solNum * solNum);
world.send(0, 2, value);
status = world.recv(0, 0, finished);
}
cout << "Node " << world.rank() << " finished." << endl;
}
return EXIT_SUCCESS;
}
在21个节点(1个主节点,20个从节点)上运行此节点会产生:
Sending solution no. 0 to node 1
Sending solution no. 1 to node 2
Sending solution no. 2 to node 3
Sending solution no. 3 to node 4
Sending solution no. 4 to node 5
Sending solution no. 5 to node 6
Sending solution no. 6 to node 7
Sending solution no. 7 to node 8
Sending solution no. 8 to node 9
Sending solution no. 9 to node 10
Sending solution no. 10 to node 11
Sending solution no. 11 to node 12
Sending solution no. 12 to node 13
Sending solution no. 13 to node 14
Sending solution no. 14 to node 15
Sending solution no. 15 to node 16
Sending solution no. 16 to node 17
Sending solution no. 17 to node 18
Sending solution no. 18 to node 19
Sending solution no. 19 to node 20
Node 1 receiving solution no. 0
Node 2 receiving solution no. 1
Node 12 receiving solution no. 19
Node 3 receiving solution no. 19
Node 15 receiving solution no. 19
Node 13 receiving solution no. 19
Node 4 receiving solution no. 19
Node 9 receiving solution no. 19
Node 10 receiving solution no. 19
Node 14 receiving solution no. 19
Node 6 receiving solution no. 19
Node 5 receiving solution no. 19
Node 11 receiving solution no. 19
Node 8 receiving solution no. 19
Node 16 receiving solution no. 19
Node 19 receiving solution no. 19
Node 20 receiving solution no. 19
Node 1 finished.
Node 2 finished.
Node 7 receiving solution no. 19
0, 0
1, 1
2, 361
3, 361
4, 361
5, 361
6, 361
7, 361
8, 361
9, 361
10, 361
11, 361
12, 361
13, 361
14, 361
15, 361
16, 361
17, 361
18, 361
19, 361
Node 6 finished.
Node 3 finished.
Node 17 receiving solution no. 19
Node 17 finished.
Node 10 finished.
Node 12 finished.
Node 8 finished.
Node 4 finished.
Node 15 finished.
Node 18 receiving solution no. 19
Node 18 finished.
Node 11 finished.
Node 13 finished.
Node 20 finished.
Node 16 finished.
Node 9 finished.
Node 19 finished.
Node 7 finished.
Node 5 finished.
Node 14 finished.
因此,当主设备发送0到节点1,1到节点2,2到节点3等时,大多数从节点(由于某种原因)接收到数字19.因此,而不是从0生成数字的平方到19,我们得到0平方,1平方和19平方18倍!
提前感谢能解释此事的任何人。
艾伦
答案 0 :(得分:11)
好的,我想我有答案,这需要了解潜在的C风格MPI调用。 Boost的'isend'函数本质上是'MPI_Isend'的包装器,它不保护用户不需要知道'MPI_Isend'如何工作的一些细节。
'MPI_Isend'的一个参数是指向包含您要发送的信息的缓冲区的指针。但重要的是,在您知道已收到消息之前,不能重复使用此缓冲区。请考虑以下代码:
// Get solution numbers from the solutions and store in a vector
vector<int> solutionNums(numSolutions);
for (int sol = 0; sol < numSolutions; ++sol)
{
solutionNums[sol] = solutions[sol]->solutionNum();
}
// Send solution numbers
for (int sol = 0; sol < numSolutions; ++sol)
{
world.isend(sol + 1, 0, false); // Indicates that we have not finished, and to expect a solution representation
cout << "Sending solution no. " << solutionNums[sol] << " to node " << sol + 1 << endl;
world.isend(sol + 1, 1, solutionNums[sol]);
}
这非常有效,因为每个解决方案编号都位于内存中的位置。现在考虑以下小调整:
// Create solutionNum array
vector<int> solutionNums(numSolutions);
for (int sol = 0; sol < numSolutions; ++sol)
{
solutionNums[sol] = solutions[sol]->solutionNum();
}
// Send solutions
for (int sol = 0; sol < numSolutions; ++sol)
{
int solNum = solutionNums[sol];
world.isend(sol + 1, 0, false); // Indicates that we have not finished, and to expect a solution representation
cout << "Sending solution no. " << solNum << " to node " << sol + 1 << endl;
world.isend(sol + 1, 1, solNum);
}
现在底层的'MPI_Isend'调用提供了一个指向solNum的指针。不幸的是,每次在循环周围都会覆盖这一位内存,所以虽然它可能看起来像4发送到节点5,但是当发送实际发生时,该内存位置的新内容(例如19)而是通过了。
现在考虑原始代码:
// Send solutions
for (int sol = 0; sol < numSolutions; ++sol)
{
world.isend(sol + 1, 0, false); // Tells the slave to expect work
cout << "Sending solution no. " << solutions[sol]->solutionNum() << " to node " << sol + 1 << endl;
world.isend(sol + 1, 1, solutions[sol]->solutionNum());
}
这里我们传递一个临时的。同样,每次循环时,此临时内存的位置都会被覆盖。同样,错误的数据被发送到从节点。
碰巧,我已经能够重构我的'真实'代码,使用'发送'而不是'isend'。但是,如果我将来需要使用'isend',我会更加小心!
答案 1 :(得分:4)
我想我今天偶然发现了类似的问题。序列化自定义数据类型时,我注意到它在另一侧(有时)已损坏。修复方法是存储mpi::request
返回值isend
。如果您查看boost communicator::isend_impl(int dest, int tag, const T& value, mpl::false_)
中的communicator.hpp
,您会看到序列化数据作为共享指针放入请求中。如果它再次被删除,数据将失效并且可能发生任何事情。
所以:总是保存isend返回值!
答案 2 :(得分:2)
您的编译器优化了“解决方案[sol] = new Solution”的废话。循环并得出结论,它可以跳转到所有num_solution ++增量的末尾。这样做当然是错误的,但这就是发生的事情。
自动线程化或自动并行化编译器有可能导致20个numsolutions ++实例以半随机顺序发生,相对于解决方案列表中的20个solution_num = num_solutions实例() 。优化更可能是错误的。
替换
for (int sol = 0; sol < numSolutions; ++sol) { solutions[sol] = new Solution; }
带
for (int sol = 0; sol < numSolutions; ++sol) { solutions[sol] = new Solution(sol); }
你的问题就会消失。特别是,每个解决方案都会获得自己的数字,而不是在编译器错误地重新排序20个增量期间获得共享静态发生的任何数字。
答案 3 :(得分:1)
以milianw的答案为基础:我的印象是使用isend的正确方法是保留它返回的请求对象,并在另一次调用isend之前使用test()或wait()方法检查它是否已完成。我认为继续调用isend()并将请求对象推送到向量上也是有效的。然后,您可以使用{test,wait} _ {any,some,all}测试或等待这些请求。
在某些时候,您还需要担心发布的发送速度是否快于收件人可以接收的发送速度,因为迟早会耗尽MPI缓冲区。根据我的经验,这只会表现为崩溃。