Question

我正在为分布式tensorflow编写一个contrib扩展，覆盖Rendezvous::RecvFromRemoteAsync()。为了验证我的解决方案，我在代码的各个点（发送者和接收者）中添加了张量校验和。奇怪的是，我发现校验和发生了变化，而我仍然在发送代码中。

为了简化检查，我创建了以下函数（伪）：

TestChecksum（Tensor t，int delay）：

1.  int64 checksum1 = checksum(t)
2.  usleep(delay)
3.  int64 checksum2 = checksum(t)
4.  CHECK(checksum1 == checksum2);

现在，我在RecvLocalAsync()回调的开头，在原始GRPC 代码（right here）中调用此函数。

对于延迟100000（微），测试通过。

对于延迟200000（微），测试失败。

另外，我查看了张量缓冲区，看到它是所有步骤ID共享的。因此，当RecvFromRemoteAsync仍在进行中时，似乎另一个线程正在更改张量内容。可能吗？我怎么知道我收到了正确的张量？

编辑 - 如何重现：

选择this分支。如果您愿意，错误再现代码是在最后一次提交中，它可能是挑选的，没有冲突。

获取tensorflow benchmarks

运行tf_cnn_benchmarks.py至少1 ps和2名工作人员。

我使用的命令：

python -u tf_cnn_benchmarks.py --job_name=ps --task_index=0 --ps_hosts=<...> --worker_hosts=<...> --server_protocol=grpc --model=resnet152 --batch_size=32 --num_gpus=2 --local_parameter_device=gpu
python -u tf_cnn_benchmarks.py --job_name=worker --task_index=0 --ps_hosts=<...> --worker_hosts=<...> --server_protocol=grpc --model=resnet152 --batch_size=32 --num_gpus=2 --local_parameter_device=gpu
python -u tf_cnn_benchmarks.py --job_name=worker --task_index=1 --ps_hosts=<...> --worker_hosts=<...> --server_protocol=grpc --model=resnet152 --batch_size=32 --num_gpus=2 --local_parameter_device=gpu

发送

编辑 - 如何重现：

0 个答案: