Question

我正在用C ++编写这个相当大的网络模拟器。在我开发它们时，我一直在定期测试各个部分，并且在将所有内容组合在一起之后，只要我在模拟器上施加的负载不是太大（它是P2P内容分发模拟器，所以更加不同“）内容“我介绍了模拟器必须处理的更多数据传输。超过正在模拟的不同内容的数量的某个阈值的任何东西将在几分钟的平稳运行之后导致突然的SIGSEGV。我假设有一个内存泄漏，最终变得太大而且弄乱了一些事情，但是一个低于阈值的参数运行的valgrind完美无缺。但是，如果我尝试使用valgrind使用内容编号的临界值来运行程序，在某一点之后我开始在以前没有出现任何问题的函数中获取内存访问错误：

==5987== Invalid read of size 8
==5987==    at 0x40524E: Scheduler::advanceClock() (Scheduler.cpp:38)
==5987==    by 0x45BA73: TestRun::execute() (TestRun.cpp:73)
==5987==    by 0x45522B: main (CDSim.cpp:131)
==5987==  Address 0x2e63bc70 is 0 bytes inside a block of size 32 free'd
==5987==    at 0x4C2A4BC: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5987==    by 0x405487: Scheduler::advanceClock() (Scheduler.cpp:69)
==5987==    by 0x45BA73: TestRun::execute() (TestRun.cpp:73)
==5987==    by 0x45522B: main (CDSim.cpp:131)
==5987==
==5987== Invalid read of size 4
==5987==    at 0x40584E: Request::getSimTime() const (Event.hpp:45)
==5987==    by 0x40525C: Scheduler::advanceClock() (Scheduler.cpp:38)
==5987==    by 0x45BA73: TestRun::execute() (TestRun.cpp:73)
==5987==    by 0x45522B: main (CDSim.cpp:131)
==5987==  Address 0x2e63bc78 is 8 bytes inside a block of size 32 free'd
==5987==    at 0x4C2A4BC: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5987==    by 0x405487: Scheduler::advanceClock() (Scheduler.cpp:69)
==5987==    by 0x45BA73: TestRun::execute() (TestRun.cpp:73)
==5987==    by 0x45522B: main (CDSim.cpp:131)
==5987==

我知道在没有看到整个代码的情况下给出答案可能很难，但是对于可能发生的事情是否存在“高级”暗示？我不明白为什么似乎正常工作的功能突然开始行为不端。是否有一些我可能遗失的明显事物？

以前的valgrind日志中的被控制的行是if (nextEvent->getSimTime() < this->getSimTime())，位于以下块中：

bool Scheduler::advanceClock() {
  if (pendingEvents.size() == 0) {
    std::cerr << "WARNING: Scheduler::advanceClock() - Empty event queue before "
        "reaching the termination event" << std::endl;
    return false;
  }
  const Event* nextEvent = pendingEvents.top();
  // Check that the event is not scheduled in the past
  if (nextEvent->getSimTime() < this->getSimTime()) {
    std::cerr << "Scheduler::advanceClock() - Event scheduled in the past!" << 
        std::endl;
    std::cerr << "Simulation time: " << this->getSimTime()
        << ", event time: " << nextEvent->getSimTime()
        << std::endl;
    exit(ERR_EVENT_IN_THE_PAST);
  }
  // Update the clock with the current event time (>= previous time)
  this->setSimTime(nextEvent->getSimTime());
  ...

其中pendingEvents是boost :: heap :: binomial_heap。

Answer 1

我终于找到了问题所在。当事件完成并且需要从列表中删除时，我的代码就是这样的：

...
// Data transfer completed, remove event from queue
// Notify the oracle, which will update the cache mapping and free resources
// in the topology
oracle->notifyCompletedFlow(nextEvent, this);
// Remove flow from top of the queue
pendingEvents.pop();
handleMap.erase(nextEvent);
delete nextEvent;
return true;

问题是oracle->notifyCompletedFlow()在调度程序上调用了一些方法来动态更新调度事件的优先级（例如，对网络中可用带宽的变化做出反应），因此当我删除时pendingEvents.pop()队列的顶部在某些情况下，我弹出一个不同的事件，并将删除的nextEvent留在那里。通过在调用oracle之前弹出队列，问题就会自行解决。

我为遗漏了可能导致更快答案的代码而道歉，我将尝试从错误中吸取教训:)感谢您指出我正确的方向。

Answer 2

可能是const Event* nextEvent = pendingEvents.top();的内容看起来像pendingEvents是一种堆栈。你可以试试这个：

Instrument（意味着添加一些跟踪输出到std :: cerr |文件一样简单的跟踪）你分配和释放内存的代码（你在哪里使用malloc / new，free / delete）;
作为调试工具，尝试使用Event的智能指针，它将在解除引用期间检查指针的有效性（operator - ＆gt;）。

当程序参数超过某个阈值时出现分段错误

2 个答案: