Question

让我从一些示例代码开始。我为此做了一个最小的测试用例。要复制，需要两块：

第一个可执行文件，一个使用nm1 <- names(which(!colSums(!do.call(rbind, by(d[-1], d$study.name, FUN = function(x) lengths(sapply(x, unique)) == 1))))) unique(d[c("study.name", nm1)])的小型应用程序。我们称之为 Debugger 。

CreateProcess

第二个可执行文件，一个小的应用程序，它做一些愚蠢的事情会消耗时间。我们称其为 App ：

#include <Windows.h>
#include <string>
#include <iostream>
#include <vector>

int main()
{
    STARTUPINFO         si = {0};
    PROCESS_INFORMATION pi = {0};
    si.cb = sizeof(si);

    // Starts the 'App':
    auto exe = L"C:\\Tests\\x64\\Release\\TestProject.exe";
    std::vector<wchar_t> tmp;
    tmp.resize(1024);
    memcpy(tmp.data(), exe, (1 + wcslen(exe)) * sizeof(wchar_t));

    auto result = CreateProcess(NULL, tmp.data(), NULL, NULL, FALSE, DEBUG_PROCESS, NULL, NULL, &si, &pi);
    DEBUG_EVENT debugEvent = { 0 };
    bool continueDebugging = true;
    while (continueDebugging) 
    {
        if (WaitForDebugEvent(&debugEvent, INFINITE))
        {
            std::cout << "Event " << debugEvent.dwDebugEventCode << std::endl;
            if (debugEvent.dwDebugEventCode == EXIT_PROCESS_DEBUG_EVENT)
            {
                continueDebugging = false;
            }

            // I real life, this is more complicated... For a minimum test, this will do
            auto continueStatus = DBG_CONTINUE;
            ContinueDebugEvent(debugEvent.dwProcessId, debugEvent.dwThreadId, continueStatus);

        }
    }
    std::cout << "Done." << std::endl;

    std::string s;
    std::getline(std::cin, s);

    return 0;
}

请注意 debugger 应用程序实际上没有执行任何操作。它只是坐在那里，等到 app 完成。我正在使用最新版本的VS2019。

我已经测试了四种方案。对于每种情况，我都为单次迭代（变量#include <Windows.h> #include <iostream> #include <string> #include <vector> __declspec(noinline) void CopyVector(uint64_t value, std::vector<uint8_t> data) { // irrelevant. data.resize(10); *reinterpret_cast<uint64_t*>(data.data()) = value; } int main(int argc, const char** argv) { for (int i = 0; i < 10; ++i) { LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds; LARGE_INTEGER Frequency; QueryPerformanceFrequency(&Frequency); QueryPerformanceCounter(&StartingTime); // Activity to be timed std::vector<uint8_t> tmp; tmp.reserve(10'000'000 * 8); // The activity (*) uint64_t v = argc; for (size_t j = 0; j < 10'000'000; ++j) { v = v * 78239742 + 1278321; CopyVector(v, tmp); } QueryPerformanceCounter(&EndingTime); ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart; // We now have the elapsed number of ticks, along with the // number of ticks-per-second. We use these values // to convert to the number of elapsed microseconds. // To guard against loss-of-precision, we convert // to microseconds *before* dividing by ticks-per-second. ElapsedMicroseconds.QuadPart *= 1000000; ElapsedMicroseconds.QuadPart /= Frequency.QuadPart; std::cout << "Elapsed: " << ElapsedMicroseconds.QuadPart << " microsecs" << std::endl; } std::string s; std::getline(std::cin, s); }）花费了时间。我所期望的是，运行 App （1）和运行 Debugger （4）的速度大约相同（因为 Debugger 并没有真正执行任何操作）。但是，现实却大不相同：

运行 App （Windows资源管理器/ Ctrl-F5）。在我的电脑上，每次迭代大约需要1秒。
运行 App 。同样，每次迭代大约1秒。我会期望的。
运行 Debugger 。同样，每次迭代大约1秒。同样，我会期望的。
运行 Debugger （仅从Windows资源管理器或ctrl-F5）。这次，我们必须等待大约。每次迭代4秒（！）。不是我所期望的！

我将问题缩小为i参数，该参数通过值传递（称为复制c'tor）。

我非常想知道这里发生了什么...为什么 debugger 运行慢了4倍，却什么也没做？

-更新-

我已经使用专有库向我的小调试器程序中添加了一些堆栈跟踪和性能分析功能，以将情况（3）和（4）相互比较。我基本上已经计算出堆栈跟踪中的指针发生的频率。

这些方法可以在情况（4）的结果中找到，但在情况（3）中却没有意义。开头的数字是一个简单的计数器：

vector<uint8_t> data

尤其是RtlpNtMakeTemporaryKey似乎出现了很多。不幸的是，我不知道这意味着什么，而且Google似乎没有帮助...

Answer 1

在调试堆中不同。阅读The Windows Heap Is Slow When Launched from the Debugger

和Accelerating Debug Runs, Part 1: _NO_DEBUG_HEAP

当存在进程初始化系统（ntdll）检查调试器时，如果检查是否存在环境变量_NO_DEBUG_HEAP并将其设置为非零。如果否，请设置 NtGlobalFlag （在PEB中）以调试堆使用（FLG_HEAP_ENABLE_TAIL_CHECK，FLG_HEAP_ENABLE_FREE_CHECK，FLG_HEAP_VALIDATE_PARAMETERS），所有这些检查并使用特殊模式（块末尾的baadf00d和abababab）使所有堆的分配/释放速度变慢（在没有这种情况下进行比较）

从另一方面来说，您的程序大部分时间用于从堆中分配/释放内存。

配置文件还显示以下内容-RtlAllocateHeap，memset-确定分配的块中是否填充了魔术图案，RtlpNtMakeTemporaryKey-此“功能”由一条指令组成-jmp ZwDeleteKey -因此，您实际上不在此功能内，而是“靠近”该功能，在与堆相关的另一个功能内。

如西蒙·穆里尔（Simon Mourier）所述-为什么情况（2）和（3）运行得比（1）快（没有调试器时），但只有情况（4）慢一些？

来自C++ Debugging Improvements in Visual Studio "14"

因此，使用以下命令启动C ++应用程序时可以提高性能： Visual Studio调试器，在Visual Studio 2015中我们禁用该操作系统的调试堆。

这是通过在调试过程环境中设置 _NO_DEBUG_HEAP=1 完成的。因此，请比较Accelerating Debug Runs, Part 1: _NO_DEBUG_HEAP（文章较旧）-现在默认情况下。

我们可以通过应用中的下一个代码对此进行检查：

WCHAR _no_debug_heap[32];
if (GetEnvironmentVariable(L"_NO_DEBUG_HEAP", _no_debug_heap, _countof(_no_debug_heap)))
{
    DbgPrint("_NO_DEBUG_HEAP=%S\n", _no_debug_heap);
}
else
{
    DbgPrint("error=%u\n", GetLastError());
}

因此，当我们在调试器下启动应用程序时-没有调试堆，因为VS调试器添加了_NO_DEBUG_HEAP=1。当您从调试器在调试器和应用程序下启动调试器时-通过CreateProcessW函数

lpEnvironment

指向新进程的环境块的指针。如果这   参数为NULL，新进程使用调用环境   过程。

因为您在此处传递了0-因此应用程序使用的环境与调试器相同-继承_NO_DEBUG_HEAP=1

但在情况（4）中-您未自行设置_NO_DEBUG_HEAP=1。结果是使用了调试堆，并且运行速度较慢。

使用CreateProcess时速度变慢

1 个答案: