Question

所以在几天之前我开始学习C ++。我正在编写一个简单的xHTML解析器，它不包含嵌套标签。为了测试，我一直在使用以下数据：http://pastebin.com/bbhJHBdQ（大约10k字符）。 我只需要在p，h2和h3标签之间解析数据。我的目标是将标签及其内容解析为以下结构：

struct Node {
    short tag; // p = 1, h2 = 2, h3 = 3
    std::string data;
};

例如<p> asdasd </p>将被解析为tag = 1, string = "asdasd"。我不想使用第三方库，我正在尝试进行速度优化。

这是我的代码：

short tagDetect(char * ptr){
    if (*ptr == '/') {
        return 0;
    }

    if (*ptr == 'p') {
        return 1;
    }

    if (*(ptr + 1) == '2')
        return 2;

    if (*(ptr + 1) == '3')
        return 3;

    return -1;
}


struct Node {
    short tag;
    std::string data;

    Node(std::string input, short tagId) {
        tag = tagId;
        data = input;
    }
};

int _tmain(int argc, _TCHAR* argv[])
{
    std::string input = GetData(); // returns the pastebin content above
    std::vector<Node> elems;

    String::size_type pos = 0;
    char pattern = '<';

    int openPos;
    short tagID, lastTag;

    double  duration;
    clock_t start = clock();

    for (int i = 0; i < 20000; i++) {
        elems.clear();

        pos = 0;
        while ((pos = input.find(pattern, pos)) != std::string::npos) {
            pos++;
            tagID = tagDetect(&input[pos]);
            switch (tagID) {
            case 0:
                if (tagID = tagDetect(&input[pos + 1]) == lastTag && pos - openPos > 10) {
                    elems.push_back(Node(input.substr(openPos + (lastTag > 1 ? 3 : 2), pos - openPos - (lastTag > 1 ? 3 : 2) - 1), lastTag));
                }

                break;
            case 1:
            case 2:
            case 3:
                openPos = pos;
                lastTag = tagID;
                break;
            }
        }

    }

    duration = (double)(clock() - start) / CLOCKS_PER_SEC;
    printf("%2.1f seconds\n", duration);
}

我的代码是循环的，以便对我的代码进行性能测试。我的数据包含10k字符。

我注意到我的代码中最大的“瓶颈”是substr。如上所述，代码在5.8 sec中完成执行。我注意到如果我将strsub len减少到10，执行速度会降低到0.4 sec。如果我用""替换整个substr，我的代码将在0.1 sec完成。

我的问题是：

如何优化substr，因为它是我代码的主要瓶颈？
我可以对我的代码进行任何其他优化吗？

我不确定这个问题对于SO来说是否合适，但我是C +中的新手，我不知道如果我的代码是完整的废话，谁会问。

可以在此处找到完整的源代码：http://pastebin.com/dhR5afuE

Answer 1

您可以存储引用原始字符串中各节（通过指针，迭代器或整数索引）的数据，而不是存储子字符串。只要使用参考数据，您就必须小心原始字符串保持不变。即使您不愿意直接使用它，也要从boost::string_ref获取提示。

Answer 2

有更好的子串算法，而不仅仅是线性搜索，即 O（MxN）。查看Boyer-Moore和Knuth-Morris-Platt算法。我在几年前测试过并且B-M赢了。

您还可以考虑使用正则表达式，这种表达式设置起来比较昂贵，但在实际搜索中可能比线性搜索更有效。

C ++ substr，优化速度

2 个答案: