Question

我正在研究一个状态机，它应该提取表格

的函数调用

/* I am a comment */
//I am a comment
pref("this.is.a.string.which\"can have QUOTES\"", 123456);

提取的数据为pref("this.is.a.string.which\"can have QUOTES\"", 123456); 从文件。目前，为了处理一个41kb的文件，这个过程花了将近一分半钟。有没有什么我在这里对这个有限状态机严重误解？

#include <boost/algorithm/string.hpp>
std::vector<std::string> Foo()
{
    std::string fileData;
    //Fill filedata with the contents of a file
    std::vector<std::string> results;
    std::string::iterator begin = fileData.begin();
    std::string::iterator end = fileData.end();
    std::string::iterator stateZeroFoundLocation = fileData.begin();
    std::size_t state = 0;
    for(; begin < end; begin++)
    {
        switch (state)
        {
        case 0:
            if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {
                stateZeroFoundLocation = begin;
                begin += 4;
                state = 2;
            } else if (*begin == '/')
                state = 1;
            break;
        case 1:
            state = 0;
            switch (*begin)
            {
            case '*':
                begin = boost::find_first(boost::make_iterator_range(begin, end), "*/").end();
                break;
            case '/':
                begin = std::find(begin, end, L'\n');
            }
            break;
        case 2:
            if (*begin == '"')
                state = 3;
            break;
        case 3:
            switch(*begin)
            {
            case '\\':
                state = 4;
                break;
            case '"':
                state = 5;
            }
            break;
        case 4:
            state = 3;
            break;
        case 5:
            if (*begin == ',')
                state = 6;
            break;
        case 6:
            if (*begin != ' ')
                state = 7;
            break;
        case 7:
            switch(*begin)
            {
            case '"':
                state = 8;
                break;
            default:
                state = 10;
                break;
            }
            break;
        case 8:
            switch(*begin)
            {
            case '\\':
                state = 9;
                break;
            case '"':
                state = 10;
            }
            break;
        case 9:
            state = 8;
            break;
        case 10:
            if (*begin == ')')
                state = 11;
            break;
        case 11:
            if (*begin == ';')
                state = 12;
            break;
        case 12:
            state = 0;
            results.push_back(std::string(stateZeroFoundLocation, begin));
        };
    }
    return results;
}

Billy3

编辑：这是我见过的最奇怪的事情之一。我刚刚重建了这个项目，它又合理地运行了。奇

Answer 1

除非您的41 kb文件主要是注释或首选项，否则它将大部分时间花在状态0上。对于状态0中的每个字符，您至少进行两次函数调用。

if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

您可以通过预先测试来加快速度，以查看当前字符是否为'p'

if (*begin == 'p' && boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

如果字符不是'p'，则不需要进行任何函数调用。特别是没有创建迭代器，这可能是花费时间的地方。

Answer 2

我不知道这是否是问题的一部分，但你在案例0中输入错误，“perf”拼错为“pref”。

Answer 3

通过查看它很难说...但我猜测查找算法正在做这件事。你为什么在FSM内搜索？根据定义，你应该一次给它们一个字符....添加更多状态。同时尝试将结果作为列表，而不是矢量。正在进行大量的复制

vector<string>

但主要是：简介！

Answer 4

有限状态机是一种可行的解决方案，但对于文本处理，最好使用高度优化的有限状态机生成器。在这种情况下，正则表达式。这是Perl正则表达式：

# first clean the comments
$source =~ s|//.*$||;      # replace "// till end of line" with nothing
$source =~ s|/\*.*?\*/||s; # replace "/* any text until */" with nothing
                           # depending on your data, you may need a few other
                           # rules here to avoid blanking data, you could replace
                           # the comments with a unique identifier, and then
                           # expand any identifiers that the regex below returns

# then find your data
while ($source =~ /perf\(\s*"(.+?)",\s*(\d+)\s*\);/g) { 
   # matches your function signature and moves along source
   # do something with the captured groups, in this case $1 and $2
}

由于大多数正则表达式库都与Perl兼容，因此翻译语法应该不难。如果您的搜索变得更复杂，那么解析器就会有序。

Answer 5

如果您正在进行解析，为什么不使用解析器库。

我通常考虑Boost.Spirit.Qi。

你用类似EBNF的表达来表达你的语法，这肯定会使维护变得更容易。
这是一个仅限标题的库，所以你没有任何问题可以在混合中加入一个完整的二进制文件。

虽然我可以欣赏极简主义的方法，但我担心你自己编写有限状态机的想法并不明智。它适用于玩具示例，但随着要求加起来，你将会有一个可怕的switch，并且要理解正在发生的事情会变得越来越复杂。

请不要告诉我你知道它不会发展：我不相信神谕;）

为什么我的有限状态机需要这么长时间才能执行？

5 个答案: