Boost :: Regex在长表达式不匹配时抛出错误

时间:2013-02-24 06:42:55

标签: c++ regex boost

我有两个正则表达式。一个匹配python样式注释,一个匹配文件路径。

当我尝试查看注释是否与文件路径表达式匹配时,如果注释字符串超过~15个字符,则会引发错误。否则它会按预期运行。

如何修改我的正则表达式以使其没有此问题

示例代码:

#include <string>
#include "boost/regex.hpp"

using namespace std;
using namespace boost;

int main(int argc, char** argv)
{
    boost::regex re_comment("\\s*#[^\\r\\n]*");
    boost::regex re_path("\"?([A-Za-z]:)?[\\\\/]?(([^(\\\\/:*?\"<>|\\r\\n)]+[\\\\/]?)+)?\\.[\\w]+\"?");

    string shortComment = " #comment ";
    string longComment  = "#123456789012345678901234567890";
    string myPath       = "C:/this/is.a/path.doc";

    regex_match(shortComment,re_comment);    //evaluates to true
    regex_match(longComment,re_comment);     //evaluates to true

    regex_match(myPath, re_path);             //evaluates to true
    regex_match(shortComment, re_path);       //evaluates to false
    regex.match(longComment, re_path);        //throws error
}

这是抛出的错误

terminate called after throwing an instance of
    'boost::exception_detail::clone_impl<boost::exception_detail
            ::error_info_injector<std::runtime_error> >'
what():  The complexity of matching the regular expression exceeded predefined
    bounds.  Try refactoring the regular expression to make each choice made by the
    state machine unambiguous.  This exception is thrown to prevent "eternal" matches
    that take  an indefinite period time to locate.

1 个答案:

答案 0 :(得分:1)

我知道总是创造一个巨大的正则表达式以解决所有世界问题是很诱人的,而且确实可能​​有这样做的性能原因,但你还必须考虑你在构建这样的时候创建的维护噩梦怪物。话虽如此,我建议将问题分解为可管理的部分。

基本上处理引号,在dir分隔符上拆分字符串,并对路径的每个部分进行正则表达式。

#include <string>
#include "boost/regex.hpp"
#include "boost/algorithm/string.hpp"
using namespace std;
using namespace boost;


bool my_path_match(std::string line)
{
    bool ret = true;
    string drive = "([a-zA-Z]\\:)?";
    string pathElem = "(\\w|\\.|\\s)+";
    boost::regex re_pathElem(pathElem);
    boost::regex re_drive("(" + drive + "|" + pathElem + ")");

    vector<string> split_line;
    vector<string>::iterator it;

    if ((line.front() == '"') && (line.back() == '"'))
    {
        line.erase(0, 1); // erase the first character
        line.erase(line.size() - 1); // erase the last character
    }

    split(split_line, line, is_any_of("/\\"));

    if (regex_match(split_line[0], re_drive) == false)
    {
        ret = false;
    }
    else
    {
        for (it = (split_line.begin() + 1); it != split_line.end(); it++)
        {
            if (regex_match(*it, re_pathElem) == false)
            {
                ret = false;
                break;
            }
        }
    }
    return ret;
}

int main(int argc, char** argv)
{
    boost::regex re_comment("^.*#.*$");

    string shortComment = " #comment ";
    string longComment  = "#123456789012345678901234567890";
    vector<string> testpaths;
    vector<string> paths;
    vector<string>::iterator it;
    testpaths.push_back("C:/this/is.a/path.doc");
    testpaths.push_back("C:/this/is also .a/path.doc");
    testpaths.push_back("/this/is also .a/path.doc");
    testpaths.push_back("./this/is also .a/path.doc");
    testpaths.push_back("this/is also .a/path.doc");
    testpaths.push_back("this/is 1 /path.doc");

    bool ret;
    ret = regex_match(shortComment, re_comment);    //evaluates to true
    cout<<"should be true = "<<ret<<endl;
    ret = regex_match(longComment, re_comment);     //evaluates to true
    cout<<"should be true = "<<ret<<endl;

    string quotes;
    for (it = testpaths.begin(); it != testpaths.end(); it++)
    {
        paths.push_back(*it);
        quotes = "\"" + *it + "\""; // test quoted paths
        paths.push_back(quotes);
        std::replace(it->begin(), it->end(), '/', '\\'); // test backslash paths
        std::replace(quotes.begin(), quotes.end(), '/', '\\'); // test backslash quoted paths
        paths.push_back(*it);
        paths.push_back(quotes);
    }

    for (it = paths.begin(); it != paths.end(); it++)
    {
        ret = my_path_match(*it);             //evaluates to true
        cout<<"should be true = "<<ret<<"\t"<<*it<<endl;
    }

    ret = my_path_match(shortComment);       //evaluates to false
    cout<<"should be false = "<<ret<<endl;
    ret = my_path_match(longComment);        //evaluates to false
    cout<<"should be false = "<<ret<<endl;
}

是的,它(可能)会比单个正则表达式慢但是它会起作用,它不会在python注释行上抛出错误,如果你发现路径/注释失败,你应该能够找出问题并修复它(即它是可维护的)。