Question

我正在为IRC protocol abnf message format写正则表达式。以下是我正在编写的一些正则表达式的简短示例。

// digit      =  %x30-39                 ; 0-9
// "[0-9]"
static const std::string digit("[\x30-\x39]");

我使用以前的定义来形成更复杂的定义，这变得非常复杂，快速。在我遇到问题的地方，特别是对于更复杂的正则表达式，正在编写它们：

// hexdigit = digit / "A" / "B" / "C" / "D" / "E" / "F"
// "[[0-9]ABCDEF]"
static const std::string hexdigit("[" + digit + "ABCDEF]");

A＆＃34; hexdigit＆＃34;是一个＆＃34;数字＆＃34;或＆＃34;十六字母＆＃34;。

注意：我不在乎RFC定义了一个＆＃34; hexdigit＆＃34;信（ABCDEF）仅为大写。我正在谈论RFC所说的内容，我并不打算改变他们的要求。

const std::regex digit(dapps::regex::digit);
assert(std::regex_match("0", digit));
assert(std::regex_match("1", digit));
assert(std::regex_match("2", digit));
assert(std::regex_match("3", digit));
assert(std::regex_match("4", digit));
assert(std::regex_match("5", digit));
assert(std::regex_match("6", digit));
assert(std::regex_match("7", digit));
assert(std::regex_match("8", digit));
assert(std::regex_match("9", digit));
assert(!std::regex_match("10", digit));

在上面的代码中，匹配＆＃34;数字＆＃34;按照abnf的意图行事。

然而，＆＃34; hexdigit＆＃34;现在是非法的正则表达式语法：

[[0-9]ABCDEF]

而不是

[0-9ABCDEF]

并尝试与之匹配不会起作用：

const std::regex hexdigit(dapps::regex::hexdigit);
assert(std::regex_match("0", hexdigit));
assert(std::regex_match("1", hexdigit));
assert(std::regex_match("2", hexdigit));
assert(std::regex_match("3", hexdigit));
assert(std::regex_match("4", hexdigit));
assert(std::regex_match("5", hexdigit));
assert(std::regex_match("6", hexdigit));
assert(std::regex_match("7", hexdigit));
assert(std::regex_match("8", hexdigit));
assert(std::regex_match("9", hexdigit));
assert(std::regex_match("A", hexdigit));
assert(std::regex_match("B", hexdigit));
assert(std::regex_match("C", hexdigit));
assert(std::regex_match("D", hexdigit));
assert(std::regex_match("E", hexdigit));
assert(std::regex_match("F", hexdigit));
assert(!std::regex_match("10", hexdigit));

因此，如果我做＆＃34;数字＆＃34;没有范围选择器中的＆＃34;单个字符＆＃34;，（[ ]）那么你就不能使用＆＃34;数字＆＃34;匹配＆＃34;数字＆＃34;。

我可能完全以错误的方式解决这个问题，所以我的问题是：我是否真的需要保留这两个版本，包括带括号和不带括号的版本，或者是否有更简单的方法来组成正则表达式。

Answer 1

而不是将你尝试过的两个字符类合并，而应该是：

[0-9ABCDEF]

构建一个交替 - 即一个逻辑OR - 通过管道char |，并括号（非分组）连接的术语：

(?:[0-9]|[ABCDEF])

这种方法的好处是你可以用这种方式加入任何两个表达式，字符类或其他，例如数字或空格：

(?:[0-9]|\s)

因此可以非常普遍地应用。

次要问题：您可以将[ABCDEF]编码为[A-F]和/或使其与[A-Fa-f]不区分大小写。

Answer 2

我不确定我是否正确地阅读了您的问题。如果您关注的是重复模式＆＃34;常数，您可以通过以下方式完成：

static const std::string digit("0-9");
static const std::string hexdigit(digit + "ABCDEF");
static const std::string digit_range("[" + digit + "]");
static const std::string hexdigit_range("[" + hexdigit + "]");

或者只保留前2个，并使用这样的util方法（伪代码）：

static const std::string digit("0-9");
static const std::string hexdigit(digit + "ABCDEF");

string range_of(string... ranges) {
    string result = "[";
    for each range in ranges {
        result += range
    }
    result += "]";
    return result;
}

这样您就可以定义不同类型的范围常量，并std::regex pattern(range_of(hexdigit));使用，甚至可以使用std::regex pattern(range_of(digit, uppercase_alphabet, normal_punctuation));

Answer 3

要获得IRC消息的一般格式（没有v3，因为我认为您不考虑来自v3的标记消息），您可以使用this simple regexp：

^\s*(:[^ \n:]* )?([A-Za-z0-9]*)( [^ \n:]*)?( [^ \n:]*)?( [^ \n:]*)?( [^ \n:]*)?( [^ \n:]*)?( [^ \n:]*)?(:.*)?

请参阅demo

它允许您将消息内容分解为其部分，最多可以匹配六个不同的参数， catchall 最后一个参数，前面有:。

如何在代码中组成正则表达式

3 个答案: