Question

注意：这是.NET正则表达式。

我有一堆文本，我需要从中提取特定的行。我关心的界限有以下几种形式：

type Name(type arg1, type arg2, type arg3)

为了配合这一点，我提出了以下正则表达式：

^(\w+)\s+(\w+)\s*\(\s*((\w+)\s+(\w+)(:?,\s+)?)*\s*\)$

这个令人困惑的混乱会产生一个看起来像这样的Match对象：

Group 0: type Name(type arg1, type arg2, type arg3)
    Capture 0: type Name(type arg1, type arg2, type arg3)
Group 1: type
    Capture 0: type
Group 2: Name
    Capture 0: Name
Group 3: type arg3
    Capture 0: type arg1,
    Capture 1: type arg2,
    Capture 2, type arg3
Group 4: type
    Capture 0: type
    Capture 1: type
    Capture 2: type
Group 5: arg3
    Capture 0: arg1
    Capture 1: arg2
    Capture 2: arg3
Group 6:
    Capture 0: ,
    Capture 1: ,

但是，这不是完整的输入。其中一些行可能如下所示：

type Name(type arg1, type[] arg2, type arg3)

注意arg2之前的括号。

所以，我修改了我的正则表达式：

^(\w+)\s+(\w+)\s*\(\s*((\w+)\s*(\[\])?\s+(\w+)(:?,\s+)?)*\s*\)$

这会产生如下匹配：

Group 0: type Name(type arg1, type arg2, type arg3)
    Capture 0: type Name(type arg1, type arg2, type arg3)
Group 1: type
    Capture 0: type
Group 2: Name
    Capture 0: Name
Group 3: type arg3
    Capture 0: type arg1,
    Capture 1: type arg2,
    Capture 2, type arg3
Group 4: type
    Capture 0: type
    Capture 1: type
    Capture 2: type
Group 5: []
    Capture0: []
Group 6: arg3
    Capture 0: arg1
    Capture 1: arg2
    Capture 2: arg3
Group 7:
    Capture 0: ,
    Capture 1: ,

第5组实际上包含括号。然而，它唯一的捕获是＃0，这不是它所在的捕获（第二个）。

是否有某种方法可以将此捕获与相应的组相关联，或者我是否在咆哮错误的树？

我想，实现这个的另一种方法是分别解析输入中的参数。但是，肯定有办法这样做，不是吗？

修改
为了澄清，我没有构建语言解析器。我正在为脚本语言转换旧的文本api文档，如下所示：

--- foo object ---
void bar(int baz)
 * This does something.
 * Remember blah blah blah.

int getFrob()
 * Gets the frob

进入我可以导出为HTML等的新格式

编辑mkII： 对于其他人的好处，这是新修订的代码：

m = Regex.Match(line, @"^(\w+)\s+(\w+)\s*\((.*?)\)$");
if (m.Success) {

    if (curMember != null) {
        curType.Add(curMember);
    }
    curMember = new XElement("method");
    curMember.Add(new XAttribute("type", m.Groups[1].Value));
    curMember.Add(new XAttribute("name", m.Groups[2].Value));

    if (m.Groups[3].Success) {
        XElement args = new XElement("arguments");

        MatchCollection matches = Regex.Matches(m.Groups[3].Value, @"(\w+)(\[\])?\s+(\w+)");

        foreach (Match m2 in matches) {
            XElement arg = new XElement("arg");
            arg.Add(new XAttribute("type", m2.Groups[1].Value));
            if (m2.Groups[2].Success) {
                arg.Add(new XAttribute("array", "array"));
            }
            arg.Value = m2.Groups[3].Value;


            args.Add(arg);
        }

        curMember.Add(args);
    }
}

首先，它匹配type Name(*)部分，当它得到时，它会在参数部分重复匹配type Name。

Answer 1

我这样做是为了使它成为两阶段解析器。

首先，我确保我知道自己拥有的东西。在这个阶段，我不关心匹配组。

第二阶段实际上试图理解这一切。从第一阶段开始，它可以例如很容易得到括号内的所有内容，但解析参数很难。因此，根据括号内的结果，例如，将其拆分为,，然后逐个解析参数。

如果那太难了，因为例如允许多维数组（[,]），您创建一个正则表达式，从参数中的部分中获取第一个参数。然后你知道那个参数有多长，从参数中删除那个部分，剩下三个，等等。

匹配整行并在括号内生成零件：

"type Name(type arg1, type[] arg2, type arg3)" => "type arg1, type[] arg2, type arg3"

解析参数：

一个。吃参数列表的第一个参数：

"type arg1, type[] arg2, type arg3" => "type", "arg1"

湾从参数列表中删除已解析参数的长度：

"type arg1, type[] arg2, type arg3" => ", type[] arg2, type arg3"


", type[] arg2, type arg3".TrimStart(new char[]{ ',', ' ' }) => "type[] arg2, type arg3"

℃。如果字符串不为空：lather, rinse, repeat。

如何将多个子字符串与正则表达式匹配，即使它们是可选的？

1 个答案: