用于解析robots.txt的正则表达式

时间:2011-09-20 11:37:01

标签: regex

我有以下robots.txt作为示例 -

User-agent: googlebot
User-agent: slurp
User-agent: msnbot
User-agent: teoma
User-agent: W3C-checklink
User-agent: WDG_SiteValidator
Disallow: /
Disallow: /js/
Disallow: /Web_References/
Disallow: /webresource.axd
Disallow: /scriptresource.axd

User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /webresource.axd
Disallow: /scriptresource.axd
Disallow: /js/
Disallow: /Web_References/

我可能会对正则表达式提出太多要求,但我想编写一个表达式,它将以下列分组和有序方式返回匹配项 -

Matches
 - [0]
   - [UserAgents]
      - "googlebot"
      - "slurp"
      - "msnbot"
      - "teoma"
      - "W3C-checklink"
      - "WDG_SiteValidator"
    - [Routes]
      - [0]
        - [Permission] "Allow"
        - [Url] "/"
      - [1]
        - [Permission] "Disallow"
        - [Url] "/js/"
      - [2]
        - [Permission] "Disallow"
        - [Url] "/Web_References/"

...

etc

...

我已经编写了单独的表达式来匹配文档的元素,但是当拼凑在一起时我无法使它们工作。也许有人可以指出我哪里出错了?

模式

用户代理:(?:user-agent:\s*)(?<UserAgent>[a-z_0-9-*]*)

权限:(?<Permission>(?:allow|disallow))(?:\s*:\s*)(?<Url>[/0-9_a-z.]*)

我的尝试

((?<UserAgents>(?:user-agent:\s*)(?<UserAgent>[a-z_0-9-*]*))+(?<Routes>(?<Permission>(?:allow|disallow))(?:\s*:\s*)(?<Url>[/0-9_a-z.]*))+)+

仅供参考,我正在使用Expresso来调试这些脚本并进行以下检查 - 多行,编译和忽略大小写

1 个答案:

答案 0 :(得分:1)

试试这个:

(?:^User-agent: (?<UserAgent>.*?)$)|(?<Permission>^(?:Allow)|(?:Disallow)): (?<Url>.*?)$

我不确定你想要的那种格式,但是上面的正则表达式匹配并命名你感兴趣的部分。也许你可以建立在那个正则表达式之上。我几乎不做C#,但也许这可能有用:

try {
    Regex regexObj = new Regex("(?:^User-agent: (?<UserAgent>.*?)$)|(?<Permission>^(?:Allow)|(?:Disallow)): (?<Url>.*?)$", RegexOptions.IgnoreCase | RegexOptions.Multiline);
    Match matchResults = regexObj.Match(subjectString);
    while (matchResults.Success) {
        for (int i = 1; i < matchResults.Groups.Count; i++) {
            Group groupObj = matchResults.Groups[i];
            if (groupObj.Success) {
                // matched text: groupObj.Value
                // match start: groupObj.Index
                // match length: groupObj.Length
            } 
        }
        matchResults = matchResults.NextMatch();
    } 
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}