我有以下robots.txt作为示例 -
User-agent: googlebot User-agent: slurp User-agent: msnbot User-agent: teoma User-agent: W3C-checklink User-agent: WDG_SiteValidator Disallow: / Disallow: /js/ Disallow: /Web_References/ Disallow: /webresource.axd Disallow: /scriptresource.axd User-agent: Mediapartners-Google* Disallow: User-agent: * Disallow: /webresource.axd Disallow: /scriptresource.axd Disallow: /js/ Disallow: /Web_References/
我可能会对正则表达式提出太多要求,但我想编写一个表达式,它将以下列分组和有序方式返回匹配项 -
Matches - [0] - [UserAgents] - "googlebot" - "slurp" - "msnbot" - "teoma" - "W3C-checklink" - "WDG_SiteValidator" - [Routes] - [0] - [Permission] "Allow" - [Url] "/" - [1] - [Permission] "Disallow" - [Url] "/js/" - [2] - [Permission] "Disallow" - [Url] "/Web_References/" ... etc ...
我已经编写了单独的表达式来匹配文档的元素,但是当拼凑在一起时我无法使它们工作。也许有人可以指出我哪里出错了?
模式
用户代理:(?:user-agent:\s*)(?<UserAgent>[a-z_0-9-*]*)
权限:(?<Permission>(?:allow|disallow))(?:\s*:\s*)(?<Url>[/0-9_a-z.]*)
我的尝试
((?<UserAgents>(?:user-agent:\s*)(?<UserAgent>[a-z_0-9-*]*))+(?<Routes>(?<Permission>(?:allow|disallow))(?:\s*:\s*)(?<Url>[/0-9_a-z.]*))+)+
仅供参考,我正在使用Expresso来调试这些脚本并进行以下检查 - 多行,编译和忽略大小写
答案 0 :(得分:1)
试试这个:
(?:^User-agent: (?<UserAgent>.*?)$)|(?<Permission>^(?:Allow)|(?:Disallow)): (?<Url>.*?)$
我不确定你想要的那种格式,但是上面的正则表达式匹配并命名你感兴趣的部分。也许你可以建立在那个正则表达式之上。我几乎不做C#,但也许这可能有用:
try {
Regex regexObj = new Regex("(?:^User-agent: (?<UserAgent>.*?)$)|(?<Permission>^(?:Allow)|(?:Disallow)): (?<Url>.*?)$", RegexOptions.IgnoreCase | RegexOptions.Multiline);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
for (int i = 1; i < matchResults.Groups.Count; i++) {
Group groupObj = matchResults.Groups[i];
if (groupObj.Success) {
// matched text: groupObj.Value
// match start: groupObj.Index
// match length: groupObj.Length
}
}
matchResults = matchResults.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}