匹配给定的正则表达式,除非存在给定的单词(lookahead或lookbehind)

时间:2016-04-29 22:27:38

标签: javascript regex lookahead lookbehind

我正在使用javascript正则表达式来解析一系列网址。我需要匹配URL中的数字(它实际上更复杂,但我简化了),但只想匹配给定单词不在URL中的数字。

即,我想要排除带有&#39; changelogs&#39;在其中,因此将捕获&#39; 1047 &#39; <#39; 1048 &#39;,&#39; 1245 < /强>&#39;和&#39; 1049 &#39;来自以下列表;

http://www.opera.com/docs/changelogs/unified/1215/
http://www.whatever.com/docs/changelogs/anythingelse/anything/1215/
http://www.blabblah/security/advisory/1047
http://booger/security/advisory/1048/
ftp://msn.global.whatever/somethingelse/1245
whatever/it/doesnt/matter/could/be/anything/i/still/want/this/number/1049/

我知道我需要某种环顾四周的前瞻性观察,但我会罢工。这是我尝试过的最后一种模式;

(?!changelogs)(\d+)

Here is the regex101 sandbox I'm using

此外,唯一匹配是实际数字也很重要。我不想要任何其他东西来匹配

以下是我的.NET代码的样子(请注意&#34; BulletinOrAdvisoryPattern&#34;是有问题的正则表达式)...

Regex bulletinPattern = new Regex(@matchingDomain.Vendor.BulletinOrAdvisoryPattern, RegexOptions.IgnoreCase );
Match bulletinMatch = bulletinPattern.Match(referenceTitle);

                    if (bulletinMatch.Success)
                    {
                        //Found the bulletin ID in the NVD Reference Title 
                        return bulletinMatch.Value;
                    }

3 个答案:

答案 0 :(得分:2)

你需要的“丑陋”正则表达式是

(?<=http://www\.opera\.com\b(?!.*/changelogs(?:/|$))\S*)\d+

请参阅.NET regex demo

但是,您只需要

var result = input.Contains("/changelogs/") ? "" : input.Trim('/').Split('/').LastOrDefault();

请参阅IDEONE C# demo

var lst = new List<string>() {"http://w...content-available-to-author-only...a.com/docs/changelogs/unified/1215/",
    "http://w...content-available-to-author-only...a.com/docs/changelogs/anythingelse/anything/1215/",
    "http://w...content-available-to-author-only...a.com/security/advisory/1047",
    "http://w...content-available-to-author-only...a.com/security/advisory/1048/",
    "http://w...content-available-to-author-only...a.com/doesnt/matter/could/be/anything/1049/"};
lst.ForEach(m => Console.WriteLine(
        m.Contains("/changelogs/") ? "" : m.Trim('/').Split('/').LastOrDefault()
    ));

<强>更新

您将语言从C#切换到JavaScript,因为JS正则表达式引擎不支持后视,因此大大改变了这种情况。

因此,你必须解决它,并且有办法模仿lookbehind,或者只是使用捕获机制。

如果您可以使用捕获,请尝试

/^(?!.*\/changelogs(?:\/|$)).*\/(\d+)/

请参阅regex demo

var re = /^(?!.*\/changelogs(?:\/|$)).*\/(\d+)/gmi; 
var str = 'http://www.opera.com/docs/changelogs/unified/1215/\nhttp://www.whatever.com/docs/changelogs/anythingelse/anything/1215/\nhttp://www.blabblah/security/advisory/1047\nhttp://booger/security/advisory/1048/\nftp://msn.global.whatever/somethingelse/1245\nwhatever/it/doesnt/matter/could/be/anything/i/still/want/this/number/1049/';
var res = [];
 
while ((m = re.exec(str)) !== null) {
  res.push(m[1]);
}
document.body.innerHTML = JSON.stringify(res, 0, 4);

或者,使用可选组(如果要替换):

var re = /(\/changelogs\/.*)?\/(\d+)/gi; 
var str = 'http://www.opera.com/docs/changelogs/unified/1215/\nhttp://www.whatever.com/docs/changelogs/anythingelse/anything/1215/\nhttp://www.blabblah/security/advisory/1047\nhttp://booger/security/advisory/1048/\nftp://msn.global.whatever/somethingelse/1245\nwhatever/it/doesnt/matter/could/be/anything/i/still/want/this/number/1049/';
var result = str.replace(re, function (m, g1, g2){
  return g1 ? m : "NEW_VAL";
});
document.body.innerHTML = result;

答案 1 :(得分:1)

类似下面的内容应该这样做。如果你不仅对歌剧感兴趣,你可以通过用Count取代歌剧来调整这一点更加通用。此外,你可以用像.+这样的东西来代替com来匹配像com和net这样的东西。 :

(com|net|org|gov)

Here is your regex 101 updated to reflect this

答案 2 :(得分:1)

此模式排除包含&#39;更改日志&#39;在它们中找到最后一个由斜杠封装的数字。

(?:\/)(?!.*changelogs)(?:\/[^\/]+)*\/(\d+)\/{0,1}

这是updated regex 101