请解释为什么JS RegEx不起作用

时间:2014-01-02 16:30:01

标签: javascript regex

尝试提取源代码字符串中所有img,link和script标记中包含的url。使用Sublime Text regex (img|link|script).+?(href|src)="(.+?)"可以找到正确的结果:

=>

link rel="stylesheet" type="text/css" href="assets/css/dynbm.css"
img src="assets/img/sponsor-image.png"
img class="logo" src="assets/img/sponsor-logo.png"

但是通过JS使用以下代码会返回一个随机数量的网址(甚至是一些但不是所有的标记网址):

var res,
    lnks = [],
    lnk_exp = new RegExp('(img|link|script).+?(href|src)="(.+?)"', 'gi');
while (res = lnk_exp.exec(src)) {
    lnks.push(res[3]);
}
console.log(lnks);

=>

[
      "assets/css/dynbm.css", 
      "/click?url=http://www.website.com/imglink", 
      "assets/img/sponsor-image.png", 
      "/click?url=http://www.website.com/link1", 
      "/click?url=http://www.website.com/link3", 
      "/click?url=http://www.website.com/cta", 
      "assets/img/sponsor-logo.png"
]

感兴趣的人的完整字符串:

<link rel="stylesheet" type="text/css" href="assets/css/dynbm.css"><div class="dynbm_wrap rrwidth" id="dbm-name"><div id="dynbm_screens"><div class="screen" id="slide1"><div class="dynbm_body"><div class="img_right"><a href="/click?url=http://www.website.com/imglink" onclick="return sl(this,'nw','dbm-name_i1-1');"><img src="assets/img/sponsor-image.png" alt="sponsor-image"></a></div><h3><a href="/click?url=http://www.website.com/link" onclick="return sl(this,'nw','dbm-name_h1-1');">Heading</a></h3><div class="body_content"><ul><li><a href="/click?url=http://www.website.com/link1" onclick="return sl(this,'nw','dbm-name_l1-1');">Bullet 1</a></li><li><a href="/click?url=http://www.website.com/link2" onclick="return sl(this,'nw','dbm-name_l1-2');">Bullet 2</a></li><li><a href="/click?url=http://www.website.com/link3" onclick="return sl(this,'nw','dbm-name_l1-3');">Bullet 3</a></li></ul></div><p class="action_link"><a target="_parent" href="/click?url=http://www.website.com/cta" onclick="return sl(this,'nw','dbm-name_a1-1');">Learn More</a></p></div></div></div><div class="dynbm_base"><div id="sponsored_footer"><p class="sponsored_text"><a href="/www/sponsored-by" id="sponsorlnk" target="_parent">From Our Sponsor</a></p><a target="_parent" href="/click?url=http://www.website.com" onclick="return sl(this,'nw','dbm-name_logo');"><img class="logo" src="assets/img/sponsor-logo.png" alt="Logo"></a></div><div class="disclosure"><a href="#" class="close" title="Close this message" target="_parent">close</a><h4>From Our Sponsor</h4><p>Content under this heading is from or created on behalf of the named sponsor. This content is not subject to the WebMD Editorial Policy and is not reviewed by the WebMD Editorial department for accuracy, objectivity or balance.</p></div></div></div><style type="text/css">#dbm-name { background: #fff; color: #000; }</style><script type="text/javascript">(function(){var e=$('dbm-name');e.find('p.sponsored_text a, .disclosure a.close, .disclosure').click(function(){e.find('.disclosure').toggleClass('visible').css('z-index',99);return false})})()</script>

3 个答案:

答案 0 :(得分:1)

我认为你所需要的只是你的正则表达式开头的<,这样可以防止它与子串img|link|script无意中匹配,但这不是那些类型的标签,比如这,来自您的示例来源:

img_right"><a href="/click?url=http://www.website.com/imglink"

因此,正则表达式应该是:

<(img|link|script).+?(href|src)="(.+?)"

当然,HTML不要求标签的名称与开口尖括号直接相邻,所以你也应该有一个可选的空格数:

< *(img|link|script).+?(href|src)="(.+?)"

请记住HTML ought not be parsed with regex对于Real Work™。

答案 1 :(得分:0)

首先,你的标题有点耸人听闻。当然JS RegEx可以工作,它只是像Sublime Text的正则表达式一样工作。

在这种情况下,它完全与它says完全相同。 res[1] ... res[n]返回:

  

带括号的子字符串匹配(如果有)。可能带括号的子串的数量是无限的。

搜索中的第三个子字符串是等于之后的所有内容,看起来你正在获得输出。只需按res[0],即可获得完整匹配。

答案 2 :(得分:0)

愚弄我,自己发现了问题。需要在<之前添加img/link/script,以便/click?url=http://www.website.com/link1之类的内容不计入<link>匹配。