Question

我制作了一个与title="..."中的<a>匹配的正则表达式;不幸的是，它也匹配title="..."中的<img/>。

有没有办法告诉正则表达式仅在title="..."中查找<a>？我不能使用像(?<=<a\s+)这样的后置方法，因为它们在JavaScript中支持 NOT 。

这是我的表达：

/((title=".+")(?=\s*href))|(title=".+")/igm;

以上表达式符合以下条件：

enter image description here

正如您所看到的，它匹配title="..."中找到的<img/>;我需要表达式来排除在图像标签中找到的标题。

Here是RegExp的链接。

另外，如果可能的话，我需要摆脱标题周围的title =“”。因此，只返回title AFTER href和title BEFORE href。如果不可能，我想我可以使用.replace()并将其替换为""。

zx81的表达：

enter image description here

Answer 1

首先，您必须知道大多数人更喜欢使用DOM解析器解析html，因为正则表达式可能会带来某些危险。话虽这么说，对于这个简单的任务（没有嵌套），这是你在正则表达式中可以做的。

使用捕获组

我们在JavaScript中没有lookbehinds或\K，但我们可以捕获我们对捕获组的喜好，然后从该组中检索匹配，忽略其余的。

此正则表达式捕获第1组的标题：

<a [^>]*?(title="[^"]*")

在the demo上，查看右侧窗格中的第1组捕获：这就是我们感兴趣的内容。

示例JavaScript代码

var unique_results = []; 
var yourString = 'your_test_string'
var myregex = /<a [^>]*?(title="[^"]*")/g;
var thematch = myregex.exec(yourString);
while (thematch != null) {
    // is it unique?
    if(unique_results.indexOf(thematch[1]) <0) {
        // add it to array of unique results
        unique_results.push(thematch[1]);
        document.write(thematch[1],"<br />");    
    }
    // match the next one
    thematch = myregex.exec(yourString);
}

<强>解释

<a匹配标记的开头
[^>]*?懒惰地匹配任何不是>的字符，最多...
(捕获群组
title="文字字符
[^"]*任何非引用的字符
"收盘报价
)结束第1组

Answer 2

我不确定您是否可以在JavaScript中使用单个正则表达式执行此操作;但是，你可以这样做：

http://jsfiddle.net/KYfKT/1/

var str = '\
<a href="www.google.com" title="some title">\
<a href="www.google.com" title="some other title">\
<a href="www.google.com">\
<img href="www.google.com" title="some title">\
';

var matches = [];
//-- somewhat hacky use of .replace() in order to utilize the callback on each <a> tag
str.replace(/\<a[^\>]+\>/g, function (match) {
    //-- if the <a> tag includes a title, push it onto matches
    var title = match.match(/((title=".+")(?=\s*href))|(title=".+")/igm);
    title && matches.push(title[0].substr(7, title[0].length - 8));
});

document.body.innerText = JSON.stringify(matches);

你可能应该使用DOM，而不是正则表达式：

http://jsfiddle.net/KYfKT/3/

var str = '\
<a href="www.google.com" title="some title">Some Text</a>\
<a href="www.google.com" title="some other title">Some Text</a>\
<a href="www.google.com">Some Text</a>\
<img href="www.google.com" title="some title"/>\
';

var div = document.createElement('div');
div.innerHTML = str;
var titles = Array.apply(this, div.querySelectorAll('a[title]')).map(function (item) { return item.title; });

document.body.innerText = titles;

Answer 3

我不确定你的html源代码来自哪里，但我知道有些浏览器在被取为'innerHTML'时不尊重源的大小写（或属性顺序）。

此外，作者和浏览器都可以使用单引号和双引号这些是我所知道的最常见的两个跨浏览器陷阱。

因此，您可以尝试：/<a [^>]*?title=(['"])([^\1]*?)\1/gi

它使用back-references执行非贪婪的不区分大小写的搜索，以解决单引号和双引号的情况。

第一部分已经由zx81的回答解释了。 \1匹配第一个捕获组，因此它与使用的开头报价匹配。现在第二个捕获组应该包含裸标题字符串。

一个简单的例子：

var rxp=/<a [^>]*?title=(['"])([^\1]*?)\1/gi
,   res=[]
,   tmp
;

while( tmp=rxp.exec(str) ){  // str is your string
  res.push( tmp[2] );        //example of adding the strings to an array.
}

然而正如其他人所指出的那样，对于正则表达式标签汤（又称HTML）来说，确实很糟糕。 Robert Messerle的替代方案（使用DOM）更可取！

警告（我差点忘了）..
IE6（以及其他？）具有这个很好的“内存节省功能”，可以方便地删除所有不需要的引号（对于不需要空格的字符串）。所以，在那里，这个正则表达式（和zx81的）将失败，因为它们依赖于引用的使用！回到绘图板......（重新编写HTML时看似永无止境的过程）。

JavaScript RegExp Lookbehind替代方案？

3 个答案: