Question

我正在尝试编写正则表达式来解析HTML字符串。

我需要找到一个包装在标签中的单词，后面没有其他特定的标签，例如
。在标签之间没有空格之前，以下正则表达式似乎可以正常工作。

preg_match('/\<b[^<]*?\>([^\s<]+?)\<\/b\>\s*(?!\<br\>)/ui', '<b>word</b> <br>');

没有空格时的预期行为：
https://regex101.com/r/mKTmM3/11

和
之间有空格的意外行为：
https://regex101.com/r/mKTmM3/10

我该如何解决这个问题？

Answer 1

在这里，我们也许可以解决这个问题。

让我们从一个不紧随其后的单词策略开始，以排除我们不希望的<br>，看看是否可行。为此，我们只需要用结束字符关闭表达式，而我们可能不希望将其与开始字符绑定：

((<b>([a-z]+)<\/b>)((?!<br>).)*)$

Demo

我们还添加了额外的捕获组()，如果我们不想拥有它，可以将其删除。

测试

$re = '/((<b>([a-z]+)<\/b>)((?!<br>).)*)$/im';
$str = '<b>word</b><br>
<b>word</b>   <br>
<b>word</b> in text
half<b>word</b> ';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

var_dump($matches);

输出

array(2) {
  [0]=>
  array(5) {
    [0]=>
    string(19) "<b>word</b> in text"
    [1]=>
    string(19) "<b>word</b> in text"
    [2]=>
    string(11) "<b>word</b>"
    [3]=>
    string(4) "word"
    [4]=>
    string(1) "t"
  }
  [1]=>
  array(5) {
    [0]=>
    string(12) "<b>word</b> "
    [1]=>
    string(12) "<b>word</b> "
    [2]=>
    string(11) "<b>word</b>"
    [3]=>
    string(4) "word"
    [4]=>
    string(1) " "
  }
}

演示

const regex = /((<b>([a-z]+)<\/b>)((?!<br>).)*)$/igm;
const str = `<b>word</b><br>
<b>word</b>   <br>
<b>word</b> in text
half<b>word</b> `;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

RegEx用于排除特殊专利

1 个答案:

Demo

测试

输出

演示