Question

我正在尝试编写一个RegEx规则来查找我网页上的所有href HTML链接，并向他们添加'rel =“nofollow”。

但是，我有一个必须排除的URL列表（例如，任何（通配符）内部链接（例如pokerdiy.com） - 所以我的域名所在的任何内部链接都不包括在内.I希望能够在排除列表中指定确切的URL - 例如 - http://www.example.com/link.aspx）

到目前为止，这是我的工作：

（] +）（href =“http：//.*？（？！（pokerdiy））[^＆gt;] +＆gt;）

如果您需要更多背景/信息，可以在此处查看完整的主题和要求（跳过顶部以获取信息）： http://www.snapsis.com/Support/tabid/601/aff/9/aft/13117/afv/topic/afpgj/1/Default.aspx#14737

Answer 1

James'正则表达式的改进：

(<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"((?!.*\brel=)[^>]*)(?:[^>]*)>

此正则表达式将匹配字符串数组$ follow_list中的链接NOT。字符串不需要前导'www'。 :) 优点是这个正则表达式将保留标记中的其他参数（如目标，样式，标题......）。如果标记中已存在rel参数，则正则表达式将不匹配，因此您可以强制关注不在$ follow_list中的网址

替换为：

$1$2$3"$4 rel="nofollow">

完整示例（PHP）：

function dont_follow_links( $html ) {
 // follow these websites only!
 $follow_list = array(
  'google.com',
  'mypage.com',
  'otherpage.com',
 );
 return preg_replace(
  '%(<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"((?!.*\brel=)[^>]*)(?:[^>]*)>%',
  '$1$2$3"$4 rel="nofollow">',
  $html);
}

如果你想覆盖rel无论如何，我都会使用preg_replace_callback方法，在回调中，rel属性会被单独替换：

$subject = preg_replace_callback('%(<a\s*[^>]*href="https?://(?:(?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"[^>]*)>%', function($m) {
    return preg_replace('%\srel\s*=\s*(["\'])(?:(?!\1).)*\1(\s|$)%', ' ', $m[1]).' rel="nofollow">';
}, $subject);

Answer 2

我开发了一个稍微强大的版本，可以检测锚标签中是否已经有“rel =”，因此不会重复属性。

(<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!blog.bandit.co.nz)[^"]+)"([^>]*)>

匹配

<a href="http://google.com">Google</a>
<a title="Google" href="http://google.com">Google</a>
<a target="_blank" href="http://google.com">Google</a>
<a href="http://google.com" title="Google" target="_blank">Google</a>

但不符合

<a rel="nofollow" href="http://google.com">Google</a>
<a href="http://google.com" rel="nofollow">Google</a>
<a href="http://google.com" rel="nofollow" title="Google" target="_blank">Google</a>
<a href="http://google.com" title="Google" target="_blank" rel="nofollow">Google</a>
<a href="http://google.com" title="Google" rel="nofollow" target="_blank">Google</a>
<a target="_blank" href="http://blog.bandit.co.nz">Bandit</a>

使用

替换

$1$2$3"$4 rel="nofollow">

希望这有助于某人！

詹姆斯

Answer 3

(<a href="https?://)((?:(?!\b(pokerdiy.com|www\.example\.com/link\.aspx)\b)[^"])+)"

会匹配以http://或https://开头的任何链接的第一部分，并且pokerdiy.com属性中的任何位置都不包含www.example.com/link.aspx或href 。将其替换为

\1\2" rel="nofollow"

如果rel="nofollow"已经存在，您最终会得到其中两个。当然，相对链接或其他协议（如ftp://等）根本不会匹配。

说明：

除非可以在当前位置匹配(?!\b(foo|bar)\b)[^"]或"，否则

foo会与任何非bar字符匹配。 \b用于确保我们不会在rebar或foonly上意外触发。

重复整个构造（(?: ... )+），并且在反引用\2中保留匹配的内容。

由于要匹配的下一个令牌是"，如果该属性在任何地方包含foo或bar，则整个正则表达式都会失败。

RegEx表达式用于查找href链接并向其添加NoFollow

3 个答案: