Question

我正在尝试使用rel =“nofollow”parametr：

关闭所有外部网址

我写这个简单的中间件：

import re

NOFOLLOW_RE = re.compile(u'<a (?![^>]*rel=["\']nofollow[\'"])'\
                         u'(?![^>]*href=["\']mysite\.com[\'"])',
                         re.UNICODE|re.IGNORECASE)

class NofollowLinkMiddleware(object):

    def process_response(self, request, response):
        if ("text" in response['Content-Type']):

            response.content = re.sub(NOFOLLOW_RE, u'<a rel="nofollow" ', response.content.decode('UTF8') )
            return response
        else:
            return response

它有效，但关闭内部和外部的所有链接。我不知道如何添加＆lt; noindex＆gt;＆lt; / noindex＆gt;标记链接。

Answer 1

首先，您忘记了'http：//'和网址路径。所以，你的regexp应该是：

NOFOLLOW_RE = re.compile(u'<a (?![^>]*rel=["\']nofollow[\'"])'\
                         u'(?![^>]*href=["\']http://mysite\.com(/[^\'"]*)?[\'"])',
                         re.U|re.I)

然后，您还需要考虑从“/”和“＃”开始的href作为内部链接：

NOFOLLOW_RE = re.compile(u'<a (?![^>]*rel=["\']nofollow[\'"])'\
                         u'(?![^>]*href=["\'](?:https?://mysite\.com(?:/[^\'"]*)|/[^\'"]*|#[^\'"]*)[\'"])',
                         re.U|re.I)

此外，您可能希望考虑使用第三级域名和“https：//”协议。

对于＆lt; noindex＆gt;标记您可以使用组，请查看re.sub() in Python docs：

NOFOLLOW_RE = re.compile(u'<a (?P<link>(?![^>]*rel=["\']nofollow[\'"])'\
                         u'(?![^>]*href=["\'](?:https?://mysite\.com(?:/[^\'"]*)|/[^\'"]*|#[^\'"]*)[\'"]).*?</a>)',
                         re.U|re.I)
...
response.content = NOFOLLOW_RE.sub(u'<noindex><a rel="nofollow" \g<link></noindex>', your_html)

这个正则表达式很古怪。我强烈建议你为它编写一个测试，其中包括＆lt; a＆gt;的所有可能组合。标签和它可以想象的属性。如果您之后在此代码中发现某些问题，测试将帮助您不要破坏所有内容。

Answer 2

我知道我已经很晚了，但我正在为别人留下答案。 @HighCat给出了除一个案件以外的所有案件的正确答案。以上正则表达式还会在链接http://example.com

中添加nofollow

因此，在这种情况下的正则表达式应该是=＆gt;

import re

NOFOLLOW_RE = re.compile(u'<a (?P<link>(?![^>]*rel=["\']nofollow[\'"])'\
                         u'(?![^>]*href=["\'](?:https?://example\.com/?(?:[^\'"]*)|/[^\'"]*|#[^\'"]*)[\'"]).*?</a>)',
                         re.U|re.I)

class NofollowLinkMiddleware(object):

    def process_response(self, request, response):
        if ("text" in response['Content-Type']):

            response.content = NOFOLLOW_RE.sub(u'<a rel="nofollow" target="_blank" \g<link>', response.content.decode('UTF8') )
            return response
        else:
            return response

^{这是一个小小的改变。我应该评论或编辑，但我没有足够的声誉（评论）和编辑也需要6个以上的字符更改。}

Django中间件，用于为所有外部链接添加relnofollow

2 个答案: