当给定的域名不匹配时,我只是在某一点上更换了文本中的链接协议:
测试案例
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam <a title="mytitle" href="https://www.other-domain.de/path/index.html" target="_blank">other domain</a> nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd <a title="other title" href="https://www.my-domain.de/path/index.html" target="_blank">my domain</a>, no sea takimata <a title="mytitle" href="https://www.other-domain.de/path2/index2.html" target="_blank">other domain</a> est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed <a title="other title" href="https://www.my-domain.de/path/index.html" target="_blank">my domain</a> voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.
正则表达式:
$content = preg_replace('/<a (.*?)href=[\"\'](.*?)\/\/(.*?)[\"\'](.*?)>(.*?)<\/a>/i', '<a href="http://$3">$5</a>', $content);
然而,这匹配所有链接 - 我的目标是仅将替换应用于与给定域不匹配的链接,例如我的情况下为“my-domain.de”。
也就是说 - 只有与给定域不匹配的链接应将其协议从“https”更改为“http”。
干杯 马立克
答案 0 :(得分:0)
对于它的价值,这是你正在寻找的正则表达式:
原始匹配模式:
<a ((?:(?!href).)*?)href=[\"\']https:\/\/((?:(?!my-domain.de).)*?)[\"\'](.*?)>(.*?)<\/a>
原始替换模式:
<a $1href="http://$2"$3>$4</a>
PHP代码是:
$content = preg_replace('/<a ((?:(?!href).)*?)href=[\"\']https:\/\/((?:(?!my-domain.de).)*?)[\"\'](.*?)>(.*?)<\/a>/i','<a $1href="http://$2"$3>$4</a>',$content);
有人说,要预先警告 - 至于Andy Lester,这个正则表达式是不可靠的。虽然在我看来,这个问题并不完全是“HTML的本质” ,或者至少不是那么简单。在这个公认的巨大资源 - http://htmlparsing.com/regexes - 中提出的观点是,你试图在一条颠簸的道路上重新发明轮子。更广泛的关注是“并不是说正则表达本身就是邪恶的,但过度使用的正则表达式是邪恶的。”这句话是由杰夫阿特伍德,来自一个特殊的阐述关于正则表达式的喜悦和恐惧:Regular Expressions: Now You Have Two Problems(他还有一篇文章特别警告不要使用正则表达式来解析HTML - Parsing Html The Cthulhu Way。)
特别是在我上面的“解决方案”的情况下,例如 - 尽管是有效的HTML,但以下输入(带行返回)将不匹配:
<a title="mytitle"
href="https://www.other-domain.de/path/index.html"
target="_blank">other domain</a>
但是,根据需要处理以下输入:
<a href="https://my-domain.de">my domain</a>
<a href="https://other-domain.de">other domain</a>
<a href="https://www.my-domain.de/path/index.html">my domain</a>
<a href="https://www.other-domain.de/path/index.html">other domain</a>
<a title="other title" href="https://www.my-domain.de/path/index.html" target="_blank">other domain</a>
<a title="my title" href="https://www.other-domain.de/path/index.html" target="_blank">my domain</a>
变为:
<a href="https://my-domain.de">my domain</a>
<a href="http://other-domain.de">other domain</a>
<a href="https://www.my-domain.de/path/index.html">my domain</a>
<a href="http://www.other-domain.de/path/index.html">other domain</a>
<a title="other title" href="https://www.my-domain.de/path/index.html" target="_blank">other domain</a>
<a title="my title" href="http://www.other-domain.de/path/index.html" target="_blank">my domain</a>
这里有一个很好的资源来解释正则表达式的完整细分:http://www.myregextester.com/index.php
要在该工具上复制测试:
为了方便和后人,我在下面提供了该工具提供的完整说明,但其中两个概念亮点是:
前瞻和否定前瞻 - 例如(?!text)
http://php.net/manual/en/regexp.reference.assertions.php
非捕获子模式 - 例如(?:text)
或(?:(?!text))
http://php.net/manual/en/regexp.reference.subpatterns.php
匹配模式说明:
The regular expression:
`(?i-msx:<a ((?:(?!href).)*?)href=[\"\']https:\/\/((?:(?!my-domain.de).)*?)[\"\'](.*?)>(.*?)<\/a>)`
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?i-msx: group, but do not capture (case-insensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
<a '<a '
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
href 'href'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
href= 'href='
----------------------------------------------------------------------
[\"\'] any character of: '\"', '\''
----------------------------------------------------------------------
https: 'https:'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
my-domain 'my-domain'
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
de 'de'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
[\"\'] any character of: '\"', '\''
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
( group and capture to \4:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \4
----------------------------------------------------------------------
< '<'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
a> 'a>'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------