Question

尝试创建一个正则表达式来限制我们的垃圾邮件摄入量。问题是，我的正则表达式并不完全流利。我下面的工作产品主要是复制和粘贴，调整和搜索，以帮助调整它。

我决定尝试使用正则表达式来匹配链接错误表示主机名的电子邮件。

例如：

<A HREF="http://phishers.org/we_want_your_money.htm">http://someLegitimateSite.com/somewhere </A>

我基本上只关心主机名，限制误报并避免或多或少合法的链接，例如A HREF ...＆gt;点击这里！

到目前为止，我有这个：

(HREF="http[s]?:\/\/)(?'hostname1'(.*?))[:|\/|"].*?\"\>(http[s]?:\/\/)(?'hostname2'(.*?))[<|\/|:]

根据https://regex101.com/我有两个命名的捕获组（hostname1和hostname2），以及我不确定我关心的其他组的重击。

如果hostname1和hostname2相同，我想要做的是匹配字符串。我觉得它涉及一个后视或前瞻，但老实说我不知道。

修改： 感谢Jan对此进行原型设计。根据他的回答中的评论，我做了一个快速添加，以添加未计入的图像标签的情况。对于大型网站（例如BestBuy），他们将图像存储在不同的内容服务器上，这会触发规则。我已经决定排除图像标签，我相信（我的非专家意见）我已经成功完成了。 YMMV。

href=["']https?:\/\/(?<hostname>[^\/"]+)[^>]+>((?!<IMG).?)(?:https?:\/\/)?(?!.*\k'hostname')

Answer 1

这在某种程度上取决于您的编程语言。在PHP中你可以想出来......像：

href=["']https?:\/\/(?<hostname>[^\/]+)[^>]+>(?:https?:\/\/)?\k'hostname'
# match href, =, a single/double quote, :// literally
# capture everything up to a forward slash (but not including) in a group called hostname
# followed by anything but >
# followed by >
# start a non capturing group (?:) with http/https://
# look if one can match the previously captured group called hostname

如果是这种情况，则可能不是垃圾链接（href和链接文本匹配）。

概述：

<A HREF="http://phishers.org/we_want_your_money.htm">http://someLegitimateSite.com/somewhere </A>
<a href="https://example.com/subfolder">example.com</a> <-- will match, the others not
<a href="http://somebadsite.com">https://somegoodsite.com</a>

查看working example here on regex101.com。

编辑：根据您的评论，您希望得到否定的结果，这可以通过否定前瞻来完成：

href=["']https?:\/\/(?<hostname>[^\/"]+)[^>]+>(?:https?:\/\/)?(?!.*\k'hostname')
# same as before, except for the last part: (?!...)
# this one assures that the following group (hostname in our case) is not matched

请参阅此正则表达式here的工作示例。

正则表达式 - 比较两个捕获组

1 个答案: