Question

我想从此文本中提取网址：

<body>
<a href="http://domaine.com/t/text/text"> <img src="http://domaine.com/i/text/text"></a> <br>
<a href="http://domaine.com/text"></a> <br>
<a href="http://domaine.com"></a> <br>
<a href="http://domaine.com/text/text"></a> <br>
<a href="http://[GoTo]"></a> <br>
<a href="http://[NextURL]"></a> <br>
</body>

但我希望从提取中排除某些具有特定模式的网址;这些模式是：

http://***/i/***/***
http://***/t/***/***
http://[GoTo]
http://[NextURL]

这意味着我只会得到这些网址：

http://domaine.com/text
http://domaine.com
http://domaine.com/text/text

到目前为止我所做的是使用这个正则表达式：

$regex = '/https?\:\/\/[^\" ]+/i';
preg_match_all($regex, $string, $matches);
print_r($matches[0]);

但是你可以注意到我提取了所有网址，而且我不知道如何使用我的特定宠物排除其中一些网址。

Answer 1

您正在寻找的是一种消极的前瞻：

$regex = '/https?:\/\/(?!\[GoTo\]|\[NextURL\]|[^\" ]*\/i\/[^\" ]+|[^\" ]*\/t\/[^\" ]*)[^\" ]+/i';

？在子匹配的开头应防止匹配带有封闭模式的URL。这可能需要针对特定的极端情况进行调整，但如上所述，这可以满足您的需求。

从文本中提取特定的URL

1 个答案: