Question

我需要抓住来自多个网站的所有链接。为此我收集了整个html文件。我需要一个将所有这些都放在数组中的正则表达式。

我不想收集任何图像文件或其他代码文件。只是页面本身的html。

我希望它收集所有这样的链接：

/https://www.hello.com
/https://www.hello.com/index.php
/https://www.hello.com/world
/https://www.hello.com/world.php
/https://www.hello.com/world.html
/https://hello.com
/https://hello.com/world
/http://www.hello.com
/http://www.hello.com/world
/http://hello.com
/http://hello.com/world
/www.hello.com
/www.hello.com/world
/hello.com
/hello.com/world
/hello
/hello/world

但不是这样的：

hello 
hello/world
hello.png
hello.zip
/hello/world.png
/hello/world.js

我需要什么正则表达式？或者，还有更好的方法？（也许通过收集一个）

Answer 1

我猜你定义了＃34; link＆＃34;作为<a href="...">形式的超链接。以下正则表达式（已经是PHP字符串的形式）应该是一个良好的开端*：

'<\\s*a\\s*[^>]*href\\s*=\\s*"([^"]+)"'

Test this regex

与preg_match($regex, $html, $match)一起使用时，$match[1]会为您提供链接，但是，它采用编码形式（可能包含html实体）。要删除它们，请使用html_entity_decode。

$link = html_entity_decode($match[1]);

您还应该排除只是同一网站片段的链接，这些链接是以井号开头的链接：$link[0] == '#'

*这个正则表达式不符合HTML语言的定义（我认为这是不可能100％正确完成的）。例如，正则表达式失败的链接中的属性没有用双引号括起来（它们可能没有引用或引用单引号）。

Answer 2

在这种情况下，像PHPQuery这样的东西可能比使用正则表达式更可取。有关原因的解释，请参阅this answer。

正则表达式从html字符串中捕获所有相对和绝对链接

2 个答案: