Question

我正在编写一个简单的网络抓取工具来抓取网站上的一些链接。我需要检查返回的链接，以确保我有选择地收集我想要的内容。

例如，这里有一些从http://www.polygon.com/

返回的链接

[0] http://www.polygon.com/2015/5/15/8613113/destiny-queens-wrath-bounties-ether-key-guide#comments

[1] http://www.polygon.com/videos

[2] http://www.polygon.com/2015/5/15/8613113/destiny-queens-wrath-bounties-ether-key-guide

[3] http://www.polygon.com/features

所以链接0和2是我想要抓取的链接，1和3我们不想要。链接之间有明显的视觉区别，那么我如何比较它们呢？

我如何检查以确保不返回1和3？理想情况下，我希望能够输入一些东西，以便它可以适应任何网站。

我在想我需要查看链接以确保其过去/ 2015 /等但我很丢失。

这是我用来抓取链接的PHP代码：

<?php

$source_url = 'http://www.polygon.com/';
$html = file_get_contents($source_url);
$dom = new DOMDocument;
@$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');

foreach ($links as $link) {
    $input_url = $link->getAttribute('href');
    echo $input_url . "<br>";   
}
?>

Answer 1

看起来正则表达式在这里会有所帮助。你可以说，例如：

/* if $input_url contains a 4 digit year, slash, number(s), slash, number(s) */
if (preg_match("/\/20\d\d\/\d+\/\d+\/",$input_url)) {
  echo $input_url . "<br>";
}

PHP Web爬虫，检查路径的URL

1 个答案: