Question

我已经安装了一个php脚本来禁止忽略我的robots.txt文件的机器人。我想测试它是否正常工作。是否有几行PHP代码，我可以用来模拟我的网站蜘蛛网站。也许抓取'n'层深，创建结果的简单文本文件，并忽略我的robots.txt文件并忽略rel =“nofollow”。

Answer 1

wget -r -l4 –spider -D thesite.com http://www.thesite.com

来自http://beeznest.wordpress.com/2012/07/01/spider-a-website-with-wget/

Answer 2

您可以使用PHP Simple HTML DOM Parser：http://simplehtmldom.sourceforge.net/

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';

另请参阅：http://davidwalsh.name/php-notifications

我怎么能蜘蛛自己的网站

2 个答案: