Question

我正在尝试使用Simple HTML Dom解析器创建爬虫。一切都运行正常，但当我检查网站的统计数据时，它会显示以下内容

未知机器人（由空用户代理字符串标识）
未知机器人（由'bot'标识后跟空格或以下字符之一_ +：，。; / - ）

我只是想将它作为正确的抓取工具使用名称并链接回爬虫。

我在这里缺少什么，请查看下面的代码。

<?php
include 'config.php';
include 'simple_html_dom.php';
set_time_limit(9000);

$context = stream_context_create();
stream_context_set_params($context, array('user_agent' => 'Mozilla/5.0 (compatible; My-bot/1.0; +https://mydomain.tld/bot'));
$html = file_get_html("https://www.google.com/", 0, $context);

foreach($html->find('a') as $link)
{   
$linkHref = $link->href;
$linkHtml = file_get_html('http://example.com'.$linkHref);  



foreach($linkHtml->find('title') as $title2)            
{
$title2 = $title2->plaintext;
$title[] = $conn->real_escape_string(trim($title2));            
echo $title2.'<br>';            
}

}   

?>

如何使用简单的HTML DOM制作适当的Crawler

0 个答案: