Question

我有一个PHP scraper脚本，我用它来刮取我网站上的页面。然后，该脚本将内容解析为HTML并将其输出给用户。我在PHP中使用useragent函数来假装你是一个爬虫，例如GoogleBot。如何将我的两个脚本组合在一起，以便我正在抓取的页面认为我是一个爬虫？

我的刮刀PHP代码是：

$query=$_REQUEST['q'];

$html = file_get_contents("search.php?q=$query");
preg_match_all(
    '/<div class="cl1 cld">.*?<a rel="nofollow" class="l le" href="(.*?)">(.*?)<\/a>.*?<div class="cra">(.*?)<\/div>.*?<div class="clud">(.*?)<\/div>.*?<\/div>/s',
    $html,
    $posts, // will contain the blog posts
    PREG_SET_ORDER // formats data into an array of posts
);

foreach ($posts as $post) {
    $link = $post[1];
    $title = $post[2];
    $description = $post[3];
    $url = $post[4];

echo "<div class='result'><div class='title'><a href='$link'>$title</a></div>$description<div class='url'>$url</div></div>";
}

?>

我有这行代码假装是一个爬虫。

$userAgent = 'MyScraperBot (http://www.mysite.com/)';

Answer 1

如果您想继续使用file_get_contents，可以使用以下内容设置PHP内部（http fopen wrapper）用户代理：

 ini_set("user_agent", 'MyScraperBot (http://www.mysite.com/)');

Answer 2

您需要使用CURL setopt

// spoofing FireFox 2.0
$useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";

$ch = curl_init();

// set user agent
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
// set the rest of your cURL options here

PHP刮刀脚本中的Useragent

2 个答案: