我可以使用以下代码抓取大多数网站,但有些网站会将我重定向到=> distil_r_blocked.html
这是我得到的标题
HTTP/1.1 200 OK Date: Mon, 26 Jun 2017 20:30:12 GMT Content-Type: text/html Transfer-Encoding: chunked Connection: keep-alive Vary: Accept-Encoding Expires: Thu, 01 Jan 1970 00:00:01 GMT Cache-Control: no-cache Cache-Control: private, no-cache, no-store, must-revalidate Edge-Control: no-store, bypass-cache Surrogate-Control: no-store, bypass-cache
这是我的代码
function file_get_contents_curl($target_url,$json=false){
$ch = curl_init();
$headers = array();
if($json) {
$headers[] = 'Content-type: application/json';
$headers[] = 'X-HTTP-Method-Override: GET';
}
$options = array(
CURLOPT_URL => $target_url,
CURLOPT_HTTPHEADER => array($headers),
CURLOPT_TIMEOUT => 300,
CURLOPT_FOLLOWLOCATION => 1,
CURLOPT_AUTOREFERER => 1,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_HEADER => 1,
CURLOPT_FOLLOWLOCATION => 1,
CURLOPT_MAXREDIRS => 3,
CURLOPT_TIMEOUT => 10,
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9');
curl_setopt_array($ch,$options);
$response = curl_exec($ch);
if($response === false || curl_error($ch)) {
curl_close($ch);
return false;
} else {
curl_close($ch);
return $response;
}
}
// Create a curl handle to a non-existing location
$ch = curl_init($target_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
if(curl_exec($ch) === false)
{
echo 'Curl error: ' . curl_error($ch);
}
else
{
echo 'Operation completed without any errors';
}
$data = file_get_contents_curl($target_url);
$html = str_get_html($data);
是否还有重定向?
谢谢,西蒙
答案 0 :(得分:1)
您的cURL选项CURLOPT_FOLLOWLOCATION
设置为TRUE,这意味着它将遵循重定向。将其设置为0,它不会遵循重定向。如果不需要,您还可以使用此选项两次。
关于检索原始内容,您将无法控制此内容,因为服务器正在控制响应。充其量,你可以尝试欺骗标题或使用不同的IP,但这通常是不赞成的...主要是因为它是粗略的行为(在我看来)。