我是使用PHP进行Web抓取的新手,但不是PHP本身。我的问题不是正则表达式相关,但似乎直接与booking.com网站有关。我想在特定城市刮取酒店的价格。为此,我在预订页面中复制了浏览器中的URL,并将其粘贴到我的代码中。
This是页面。
这是我的代码:
<?php
function getHTML($url,$timeout)
{
$ch = curl_init($url); // initialize curl with given url
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set useragent
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
return @curl_exec($ch);
}
$html=getHTML("http://www.booking.com/searchresults.en.html?dcid=1;checkin_monthday=25;checkin_year_month=2014-7;checkout_monthday=26;checkout_year_month=2014-7;city=-1461464;class_interval=1;csflt=%7B%7D;interval_of_time=undef;no_rooms=1;or_radius=0;property_room_info=1;review_score_group=empty;score_min=0;src=city;ssb=empty;;nflt=ht_id%3D204%3Bclass%3D3%3B;unchecked_filter=class",10);
echo $html;
?>
我确实打印了预订页面但是它没有考虑到URL中的参数,因为在页面上我得到它要求预订日期&amp;城市...
我尝试在多个浏览器中粘贴此网址并隐身窗口(以查看该网址是否与Cookie或其他内容相关联),并且工作正常。也许我在cURL请求中错过了一个参数......
答案 0 :(得分:5)
我终于解决了这个问题。我只需要将浏览器的标题发送到网站。 这是代码:
function getHTML($url,$timeout)
{
$header = array();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
$ch = curl_init($url); // initialize curl with given url
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set useragent
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
curl_setopt($ch, CURLOPT_COOKIESESSION, true );
return @curl_exec($ch);
}
$url = "booking url...";
echo getHTML($url,10);
解决方案很简单......