为特定的网页检索定义特定的PHP curl选项

时间:2015-06-19 14:53:48

标签: php curl

我正在尝试抓取此网页:SiriusXMU以获取“正在播放”的信息。这是我到目前为止的代码:

    $timeout = 60;
    $url = 'http://www.siriusxm.com/siriusxmu';
    $agent= 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0';
    $referer = 'http://www.siriusxm.com/channellineup/';

    $header[] = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
    $header[] = "Cache-Control: max-age=0";
    $header[] = "Connection: keep-alive";
    //$header[] = "Keep-Alive: 300";
    //$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[] = "Accept-Language: en-US,en;q=0.5";

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);//The URL to fetch. This can also be set when initializing a session with curl_init().
    curl_setopt($ch, CURLOPT_USERAGENT, $agent);//The contents of the "User-Agent: " header to be used in a HTTP request.
    curl_setopt($ch, CURLOPT_HTTPHEADER, $header);//An array of HTTP header fields to set, in the format array('Content-type: text/plain', 'Content-length: 100')
    curl_setopt($ch, CURLOPT_HEADER, true);//TRUE to include the header in the output.
    curl_setopt($ch, CURLOPT_REFERER, $referer);//The contents of the "Referer: " header to be used in a HTTP request.
    curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate');//The contents of the "Accept-Encoding: " header. This enables decoding of the response. Supported encodings are "identity", "deflate", and "gzip". If an empty string, "", is set, a header containing all supported encoding types is sent.
    //curl_setopt($ch, CURLOPT_AUTOREFERER, true);//TRUE to automatically set the Referer: field in requests where it follows a Location: redirect.
    //curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);//TRUE to follow any "Location: " header that the server sends as part of the HTTP header (note this is recursive, PHP will follow as many "Location: " headers that it is sent, unless CURLOPT_MAXREDIRS is set).
    curl_setopt($ch, CURLOPT_TIMEOUT, $timeout);//The maximum number of seconds to allow cURL functions to execute.
    //curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);//FALSE to stop cURL from verifying the peer's certificate. Alternate certificates to verify against can be specified with the CURLOPT_CAINFO option or a certificate directory can be specified with the CURLOPT_CAPATH option.
    //curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);1 to check the existence of a common name in the SSL peer certificate. 2 to check the existence of a common name and also verify that it matches the hostname provided. In production environments the value of this option should be kept at 2 (default value).
    //curl_setopt($ch, CURLOPT_VERBOSE, true);//TRUE to output verbose information. Writes output to STDERR, or the file specified using CURLOPT_STDERR.
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);//if the CURLOPT_RETURNTRANSFER option is set, it will return the result on success, FALSE on failure. 
    //      
    $result = curl_exec($ch);//Returns TRUE on success or FALSE on failure. However, if the CURLOPT_RETURNTRANSFER option is set, it will return the result on success, FALSE on failure.
    curl_close($ch);

我一直在研究我的浏览器发送的HTTP标头,它成功地启用了网页的“播出”部分,该部分显示了正在播放的内容。但是,当我使用curl模拟这些标题时,网页的“One the Air”部分会返回“抱歉,所选平台无法提供程序信息。”

Firefox AddOn HttpFox显示主页的以下内容:

00:00:03.904    0.163   1524    209 GET 200 text/html   http://www.siriusxm.com/siriusxmu

(Request-Line)  GET /siriusxmu HTTP/1.1
Host    www.siriusxm.com
User-Agent  Mozilla/5.0 (Windows NT 6.3; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0
Accept  text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language en-US,en;q=0.5
Accept-Encoding gzip, deflate
Referer http://www.siriusxm.com/channellineup/
Cookie  mmcore.tst=0.557; mmid=-318486443%7CBQAAAAo2JYEzEgwAAA%3D%3D; mmcore.pd=111492824%7CBQAAAAoBQjYlgTMSDPt9EvUCAJ3zFneyeNJIDwAAAIQ4RsgceNJIAAAAAP//////////ABB3d3cuc2lyaXVzeG0uY29tAhIMAgAAAAAAAAAAAAD///////////////8AAAAAAAFF; mmcore.srv=cg5.usw; __utma=1.1327546933.1434659528.1434659528.1434723665.2; __utmz=1.1434659528.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); s_nr=1434723821271-Repeat; s_vnum=1435723200051%26vn%3D2; s_lastvisit=1434723660883; s_vi=[CS]v1|2AC1956485078C76-6000010E20030C67[CE]; mm_pc=%7B%22vehiclenewness%22%3A%22new%22%2C%22PC2%22%3A%22%22%7D; sxm_platform=xm; __utmv=1.|5=serviceType=xm=1; _hjUserId=86ab277e-6c63-4dd1-975c-3424e32502e6; __insp_slim=1434659556045; __insp_wid=800165747; __insp_nv=true; __insp_ref=aHR0cDovL3d3dy5zaXJpdXN4bS5jb20vc3RyZWFtaW5n; __insp_norec_sess=true; _hjIncludedInSample=1; __utmc=1; s_cc=true; SC_LINKS=%5B%5BB%5D%5D; s_sq=%5B%5BB%5D%5D; s_sv_sid=797366592635; QSI_HistorySession=http%3A%2F%2Fwww.siriusxm.com%2Fstreaming~1434659533837%7Chttp%3A%2F%2Fwww.siriusxm.com%2Fchannellineup%2F%23~1434659556190%7Chttp%3A%2F%2Fwww.siriusxm.com%2Fsiriusxmu~1434659575429; s_invisit=true; __utmb=1.8.10.1434723665
Connection  keep-alive

以及为“One the Air”部分请求javascript时的以下内容:

00:00:05.293    1.186   1609    (137)   GET 304 text/javascript http://www.siriusxm.com/static/app/js/sxm-channel-ontheair.js

(Request-Line)  GET /static/app/js/sxm-channel-ontheair.js HTTP/1.1
Host    www.siriusxm.com
User-Agent  Mozilla/5.0 (Windows NT 6.3; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0
Accept  */*
Accept-Language en-US,en;q=0.5
Accept-Encoding gzip, deflate
Referer http://www.siriusxm.com/siriusxmu
Cookie  mmcore.tst=0.557; mmid=-318486443%7CBQAAAAo2JYEzEgwAAA%3D%3D; mmcore.pd=111492824%7CBQAAAAoBQjYlgTMSDPt9EvUCAJ3zFneyeNJIDwAAAIQ4RsgceNJIAAAAAP//////////ABB3d3cuc2lyaXVzeG0uY29tAhIMAgAAAAAAAAAAAAD///////////////8AAAAAAAFF; mmcore.srv=cg5.usw; __utma=1.1327546933.1434659528.1434659528.1434723665.2; __utmz=1.1434659528.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); s_nr=1434723821271-Repeat; s_vnum=1435723200051%26vn%3D2; s_lastvisit=1434723660883; s_vi=[CS]v1|2AC1956485078C76-6000010E20030C67[CE]; mm_pc=%7B%22vehiclenewness%22%3A%22new%22%2C%22PC2%22%3A%22%22%7D; sxm_platform=xm; __utmv=1.|5=serviceType=xm=1; _hjUserId=86ab277e-6c63-4dd1-975c-3424e32502e6; __insp_slim=1434659556045; __insp_wid=800165747; __insp_nv=true; __insp_ref=aHR0cDovL3d3dy5zaXJpdXN4bS5jb20vc3RyZWFtaW5n; __insp_norec_sess=true; _hjIncludedInSample=1; __utmc=1; s_cc=true; SC_LINKS=%5B%5BB%5D%5D; s_sq=%5B%5BB%5D%5D; s_sv_sid=797366592635; QSI_HistorySession=http%3A%2F%2Fwww.siriusxm.com%2Fstreaming~1434659533837%7Chttp%3A%2F%2Fwww.siriusxm.com%2Fchannellineup%2F%23~1434659556190%7Chttp%3A%2F%2Fwww.siriusxm.com%2Fsiriusxmu~1434659575429; s_invisit=true; __utmb=1.8.10.1434723665
Connection  keep-alive
If-Modified-Since   Fri, 22 May 2015 02:06:57 GMT
If-None-Match   "ab841364-8501-516a21d70499b"
Cache-Control   max-age=0

网络服务器正在确定关于我的curl请求无效的内容,并且没有启用“On the Air”javascript内容,只是说“抱歉,所选平台无法提供程序信息。”

如何让curl正常工作并模拟我的浏览器,从而从此Web服务器返回有效的网页结果?

1 个答案:

答案 0 :(得分:2)

您似乎需要运行具有JavaScript解释器的客户端。

HTML包括以下内容:

<div id="on-the-air-unavailable"><p>Sorry, program information is not available for the selected platform.</p></div>

JS包括以下内容(不在一起):

$("#on-the-air-unavailable").hide();
$("#on-the-air-unavailable").show();

要让JavaScript与HTML交互,您需要将它们一起运行。

有些无头HTTP客户端可以使用JS解释器或浏览器自动化工具,例如Selenium