我正在尝试使用php进行数据抓取,但我需要访问的网址需要发布数据。
<?php
//set POST variables
$url = 'https://www.ncaa.org/';
//$url = 'https://web3.ncaa.org/hsportal/exec/hsAction?hsActionSubmit=searchHighSchool';
// This is the data to POST to the form. The KEY of the array is the name of the field. The value is the value posted.
$data_to_post = array();
$data_to_post['hsCode'] = '332680';
$data_to_post['state'] = '';
$data_to_post['city'] = '';
$data_to_post['name'] = '';
$data_to_post['hsActionSubmit'] = 'Search';
// Initialize cURL
$curl = curl_init();
// Set the options
curl_setopt($curl,CURLOPT_URL, $url);
// This sets the number of fields to post
curl_setopt($curl,CURLOPT_POST, sizeof($data_to_post));
// This is the fields to post in the form of an array.
curl_setopt($curl,CURLOPT_POSTFIELDS, $data_to_post);
//execute the post
$result = curl_exec($curl);
//close the connection
curl_close($curl);
?>
当我尝试访问托管实际信息的第二个$ url时,它返回无法加载响应数据,但它将允许我访问ncaa主页。即使我发送了正确的表单数据,我是否有理由无法加载响应数据?
答案 0 :(得分:1)
该网站显然会检查已识别的用户代理。默认情况下,PHP curl不会发送User-Agent
标头。添加
curl_setopt($curl, CURLOPT_USERAGENT, 'curl/7.21.4');
并且脚本返回响应。但是,在这种情况下,响应表明它需要比您拥有的浏览器更新的浏览器。因此,您应该从真实的浏览器中复制用户代理字符串,例如
curl_setopt($curl, CURLOPT_USERAGENT, '"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36');
此外,它要求参数以application/x-www-form-urlencoded
格式发送。当您使用数组作为CURLOPT_POSTFIELDS
的参数时,它使用multipart/form-data
。所以将该行更改为:
curl_setopt($curl,CURLOPT_POSTFIELDS, http_build_query($data_to_post));
将数组转换为URL编码的字符串。
在网址中,请忽略?hsActionSubmit=searchHighSchool
,因为该参数是在POST字段中发送的。
最终的工作脚本如下所示:
<?php
//set POST variables
//$url = 'https://www.ncaa.org/';
$url = 'https://web3.ncaa.org/hsportal/exec/hsAction';
// This is the data to POST to the form. The KEY of the array is the name of the field. The value is the value posted.
$data_to_post = array();
$data_to_post['hsCode'] = '332680';
$data_to_post['state'] = '';
$data_to_post['city'] = '';
$data_to_post['name'] = '';
$data_to_post['hsActionSubmit'] = 'Search';
// Initialize cURL
$curl = curl_init();
// Set the options
curl_setopt($curl,CURLOPT_URL, $url);
// This sets the number of fields to post
curl_setopt($curl,CURLOPT_POST, sizeof($data_to_post));
// This is the fields to post in the form of an array.
curl_setopt($curl,CURLOPT_POSTFIELDS, http_build_query($data_to_post));
curl_setopt($curl, CURLOPT_USERAGENT, '"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36');
//execute the post
$result = curl_exec($curl);
//close the connection
curl_close($curl);
答案 1 :(得分:0)
curl HTTPS连接需要关闭特定选项。 CURLOPT_SSL_VERIFYPEER
// Initialize cURL
$curl = curl_init();
// Set the options
curl_setopt($curl,CURLOPT_URL, $url);
// ** This option MUST BE FALSE **
**curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, FALSE);**
// This sets the number of fields to post
curl_setopt($curl,CURLOPT_POST, sizeof($data_to_post));
// This is the fields to post in the form of an array.
curl_setopt($curl,CURLOPT_POSTFIELDS, $data_to_post);
//execute the post
$result = curl_exec($curl);
//close the connection
curl_close($curl);