function curl_get($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$data = curl_exec($ch);
print_r(curl_getinfo($ch, CURLINFO_SIZE_DOWNLOAD));
curl_close($ch);
return $data;
}
我试图在这个页面“wikipedia.sfstate.us/Scarves”中匹配一个字符串。我使用该函数来获取内容:
$url = "http://wikipedia.sfstate.us/Scarves";
$html = curl_get($url);
var_dump($html);
结果如下:
812 //CURLINFO_SIZE_DOWNLOAD
string(812) "..." //$html string where the content is stored
但是,整个文件是64612字节(由web-sniffer.net得出)。并且64612 = 1024 * 63 + 812.也就是说,我只得到文件的最后812个字节。
为什么会发生这种情况?有关如何获取整个内容的任何想法?感谢。
P.S。:我也试过了......如下所示但没有帮助
if(strlen($html) < 1024){
$html = '';
$i = 0;
while($content = file_get_contents($url, FILE_TEXT, NULL, $i, $i + 1023)){
$html .= $content;
$i += 1023;
}
}
答案 0 :(得分:0)
您尝试抓取的页面具有基于用户代理的保护。在您的请求中添加适当的用户代理并运行:
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.79 Safari/537.1");
当然,如果他们有这样的保护,可能是因为他们不希望你刮掉他们的内容。
答案 1 :(得分:0)
试试这是我测试的代码,它工作正常
输出: -
<?php
function curl_get($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.79 Safari/537.1");
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$data = curl_exec($ch);
print_r(curl_getinfo($ch, CURLINFO_SIZE_DOWNLOAD));
curl_close($ch);
return $data;
}
$url = "http://wikipedia.sfstate.us/Scarves";
$html = curl_get($url);
var_dump($html);
还可以尝试其他示例
$ch = curl_init("http://wikipedia.sfstate.us/Scarves");
$fp = fopen("example_htmlpage.html", "w");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);