用php获取google学者页面的内容

时间:2015-12-03 16:02:09

标签: php web-scraping

这是我在Google学术搜索中的日记页面: https://scholar.google.com/citations?user=F4z6guYAAAAJ

我可以用浏览器检查页面。但无法通过PHP(Curl或File_get_contents)获取内容

我尝试了很多标题,但没用。

更新:我的代码在这里:

$fgc_context = stream_context_create(array(
  'http'=>array(
    'method'=>"GET",                
    'header'=>"Accept: text/html,application/xhtml+xml,application/xml\r\n" .
              "Accept-Charset: ISO-8859-1,utf-8\r\n" .
              "Accept-Encoding: gzip,deflate,sdch\r\n" .
              "Accept-Language: en-US,en;q=0.8\r\n",
    "timeout" => 60,
    'user_agent'=>"user_agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9\r\n"   
 )
)); 

ini_set('user_agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9');
$wcnt = @file_get_contents($the_journal_url, false, $fgc_context);

谷歌返回页面结束时:

<H1>Server Error</H1> We're sorry but it appears that there has been an internal server error while processing your request. Our engineers have been notified and are working to resolve the issue.<p>Please try again later.</p>

1 个答案:

答案 0 :(得分:0)

尝试使用此代码: (运行2次以便第一次创建cookie)

$cookie = __DIR__ . '/cookie.txt';

$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_URL, 'https://scholar.google.com/citations?user=F4z6guYAAAAJ');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:42.0) Gecko/20100101 Firefox/42.0');
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);

echo $data;