使用CURL像iframe一样提取html?

时间:2017-06-01 14:33:25

标签: php curl

我有一个html代码,显示谷歌趋势的相关主题:

<iframe id="trends-widget-1" src='https://trends.google.com/trends/embed/explore/RELATED_TOPICS?req={"comparisonItem":[{"keyword":"stack","geo":"BR","time":"today 5-y"}],"category":0,"property":""}&amp;tz=180&amp;eq=geo=BR&q=stack' width="100%" frameborder="0" scrolling="0" style="border-radius: 2px; box-shadow: rgba(0, 0, 0, 0.12) 0px 0px 2px 0px, rgba(0, 0, 0, 0.24) 0px 2px 2px 0px; height: 384px;"></iframe>

现在,我想找到一种方法来保存这个html(以备将来使用......)。为此,我尝试使用CURL

$url = 'https://trends.google.com/trends/embed/explore/RELATED_TOPICS?req={"comparisonItem":[{"keyword":"stack","geo":"BR","time":"today 5-y"}],"category":0,"property":""}&amp;tz=180&amp;eq=geo=BR&q=stack';

        $ch = curl_init();
        $source = $url;
        curl_setopt($ch, CURLOPT_URL, $source);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
        curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 1000);
        curl_setopt($ch, CURLOPT_TIMEOUT, 100);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13");
        $html = curl_exec($ch);
        curl_close($ch);
        echo $html;

问题? curl显示包含以下消息的Google页面:

  
      
  1. 这是一个错误。您的客户发出了格式错误或非法的请求。这就是我们所知道的。
  2.   

如何避免此类问题并提取html?

1 个答案:

答案 0 :(得分:0)

网址的查询字符串部分是html实体和非网址编码文本的混合。

我认为这是为了更加难以正确解码抓取工具的网址。

无论如何,浏览器能够正确解释查询字符串,首先解码html实体,然后识别每个查询参数及其值。

浏览器用于执行上述解码的算法并不简单,并且没有专门的PHP函数来完成这项工作。如果你对这个主题感兴趣,我认为它应该得到一个专门的问题。

对于您的具体情况,您可以通过以下方式修复网址:

// The base URL is ok

$url = 'https://trends.google.com/trends/embed/explore/RELATED_TOPICS?';

// The `req` parameter's value must be url-encoded

$url .= 'req=' . urlencode( '{"comparisonItem":[{"keyword":"stack","geo":"BR","time":"today 5-y"}],"category":0,"property":""}' );

// The last part of the query string contains html entities, specifically &amp;
// They have to be "translated" into ampersands to let the query make sense
// (I did it manually)
//
// Note also the final part of the query string does not contain special
// characters so I skiped the URL encoding

$url .= '&tz=180&eq=geo=BR&q=stack';

您最终获得此网址

https://trends.google.com/trends/embed/explore/RELATED_TOPICS?req=%7B%22comparisonItem%22%3A%5B%7B%22keyword%22%3A%22stack%22%2C%22geo%22%3A%22BR%22%2C%22time%22%3A%22today+5-y%22%7D%5D%2C%22category%22%3A0%2C%22property%22%3A%22%22%7D&tz=180&eq=geo=BR&q=stack

可以在浏览器栏和CURL

中粘贴

下注:

我不确定一旦获取该页面的来源,您能从该页面获取多少信息,因为它大量使用JavaScript和ajax调用来呈现内容。