我有一个html
代码,显示谷歌趋势的相关主题:
<iframe id="trends-widget-1" src='https://trends.google.com/trends/embed/explore/RELATED_TOPICS?req={"comparisonItem":[{"keyword":"stack","geo":"BR","time":"today 5-y"}],"category":0,"property":""}&tz=180&eq=geo=BR&q=stack' width="100%" frameborder="0" scrolling="0" style="border-radius: 2px; box-shadow: rgba(0, 0, 0, 0.12) 0px 0px 2px 0px, rgba(0, 0, 0, 0.24) 0px 2px 2px 0px; height: 384px;"></iframe>
现在,我想找到一种方法来保存这个html(以备将来使用......)。为此,我尝试使用CURL
:
$url = 'https://trends.google.com/trends/embed/explore/RELATED_TOPICS?req={"comparisonItem":[{"keyword":"stack","geo":"BR","time":"today 5-y"}],"category":0,"property":""}&tz=180&eq=geo=BR&q=stack';
$ch = curl_init();
$source = $url;
curl_setopt($ch, CURLOPT_URL, $source);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 1000);
curl_setopt($ch, CURLOPT_TIMEOUT, 100);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13");
$html = curl_exec($ch);
curl_close($ch);
echo $html;
问题? curl
显示包含以下消息的Google页面:
- 这是一个错误。您的客户发出了格式错误或非法的请求。这就是我们所知道的。
醇>
如何避免此类问题并提取html?
答案 0 :(得分:0)
源网址的查询字符串部分是html实体和非网址编码文本的混合。
我认为这是为了更加难以正确解码抓取工具的网址。
无论如何,浏览器能够正确解释查询字符串,首先解码html实体,然后识别每个查询参数及其值。
浏览器用于执行上述解码的算法并不简单,并且没有专门的PHP函数来完成这项工作。如果你对这个主题感兴趣,我认为它应该得到一个专门的问题。
对于您的具体情况,您可以通过以下方式修复网址:
// The base URL is ok
$url = 'https://trends.google.com/trends/embed/explore/RELATED_TOPICS?';
// The `req` parameter's value must be url-encoded
$url .= 'req=' . urlencode( '{"comparisonItem":[{"keyword":"stack","geo":"BR","time":"today 5-y"}],"category":0,"property":""}' );
// The last part of the query string contains html entities, specifically &
// They have to be "translated" into ampersands to let the query make sense
// (I did it manually)
//
// Note also the final part of the query string does not contain special
// characters so I skiped the URL encoding
$url .= '&tz=180&eq=geo=BR&q=stack';
您最终获得此网址
https://trends.google.com/trends/embed/explore/RELATED_TOPICS?req=%7B%22comparisonItem%22%3A%5B%7B%22keyword%22%3A%22stack%22%2C%22geo%22%3A%22BR%22%2C%22time%22%3A%22today+5-y%22%7D%5D%2C%22category%22%3A0%2C%22property%22%3A%22%22%7D&tz=180&eq=geo=BR&q=stack
可以在浏览器栏和CURL
中粘贴下注:
我不确定一旦获取该页面的来源,您能从该页面获取多少信息,因为它大量使用JavaScript和ajax调用来呈现内容。