使用curl_init获取网页内容不适用于某些链接

时间:2012-09-26 05:22:54

标签: php curl

我正在使用此代码获取输入网址的内容: -

class MetaTagParser
{
    public $metadata;
    private $html;
    private $url;




    public function __construct($url)
    {
        $this->url=$url;

        $this->html=  $this->file_get_contents_curl();

        $this->set_title();
        $this->set_meta_properties();
    }

    public function file_get_contents_curl()
    {
        $ch = curl_init();

        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_URL, $this->url);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

        $data = curl_exec($ch);
        curl_close($ch);

        return $data;
    }

    public function set_title()
    {
        $doc = new DOMDocument();
        @$doc->loadHTML($this->html);
        $nodes = $doc->getElementsByTagName('title');

        $this->metadata['title'] = $nodes->item(0)->nodeValue;
    }

这个类适用于某些页面但是对于某些类似于此的URL - http://www.dnaindia.com/india/report_in-a-first-upa-govt-tweets-the-press_1745346 当我尝试获取数据时,我收到此错误: - “警告:get_meta_tags(http://www.dnaindia.com/india/report_in-a-first-upa-govt-tweets-the-press_1745346):无法打开数据流:HTTP请求失败!HTTP / 1.1 403禁止在第52行的C:\ xampp \ htdocs \ prac \ index.php“

它不起作用,任何想法为什么会发生这种情况?

1 个答案:

答案 0 :(得分:1)

有时网站管理员并不愚蠢,知道如何保护页面免受诽谤和抓取,所以你必须欺骗他的保护并呈现来自普通浏览器的用户代理。添加以下行:

CURLOPT_USERAGENT => "Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20100101 Firefox/15.0.1",