PHP爬虫不适用于维基百科

时间:2016-03-24 17:19:40

标签: php parsing web-crawler wiki

下面是我输入id = Summary下的文本的php代码。那么这个脚本适用于其他webistes但不适用于维基百科。我也粘贴了我在下面的错误。维基百科是否限制了解析器脚本?如果是这样,是否有任何解决方案来解析并从维基获取内容? 提前谢谢。

<?php


function getElementByIdAsString($url, $id, $pretty = true) {
    $doc = new DOMDocument();

    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

    $result = curl_exec($ch);


//    var_dump($doc->loadHTMLFile($url)); die;
error_reporting(E_ERROR | E_PARSE);
    if(!$result) {
        throw new Exception("Failed to load $url");
    }
    $doc->loadHTML($result);
    // Obtain the element
    $element = $doc->getElementById($id);

    if(!$element) {
        throw new Exception("An element with id $id was not found");
    }

    if($pretty) {
        $doc->formatOutput = true;
    }

    // Return the string representation of the element
    return $doc->saveXML($element);
}

//Here I am dispalying the output in bold text
echo getElementByIdAsString('https://en.wikipedia.org/wiki/A_Brief_History_of_Time', 'Summary');
?>

错误:

Fatal error: Uncaught exception 'Exception' with message 'Failed to load http://en.wikipedia.org/wiki/A_Brief_History_of_Time' in C:\xampp\htdocs\example2.php:25 Stack trace: #0 C:\xampp\htdocs\example2.php(49): getElementByIdAsString() #1 {main} thrown in C:\xampp\htdocs\example2.php on line 25

1 个答案:

答案 0 :(得分:1)

看起来它与此重复:php crawler for wiki getting error

原因是curl尝试验证证书,所以只需添加:

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);

删除问题,但我觉得使用所有这些

curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);