PHP爬虫异常

时间:2016-03-23 20:22:32

标签: php web-crawler

下面是我的代码,它输出维基页面上Plot选项卡下的内容,我正在使用getElementById并且它抛出了我粘贴在下面的一些异常,有人可以修改它来工作。 在此先感谢。

<?php
/**
 * Downloads a web page from $url, selects the the element by $id
 * and returns it's xml string representation.
 */
//Taking input
 if(isset($_POST['submit'])) /* i.e. the PHP code is executed only when someone presses Submit button in the below given HTML Form */
{
$var = $_POST['var'];   // Here $var is the input taken from user.
} 
function getElementByIdAsString($url, $id, $pretty = true) {
    $doc = new DOMDocument();
    @$doc->loadHTMLFile($url);

    if(!$doc) {
        throw new Exception("Failed to load $url");
    }

    // Obtain the element
    $element = $doc->getElementById($id);

    if(!$element) {
        throw new Exception("An element with id $id was not found");
    }

    if($pretty) {
        $doc->formatOutput = true;
    }

    // Return the string representation of the element
    return $doc->saveXML($element);
}

// call it:
echo getElementByIdAsString('https://en.wikipedia.org/wiki/I_Too_Had_a_Love_Story', 'Plot');
?>

例外是:

Fatal error: Uncaught exception 'Exception' with message 'An element with id Plot was not found' in C:\xampp\htdocs\example2.php:23 Stack trace: #0 C:\xampp\htdocs\example2.php(35): getElementByIdAsString() #1 {main} thrown in C:\xampp\htdocs\example2.php on line 23

1 个答案:

答案 0 :(得分:0)

我尝试使用您的代码并运行并返回<span class="mw-headline" id="Plot">Plot</span>。我认为您使用DOMDocument::loadHTMLFile@

的问题
@$doc->loadHTMLFile($url);

因为此方法返回

  

bool在成功时为true,在失败时为false

有时它会返回false(例如对于许多请求来自维基百科的403)并且你的dom元素是空的。在这种情况下,您的$element = $doc->getElementById($id);无法找到此元素。

尝试将您的代码更改为:

<?php
/**
 * Downloads a web page from $url, selects the the element by $id
 * and returns it's xml string representation.
 */
//Taking input
if(isset($_POST['submit'])) /* i.e. the PHP code is executed only when someone presses Submit button in the below given HTML Form */
{
    $var = $_POST['var'];   // Here $var is the input taken from user.
}
function getElementByIdAsString($url, $id, $pretty = true) {
    $doc = new DOMDocument();
    $loadResult = @$doc->loadHTMLFile($url);

    if(!$doc || !$loadResult) {
        throw new Exception("Failed to load $url");
    }

    // Obtain the element
    $element = $doc->getElementById($id);

    if(!$element) {
        throw new Exception("An element with id $id was not found");
    }

    if($pretty) {
        $doc->formatOutput = true;
    }

    // Return the string representation of the element
    return $doc->saveXML($element);
}

// call it:
echo getElementByIdAsString('https://en.wikipedia.org/wiki/I_Too_Had_a_Love_Story', 'Plot');
?>

Wkipedia可能无法用于您的脚本(某些网站会阻止解析器脚本)。尝试使用curl获取响应的status_code

$url = 'en.wikipedia.org/wiki/I_Too_Had_a_Love_Story';
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL,$url); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
$html = curl_exec($ch); 
$status_code = curl_getinfo($ch,CURLINFO_HTTP_CODE);