PHP Crawler无法正常工作

时间:2016-03-23 07:37:42

标签: php web-crawler

我使用下面的php代码使用getElementByID

提取id = description下的内容
function getElementByIdAsString($url, $id, $pretty = true) {
    $doc = new DOMDocument();
    @$doc->loadHTMLFile($url);

    if(!$doc) {
        throw new Exception("Failed to load $url");
    }

    // Obtain the element
    $element = $doc->getElementById($id);

    if(!$element) {
        throw new Exception("An element with id $id was not found");
    }

    if($pretty) {
        $doc->formatOutput = true;
    }

    // Return the string representation of the element
    return $doc->saveXML($element);
}

// call it:
echo getElementByIdAsString('www.abebooks.com/9780143418764/Love-Story-Singh-Ravinder-0143418769/plp', 'Synopsis');
?>

在上面的代码中我使用输入('www.abebooks.com/9780143418764/Love-Story-Singh-Ravinder-0143418769/plp','概要')它工作正常,但当我尝试另一个网站输入(www.lookupbyisbn.com/Lookup/Book/0143418769/0143418769/1, reviews)它不起作用...有人可以帮帮我吗?

提前致谢。

2 个答案:

答案 0 :(得分:2)

似乎第二个网址需要一些额外的标题才能从网站上获取html内容。我建议你使用curl来加载远程内容。看看下面的解决方案,我对您当前的代码做了一些小的更改:($ doc-> loadHTMLFile($ url)为第二个网址返回false)

function getElementByIdAsString($url, $id, $pretty = true) {
    $doc = new DOMDocument();

    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

    $result = curl_exec($ch);


//    var_dump($doc->loadHTMLFile($url)); die;

    if(!$result) {
        throw new Exception("Failed to load $url");
    }
    $doc->loadHTML($result);
    // Obtain the element
    $element = $doc->getElementById($id);

    if(!$element) {
        throw new Exception("An element with id $id was not found");
    }

    if($pretty) {
        $doc->formatOutput = true;
    }

    // Return the string representation of the element
    return $doc->saveXML($element);
}

// call it:
echo getElementByIdAsString('http://www.abebooks.com/9780143418764/Love-Story-Singh-Ravinder-0143418769/plp', 'synopsis');

echo getElementByIdAsString('http://www.lookupbyisbn.com/Lookup/Book/0143418769/0143418769/1', 'reviews');

答案 1 :(得分:0)

您的脚本运行正常,但如果直接输出错误,您可能会更容易发现问题(现在)。我做了一些小的调整,似乎按预期工作:

function getElementByIdAsString($url, $id, $pretty = true) {
    $doc = new DOMDocument();
    $doc->formatOutput = $pretty;

    if( !$doc->loadHTMLFile($url) ) {
        echo  "Failed to load $url";
        throw new Exception("Failed to load $url");
    }

    // Obtain the element
    if( !( $element = $doc->getElementById($id) )) {
        echo  "An element with id $id was not found";
        throw new Exception("An element with id $id was not found");
    }

    // Return the string representation of the element
    return $doc->saveXML($element);
}

所以在你原来的例子中:

// call it:
echo getElementByIdAsString('www.abebooks.com/9780143418764/Love-Story-Singh-Ravinder-0143418769/plp', 'Synopsis');

此输出:

  

无法加载www.abebooks.com/9780143418764/Love-Story-Singh-Ravinder-0143418769/plp

因为基于URL的文件位置需要具有架构,例如http://。为此更新:

// call it:
echo getElementByIdAsString('http://www.abebooks.com/9780143418764/Love-Story-Singh-Ravinder-0143418769/plp', 'Synopsis');

输出:

  

找不到ID提要的元素

这是因为(鼓励请......)该文档中没有一个id为Synopsis的元素。直接查看页面源代码,我看到一个id为synopsis的元素,所以再次调整,我尝试了:

// call it:
echo getElementByIdAsString('http://www.abebooks.com/9780143418764/Love-Story-Singh-Ravinder-0143418769/plp', 'synopsis');'

返回了:

<div role="tabpanel" id="synopsis" class="tab-pane active">&#13;
                    <div class="ms-toggle">&#13;
                      <p class="synopsis-item">Do Love stories ever die? Can modern day gadgets like Mobile phones and the http:\ www’ era of Internet bring you the love of your life? You haven't met her earlier, but commit to marry. Will you still call this a love marriage? And what if on the engagement day while you pull the ring out from your pocket, you realize what you planned was just a dream which never comes true...? How would you react when a beautiful person comes into your life, becomes your most precious possession and then one day goes away from you...forever? Not all love stories are meant to have a perfect ending. Some stay incomplete. Yet they are beautiful in their own way. Ravin's love story is one such innocent and beautiful story. He believes love stories seldom die. They are meant to stay for the generations yet to come and read them.</p>&#13;
                      <p id="synopsis-disclaimer-text"><em>"synopsis" may belong to another edition of this title.</em></p>&#13;
                    </div>&#13;
                  </div>

所以你的功能本身似乎很好(可能有点粗糙,但这是可以原谅的),你只需要注意你花时间构建的那些错误。你设法在你的例子中击中了两个:无效的网址,不存在的ID。

更新

正如另一个答案中所提到的,如果没有为HTTP请求设置User-Agent:标头,则问题似乎是远程站点没有返回任何内容。如建议的那样,一种选择是首先使用curl来获取远程HTML,从而允许对HTTP请求进行更多控制。另一种选择是在运行时设置user_agent INI指令,如下所示:

ini_set('user_agent', 'Googlebot/2.1 (+http://www.google.com/bot.html)');

因此,对于您的函数,您可以添加:

function getElementByIdAsString($url, $id, $pretty = true) {
    $doc = new DOMDocument();
    $doc->formatOutput = $pretty;    

    ini_set('user_agent', 'Googlebot/2.1 (+http://www.google.com/bot.html)');

    if( !$doc->loadHTMLFile( $url ) ) {
        echo  "Failed to load $url" . PHP_EOL;
        throw new Exception("Failed to load $url");
    }

    // Obtain the element
    if( !( $element = $doc->getElementById($id) )) {
        echo  "An element with id $id was not found" . PHP_EOL;
        throw new Exception("An element with id $id was not found");
    }

    // Return the string representation of the element
    return $doc->saveXML($element);
}

我测试过,这看起来无需切换到卷曲(一般来说这仍然是一个好主意)。