使用可读性API从页面中刮取最相关的图像

时间:2014-01-18 06:49:31

标签: php parsing

我正在使用可读性API来执行此操作。在他们的示例中,他们显示lead_img_url,但我无法获取它。

参考:https://www.readability.com/developers/api/parser

这是直接请求的正确方法:

  1. https://www.readability.com//api/content/v1/parser?url=http://blog.readability.com/2011/02/step-up-be-heard-readability-ideas/&token=1b830931777ac7c2ac954e9f0d67df437175e66e

  2. https://www.readability.com/parser/?token=1b830931777ac7c2ac954e9f0d67df437175e66e&url=http://nextbigwhat.com

  3. 它说:{"messages": "The API Key in the form of the 'token' parameter is invalid.", "error": true}

    另一次尝试:

    <?php
        define('TOKEN', "1b830931777ac7c2ac954e9f0d67df437175e66e");    
        define('API_URL', "https://www.readability.com/api/content/v1/parser?url=%s&token=%s");
    
       function get_image($url) {   
    
        // sanitize it so we don't break our api url    
        $encodedUrl = urlencode($url);    
        $TOKEN = '1b830931777ac7c2ac954e9f0d67df437175e66e';    
        $API_URL = 'https://www.readability.com/api/content/v1/parser?url=%s&token=%s';    
    //  $API_URL = 'http://blog.readability.com/2011/02/step-up-be-heard-readability-ideas';    
        // build our url   
        $url = sprintf($API_URL, $encodedUrl, $TOKEN);    
    
        // call the api    
        $response = file_get_contents($url);    
        if( $response ) {    
            return false;   
        }    
        $json = json_decode($response);    
        if(!isset($json['lead_image_url'])) {    
            return false;    
        }    
    
        return $json['lead_image_url'];
    
    }
    

    错误:Warning: file_get_contents(https://www.readability.com/api/content/v1/parser?url=http%3A%2F%2Fthenwat.com%2Fthenwat%2Finvite%2Findex.php&amp;token=1b830931777ac7c2ac954e9f0d67df437175e66e): failed to open stream: HTTP request failed! HTTP/1.1 403 FORBIDDEN in F:\wamp\www\inviteold\test2.php on line 32

    再来一次:

    require 'readability/lib/Readability.inc.php';
    $url = 'http://www.nextbigwhat.com';
    $html = file_get_contents($url);
    
    $Readability     = new Readability($html); // default charset is utf-8
    $ReadabilityData = $Readability->getContent();
    
    $image= $ReadabilityData['lead_image_url'];
    $title= $ReadabilityData['title']; //This works fine.
    $content = $ReadabilityData['word_count'];
    
    echo "$content"; 
    

    它说:Notice: Undefined index: lead_image_url in F:\wamp\www\inviteold\test2.php on line 13

1 个答案:

答案 0 :(得分:4)

首先,为了使用他们提供的REST API,您需要创建一个帐户。之后,您可以生成自己的token以在通话中使用。示例提供的token将无效,因为它故意无效。其目的仅限于此。

其次,确保allow_url_fopen文件中的php.ini指令设置为true。出于测试脚本的目的,或者如果您无法更改php.ini文件(共享托管解决方案),您可以使用页面顶部的ini_set('allow_url_fopen', true);

最后,为了自己解析图像,您需要从检索到的DOM中检索所有图像元素。有时不会有任何图像,有时会有。这取决于你从哪个页面拉出来。此外,您需要解决相对路径......

您的代码

require 'readability/lib/Readability.inc.php';
$url = 'http://www.nextbigwhat.com';
$html = file_get_contents($url);

$Readability     = new Readability($html); // default charset is utf-8
$ReadabilityData = $Readability->getContent();

$image= $ReadabilityData['lead_image_url'];
$title= $ReadabilityData['title']; //This works fine.
$content = $ReadabilityData['word_count'];

echo "$content"; 

执行Readability后,您可以使用DOMDocument课程从您提取的内容中检索图像。实例化新的DOMDocument并加载HTML。确保使用libxml_use_internal_errors函数来抑制大多数网站上解析器导致的错误。我们将把它放在一个函数中,以便在需要时更容易在别处使用。

function sampleDomMedia($html) {
    // Supress validator errors
    libxml_use_internal_errors(true);

    // New document
    $dom = new DOMDocument();
    // Populate document
    $dom->loadHTML($html);
    //[...]

您现在可以从您实例化的文档中检索所有图像元素,然后获取其src属性......如下所示:

    //[...]
    // Get image elements
    $nodeList = $dom->getElementsByTagName('img');

    // Get length
    $length = $nodeList->length;

    // Initialize array
    $images = array();

    // Iterate over our nodes
    for($i=0;$i<$length;$i++) {
        // Get the current node
        $node = $nodeList->item($i);
        // Retrieve the src attribute
        $image = $node->getAttribute('src');

        // Push image src into $images array
        array_push($images,$image);
    }

    return $images;
}

现在您有一系列图像可供用户使用。但在你这样做之前,我们又忘记了一件事...我们想要解决所有相对路径,以便我们总是有一条绝对路径来存在另一个站点上的图像。

为此,我们必须确定基域URL,以及我们正在使用的当前页面的相对路径。我们可以使用PHP提供的parse_url()函数来完成。为简单起见,我们可以把它放到一个函数中。

function getUrls($url) {
    // Parse URL
    $urlArr = parse_url($url);

    // Determine Base URL, with scheme, host, and port
    $base = $urlArr['scheme']."://".$urlArr['host'];
    if(array_key_exists("port",$urlArr) && $urlArr['port'] != 80) {
        $base .= ":".$urlArr['port'];
    }

    // Truncate the Path using the position of the last forward slash
    $relative = $base.substr($urlArr['path'], 0, strrpos($urlArr['path'],"/")+1);

    // Return our two URLs
    return array($base, $relative);
}

在原始sampleDomMedia函数中添加一个附加参数,我们可以调用此函数来获取路径。然后我们可以检查src属性的值以确定它是什么类型的路径,然后解决它。

function sampleDomMedia($html, $url) {
    // Retrieve our URLs
    list($baseUrl, $relativeUrl) = getUrls($url);

    libxml_use_internal_errors(true);

    $dom = new DOMDocument();
    $dom->loadHTML($html);

    $nodeList = $dom->getElementsByTagName('img');
    $length = $nodeList->length;
    $images = array();

    for($i=0;$i<$length;$i++) {
        $node = $nodeList->item($i);
        $image = $node->getAttribute('src');

        // Resolve relative paths
        if(substr($image,0,2)=="//") { // Missing protocol
            $image = "http:".$image;
        } else if(substr($image,0,1)=="/") { // Path Relative to Base
            $image = $baseUrl.$image;
        } else if(substr($image,0,4)!=="http") { // Path Relative to Dimension
            $image = $relativeUrl.$image;
        }

        array_push($images,$image);
    }

    return $images;
}

最后,但同样重要的是,我们留下了前两个函数,以及这段程序代码:

require 'readability/lib/Readability.inc.php';
$url = 'http://www.nextbigwhat.com';
$html = file_get_contents($url);

$Readability     = new Readability($html); // default charset is utf-8
$ReadabilityData = $Readability->getContent();

$image = $ReadabilityData['lead_image_url'];
$images = sampleDomMedia($html, $url);

$title = $ReadabilityData['title']; //This works fine.
$content = $ReadabilityData['word_count'];

echo "$content";

此外,如果您认为文章内容可能包含图片内部(通常没有),您可以使用contents而不是Readability返回的$html变量,如:

$title = $ReadabilityData['title']; //This works fine.
$content = $ReadabilityData['word_count'];
$images = sampleDomMedia($content, $url);

我希望有所帮助。