Question

我正在使用可读性API来执行此操作。在他们的示例中，他们显示lead_img_url，但我无法获取它。

参考：https://www.readability.com/developers/api/parser

这是直接请求的正确方法：

https://www.readability.com//api/content/v1/parser?url=http://blog.readability.com/2011/02/step-up-be-heard-readability-ideas/&token=1b830931777ac7c2ac954e9f0d67df437175e66e
https://www.readability.com/parser/?token=1b830931777ac7c2ac954e9f0d67df437175e66e&url=http://nextbigwhat.com

它说：{"messages": "The API Key in the form of the 'token' parameter is invalid.", "error": true}

另一次尝试：

<?php
    define('TOKEN', "1b830931777ac7c2ac954e9f0d67df437175e66e");    
    define('API_URL', "https://www.readability.com/api/content/v1/parser?url=%s&token=%s");

   function get_image($url) {   

    // sanitize it so we don't break our api url    
    $encodedUrl = urlencode($url);    
    $TOKEN = '1b830931777ac7c2ac954e9f0d67df437175e66e';    
    $API_URL = 'https://www.readability.com/api/content/v1/parser?url=%s&token=%s';    
//  $API_URL = 'http://blog.readability.com/2011/02/step-up-be-heard-readability-ideas';    
    // build our url   
    $url = sprintf($API_URL, $encodedUrl, $TOKEN);    

    // call the api    
    $response = file_get_contents($url);    
    if( $response ) {    
        return false;   
    }    
    $json = json_decode($response);    
    if(!isset($json['lead_image_url'])) {    
        return false;    
    }    

    return $json['lead_image_url'];

}

错误：Warning: file_get_contents(https://www.readability.com/api/content/v1/parser?url=http%3A%2F%2Fthenwat.com%2Fthenwat%2Finvite%2Findex.php&token=1b830931777ac7c2ac954e9f0d67df437175e66e): failed to open stream: HTTP request failed! HTTP/1.1 403 FORBIDDEN in F:\wamp\www\inviteold\test2.php on line 32

require 'readability/lib/Readability.inc.php';
$url = 'http://www.nextbigwhat.com';
$html = file_get_contents($url);

$Readability     = new Readability($html); // default charset is utf-8
$ReadabilityData = $Readability->getContent();

$image= $ReadabilityData['lead_image_url'];
$title= $ReadabilityData['title']; //This works fine.
$content = $ReadabilityData['word_count'];

echo "$content";

它说：Notice: Undefined index: lead_image_url in F:\wamp\www\inviteold\test2.php on line 13

Answer 1

首先，为了使用他们提供的REST API，您需要创建一个帐户。之后，您可以生成自己的token以在通话中使用。示例提供的token将无效，因为它故意无效。其目的仅限于此。

其次，确保allow_url_fopen文件中的php.ini指令设置为true。出于测试脚本的目的，或者如果您无法更改php.ini文件（共享托管解决方案），您可以使用页面顶部的ini_set('allow_url_fopen', true);。

最后，为了自己解析图像，您需要从检索到的DOM中检索所有图像元素。有时不会有任何图像，有时会有。这取决于你从哪个页面拉出来。此外，您需要解决相对路径......

您的代码

require 'readability/lib/Readability.inc.php';
$url = 'http://www.nextbigwhat.com';
$html = file_get_contents($url);

$Readability     = new Readability($html); // default charset is utf-8
$ReadabilityData = $Readability->getContent();

$image= $ReadabilityData['lead_image_url'];
$title= $ReadabilityData['title']; //This works fine.
$content = $ReadabilityData['word_count'];

echo "$content";

执行Readability后，您可以使用DOMDocument课程从您提取的内容中检索图像。实例化新的DOMDocument并加载HTML。确保使用libxml_use_internal_errors函数来抑制大多数网站上解析器导致的错误。我们将把它放在一个函数中，以便在需要时更容易在别处使用。

function sampleDomMedia($html) {
    // Supress validator errors
    libxml_use_internal_errors(true);

    // New document
    $dom = new DOMDocument();
    // Populate document
    $dom->loadHTML($html);
    //[...]

您现在可以从您实例化的文档中检索所有图像元素，然后获取其src属性......如下所示：

    //[...]
    // Get image elements
    $nodeList = $dom->getElementsByTagName('img');

    // Get length
    $length = $nodeList->length;

    // Initialize array
    $images = array();

    // Iterate over our nodes
    for($i=0;$i<$length;$i++) {
        // Get the current node
        $node = $nodeList->item($i);
        // Retrieve the src attribute
        $image = $node->getAttribute('src');

        // Push image src into $images array
        array_push($images,$image);
    }

    return $images;
}

现在您有一系列图像可供用户使用。但在你这样做之前，我们又忘记了一件事...我们想要解决所有相对路径，以便我们总是有一条绝对路径来存在另一个站点上的图像。

为此，我们必须确定基域URL，以及我们正在使用的当前页面的相对路径。我们可以使用PHP提供的parse_url()函数来完成。为简单起见，我们可以把它放到一个函数中。

function getUrls($url) {
    // Parse URL
    $urlArr = parse_url($url);

    // Determine Base URL, with scheme, host, and port
    $base = $urlArr['scheme']."://".$urlArr['host'];
    if(array_key_exists("port",$urlArr) && $urlArr['port'] != 80) {
        $base .= ":".$urlArr['port'];
    }

    // Truncate the Path using the position of the last forward slash
    $relative = $base.substr($urlArr['path'], 0, strrpos($urlArr['path'],"/")+1);

    // Return our two URLs
    return array($base, $relative);
}

在原始sampleDomMedia函数中添加一个附加参数，我们可以调用此函数来获取路径。然后我们可以检查src属性的值以确定它是什么类型的路径，然后解决它。

function sampleDomMedia($html, $url) {
    // Retrieve our URLs
    list($baseUrl, $relativeUrl) = getUrls($url);

    libxml_use_internal_errors(true);

    $dom = new DOMDocument();
    $dom->loadHTML($html);

    $nodeList = $dom->getElementsByTagName('img');
    $length = $nodeList->length;
    $images = array();

    for($i=0;$i<$length;$i++) {
        $node = $nodeList->item($i);
        $image = $node->getAttribute('src');

        // Resolve relative paths
        if(substr($image,0,2)=="//") { // Missing protocol
            $image = "http:".$image;
        } else if(substr($image,0,1)=="/") { // Path Relative to Base
            $image = $baseUrl.$image;
        } else if(substr($image,0,4)!=="http") { // Path Relative to Dimension
            $image = $relativeUrl.$image;
        }

        array_push($images,$image);
    }

    return $images;
}

最后，但同样重要的是，我们留下了前两个函数，以及这段程序代码：

require 'readability/lib/Readability.inc.php';
$url = 'http://www.nextbigwhat.com';
$html = file_get_contents($url);

$Readability     = new Readability($html); // default charset is utf-8
$ReadabilityData = $Readability->getContent();

$image = $ReadabilityData['lead_image_url'];
$images = sampleDomMedia($html, $url);

$title = $ReadabilityData['title']; //This works fine.
$content = $ReadabilityData['word_count'];

echo "$content";

此外，如果您认为文章内容可能包含图片内部（通常没有），您可以使用contents而不是Readability返回的$html变量，如：

$title = $ReadabilityData['title']; //This works fine.
$content = $ReadabilityData['word_count'];
$images = sampleDomMedia($content, $url);

我希望有所帮助。

使用可读性API从页面中刮取最相关的图像

1 个答案: