Safari有一个“阅读器模式”,可以从包含文章但文本的网站中删除所有内容。
现在我需要从网站获取HTML源代码,然后使用 PHP 获取Safari的“阅读器模式”等真实内容新闻!
你能帮助我吗?? :S
答案 0 :(得分:1)
有人指出,只是发布到另一篇文章的链接不是很有帮助所以我正在更新。我已经开始使用Arc90的可读性的PHP端口,它的效果非常好。
这是一个指向Readability.js的PHP端口的链接:http://www.keyvan.net/2010/08/php-readability/
以下是一个简单的实施示例:
$url = 'http://';
$html = file_get_contents($url);
if (function_exists('tidy_parse_string')) {
$tidy = tidy_parse_string($html, array(), 'UTF8');
$tidy->cleanRepair();
$html = $tidy->value;
}
// give it to Readability
$readability = new Readability($html, $url);
// echo $readability->html;
// echo htmlspecialchars($tidy($readability->html, true));
// print debug output?
// useful to compare against Arc90's original JS version -
// simply click the bookmarklet with FireBug's console window open
$readability->debug = false;
// convert links to footnotes?
$readability->convertLinksToFootnotes = false;
$readability->lightClean = false;
// $readability->revertForcedParagraphElements = false;
// process it
$result = $readability->init();
// store reference to dom content processed by Readability
$content = $readability->getContent();
echo '<h1>'.$readability->getTitle()->textContent.'</h1>';
echo $content->innerHTML;
如果你想打开这个有用的页面数,我发现你可以在传递给Readability之前卷曲并定义html的用户代理,你会得到更好的结果。抛出一些重定向,它甚至更好。
这是我正在使用的函数而不是file_get_contents:
function getData($url) {
$url = str_replace('&', '&', urldecode(trim($url)) );
$timeout = 5;
$cookie = tempnam('/tmp', 'CURLCOOKIE');
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1');
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_TIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
$content = curl_exec($ch);
curl_close ($ch);
return $content;
}
实现:
$url = 'http://';
//$html = file_get_contents($url);
$html = getData($url);
if (function_exists('tidy_parse_string')) {
$tidy = tidy_parse_string($html, array(), 'UTF8');
$tidy->cleanRepair();
$html = $tidy->value;
}
$readability = new Readability($html, $url);
//...