Question

问题标题说明了一切，经过一段谷歌搜索和几天修改代码后，我无法弄清楚如何下载网页的纯文本。

使用strip_tags();仍然离开JavaScript和CSS并尝试使用正则表达式进行清理也会导致问题。

有没有（简单或复杂的）方式使用PHP以纯文本格式下载网页（比如维基百科文章）？

我使用PHP的file_get_contents();下载了该页面，如下所示：

$homepage = file_get_contents('http://www.example.com/');

正如我所说，我尝试使用strip_tags();等，但我无法获得纯文本。

我尝试使用：http://millkencode.googlecode.com/svn/trunk/htmlxtractor/ContentExtractor.php获取主要内容，但似乎无效。

Answer 1

这并不像看起来那么容易。我建议您查看PHP Simple HTML DOM Parser之类的内容。除了难以删除的JavaScript和CSS（以及使用RegEx for HTML is not proper）之外，还有一些内联样式和类似的内容。

这当然与HTML的复杂性有关。在某些情况下strip_tags就足够了。

Answer 2

使用此代码：

require_once('simple_html_dom.php');
$content=file_get_html('http://en.wikipedia.org/wiki/FYI');
$title=$content->find("#firstHeading",0)->plaintext ;
$text=$content->find("#bodyContent",0)->plaintext;
echo $title.$text;

http://simplehtmldom.sourceforge.net

下载纯文字网页

2 个答案: