Question

过去1小时我一直在摸不着头脑。有没有可靠的方法来提取文本

并且没有其他内容（代码，图片，链接，样式，脚本）来自html页面。我试图提取html文档体内的所有文本。

这包括段落，纯文本和表格数据..

到目前为止，我已经尝试了simplehtmldom解析器以及file_get_contents，但它们都无效。这是代码：

<?php

require_once "simple_html_dom.php";

function getplaintextintrofromhtml($html) {

    // Remove the HTML tags
    $html = strip_tags($html);

    // Convert HTML entities to single characters
    $html = html_entity_decode($html, ENT_QUOTES, 'UTF-8');

    return $html;

}

$html = file_get_contents('http://www.thefreedictionary.com/contempt');

echo getplaintextintrofromhtml($html);
?>

以下是输出的截图：

https://docs.google.com/file/d/0B-b63LoI1gSfaGhpR0NvdUtlbW8/edit?usp=drivesdk

正如您所看到的那样，它显示出奇怪的输出，甚至不显示整页文本

Answer 1

我认为PHP Simple HTML DOM Parser是最简单快捷的方法尝试 http://simplehtmldom.sourceforge.net/

features
A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
Require PHP 5+.
Supports invalid HTML.
Find tags on an HTML page with selectors just like jQuery.
Extract contents from HTML in a single line

Answer 2

我不认为你为什么认为SimpleHTMLDOM不起作用，你只需要正确使用它，只需定位主体，然后使用->innertext属性：

function getplaintextintrofromhtml($url) {
    include 'simple_html_dom.php';

    $html = file_get_html($url);
    // point to the body, then get the innertext
    $data = $html->find('body', 0)->innertext;
    return $data;
}

echo getplaintextintrofromhtml('http://www.thefreedictionary.com/contempt');

Answer 3

Html2Text就是一个很好的库。

https://github.com/mtibben/html2text

使用composer安装：

composer require html2text/html2text

基本用法：

$html = new \Html2Text\Html2Text('Hello, &quot;<b>world</b>&quot;');

echo $html->getText();  // Hello, "WORLD"

PHP从html页面中提取所有文本

3 个答案: