Question

我使用TinyMCE允许在我的网站中最小化文本格式。从生成的HTML，我想将其转换为纯文本的电子邮件。我一直在使用一个名为html2text的类，但它确实缺乏UTF-8支持等等。但是，我确实将某些HTML标记映射到纯文本格式 - 就像在以前具有＆lt; i＆gt;的文本周围放置下划线一样。 HTML中的标签。

是否有人使用类似的方法将HTML转换为PHP中的纯文本？如果是这样的话：你推荐我可以使用的任何第三方课程吗？或者你如何最好地解决这个问题？

Answer 1

使用html2text获得许可的HTML（例如text至Eclipse Public License）。它使用PHP的DOM方法从HTML加载，然后迭代生成的DOM以提取纯文本。用法：

// when installed using the Composer package
$text = Html2Text\Html2Text::convert($html);

// usage when installed using html2text.php
require('html2text.php');
$text = convert_html_to_text($html);

虽然不完整，但它是开源的，欢迎提供。

其他转换脚本的问题：

由于html2text（GPL）与EPL不兼容。
lkessler's link（归因）与大多数开源许可证不兼容。

Answer 2

这是另一种解决方案：

$cleaner_input = strip_tags($text);

有关消毒功能的其他变体，请参阅：

https://RunForgithub.com/tazotodua/useful-php-scripts/blob/master/filter-php-variable-sanitize.php

Answer 3

使用DOMDocument从HTML转换为文本是一种可行的解决方案。考虑HTML2Text，它需要PHP5：

关于UTF-8，“howto”页面上的注释声明：

PHP自己对unicode的支持很差，而且它并不总能正确处理utf-8。虽然html2text脚本使用unicode-safe方法（不需要mbstring模块），但它无法始终处理PHP自己的编码处理。 PHP并不真正理解像utf-8这样的unicode或编码，并使用系统的基本编码，它往往是ISO-8859系列之一。因此，在文本编辑器中看起来像utf-8或单字节的有效字符可能会被PHP误解为错误。所以，即使你认为你正在为html2text提供一个有效的角色，你可能也不会。

作者提供了几种解决方法，并声明HTML2Text的第2版（使用DOMDocument）支持UTF-8。

请注意商业用途的限制。

Answer 4

有可靠的strip_tags功能。虽然它不漂亮。它只会消毒。您可以将它与字符串替换组合以获得您喜欢的下划线。


<?php
// to strip all tags and wrap italics with underscore
strip_tags(str_replace(array("<i>", "</i>"), array("_", "_"), $text));

// to preserve anchors...
str_replace("|a", "<a", strip_tags(str_replace("<a", "|a", $text)));

?>

Answer 5

您可以将lynx与-stdin和-dump选项一起使用来实现：

<?php
$descriptorspec = array(
   0 => array("pipe", "r"),  // stdin is a pipe that the child will read from
   1 => array("pipe", "w"),  // stdout is a pipe that the child will write to
   2 => array("file", "/tmp/htmp2txt.log", "a") // stderr is a file to write to
);

$process = proc_open('lynx -stdin -dump 2>&1', $descriptorspec, $pipes, '/tmp', NULL);

if (is_resource($process)) {
    // $pipes now looks like this:
    // 0 => writeable handle connected to child stdin
    // 1 => readable handle connected to child stdout
    // Any error output will be appended to htmp2txt.log

    $stdin = $pipes[0];
    fwrite($stdin,  <<<'EOT'
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
 <title>TEST</title>
</head>
<body>
<h1><span>Lorem Ipsum</span></h1>

<h4>"Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit..."</h4>
<h5>"There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is pain..."</h5>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque et sapien ut erat porttitor suscipit id nec dui. Nam rhoncus mauris ac dui tristique bibendum. Aliquam molestie placerat gravida. Duis vitae tortor gravida libero semper cursus eu ut tortor. Nunc id orci orci. Suspendisse potenti. Phasellus vehicula leo sed erat rutrum sed blandit purus convallis.
</p>
<p>
Aliquam feugiat, neque a tempus rhoncus, neque dolor vulputate eros, non pellentesque elit lacus ut nunc. Pellentesque vel purus libero, ultrices condimentum lorem. Nam dictum faucibus mollis. Praesent adipiscing nunc sed dui ultricies molestie. Quisque facilisis purus quis felis molestie ut accumsan felis ultricies. Curabitur euismod est id est pretium accumsan. Praesent a mi in dolor feugiat vehicula quis at elit. Mauris lacus mauris, laoreet non molestie nec, adipiscing a nulla. Nullam rutrum, libero id pellentesque tempus, erat nibh ornare dolor, id accumsan est risus at leo. In convallis felis at eros condimentum adipiscing aliquam nisi faucibus. Integer arcu ligula, porttitor in fermentum vitae, lacinia nec dui.
</p>
</body>
</html>
EOT
    );
    fclose($stdin);

    echo stream_get_contents($pipes[1]);
    fclose($pipes[1]);

    // It is important that you close any pipes before calling
    // proc_close in order to avoid a deadlock
    $return_value = proc_close($process);

    echo "command returned $return_value\n";
}

Answer 6

您可以测试此功能

function html2text($Document) {
    $Rules = array ('@<script[^>]*?>.*?</script>@si',
                    '@<[\/\!]*?[^<>]*?>@si',
                    '@([\r\n])[\s]+@',
                    '@&(quot|#34);@i',
                    '@&(amp|#38);@i',
                    '@&(lt|#60);@i',
                    '@&(gt|#62);@i',
                    '@&(nbsp|#160);@i',
                    '@&(iexcl|#161);@i',
                    '@&(cent|#162);@i',
                    '@&(pound|#163);@i',
                    '@&(copy|#169);@i',
                    '@&(reg|#174);@i',
                    '@&#(d+);@e'
             );
    $Replace = array ('',
                      '',
                      '',
                      '',
                      '&',
                      '<',
                      '>',
                      ' ',
                      chr(161),
                      chr(162),
                      chr(163),
                      chr(169),
                      chr(174),
                      'chr()'
                );
  return preg_replace($Rules, $Replace, $Document);
}

Answer 7

我没有找到适合的任何现有解决方案 - 简单的HTML电子邮件到简单的纯文本文件。

我打开了这个存储库，希望它可以帮助某人。麻省理工学院许可，顺便说一下：）

https://github.com/RobQuistNL/SimpleHtmlToText

示例：

$myHtml = '<b>This is HTML</b><h1>Header</h1><br/><br/>Newlines';
echo (new Parser())->parseString($myHtml);

返回：

**This is HTML**
### Header ###


Newlines

Answer 8

如果您想转换 HTML特殊字符，而不仅仅删除它们以及删除内容并准备纯文本，这对我有用的解决方案......

function htmlToPlainText($str){
    $str = str_replace('&nbsp;', ' ', $str);
    $str = html_entity_decode($str, ENT_QUOTES | ENT_COMPAT , 'UTF-8');
    $str = html_entity_decode($str, ENT_HTML5, 'UTF-8');
    $str = html_entity_decode($str);
    $str = htmlspecialchars_decode($str);
    $str = strip_tags($str);

    return $str;
}

$string = '<p>this is (&nbsp;) a test</p>
<div>Yes this is! &amp; does it get "processed"? </div>'

htmlToPlainText($string);
// "this is ( ) a test. Yes this is! & does it get processed?"`

html_entity_decode w / ENT_QUOTES | ENT_XML1转换'之类的内容 htmlspecialchars_decode转换&之类的内容 html_entity_decode会转换'<之类的内容和strip_tags删除遗留下来的所有HTML标记。

Answer 9

Markdownify将HTML转换为Markdown，这是一个在此网站上使用的纯文本格式系统。

Answer 10

Markdownify对我来说很棒！有什么必须提到它：它完全支持utf-8，这是我寻找另一个解决方案的主要原因，而不是html2text（这个帖子前面已经提到过）。

Answer 11

我遇到了与OP相同的问题，并且从上面的顶部答案中尝试一些解决方案并不适用于我的场景。最后看看为什么。

相反，我找到了这个有用的脚本，为了避免混淆，我们称之为html2text_roundcube，可以在GPL下找到：

https://github.com/mtibben/html2text

它实际上是已提及的脚本的更新版本 - http://www.chuggnutt.com/html2text.php - 由RoundCube邮件更新。

用法：

$h2t = new \Html2Text\Html2Text('Hello, &quot;<b>world</b>&quot;');
echo $h2t->getText(); // prints Hello, "WORLD"

为什么html2text_roundcube证明比其他人更好：

对于包含特殊HTML代码/名称（例如http://www.chuggnutt.com/html2text.php）或不成对引号（例如ä）的情况，脚本25" Monitor无法开箱即用。
脚本https://github.com/soundasleep/html2text没有选项可以隐藏或分组文本末尾的链接，使得通常的HTML页面看起来像文本平面格式的链接一样膨胀;自定义代码以便对转换的完成方式进行特殊处理并不像在html2text_roundcube中简单编辑数组那样直接。

Answer 12

public function plainText($text)
{
    $text = strip_tags($text, '<br><p><li>');
    $text = preg_replace ('/<[^>]*>/', PHP_EOL, $text);

    return $text;
}

$text = "string 1 string 2 <ul><li>string 3</li><li>string 4</li></ul>string 5";

echo planText($text);

<强>输出
string 1 string 2 string 3 string 4 string 5

Answer 13

我刚刚找到了一个PHP函数“strip_tags（）”，它在我的案例中工作。

我尝试转换以下HTML：

<p><span style="font-family: 'Verdana','sans-serif'; color: black; font-size: 7.5pt;">&nbsp;</span>Many  practitioners are optimistic that the eyeglass and contact lens  industry will recover from the recent economic storm. Did your practice  feel its affects?&nbsp; Statistics show revenue notably declined in 2008 and  2009. But interestingly enough, those that monitor these trends state  that despite the industry's lackluster performance during this time,  revenue has grown at an average annual rate&nbsp;of 2.2% over the last five  years, to $9.0 billion in 2010.&nbsp; So despite the downturn, how were we  able to manage growth as an industry?</p>

应用strip_tags（）函数后，我得到以下输出：

&amp;nbsp;Many  practitioners are optimistic that the eyeglass and contact lens  industry will recover from the recent economic storm. Did your practice  feel its affects?&amp;nbsp; Statistics show revenue notably declined in 2008 and  2009. But interestingly enough, those that monitor these trends state  that despite the industry&#039;s lackluster performance during this time,  revenue has grown at an average annual rate&amp;nbsp;of 2.2% over the last five  years, to $9.0 billion in 2010.&amp;nbsp; So despite the downturn, how were we  able to manage growth as an industry?

Answer 14

如果您不想完全剥离标记并将内容保留在标记内，则可以使用DOMDocument并提取根节点的textContent，如下所示：

function html2text($html) {
    $dom = new DOMDocument();
    $dom->loadHTML("<body>" . strip_tags($html, '<b><a><i><div><span><p>') . "</body>");
    $xpath = new DOMXPath($dom);
    $node = $xpath->query('body')->item(0);
    return $node->textContent; // text
}

$p = 'this is <b>test</b>. <p>how are <i>you?</i>. <a href="#">I\'m fine!</a></p>';
print html2text($p);
// this is test. how are you?. I'm fine!

这种方法的一个优点是它不需要任何外部包。

Answer 15

对于utf-8中的文本，它对我有用mb_convert_encoding。要处理所有内容而不管错误，请确保使用“ @”。

我使用的基本代码是：

     If CheckBox34.Checked = True Then
        Dim objWriter As New System.IO.StreamWriter(TextBox31.Text & "\" & Format(Now, "dd-MMM-yyyy") & ".log", True)
        objWriter.WriteLine(Format(Now, "dd-MMM-yyyy HH:mm:ss ") & TextBox4.Text & vbCrLf & str1)
        objWriter.Close()
    End If

如果需要更高级的功能，可以迭代分析节点，但是空白会遇到很多问题。

我已经根据我在这里所说的实现了一个转换器。如果您有兴趣，可以从git https://github.com/kranemora/html2text

下载

它可以作为制作您自己的参考

您可以像这样使用它：

$dom = new DOMDocument();
@$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

$body = $dom->getElementsByTagName('body')->item(0);
echo $body->textContent;

在PHP中将HTML转换为纯文本以用于电子邮件

15 个答案: