Question

我有一堆HTML数据，我正在使用PHP写入PDF文件。在PDF中，我希望剥离和清理所有HTML。例如：

<ul>
    <li>First list item</li>
    <li>Second list item which is quite a bit longer</li>
    <li>List item with apostrophe 's 's</li>
</ul>

应该成为：

First list item
Second list item which is quite a bit longer
List item with apostrophe 's 's

但是，如果我只使用strip_tags()，我会得到类似的结果：

   First list item&#8232;

   Second list item which is quite a bit
longer&#8232;

   List item with apostrophe &rsquo;s &rsquo;s

还要注意输出的缩进。

有关如何正确清理HTML以获得漂亮，干净的字符串而没有混乱的空格和奇怪字符的任何提示？

谢谢：）

Answer 1

字符似乎是html实体。尝试：

html_entity_decode( strip_tags( $my_html_code ) );

Answer 2

您可以使用html_entity_decode解码strip_tags的结果，或使用preg_replace删除它们：

$text = strip_tags($html_text);
$content = preg_replace("/&#?[a-z0-9]{2,8};/i","",$text );

要删除行开头的空格，请使用ltrim：

$content = join("\n", array_map("ltrim", explode("\n", $content )));

保持撇号使用它：

$text = strip_tags($html_text);
$text = str_replace("&rsquo;","'", $text); 
$content = preg_replace("/&#?[a-z0-9]{2,8};/i","",$text );

Answer 3

使用PHP Tidy库来清理你的html。但在你的情况下，我会使用DOMDocument类从html获取数据。