PHP - html_entity_decode没有解码所有内容

时间:2014-01-05 21:26:27

标签: php html parsing dom

我正在解析HTML页面。在某些时候,我得到div和使用之间的文本 html_entity_decode用于打印该文本。

问题是页面包含像这​​个明星这样的字符或像⬛︎,◄,◉等形状的其他字符。我已检查过这些字符未在源页面上编码,它们就像你通常会看到它们。

页面正在使用charset =“UTF-8”

所以,当我使用

html_entity_decode($string, ENT_QUOTES, 'UTF-8');

例如,该星被“解码”为â˜

使用

获取$ string
document.getElementById("id-of-div").innerText

我想正确解码它们。我如何在PHP中执行此操作?

注意:我尝试了htmlspecialchars_decode($string, ENT_QUOTES);,它会产生相同的结果。

1 个答案:

答案 0 :(得分:5)

我试图用这个简单的PHP来重现你的问题:

<?php
  // Make sure our client knows we're sending UTF-8
  header('Content-Type: text/plain; charset=utf-8');
  $string = "The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a &quot;test&quot;.";
  echo 'String: ' . $string . "\n";
  echo 'Decoded: ' . html_entity_decode($string, ENT_QUOTES, 'UTF-8');

正如所料,输出是:

String: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a &quot;test&quot;.
Decoded: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a "test".

如果我将标题中的字符集更改为iso-8859-1,我会看到:

String: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: <span>This is a &quot;test&quot;.
Decoded: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: <span>This is a "test".

所以,我会说你的问题是显示问题。正如您所期望的那样,“{有趣”角色完全不受html_entity_decode的影响。只是无论你有什么代码,或者你用来查看输出的任何代码,都是错误地使用iso-8859-1来显示它们。