我正在解析HTML页面。在某些时候,我得到div和使用之间的文本 html_entity_decode用于打印该文本。
问题是页面包含像这个明星★
这样的字符或像⬛︎,◄,◉等形状的其他字符。我已检查过这些字符未在源页面上编码,它们就像你通常会看到它们。
页面正在使用charset =“UTF-8”
所以,当我使用
时html_entity_decode($string, ENT_QUOTES, 'UTF-8');
例如,该星被“解码”为â˜
使用
获取$ stringdocument.getElementById("id-of-div").innerText
我想正确解码它们。我如何在PHP中执行此操作?
注意:我尝试了htmlspecialchars_decode($string, ENT_QUOTES);
,它会产生相同的结果。
答案 0 :(得分:5)
我试图用这个简单的PHP来重现你的问题:
<?php
// Make sure our client knows we're sending UTF-8
header('Content-Type: text/plain; charset=utf-8');
$string = "The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a "test".";
echo 'String: ' . $string . "\n";
echo 'Decoded: ' . html_entity_decode($string, ENT_QUOTES, 'UTF-8');
正如所料,输出是:
String: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a "test".
Decoded: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a "test".
如果我将标题中的字符集更改为iso-8859-1
,我会看到:
String: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: <span>This is a "test".
Decoded: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: <span>This is a "test".
所以,我会说你的问题是显示问题。正如您所期望的那样,“{有趣”角色完全不受html_entity_decode
的影响。只是无论你有什么代码,或者你用来查看输出的任何代码,都是错误地使用iso-8859-1来显示它们。