将命名的HTML实体转换为数字HTML实体

时间:2012-06-24 10:39:07

标签: php unicode html-entities

是否有PHP函数将命名的HTML实体转换为各自的数字HTML实体?

例如:

$str = "Oggi è un bel giorno";
echo entities_to_unicode($str); // Oggi è un bel giorno

提前致谢,祝你有愉快的一天!

6 个答案:

答案 0 :(得分:20)

您正在寻找从命名HTML实体到数字对应的简单翻译功能。

这可以通过使用转换表(即数组)和字符串转换函数(strtr)来完成:

$translated = strtr($string, $HTML401NamedToNumeric);

这适用于$string为UTF-8编码或单字节字符集。

如上所述,由W3C指定的HTML 4.01命名实体的示例数组如下所示。它包含252个实体。如果你想支持XHTML,那么还有一个(我把它放在最后):

$HTML401NamedToNumeric = array(
    ' '     => ' ',  # no-break space = non-breaking space, U+00A0 ISOnum
    '¡'    => '¡',  # inverted exclamation mark, U+00A1 ISOnum
    '¢'     => '¢',  # cent sign, U+00A2 ISOnum
    '£'    => '£',  # pound sign, U+00A3 ISOnum
    '¤'   => '¤',  # currency sign, U+00A4 ISOnum
    '¥'      => '¥',  # yen sign = yuan sign, U+00A5 ISOnum
    '¦'   => '¦',  # broken bar = broken vertical bar, U+00A6 ISOnum
    '§'     => '§',  # section sign, U+00A7 ISOnum
    '¨'      => '¨',  # diaeresis = spacing diaeresis, U+00A8 ISOdia
    '©'     => '©',  # copyright sign, U+00A9 ISOnum
    'ª'     => 'ª',  # feminine ordinal indicator, U+00AA ISOnum
    '«'    => '«',  # left-pointing double angle quotation mark = left pointing guillemet, U+00AB ISOnum
    '¬'      => '¬',  # not sign, U+00AC ISOnum
    '­'      => '­',  # soft hyphen = discretionary hyphen, U+00AD ISOnum
    '®'      => '®',  # registered sign = registered trade mark sign, U+00AE ISOnum
    '¯'     => '¯',  # macron = spacing macron = overline = APL overbar, U+00AF ISOdia
    '°'      => '°',  # degree sign, U+00B0 ISOnum
    '±'   => '±',  # plus-minus sign = plus-or-minus sign, U+00B1 ISOnum
    '²'     => '²',  # superscript two = superscript digit two = squared, U+00B2 ISOnum
    '³'     => '³',  # superscript three = superscript digit three = cubed, U+00B3 ISOnum
    '´'    => '´',  # acute accent = spacing acute, U+00B4 ISOdia
    'µ'    => 'µ',  # micro sign, U+00B5 ISOnum
    '¶'     => '¶',  # pilcrow sign = paragraph sign, U+00B6 ISOnum
    '·'   => '·',  # middle dot = Georgian comma = Greek middle dot, U+00B7 ISOnum
    '¸'    => '¸',  # cedilla = spacing cedilla, U+00B8 ISOdia
    '¹'     => '¹',  # superscript one = superscript digit one, U+00B9 ISOnum
    'º'     => 'º',  # masculine ordinal indicator, U+00BA ISOnum
    '»'    => '»',  # right-pointing double angle quotation mark = right pointing guillemet, U+00BB ISOnum
    '¼'   => '¼',  # vulgar fraction one quarter = fraction one quarter, U+00BC ISOnum
    '½'   => '½',  # vulgar fraction one half = fraction one half, U+00BD ISOnum
    '¾'   => '¾',  # vulgar fraction three quarters = fraction three quarters, U+00BE ISOnum
    '¿'   => '¿',  # inverted question mark = turned question mark, U+00BF ISOnum
    'À'   => 'À',  # latin capital letter A with grave = latin capital letter A grave, U+00C0 ISOlat1
    'Á'   => 'Á',  # latin capital letter A with acute, U+00C1 ISOlat1
    'Â'    => 'Â',  # latin capital letter A with circumflex, U+00C2 ISOlat1
    'Ã'   => 'Ã',  # latin capital letter A with tilde, U+00C3 ISOlat1
    'Ä'     => 'Ä',  # latin capital letter A with diaeresis, U+00C4 ISOlat1
    'Å'    => 'Å',  # latin capital letter A with ring above = latin capital letter A ring, U+00C5 ISOlat1
    'Æ'    => 'Æ',  # latin capital letter AE = latin capital ligature AE, U+00C6 ISOlat1
    'Ç'   => 'Ç',  # latin capital letter C with cedilla, U+00C7 ISOlat1
    'È'   => 'È',  # latin capital letter E with grave, U+00C8 ISOlat1
    'É'   => 'É',  # latin capital letter E with acute, U+00C9 ISOlat1
    'Ê'    => 'Ê',  # latin capital letter E with circumflex, U+00CA ISOlat1
    'Ë'     => 'Ë',  # latin capital letter E with diaeresis, U+00CB ISOlat1
    'Ì'   => 'Ì',  # latin capital letter I with grave, U+00CC ISOlat1
    'Í'   => 'Í',  # latin capital letter I with acute, U+00CD ISOlat1
    'Î'    => 'Î',  # latin capital letter I with circumflex, U+00CE ISOlat1
    'Ï'     => 'Ï',  # latin capital letter I with diaeresis, U+00CF ISOlat1
    'Ð'      => 'Ð',  # latin capital letter ETH, U+00D0 ISOlat1
    'Ñ'   => 'Ñ',  # latin capital letter N with tilde, U+00D1 ISOlat1
    'Ò'   => 'Ò',  # latin capital letter O with grave, U+00D2 ISOlat1
    'Ó'   => 'Ó',  # latin capital letter O with acute, U+00D3 ISOlat1
    'Ô'    => 'Ô',  # latin capital letter O with circumflex, U+00D4 ISOlat1
    'Õ'   => 'Õ',  # latin capital letter O with tilde, U+00D5 ISOlat1
    'Ö'     => 'Ö',  # latin capital letter O with diaeresis, U+00D6 ISOlat1
    '×'    => '×',  # multiplication sign, U+00D7 ISOnum
    'Ø'   => 'Ø',  # latin capital letter O with stroke = latin capital letter O slash, U+00D8 ISOlat1
    'Ù'   => 'Ù',  # latin capital letter U with grave, U+00D9 ISOlat1
    'Ú'   => 'Ú',  # latin capital letter U with acute, U+00DA ISOlat1
    'Û'    => 'Û',  # latin capital letter U with circumflex, U+00DB ISOlat1
    'Ü'     => 'Ü',  # latin capital letter U with diaeresis, U+00DC ISOlat1
    'Ý'   => 'Ý',  # latin capital letter Y with acute, U+00DD ISOlat1
    'Þ'    => 'Þ',  # latin capital letter THORN, U+00DE ISOlat1
    'ß'    => 'ß',  # latin small letter sharp s = ess-zed, U+00DF ISOlat1
    'à'   => 'à',  # latin small letter a with grave = latin small letter a grave, U+00E0 ISOlat1
    'á'   => 'á',  # latin small letter a with acute, U+00E1 ISOlat1
    'â'    => 'â',  # latin small letter a with circumflex, U+00E2 ISOlat1
    'ã'   => 'ã',  # latin small letter a with tilde, U+00E3 ISOlat1
    'ä'     => 'ä',  # latin small letter a with diaeresis, U+00E4 ISOlat1
    'å'    => 'å',  # latin small letter a with ring above = latin small letter a ring, U+00E5 ISOlat1
    'æ'    => 'æ',  # latin small letter ae = latin small ligature ae, U+00E6 ISOlat1
    'ç'   => 'ç',  # latin small letter c with cedilla, U+00E7 ISOlat1
    'è'   => 'è',  # latin small letter e with grave, U+00E8 ISOlat1
    'é'   => 'é',  # latin small letter e with acute, U+00E9 ISOlat1
    'ê'    => 'ê',  # latin small letter e with circumflex, U+00EA ISOlat1
    'ë'     => 'ë',  # latin small letter e with diaeresis, U+00EB ISOlat1
    'ì'   => 'ì',  # latin small letter i with grave, U+00EC ISOlat1
    'í'   => 'í',  # latin small letter i with acute, U+00ED ISOlat1
    'î'    => 'î',  # latin small letter i with circumflex, U+00EE ISOlat1
    'ï'     => 'ï',  # latin small letter i with diaeresis, U+00EF ISOlat1
    'ð'      => 'ð',  # latin small letter eth, U+00F0 ISOlat1
    'ñ'   => 'ñ',  # latin small letter n with tilde, U+00F1 ISOlat1
    'ò'   => 'ò',  # latin small letter o with grave, U+00F2 ISOlat1
    'ó'   => 'ó',  # latin small letter o with acute, U+00F3 ISOlat1
    'ô'    => 'ô',  # latin small letter o with circumflex, U+00F4 ISOlat1
    'õ'   => 'õ',  # latin small letter o with tilde, U+00F5 ISOlat1
    'ö'     => 'ö',  # latin small letter o with diaeresis, U+00F6 ISOlat1
    '÷'   => '÷',  # division sign, U+00F7 ISOnum
    'ø'   => 'ø',  # latin small letter o with stroke, = latin small letter o slash, U+00F8 ISOlat1
    'ù'   => 'ù',  # latin small letter u with grave, U+00F9 ISOlat1
    'ú'   => 'ú',  # latin small letter u with acute, U+00FA ISOlat1
    'û'    => 'û',  # latin small letter u with circumflex, U+00FB ISOlat1
    'ü'     => 'ü',  # latin small letter u with diaeresis, U+00FC ISOlat1
    'ý'   => 'ý',  # latin small letter y with acute, U+00FD ISOlat1
    'þ'    => 'þ',  # latin small letter thorn, U+00FE ISOlat1
    'ÿ'     => 'ÿ',  # latin small letter y with diaeresis, U+00FF ISOlat1
    'ƒ'     => 'ƒ',  # latin small f with hook = function = florin, U+0192 ISOtech
    'Α'    => 'Α',  # greek capital letter alpha, U+0391
    'Β'     => 'Β',  # greek capital letter beta, U+0392
    'Γ'    => 'Γ',  # greek capital letter gamma, U+0393 ISOgrk3
    'Δ'    => 'Δ',  # greek capital letter delta, U+0394 ISOgrk3
    'Ε'  => 'Ε',  # greek capital letter epsilon, U+0395
    'Ζ'     => 'Ζ',  # greek capital letter zeta, U+0396
    'Η'      => 'Η',  # greek capital letter eta, U+0397
    'Θ'    => 'Θ',  # greek capital letter theta, U+0398 ISOgrk3
    'Ι'     => 'Ι',  # greek capital letter iota, U+0399
    'Κ'    => 'Κ',  # greek capital letter kappa, U+039A
    'Λ'   => 'Λ',  # greek capital letter lambda, U+039B ISOgrk3
    'Μ'       => 'Μ',  # greek capital letter mu, U+039C
    'Ν'       => 'Ν',  # greek capital letter nu, U+039D
    'Ξ'       => 'Ξ',  # greek capital letter xi, U+039E ISOgrk3
    'Ο'  => 'Ο',  # greek capital letter omicron, U+039F
    'Π'       => 'Π',  # greek capital letter pi, U+03A0 ISOgrk3
    'Ρ'      => 'Ρ',  # greek capital letter rho, U+03A1
    'Σ'    => 'Σ',  # greek capital letter sigma, U+03A3 ISOgrk3
    'Τ'      => 'Τ',  # greek capital letter tau, U+03A4
    'Υ'  => 'Υ',  # greek capital letter upsilon, U+03A5 ISOgrk3
    'Φ'      => 'Φ',  # greek capital letter phi, U+03A6 ISOgrk3
    'Χ'      => 'Χ',  # greek capital letter chi, U+03A7
    'Ψ'      => 'Ψ',  # greek capital letter psi, U+03A8 ISOgrk3
    'Ω'    => 'Ω',  # greek capital letter omega, U+03A9 ISOgrk3
    'α'    => 'α',  # greek small letter alpha, U+03B1 ISOgrk3
    'β'     => 'β',  # greek small letter beta, U+03B2 ISOgrk3
    'γ'    => 'γ',  # greek small letter gamma, U+03B3 ISOgrk3
    'δ'    => 'δ',  # greek small letter delta, U+03B4 ISOgrk3
    'ε'  => 'ε',  # greek small letter epsilon, U+03B5 ISOgrk3
    'ζ'     => 'ζ',  # greek small letter zeta, U+03B6 ISOgrk3
    'η'      => 'η',  # greek small letter eta, U+03B7 ISOgrk3
    'θ'    => 'θ',  # greek small letter theta, U+03B8 ISOgrk3
    'ι'     => 'ι',  # greek small letter iota, U+03B9 ISOgrk3
    'κ'    => 'κ',  # greek small letter kappa, U+03BA ISOgrk3
    'λ'   => 'λ',  # greek small letter lambda, U+03BB ISOgrk3
    'μ'       => 'μ',  # greek small letter mu, U+03BC ISOgrk3
    'ν'       => 'ν',  # greek small letter nu, U+03BD ISOgrk3
    'ξ'       => 'ξ',  # greek small letter xi, U+03BE ISOgrk3
    'ο'  => 'ο',  # greek small letter omicron, U+03BF NEW
    'π'       => 'π',  # greek small letter pi, U+03C0 ISOgrk3
    'ρ'      => 'ρ',  # greek small letter rho, U+03C1 ISOgrk3
    'ς'   => 'ς',  # greek small letter final sigma, U+03C2 ISOgrk3
    'σ'    => 'σ',  # greek small letter sigma, U+03C3 ISOgrk3
    'τ'      => 'τ',  # greek small letter tau, U+03C4 ISOgrk3
    'υ'  => 'υ',  # greek small letter upsilon, U+03C5 ISOgrk3
    'φ'      => 'φ',  # greek small letter phi, U+03C6 ISOgrk3
    'χ'      => 'χ',  # greek small letter chi, U+03C7 ISOgrk3
    'ψ'      => 'ψ',  # greek small letter psi, U+03C8 ISOgrk3
    'ω'    => 'ω',  # greek small letter omega, U+03C9 ISOgrk3
    'ϑ' => 'ϑ',  # greek small letter theta symbol, U+03D1 NEW
    'ϒ'    => 'ϒ',  # greek upsilon with hook symbol, U+03D2 NEW
    'ϖ'      => 'ϖ',  # greek pi symbol, U+03D6 ISOgrk3
    '•'     => '•', # bullet = black small circle, U+2022 ISOpub
    '…'   => '…', # horizontal ellipsis = three dot leader, U+2026 ISOpub
    '′'    => '′', # prime = minutes = feet, U+2032 ISOtech
    '″'    => '″', # double prime = seconds = inches, U+2033 ISOtech
    '‾'    => '‾', # overline = spacing overscore, U+203E NEW
    '⁄'    => '⁄', # fraction slash, U+2044 NEW
    '℘'   => '℘', # script capital P = power set = Weierstrass p, U+2118 ISOamso
    'ℑ'    => 'ℑ', # blackletter capital I = imaginary part, U+2111 ISOamso
    'ℜ'     => 'ℜ', # blackletter capital R = real part symbol, U+211C ISOamso
    '™'    => '™', # trade mark sign, U+2122 ISOnum
    'ℵ'  => 'ℵ', # alef symbol = first transfinite cardinal, U+2135 NEW
    '←'     => '←', # leftwards arrow, U+2190 ISOnum
    '↑'     => '↑', # upwards arrow, U+2191 ISOnum
    '→'     => '→', # rightwards arrow, U+2192 ISOnum
    '↓'     => '↓', # downwards arrow, U+2193 ISOnum
    '↔'     => '↔', # left right arrow, U+2194 ISOamsa
    '↵'    => '↵', # downwards arrow with corner leftwards = carriage return, U+21B5 NEW
    '⇐'     => '⇐', # leftwards double arrow, U+21D0 ISOtech
    '⇑'     => '⇑', # upwards double arrow, U+21D1 ISOamsa
    '⇒'     => '⇒', # rightwards double arrow, U+21D2 ISOtech
    '⇓'     => '⇓', # downwards double arrow, U+21D3 ISOamsa
    '⇔'     => '⇔', # left right double arrow, U+21D4 ISOamsa
    '∀'   => '∀', # for all, U+2200 ISOtech
    '∂'     => '∂', # partial differential, U+2202 ISOtech
    '∃'    => '∃', # there exists, U+2203 ISOtech
    '∅'    => '∅', # empty set = null set = diameter, U+2205 ISOamso
    '∇'    => '∇', # nabla = backward difference, U+2207 ISOtech
    '∈'     => '∈', # element of, U+2208 ISOtech
    '∉'    => '∉', # not an element of, U+2209 ISOtech
    '∋'       => '∋', # contains as member, U+220B ISOtech
    '∏'     => '∏', # n-ary product = product sign, U+220F ISOamsb
    '∑'      => '∑', # n-ary sumation, U+2211 ISOamsb
    '−'    => '−', # minus sign, U+2212 ISOtech
    '∗'   => '∗', # asterisk operator, U+2217 ISOtech
    '√'    => '√', # square root = radical sign, U+221A ISOtech
    '∝'     => '∝', # proportional to, U+221D ISOtech
    '∞'    => '∞', # infinity, U+221E ISOtech
    '∠'      => '∠', # angle, U+2220 ISOamso
    '∧'      => '∧', # logical and = wedge, U+2227 ISOtech
    '∨'       => '∨', # logical or = vee, U+2228 ISOtech
    '∩'      => '∩', # intersection = cap, U+2229 ISOtech
    '∪'      => '∪', # union = cup, U+222A ISOtech
    '∫'      => '∫', # integral, U+222B ISOtech
    '∴'   => '∴', # therefore, U+2234 ISOtech
    '∼'      => '∼', # tilde operator = varies with = similar to, U+223C ISOtech
    '≅'     => '≅', # approximately equal to, U+2245 ISOtech
    '≈'    => '≈', # almost equal to = asymptotic to, U+2248 ISOamsr
    '≠'       => '≠', # not equal to, U+2260 ISOtech
    '≡'    => '≡', # identical to, U+2261 ISOtech
    '≤'       => '≤', # less-than or equal to, U+2264 ISOtech
    '≥'       => '≥', # greater-than or equal to, U+2265 ISOtech
    '⊂'      => '⊂', # subset of, U+2282 ISOtech
    '⊃'      => '⊃', # superset of, U+2283 ISOtech
    '⊄'     => '⊄', # not a subset of, U+2284 ISOamsn
    '⊆'     => '⊆', # subset of or equal to, U+2286 ISOtech
    '⊇'     => '⊇', # superset of or equal to, U+2287 ISOtech
    '⊕'    => '⊕', # circled plus = direct sum, U+2295 ISOamsb
    '⊗'   => '⊗', # circled times = vector product, U+2297 ISOamsb
    '⊥'     => '⊥', # up tack = orthogonal to = perpendicular, U+22A5 ISOtech
    '⋅'     => '⋅', # dot operator, U+22C5 ISOamsb
    '⌈'    => '⌈', # left ceiling = apl upstile, U+2308 ISOamsc
    '⌉'    => '⌉', # right ceiling, U+2309 ISOamsc
    '⌊'   => '⌊', # left floor = apl downstile, U+230A ISOamsc
    '⌋'   => '⌋', # right floor, U+230B ISOamsc
    '⟨'     => '〈', # left-pointing angle bracket = bra, U+2329 ISOtech
    '⟩'     => '〉', # right-pointing angle bracket = ket, U+232A ISOtech
    '◊'      => '◊', # lozenge, U+25CA ISOpub
    '♠'   => '♠', # black spade suit, U+2660 ISOpub
    '♣'    => '♣', # black club suit = shamrock, U+2663 ISOpub
    '♥'   => '♥', # black heart suit = valentine, U+2665 ISOpub
    '♦'    => '♦', # black diamond suit, U+2666 ISOpub
    '"'     => '"',   # quotation mark = APL quote, U+0022 ISOnum
    '&'      => '&',   # ampersand, U+0026 ISOnum
    '<'       => '<',   # less-than sign, U+003C ISOnum
    '>'       => '>',   # greater-than sign, U+003E ISOnum
    'Œ'    => 'Œ',  # latin capital ligature OE, U+0152 ISOlat2
    'œ'    => 'œ',  # latin small ligature oe, U+0153 ISOlat2
    'Š'   => 'Š',  # latin capital letter S with caron, U+0160 ISOlat2
    'š'   => 'š',  # latin small letter s with caron, U+0161 ISOlat2
    'Ÿ'     => 'Ÿ',  # latin capital letter Y with diaeresis, U+0178 ISOlat2
    'ˆ'     => 'ˆ',  # modifier letter circumflex accent, U+02C6 ISOpub
    '˜'    => '˜',  # small tilde, U+02DC ISOdia
    ' '     => ' ', # en space, U+2002 ISOpub
    ' '     => ' ', # em space, U+2003 ISOpub
    ' '   => ' ', # thin space, U+2009 ISOpub
    '‌'     => '‌', # zero width non-joiner, U+200C NEW RFC 2070
    '‍'      => '‍', # zero width joiner, U+200D NEW RFC 2070
    '‎'      => '‎', # left-to-right mark, U+200E NEW RFC 2070
    '‏'      => '‏', # right-to-left mark, U+200F NEW RFC 2070
    '–'    => '–', # en dash, U+2013 ISOpub
    '—'    => '—', # em dash, U+2014 ISOpub
    '‘'    => '‘', # left single quotation mark, U+2018 ISOnum
    '’'    => '’', # right single quotation mark, U+2019 ISOnum
    '‚'    => '‚', # single low-9 quotation mark, U+201A NEW
    '“'    => '“', # left double quotation mark, U+201C ISOnum
    '”'    => '”', # right double quotation mark, U+201D ISOnum
    '„'    => '„', # double low-9 quotation mark, U+201E NEW
    '†'   => '†', # dagger, U+2020 ISOpub
    '‡'   => '‡', # double dagger, U+2021 ISOpub
    '‰'   => '‰', # per mille sign, U+2030 ISOtech
    '‹'   => '‹', # single left-pointing angle quotation mark, U+2039 ISO proposed
    '›'   => '›', # single right-pointing angle quotation mark, U+203A ISO proposed
    '€'     => '€', # euro sign, U+20AC NEW
);

XHTML的那个:

    '''     => ''',   # apostrophe = APL quote, U+0027 ISOnum

答案 1 :(得分:6)

此解决方案基于php.net的代码:

function entities_to_unicode($str) {
    $str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
    $str = preg_replace_callback("/(&#[0-9]+;)/", function($m) { return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); }, $str);
    return $str;
}

$str = 'Oggi è un bel giorno';
echo entities_to_unicode($str);

答案 2 :(得分:4)

echo preg_replace('/[^!-%\x27-;=?-~ ]/e', '"&#".ord("$0").";"', html_entity_decode($str))

答案 3 :(得分:3)

这一个

  1. 不需要枚举用户代码中的实体
  2. 适用于包含命名实体的HTML代码(粗鲁地将html_entity_decode应用于整个字符串混乱<和>(转换& lt;和& gt;)和HTML标记开始/结束):
  3. 这是

    function htmlent2xml($s) {
        return preg_replace_callback("/(&[a-zA-Z][a-zA-Z0-9]*;)/",function($m){
           $c = html_entity_decode($m[0],ENT_HTML5,"UTF-8");
           return htmlentities($c,ENT_XML1,"UTF-8");
        },$s);
    }
    

答案 4 :(得分:2)

首先使用html_entity_decode获取源代码的未编码版本。如有必要,将第三个参数(编码)设置为正确的值。

然后在源代码上使用utf8_encode

$source_code_without_entities = html_entity_decode($source_code_with_entities);
$utf8_source_code = utf8_encode($source_code_without_entities);

答案 5 :(得分:-1)

提供的答案@hakre是唯一真正可以解决所提出问题的方法。有趣的是,所有其他答案(包括已接受的答案)均无效。顺便说一句,被接受的答案实际上并没有任何作用!其他人至少做了一些事情,但这是错误的。试图回答这个问题的人们似乎还不了解作者想要将命名实体转换成其数字对应物。

以下是我的贡献,基于PHP文档(https://www.php.net/manual/pt_BR/function.htmlentities.php#106535)的评论

function xmlentities($aString) {
    $validChars = "A-Z0-9a-z\s_-";
    $twoChars = null;
    return preg_replace_callback("/[^$validChars]/"
    // Utilizar use(&$twoChars) faz com que $twoChars seja visível dentro da 
    // função anônima. É necessário usar o "&" se se pretende alterar o 
    // valor desta variável 
                                ,function ($aMatches) use(&$twoChars) { 
                                    $oneChar = $aMatches[0];
                                    switch($oneChar) {
    // Realiza substituições diretas. No caso, substitui as entidades que o 
    // XML reconhece. Eu poderia ter usado uma função do próprio PHP para 
    // isso, mas resolvi não usar porque são só 5 caracteres a substituir
                                        case "'": return "'";
                                        case '"': return """;
                                        case '&': return "&";
                                        case '<': return "&lt;";
                                        case '>': return "&gt;";
    // Caso não seja uma entidade reconhecida pelo xml, tratamentos 
    // especiais são necessários para identificar estamos lidadando com 
    // caracteres ISO-8859-1 ou UTF-8
                                        default: 
    // A tabela UTF-8 estende de forma compatível a tabela ASCII. Os 
    // primeiros 127 caracteres tem 1 byte e todos os demais tem dois bytes.
    // Os caracteres UTF-8 com 2 bytes começam com C2 (194) e seguem a 
    // sequência até chegar em CF (207). A condição abaixo detecta a 
    // existência de um destes bytes, que identificam um caractere UTF-8. 
    // Neste caso, se deve acumular ele numa variável com o intuito de,
    // posteriormente realizar a conversão de dois bytes e obter um único 
    // byte ISO-8859-1. Nesta primeira condição, há apenas o acúmulo na 
    // variável. Nada é retornado
                                            if (194 <= ord($oneChar) && ord($oneChar) <= 207) { 
                                                $twoChars = $oneChar;
                                                return;
    // Caso $twoChars contenha um valor, é porque em um passo anterior ele 
    // foi preenchido com o primeiro caractere de um par UTF-8. Neste caso 
    // devemos concatenar o segundo para, convertê-los para ISO-8859-1 e 
    // atribuir null à variável de controle ($twoChars). Em seguida, 
    // retornamos a saída formatada com o ordinal do caractere na tabela 
    // ISO-8859-1
                                            } else if ($twoChars) { 
                                                $twoChars .= $oneChar;
                                                $ansiChar = utf8_decode($twoChars);
                                                $twoChars = null;
                                                return "&#" . str_pad(ord($ansiChar), 3, "0", STR_PAD_LEFT) . ";";
    // Caso a string informada no argumento $aString da função já esteja 
    // codificada em ISO-88959-1, todos os seus caracteres terão 1 byte e 
    // neste caso, basta formatar diretamente este byte
                                            } else {
                                                return "&#" . str_pad(ord($oneChar), 3, "0", STR_PAD_LEFT) . ";";       
                                            }
                                    }
                                }
                                ,$aString);
}

我的版本带有注释(使用Google翻译),并且只能处理“原始”字符串,而没有实体(&xxx;),因此,如果您的字符串已命名实体,请首先将其转换为原始字符串以使用它形式:

$text = "Oggi &egrave; un bel&nbsp;giorno";

$text = html_entity_decode($text,ENT_QUOTES || ENT_HTML5,"UTF-8");

$text = xmlentities($text);

echo($text); // Output = Oggi &#232; un bel&#160;giorno