我正在使用perl的XML :: Writer为名为OpenNMS的程序生成导入文件。根据{{3}},我需要将所有特殊字符预先声明为XML ENTITY声明。显然,我需要浏览我正在导出的所有字符串并编目使用的特殊字符。找出perl字符串中哪些字符与UTF-8编码“特殊”的最简单方法是什么?有没有办法弄清楚这些角色的实体名称应该是什么?
答案 0 :(得分:2)
为了找到“特殊”字符,您可以使用ord
找出代码点。这是一个例子:
# Create a Unicode test file with some Latin chars, some Cyrillic,
# and some outside the BMP.
# The BMP is the basic multilingual plane, see perluniintro.
# (Not sure what you mean by saying "non-basic".)
perl -CO -lwe "print join '', map chr, 97 .. 100, 0x410 .. 0x415, 0x10000 .. 0x10003" > u.txt
# Read it and find codepoints outside the BMP.
perl -CI -nlwe "print for map ord, grep ord > 0xffff, split //" < u.txt
您可以通过阅读perluniintro获得一个很好的介绍。
我不确定您所指的文档在“导出的XML”部分中的含义。 看起来像系统的一些限制,事实上是ASCII并且不执行Unicode。 或者是对XML的误解。或两者兼而有之。
无论如何,如果你正在寻找名字,你可以使用或引用规范的名称。 有关其中引用的HTML或MathML,请参阅XML Entity Definitions for Characters或其中一个较旧的文档。
答案 1 :(得分:1)
您可以查看uniquote program。它有一个 - xml 选项。例如:
$ cat sample
1 NFD single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
2 NFC single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
3 NFD multiple combining characters: (hẫç̌k) and (hã̂ç̌k).
3 NFC multiple combining characters: (hẫç̌k) and (hã̂ç̌k).
5 invisible characters: (4⁄3πr³) and (4⁄3πr³).
6 astral characters: ( = sqrt[² + ²]) and ( = sqrt[² + ²]).
7 astral + combining chars: (̅ = sqrt[̅² + ̅²]) and (̅ = sqrt[̅² + ̅²]).
8 wide characters: (wide) and (wide).
9 regular characters: (normal) and (normal).
$ uniquote -x sample
1 NFD single combining characters: (cre\x{300}me bru\x{302}le\x{301}e et fiance\x{301}) and (cre\x{300}me bru\x{302}le\x{301}e et fiance\x{301}).
2 NFC single combining characters: (cr\x{E8}me br\x{FB}l\x{E9}e et fianc\x{E9}) and (cr\x{E8}me br\x{FB}l\x{E9}e et fianc\x{E9}).
3 NFD multiple combining characters: (ha\x{302}\x{303}c\x{327}\x{30C}k) and (ha\x{303}\x{302}c\x{327}\x{30C}k).
3 NFC multiple combining characters: (h\x{1EAB}\x{E7}\x{30C}k) and (h\x{E3}\x{302}\x{E7}\x{30C}k).
5 invisible characters: (4\x{2044}3\x{2062}\x{3C0}\x{2062}r\x{B3}) and (4\x{2044}3\x{2062}\x{3C0}\x{2062}r\x{B3}).
6 astral characters: (\x{1D402} = sqrt[\x{1D400}\x{B2} + \x{1D401}\x{B2}]) and (\x{1D402} = sqrt[\x{1D400}\x{B2} + \x{1D401}\x{B2}]).
7 astral + combining chars: (\x{1D402}\x{305} = sqrt[\x{1D400}\x{305}\x{B2} + \x{1D401}\x{305}\x{B2}]) and (\x{1D402}\x{305} = sqrt[\x{1D400}\x{305}\x{B2} + \x{1D401}\x{305}\x{B2}]).
8 wide characters: (\x{FF57}\x{FF49}\x{FF44}\x{FF45}) and (\x{FF57}\x{FF49}\x{FF44}\x{FF45}).
9 regular characters: (normal) and (normal).
$ uniquote -b sample
1 NFD single combining characters: (cre\xCC\x80me bru\xCC\x82le\xCC\x81e et fiance\xCC\x81) and (cre\xCC\x80me bru\xCC\x82le\xCC\x81e et fiance\xCC\x81).
2 NFC single combining characters: (cr\xC3\xA8me br\xC3\xBBl\xC3\xA9e et fianc\xC3\xA9) and (cr\xC3\xA8me br\xC3\xBBl\xC3\xA9e et fianc\xC3\xA9).
3 NFD multiple combining characters: (ha\xCC\x82\xCC\x83c\xCC\xA7\xCC\x8Ck) and (ha\xCC\x83\xCC\x82c\xCC\xA7\xCC\x8Ck).
3 NFC multiple combining characters: (h\xE1\xBA\xAB\xC3\xA7\xCC\x8Ck) and (h\xC3\xA3\xCC\x82\xC3\xA7\xCC\x8Ck).
5 invisible characters: (4\xE2\x81\x843\xE2\x81\xA2\xCF\x80\xE2\x81\xA2r\xC2\xB3) and (4\xE2\x81\x843\xE2\x81\xA2\xCF\x80\xE2\x81\xA2r\xC2\xB3).
6 astral characters: (\xF0\x9D\x90\x82 = sqrt[\xF0\x9D\x90\x80\xC2\xB2 + \xF0\x9D\x90\x81\xC2\xB2]) and (\xF0\x9D\x90\x82 = sqrt[\xF0\x9D\x90\x80\xC2\xB2 + \xF0\x9D\x90\x81\xC2\xB2]).
7 astral + combining chars: (\xF0\x9D\x90\x82\xCC\x85 = sqrt[\xF0\x9D\x90\x80\xCC\x85\xC2\xB2 + \xF0\x9D\x90\x81\xCC\x85\xC2\xB2]) and (\xF0\x9D\x90\x82\xCC\x85 = sqrt[\xF0\x9D\x90\x80\xCC\x85\xC2\xB2 + \xF0\x9D\x90\x81\xCC\x85\xC2\xB2]).
8 wide characters: (\xEF\xBD\x97\xEF\xBD\x89\xEF\xBD\x84\xEF\xBD\x85) and (\xEF\xBD\x97\xEF\xBD\x89\xEF\xBD\x84\xEF\xBD\x85).
9 regular characters: (normal) and (normal).
$ uniquote -v sample
1 NFD single combining characters: (cre\N{COMBINING GRAVE ACCENT}me bru\N{COMBINING CIRCUMFLEX ACCENT}le\N{COMBINING ACUTE ACCENT}e et fiance\N{COMBINING ACUTE ACCENT}) and (cre\N{COMBINING GRAVE ACCENT}me bru\N{COMBINING CIRCUMFLEX ACCENT}le\N{COMBINING ACUTE ACCENT}e et fiance\N{COMBINING ACUTE ACCENT}).
2 NFC single combining characters: (cr\N{LATIN SMALL LETTER E WITH GRAVE}me br\N{LATIN SMALL LETTER U WITH CIRCUMFLEX}l\N{LATIN SMALL LETTER E WITH ACUTE}e et fianc\N{LATIN SMALL LETTER E WITH ACUTE}) and (cr\N{LATIN SMALL LETTER E WITH GRAVE}me br\N{LATIN SMALL LETTER U WITH CIRCUMFLEX}l\N{LATIN SMALL LETTER E WITH ACUTE}e et fianc\N{LATIN SMALL LETTER E WITH ACUTE}).
3 NFD multiple combining characters: (ha\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING TILDE}c\N{COMBINING CEDILLA}\N{COMBINING CARON}k) and (ha\N{COMBINING TILDE}\N{COMBINING CIRCUMFLEX ACCENT}c\N{COMBINING CEDILLA}\N{COMBINING CARON}k).
3 NFC multiple combining characters: (h\N{LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE}\N{LATIN SMALL LETTER C WITH CEDILLA}\N{COMBINING CARON}k) and (h\N{LATIN SMALL LETTER A WITH TILDE}\N{COMBINING CIRCUMFLEX ACCENT}\N{LATIN SMALL LETTER C WITH CEDILLA}\N{COMBINING CARON}k).
5 invisible characters: (4\N{FRACTION SLASH}3\N{INVISIBLE TIMES}\N{GREEK SMALL LETTER PI}\N{INVISIBLE TIMES}r\N{SUPERSCRIPT THREE}) and (4\N{FRACTION SLASH}3\N{INVISIBLE TIMES}\N{GREEK SMALL LETTER PI}\N{INVISIBLE TIMES}r\N{SUPERSCRIPT THREE}).
6 astral characters: (\N{MATHEMATICAL BOLD CAPITAL C} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{SUPERSCRIPT TWO}]) and (\N{MATHEMATICAL BOLD CAPITAL C} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{SUPERSCRIPT TWO}]).
7 astral + combining chars: (\N{MATHEMATICAL BOLD CAPITAL C}\N{COMBINING OVERLINE} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO}]) and (\N{MATHEMATICAL BOLD CAPITAL C}\N{COMBINING OVERLINE} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO}]).
8 wide characters: (\N{FULLWIDTH LATIN SMALL LETTER W}\N{FULLWIDTH LATIN SMALL LETTER I}\N{FULLWIDTH LATIN SMALL LETTER D}\N{FULLWIDTH LATIN SMALL LETTER E}) and (\N{FULLWIDTH LATIN SMALL LETTER W}\N{FULLWIDTH LATIN SMALL LETTER I}\N{FULLWIDTH LATIN SMALL LETTER D}\N{FULLWIDTH LATIN SMALL LETTER E}).
9 regular characters: (normal) and (normal).
$ uniquote --xml sample
1 NFD single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
2 NFC single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
3 NFD multiple combining characters: (hâçk) and (hãçk).
3 NFC multiple combining characters: (hẫk) and (hãk).
5 invisible characters: (4⁄3⁢r³) and (4⁄3⁢r³).
6 astral characters: (𝐂 = sqrt[𝐀 + 𝐁]) and (𝐂 = sqrt[𝐀 + 𝐁]).
7 astral + combining chars: (𝐂 = sqrt[𝐀 + 𝐁]) and (𝐂 = sqrt[𝐀 + 𝐁]).
8 wide characters: (w) and (w).
9 regular characters: (normal) and (normal).
$ uniquote --verbose --html sample
1 NFD single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
2 NFC single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
3 NFD multiple combining characters: (hẫç̌k) and (hã̂ç̌k).
3 NFC multiple combining characters: (hẫç̌k) and (hã̂ç̌k).
5 invisible characters: (4⁄3⁢π⁢r³) and (4⁄3⁢π⁢r³).
6 astral characters: (𝐂 = sqrt[𝐀² + 𝐁²]) and (𝐂 = sqrt[𝐀² + 𝐁²]).
7 astral + combining chars: (𝐂̅ = sqrt[𝐀̅² + 𝐁̅²]) and (𝐂̅ = sqrt[𝐀̅² + 𝐁̅²]).
8 wide characters: (wide) and (wide).
9 regular characters: (normal) and (normal).