如何识别perl中一组字符串中的所有非基本UTF-8字符

时间:2011-03-23 13:02:48

标签: perl unicode utf-8

我正在使用perl的XML :: Writer为名为OpenNMS的程序生成导入文件。根据{{​​3}},我需要将所有特殊字符预先声明为XML ENTITY声明。显然,我需要浏览我正在导出的所有字符串并编目使用的特殊字符。找出perl字符串中哪些字符与UTF-8编码“特殊”的最简单方法是什么?有没有办法弄清楚这些角色的实体名称应该是什么?

2 个答案:

答案 0 :(得分:2)

为了找到“特殊”字符,您可以使用ord找出代码点。这是一个例子:

# Create a Unicode test file with some Latin chars, some Cyrillic,
# and some outside the BMP.
# The BMP is the basic multilingual plane, see perluniintro.
# (Not sure what you mean by saying "non-basic".)
perl -CO -lwe "print join '', map chr, 97 .. 100, 0x410 .. 0x415, 0x10000 .. 0x10003" > u.txt

# Read it and find codepoints outside the BMP.
perl -CI -nlwe "print for map ord, grep ord > 0xffff, split //" < u.txt

您可以通过阅读perluniintro获得一个很好的介绍。

我不确定您所指的文档在“导出的XML”部分中的含义。 看起来像系统的一些限制,事实上是ASCII并且不执行Unicode。 或者是对XML的误解。或两者兼而有之。

无论如何,如果你正在寻找名字,你可以使用或引用规范的名称。 有关其中引用的HTML或MathML,请参阅XML Entity Definitions for Characters或其中一个较旧的文档。

答案 1 :(得分:1)

您可以查看uniquote program。它有一个 - xml 选项。例如:

$ cat sample
     1  NFD single combining characters:   (crème brûlée et fiancé) and (crème brûlée et fiancé).
     2  NFC single combining characters:   (crème brûlée et fiancé) and (crème brûlée et fiancé).
     3  NFD multiple combining characters: (hẫç̌k) and (hã̂ç̌k).
     3  NFC multiple combining characters: (hẫç̌k) and (hã̂ç̌k).
     5  invisible characters:              (4⁄3⁢π⁢r³) and (4⁄3⁢π⁢r³).
     6  astral characters:                 ( = sqrt[² + ²]) and ( = sqrt[² + ²]).
     7  astral + combining chars:          (̅ = sqrt[̅² + ̅²]) and (̅ = sqrt[̅² + ̅²]).
     8  wide characters:                   (wide) and (wide).
     9  regular characters:                (normal) and (normal).

$ uniquote -x sample
     1  NFD single combining characters:   (cre\x{300}me bru\x{302}le\x{301}e et fiance\x{301}) and (cre\x{300}me bru\x{302}le\x{301}e et fiance\x{301}).
     2  NFC single combining characters:   (cr\x{E8}me br\x{FB}l\x{E9}e et fianc\x{E9}) and (cr\x{E8}me br\x{FB}l\x{E9}e et fianc\x{E9}).
     3  NFD multiple combining characters: (ha\x{302}\x{303}c\x{327}\x{30C}k) and (ha\x{303}\x{302}c\x{327}\x{30C}k).
     3  NFC multiple combining characters: (h\x{1EAB}\x{E7}\x{30C}k) and (h\x{E3}\x{302}\x{E7}\x{30C}k).
     5  invisible characters:              (4\x{2044}3\x{2062}\x{3C0}\x{2062}r\x{B3}) and (4\x{2044}3\x{2062}\x{3C0}\x{2062}r\x{B3}).
     6  astral characters:                 (\x{1D402} = sqrt[\x{1D400}\x{B2} + \x{1D401}\x{B2}]) and (\x{1D402} = sqrt[\x{1D400}\x{B2} + \x{1D401}\x{B2}]).
     7  astral + combining chars:          (\x{1D402}\x{305} = sqrt[\x{1D400}\x{305}\x{B2} + \x{1D401}\x{305}\x{B2}]) and (\x{1D402}\x{305} = sqrt[\x{1D400}\x{305}\x{B2} + \x{1D401}\x{305}\x{B2}]).
     8  wide characters:                   (\x{FF57}\x{FF49}\x{FF44}\x{FF45}) and (\x{FF57}\x{FF49}\x{FF44}\x{FF45}).
     9  regular characters:                (normal) and (normal).


$ uniquote -b sample
     1  NFD single combining characters:   (cre\xCC\x80me bru\xCC\x82le\xCC\x81e et fiance\xCC\x81) and (cre\xCC\x80me bru\xCC\x82le\xCC\x81e et fiance\xCC\x81).
     2  NFC single combining characters:   (cr\xC3\xA8me br\xC3\xBBl\xC3\xA9e et fianc\xC3\xA9) and (cr\xC3\xA8me br\xC3\xBBl\xC3\xA9e et fianc\xC3\xA9).
     3  NFD multiple combining characters: (ha\xCC\x82\xCC\x83c\xCC\xA7\xCC\x8Ck) and (ha\xCC\x83\xCC\x82c\xCC\xA7\xCC\x8Ck).
     3  NFC multiple combining characters: (h\xE1\xBA\xAB\xC3\xA7\xCC\x8Ck) and (h\xC3\xA3\xCC\x82\xC3\xA7\xCC\x8Ck).
     5  invisible characters:              (4\xE2\x81\x843\xE2\x81\xA2\xCF\x80\xE2\x81\xA2r\xC2\xB3) and (4\xE2\x81\x843\xE2\x81\xA2\xCF\x80\xE2\x81\xA2r\xC2\xB3).
     6  astral characters:                 (\xF0\x9D\x90\x82 = sqrt[\xF0\x9D\x90\x80\xC2\xB2 + \xF0\x9D\x90\x81\xC2\xB2]) and (\xF0\x9D\x90\x82 = sqrt[\xF0\x9D\x90\x80\xC2\xB2 + \xF0\x9D\x90\x81\xC2\xB2]).
     7  astral + combining chars:          (\xF0\x9D\x90\x82\xCC\x85 = sqrt[\xF0\x9D\x90\x80\xCC\x85\xC2\xB2 + \xF0\x9D\x90\x81\xCC\x85\xC2\xB2]) and (\xF0\x9D\x90\x82\xCC\x85 = sqrt[\xF0\x9D\x90\x80\xCC\x85\xC2\xB2 + \xF0\x9D\x90\x81\xCC\x85\xC2\xB2]).
     8  wide characters:                   (\xEF\xBD\x97\xEF\xBD\x89\xEF\xBD\x84\xEF\xBD\x85) and (\xEF\xBD\x97\xEF\xBD\x89\xEF\xBD\x84\xEF\xBD\x85).
     9  regular characters:                (normal) and (normal).

$ uniquote -v sample
     1  NFD single combining characters:   (cre\N{COMBINING GRAVE ACCENT}me bru\N{COMBINING CIRCUMFLEX ACCENT}le\N{COMBINING ACUTE ACCENT}e et fiance\N{COMBINING ACUTE ACCENT}) and (cre\N{COMBINING GRAVE ACCENT}me bru\N{COMBINING CIRCUMFLEX ACCENT}le\N{COMBINING ACUTE ACCENT}e et fiance\N{COMBINING ACUTE ACCENT}).
     2  NFC single combining characters:   (cr\N{LATIN SMALL LETTER E WITH GRAVE}me br\N{LATIN SMALL LETTER U WITH CIRCUMFLEX}l\N{LATIN SMALL LETTER E WITH ACUTE}e et fianc\N{LATIN SMALL LETTER E WITH ACUTE}) and (cr\N{LATIN SMALL LETTER E WITH GRAVE}me br\N{LATIN SMALL LETTER U WITH CIRCUMFLEX}l\N{LATIN SMALL LETTER E WITH ACUTE}e et fianc\N{LATIN SMALL LETTER E WITH ACUTE}).
     3  NFD multiple combining characters: (ha\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING TILDE}c\N{COMBINING CEDILLA}\N{COMBINING CARON}k) and (ha\N{COMBINING TILDE}\N{COMBINING CIRCUMFLEX ACCENT}c\N{COMBINING CEDILLA}\N{COMBINING CARON}k).
     3  NFC multiple combining characters: (h\N{LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE}\N{LATIN SMALL LETTER C WITH CEDILLA}\N{COMBINING CARON}k) and (h\N{LATIN SMALL LETTER A WITH TILDE}\N{COMBINING CIRCUMFLEX ACCENT}\N{LATIN SMALL LETTER C WITH CEDILLA}\N{COMBINING CARON}k).
     5  invisible characters:              (4\N{FRACTION SLASH}3\N{INVISIBLE TIMES}\N{GREEK SMALL LETTER PI}\N{INVISIBLE TIMES}r\N{SUPERSCRIPT THREE}) and (4\N{FRACTION SLASH}3\N{INVISIBLE TIMES}\N{GREEK SMALL LETTER PI}\N{INVISIBLE TIMES}r\N{SUPERSCRIPT THREE}).
     6  astral characters:                 (\N{MATHEMATICAL BOLD CAPITAL C} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{SUPERSCRIPT TWO}]) and (\N{MATHEMATICAL BOLD CAPITAL C} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{SUPERSCRIPT TWO}]).
     7  astral + combining chars:          (\N{MATHEMATICAL BOLD CAPITAL C}\N{COMBINING OVERLINE} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO}]) and (\N{MATHEMATICAL BOLD CAPITAL C}\N{COMBINING OVERLINE} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO}]).
     8  wide characters:                   (\N{FULLWIDTH LATIN SMALL LETTER W}\N{FULLWIDTH LATIN SMALL LETTER I}\N{FULLWIDTH LATIN SMALL LETTER D}\N{FULLWIDTH LATIN SMALL LETTER E}) and (\N{FULLWIDTH LATIN SMALL LETTER W}\N{FULLWIDTH LATIN SMALL LETTER I}\N{FULLWIDTH LATIN SMALL LETTER D}\N{FULLWIDTH LATIN SMALL LETTER E}).
     9  regular characters:                (normal) and (normal).

$ uniquote --xml sample
     1  NFD single combining characters:   (cre&#x300;me bru&#x302;le&#x301;e et fiance&#x301;) and (cre&#x300;me bru&#x302;le&#x301;e et fiance&#x301;).
     2  NFC single combining characters:   (cr&#xe8;me br&#xfb;l&#xe9;e et fianc&#xe9;) and (cr&#xe8;me br&#xfb;l&#xe9;e et fianc&#xe9;).
     3  NFD multiple combining characters: (ha&#x302;c&#x327;k) and (ha&#x303;c&#x327;k).
     3  NFC multiple combining characters: (h&#x1eab;k) and (h&#xe3;k).
     5  invisible characters:              (4&#x2044;3&#x2062;r&#xb3;) and (4&#x2044;3&#x2062;r&#xb3;).
     6  astral characters:                 (&#x1d402; = sqrt[&#x1d400; + &#x1d401;]) and (&#x1d402; = sqrt[&#x1d400; + &#x1d401;]).
     7  astral + combining chars:          (&#x1d402; = sqrt[&#x1d400; + &#x1d401;]) and (&#x1d402; = sqrt[&#x1d400; + &#x1d401;]).
     8  wide characters:                   (&#xff57;) and (&#xff57;).
     9  regular characters:                (normal) and (normal).

$ uniquote --verbose --html sample
     1  NFD single combining characters:   (cre&#768;me bru&#770;le&#769;e et fiance&#769;) and (cre&#768;me bru&#770;le&#769;e et fiance&#769;).
     2  NFC single combining characters:   (cr&egrave;me br&ucirc;l&eacute;e et fianc&eacute;) and (cr&egrave;me br&ucirc;l&eacute;e et fianc&eacute;).
     3  NFD multiple combining characters: (ha&#770;&#771;c&#807;&#780;k) and (ha&#771;&#770;c&#807;&#780;k).
     3  NFC multiple combining characters: (h&#7851;&ccedil;&#780;k) and (h&atilde;&#770;&ccedil;&#780;k).
     5  invisible characters:              (4&frasl;3&#8290;&pi;&#8290;r&sup3;) and (4&frasl;3&#8290;&pi;&#8290;r&sup3;).
     6  astral characters:                 (&#119810; = sqrt[&#119808;&sup2; + &#119809;&sup2;]) and (&#119810; = sqrt[&#119808;&sup2; + &#119809;&sup2;]).
     7  astral + combining chars:          (&#119810;&#773; = sqrt[&#119808;&#773;&sup2; + &#119809;&#773;&sup2;]) and (&#119810;&#773; = sqrt[&#119808;&#773;&sup2; + &#119809;&#773;&sup2;]).
     8  wide characters:                   (&#65367;&#65353;&#65348;&#65349;) and (&#65367;&#65353;&#65348;&#65349;).
     9  regular characters:                (normal) and (normal).