我正在用C ++编写一个软件,因为它可以正确地使用UTF-16编码。但是,因为在大多数情况下,UTF-16 几乎一个固定的编码(它不是),我想知道在哪里可以找到一些字符串,我可以用来测试它是否正常工作。
用拉丁字母,甚至是我国家的重音字母来测试它几乎没用,所以我不确定我应该用什么样的字符进行测试。
注意:这个软件是一个C ++库,我想将UTF-16用于其API及其内部存储。
欢迎任何建议!
答案 0 :(得分:3)
没有代理对的UTF-16范围是U + 0000到U + FFFF。来自http://www.unicode.org/charts/以上的任何内容都可以。
如果查看http://www.unicode.org/Public/UCD/latest/ucd/Blocks.txt,则会显示不同Unicode块的字符范围,所以:
10000..1007F; Linear B Syllabary
10080..100FF; Linear B Ideograms
10100..1013F; Aegean Numbers
10140..1018F; Ancient Greek Numbers
10190..101CF; Ancient Symbols
101D0..101FF; Phaistos Disc
10280..1029F; Lycian
102A0..102DF; Carian
10300..1032F; Old Italic
10330..1034F; Gothic
10380..1039F; Ugaritic
103A0..103DF; Old Persian
10400..1044F; Deseret
10450..1047F; Shavian
10480..104AF; Osmanya
10800..1083F; Cypriot Syllabary
10840..1085F; Imperial Aramaic
10900..1091F; Phoenician
10920..1093F; Lydian
10980..1099F; Meroitic Hieroglyphs
109A0..109FF; Meroitic Cursive
10A00..10A5F; Kharoshthi
10A60..10A7F; Old South Arabian
10B00..10B3F; Avestan
10B40..10B5F; Inscriptional Parthian
10B60..10B7F; Inscriptional Pahlavi
10C00..10C4F; Old Turkic
10E60..10E7F; Rumi Numeral Symbols
11000..1107F; Brahmi
11080..110CF; Kaithi
110D0..110FF; Sora Sompeng
11100..1114F; Chakma
11180..111DF; Sharada
11680..116CF; Takri
12000..123FF; Cuneiform
12400..1247F; Cuneiform Numbers and Punctuation
13000..1342F; Egyptian Hieroglyphs
16800..16A3F; Bamum Supplement
16F00..16F9F; Miao
1B000..1B0FF; Kana Supplement
1D000..1D0FF; Byzantine Musical Symbols
1D100..1D1FF; Musical Symbols
1D200..1D24F; Ancient Greek Musical Notation
1D300..1D35F; Tai Xuan Jing Symbols
1D360..1D37F; Counting Rod Numerals
1D400..1D7FF; Mathematical Alphanumeric Symbols
1EE00..1EEFF; Arabic Mathematical Alphabetic Symbols
1F000..1F02F; Mahjong Tiles
1F030..1F09F; Domino Tiles
1F0A0..1F0FF; Playing Cards
1F100..1F1FF; Enclosed Alphanumeric Supplement
1F200..1F2FF; Enclosed Ideographic Supplement
1F300..1F5FF; Miscellaneous Symbols And Pictographs
1F600..1F64F; Emoticons
1F680..1F6FF; Transport And Map Symbols
1F700..1F77F; Alchemical Symbols
20000..2A6DF; CJK Unified Ideographs Extension B
2A700..2B73F; CJK Unified Ideographs Extension C
2B740..2B81F; CJK Unified Ideographs Extension D
2F800..2FA1F; CJK Compatibility Ideographs Supplement
E0000..E007F; Tags
E0100..E01EF; Variation Selectors Supplement
接受你的选择!
此外,如果您找到的文本采用其他编码(如UTF-8),您可以使用iconv
之类的程序将其转换为UTF-16。
答案 1 :(得分:0)
处理this wikipedia page的文字。它有大量的楔形文字混合拉丁字母。
答案 2 :(得分:0)
代码点高于U + 10000(非BMP字符)的任何字符都可以,例如其中text with emoji。这是因为只有非BMP字符才会被编码为代理对,即两个UTF-16代码单元。