需要包含所有unicode字符的范围

时间:2016-04-02 07:42:07

标签: ruby-on-rails ruby unicode

经典的Ruby实现允许迭代unicode字符:

('a'..'z').to_a 
# ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"]
('@'..'[').to_a 
# ["@", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "["]

我需要获得一个包含所有unicode字符(不同的语言环境,标点符号等)的数组。我怎么做?我不知道第一个和最后一个角色。

3 个答案:

答案 0 :(得分:1)

[*32..65535].
  pack("U*").
  encode('UTF-8', invalid: :replace, undef: :replace, replace: '').
  split('')
  

IRB(主):070:0> [* 32..65535] .pack(" U *")。encode(' UTF-8',无效:: replace,undef :: replace,replace:&#39 ;&#39)分割('')
  => [" ","!"," \"","#"," $",& #34;%","&","'","(",")&# 34;," *"," +",","," - ","。&# 34;," /"," 0"," 1"," 2"," 3",& #34; 4"," 5"," 6"," 7"," 8"," 9&# 34;,":",&#34 ;;","<"," =","> ","?"," @"," A"," B"," C&#34 ;," D"," E"," F"," G"," H",&#34 ;我"," J"," K"," L"," M"," N" ," O"," P"," Q"," R"," S"," T"," U"," V"," W"," X"," Y", " Z"," ["," \","]"," ^",& #34; _","`"," a"," b"," c"," d& #34;," e"," f"," g"," h"," i",& #34; j"," k", " l"," m"," n"," o"," p"," q& #34;," r"," s"," t"," u"," v",& #34; w"," x"," y"," z"," {"," | ","}","〜"," \ u007F"," \ u0080"," \ u0081"," \ u0082"," \ u0083"," \ u0084"," \ u0085"," \ u0086"," \ u0087"," \ u0088",
  ...
  " \ uFFEA"," \ uFFEB"," \ uFFEC"," \ uFFED"," \ uFFEE", " \ uFFEF"," \ uFFF0"," \ uFFF1"," \ uFFF2"," \ uFFF3", " \ uFFF4"," \ uFFF5"," \ uFFF6"," \ uFFF7"," \ uFFF8", " \ uFFF9"," \ uFFFA"," \ uFFFB"," \ uFFFC"," \ uFFFD", " \ uFFFE"," \ uFFFF"]

#pack
#encode
#split

(不适用于所有代码点......)

(32..127).map {|i| i.chr(Encoding::UTF_8)}

但是用65535替换127.享受滚动!

  

IRB(主):011:0> (32..127).map {| i | i.chr}
  => [" ","!"," \"","#"," $",& #34;%","&","'","(",")&# 34;," *"," +",","," - ","。&# 34;," /"," 0"," 1"," 2"," 3",& #34; 4"," 5"," 6"," 7"," 8"," 9&# 34;,":",&#34 ;;","<"," =","> ","?"," @"," A"," B"," C&#34 ;," D"," E"," F"," G"," H",&#34 ;我"," J"," K"," L"," M"," N" ," O"," P"," Q"," R"," S"," T"," U"," V"," W"," X"," Y", " Z"," ["," \","]"," ^",& #34; _","`"," a"," b"," c"," d& #34;," e"," f"," g"," h"," i",& #34; j"," k", " l"," m"," n"," o"," p"," q& #34;," r"," s"," t"," u"," v",& #34; w"," x"," y"," z"," {"," | ","}","〜"," \ x7F"]

答案 1 :(得分:1)

解析UnicodeData.txt(tr44#Property Definitions中描述的字段)

要特别注意范围:

3400    <CJK Ideograph Extension A, First>
4DB5    <CJK Ideograph Extension A, Last>
4E00    <CJK Ideograph, First>
9FD5    <CJK Ideograph, Last>
AC00    <Hangul Syllable, First>
D7A3    <Hangul Syllable, Last>
D800    <Non Private Use High Surrogate, First>
DB7F    <Non Private Use High Surrogate, Last>
DB80    <Private Use High Surrogate, First>
DBFF    <Private Use High Surrogate, Last>
DC00    <Low Surrogate, First>
DFFF    <Low Surrogate, Last>
E000    <Private Use, First>
F8FF    <Private Use, Last>
20000   <CJK Ideograph Extension B, First>
2A6D6   <CJK Ideograph Extension B, Last>
2A700   <CJK Ideograph Extension C, First>
2B734   <CJK Ideograph Extension C, Last>
2B740   <CJK Ideograph Extension D, First>
2B81D   <CJK Ideograph Extension D, Last>
2B820   <CJK Ideograph Extension E, First>
2CEA1   <CJK Ideograph Extension E, Last>
F0000   <Plane 15 Private Use, First>
FFFFD   <Plane 15 Private Use, Last>
100000  <Plane 16 Private Use, First>
10FFFD  <Plane 16 Private Use, Last>

这取决于您需要的数据,是否迭代这些范围。

答案 2 :(得分:0)

[*32..65535].map do |e|
  e.chr(Encoding::UTF_8).tap do |char|
    char =~ /\p{Alnum}|\p{Punct}/ || raise 
  end rescue nil # rescuing both conversion and self-raised
end.compact

以上内容遍历所有代码点,选择alphanumerics and punctuation

NB 上面的方法,虽然或多或少是健壮的,但很容易匹配变音符号,这是ç或ö等组合字符的一部分。