在ruby 1.9.3中,我可以获得字符串的代码点:
> "foo\u00f6".codepoints.to_a
=> [102, 111, 111, 246]
是否有内置方法转向另一个方向,即从整数数组到字符串?
我知道:
# not acceptable; only works with UTF-8
[102, 111, 111, 246].pack("U*")
# works, but not very elegant
[102, 111, 111, 246].inject('') {|s, cp| s << cp }
# concise, but I need to unshift that pesky empty string to "prime" the inject call
['', 102, 111, 111, 246].inject(:<<)
更新(回应尼克拉斯的回答)
有趣的讨论。
pack("U*")
始终返回UTF-8字符串,而inject
版本返回文件源编码中的字符串。
#!/usr/bin/env ruby
# encoding: iso-8859-1
p [102, 111, 111, 246].inject('', :<<).encoding
p [102, 111, 111, 246].pack("U*").encoding
# this raises an Encoding::CompatibilityError
[102, 111, 111, 246].pack("U*") =~ /\xf6/
对我来说,inject
调用返回ISO-8859-1字符串,而pack
返回UTF-8。为了防止错误,我可以使用pack("U*").encode(__ENCODING__)
,但这使我做了额外的工作。
更新2
显然是字符串#&lt;&lt;根据字符串的编码,并不总是正确追加。因此看起来包装仍然是最好的选择。
[225].inject(''.encode('utf-16be'), :<<) # fails miserably
[225].pack("U*").encode('utf-16be') # works
答案 0 :(得分:10)
你自己尝试最明显的改编是
[102, 111, 111, 246].inject('', :<<)
然而,这不是一个好的解决方案,因为它只有在初始空字符串文字具有能够保存整个Unicode字符范围的编码时才有效。以下失败:
#!/usr/bin/env ruby
# encoding: iso-8859-1
p "\u{1234}".codepoints.to_a.inject('', :<<)
所以我实际上建议
codepoints.pack("U*")
我不知道你的意思是“只适用于UTF-8”。它创建了一个UTF-8编码的Ruby字符串,但UTF-8可以保存整个Unicode字符范围,那么问题是什么呢?观察:
irb(main):010:0> s = [0x33333, 0x1ffff].pack("U*")
=> "\u{33333}\u{1FFFF}"
irb(main):011:0> s.encoding
=> #<Encoding:UTF-8>
irb(main):012:0> [0x33333, 0x1ffff].pack("U*") == [0x33333, 0x1ffff].inject('', :<<)
=> true
答案 1 :(得分:2)
根据数组中的值和Encoding.default_internal
的值,您可以尝试:
[102, 111, 111, 246].map(&:chr).inject(:+)
你必须小心编码。请注意以下事项:
irb(main):001:0> 0.chr.encoding
=> #<Encoding:US-ASCII>
irb(main):002:0> 127.chr.encoding
=> #<Encoding:US-ASCII>
irb(main):003:0> 128.chr.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):004:0> 255.chr.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):005:0> 256.chr.encoding
RangeError: 256 out of char range
from (irb):5:in `chr'
from (irb):5
from C:/Ruby200/bin/irb:12:in `<main>'
irb(main):006:0>
默认情况下,256.chr失败,因为它喜欢返回US-ASCII或ASCII-8BIT,具体取决于代码点是在0..127还是128..256。
这应该涵盖您对8位值的要点。如果您的值大于255(可能是Unicode代码点),则可以执行以下操作:
irb(main):006:0> Encoding.default_internal = "utf-8"
=> "utf-8"
irb(main):007:0> 256.chr.encoding
=> #<Encoding:UTF-8>
irb(main):008:0> 256.chr.codepoints
=> [256]
irb(main):009:0>
将Encoding.default_internal设置为“utf-8”,Unicode值&gt; 255应该可以正常工作(但见下文):
irb(main):009:0> 65535.chr.encoding
=> #<Encoding:UTF-8>
irb(main):010:0> 65535.chr.codepoints
=> [65535]
irb(main):011:0> 65536.chr.codepoints
=> [65536]
irb(main):012:0> 65535.chr.bytes
=> [239, 191, 191]
irb(main):013:0> 65536.chr.bytes
=> [240, 144, 128, 128]
irb(main):014:0>
现在它变得有趣了 - ASCII-8BIT和UTF-8似乎没有混合:
irb(main):014:0> (0..127).to_a.map(&:chr).inject(:+).encoding
=> #<Encoding:US-ASCII>
irb(main):015:0> (0..128).to_a.map(&:chr).inject(:+).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):016:0> (0..255).to_a.map(&:chr).inject(:+).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):017:0> ((0..127).to_a + (256..1000000).to_a).map(&:chr).inject(:+).encoding
RangeError: invalid codepoint 0xD800 in UTF-8
from (irb):17:in `chr'
from (irb):17:in `map'
from (irb):17
from C:/Ruby200/bin/irb:12:in `<main>'
irb(main):018:0> ((0..127).to_a + (256..0xD7FF).to_a).map(&:chr).inject(:+).encoding
=> #<Encoding:UTF-8>
irb(main):019:0> (0..256).to_a.map(&:chr).inject(:+).encoding
Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8
from (irb):19:in `+'
from (irb):19:in `each'
from (irb):19:in `inject'
from (irb):19
from C:/Ruby200/bin/irb:12:in `<main>'
irb(main):020:0>
ASCII-8BIT和UTF-8可以连接,只要ASCII-8BIT代码点都在0..127之内:
irb(main):020:0> 256.chr.encoding
=> #<Encoding:UTF-8>
irb(main):021:0> (0.chr.force_encoding("ASCII-8BIT") + 256.chr).encoding
=> #<Encoding:UTF-8>
irb(main):022:0> 255.chr.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):023:0> (255.chr + 256.chr).encoding
Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8
from (irb):23
from C:/Ruby200/bin/irb:12:in `<main>'
irb(main):024:0>
这为我们提供了一个解决您问题的最终解决方案:
irb(main):024:0> (0..0xD7FF).to_a.map {|c| c.chr("utf-8")}.inject(:+).encoding
=> #<Encoding:UTF-8>
irb(main):025:0>
所以我认为最常见的答案是,假设你想要UTF-8,那就是:
[102, 111, 111, 246].map {|c| c.chr("utf-8")}.inject(:+)
假设您知道您的值在0..255,那么这更容易:
[102, 111, 111, 246].map(&:chr).inject(:+)
给你:
irb(main):027:0> [102, 111, 111, 246].map {|c| c.chr("utf-8")}.inject(:+)
=> "fooö"
irb(main):028:0> [102, 111, 111, 246].map(&:chr).inject(:+)
=> "foo\xF6"
irb(main):029:0> [102, 111, 111, 246].map {|c| c.chr("utf-8")}.inject(:+).encoding
=> #<Encoding:UTF-8>
irb(main):030:0> [102, 111, 111, 246].map(&:chr).inject(:+).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):031:0>
我希望这会有所帮助(虽然有点迟了) - 我发现这是寻找同一个问题的答案,所以我自己研究过。