在Ruby 1.9.2中,我找到了一种方法来制作两个具有相同字节,相同编码且相同的字符串,但它们具有不同的length
和[]
返回的不同字符
这是一个错误吗?如果它不是一个bug,那么我想完全理解它。 Ruby 1.9.2 String对象中存储了哪些信息,允许这两个字符串表现不同?
以下是重现此行为的代码。以#=>
开头的注释显示了我从此脚本获得的输出,括号中的单词告诉您我对该输出的判断。
#!/usr/bin/ruby1.9
# coding: utf-8
string1 = "\xC2\xA2" # A well-behaved string with one character (¢)
string2 = "".concat(0xA2) # A bizarre string very similar to string1.
p string1.bytes.to_a #=> [194, 162] (good)
p string2.bytes.to_a #=> [194, 162] (good)
puts string1.encoding.name #=> UTF-8 (good)
puts string2.encoding.name #=> UTF-8 (good)
puts string1 == string2 #=> true (good)
puts string1.length #=> 1 (good)
puts string2.length #=> 2 (weird!)
p string1[0] #=> "¢" (good)
p string2[0] #=> "\xC2" (weird!)
我正在运行Ubuntu并从源代码编译Ruby。我的Ruby版本是:
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]
答案 0 :(得分:8)
这是Ruby的错误并修复了r29848。
答案 1 :(得分:2)
Matz通过Twitter提到了这个问题:
http://twitter.com/matz_translator/status/6597021662187520
http://twitter.com/matz_translator/status/6597055132733440
“很难确定是一个错误但是,保留原样是不可接受的。我宁愿解决这个问题。”
答案 2 :(得分:1)
我认为问题在于字符串的编码。查看James Gray关于Unicode编码的Shades of Gray: Ruby 1.9's String文章。
其他奇怪的行为:
# coding: utf-8
string1 = "\xC2\xA2"
string2 = "".concat(0xA2)
string3 = 0xC2.chr + 0xA2.chr
string1.bytes.to_a # => [194, 162]
string2.bytes.to_a # => [194, 162]
string3.bytes.to_a # => [194, 162]
string1.encoding.name # => "UTF-8"
string2.encoding.name # => "UTF-8"
string3.encoding.name # => "ASCII-8BIT"
string1 == string2 # => true
string1 == string3 # => false
string2 == string3 # => true
string1.length # => 1
string2.length # => 2
string3.length # => 2
string1[0] # => "¢"
string2[0] # => "\xC2"
string3[0] # => "\xC2"
string3.unpack('C*') # => [194, 162]
string4 = string3.unpack('C*').pack('C*') # => "\xC2\xA2"
string4.encoding.name # => "ASCII-8BIT"
string4.force_encoding('UTF-8') # => "¢"
string3.force_encoding('UTF-8') # => "¢"
string3.encoding.name # => "UTF-8"