为什么Ruby 1.9中具有相同字节和编码的两个字符串不相同?

时间:2010-11-21 06:58:52

标签: ruby string encoding ruby-1.9

在Ruby 1.9.2中,我找到了一种方法来制作两个具有相同字节,相同编码且相同的字符串,但它们具有不同的length[]返回的不同字符

这是一个错误吗?如果它不是一个bug,那么我想完全理解它。 Ruby 1.9.2 String对象中存储了哪些信息,允许这两个字符串表现不同?

以下是重现此行为的代码。以#=>开头的注释显示了我从此脚本获得的输出,括号中的单词告诉您我对该输出的判断。

#!/usr/bin/ruby1.9
# coding: utf-8
string1 = "\xC2\xA2"       # A well-behaved string with one character (¢)
string2 = "".concat(0xA2)  # A bizarre string very similar to string1.
p    string1.bytes.to_a    #=> [194, 162]  (good)
p    string2.bytes.to_a    #=> [194, 162]  (good)
puts string1.encoding.name #=> UTF-8  (good)
puts string2.encoding.name #=> UTF-8  (good)
puts string1 == string2    #=> true   (good)
puts string1.length        #=> 1      (good)
puts string2.length        #=> 2      (weird!)
p    string1[0]            #=> "¢"    (good)
p    string2[0]            #=> "\xC2" (weird!)

我正在运行Ubuntu并从源代码编译Ruby。我的Ruby版本是:

ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]

3 个答案:

答案 0 :(得分:8)

这是Ruby的错误并修复了r29848

答案 1 :(得分:2)

Matz通过Twitter提到了这个问题:

http://twitter.com/matz_translator/status/6597021662187520

http://twitter.com/matz_translator/status/6597055132733440

“很难确定是一个错误但是,保留原样是不可接受的。我宁愿解决这个问题。”

答案 2 :(得分:1)

我认为问题在于字符串的编码。查看James Gray关于Unicode编码的Shades of Gray: Ruby 1.9's String文章。


其他奇怪的行为:

# coding: utf-8

string1 = "\xC2\xA2"       
string2 = "".concat(0xA2)  
string3 = 0xC2.chr + 0xA2.chr

string1.bytes.to_a    # => [194, 162]
string2.bytes.to_a    # => [194, 162]
string3.bytes.to_a    # => [194, 162]

string1.encoding.name # => "UTF-8"
string2.encoding.name # => "UTF-8"
string3.encoding.name # => "ASCII-8BIT"

string1 == string2    # => true
string1 == string3    # => false
string2 == string3    # => true

string1.length        # => 1
string2.length        # => 2
string3.length        # => 2

string1[0]            # => "¢"
string2[0]            # => "\xC2"
string3[0]            # => "\xC2"

string3.unpack('C*') # => [194, 162]
string4 = string3.unpack('C*').pack('C*') # => "\xC2\xA2"
string4.encoding.name # => "ASCII-8BIT"
string4.force_encoding('UTF-8') # => "¢"

string3.force_encoding('UTF-8') # => "¢"
string3.encoding.name # => "UTF-8"