在Ruby 1.9.3-429中,我试图用各种编码解析纯文本文件,最终将转换为UTF-8字符串。非ascii字符与编码为UTF-8的文件一起正常工作,但是非UTF-8文件会出现问题。
简化示例:
File.open(file) do |io|
io.set_encoding("#{charset.upcase}:#{Encoding::UTF_8}")
line, char = "", nil
until io.eof? || char == ?\n || char == ?\r
char = io.readchar
puts "Character #{char} has #{char.each_codepoint.count} codepoints"
puts "SLICE FAIL" unless char == char.slice(0,1)
line << char
end
line
end
这两个文件只是一个适当编码的字符串áÁð
。我已检查过文件是否已通过$ file -i <file_name>
使用UTF-8文件,我回来了:
Character á has 1 codepoints
Character Á has 1 codepoints
Character ð has 1 codepoints
使用ISO-8859-1文件:
Character á has 2 codepoints
SLICE FAIL
Character Á has 2 codepoints
SLICE FAIL
Character ð has 2 codepoints
SLICE FAIL
我解释这个的方式是readchar
返回一个错误转换的编码,导致切片返回错误。
这种行为是否正确?或者我是否错误地指定了外部编码文件?我宁愿不重写此过程,所以我希望我在某个地方犯了错误。我有理由以这种方式解析文件,但我认为这些与我的问题无关。在File.open
中将内部和外部编码指定为选项会产生相同的结果。