Question

我正在编写一个使用Hpricot的爬虫。它从某个网页下载一个字符串列表，然后我尝试将其写入该文件。编码出了点问题：

"\xC3" from ASCII-8BIT to UTF-8

我有在网页上呈现并以这种方式打印的项目：

DÃ©veloppement

str.encoding会返回UTF-8，因此force_encoding('UTF-8')无效。我怎么能把它转换成可读的UTF-8？

Answer 1

您的字符串似乎编码错误：

"DÃ©veloppement".encode("iso-8859-1").force_encoding("utf-8")
#=> "Développement"

Answer 2

似乎你的字符串认为它是UTF-8，但实际上，它是其他东西，可能是ISO-8859-1。

首先定义（强制）正确的编码，然后将其转换为UTF-8。

在你的例子中：

puts "DÃ©veloppement".encode('iso-8859-1').encode('utf-8')

另一种选择是：

puts "\xC3".force_encoding('iso-8859-1').encode('utf-8') #-> Ã

如果Ã毫无意义，请尝试其他编码。

Answer 3

“ruby 1.9: invalid byte sequence in UTF-8”用较少的代码描述了另一种好方法：

file_contents.encode!('UTF-16', 'UTF-8')