英镑符号£导致PG :: CharacterNotInRepertoire:错误:编码“UTF8”的无效字节序列:0xa3

时间:2016-02-03 23:02:14

标签: ruby postgresql ruby-on-rails-4 encoding utf-8

当从外部来源(例如我的银行),通过csv文件收集包含英镑符号'£'的信息,并使用ActiveRecord发布到postgres时,我收到错误:

  

PG :: CharacterNotInRepertoire:错误:编码“UTF8”的无效字节序列:0xa3

0xa3是£符号的十六进制代码。感知的智慧是在字符串上清楚地指定UTF-8,同时替换无效的字节序列。

string.encode('UTF-8', {:invalid => :replace, :undef => :replace, :replace => '?'})

这会停止错误,但是因为'£'被转换为'?'而成为有损修复

UTF-8能够处理'£'符号,那么可以采取哪些措施来修复无效字节序列并保留'£'符号?

1 个答案:

答案 0 :(得分:2)

我正在回答我自己的问题,感谢Michael Fuhr,他解释了UTF-8 byte sequence的英镑符号是0xc2 0xa3。所以,你要做的就是找到每次出现的0xa3(163)并将0xc2(194)放在它前面......

array_bytes = string.bytes
new_pound_ptr = 0
# Look for £ sign 
pound_ptr = array_bytes.index(163)
while !pound_ptr.nil?
  pound_ptr+= new_pound_ptr # new_pound_ptr is set at end of block
  # The following statement finds incorrectly sequenced £ sign...
  if (pound_ptr == 0) || (array_bytes[pound_ptr-1] != 194)
    array_bytes.insert(pound_ptr,194)
      pound_ptr+= 1
    end
    new_pound_ptr = pound_ptr
    # Search remainder of array for pound sign
    pound_ptr = array_bytes[(new_pound_ptr+1)..-1].index(163)
  end
end
# Convert bytes to 8-bit unsigned char, and UTF-8
string = array_bytes.pack('C*').force_encoding('UTF-8') unless new_pound_ptr == 0
# Can now write string to model without out-of-sequence error..
hash["description"] = string
Model.create!(hash)

我在这个stackoverflow论坛上得到了很多帮助,我希望我能帮助其他人。