使用HTML清理程序的字节序列无效

时间:2014-01-27 00:34:52

标签: ruby-on-rails ruby encoding utf-8 html-sanitizing

我在rails console:

上使用Rails HTML::FullSanitizer运行此错误
h = HTML::FullSanitizer.new
html = "Something with invalid characters \x80 and tags ī."
h.sanitze html

ArgumentError: invalid byte sequence in UTF-8
from /Users/benaluan/.rbenv/versions/1.9.3-p385/lib/ruby/gems/1.9.1/gems/actionpack-3.2.12/lib/action_controller/vendor/html-scanner/html/sanitizer.rb:37:in `sanitize'

我尝试的是在消毒前对html进行编码:

html = html.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

然而,它可以删除ī字符。有没有人遇到过同样的问题?

1 个答案:

答案 0 :(得分:1)

阅读这篇文章,其中详细描述了您的问题:http://www.spacevatican.org/2012/7/7/stripping-invalid-utf-8/

本文解决方案的代码:

html = html.force_encoding('UTF-8').
      encode('UTF-16', :invalid => :replace, :replace => '').
      encode('UTF-8')