Question

我使用Nokogiri解析html。我需要页面中的内容和图片代码，因此我使用inner_html代替content方法。但content返回的值编码正确，而inner_html错误编码。请注意，页面是中文的，不使用UTF-8编码。

这是我的代码：

# encoding: utf-8
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'iconv'

doc = Nokogiri::HTML.parse(open("http://www.sfzt.org/advise/view.asp?id=536"), nil, 'gb18030')

doc.css('td.font_info').each do |link|
  # output, correct but not i expect: 目前市面上影响比
  puts link.content

  # output, wrong and not i expect: <img ....></img>Ŀǰ??????Ӱ??Ƚϴ?Ľ????
  # I expect: <img ....></img>目前市面上影响比
  puts link.inner_html
end

Answer 1

这是写在＆＃39;编码＆＃39;自述文件部分：http://nokogiri.org/

字符串始终在内部存储为UTF-8。返回的方法文本值将始终返回UTF-8编码的字符串。方法返回XML（如to_xml，to_html和inner_html）将返回一个字符串像源文档一样编码。

因此，如果要将其作为UTF-8字符串获取，则应手动转换inner_html字符串：

puts link.inner_html.encode('utf-8') # for 1.9.x

Answer 2

我认为内容很好地剥离了标签，但是inner_html方法节点不能很好地执行此操作。

“如果您在遍历时更改inner_html（包含标记），我认为您最终会遇到一些非常奇怪的状态。换句话说，如果您遍历节点树，则不应该做任何可能的事情。添加或删除节点。“

试试这个：

doc.css('td.font_info').each do |link|
  puts link.content
  some_stuff = link.inner_html
  link.children = Nokogiri::HTML.fragment(some_stuff, 'utf-8')
end

由Nokogiri提取的错误编码的Html

2 个答案: