将Nokogiri子节点显示为原始HTML而不是> tag<

时间:2015-10-23 19:27:00

标签: ruby nokogiri html-entities

我正在将XML表格更改为HTML表格,并且必须重新排列节点。

为了完成转换,我抓取XML,将其放入二维数组,然后构建新的HTML以输出。

但是其中一些单元格中包含HTML标记,转换后<su>变为&gt;su&lt;

XML数据是:

<BOXHD>
  <CHED H="1">Disc diameter, inches (cm)</CHED>
  <CHED H="1">One-half or more of disc covered</CHED>
  <CHED H="2">Number <SU>1</SU>
  </CHED>
  <CHED H="2">Exhaust foot <SU>3</SU>/min.</CHED>
  <CHED H="1">Disc not covered</CHED>
  <CHED H="2">Number <SU>1</SU>
  </CHED>
  <CHED H="2">Exhaust foot<SU>3</SU>/min.</CHED>
</BOXHD>

我将其转换为HTML表格的步骤是:

class TableCell

  attr_accessor :text, :rowspan, :colspan

  def initialize(text='')
      @text = text
      @rowspan = 1
      @colspan = 1
  end    
end
@frag = Nokogiri::HTML(xml)

# make a 2d array to store how the cells should be arranged
column = 0
prev_row = -1
@frag.xpath("boxhd/ched").each do |ched|
  row = ched.xpath("@h").first.value.to_i - 1
  if row <= prev_row
    column +=1
  end
  prev_row = row
  @data[row][column] = TableCell.new(ched.inner_html)
end  

# methods to find colspan and rowspan, put them in @data
# ... snip ...

# now build an html table
doc = Nokogiri::HTML::DocumentFragment.parse ""
Nokogiri::HTML::Builder.with(doc) do |html|
  html.table {
    @data.each do |tr|
      html.tr {
        tr.each do |th|
          next if th.nil?
          html.th(:rowspan => th.rowspan, :colspan => th.colspan).table_header th.text
        end
      }
    end
  }
end

这给出了以下HTML(注意上标被转义):

<table>
    <tr>
        <th rowspan="2" colspan="1" class="table_header">Disc diameter, inches (cm)</th>
        <th rowspan="1" colspan="2" class="table_header">One-half or more of disc covered</th>
        <th rowspan="1" colspan="2" class="table_header">Disc not covered</th>
    </tr>
    <tr>
        <th rowspan="1" colspan="1" class="table_header">Number &lt;su&gt;1&lt;/su&gt; </th>
        <th rowspan="1" colspan="1" class="table_header">Exhaust foot &lt;su&gt;3&lt;/su&gt;/min.</th>
        <th rowspan="1" colspan="1" class="table_header">Number &lt;su&gt;1&lt;/su&gt;</th>
        <th rowspan="1" colspan="1" class="table_header">Exhaust foot&lt;su&gt;3&lt;/su&gt;/min.</th>
    </tr>
</table>

如何获取原始HTML而不是实体?

我试过这些但没有成功

@data[row][column] = TableCell.new(ched.children)
@data[row][column] = TableCell.new(ched.children.to_s)
@data[row][column] = TableCell.new(ched.to_s)

2 个答案:

答案 0 :(得分:1)

This might help you understand what's happening: require 'nokogiri' doc = Nokogiri::XML('<root><foo></foo></root>') doc.at('foo').content = '<html><body>bar</body></html>' doc.to_xml # => "<?xml version=\"1.0\"?>\n<root>\n <foo>&lt;html&gt;&lt;body&gt;bar&lt;/body&gt;&lt;/html&gt;</foo>\n</root>\n" doc.at('foo').children = '<html><body>bar</body></html>' doc.to_xml # => "<?xml version=\"1.0\"?>\n<root>\n <foo>\n <html>\n <body>bar</body>\n </html>\n </foo>\n</root>\n" doc.at('foo').children = Nokogiri::XML::Document.new.create_cdata '<html><body>bar</body></html>' doc.to_xml # => "<?xml version=\"1.0\"?>\n<root>\n <foo><![CDATA[<html><body>bar</body></html>]]></foo>\n</root>\n"

答案 1 :(得分:0)

我放弃了构建器,只是构建了HTML:

headers = html_headers()

def html_headers()

  rows = Array.new
  @data.each do |row|
      cells = Array.new
      row.each do |cell|
          next if cell.nil?
          cells << "<th rowspan=\"%d\" colspan=\"%d\">%s</th>" %
                      [cell.rowspan,
                      cell.colspan,
                      cell.text]
      end
      rows << "<tr>%s</tr>" % cells.join
  end
  rows.join 

end

def replace_nodes(headers)

  # ... snip ...

  @frag.xpath("boxhd").each do |old|
      puts "replacing boxhd..."
      old.replace headers
  end

  # ... snip ...

end

我不明白为什么,但似乎我替换了<BOXHD>标签的文本被解析和搜索,因为我能够从cell.text中的数据更改标签名称。