Question

我需要用这样的格式解析一个html表：

require 'nokogiri'

html_table = '<table>
    <tbody>
        <tr>
            <td>Some text in the first row!</td>
            <td>More text in the first row!</td>
        </tr>
        <td>Some text in the second row!</td>
        <td>More text in the second row!</td> </tr>
        <td>Some text in the third row!</td>
        <td>More text in the third row!</td>  </tr>
    </tbody>
</table>'

如您所见，最后两行没有打开<tr>标记。当我尝试使用puts Nokogiri::HTML(html_table).css('table tr')获取所有三行时，代码将被清除，最后两行变为td个节点：

<tr>
    <td>Some text in the first row!</td>
    <td>More text in the first row!</td>
</tr>

我在网上找到了一些方法可以在没有结束标记</tr>时修复此问题，但不是相反。有没有一种简单的方法可以使用Nokogiri解决这个问题？

Answer 1

我认为这是由于Nokogiri解析时出错。一种可能的解决方案是使用Nokogumbo gem，它可以更加正确地解析nokogiri的能力。通过以下方式安装：

gem install nokogumbo

而不是使用nokogiri：

require 'nokogumbo'# nokogumbo will also load Nokogiri, so no need to put: require 'nokogiri'
Nokogiri::HTML5(source_code).css('table tr').each do |row|
  p row
end

请注意，您必须使用网站上的源代码，这些源代码可以在任何地方正确地使用标记。您可以按照以下方式使用网站的源代码，但当然要求网站页面上只有一个表格。

require 'open-uri'
source_code = open('http://www.url_to_website_I_want_to_parse.com')

确保在开始课程中声明变量source_code。

Nokogiri：解析没有打开标记的html表格行

1 个答案: