Question

如何使用Ruby解析和分组示例HTML？

HTML文本：

<h2>heading one</h2>
<p>different content in here <a>test</a> <b>test</b></p>
<p>different content in here <a>test</a> <b>test</b></p>

<h2>heading two</h2>
<p>different content in here <a>test</a> <b>test</b></p>

<h2>heading three</h2>
<p>different content in here <a>test</a> <b>test</b></p>
<p>different content in here <a>test</a> <b>test</b></p>
<p>different content in here <a>test</a> <b>test</b></p>

元素不是嵌套的，我想按标题将它们分组。当我找到<h2>时，我想提取其文本以及其后的所有内容，直到遇到下一个<h2>。最后一个标题没有另一个h2作为定界符。

这是示例输出：

- Heading one
"<p>different content in here <a>test</a> <b>test</b></p>
<p>different content in here <a>test</a> <b>test</b></p>"

- Heading 2
"<p>different content in here <a>test</a> <b>test</b></p>"

Answer 1

您可以使用Nokogiri非常快地完成此操作，而不必使用正则表达式来解析HTML。

您将能够获取h2元素，然后提取其中的内容。

一些示例位于https://www.rubyguides.com/2012/01/parsing-html-in-ruby/

Answer 2

这应该有效，
第1组包含标题文本，第2组包含主体。

包含空格修剪

/<h2\s*>\s*([\S\s]*?)\s*<\/h2\s*>\s*([\S\s]*?)(?=\s*<h2\s*>|\s*$)/

https://regex101.com/r/pgLIi0/1

可读正则表达式

 <h2 \s* >
 \s*     
 ( [\S\s]*? )                  # (1) Heading
 \s* 
 </h2 \s* >
 \s*   
 ( [\S\s]*? )                  # (2) Body
 (?= \s* <h2 \s* > | \s* $ )

Answer 3

强烈反对您尝试执行的操作，并且“ RegEx match open tags except XHTML self-contained tags”有助于解释原因。仅在您拥有代码生成的最平凡的情况下，才应使用模式。如果您不拥有该生成器，那么HTML中的任何更改都可能会以无法修复的方式破坏代码，尤其是在深夜中，在关键停机期间深夜，您的老板会紧追您，以使其立即运行。

使用Nokogiri，这将使您以更可靠和推荐的方式进入球场。本示例仅收集h2及其后的p节点。弄清楚如何显示它们是一项练习。

require 'nokogiri'

html = <<EOT
<h2>heading 1</h2>
<p>content 1a<b>test</b></p>
<p>content 1b</p>

<h2>heading 2</h2>
<p>content 2a</p>
EOT

doc = Nokogiri::HTML.parse(html)

output = doc.search('h2').map { |h|

  next_node = h.next_sibling
  break unless next_node

  paragraphs = []

  loop do

    case 
    when next_node.text? && next_node.blank?
    when next_node.name == 'p'
      paragraphs << next_node 
    else
      break
    end

    next_node = next_node.next_sibling
    break unless next_node

  end

  [h, paragraphs]
}

这将导致output包含包含节点的数组的数组：

# => [[#(Element:0x3ff4e4034be8 {
#        name = "h2",
#        children = [ #(Text "heading 1")]
#        }),
#      [#(Element:0x3ff4e4034b98 {
#         name = "p",
#         children = [
#           #(Text "content 1a"),
#           #(Element:0x3ff4e3807ccc {
#             name = "b",
#             children = [ #(Text "test")]
#             })]
#         }),
#       #(Element:0x3ff4e4034ad0 {
#         name = "p",
#         children = [ #(Text "content 1b")]
#         })]],
#     [#(Element:0x3ff4e4034a6c {
#        name = "h2",
#        children = [ #(Text "heading 2")]
#        }),
#      [#(Element:0x3ff4e40349a4 {
#         name = "p",
#         children = [ #(Text "content 2a")]
#         })]]]

代码也对HTML的格式进行了一些假设，但是如果格式更改，它不会吐出垃圾。假定格式如下：

<h2>
<p>
...

在h2之后始终加上p标签，直到出现其他标签，包括随后的h2。

该测试：

when next_node.text? && next_node.blank?

是必需的，因为HTML不需要格式化，但是当插入时，将插入仅包含空格的“ TEXT”节点，这将导致我们期望使用“漂亮的HTML”进行缩进。解析器和浏览器不在乎它是否在那里，除非是预格式化的文本，只有人类才知道。实际上，最好不要使用它们，因为它们会膨胀文件并减慢其传输速度。但是人们那样挑剔。实际上，代码中的HTML示例实际上看起来更像：

<h2>heading 1</h2>\n<p>content 1a<b>test</b></p>\n<p>content 1b</p>\n\n<h2>heading 2</h2>\n<p>content 2a</p>\n

并且when语句将忽略那些“ \n”节点。

如何使用Ruby解析元素之后的HTML文本

3 个答案: