Question

如何使用Nokogiri以格式标签递归捕获所有文本？

<div id="1">
  This is text in the TD with <strong> strong </strong> tags
  <p>This is a child node. with <b> bold </b> tags</p>
  <div id=2>
      "another line of text to a <a href="link.html"> link </a>"
      <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
  </div>
</div>

例如，我想捕获：

"This is text in the TD with <strong> strong </strong> tags" 

"This is a child node. with <b> bold </b> tags"

"another line of text to a <a href="link.html"> link </a>"

"This is text inside a div <em>inside<em> another div inside a paragraph tag"

我不能只使用.text（），因为它会删除格式化标签，我不知道如何递归。

添加细节：Sanitize看起来像一个有趣的宝石，我现在正在读它。但是，有一些额外的信息可能会澄清我需要做什么。

我需要遍历每个节点，获取文本，处理它并将其放回去。因此，我会抓住“这是TD中带有强标签的文字”中的文字，将其修改为“这是TD中修改后的文字强标签。然后转到div 1中的下一个标签获取

文本。“这是一个子节点。用粗体标签“修改它”这是一个修改过的子节点。用粗体标签。“然后把它放回去。转到下一个div＃2并抓取文字，”另一行文字到链接“，修改它”，另一行修改后的文字链接到一个链接“，并把它放回去并转到下一个节点，Div＃2并从段落标签中抓取文本。”这是在段落标记内的另一个div里面的div内修改的文本“

所以在处理完所有内容之后，新的html应该看起来像这样......

<div id="1">
  This is modified text in the TD with <strong> strong </strong> tags
  <p>This is a modified child node. with <b> bold </b> tags</p>
  <div id=2>
      "another line of modified text to a <a href="link.html"> link </a>"
      <p> This is modified text inside a div <em>inside<em> another div inside a paragraph tag</p>
  </div>
</div>

我的准代码，但我真的坚持这两个部分，只使用格式化文本（清理帮助），但清理抓取所有标记。我需要保留格式化文本的格式，包括空格等。但是，不要抓取无关标签的子项。第二，遍历所有与全文标签直接相关的孩子。

#Quasi-code
doc = Nokogiri.HTML(html)
kids=doc.at('div#1')
text_kids=kids.descendant_elements
text.kids.each do |i|
   #grab full text(full sentence and paragraphs) with formating tags
   #currently, I have not way to grab just the text with formatting and not the other tags
   modified_text=processing_code(i.full_text_w_formating())
   i.full_text_w_formating=modified_text
end

def processing_code(string)
#code to process string (not relevant for this example)
  return modified_string
end


# Recursive 1
class Nokogiri::XML::Node
  def descendant_elements
  #This is flawed because it grabs every child and even 
  #splits it based on any tag.
  # I need to traverse down only the text related children.
  element_children.map{ |kid|
     [kid, kid.descendant_elements]
  }.flatten
  end
 end

Answer 1

我使用两种策略，Nokogiri提取您想要的内容，然后使用黑名单/白名单程序剥离您不想要的标签或保留您想要的标签。

require 'nokogiri'
require 'sanitize'

html = '
<div id="1">
  This is text in the TD with <strong> strong <strong> tags
  <p>This is a child node. with <b> bold </b> tags</p>
  <div id=2>
      "another line of text to a <a href="link.html"> link </a>"
      <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
  </div>
</div>
'

doc = Nokogiri.HTML(html)
html_fragment = doc.at('div#1').to_html

将<div id="1">的内容捕获为HTML字符串：

      This is text in the TD with <strong> strong <strong> tags
      <p>This is a child node. with <b> bold </b> tags</p>
      <div id="2">
          "another line of text to a <a href="link.html"> link </a>"
          <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em></p>
      </div>
    </strong></strong>

结尾</strong></strong>是两个打开<strong>标记的结果。这可能是故意的，但没有结束标签，Nokogiri会做一些修正来使HTML正确。

将html_fragment传递给Sanitize gem：

doc = Sanitize.clean(
  html_fragment,
  :elements   => %w[ a b em strong ],
  :attributes => {
    'a'    => %w[ href ],
  },
)

返回的文字如下：

 This is text in the TD with <strong> strong <strong> tags
  This is a child node. with <b> bold </b> tags 

      "another line of text to a <a href="link.html"> link </a>"
        This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em> 

</strong></strong>

同样，由于HTML格式错误且没有关闭</strong>标记，因此存在两个尾随结束标记。

Nokogiri使用格式和链接标记抓取文字，<em>，<strong>，<a>, etc</a> </strong> </em>

1 个答案: