我的目的是修改标签中的句子。
例如更改:
<div id="1">
This is text in the TD with <strong> strong </strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id=2>
"another line of text to a <a href="link.html"> link </a>"
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
对此:
<div id="1">
This is modified text in the TD with <strong> strong </strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id=2>
"another line of text to a <a href="link.html"> link </a>"
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
这意味着我需要遍历节点抓取标签并获取所有文本&amp;样式节点,但不抓取子标签。修改句子并将它们放回去。我需要为每个标记使用全文执行此操作,直到所有内容都被修改。
例如,抓取div#1
的文本和样式节点将是:
“这是TD中带有强标签的文字”
但正如你所看到的,下面的其他任何文字都不会被抓住。它应该可以通过变量访问和修改。
div#1.text_with_formating= "This is modified text in the TD with <strong> strong </strong> tags"
以下代码删除了所有内容,而不仅仅是子标记,使内容保留所有内容,甚至是div#1
下的标记。因此,我不确定如何继续。
Sanitize.clean(h,{:elements => %w[b em i strong u],:remove_contents=>'true'})
您如何推荐解决此问题?
答案 0 :(得分:1)
如果要查找元素下的所有text nodes,请使用:
text_pieces = div.xpath('.//text()')
如果您只想查找元素的直接子元素,请使用:
text_pieces = div.xpath('text()')
对于每个文本节点,您可以按照自己喜欢的方式更改content
。但是,您必须确保使用my_text_node.content = ...
代替my_text_node.content.gsub!(...)
。
# Replace text that is a direct child of an element
def gsub_my_text!( el, find, replace=nil, &block )
el.xpath('text()').each do |text|
next if text.content.strip.empty?
text.content = replace ? text.content.gsub(find,replace,&block) : text.content.gsub(find,&block)
end
end
# Replace text beneath an element.
def gsub_text!( el, find, replace=nil, &block )
el.xpath('.//text()').each do |text|
next if text.content.strip.empty?
text.content = replace ? text.content.gsub(find,replace,&block) : text.content.gsub(find,&block)
end
end
d1 = doc.at('#d1')
gsub_my_text!( d1, /[aeiou]+/ ){ |found| found.upcase }
puts d1
#=> <div id="d1">
#=> ThIs Is tExt In thE TD wIth <strong> strong </strong> tAgs
#=> <p>This is a child node. with <b> bold </b> tags</p>
#=> <div id="d2">
#=> "another line of text to a <a href="link.html"> link </a>"
#=> <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em></p>
#=> </div>
#=> </div>
gsub_text!( d1, /\w+/, '(\\0)' )
puts d1
#=> <div id="d1">
#=> (ThIs) (Is) (tExt) (In) (thE) (TD) (wIth) <strong> (strong) </strong> (tAgs)
#=> <p>(This) (is) (a) (child) (node). (with) <b> (bold) </b> (tags)</p>
#=> <div id="d2">
#=> "(another) (line) (of) (text) (to) (a) <a href="link.html"> (link) </a>"
#=> <p> (This) (is) (text) (inside) (a) (div) <em>(inside)<em> (another) (div) (inside) (a) (paragraph) (tag)</em></em></p>
#=> </div>
#=> </div>
编辑:以下代码允许您将文本+内联标记的运行提取为字符串,对其运行gsub
,并将结果替换为新标记。
require 'nokogiri'
doc = Nokogiri.HTML '<div id="d1">
Text with <strong>strong</strong> tag.
<p>This is a child node. with <b>bold</b> tags.</p>
<div id=d2>And now we are in <a href="foo">another</a> div.</div>
Hooray for <em>me!</em>
</div>'
module Enumerable
# http://stackoverflow.com/q/4800337/405017
def split_on() chunk{|o|yield(o)||nil}.map{|b,a|b&&a}.compact end
end
require 'set'
# Given a node, call gsub on the `inner_html`
def gsub_markup!( node, find, replace=nil, &replace_block )
allowed = Set.new(%w[strong b em i u strike])
runs = node.children.split_on{ |el| el.node_type==1 && !allowed.include?(el.name) }
runs.each do |nodes|
orig = nodes.map{ |node| node.node_type==3 ? node.content : node.to_html }.join
next if orig.strip.empty? # Skip whitespace-only nodes
result = replace ? orig.gsub(find,replace) : orig.gsub(find,&replace_block)
puts "I'm replacing #{orig.inspect} with #{result.inspect}" if $DEBUG
nodes[1..-1].each(&:remove)
nodes.first.replace(result)
end
end
d1 = doc.at('#d1')
$DEBUG = true
gsub_markup!( d1, /[aeiou]+/, &:upcase )
#=> I'm replacing "\n Text with <strong>strong</strong> tag.\n " with "\n TExt wIth <strOng>strOng</strOng> tAg.\n "
#=> I'm replacing "\n Hooray for <em>me!</em>\n" with "\n HOOrAy fOr <Em>mE!</Em>\n"
puts doc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body><div id="d1">
#=> TExt wIth <strong>strOng</strong> tAg.
#=> <p>This is a child node. with <b>bold</b> tags.</p>
#=> <div id="d2">And now we are in <a href="foo">another</a> div.</div>
#=> HOOrAy fOr <em>mE!</em>
#=> </div></body></html>
答案 1 :(得分:0)
最简单的方法是:
div = doc.at('div#1')
div.replace div.to_s.sub('text', 'modified text')