Question

我遇到了问题，我必须找到快速解决方案。

我想在所有“表格”中删除“br”和“p”标签，但不在外面。

对于前。

初始html文档：

...
<p>Hello</p>
<table>
  <tr>
    <td><p>Text example <br>continues...</p></td>
    <td><p>Text example <br>continues...</p></td>
    <td><p>Text example <br>continues...</p></td>
    <td><p>Text example <br>continues...</p></td>
  </tr>
</table>
<p>Bye<br></p>
<p>Bye<br></p>
...

我的目标：

...
<p>Hello</p>
<table>
  <tr>
    <td>Text example continues...</td>
    <td>Text example continues...</td>
    <td>Text example continues...</td>
    <td>Text example continues...</td>
  </tr>
</table>
<p>Bye<br></p>
<p>Bye<br></p>
...

现在，这就是我要清理的方法：

loop do
  if html.match(/<table>(.*?)(<\/?(p|br)*?>)(.*?)<\/table>/) != nil
    html = html.gsub(/<table>(.*?)(<\/?(p|br)*?>)(.*?)<\/table>/,'<table>\1 \4</table>')
  else
    break
  end
end

这很好用，但问题是，我有1xxx文件，每个人有大约1000行......每个人需要1-3个小时。（（1-3小时）*（数千个文件））=¡痛苦！

我希望用Sanitize或其他方法来做，但是......现在......我找不到方法。

有人能帮助我吗？

提前谢谢！马努

Answer 1

使用Nokogiri：

require 'nokogiri'

doc = Nokogiri::HTML::Document.parse <<-_HTML_
<p>Hello</p>
<table>
  <tr>
    <td><p>Text example <br>continues...</p></td>
    <td><p>Text example <br>continues...</p></td>
    <td><p>Text example <br>continues...</p></td>
    <td><p>Text example <br>continues...</p></td>
  </tr>
</table>
<p>Bye<br></p>
<p>Bye<br></p>
_HTML_

doc.xpath("//table/tr/td/p").each do |el|
  el.replace(el.text)
end

puts doc.to_html

的输出： 的

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body> <p>Hello</p> <table><tr> <td>Text example continues...</td> <td>Text example continues...</td> <td>Text example continues...</td> <td>Text example continues...</td> </tr></table> <p>Bye<br></p> <p>Bye<br></p> </body> </html>

如果在特定标记内，则删除特定标记

1 个答案: