正则表达式删除li标签和td标签中的p标签

时间:2016-04-21 20:39:34

标签: ruby regex ruby-on-rails-4

我有这个HTML内容:

<p>This is a paragraph:</p>
<ul>
<li>
<p>point 1</p>
</li>
<li>
<p>point 2</p>
<ul>
<li>
<p>point 3</p>
</li>
<li>
<p>point 4</p>
</li>
</ul>
</li>
<li>
<p>point 5</p>
</li>
</ul>
<ul>
<li>
<p><strong>sub-head : </strong>This is a para followed by heading, This is a para followed by heading, This is a para followed by heading, This is a para followed by heading</p>
</li>
<li>
<p><strong>sub-head 2: </strong></p>
<p>This is a para followed by heading, This is a para followed by heading, This is a para followed by heading, This is a para followed by heading</p>
</li>
</ul>

我想删除所有&lt; p&gt;&amp;&lt; / p&gt; &lt; li&gt;&amp;&lt; / li&gt;之间的标签无论其在&lt; li&gt;&amp;&lt; / li&gt;之间的位置如何。同样,我需要删除表格内的td标签之间的p标签。

到目前为止,这是我的控制器代码:

nogo={"<li>\n<p>" =>'<li>', "</p>\n</li>" => '</li>', "<td>\n<p>" => '<td>', "</p>\n</td>" => '</td>', 
  '<p> </p>' => '','<ul>' => "\n<ul>",'</ul>' => "</ul>\n", '</ol>' => "</ol>\n"   , 
  '<table>' => "\n<table width='100%' border='0' cellspacing='0' cellpadding='0' class='table table-curved'>", 
 '&lt;' => '<', '&gt;'=>'>','<br>' => '','<p></p>' => '', ' rel="nofollow"' => ''

c=params[:content]
       bundle_out=Sanitize.fragment(c,Sanitize::Config.merge(Sanitize::Config::BASIC,
       :elements=> Sanitize::Config::BASIC[:elements]+['table', 'tbody', 'tr', 'td', 'h1', 'h2', 'h3'],
       :attributes=>{'a' => ['href']}) )#.split(" ").join(" ")

      re = Regexp.new(nogo.keys.map { |x| Regexp.escape(x) }.join('|'))

      @bundle_out=bundle_out.gsub(re, nogo)

我通过params [:content]将上述html内容传递给此代码,它已经分配给变量c。

以下是未按预期进行的o / p。一些关闭的p标签和打开的p标签仍在li和close li标签之间

<p>This is a paragraph:</p>

<ul>
<li>point 1</li>
<li>point 2</p>
<ul>
<li>point 3</li>
<li>point 4</li>
</ul>
</li>
<li>point 5</li>
</ul>

<ul>
<li><strong>sub-head : </strong>This is a para followed by heading, This is a para followed by heading, This is a para followed by heading, This is a para followed by heading</li>
<li><strong>sub-head 2: </strong></p>
<p>This is a para followed by heading, This is a para followed by heading, This is a para followed by heading, This is a para followed by heading</li>
</ul>

我的目标很简单我只想删除li和td标签内的所有p标签,我无法正确执行。任何帮助表示赞赏。

我想用正则表达式来做这件事。我知道使用正则表达式不是解析html内容的正确方法。

1 个答案:

答案 0 :(得分:1)

I won't recommend using regex because they're a dead-end unless the HTML is trivial and you create it. And, if you are the one creating it, then modifying it after generating it is the wrong way to go about generating content.

Use a parser. Nokogiri is the de-facto standard for Ruby, and, with some knowledge of CSS or XPath, you can quickly learn to search, or modify, HTML and XML:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <ul>
      <li>
        <p>foo</p>
      </li>
      <li>
        <span>
          <p>bar</p>
        </span>
      </li>
    </ul>
  </body>
</html>
EOT

doc.search('li p').each do |p_tag|
  p_tag.remove
end

puts doc.to_html

Running that results in:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <body>
    <ul>
      <li>

      </li>
      <li>
        <span>

        </span>
      </li>
    </ul>
  </body>
</html>

The tutorials on the Nokogiri site are your starting point. Stack Overflow is also a good resource as there are many different easily-searchable questions about all aspects of using the gem.