Question

是否有任何方法可以在不删除锚标记链接的情况下从字符串中删除HTML标记？

例如，这是我的输入：

 <html>
     <body>
      <a href="http://www.yahoo.com">Yahoo</a>
      <p>This is test content </p>
      <a href="http://www.google.com">Google</a>
     </body>
  </html>

我想要的输出：

http://www.yahoo.com雅虎

这是测试内容

http://www.google.com Google

Answer 1

使用Sanitize。

标签和属性（仅允许提及的标签和属性，没有其他内容。）

<%= sanitize @article.body, tags: %w(table tr td), attributes: %w(id class style) %>

这里是documentation。

Answer 2

您可以使用Nokogiri parser解析HTML，并在遇到href标记时保存<a>属性的值。

Answer 3

经过大量的研究，这个宝石解决了我的问题： https://github.com/premailer/premailer

但是我必须修改它的html_to_plain_text模块以不删除ruby变量。

Answer 4

您可以使用Nokogiri解析HTML。

x = Nokogiri::HTML(html_content)
output = []
x.at_css('body').children.each do |tag|
    if tag.class == Nokogiri::XML::Element 
        output << tag.attributes if tag.respond_to?(:attributes)
        output << tag.children if tag.respond_to?(:children)
    end
end
puts output
[{"href"=>#<Nokogiri::XML::Attr:0x3fef80461c98 name="href" value="http://www.yahoo.com">}, [#<Nokogiri::XML::Text:0x3fef804617d4 "Yahoo">], [#<Nokogiri::XML::Text:0x3fef80461310 "This is test content ">], {"href"=>#<Nokogiri::XML::Attr:0x3fef80461054 name="href" value="http://www.google.com">}, [#<Nokogiri::XML::Text:0x3fef80460b7c "Google">]]

您可以根据需要格式化输出数组

如何删除除锚标记之外的所有HTML标记

4 个答案: