Question

我正在努力使Solr搜索结果更加健康，因为它里面有html标签：

ActionController::Base.helpers.sanitize( result_string )

很容易将未突出显示的字符串清理为：I know <ul><li>ruby</li> <li>rails</li></ul>。

但是当结果突出显示时，我的内部还有其他重要标记 - 和：

I know <ul><li>ruby</li> <li>rails</li></ul>。

因此，当我使用嵌套的html和突出显示标签对字符串进行sanitalize时，我得到了htmls标签的和平字符串。这很糟糕:)）

如何使用标记内部突出显示突出显示的字符串以获得正确的结果（仅包含标记的字符串）？

我找到了方法，但它很慢而且不漂亮：

string = 'I <em>know</em> <<em>ul</em>><<em>li</em>><em>ruby</em></<em>li</em>> <<em>li</em>><em>rails</em></<em>li</em>></<em>ul</em>>'

['p', 'ul', 'li', 'ol', 'span', 'b', 'br'].each do |tag| 
  string.gsub!( "<<em>#{tag}</em>>",  '' )
  string.gsub!( "</<em>#{tag}</em>>", '' )
end

string = ActionController::Base.helpers.sanitize string, tags: %w(em)

如何优化或使用更好的解决方案？编写一些正则表达式并删除html_tags，但保留和，例如

请帮助，谢谢。

Answer 1

你可以打电话给gsub！丢弃所有标签但仅保留标签，这些标签是独立的，或者不包含在html标签中。

result_string.gsub!(/(<\/?[^e][^m]>)|(<<em>\w*<\/em>>)|(<\/<em>\w*<\/em>>)/, '')

会做的伎俩

解释：

# first group (<\/?[^e][^m]>) 
# find all html tags that are not <em> or </em>

# second group (<<em>\w*<\/em>>)
# find all opening tags that have <em> </em> inside of them like:
# <<em>li</em>>   or <<em>ul</em>>

# third group (<\/<em>\w*<\/em>>)
# find all closing tags that have <em> </em> inside of them:
# </<em>li</em>>   or  </<em>ul</em>>

# and gsub replaces all of this with empty string

Answer 2

我认为你可以使用sinitize：

Custom Use (only the mentioned tags and attributes are allowed, nothing else)
<%= sanitize @article.body, tags: %w(table tr td), attributes: %w(id class style) %>

所以，这样的事情应该有效：

sanitize result_string, tags: %w(em)

Answer 3

使用sanitize的附加参数，您可以指定允许的标记。

在您的示例中，请尝试：

ActionController::Base.helpers.sanitize( result_string, tags: %w(em) )

它应该做的伎俩

如何使用嵌套的html标签对字符串进行清理，但保留<em>标记？</em>

3 个答案: