Question

我的字符串如下：

sanitize_text = `"<b><i>this is the bold text</i></b><i>this is the italic</i>"`

我的问题是：

要解析字符串中的字符并找到特定的html标记('<b>','<i>' ...)，然后在文本之间应用属性。
需要为每个文字应用属性。

我接近这样：

sanitize_arr = sanitize_text.scan(/\D\d*/)

santize_arr.each_with_index do |char, index|
  if character is new start tag == '<b>'
    Then for next characters till '</b>' I will apply some bold properties .    
  end
  if character is new start tag == '<i>'
    Then for next characters till '</i>' I will apply some italic properties .
  end
end

只是好奇地知道我正朝着正确的方向前进，或者有更好的解决方案，请告诉我。

Answer 1

如果我错了，请纠正我。你想在文本中找到特定的html标签并对它们进行一些操作吗？你试过Nokogiri gem吗？

并做类似的事情：

require 'nokogiri'
nokogiri_object=Nokogiri::HTML(sanitize_text)
bold_text=nokogiri_object.css('b').text
puts bold_text

输出 "this is the bold text"

Answer 2

是的我已经完成了，比如：

santize_text = "<b><u>this</u></b><i><p>this is the italic text</p></i>"

santize_arr = santize_text.scan(/\D\d*/)
char_array , html_tag_array = [], []
continue_insert_char_array, continue_insert_arr2 = false,false
santize_arr.each_with_index do |char, index|
  #To check new start tag
  continue_insert_char_array = true if char=='<' && santize_arr[index+1]!='/'
  if continue_insert_char_array
    char_array << char
    if char=='>' && continue_insert_char_array
      continue_insert_char_array = false
      html_tag_array << char_array.join
      char_array = []
    end
    next
  end

  #To check new end tag
  continue_insert_arr2 = true if char=='<' && santize_arr[index+1]=='/'
  if continue_insert_arr2
    char_array << char
    if char=='>' && continue_insert_arr2
      continue_insert_arr2 = false
      html_tag_array.delete(char_array.join.gsub('/', ""))
      char_array = []
    end
    next
  end

  # Apply the property on the character
  "Bold Char" if html_tag_array.include?("<b>")
  "Italic Char" if html_tag_array.include?("<i>")
end

请告诉我如果有任何改变可以让它变得更好。

Answer 3

您可以编写自己的XML Parser ..不要认真！查看Parslet 事实上，它附带的示例包括XML Parser

这样的事情：

require 'parslet'

class XML < Parslet::Parser
  root :document

  rule(:document)   { (formatting | text).repeat(1) }  
  rule(:formatting) { tag_pair('b').as(:bold) | tag_pair('u').as(:underline) | tag_pair('i').as(:italic) } 

  def tag(type)
     str('<') >> str(type) >> str('>')
  end

  def tag_pair(type)
    tag(type) >> document.maybe >> tag("/" + type)
  end

  rule(:text) {
    match('[^<>]').repeat(1).as(:text)
  }
end

 parser = XML.new
 input = ARGV[0]

 require 'parslet/convenience'
 puts parser.parse_with_debug(input).inspect

产生类似的东西......

> ruby xmlparser.rb "<b>bold<i>italic</i> bold again <u>underlined</u></b>"

[{：bold =＆gt; [{：text =＆gt;“bold”@ 3}，{：italic =＆gt; [{：text =＆gt;“italic”@ 10}]}，{：text = ＆gt;“再次加粗”@ 21}，{：underline =＆gt; [{：text =＆gt;“underlined”@ 36}]}]}]

正如您所看到的，这棵树的样式节点有粗体斜体等，以及它们内部的内容。

可以轻松扩展以处理空白区域，并处理您关心的其他标记。处理你不关心的标签有点困难。

无论如何......只是展示了各种可能性。

使用Parslet，您通常会编写一个Transform类，将此树结构转换为您最终希望执行的操作。我喜欢Parslet使用解析数据拆分解析的方式。

希望这有帮助。

解析字符串中的字符并找到特定的html标记

3 个答案: