Ruby算法确定有效的HTML结构

时间:2017-07-25 17:00:44

标签: ruby algorithm

我必须将带有哈希的数组作为输入数据,每个哈希都是html标记的描述(文本中的开放和结束位置以及标记的类型)。我需要生成另一个数组,其中标记按顺序排列。

例如:

input = [
         {start_p: 0, end_p: 100, start_t: '<p>', end_t: '</p>'},
         {start_p: 10, end_p: 50, start_t: '<p>', end_t: '</p>'},
         {start_p: 0, end_p: 100, start_t: '<span>', end_t: '</span>'},
         {start_p: 20, end_p: 30, start_t: '<em>', end_t: '</em>'},
         {start_p: 40, end_p: 50, start_t: '<em>', end_t: '</em>'},
         {start_p: 50, end_p: 60, start_t: '<em>', end_t: '</em>'},
         {start_p: 70, end_p: 80, start_t: '<em>', end_t: '</em>'},
         {start_p: 8, end_p: 99, start_t: '<strong>', end_t: '</strong>'}
        ]

expected_output: [<p><span><strong><p><em></em><em></em></p><em></em><em></em></strong></span></p>]

而不仅仅是输出中的标记,每个标记应该是带有位置和标记的哈希,例如:

     {position: 0, tag: '<p>'}

最重要的是按顺序排序,尊重没有交叉标签的HTML规则(如果多个标签在同一个位置结束,那么最后打开的标签应该先行,如果一个结束,另一个打开则打开在同一个位置,结束将是第一个,等等。

这是遗留系统的一部分,输入和输出目前无法更改。此外,输入可能非常大(数十万个元素)

任何更好的解决方案,而不仅仅是强力递归?

2 个答案:

答案 0 :(得分:1)

input.group_by { |h| h[:start_p] }.
      values.
      flat_map do |a|
        x = 1.0
        a.flat_map do |h|
          x /= 2.0
          [[h[:start_p] += x, h[:start_t]], [h[:end_p] -= x, h[:end_t]]]
        end
      end.sort_by(&:first).map(&:last).join
#=> "<span><p><strong><p><em></em><em></p></em><em></em><em></em></strong></p></span>"

步骤如下。

b = input.group_by { |h| h[:start_p] }
  #=> { 0=>[{:start_p=>0, :end_p=>100, :start_t=>"<p>", :end_t=>"</p>"},
  #        {:start_p=>0, :end_p=>100, :start_t=>"<span>", :end_t=>"</span>"}],
  #    10=>[{:start_p=>10, :end_p=>50, :start_t=>"<p>", :end_t=>"</p>"}],
  #    20=>[{:start_p=>20, :end_p=>30, :start_t=>"<em>", :end_t=>"</em>"}],
  #    40=>[{:start_p=>40, :end_p=>50, :start_t=>"<em>", :end_t=>"</em>"}],
  #    50=>[{:start_p=>50, :end_p=>60, :start_t=>"<em>", :end_t=>"</em>"}],
  #    70=>[{:start_p=>70, :end_p=>80, :start_t=>"<em>", :end_t=>"</em>"}],
  #     8=>[{:start_p=> 8, :end_p=>99, :start_t=>"<strong>", :end_t=>"</strong>"}]}
c = b.values
  #=> [[{:start_p=>0, :end_p=>100, :start_t=>"<p>", :end_t=>"</p>"},
  #     {:start_p=>0, :end_p=>100, :start_t=>"<span>", :end_t=>"</span>"}],
  #    [{:start_p=>10, :end_p=>50, :start_t=>"<p>", :end_t=>"</p>"}],
  #   ...
  #    [{:start_p=>8, :end_p=>99, :start_t=>"<strong>", :end_t=>"</strong>"}]]
d = c.flat_map do |a|
      x = 1.0
      a.flat_map do |h|
        x /= 2.0
        [[h[:start_p] += x, h[:start_t]], [h[:end_p] -= x, h[:end_t]]]
      end
    end
  #=> [[0.5, "<p>"], [99.5, "</p>"], [0.25, "<span>"], [99.75, "</span>"],
  #    [10.5, "<p>"], [49.5, "</p>"], [20.5, "<em>"], [29.5, "</em>"],
  #    [40.5, "<em>"], [49.5, "</em>"], [50.5, "<em>"], [59.5, "</em>"],
  #    [70.5, "<em>"], [79.5, "</em>"], [8.5, "<strong>"], [98.5, "</strong>"]]

d(元组)的前四个元素对于理解我所采用的方法是最重要的。

e = d.sort_by(&:first)
  #=> [[0.25, "<span>"], [0.5, "<p>"], [8.5, "<strong>"], [10.5, "<p>"],
  #    [20.5, "<em>"], [29.5, "</em>"], [40.5, "<em>"], [49.5, "</p>"],
  #    [49.5, "</em>"], [50.5, "<em>"], [59.5, "</em>"], [70.5, "<em>"],
  #    [79.5, "</em>"], [98.5, "</strong>"], [99.5, "</p>"], [99.75, "</span>"]]

f = e.map(&:last)
  #=> ["<span>", "<p>", "<strong>", "<p>", "<em>", "</em>", "<em>", "</p>",
  #    "</em>", "<em>", "</em>", "<em>", "</em>", "</strong>", "</p>", "</span>"]
f.join
  #=> "<span><p><strong><p><em></em><em></p></em><em></em><em></em></strong></p></span>"

如果要求,我将详细说明d以上的计算。

答案 1 :(得分:0)

我不确定强力递归的含义,但可以使用sort_bymap来完成。这是让sort_by正确无误以达到所需的HTML规则的问题。

output = input.sort_by { |hsh| hsh[:start_p] }.map{|x| x.slice(:start_p, :start_t)}
output.each do |h|
  h[:position] = h.delete(:start_p)  
  h[:tag] = h.delete(:start_t)  
end

猴子修补切片法。

module MyExtension
  module Hash 
    def slice(*keys)
      ::Hash[[keys, self.values_at(*keys)].transpose]
    end
  end
end
Hash.include MyExtension::Hash