包裹相邻元素的组

时间:2012-12-26 06:25:56

标签: ruby xpath nokogiri

我有一个HTML文档,其结构如下:

<li class="indent1">(something)
  <li class="indent2">(something else)</li>
  <li class="indent2">(something else)
    <li class="indent3">(another sublevel)</li>
  </li>
  <li class="indent2">(something else)</li>
</li>

我需要做的是将这些LI标签包装在OL标签中。在整个文件中有许多这样的列表。 HTML需要如下所示:

<ol>
  <li>(something)
    <ol>
      <li>(something else)</li>
      <li>(something else)
        <ol>
          <li>(another sublevel)</li>
        </ol>
      </li>
      <li>(something else)</li>
    </ol>
  </li>
</ol>

我怎样才能在Nokogiri做这件事?非常感谢提前。

修改

以下是原始文档中的HTML示例。我的脚本将所有P标签转换为LI标签。

  <p class="indent1"><i>a.</i> This regulation describes the Army Planning, Programming,
  Budgeting, and Execution System (PPBES). It explains how an integrated Secretariat and
  Army Staff, with the full participation of major Army commands (MACOMs), Program
  Executive Offices (PEOs), and other operating agencies--</p>

  <p class="indent2">(1) Plan, program, budget, and then allocate and manage approved
  resources.</p>

  <p class="indent2">(2) Provide the commanders in chief (CINCs) of United States unified
  and specified commands with the best mix of Army forces, equipment, and support
  attainable within available resources.</p>

  <p class="indent1"><i>b.</i> The regulation assigns responsibilities and describes
  policy and procedures for using the PPBES to:</p>

缩进1类表示第一级列表项,缩进2表示第二级等。我需要将这些缩进类转换为正确的有序列表。

2 个答案:

答案 0 :(得分:1)

以下解决方案通过循环遍历文档中的每个<li>并执行以下操作:

  • 如果之前没有<ol>,请将<li>换成新的,然后将<li>放在那里。
  • 如果前面有<ol>,请将此<li>移入其中。
document.css('li').each do |li|
  if li.at_xpath('preceding-sibling::node()[not(self::text()[not(normalize-space())])][1][self::ol]')
    li.previous_element << li
  else
    li.replace('<ol/>').first << li
  end
end

在这里测试:

require 'nokogiri'

# Use XML instead of HTML fragment due to problems with XPath
fragment = Nokogiri::XML.fragment '
  <li>List 1
    <li>List 1a</li>
    <li>List 1b
      <li>List 1bi</li>
    </li>
    <li>List 1c</li>
    New List
    <li>New List 1a</li>
  </li>
  <p>Break 1</p>
  <li>List 2a</li>
  <li>List 2b</li>
  <p>Break 2</p>
  <li>List 3 <li>List 3a</li></li>
'

fragment.css('li').each do |li|
  # Complex test to see if the preceding element is an <ol> and there's no non-empty text the li and it
  # See http://stackoverflow.com/q/14045519/405017
  if li.at_xpath('preceding-sibling::node()[not(self::text()[not(normalize-space())])][1][self::ol]')
    li.previous_element << li
  else
    li.replace('<ol/>').first << li
  end
end

puts fragment   # I've normalized the whitespace in the output to make it clear
#=> <ol>
#=>   <li>List 1
#=>     <ol>
#=>       <li>List 1a</li>
#=>       <li>List 1b
#=>         <ol>
#=>           <li>List 1bi</li>
#=>         </ol>
#=>       </li>
#=>       <li>List 1c</li>
#=>     </ol>
#=>     New List
#=>     <ol><li>New List 1a</li></ol>
#=>   </li>
#=> </ol>
#=> <p>Break 1</p>
#=> <ol>
#=>   <li>List 2a</li>
#=>   <li>List 2b</li>
#=> </ol>
#=> <p>Break 2</p>
#=> <ol>
#=>   <li>List 3
#=>     <ol>
#=>       <li>List 3a</li>
#=>     </ol>
#=>   </li>
#=> </ol>

答案 1 :(得分:-1)

问题是您的HTML格式不正确。您无法使用nokogiri成功解析它。