我有一个HTML文档,其结构如下:
<li class="indent1">(something)
<li class="indent2">(something else)</li>
<li class="indent2">(something else)
<li class="indent3">(another sublevel)</li>
</li>
<li class="indent2">(something else)</li>
</li>
我需要做的是将这些LI标签包装在OL标签中。在整个文件中有许多这样的列表。 HTML需要如下所示:
<ol>
<li>(something)
<ol>
<li>(something else)</li>
<li>(something else)
<ol>
<li>(another sublevel)</li>
</ol>
</li>
<li>(something else)</li>
</ol>
</li>
</ol>
我怎样才能在Nokogiri做这件事?非常感谢提前。
修改
以下是原始文档中的HTML示例。我的脚本将所有P标签转换为LI标签。
<p class="indent1"><i>a.</i> This regulation describes the Army Planning, Programming,
Budgeting, and Execution System (PPBES). It explains how an integrated Secretariat and
Army Staff, with the full participation of major Army commands (MACOMs), Program
Executive Offices (PEOs), and other operating agencies--</p>
<p class="indent2">(1) Plan, program, budget, and then allocate and manage approved
resources.</p>
<p class="indent2">(2) Provide the commanders in chief (CINCs) of United States unified
and specified commands with the best mix of Army forces, equipment, and support
attainable within available resources.</p>
<p class="indent1"><i>b.</i> The regulation assigns responsibilities and describes
policy and procedures for using the PPBES to:</p>
缩进1类表示第一级列表项,缩进2表示第二级等。我需要将这些缩进类转换为正确的有序列表。
答案 0 :(得分:1)
以下解决方案通过循环遍历文档中的每个<li>
并执行以下操作:
<ol>
,请将<li>
换成新的,然后将<li>
放在那里。<ol>
,请将此<li>
移入其中。document.css('li').each do |li|
if li.at_xpath('preceding-sibling::node()[not(self::text()[not(normalize-space())])][1][self::ol]')
li.previous_element << li
else
li.replace('<ol/>').first << li
end
end
在这里测试:
require 'nokogiri'
# Use XML instead of HTML fragment due to problems with XPath
fragment = Nokogiri::XML.fragment '
<li>List 1
<li>List 1a</li>
<li>List 1b
<li>List 1bi</li>
</li>
<li>List 1c</li>
New List
<li>New List 1a</li>
</li>
<p>Break 1</p>
<li>List 2a</li>
<li>List 2b</li>
<p>Break 2</p>
<li>List 3 <li>List 3a</li></li>
'
fragment.css('li').each do |li|
# Complex test to see if the preceding element is an <ol> and there's no non-empty text the li and it
# See http://stackoverflow.com/q/14045519/405017
if li.at_xpath('preceding-sibling::node()[not(self::text()[not(normalize-space())])][1][self::ol]')
li.previous_element << li
else
li.replace('<ol/>').first << li
end
end
puts fragment # I've normalized the whitespace in the output to make it clear
#=> <ol>
#=> <li>List 1
#=> <ol>
#=> <li>List 1a</li>
#=> <li>List 1b
#=> <ol>
#=> <li>List 1bi</li>
#=> </ol>
#=> </li>
#=> <li>List 1c</li>
#=> </ol>
#=> New List
#=> <ol><li>New List 1a</li></ol>
#=> </li>
#=> </ol>
#=> <p>Break 1</p>
#=> <ol>
#=> <li>List 2a</li>
#=> <li>List 2b</li>
#=> </ol>
#=> <p>Break 2</p>
#=> <ol>
#=> <li>List 3
#=> <ol>
#=> <li>List 3a</li>
#=> </ol>
#=> </li>
#=> </ol>
答案 1 :(得分:-1)
问题是您的HTML格式不正确。您无法使用nokogiri成功解析它。