Question

我使用Nokogiri作为我的Ruby on Rails docx生成器的一部分，我遇到了一些问题。我使用Nokogiri来解析应用程序中的每个段落，并对其周围带有HTML标记的所有文本执行一些操作。

然而，在我遍历每个段落之前，我错过了无序列表。这是文本编辑器在我的例子中产生的内容：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p><strong>Just testing <em>something</em> out </strong>over here.</p>
<p>Here's a paragraph that contains bullets though:</p>
<ul>
<li>One thing here.</li>
<li>Another thing here</li>
</ul>
<p>Some more text.</p>
</body></html>

我使用这个ruby代码基本上遍历段落：

# test = the HTML above that I just pasted
html = Nokogiri::HTML(test)
html.xpath("//p").each do |paragraph|
  # some code here that converts HTML -> WordML
end

结果，该代码只捕获了这个：

# output of html.xpath("//p")
<p><strong>Just testing <em>something</em> out </strong>over here.</p>
<p>Here's a paragraph that contains bullets though:</p>
<p>Some more text.</p>

我需要以某种方式捕获p标记，并将ul标记视为同时位于p标记内。否则，我只会将段落标记内的HTML转换为WordML和无序列表。

所以我能够在那里找到一半 - 我可以使用html.xpath("//p | //ul")让我到那里，但是当我嵌套ul标签时就会出现问题。例如：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p><strong>Just testing <em>something</em> out </strong>over here.</p>
<p>Here's a paragraph that contains bullets though:</p>
<ul>
<li>One thing here.<ul><li>One more thing</li></ul>
</li>
<li>Another thing here</li>
</ul>
<p><br></p>
<ul><li>nothing</li></ul>
<p>Some more text.</p>
</body></html>

成为

<p><strong>Just testing <em>something</em> out </strong>over here.</p>
<p>Here's a paragraph that contains bullets though:</p>
<ul>
<li>One thing here.<ul><li>One more thing</li></ul>
</li>
<li>Another thing here</li>
</ul>
<ul><li>One more thing</li></ul>
<p><br></p>
<ul><li>nothing</li></ul>
<p>Some more text.</p>

正如您所看到的，

包含嵌套的ul数据两次（因为它是我假设的嵌套ul标记）

Answer 1

用一些随机语法来解决这个问题。我能够通过使用来解决这个问题上例中的html.xpath("//p", "//ul")。

Answer 2

你可以做两件事：

appbutton.create_input("test1", 0, frame1)
appbutton.create_input("test2", 1, frame1)
appbutton.create_input("test3", 2, frame2)
appbutton.create_input("test4", 3, frame2)

这使用CSS，它可以找到任何类型的节点，首先查找require 'nokogiri' doc = Nokogiri::HTML(<<EOT) <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body> p1 p2 <ul> <li>l1</li> </ul> p3 </body></html> EOT doc.search('p', 'ul').map(&:to_html) # => ["p1", "p2", "p3", "<ul>\n<li>l1</li>\n</ul>"]标记，然后查找标记。

使用XPath：

<ul>

这会查找doc.search('//p | //ul').map(&:to_html) # => ["p1", "p2", "<ul>\n<li>l1</li>\n</ul>", "p3"]或个标签，而不是一个然后另一个。

Nokogiri可以将无序列表视为段落吗？

2 个答案: