Question

我在Ruby中编写了一个Web爬虫，我使用Nokogiri::HTML来解析页面。我需要打印页面，在IRB中乱搞时，我注意到了pretty_print方法。然而，它需要一个参数，我无法弄清楚它想要什么。

我的抓取工具正在缓存网页的HTML并将其写入本地计算机上的文件。我希望“漂亮地打印”HTML，以便它在我这样做时看起来很好并且格式正确。

Answer 1

@mislav的答案有点不对劲。 Nokogiri支持漂亮打印如果你：

将文档解析为XML
指示Nokogiri在解析期间忽略仅空白节点（“空白”）
使用to_xhtml或to_xml指定pretty-printing parameters

行动中：

html = '<section>
<h1>Main Section 1</h1><p>Intro</p>
<section>
<h2>Subhead 1.1</h2><p>Meat</p><p>MOAR MEAT</p>
</section><section>
<h2>Subhead 1.2</h2><p>Meat</p>
</section></section>'

require 'nokogiri'
doc = Nokogiri::XML(html,&:noblanks)
puts doc
#=> <section>
#=>   <h1>Main Section 1</h1>
#=>   <p>Intro</p>
#=>   <section>
#=>     <h2>Subhead 1.1</h2>
#=>     <p>Meat</p>
#=>     <p>MOAR MEAT</p>
#=>   </section>
#=>   <section>
#=>     <h2>Subhead 1.2</h2>
#=>     <p>Meat</p>
#=>   </section>
#=> </section>

puts doc.to_xhtml( indent:3, indent_text:"." )
#=> <section>
#=> ...<h1>Main Section 1</h1>
#=> ...<p>Intro</p>
#=> ...<section>
#=> ......<h2>Subhead 1.1</h2>
#=> ......<p>Meat</p>
#=> ......<p>MOAR MEAT</p>
#=> ...</section>
#=> ...<section>
#=> ......<h2>Subhead 1.2</h2>
#=> ......<p>Meat</p>
#=> ...</section>
#=> </section>

Answer 2

通过HTML页面的“漂亮打印”，我认为你想要用适当的缩进重新格式化HTML结构。 Nokogiri不支持这一点; pretty_print方法用于“pp”库，输出仅用于调试。

有几个项目能够很好地理解HTML，能够在不破坏真正重要的空白（着名的HTML Tidy）的情况下对其进行重新格式化，但通过谷歌搜索，我发现这篇帖子标题为{{3 }}

归结为：

xsl = Nokogiri::XSLT(File.open("pretty_print.xsl"))
html = Nokogiri(File.open("source.html"))
puts xsl.apply_to(html).to_s

当然，它要求您将链接的xsl文件下载到文件系统。我在我的机器上很快就尝试过，它就像一个魅力。

Answer 3

这对我有用：

 pretty_html = Nokogiri::HTML(html).to_xhtml(indent: 3)

我尝试了上面的REXML版本，但它损坏了我的一些文档。我讨厌将xslt带入一个新项目。两人都觉得过时了。：）

Answer 4

您可以尝试使用REXML：

require "rexml/document"

doc = REXML::Document.new(xml)
doc.write($stdout, 2)

Answer 5

我的解决方案是在实际的print对象上添加Nokogiri方法。在下面的代码段中运行代码后，您应该只能编写node.print，并且它会打印内容。不需要xslt： - ）

Nokogiri::XML::Node.class_eval do
  # Print every Node by default (will be overridden by CharacterData)
  define_method :should_print? do
    true
  end

  # Duplicate this node, replace the contents of the duplicated node with a
  # newline. With this content substitution, the #to_s method conveniently
  # returns a string with the opening tag (e.g. `<a href="foo">`) on the first
  # line and the closing tag on the second (e.g. `</a>`, provided that the
  # current node is not a self-closing tag).
  #
  # Now, print the open tag preceded by the correct amount of indentation, then
  # recursively print this node's children (with extra indentation), and then
  # print the close tag (if there is a closing tag)
  define_method :print do |indent=0|
    duplicate = self.dup
    duplicate.content = "\n"
    open_tag, close_tag = duplicate.to_s.split("\n")

    puts (" " * indent) + open_tag
    self.children.select(&:should_print?).each { |child| child.print(indent + 2) }
    puts (" " * indent) + close_tag if close_tag
  end
end

Nokogiri::XML::CharacterData.class_eval do
  # Only print CharacterData if there's non-whitespace content
  define_method :should_print? do
    content =~ /\S+/
  end

  # Replace all consecutive whitespace characters by a single space; precede the
  # outut by a certain amount of indentation; print this text.
  define_method :print do |indent=0|
    puts (" " * indent) + to_s.strip.sub(/\s+/, ' ')
  end
end

Answer 6

我知道我回答这个问题的时间太晚了，但是我仍然会留下答案。我尝试了上述所有步骤，但在一定程度上确实起作用。

Nokogiri确实对HTML进行了格式设置，但并不在乎结束标签或开始标签，因此漂亮的格式不在图片之内。

我发现了一个名为htmlbeautifier的宝石，它的作用就像是吊饰。我希望仍在寻找答案的其他人会发现这一点很有价值。

Answer 7

为什么不尝试使用pp方法？

require 'pp'
pp some_var

如何使用Nokogiri精美打印HTML？

7 个答案: