Question

目前，我正在尝试使用nokogiri获取页面上元素的内部HTML。然而，我不只是获取元素的文本，我也得到它的转义序列。有没有办法可以用nokogiri来压制或删除它们？

require 'nokogiri'
require 'open-uri'

page = Nokogiri::HTML(open("http://the.page.url.com"))

page.at_css("td[custom-attribute='foo']").parent.css('td').css('a').inner_html

这返回=＆gt; "\r\n\t\t\t\t\t\t\t\tTheActuallyInnerContentThatIWant\r\n\t"

最有效和最直接的nokogiri（或红宝石）方法是什么？

Answer 1

page.at_css("td[custom-attribute='foo']")
    .parent
    .css('td')
    .css('a')
    .text               # since you need a text, not inner_html
    .strip              # this will strip a result

String#strip

Sidenote ：css('td a')可能比css('td').css('a')更有效。

Answer 2

深入研究包含所需文本的最近节点非常重要。考虑一下：

SELECT
    d.hour,
    t.value
FROM
    @table t
INNER JOIN (SELECT DISTINCT hour FROM dimTime) d ON d.hour = t.hour

require 'nokogiri' doc = Nokogiri::HTML(<<EOT) <html> <body> foo </body> </html> EOT doc.at('body').inner_html # => "\n foo\n " doc.at('body').text # => "\n foo\n " doc.at('p').inner_html # => "foo" doc.at('p').text # => "foo"，at和at_css返回Node / XML :: Element。 at_xpath，search和css返回一个NodeSet。在查看节点或节点集时，xpath或text返回信息的方式有很大差异：

inner_html

请注意，使用doc = Nokogiri::HTML(<<EOT) <html> <body> foo bar </body> </html> EOT doc.at('p') # => #<Nokogiri::XML::Element:0x3fd635cf36f4 name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf3514 "foo">]> doc.search('p') # => [#<Nokogiri::XML::Element:0x3fd635cf36f4 name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf3514 "foo">]>, #<Nokogiri::XML::Element:0x3fd635cf32bc name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf30dc "bar">]>] doc.at('p').class # => Nokogiri::XML::Element doc.search('p').class # => Nokogiri::XML::NodeSet doc.at('p').text # => "foo" doc.search('p').text # => "foobar"返回了一个NodeSet，并且search返回了连接在一起的节点文本。这很少是你想要的。

另请注意，Nokogiri足够聪明，可以在99％的时间内确定选择器是CSS还是XPath，因此对任一类型的选择器使用通用text和search都非常方便。

如何让Nokogiri inner_HTML对象忽略/删除转义序列

2 个答案: