Question

我想在第一个<br>（状态）之后提取文本。

HTML代码为：

<div class="location">
    Country
    <br>
    State
    <br>
    City
</div>

目前，我可以使用以下内容提取所有<div>文字

a = Mechanize.new
page = a.get(url)
state = page.at('.location').text
puts state

有什么想法吗？

Answer 1

这很简单，但您必须了解文档如何在DOM中的Nokogiri中表示。

有些标签，即Element节点，以及插入的文本，它们是Text节点：

eb open

以下是Nokogiri所说的require 'nokogiri' doc = Nokogiri::HTML(<<EOT) <div class="location"> Country <br> State <br> City </div> EOT doc.at('.location br').next_sibling.text.strip # => "State"：

<br>

以下Text节点：

doc.at('.location br').class # => Nokogiri::XML::Element

以及我们如何访问文本节点的内容：

doc.at('.location br').next_sibling.class # => Nokogiri::XML::Text

再次，查看doc.at('.location br').next_sibling.text # => "\n State\n "标记及其下一个兄弟节点：

<div>

顺便说一句，您可以使用以下内容访问Mechanize的Nokogiri解析器来使用DOM：

doc.at('.location').class # => Nokogiri::XML::Element
doc.at('.location').next_sibling.class # => Nokogiri::XML::Text
doc.at('.location').next_sibling # => #<Nokogiri::XML::Text:0x3fcf58489c7c "\n">

我不能这样做require 'mechanize' agent = Mechanize.new page = agent.get('http://example.com') doc = page.parser doc.class # => Nokogiri::HTML::Document doc.title # => "Example Domain"或doc.at('.location br br').next_sibling.text

第一个断言是正确的，你不能使用doc.at('.location br').next_sibling.next_sibling.text，因为你不能在'.location br br'内嵌套标记，所以<br>在编写时是无意义的HTML的CSS选择器。

第二个断言是错误的。您可以使用br br，但您必须了解DOM中的标记。在您的HTML示例中，它不会返回任何合理的内容：

next_sibling.next_sibling

获取doc.at('.location br').to_html # => "<br>" doc.at('.location br').next_sibling.to_html # => "\n State\n " doc.at('.location br').next_sibling.next_sibling.to_html # => "<br>"的{{1}}将返回一个空字符串，因为text无法包装文字：

<br>

所以，你还没有走得太远：

<br>

但是，如果这是DOM的意图我会更简单地做到：

doc.at('br').text # => ""

Answer 2

尝试以下。

a = Mechanize.new
page = a.get(url)
state = page.search(".kiwii-no-link-color").children[2].text
puts state

如何在使用Mechanize后提取文本

2 个答案: