Question

我正在尝试使用Nokogiri的CSS方法从我的HTML中获取一些名称。

这是HTML的一个例子：

<section class="container partner-customer padding-bottom--60">
    <div>
        <div>
            <a id="technologies"></a>
            <h4 class="center-align">The Team</h4>
        </div>
    </div>
    <div class="consultant list-across wrap">
        <div class="engineering">
            <img class="" src="https://v0001.jpg" alt="Person 1"/>
            <p>Person 1<br>Founder, Chairman &amp; CTO</p>
        </div>
        <div class="engineering">
            <img class="" src="https://v0002.png" alt="Person 2"/></a>
            <p>Person 2<br>Founder, VP of Engineering</p>
        </div>
        <div class="product">
            <img class="" src="https://v0003.jpg" alt="Person 3"/></a>
            <p>Person 3<br>Product</p>
        </div>
        <div class="Human Resources &amp; Admin">
            <img class="" src="https://v0004.jpg" alt="Person 4"/></a>
            <p>Person 4<br>People &amp; Places</p>
        </div>
        <div class="alliances">
            <img class="" src="https://v0005.jpg" alt="Person 5"/></a>
            <p>Person 5<br>VP of Alliances</p>
        </div>

到目前为止，我在people.rake文件中的内容如下：

  staff_site = Nokogiri::HTML(open("https://www.website.com/company/team-all"))
  all_hands = staff_site.css("div.consultant").map(&:text).map(&:squish)

我在alt=""标记（人名）中获取所有元素时遇到一些麻烦，因为它嵌套在几个div之下。

目前，使用div.consultant，它会获取所有名称+角色，即Person 1Founder, Chairman; CTO，而不只是alt=中的人名。

我怎样才能简单地在alt中获取元素？

Answer 1

您想要的输出不明确且HTML已损坏。

从这开始：

require 'nokogiri'

doc = Nokogiri::HTML('<html><body><div class="consultant"><img alt="foo"/><img alt="bar" /></div></body></html>')
doc.search('div.consultant img').map{ |img| img['alt'] } # => ["foo", "bar"]

在text的输出上使用css不是一个好主意。 css返回一个NodeSet。针对NodeSet的text会导致所有文本被连接起来，这通常会导致文本内容受损，迫使你弄清楚如何将它再次分开，这最终是可怕的代码：

doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
doc.search('p').text # => "foobar"

NodeSet#text中记录了此行为：

获取所有包含的Node对象的内部文本

相反，对单个节点使用text（AKA inner_text或content），从而生成该节点的确切文本，然后您可以根据需要加入：

返回此节点的内容

doc.search('p').map(&:text) # => ["foo", "bar"]

另请参阅“How to avoid joining all text from Nodes when scraping”。

使用Nokogiri的CSS方法获取alt标记内的所有元素

1 个答案: