Ruby + Nokogiri:循环遍历div并在其中查找文本

时间:2014-09-24 00:58:53

标签: ruby

我有这个HTML,请注意所有内容都嵌套在.listing div:

    <div id="listing_1085130_featured" class="item listing 1085130 even featured selected" data-blockindex="0" se:map:point="40.7219,-74.0034" se:map="map" se:behavior="selectable hoverable rememberable clickable mappable" style="cursor: pointer;">
        <div class="item_inner ">
            <div class="featured_tag hidden-xs">Featured Listing</div>
            <div class="selected_marker hidden-xs hidden-sm">
                <div id="results_list" class="photo">
                    <a href="/building/27-wooster/ph?featured=1">
                        <img border="0" src="https://s3.amazonaws.com/img.streeteasy.com/nyc/image/47/76017947.jpg" alt="27 Wooster Street #PH">
                    </a>
                    <div id="featured-tag-on-responsive" class="visible-xs">Featured Listing</div>
                </div>
                <div class="details">
                    <div class="details_title">
                        <h5>
                            <a se:clickable:target="true" href="/building/27-wooster/ph?featured=1">27 Wooster Street #PH</a>
                        </h5>
                        <div class="item_tools">
                        </div>
                        <div class="closer"></div>
                        <div class="details_info first_detail_info">
                            <div class="details_info">
                                <div class="details_info">
                                    <div class="details_info">
                                    </div>
                                    <div class="closer"></div>
                                </div>
                            </div>

    ....

我有很多这些,如何获取#results_list内第一个链接的href,在这种情况下为/building/27-wooster/ph?featured=1

到目前为止,这是我的方法:

require 'json'
require 'open-uri'
require 'nokogiri'

def scrape(page_number)
  doc = Nokogiri::HTML(open("http://streeteasy.com/for-sale/soho?page=#{page_number}sort_by=price_desc"))
  doc.css(".listing").each do |listing|
    # grab data inside that specific listing
  end
end

有没有办法查看该列表?比如listing.children("#results_list a").first.href

2 个答案:

答案 0 :(得分:0)

这对我有用:

doc.css("#results_list/a").each do |listing|
  p listing['href']
end

要获得第一个列表,使用at_css,用这一行替换上面的代码应该会产生相同的结果:

doc.at_css("#results_list/a")['href']

答案 1 :(得分:0)

  

有没有办法查看该列表?

是的,但是在html中,id必须是页面唯一的,因此您怀疑所有的.listing div都包含一个id =&#34; results_list&#34;的div。但是,nokogiri似乎没有多个相同ID的问题:

require 'nokogiri'

html = <<'END_OF_HTML'
<div class="item listing 1085130 even featured selected">
  <div>
    <div id="results_list" class="photo">
     <a href="/building/27-wooster/ph?featured=1">hello</a>
     <a href="#">apple</a>
    </div>
  </div>
</div>

<div class="item listing 1085131 even featured selected">
  <div>
    <div id="results_list" class="photo">
     <a href="/building/27-wooster/ph?featured=1">world</a>
     <a href="#">cherry</a>
    </div>
  </div>
</div>

<div class="item listing 1085132 even featured selected">
  <div>
    <div id="results_list" class="photo">
     <a href="/building/27-wooster/ph?featured=1">goodbye</a>
     <a href="#">peach</a>
    </div>
  </div>
</div>
END_OF_HTML

doc = Nokogiri::HTML(html)

doc.css(".listing").each do |div|
  a_tag = div.at_xpath('.//div[@id="results_list"]/a')
  puts a_tag.text
end

--output:--
hello
world
goodbye

at_xpath()搜索第一个匹配元素 .//在当前元素中进行搜索