Question

我正在尝试抓取一个包含多个<p>标记的网站，这些标记始终以“位于：......”开头。其他<p>标记都不会以这些词开头。

如何让我的刮刀只提取那些特定的标签？

这是scraper.rb：

 require 'open-uri'
    require 'nokogiri'
    require 'csv'

    # Store URL to be scraped
    url = "http://www.timeout.com/london/restaurants/the-50-best-street-food-stalls-in-london?package_page=68111"
    # Parse the page with Nokogiri
    page = Nokogiri::HTML(open(url))

    # Display output onto the screen
    name =[]
    page.css('h3').each do |line|
      name << h3.text.strip
    end

    zero =[]
    page.css('p').each do |line|
      zero << line.text.strip
    end

这是要删除的传入HTML：

      <div class="feature-item__text">

      <h3>
     Yu Kyu
      </h3>
                                                                                                                                                                                                                                    <p class="feature_item__annotation--truncated">
    <p>Everybody knows that on any given visit to...</p>
    <p><strong>Don't miss:</strong> Curry Katsu Sandwich (&pound;6.50).</p>
    <p><strong>Find them at:</strong><a href="http://www.timeout.com/london/restaurants/kerb">Kerb</a>.</p>
    <p><strong>But first check:</strong> <a href="...">@_YuKyu_</a></p>
</p>
                                                                                                            </div>
    </div>
    <div class="listing_meta_controls"></div>    
    </article>

Answer 1

您的问题中存在多个问题以及它与HTML如何对齐。

网站可能正在改变措辞，甩掉刮刀并将“位于：”更改为“找到它们”。如果可能的话，你可能不相信在找到你想要的信息时作为一个航路点。

也就是说，CSS不允许我们从某些东西开始寻找文本，但XPath确实：

@doc.search('//strong[starts-with(text(), "Find")]/following-sibling::a')

该选择器会找到所有<strong>Find them at:</strong>标记和相邻的同级<a>标记，以便您根据所需内容处理标记的text或'href'参数。使用该选择器，我在页面上看到84次点击，如下所示：

@doc.search('//strong[starts-with(text(), "Find")]/following-sibling::a').first.to_html 
#=> "<a href=\"http://www.timeout.com/london/restaurants/kerb\">Kerb</a>"

@doc.search('//strong[starts-with(text(), "Find")]/following-sibling::a').first.text 
#=> "Kerb"
@doc.search('//strong[starts-with(text(), "Find")]/following-sibling::a').first['href'] 
#=> "http://www.timeout.com/london/restaurants/kerb"

如果你想使用CSS，那么你可以采取不同的策略。查找包含<div>，然后在里面搜索：

require 'nokogiri'
require 'open-uri'

URL = 'http://www.timeout.com/london/restaurants/the-50-best-street-food-stalls-in-london?package_page=68111'
doc = Nokogiri::HTML(open(URL))
feature_items = doc.search('div.feature-item__text').map{ |div|
  h3 = div.at('h3').text.strip
  a = div.at('strong + a')
  a_text = a.text.strip
  a_href = a['href']

  {
    h3: h3,
    a_text: a_text,
    a_href: a_href
  }
}

这会返回一个哈希数组，每个哈希都是特定地方的信息。

这是发现的前五个：

feature_items[0, 5]
# => [{:h3=>"Yu Kyu",
#      :a_text=>"Kerb",
#      :a_href=>"http://www.timeout.com/london/restaurants/kerb"},
#     {:h3=>"Luardos",
#      :a_text=>"Kerb",
#      :a_href=>"http://www.timeout.com/london/restaurants/kerb"},
#     {:h3=>"Mission Mariscos",
#      :a_text=>"The Schoolyard",
#      :a_href=>"http://www.timeout.com/london/shopping/broadway-market-1"},
#     {:h3=>"Butchies",
#      :a_text=>"Broadway Market",
#      :a_href=>"http://www.timeout.com/london/shopping/broadway-market-1"},
#     {:h3=>"BBQ Dreamz",
#      :a_text=>"Kerb",
#      :a_href=>"http://www.timeout.com/london/restaurants/kerb"}]

Answer 2

如果我理解正确，你可以简单地做

zero =[]
page.css('p').each do |line|
   text = line.text.strip
   if text.present? && text.include? 'Located in'
     zero << text
   end
end

当有多个'p'标签时，我该怎么刮？

2 个答案: