我喜欢使用mechanize gem构建一个刮刀。一些工作正常但我遇到的问题是我想解析的html结构变化,例如不包含所有请求的元素。
这是我到目前为止所做的,这很好用:
require 'mechanize'
mechanize = Mechanize.new
################ GET TOTAL NUMBER OF PAGES ################
page = mechanize.get("https://yellow.local.ch/de/q?browse=physiotherapy&where=&page=1")
total_pages = page.search(".page").last.at("a").text.strip.to_i
########## LOOP OVER EACH PAGE ##############
total_pages.times do |page_num|
puts ("https://yellow.local.ch/de/q?browse=physiotherapy&where=&page=#{page_num + 1}")
page = mechanize.get("https://yellow.local.ch/de/q?browse=physiotherapy&where=&page=#{ page_num + 1}")
containers = page.search(".container")
# ################ GET DATA FOR EACH PAGE ####################
containers.each do |container|
firma = container.at("h2").text.strip
categories = container.search(".categories").at("span").text.strip
address = container.search(".address").at("span").text.strip
phone = container.search(".contact").at("a").text.strip
############## URL empty ######################
end
这是我要解析的HTML
<div class="container clearfix">
<h2>
<a class="details-entry-title-link" href="https://yellow.local.ch/de/d/Arlesheim/4144/Physiotherapie/panta-rhei-KfNsAr3NLV8NSL4Z1Q3CQA">panta rhei</a>
</h2>
<div class="listing-index-content">
<span class="categories first-two">Physiotherapie</span>
<br>
<span class="address first-two">Tramweg 2, 4144 Arlesheim</span>
<br>
<span class="contact last-two">
<span class="phone">
<label>Telefon</label>
<span class="value"><span class="star">*</span><a rel="nofollow" class="number" href="tel:+41617016318">061 701 63 18</a>
</span>
</span>
|
<span class="url">
<a class="redirect" data-href="https://www.local.ch/redirect?entity_id=KfNsAr3NLV8NSL4Z1Q3CQA&hmac=926efdfbc1e4c457a45e16b4e903b8dfdff31c8f&locale=de&url=http%3A%2F%2Fwww.pantarhei-arlesheim.ch" href="http://www.pantarhei-arlesheim.ch">www.pantarhei-arlesheim.ch</a>
</span>
</span>
<br>
<span class="icons last-two">
<span id="heart-place_KfNsAr3NLV8NSL4Z1Q3CQA">
<button class="button-reseted heart" data-action="favorited" data-create="/de/favorites?place_id=KfNsAr3NLV8NSL4Z1Q3CQA" data-delete="/de/favorites/:id">
<i class="heart-icon icon-heart-outline"></i>
</button>
<span class="interestingness" data-load-with="<span class="bootstrap"><i class="icon-spinner icon-spin"></i></span>">
2
</span>
</span>
</span>
<a class="big info-link button details-entry-link-button" href="https://yellow.local.ch/de/d/Arlesheim/4144/Physiotherapie/panta-rhei-KfNsAr3NLV8NSL4Z1Q3CQA" title="panta rhei">Details</a>
</div>
</div>
有时部分** span class =“url”**不存在。因此我想如果有这样的元素我必须检查容器。如果是,那么所有好的,如果没有跳过它并去下一个元素进行解析。
感谢您的帮助。