如何在网络抓取时忽略某些结果?

时间:2018-07-23 11:16:34

标签: python selenium web-scraping

我是网站剪贴的新手,想从booking.com上获取酒店和价格的列表 但是该页面使用javascript并在页面加载后将一些酒店更新为“售罄”,因此这些条目没有价格。因此,根据被“售罄”的酒店数量,被检索的清单的价格上移了一个或多个。

这是我使用的代码:

from selenium import webdriver
chrome_path = r"C:\Users\shiks\Desktop\chromedriver_win32\chromedriver.exe"
dr = webdriver.Chrome(chrome_path)
dr.get("https://www.booking.com/searchresults.html?label=gen173nr-1FCAEoggJCAlhYSDNYBGhsiAEBmAExwgEKd2luZG93cyAxMMgBDNgBAegBAfgBApICAXmoAgM;sid=83ab0db61cc2291ed9a9875978c46395;checkin_month=8&checkin_monthday=7&checkin_year=2018&checkout_month=8&checkout_monthday=8&checkout_year=2018&class_interval=1&dest_id=20088325&dest_type=city&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&offset=0&postcard=0&raw_dest_type=city&room1=A%2CA&sb_price_type=total&search_selected=1&src=index&src_elem=sb&ss=New%20York%2C%20New%20York%20State%2C%20USA&ss_all=0&ss_raw=new%20yor&ssb=empty&sshis=0&")
hotel = dr.find_elements_by_class_name("sr-hotel__name")
price = dr.find_elements_by_css_selector("strong.price")
for hotel1, price1 in zip(hotel, price):
    print(hotel.text + " - " + price1.text)

这是我得到的输出:

The Watson Hotel - ₹ 14,824
citizenM New York Times Square - ₹ 21,984
Studio Plus - Midtown Spacious Apartment - ₹ 21,984
HGU New York - ₹ 22,397
Comfortable 2 bedroom by Wall street - ₹ 31,632
Gansevoort Meatpacking - ₹ 20,261
MOXY NYC Times Square - ₹ 15,782
La Quinta Inn & Suites New York City Central Park - ₹ 19,165
Madison LES Hotel - ₹ 33,079
Courtyard by Marriott New York Manhattan/Central Park - ₹ 23,362
LUMA Hotel - Times Square - ₹ 16,884
The Assemblage John Street - ₹ 29,709
Broadway at Times Square Hotel - ₹ 15,919
Splendid Apartment by Times SQ - ₹ 36,456
Candlewood Suites NYC -Times Square - ₹ 20,399

但是,纽约HGU酒店已售罄,“ 22397”的价格为“舒适2间卧室”。我该如何解决?

2 个答案:

答案 0 :(得分:1)

我将使用与上述相同的方法,而不是使用xpath的css属性,因为使用css定位元素对我来说效果不佳。我真的很喜欢SelectorGadget chrome插件来获取对象xpath:https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb

我的完整代码:

from selenium import webdriver

chrome_path = r"C:\Users\shiks\Desktop\chromedriver_win32\chromedriver.exe"
dr = webdriver.Chrome(chrome_path)

dr.get("https://www.booking.com/searchresults.html?label=gen173nr-1FCAEoggJCAlhYSDNYBGhsiAEBmAExwgEKd2luZG93cyAxMMgBDNgBAegBAfgBApICAXmoAgM;sid=83ab0db61cc2291ed9a9875978c46395;checkin_month=8&checkin_monthday=7&checkin_year=2018&checkout_month=8&checkout_monthday=8&checkout_year=2018&class_interval=1&dest_id=20088325&dest_type=city&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&offset=0&postcard=0&raw_dest_type=city&room1=A%2CA&sb_price_type=total&search_selected=1&src=index&src_elem=sb&ss=New%20York%2C%20New%20York%20State%2C%20USA&ss_all=0&ss_raw=new%20yor&ssb=empty&sshis=0&")

search_results = dr.find_elements_by_xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "sr_flex_layout", " " ))]')

for result_card in search_results:

    hotel_name = result_card.find_elements_by_class_name("sr-hotel__name")[0].text

    price_obj = result_card.find_elements_by_css_selector("strong.price")

    if price_obj:
        price = price_obj[0].text
    else:
        price = 'Unknown'

    print(hotel_name, price)

答案 1 :(得分:0)

为此特定网站进行

hotel = dr.find_elements_by_class_name("sr-hotel__name")
price = dr.find_elements_by_css_selector("strong.price")

是不正确的,因为如您所见,售罄的价格使价格上涨了一位。 我的建议是使用名为sr_item sr_item_new sr_item_default sr_property_block sr_flex_layout的类

该课程是每个酒店详细信息的完整区块。因此,一旦获得此街区,就可以遍历每个街区以获取酒店名称和价格,这样,您可以继续检查价格是否为空,您可以忽略该街区