在Selenium,Python中使用for循环进行抓取时,某些数据没有出现?

时间:2020-03-30 23:04:12

标签: python-3.x pandas selenium selenium-webdriver

我正在使用for循环和Selenium Web驱动程序来取消booking.com的多个页面。但是,某些项目没有出现。当我检查页面时,项目可用。您能告诉我这是什么问题和解决方案吗?我在这里检查了其他帖子,他们都建议使用计时器。每当读取新页面但不成功时,我都会使用计时器。

如果我刮掉一页纸,我可以得到完整的记录,但是这会花费很多时间。因此,我想将其自动化。根据booking.com的链接,每页提供28条记录,第二页偏移为25。

我在这里尝试提取惠灵顿的酒店,它有4页。我按照我的代码测试了两页。请提供帮助并告知发生了什么问题?

我的下面的代码


#Importing necessary library

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.firefox.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
import time
import re
import requests

from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt


from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from itertools import zip_longest

# Creating an empty list for hotel name, ratings, locations, and description links and appending the list using loop
names = []
rating = []
location = []
links = []
reviews = []
price = []
p1 = []
desc = []
loc = []
src_link = []
category = []
desc = []

driver = webdriver.Chrome(ChromeDriverManager().install())
for pageno in range(0,50,25):

    print(pageno)

    driver.get("https://www.booking.com/searchresults.en-gb.html?aid=304142&label=gen173nr-1DCAEoggI46AdIM1gEaK4BiAEBmAEJuAEXyAEM2AED6AEBiAIBqAIDuAKhtYn0BcACAQ&sid=560904567b64f1e8c80d883e4882616f&tmpl=searchresults&checkin_month=8&checkin_monthday=1&checkin_year=2020&checkout_month=8&checkout_monthday=4&checkout_year=2020&class_interval=1&dest_id=-1521348&dest_type=city&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&postcard=0&raw_dest_type=city&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=47769c9973ad002d&ss=Wellington&ss_all=0&ssb=empty&sshis=0&ssne=Wellington&ssne_untouched=Wellington&top_ufis=1&rows=25&offset=0" + str(pageno))
    time.sleep(5)
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')

#Hotel name
    for item in soup.findAll('span', {'class': 'sr-hotel__name'}):
        names.append(item.get_text(strip=True))
        #print(names)
# Number of reviews
    for item in soup.findAll('div', {'class' : 'bui-review-score__text'}):
        reviews.append(item.get_text(strip = True))
        #print(reviews)
#Number of ratings
    for item in soup.findAll("div", {'class': 'bui-review-score__badge'}):
        rating.append(item.get_text(strip=True))
        #print(rating)
# Extracting each hotel links
    for item in soup.findAll("a", {'class': 'hotel_name_link url'}):
        item = item.get("href").strip("\n")
        links.append(f"https://www.booking.com{item}")
#Extracting each hotel image link
    for link in soup.find_all("img", class_='hotel_image'):
        a = link.attrs["src"]
        src_link.append(a)
#Opening each hotel link and extracting location and hotel description
    for item in links:
        r = requests.get(item)
        soup = BeautifulSoup(r.text, 'html.parser')
        for item in soup.findAll("div", {'id': 'property_description_content'}):
            desc.append(item.get_text("\n", strip=True))
        for item in soup.findAll("span", {'class': 'hp_address_subtitle'}):
            loc.append(item.get_text(strip = True))
#Extracting hotel category type
    for item in links:


        driver.get(item)
        WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"h2#hp_hotel_name")))

        try:
            job_title = driver.find_element_by_css_selector("h2#hp_hotel_name>span").text
            category.append(job_title)
        #print(category)

        except:
            category.append("None")

# Converting all the details into dataframe and csv file
final = []
for item in zip_longest(names, reviews, rating, desc, loc, src_link, links, category):
    final.append(item)

df5 = pd.DataFrame(
    final, columns=['Names', 'Reviews','Rating', 'Description', 'Location', 'image', 'links', 'category'])
#df.to_csv('booked.csv')
#driver.quit()

输出: 最近20条记录未显示酒店名称,评论,评分。

enter image description here

2 个答案:

答案 0 :(得分:0)

对字符串格式使用内置的Python。即“有{}页。”。format('4')执行此操作的旧方法是“有%s页。” %4

您已经有了计算项目数量并使用范围功能步进25的想法。还请注意是否执行以下操作:

for i in range(1, 25):

它只会计数到24,而实际上不会计数25。因此在您的范围函数中,它实际上不会计数50,因此没有第二页。我会这样做:

for pageno in range(0,51,25):

并更改driver.get字符串以使其与.format()一起使用,它将用您在format中放置的内容代替卷发。

driver.get("https://www.booking.com/searchresults.en-gb.html?aid=304142&label=gen173nr-1DCAEoggI46AdIM1gEaK4BiAEBmAEJuAEXyAEM2AED6AEBiAIBqAIDuAKhtYn0BcACAQ&sid=560904567b64f1e8c80d883e4882616f&tmpl=searchresults&checkin_month=8&checkin_monthday=1&checkin_year=2020&checkout_month=8&checkout_monthday=4&checkout_year=2020&class_interval=1&dest_id=-1521348&dest_type=city&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&postcard=0&raw_dest_type=city&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=47769c9973ad002d&ss=Wellington&ss_all=0&ssb=empty&sshis=0&ssne=Wellington&ssne_untouched=Wellington&top_ufis=1&rows=25&offset={}".format(pageno) 

答案 1 :(得分:0)

如果您可以从酒店主页上获得所有内容,那么我认为您不应该实现许多for循环,因为@ThePyGuy答案中提到的某些循环是非常不合逻辑的。

首先,得到回应

url="https://www.booking.com/searchresults.en-gb.html?dest_id=-1521348;dest_type=city;offset=0;ss=Wellington;tmpl=searchresults"

response=requests.get(url)

现在调用应该像这样实现的方法

def pagination(response):
data={}
soup = BeautifulSoup(response.text, 'html.parser')
urls = soup.findAll("a", {'class': 'hotel_name_link url'})
img_urls=soup.findAll("img", class_='hotel_image')
for i in urls:
    resp=requests.get(urljoin(response.url,i.get("href").strip("\n")))
    sp = BeautifulSoup(resp.text, 'html.parser')
    data['Names']=sp.h2.text.strip() #You can get Category From here also
    data['Rating']=sp.find("div", {'class': 'bui-review-score__badge'}).get_text(strip=True)
    data['Reviews'] =sp.find('div', {'class' : 'bui-review-score__text'}).get_text(strip=True)
    data['Description']=next(iter([item.get_text("\n", strip=True) for item in sp.findAll("div", {'id': 'property_description_content'})]),'None')
    data['Location']=sp.find("span", {'class': 'hp_address_subtitle'}).get_text(strip=True)
    data['image']=img_urls[urls.index(i)].attrs["src"]
    data['links']=resp.url
    print (data)
try:
    next_page=soup.find("a", {'title': re.compile("Next page")}).attrs['href']
    if next_page:
        response = requests.get(next_page)
        print(response.url)
        pagination(response)
except AttributeError:
    print('Scraping Completed...!')

您可以使用最适合制作csv文件的字典。如果您认为无法从酒店的主页上获取数据,那么这是数据的屏幕截图

enter image description here

只需添加一个csv文件代码代替print(data),然后在任何需要的地方使用它。