我正在使用for循环和Selenium Web驱动程序来取消booking.com的多个页面。但是,某些项目没有出现。当我检查页面时,项目可用。您能告诉我这是什么问题和解决方案吗?我在这里检查了其他帖子,他们都建议使用计时器。每当读取新页面但不成功时,我都会使用计时器。
如果我刮掉一页纸,我可以得到完整的记录,但是这会花费很多时间。因此,我想将其自动化。根据booking.com的链接,每页提供28条记录,第二页偏移为25。
我在这里尝试提取惠灵顿的酒店,它有4页。我按照我的代码测试了两页。请提供帮助并告知发生了什么问题?
我的下面的代码
#Importing necessary library
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.firefox.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
import time
import re
import requests
from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from itertools import zip_longest
# Creating an empty list for hotel name, ratings, locations, and description links and appending the list using loop
names = []
rating = []
location = []
links = []
reviews = []
price = []
p1 = []
desc = []
loc = []
src_link = []
category = []
desc = []
driver = webdriver.Chrome(ChromeDriverManager().install())
for pageno in range(0,50,25):
print(pageno)
driver.get("https://www.booking.com/searchresults.en-gb.html?aid=304142&label=gen173nr-1DCAEoggI46AdIM1gEaK4BiAEBmAEJuAEXyAEM2AED6AEBiAIBqAIDuAKhtYn0BcACAQ&sid=560904567b64f1e8c80d883e4882616f&tmpl=searchresults&checkin_month=8&checkin_monthday=1&checkin_year=2020&checkout_month=8&checkout_monthday=4&checkout_year=2020&class_interval=1&dest_id=-1521348&dest_type=city&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&postcard=0&raw_dest_type=city&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=47769c9973ad002d&ss=Wellington&ss_all=0&ssb=empty&sshis=0&ssne=Wellington&ssne_untouched=Wellington&top_ufis=1&rows=25&offset=0" + str(pageno))
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
#Hotel name
for item in soup.findAll('span', {'class': 'sr-hotel__name'}):
names.append(item.get_text(strip=True))
#print(names)
# Number of reviews
for item in soup.findAll('div', {'class' : 'bui-review-score__text'}):
reviews.append(item.get_text(strip = True))
#print(reviews)
#Number of ratings
for item in soup.findAll("div", {'class': 'bui-review-score__badge'}):
rating.append(item.get_text(strip=True))
#print(rating)
# Extracting each hotel links
for item in soup.findAll("a", {'class': 'hotel_name_link url'}):
item = item.get("href").strip("\n")
links.append(f"https://www.booking.com{item}")
#Extracting each hotel image link
for link in soup.find_all("img", class_='hotel_image'):
a = link.attrs["src"]
src_link.append(a)
#Opening each hotel link and extracting location and hotel description
for item in links:
r = requests.get(item)
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll("div", {'id': 'property_description_content'}):
desc.append(item.get_text("\n", strip=True))
for item in soup.findAll("span", {'class': 'hp_address_subtitle'}):
loc.append(item.get_text(strip = True))
#Extracting hotel category type
for item in links:
driver.get(item)
WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"h2#hp_hotel_name")))
try:
job_title = driver.find_element_by_css_selector("h2#hp_hotel_name>span").text
category.append(job_title)
#print(category)
except:
category.append("None")
# Converting all the details into dataframe and csv file
final = []
for item in zip_longest(names, reviews, rating, desc, loc, src_link, links, category):
final.append(item)
df5 = pd.DataFrame(
final, columns=['Names', 'Reviews','Rating', 'Description', 'Location', 'image', 'links', 'category'])
#df.to_csv('booked.csv')
#driver.quit()
输出: 最近20条记录未显示酒店名称,评论,评分。
答案 0 :(得分:0)
对字符串格式使用内置的Python。即“有{}页。”。format('4')执行此操作的旧方法是“有%s页。” %4
您已经有了计算项目数量并使用范围功能步进25的想法。还请注意是否执行以下操作:
for i in range(1, 25):
它只会计数到24,而实际上不会计数25。因此在您的范围函数中,它实际上不会计数50,因此没有第二页。我会这样做:
for pageno in range(0,51,25):
并更改driver.get字符串以使其与.format()一起使用,它将用您在format中放置的内容代替卷发。
driver.get("https://www.booking.com/searchresults.en-gb.html?aid=304142&label=gen173nr-1DCAEoggI46AdIM1gEaK4BiAEBmAEJuAEXyAEM2AED6AEBiAIBqAIDuAKhtYn0BcACAQ&sid=560904567b64f1e8c80d883e4882616f&tmpl=searchresults&checkin_month=8&checkin_monthday=1&checkin_year=2020&checkout_month=8&checkout_monthday=4&checkout_year=2020&class_interval=1&dest_id=-1521348&dest_type=city&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&postcard=0&raw_dest_type=city&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=47769c9973ad002d&ss=Wellington&ss_all=0&ssb=empty&sshis=0&ssne=Wellington&ssne_untouched=Wellington&top_ufis=1&rows=25&offset={}".format(pageno)
答案 1 :(得分:0)
如果您可以从酒店主页上获得所有内容,那么我认为您不应该实现许多for循环,因为@ThePyGuy答案中提到的某些循环是非常不合逻辑的。
首先,得到回应
url="https://www.booking.com/searchresults.en-gb.html?dest_id=-1521348;dest_type=city;offset=0;ss=Wellington;tmpl=searchresults"
response=requests.get(url)
现在调用应该像这样实现的方法
def pagination(response):
data={}
soup = BeautifulSoup(response.text, 'html.parser')
urls = soup.findAll("a", {'class': 'hotel_name_link url'})
img_urls=soup.findAll("img", class_='hotel_image')
for i in urls:
resp=requests.get(urljoin(response.url,i.get("href").strip("\n")))
sp = BeautifulSoup(resp.text, 'html.parser')
data['Names']=sp.h2.text.strip() #You can get Category From here also
data['Rating']=sp.find("div", {'class': 'bui-review-score__badge'}).get_text(strip=True)
data['Reviews'] =sp.find('div', {'class' : 'bui-review-score__text'}).get_text(strip=True)
data['Description']=next(iter([item.get_text("\n", strip=True) for item in sp.findAll("div", {'id': 'property_description_content'})]),'None')
data['Location']=sp.find("span", {'class': 'hp_address_subtitle'}).get_text(strip=True)
data['image']=img_urls[urls.index(i)].attrs["src"]
data['links']=resp.url
print (data)
try:
next_page=soup.find("a", {'title': re.compile("Next page")}).attrs['href']
if next_page:
response = requests.get(next_page)
print(response.url)
pagination(response)
except AttributeError:
print('Scraping Completed...!')
您可以使用最适合制作csv文件的字典。如果您认为无法从酒店的主页上获取数据,那么这是数据的屏幕截图
只需添加一个csv文件代码代替print(data)
,然后在任何需要的地方使用它。