Booking.com 的网页抓取脚本不起作用

时间:2021-05-18 12:54:41

标签: python python-3.x web-scraping beautifulsoup

我编写了一个脚本来抓取此页面上酒店的酒店名称、评级和福利:link

这是我的脚本:

import numpy as np


import time
from random import randint
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import re
import random

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Referer': 'https://www.espncricinfo.com/',
    'Upgrade-Insecure-Requests': '1',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

url0 = 'https://www.booking.com/searchresults.fr.html?label=gen173nr-1DCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AED6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBOACAQ;sid=303509179a2849df63e4d1e5bc1ab1e3;dest_id=-1456928;dest_type=city&'
links1 = []

results = requests.get(url0, headers = headers)


soup = BeautifulSoup(results.text, "html.parser")

links1 = [a['href']  for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', class_ = 'js-sr-hotel-link hotel_name_link url',  href=True)]
 

root_url = 'https://www.booking.com/'
urls1 = [ '{root}{i}'.format(root=root_url, i=i) for i in links1 ]



pointforts = []
hotels = []
notes = []

for url in urls1: 
    results = requests.get(url)

    soup = BeautifulSoup(results.text, "html.parser")

    try :
        div = soup.find("div", {"class": "hp_desc_important_facilities clearfix hp_desc_important_facilities--bui"})
        pointfort = [x['data-name-en'] for x in div.select('div[class*="important_facility"]')]
        pointforts.append(pointfort)

    except:
        pointforts.append('Nan')

    try:    
        note = soup.find('div', class_ = 'bui-review-score__badge').text
        notes.append(note)

    except:
        notes.append('Nan')
    
    try:
        hotel = soup.find("h2",attrs={"id":"hp_hotel_name"}).text.strip("\n").split("\n")[1]
        hotels.append(hotel)
    except:
        hotels.append('Nan')



data = pd.DataFrame({
    'Notes' : notes,
    'Points fort' : pointforts,
    'Nom' : hotels})


#print(data.head(20))

data.to_csv('datatest.csv', sep=';', index=False, encoding = 'utf_8_sig')

它奏效了,我做了一个循环来抓取酒店的所有链接,然后抓取所有这些酒店的评级和福利。但是我有双倍,所以不是: links1 = [a['href'] for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', href=True)]

我把:links1 = [a['href'] for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', class_ = 'js-sr-hotel-link hotel_name_link url', href=True)] 如你在我上面的脚本中看到的那样。

但现在它不再起作用了,我只获得了 Nan,而之前,当我有双倍奖励时,我和 Nan 有一些,但其中大多数都有我想要的福利和收视率。我不明白为什么。

这是酒店链接的 html:

hotellink

这是获取名称的html(在我获得链接后,脚本转到此链接):

namehtml

这是获取与酒店相关的所有特权的 html(就像名称一样,脚本转到我之前抓取的链接):

perkshtml

这是我的结果...

output

1 个答案:

答案 0 :(得分:1)

该网站上的 href 标签包含换行符。一个在开始,也有一些中途。因此,当您尝试组合 root_url 时,您无法获得有效的网址。

修复方法是删除所有换行符。由于 href 总是以 / 开头,因此也可以将其从 root_url 中删除,或者您可以使用 urllib.parse.urljoin()

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Referer': 'https://www.espncricinfo.com/',
    'Upgrade-Insecure-Requests': '1',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

url0 = 'https://www.booking.com/searchresults.fr.html?label=gen173nr-1DCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AED6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBOACAQ;sid=303509179a2849df63e4d1e5bc1ab1e3;dest_id=-1456928;dest_type=city&'

results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")

links1 = [a['href'].replace('\n','')  for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', class_ = 'js-sr-hotel-link hotel_name_link url',  href=True)]
root_url = 'https://www.booking.com'
urls1 = [f'{root_url}{i}' for i in links1]

pointforts = []
hotels = []
notes = []

for url in urls1: 
    results = requests.get(url)
    soup = BeautifulSoup(results.text, "html.parser")

    try:
        div = soup.find("div", {"class": "hp_desc_important_facilities clearfix hp_desc_important_facilities--bui"})
        pointfort = [x['data-name-en'] for x in div.select('div[class*="important_facility"]')]
        pointforts.append(pointfort)
    except:
        pointforts.append('Nan')

    try:    
        note = soup.find('div', class_ = 'bui-review-score__badge').text
        notes.append(note)
    except:
        notes.append('Nan')
    
    try:
        hotel = soup.find("h2",attrs={"id":"hp_hotel_name"}).text.strip("\n").split("\n")[1]
        hotels.append(hotel)
    except:
        hotels.append('Nan')


data = pd.DataFrame({
    'Notes' : notes,
    'Points fort' : pointforts,
    'Nom' : hotels})

#print(data.head(20))
data.to_csv('datatest.csv', sep=';', index=False, encoding = 'utf_8_sig')

这将为您提供一个输出 CSV 文件:

Notes;Points fort;Nom
 8,3 ;['Parking (fee required)', 'Free WiFi Internet Access Included', 'Family Rooms', 'Airport Shuttle', 'Non Smoking Rooms', '24 hour Front Desk', 'Bar'];Elysées Union
 8,4 ;['Free WiFi Internet Access Included', 'Family Rooms', 'Non Smoking Rooms', 'Pets allowed', '24 hour Front Desk', 'Rooms/Facilities for Disabled'];Hyatt Regency Paris Etoile
 8,3 ;['Free WiFi Internet Access Included', 'Family Rooms', 'Non Smoking Rooms', 'Pets allowed', 'Restaurant', '24 hour Front Desk', 'Bar'];Pullman Paris Tour Eiffel
 8,7 ;['Free WiFi Internet Access Included', 'Non Smoking Rooms', 'Restaurant', '24 hour Front Desk', 'Rooms/Facilities for Disabled', 'Elevator', 'Bar'];citizenM Paris Gare de Lyon