Question

我试图在booking.com上抓取一些数据，但出现此错误：

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.booking.com%0a', port=443): Max retries exceeded with url: /hotel/fr/elyseesunion.fr.html?label=gen173nr-1FCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AEB6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBeACAQ&sid=c9f6a7c7371b88db9274005810b6f9e1&dest_id=-1456928&dest_type=city&group_adults=2&group_children=0&hapos=1&hpos=1&no_rooms=1&sr_order=popularity&srepoch=1621245602&srpvid=00bd465162de01a4&ucfs=1&from=searchresults%0A;highlight_room= (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000019BF5ECBBB0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

我知道这个，网站阻止我抓取他的数据。

我在这个网站上尝试了一些答案，但没有一个奏效。

这是我的脚本：

import numpy as np


import time
from random import randint
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import re
import random

#headers= {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Referer': 'https://www.espncricinfo.com/',
    'Upgrade-Insecure-Requests': '1',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

url0 = 'https://www.booking.com/searchresults.fr.html?label=gen173nr-1DCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AED6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBOACAQ;sid=303509179a2849df63e4d1e5bc1ab1e3;dest_id=-1456928;dest_type=city&'
links1 = []

results = requests.get(url0, headers = headers)


soup = BeautifulSoup(results.text, "html.parser")

links1 = [a['href']  for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a',  href=True)]
 

root_url = 'https://www.booking.com'
urls1 = [ '{root}{i}'.format(root=root_url, i=i) for i in links1 ]

#print(urls1[0])

for url in urls1: 
    results = requests.get(url)

    

    time.sleep(random.random()*10)



    soup = BeautifulSoup(results.text, "html.parser")


    div = soup.find("div", {"class": "hp_desc_important_facilities clearfix hp_desc_important_facilities--bui"})
    pointfort = [x['data-name-en'] for x in div.select('div[class*="important_facility"]')]


print(pointfort)

如您所见，我尝试了 time，我尝试了 headers。我也试过 timeout 等等。

有什么想法可以解决这个问题吗？

也许我等一下？

Answer 1

我正在使用 requests_html 库。使用此 doc 链接。

试试这个代码片段。它对我有用。它正在抓取页面并将其存储在 HTML 页面中。


from requests_html import HTMLSession

session = HTMLSession()

url = "https://www.booking.com/searchresults.en-gb.html?label=gen173nr-1FCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AEB6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBeACAQ&sid=7685a7b3f07c84e4aadff993a229c309&tmpl=searchresults&class_interval=1&dest_id=-1456928&dest_type=city&dtdisc=0&inac=0&index_postcard=0&label_click=undef&lang=en-gb&offset=0&postcard=0&raw_dest_type=city&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&soz=1&srpvid=525248f6e64d0200&ss_all=0&ssb=empty&sshis=0&top_ufis=1&lang_click=other;cdl=fr;lang_changed=1"
r = session.get(url)
with open("test.html", "wb") as f:
    f.write(r.content)

print("completed")

无法建立新连接：[Errno 11001] getaddrinfo failed'))

1 个答案: