我试图在booking.com上抓取一些数据,但出现此错误:
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.booking.com%0a', port=443): Max retries exceeded with url: /hotel/fr/elyseesunion.fr.html?label=gen173nr-1FCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AEB6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBeACAQ&sid=c9f6a7c7371b88db9274005810b6f9e1&dest_id=-1456928&dest_type=city&group_adults=2&group_children=0&hapos=1&hpos=1&no_rooms=1&sr_order=popularity&srepoch=1621245602&srpvid=00bd465162de01a4&ucfs=1&from=searchresults%0A;highlight_room= (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000019BF5ECBBB0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
我知道这个,网站阻止我抓取他的数据。
我在这个网站上尝试了一些答案,但没有一个奏效。
这是我的脚本:
import numpy as np
import time
from random import randint
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import re
import random
#headers= {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Referer': 'https://www.espncricinfo.com/',
'Upgrade-Insecure-Requests': '1',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
}
url0 = 'https://www.booking.com/searchresults.fr.html?label=gen173nr-1DCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AED6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBOACAQ;sid=303509179a2849df63e4d1e5bc1ab1e3;dest_id=-1456928;dest_type=city&'
links1 = []
results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
links1 = [a['href'] for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', href=True)]
root_url = 'https://www.booking.com'
urls1 = [ '{root}{i}'.format(root=root_url, i=i) for i in links1 ]
#print(urls1[0])
for url in urls1:
results = requests.get(url)
time.sleep(random.random()*10)
soup = BeautifulSoup(results.text, "html.parser")
div = soup.find("div", {"class": "hp_desc_important_facilities clearfix hp_desc_important_facilities--bui"})
pointfort = [x['data-name-en'] for x in div.select('div[class*="important_facility"]')]
print(pointfort)
如您所见,我尝试了 time
,我尝试了 headers
。我也试过 timeout
等等。
有什么想法可以解决这个问题吗?
也许我等一下?
答案 0 :(得分:0)
我正在使用 requests_html 库。使用此 doc 链接。
试试这个代码片段。它对我有用。它正在抓取页面并将其存储在 HTML 页面中。
from requests_html import HTMLSession
session = HTMLSession()
url = "https://www.booking.com/searchresults.en-gb.html?label=gen173nr-1FCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AEB6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBeACAQ&sid=7685a7b3f07c84e4aadff993a229c309&tmpl=searchresults&class_interval=1&dest_id=-1456928&dest_type=city&dtdisc=0&inac=0&index_postcard=0&label_click=undef&lang=en-gb&offset=0&postcard=0&raw_dest_type=city&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&soz=1&srpvid=525248f6e64d0200&ss_all=0&ssb=empty&sshis=0&top_ufis=1&lang_click=other;cdl=fr;lang_changed=1"
r = session.get(url)
with open("test.html", "wb") as f:
f.write(r.content)
print("completed")