无法建立新连接:[Errno 11001] getaddrinfo failed'))

时间:2021-05-17 10:08:43

标签: python python-3.x web-scraping beautifulsoup

我试图在booking.com上抓取一些数据,但出现此错误:

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.booking.com%0a', port=443): Max retries exceeded with url: /hotel/fr/elyseesunion.fr.html?label=gen173nr-1FCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AEB6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBeACAQ&sid=c9f6a7c7371b88db9274005810b6f9e1&dest_id=-1456928&dest_type=city&group_adults=2&group_children=0&hapos=1&hpos=1&no_rooms=1&sr_order=popularity&srepoch=1621245602&srpvid=00bd465162de01a4&ucfs=1&from=searchresults%0A;highlight_room= (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000019BF5ECBBB0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

我知道这个,网站阻止我抓取他的数据。

我在这个网站上尝试了一些答案,但没有一个奏效。

这是我的脚本:

import numpy as np


import time
from random import randint
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import re
import random

#headers= {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Referer': 'https://www.espncricinfo.com/',
    'Upgrade-Insecure-Requests': '1',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

url0 = 'https://www.booking.com/searchresults.fr.html?label=gen173nr-1DCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AED6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBOACAQ;sid=303509179a2849df63e4d1e5bc1ab1e3;dest_id=-1456928;dest_type=city&'
links1 = []

results = requests.get(url0, headers = headers)


soup = BeautifulSoup(results.text, "html.parser")

links1 = [a['href']  for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a',  href=True)]
 

root_url = 'https://www.booking.com'
urls1 = [ '{root}{i}'.format(root=root_url, i=i) for i in links1 ]

#print(urls1[0])

for url in urls1: 
    results = requests.get(url)

    

    time.sleep(random.random()*10)



    soup = BeautifulSoup(results.text, "html.parser")


    div = soup.find("div", {"class": "hp_desc_important_facilities clearfix hp_desc_important_facilities--bui"})
    pointfort = [x['data-name-en'] for x in div.select('div[class*="important_facility"]')]


print(pointfort)

如您所见,我尝试了 time,我尝试了 headers。我也试过 timeout 等等。

有什么想法可以解决这个问题吗?

也许我等一下?

1 个答案:

答案 0 :(得分:0)

我正在使用 requests_html 库。使用此 doc 链接。

试试这个代码片段。它对我有用。它正在抓取页面并将其存储在 HTML 页面中。


from requests_html import HTMLSession

session = HTMLSession()

url = "https://www.booking.com/searchresults.en-gb.html?label=gen173nr-1FCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AEB6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBeACAQ&sid=7685a7b3f07c84e4aadff993a229c309&tmpl=searchresults&class_interval=1&dest_id=-1456928&dest_type=city&dtdisc=0&inac=0&index_postcard=0&label_click=undef&lang=en-gb&offset=0&postcard=0&raw_dest_type=city&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&soz=1&srpvid=525248f6e64d0200&ss_all=0&ssb=empty&sshis=0&top_ufis=1&lang_click=other;cdl=fr;lang_changed=1"
r = session.get(url)
with open("test.html", "wb") as f:
    f.write(r.content)

print("completed")