我正尝试从AllRecipes.co.uk抓取信息,但是在运行代码时,我并没有定向到预期的页面,而是定向到要求我事先接受隐私政策的封面。这意味着我无法从想要的页面上抓取,因为我访问的任何页面都带有此“接受隐私政策”封面
网站是AllRecipes.co.uk
import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver
import numpy as np
import os
userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
headers = {
'user-agent': userAgent
}
dishType = "main-recipes"
url = 'http://allrecipes.co.uk/recipes/' + dishType + '.aspx?page='
#endPage = 1259
endPage = 3
for i in range(2, endPage):
delays = [5, 7, 9, 11, 13, 15]
delay = np.random.choice(delays)
time.sleep(delay)
print("Getting request " + str(i))
r = requests.get(url + str(i))
soup = BeautifulSoup(r.content, "html.parser")
print(soup)
#names = soup.findAll('div', attrs = {'class' : "col-sm-7"})
#for name in names:
# print(name)
答案 0 :(得分:1)
您只需要设置text
cookie:
euConsentId
为了适应您的代码,我将实例化一个"session"并在其中设置cookie:
In [1]: import requests
In [2]: from bs4 import BeautifulSoup
In [3]: url = "http://allrecipes.co.uk/recipes/main-recipes.aspx?page=2"
In [4]: BeautifulSoup(requests.get(url).content, "html.parser").title.get_text()
Out[4]: 'About your privacy on this site'
In [5]: import uuid
In [6]: BeautifulSoup(requests.get(url, cookies={'euConsentId': str(uuid.uuid4())}).content, "html.parser").title.get_text()
Out[6]: 'Main course recipes - All recipes UK '