我想抓取需要使用 Python 和 BeautifulSoup 登录并请求库的网站。 (无硒) 这是我的代码:
import requests
from bs4 import BeautifulSoup
auth = (username, password)
headers = {
'authority': 'signon.springer.com',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'origin': 'https://signon.springer.com',
'content-type': 'application/x-www-form-urlencoded',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': 'https://signon.springer.com/login?service=https%3A%2F%2Fpress.nature.com%2Fcallback%3Fclient_name%3DCasClienthttps%3A%2F%2Fpress.nature.com&locale=en>m=GTM-WDRMH37&message=This+page+is+only+accessible+for+approved+journalists.+Please+log+into+your+press+site+account.+For+more+information%3A+https%3A%2F%2Fpress.nature.com%2Fapprove-as-a-journalist&_ga=2.25951165.1431685211.1610963078-2026442578.1607341887',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
'cookie': 'SESSION=40d2be77-b3df-4eb6-9f3b-dac31ab66ce3',
}
params = (
('service', 'https://press.nature.com/callback?client_name=CasClienthttps://press.nature.com'),
('locale', 'en'),
('gtm', 'GTM-WDRMH37'),
('message', 'This page is only accessible for approved journalists. Please log into your press site account. For more information: https://press.nature.com/approve-as-a-journalist'),
('_ga', '2.25951165.1431685211.1610963078-2026442578.1607341887'),
)
data = {
'username': username,
'password': password,
'rememberMe': 'true',
'lt': 'LT-95560-qF7CZnAtuDqWS1sFQgBMqPVifS5mTg-16c07928-2faa-4ce0-58c7-5a1f',
'execution': 'e1s1',
'_eventId': 'submit',
'submit': 'Login'
}
session = requests.session()
response = session.post('https://signon.springer.com/login', headers=headers, params=params, data=data, auth = auth)
print(response)
#time.sleep(5) does not make any diference
soup = BeautifulSoup(response.content, 'html.parser')
print(soup) # im not getting the results that I want
我没有得到包含我想要的所有数据的必需 HTML 页面,我得到的 HTML 页面是登录页面。这是 HTML 响应: https://www.codepile.net/pile/EGY0YQMv
我认为问题是因为我想抓取这个页面:
https://press.nature.com/press-releases
但是当我点击那个链接(并且我没有登录)时,它会将我重定向到不同的网站进行登录:
https://signon.springer.com/login
为了获取我使用过的所有 headers
和 params
和 data
值:
inspect page -> network -> find login request -> copy cURL -> https://curl.trillworks.com/
我尝试了多种 post 和 get 方法,我尝试过使用和不使用 auth
参数,但结果是一样的。
我做错了什么?
答案 0 :(得分:1)
尝试运行脚本填写您的 username
和 password
字段,让我知道您得到了什么。如果它仍然没有让您登录,请确保在发布请求中使用其他标头。
import requests
from bs4 import BeautifulSoup
link = 'https://signon.springer.com/login'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,'html.parser')
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
#what the above line does is parse the keys and valuse available in the login form
payload['username'] = username
payload['password'] = password
print(payload) #when you print this, you should see the required parameters within payload
s.post(link,data=payload)
#as we have laready logged in, the login cookies are stored within the session
#in our subsequesnt requests we are reusing the same session we have been using from the very beginning
r = s.get('https://press.nature.com/press-releases')
print(r.status_code)
print(r.text)
答案 1 :(得分:0)
您是否尝试过将 selenium 与 bs4 和 requests 一起使用? 你可以让浏览器一直等到它选择一个元素:
driver = webdriver.Chrome()
driver.implicitly_wait(10) #secs
driver.get("https://press.nature.com/press-releases") #redirect to login link
#then login
driver.get("https://press.nature.com/press-releases") #link behind login
这样你就可以去登录url登录了,然后去你要爬的地方。
答案 2 :(得分:0)
我认为您的 auth 参数格式不正确,无法被请求接受。您可以尝试导入 HTTPBasicAuth
from requests.auth import HTTPBasicAuth
auth = HTTPBasicAuth(username, password)