抓取需要使用 BeautifulSoup 登录的网站

时间:2021-01-18 12:03:01

标签: python web-scraping beautifulsoup

我想抓取需要使用 Python 和 BeautifulSoup 登录并请求库的网站。 (无硒) 这是我的代码:

import requests
from bs4 import BeautifulSoup

auth = (username, password)
headers = {
    'authority': 'signon.springer.com',
    'cache-control': 'max-age=0',
    'upgrade-insecure-requests': '1',
    'origin': 'https://signon.springer.com',
    'content-type': 'application/x-www-form-urlencoded',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'sec-fetch-dest': 'document',
    'referer': 'https://signon.springer.com/login?service=https%3A%2F%2Fpress.nature.com%2Fcallback%3Fclient_name%3DCasClienthttps%3A%2F%2Fpress.nature.com&locale=en&gtm=GTM-WDRMH37&message=This+page+is+only+accessible+for+approved+journalists.+Please+log+into+your+press+site+account.+For+more+information%3A+https%3A%2F%2Fpress.nature.com%2Fapprove-as-a-journalist&_ga=2.25951165.1431685211.1610963078-2026442578.1607341887',
    'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    'cookie': 'SESSION=40d2be77-b3df-4eb6-9f3b-dac31ab66ce3',
}

params = (
    ('service', 'https://press.nature.com/callback?client_name=CasClienthttps://press.nature.com'),
    ('locale', 'en'),
    ('gtm', 'GTM-WDRMH37'),
    ('message', 'This page is only accessible for approved journalists. Please log into your press site account. For more information: https://press.nature.com/approve-as-a-journalist'),
    ('_ga', '2.25951165.1431685211.1610963078-2026442578.1607341887'),
)

data = {
  'username': username,
  'password': password,
  'rememberMe': 'true',
  'lt': 'LT-95560-qF7CZnAtuDqWS1sFQgBMqPVifS5mTg-16c07928-2faa-4ce0-58c7-5a1f',
  'execution': 'e1s1',
  '_eventId': 'submit',
  'submit': 'Login'
}

session = requests.session()
response = session.post('https://signon.springer.com/login', headers=headers, params=params, data=data, auth = auth)
print(response)
#time.sleep(5) does not make any diference
soup = BeautifulSoup(response.content, 'html.parser')
print(soup) # im not getting the results that I want

我没有得到包含我想要的所有数据的必需 HTML 页面,我得到的 HTML 页面是登录页面。这是 HTML 响应: https://www.codepile.net/pile/EGY0YQMv

我认为问题是因为我想抓取这个页面:

https://press.nature.com/press-releases

但是当我点击那个链接(并且我没有登录)时,它会将我重定向到不同的网站进行登录:

https://signon.springer.com/login

为了获取我使用过的所有 headersparamsdata 值:

inspect page -> network -> find login request -> copy cURL -> https://curl.trillworks.com/

我尝试了多种 post 和 get 方法,我尝试过使用和不使用 auth 参数,但结果是一样的。 我做错了什么?

3 个答案:

答案 0 :(得分:1)

尝试运行脚本填写您的 usernamepassword 字段,让我知道您得到了什么。如果它仍然没有让您登录,请确保在发布请求中使用其他标头。

import requests
from bs4 import BeautifulSoup

link = 'https://signon.springer.com/login'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
    res = s.get(link)
    soup = BeautifulSoup(res.text,'html.parser')   
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}         
    #what the above line does is parse the keys and valuse available in the login form
    payload['username'] = username
    payload['password'] = password

    print(payload) #when you print this, you should see the required parameters within payload 

    s.post(link,data=payload)
    #as we have laready logged in, the login cookies are stored within the session
    #in our subsequesnt requests we are reusing the same session we have been using from the very beginning
    r = s.get('https://press.nature.com/press-releases')
    print(r.status_code)
    print(r.text)

答案 1 :(得分:0)

您是否尝试过将 selenium 与 bs4 和 requests 一起使用? 你可以让浏览器一直等到它选择一个元素:

driver = webdriver.Chrome()
driver.implicitly_wait(10) #secs
driver.get("https://press.nature.com/press-releases") #redirect to login link
#then login
driver.get("https://press.nature.com/press-releases") #link behind login

这样你就可以去登录url登录了,然后去你要爬的地方。

答案 2 :(得分:0)

我认为您的 auth 参数格式不正确,无法被请求接受。您可以尝试导入 HTTPBasicAuth

from requests.auth import HTTPBasicAuth
auth = HTTPBasicAuth(username, password)