Question

我为www.researchgate.net编写了一个抓取工具，但我似乎永远陷入了登录页面。

这是我的代码：

import requests
from bs4 import BeautifulSoup

session = requests.Session()

params = {'login': 'my_email', 'password': 'my_password'}
session.post("https://www.researchgate.net/application.Login.html", data = params)
s = session.get("https://www.researchgate.net/search.Search.html?type=researcher&query=zhang")
print BeautifulSoup(s.text).title

有人发现我的代码有什么问题吗？为什么每次都会重定向到登录页面？

Answer 1

登录表单中有隐藏的字段可能需要提供（我无法测试 - 我没有在那里登录）。

一个是request_token，它被设置为一个长base64编码的字符串。其他人可能也需要invalidPasswordCount和loginCookie。

此外，还有一个会话cookie，您可能需要使用登录凭据发送。

要使这项工作需要初始GET来获取request_token，您需要以某种方式提取 - 例如与BeautifulSoup。如果您使用requests会话，则Cookie将显示在以下POST中，因此您不必担心这一点。

import requests
from bs4 import BeautifulSoup

session = requests.Session()

# initial GET to retrieve token and set cookies
r = session.get('https://www.researchgate.net/application.Login.html')
soup = r.BeautifulSoup(r.text)
request_token = soup.find('input', attrs={'name':'request_token'})['value']

params = {'login': 'my_email', 'password': 'my_password', 'request_token': request_token, 'invalidPasswordCount': 0, 'loginCookie': 'yes'}
session.post("https://www.researchgate.net/application.Login.html", data=params)
s = session.get("https://www.researchgate.net/search.Search.html?type=researcher&query=zhang")
print BeautifulSoup(s.text).title

Answer 2

感谢mhawke，我按照他的建议修改了原始代码，我终于成功登录了。

这是我的新代码：

import requests
from bs4 import BeautifulSoup

session = requests.Session()
loginpage = session.get("https://www.researchgate.net/application.Login.html")
request_token = BeautifulSoup(loginpage.text).form.find("input",{"name":"request_token"}).attrs["value"]
print request_token
params = {"request_token":request_token,
          "invalidPasswordCount":"0",
          'login': 'my_email', 
          'password': 'my_password',
          "setLoginCookie":"yes"
          }
session.post("https://www.researchgate.net/application.Login.html", data = params)
#print s.cookies.get_dict()
s = session.get("https://www.researchgate.net/search.Search.html?type=researcher&query=zhang")
print BeautifulSoup(s.text).title

我的request.session for python crawler出了什么问题？

2 个答案: