Python请求POST(和Cookie)登录,而GET抓取后失败

时间:2018-06-20 12:45:05

标签: python python-requests http-post session-cookies http-get

我正在尝试从网站“ https://www.timeform.com/”中抓取赛马的数据结果(按照合理的起拍价分类),但似乎记录得不好(我的POST请求不好)< / p>

有什么建议/建议吗?

我的代码:

import requests

url_get='https://www.timeform.com/horse-racing/result/brighton/2018-06-11/0200/6/1/phoenix-arts-club-fillies-handicap'
url_post_pgin='https://www.timeform.com/horse-racing/account/handlelogin?returnUrl=%2Fhorse-racing%2F'
payload = {"EmailAddress":"mymail@gmail.com","Password":"XXXXXXXXX","RememberMe":"true"}

headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'en-US,en;q=0.5',
'Connection':'keep-alive',
'Content-Length':'195',
'Content-Type':'application/x-www-form-urlencoded',
'DNT':1,
'Host':'www.timeform.com',
'Referer':'https://www.timeform.com/horse-racing/account/sign-in?returnUrl=%2Fhorse-racing%2F',
'Upgrade-Insecure-Requests':1,
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0'}

s = requests.session()
login = s.post(url=url_post_pgin,headers=headers, data=payload) #this should log me in, but I am afraid is not doing good job
get_data=s.get(url=url_get,headers=headers, cookies=s.cookies)

#file = open("rr.html", "w")
#file.write(str(m.text))
#file.close()

*编辑:我更改了网址名称变量。

1 个答案:

答案 0 :(得分:1)

仔细查看网站和您的代码,发现问题是由于标题,发送请求的方式和未发送的发布请求引起的。

问题1:标题

我们可以看到您的标题很大。即使大标头不等于坏标头,但在您的情况下,您正在标头中发送大量无用的项目,这可能会混淆服务器,因此我尝试用以下标头替换您的标头:

 headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Content-Type':'application/x-www-form-urlencoded',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0'}

如您所见,我从其中删除了很多东西。

问题2:发送请求的方式错误(针对您的情况):

requests.session()的总体目标是跟踪所有cookie,并防止您将它们手动注入到每个请求中。因此您应该更改此设置:

get_data=s.get(url=url_get,headers=headers, cookies=s.cookies)

对此:

get_data=s.get(url=url_get,headers=headers)

(发送请求时,每次使用requests.session时,cookie都会被自动感染)

问题3:您未发送的请求

如出现问题之前所述

,因为您忘记发送发帖请求。我不会对此进行详细介绍,但是您缺少的要求是:

s.get(url='https://www.timeform.com/horse-racing/account/sign-in'
,headers=headers) 

在完成所有这些更改之后,您的代码应该看起来像这样:

import requests

url_get='https://www.timeform.com/horse-racing/result/brighton/2018-06-11/0200/6/1/phoenix-arts-club-fillies-handicap'
url_post_pgin='https://www.timeform.com/horse-racing/account/handlelogin?returnUrl=%2Fhorse-racing%2F'
url_post_pgin2='https://www.timeform.com/horse-racing/account/sign-in'
payload = {"EmailAddress":"mymail@gmail.com","Password":"XXXXXXXXX","RememberMe":"true"}

headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Content-Type':'application/x-www-form-urlencoded',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0'}

s = requests.session()

login = s.post(url=url_post_pgin,headers=headers, data=payload) #this should log me in, but I am afraid is not doing good job
s.get(url=url_post_pgin2,headers=headers) #this should log me in, but I am afraid is not doing good job
get_data=s.get(url=url_get,headers=headers)

希望这会有所帮助。