Python数据废料 - 表单身份验证问题

时间:2017-01-31 02:26:02

标签: python python-3.x web-scraping forms-authentication

以下是我一直试图用来登录厨师插图网站(https://www.cooksillustrated.com/sign_in)的一些代码。

我启动会话,获取身份验证令牌和隐藏的编码字段,然后传递电子邮件和密码字段的“名称”和“值”(通过检查chrome中的元素找到)。表格似乎不包含任何其他元素;但是,post方法不会让我登录。

我注意到所有CSRF令牌都以“==”结尾,所以我尝试删除它们。但它没有用。

我也尝试修改帖子以使用表单输入的“id”字段而不是“name”(只是在黑暗中拍摄,真的......名字看起来应该像我看到的那样起作用在其他例子中)。

任何想法都会非常感激。

import requests, lxml.html
s = requests.session()

# go to the login page and get its text
login = s.get('https://www.cooksillustrated.com/sign_in')
login_html = lxml.html.fromstring(login.text)

# find the hidden fields names and values; store in a dictionary
hidden_inputs = login_html.xpath(r'//form//input[@type="hidden"]')
form = {x.attrib['name']: x.attrib['value'] for x in hidden_inputs}
print(form)

# I noticed that they all ended in two = signs, so I tried taking that off
# form['authenticity_token'] = form['authenticity_token'][:-2]

# this adds to the form payload the two named fields for user name and     password
# found using the "inspect elements" on the login screen
form['user[email]'] = 'my_email'
form['user[password]'] = 'my_pw'

# this uses "id" instead of "name" from the input fields
#form['user_email'] = 'my_email'
#form['user_password'] = 'my_pw'

response = s.post('https://www.cooksillustrated.com/sign_in', data=form)
print(form)

# trying to see if it worked - but the response URL is login again instead of main page
# and it can't find my name
# responses are okay, but I think that just means it posted the form
print(response.url)
print('Christopher' in response.text)
print(response.status_code)
print(response.ok)

1 个答案:

答案 0 :(得分:0)

好吧,POST请求网址应为https://www.cooksillustrated.com/sessions,如果您在登录时捕获所有流量,您将找到对服务器发出的实际POST请求:

POST /sessions HTTP/1.1
Host: www.cooksillustrated.com
Connection: keep-alive
Content-Length: 179
Cache-Control: max-age=0
Origin: https://www.cooksillustrated.com
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: https://www.cooksillustrated.com/sign_in
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.8

utf8=%E2%9C%93&authenticity_token=Uvku64N8V2dq8z%2BGerrqWNobn03Ydjvz8xqgOAvfBmvDM%2B71xJWl2DmRU4zbBE15gGVESmDKP2E16KIqBeAJ0g%3D%3D&user%5Bemail%5D=demo&user%5Bpassword%5D=demodemo

请注意,最后一行是此请求的编码数据,其中包含utfauthenticity_tokenuser[email]user[password]的4个参数。

所以在你的情况下,form应该包括所有这些:

form = {'user[email]': 'my_email', 
        'user[password]': 'my_pw', 
        'utf': '✓', 
        'authenticity_token': 'xxxxxx' # make sure you don't ignore '=='
}

此外,您可能希望添加一些标题,以显示来自Chrome(或您喜欢的任何浏览器),因为request的默认标题为python-requests/2.13.0,而某些网站则不会比如来自"机器人的流量":

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36', 
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 
           'Accept-Encoding': 'gzip, deflate, br', 
           ... # more
}

现在我们已准备好发出POST请求:

response = s.post('https://www.cooksillustrated.com/sessions', data=form, headers=headers)