Python ::请求身份验证

时间:2017-01-31 10:43:38

标签: python beautifulsoup python-requests

我正在尝试使用BeautifulSoup抓取一个网站。该网站需要登录。

https://www.bahn.de/p/view/meinebahn/login.shtml

研究网络我明白获得授权的一种正确方法是使用requests

我的代码如下:

url = 'https://www.bahn.de/p/view/meinebahn/login.shtml'
header = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5)AppleWebKit 537.36 (KHTML, like Gecko)     Chrome","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp    ,*/*;q=0.8"}

user = "username"
pwrd = "password"

response = requests.post(url,headers = header, auth=(user, pwrd))
page = requests.get('https://fahrkarten.bahn.de/privatkunde/meinebahn/meine_bahn_portal.go?lang=de&country=DEU#stay')

soup = BeautifulSoup(page.text, 'html.parser')

不幸的是,这不起作用,因为soup是一个html文本,其中包括“您已退出我们的系统”。虽然response的结果是<Response [200]>

由于两个原因,我对auth感到有点挣扎:

  1. 是我对auth方法的理解,即使是正确的,即首先发布登录详细信息,然后访问登录“后面”的网站,或者这种方式有何不同?
  2. 如何确定网站是否需要更特殊的身份验证方法?是否有要在html代码中查找的关键字?
  3. 任何帮助都会受到赞赏,因为我真的很想理解它,而我显然是“新手”从手册中得到正确的结论(例如http://docs.python-requests.org/en/master/user/authentication/

2 个答案:

答案 0 :(得分:3)

了解网站身份验证的最简单方法是在登录时捕获流量找出在幕后发生的事情:使用哪个URL,提交的数据等

您可以使用fiddlercharles,或使用最方便的Chrome开发工具(F12启动),它是这样的:

login request

在你的情况下,整个请求是:

POST /privatkunde/start/start.post HTTP/1.1
Host: fahrkarten.bahn.de
Connection: keep-alive
Content-Length: 74
Cache-Control: max-age=0
Origin: https://www.bahn.de
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: https://www.bahn.de/p/view/meinebahn/login.shtml
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.8

scope=bahnde&lang=de&country=DEU&username=demo&password=demo&login-submit=

最重要的是,因为 cookie用于身份验证/验证,所以整个过程需要一个会话,之后用于访问仅供登录用户访问的其他网页。

import requests

session = requests.Session() # create a session that handles cookies by default

headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5)AppleWebKit 537.36 (KHTML, like Gecko)     Chrome"
          ... # simulate headers that is used in the actual POST request
} 

data = {'scope': 'bahnde', 'lang': 'de', 'country': 'DEU', 
        'username': 'xxxx', 'password': 'xxxx', 'login-submit': ''
       }

# now login
response = session.post(url='https://fahrkarten.bahn.de/privatkunde/start/start.post', data=data, headers=headers)

# once logged in, session can be used to access other web pages
# sometimes you also want to make sure it actually logged in by checking content from response.text
content = response.text 
# try to look for your username or other flags with content.find etc. 
r2 = session.get(url='xxx') # access other pages

答案 1 :(得分:0)

可能是因为您请求了错误的页面,请查看登录页面中的表单:

<form method="post" name="staticLogin" id="kv-static-logi" action="https://fahrkarten.bahn.de/privatkunde/start/start.post">
<input name="scope" value="bahnde" type="hidden">
<input name="lang" value="de" type="hidden">
<input name="country" value="DEU" type="hidden">
<p>
<input id="kv-static-login-username_ab" name="username" class="from" maxlength="60" autocomplete="off" placeholder="Benutzername" type="text">
</p>

<p>
<input id="kv-static-login-password_ab" name="password" class="from" maxlength="60" placeholder="Passwort" type="password">
</p>

<p><button type="submit" name="login-submit" class="btn slim no-margin" style="float: left">Login</button>
<a id="vergessen" href="https://fahrkarten.bahn.de/privatkunde/start/start.post?scope=pwvergessen&amp;lang=de">Login vergessen?</a>
</p></form>

您应该使用https://fahrkarten.bahn.de/privatkunde/start/start.postusername字段请求页面password。保持请求给你的东西! (令牌等......)

见啊!